php

Transcription

php
VOLUME III - ISSUE 3
MARCH 2004
The Magazine For PHP Professionals
Matchmaker
Make Me a Match
Using the Amazon.com API
through PHP and XML-RPC
Explore your HTML code with Tidy
www.phparch.com
PHP And WAP: Past, Present & Future
Testing Automation With PHP
PHP Ahoy!
A Look at: php | Cruise
Bahamas 2004
Licensed to:
Joseph Crawford
[email protected]
Tips & Tricks,
User #63883
Plus:
Security Corner, Product Reviews and much more...
Licensed to 63883 - Joseph Crawford ([email protected])
3UHSDUH\RXUVHOIIRU3+3«
/HDUQ2EMHFW2ULHQWHG3URJUDPPLQJ
ZLWKRYHU3UDFWLFDO3+36ROXWLRQV
0R
QH
\%
DFN
*8
$5
$
*HWWKLVVHWRIWZRQHZERRNV
17
((
7KH3+3$QWKRORJ\9ROXPH,)RXQGDWLRQV
/HDUQWREXLOGIDVWVHFXUHDQGUHOLDEOH
2EMHFW2ULHQWHG3+3DSSOLFDWLRQVXVLQJ
SURIHVVLRQDO:HEGHYHORSPHQWWHFKQLTXHV
3UHYHQW64/LQMHFWLRQDWWDFNV
6HQG3DUVH+70/HPDLO
)LOWHUXVHUVXEPLWWHGFRQWHQW
&DFKHSDJHVIRUIDVWHUDFFHVV
&UHDWH\RXURZQ566IHHGV
3URGXFHFKDUWVJUDSKV
:ULWH3URIHVVLRQDO(UURUKDQGOLQJURXWLQHV
&UHDWHVHDUFKIULHQGO\85/V
$QGRWKHUSUDFWLFDODSSOLFDWLRQV
%X\ERWKERRNVWRJHWKHUIRURQO\6$9(
1 H Z H
HD V
5 HO
3/86¶3+3$UFKLWHFW·UHDGHUVJHWDQH[WUDRII
RQO\XQWLO$SULOWK
7R2UGHU12:YLVLW«
SKSDUFKLWHFWVLWHSRLQWFRP
Licensed to 63883 - Joseph Crawford ([email protected])
7KH3+3$QWKRORJ\9ROXPH,,$SSOLFDWLRQV
TABLE OF CONTENTS
php|architect
Departments
5
Features
9
Editorial
Connecting to Amazon.com Web
Services with NuSOAP
I N D E X
6
Licensed to 63883 - Joseph Crawford ([email protected])
by Alessandro Sfondrini
What’s New!
16
34
Matchmaker, Matchmaker Make Me A
Match: An Introduction to Regular
Expressions
Book Review
Flash MX 2004 for Rich Internet
Applications
by George Schlossnagle
42
Product Review
Mambo Open Source: Content Management
System
28
Automated Testing For PHP
Applications
by Dr. James McCaffrey
59
Security Corner
Shared Hosting
by Chris Shiflett
35
PHP Ahoy! A look at php|cruise
by Marco Tabini
63
Tips & Tricks
By John W. Holmes
47
WAP: Past, Present and Future
by Andrea Trasatti
66
exit(0);
I Am Jack's Total Lack of Linux Support
By Marco Tabini
53
Tidying up your HTML in PHP5
by John Coggeshall
March 2004
●
PHP Architect
●
www.phparch.com
3
You’ll never know what we’ll come up with next
!
W
E
N
Existing
subscribers
can upgrade to
the Print edition
and save!
php|architect
Visit: http://www.phparch.com/print for
more information or to subscribe online.
The Magazine For PHP Professionals
php|architect Subscription Dept.
P.O. Box 54526
1771 Avenue Road
Toronto, ON M5M 4N5
Canada
Name: ____________________________________________
Address: _________________________________________
City: _____________________________________________
State/Province: ____________________________________
ZIP/Postal Code: ___________________________________
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please
allow up to 4 to 6 weeks for your subscription to be established and your first issue
to be mailed to you.
*US Pricing is approximate and for illustration purposes only.
Choose a Subscription type:
Canada/USA
International Surface
International Air
Combo edition add-on
(print + PDF edition)
$ 83.99
$111.99
$125.99
$ 14.00
CAD
CAD
CAD
CAD
($59.99
($79.99
($89.99
($10.00
US*)
US*)
US*)
US)
Country: ___________________________________________
Payment type:
VISA
Mastercard
American Express
Credit Card Number:________________________________
Expiration Date: _____________________________________
E-mail address: ______________________________________
Phone Number: ____________________________________
Signature:
Date:
*By signing this order form, you agree that we will charge your account in Canadian
dollars for the “CAD” amounts indicated above. Because of fluctuations in the
exchange rates, the actual amount charged in your currency on your credit card
statement may vary slightly.
**Offer available only in conjunction with the purchase of a print subscription.
To subscribe via snail mail - please detach/copy this form, fill it
out and mail to the address above or fax to +1-416-630-5057
Licensed to 63883 - Joseph Crawford ([email protected])
Login to your account
for more details.
EDITORIAL
php|architect
Volume III - Issue 3
March, 2004
Publisher
Marco Tabini
Editorial Team
Arbi Arzoumani
Peter MacIntyre
Eddie Peloke
Graphics & Layout
Arbi Arzoumani
Managing Editor
Emanuela Corso
Director of Marketing
J. Scott Johnson
[email protected]
Account Executive
Shelley Johnston
[email protected]
Authors
John Coggeshall, John Holmes,
Dr. James McCaffrey, George Schlossnagle, Alessandro
Sfondrini, Chris Shiflett, Andrea Trasatti
php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini &
Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada.
Although all possible care has been placed in assuring the accuracy of the contents of this
magazine, including all associated source code, listings and figures, the publisher assumes
no responsibilities with regards of use of the information contained herein or in all associated material.
Contact Information:
General mailbox:
[email protected]
Editorial:
[email protected]
Subscriptions:
[email protected]
Sales & advertising:
[email protected]
Technical support:
[email protected]
Copyright © 2003-2004 Marco Tabini & Associates, Inc.
— All Rights Reserved
Continued on page 8...
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
E D I T O R I A L
R A N T S
I
'm sure you're familiar with the Chinese proverb "may
you live in interesting times." Even though I rarely
think of my professional life as dull and boring, the
last month has been particularly exciting. As promised
in my exit(0) column from last month's issue, if you
look through the middle of the magazine you'll find a
full report (in colour!) on the best conference I have
ever attended—our very own php|cruise (forgive me
for a bit of professional price—eight months of prep
work will do that to you). Things went so well that
we're working on another cruise—this time going to
Alaska in the fall—and plan on making php|c an annual event for many years to come.
All good things come to an end, of course, and, once
back from the cruise, it's back to work. Luckily for us,
work means bringing you yet another great issue of
php|architect—and I personally consider that another
good thing. Like every month, we've got some great
content waiting for you in the following pages.
The one I'm most proud of is George Schlossnagle's
regular expressions article. Regexes are something that
pretty much every programmer has to deal with, but
that very few among us really know how to use. In fact,
I've seen developers write extremely complicated code
with the explicit purpose of getting around having to
use a regular expression—and that is just plain wrong.
After all, using the best solution for each problem is
what being a programmer is all about.
Thus, I approached George about writing an article
on regular expressions—and it became quickly evident
that one article would not even come close to covering
the complexity of regex. Now, everyone knows that I
always try my best to stay away from multi-part articles
for a multitude of reasons, but in this case I felt that the
topic more than deserved our attention over multiple
issues and, therefore, George's article is the first in a
series of three. Over the next three months, he will take
you for a ride from the basics (which are covered in this
issue) to the more complex and exotic aspects of regular expressions, thus hopefully providing the PHP world
with a definitive guide to this topic.
If regular expressions are not your bag, one of the
other topics covered in this month's issue is certain to
tickle your fancy. For example, you may want to read
Alessandro Sfondrini's excellent article on using the
Amazon.com API directly from your PHP website, or
Andrea Trasatti's look at the world of WAP. As you can
probably imagine, both Andrea and Alessandro hail
from my native Italy—and that alone makes their articles more than worth reading. There, my monthly heritage tax is now paid up!
As I'm sure you've noticed, in the past few months
we've been publishing material about testing practices
quite frequently. As larger and larger projects are devel-
NEW STUFF
Licensed to 63883 - Joseph Crawford ([email protected])
N E W
S T U F F
What’s New!
ing the ability to access low-level socket operations on streams.
PHP 5.0 Beta 4
PHP.net has announced the release of PHP 4.3.5 RC1.
This fourth beta of PHP 5 is also scheduled to be the
last one (barring unexpected surprises, that did occur
with beta 3). This beta incorporates dozens of bug fixes
since Beta 3, rewritten exceptions support, improved
interfaces support, new experimental SOAP support, as
well as lots of other improvements, some of which are
documented in the ChangeLog. Some of the key features of PHP 5 include:
• PHP 5 features the Zend Engine 2.
• XML support has been completely redone in
PHP 5, all extensions are now focused around
the excellent libxml2 library
(http://www.xmlsoft.org/).
• SQLite has been bundled with PHP. For more
information on SQLite, please visit their website.
• A new SimpleXML extension for easily accessing and manipulating XML as PHP objects. It
can also interface with the DOM extension
and vice-versa.
• Streams have been greatly improved, includ-
March 2004
●
PHP Architect
●
www.phparch.com
PHP.net also announced the release of PHP 4.3.5 RC
3. This will be the last release candidate prior to the
final release, so please test it as much as possible.
For more information visit http://www.php.net/.
ZEND Optimizer 2.5.1
Zend has announced the release of Zend Optimizer
2.5.1.
Zend.com describes the Optimizer as: "a free application that runs the files encoded by the Zend Encoder
and Zend SafeGuard Suite, while enhancing the running speed of PHP applications.
Benefits:
• Enables users to run files encoded by the Zend
Encoder
• Increases runtime performance up to 40%."
Get more information from Zend.com.
6
NEW STUFF
DEV Web Management System
Dev is small, but powerful and very flexible content
management system for web portals. System is licensed
as freeware under the terms of GNU/GPL license. It is
absolutely free for non-commercial and commercial
use. Based on php4 + MySQL technology.
This project allows the user to publish articles, evaluate article by taking the pool, publish short news and
create back-ends in xml format, manage download
lists, Manage advertisement on your site, Be informed
about events on your site, create system reports and
export them into MS Excel or XML format and much
more.
For more information visit: http://dev-wms.sourceforge.net/.
PhpMyAdmin 2.5.6
Phpmyadmin.net has released their latest version of
phpMyAdmin. PHPMyAdmin is a tool written in PHP
intended to handle the administration of MySQL over
the Web.
"Welcome to this new version, aimed at stabilization of
the 2.5 branch. Meanwhile, work is continuing on the new
2.6 branch. PhpMyAdmin is a tool written in PHP intended to handle the administration of MySQL over the Web.
Currently it can create and drop databases,
create/drop/alter tables, delete/edit/add fields, execute
any SQL statement, manage keys on fields."
For more information visit: www.phpmyadmin.net.
PhpSQLiteAdmin 0.2
PhpSQLiteAdmin is a Web interface for the administration of SQLite databases.
Version 0.2 comes with some new features and a lot
of internal cleanups and refactoring. PhpSQLiteAdmin
is still in an early stage of development. It comes free of
charge and without warranty.
For more information visit: www.phpsqliteadmin.net.
Licensed to 63883 - Joseph Crawford ([email protected])
Zend Launches New PHP5 In-Depth
Articles Section
Zend Technologies have launched a new version of
their Developer's
Corner on the zend.com website. PHP5 In-depth
showcases articles from many well-known PHP authors
on the new features of PHP. For more information,
check out http://www.zend.com/php/in-depth.php
phpMyEdit 5.4
phpMyEdit generates PHP code for displaying/editing
MySQL tables in HTML. All you need to do is to write a
simple calling program (a utility to do this is included).
Looking for a new PHP Extension? Check out some of the latest offerings from PECL.
ps 1.1.0
ps is an extension similar to the pdf extension but for creating PostScript files. Its api is modeled after the pdf extension.
Memcache 0.2
Memcached is a caching daemon designed especially for dynamic web applications to decrease
database load by storing objects in memory. This extension allows you to work with memcached through handy OO interface. This extension allows you to call the functions made available by libstatgrab library.
POP3 1.0
The POP3 extension makes it possible for a PHP script to connect to and interact with a POP3
mail server. It is based on the PHP streams interface and requires no external library.
Fileinfo 0.1
This extension allows retrieval of information regarding vast majority of file. This information
may include dimensions, quality, length etc. Additionally it can also be used to retrieve the
mime type for a particular file and for text files proper language encoding.
March 2004
●
PHP Architect
●
www.phparch.com
7
NEW STUFF
ionCube Releases New Encoder
UK-based ionCube has released a new version of their
compiled code PHP encoding tools. New features
include a choice of ASCII or binary encoded file formats
and optional support for OpenSource extensions such
as mmcache.
Prices start at a special price of $159 in their March
20% off sale.
For further information, please visit the homepage of
the Encoder:
Editorial: Contiuned from page 5
oped using PHP, serious testing processes are going to
become an integral part of every good developer's
arsenal of programming tools. What we never quite
considered is that PHP is a great testing platform even
for those projects that are not written using it.
Thankfully, James McCaffrey came to the rescue and
provided us with a wonderful article on the subject.
Our final article this month is about the new Tidy
extension, which author John Coggeshall has recently
introduced in PHP. You may have already heard about
the Tidy project, which provides a series of libraries
capable of parsing and automatically required documents written in markup languages like HTML or XML.
Tidy brings an important set of capabilities to PHP, and
I'm happy to have the author of the extension introduce us to it.
That's it for this month—time for me to go tend to
my sunburn while I start working on the next issue.
Until then, happy readings!
Licensed to 63883 - Joseph Crawford ([email protected])
It includes a huge set of table manipulation functions
(record adition, change, view, copy, and remove), table
sorting, filtering, table lookups, and more.
Several minor bugs were fixed. A few new options
were added. Major features include tabs support, the
ability to specify SQL expressions for fields when writing to the database, the ability to define new triggers,
and more. All eval() calls were removed due to security
and performance reasons. Some code was optimized.
Several parts of the documentation were updated. A lot
of new language files were added and updated.
For more information visit:
http://platon.sk/projects/ phpMyEdit/ .
http://www.ioncube.com/sa_encoder.php
php|a
Check out some of the hottest new releases from PEAR.
Mail_Queue 1.1
Class to handle mail queue managment.Wrapper for PEAR::Mail and PEAR::DB (or
PEAR::MDB).It can load, save and send saved mails in background and also backup some mails.
The Mail_Queue class puts mails in a temporary container waiting to be fed to the MTA (Mail
Transport Agent) and send them later (eg. every few minutes) by crontab or in other way.
XML_Transformer 0.9.1
With the XML/Transformer class one can easily bind PHP functionality to XML tags, thus transforming the input XML tree into an output XML tree without the need for XSLT.
Net_LMTP 0.7.0
Provides an implementation of the RFC2033 LMTP using PEAR's Net_Socket and Auth_SASL
class.
Text_Wiki 0.8.3
Abstracts parsing and rendering rules for Wiki markup in structured plain text.
March 2004
●
PHP Architect
●
www.phparch.com
8
Connecting to Amazon.com
Web Services with NuSOAP
Have you ever wanted to add an online shop to your
website but gave up on the idea because you lack the
expertise and resources to run it? Using SOAP, you can
connect to Amazon Web Services and create a PHP application to remotely browse and search products, add
them to Amazon shopping carts or wish lists and, yes,
you can even earn money on every purchase performed
from your site.
I
n the article "Exploring the Google API with SOAP,"
which appeared in the January issue of php|a, I
showed you what SOAP is and how it can be used
together with PHP. We used a SOAP-encoded document to perform a search using the Google Engine,
then we parsed the response to display the results on
our website. To perform these operations, we wrote an
application from scratch; this approach can be great to
understand how SOAP works, but when a customer
asks you to implement a SOAP-based feature in an
application, you can't waste your time in that way.
In this case, there are some libraries that will make
your coding quicker and easier: one of these is
NuSOAP, which allows you to send Remote Procedure
Calls (RPCs) over HTTP.
This article will show you how we can use the
Amazon.com API with NuSOAP to perform searches
and display product details, without having to sort
through a lot of SOAP syntax: if you have had an
opportunity to read my previous article, you will notice
how much shorter an application written this way is,
and how much time can actually be saved by using this
method.
What are Amazon Web Services?
Amazon.com is one of the most widely known on-line
shops. You can find and buy almost everything, from
books to toys to power tools. Several years ago,
Amazon launched a very successful affiliate program,
which they later expanded in their Web Services program.
Why would you want to use Amazon Web Services
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
F E A T U R E
by Alessandro Sfondrini
(AWS)? For instance, if your website is about Literature,
you may want to allow your users to look for books in
the (huge) Amazon database directly from your pages,
without redirecting them to Amazon.com. You can provide them with a detailed description of each book and,
when they decide to buy one, you can add it directly to
their Amazon shopping cart. When the time comes to
complete the purchase, you can redirect the user
directly to the Amazon website, where the checkout
process actually takes place and you receive credit for
your affiliate referral.
It is important to understand that AWS are designed
only to retrieve information about products and create,
as well as populate, shopping carts, not to perform payments: this must be done directly on the Amazon website-the reason being, of course, one of security for the
customer's personal information. In any case, a significant portion of the transaction is performed from your
website. This results in a benefit both for you and for
your users, since you can offer your customers a nearly
seamless user experience and collect your referral fees.
Access to AWS, as well as to the affiliate program,
requires you to register with the Amazon Associates
Program and obtain an Associates ID, which will identi-
REQUIREMENTS
PHP: 4.1 and higher
OS: Any
Other software:: NuSOAP 0.6.4
Code Directory: webs-nusoap
9
FEATURE
Connecting to Amazon.com Web Services with NuSOAP
Getting started
Before we start coding, I recommend you download
the
AWS
Software
Developer's
Kit
from
http://www.amazon.com/gp/browse.html/?node=3434641. It contains
the License Agreement, a guide (you should have a
look at it to familiarize yourself with the concepts associated with the program) and some code samplesincluding a few written in PHP!
As I mentioned earlier, you will also have to apply for
your Developer's token-an alphanumerical string needed for performing searches and purchases: to do so,
you have to visit :
https://associates.amazon.com/exec/panama/associates/j
oin/developer/application.html
and accept the AWS terms and conditions.
To write our application, we will take advantage of a
PHP library called NuSOAP-which is really just a group
of "userland" classes written in PHP and designed to
allow developers to manage SOAP web services, which
will speed up our coding by allowing us to focus on
functionality
rather
than
on
the
communication protocols. NuSOAP is distributed
under the LGPL license, and can be downloaded here:
http://dietrich.ganx4.com/nusoap/ .
To add NuSOAP support to our project, we simply
have to include nusoap.php to our PHP scripts using
require(). Performing a Remote Procedure Call (RPC) is
simple—look at this example:
require("nusoap.php");
$params = array('name' => 'value');
Figure 1
Parameter Name
keyword
mode
tag
type
devtag
$proxy = $s -> getProxy();
$result = $proxy -> method($params);
Figure 2
Result Datum
Type
The keyword on which the search
should be performed.
Description
Url
String
The URL of the product page for
this item on Amazon
Asin
String
The Amazon.com Standard Item Number
for this product
ProductName
String
The name of the product (in our
case, the title of the book)
Catalog
String
The category of the product (e.g.:
books)
Authors
String
The name(s) of the author(s)
ReleaseDate
String
The release date, in human-readable
format (e.g.: "23 February, 1976").
String
The page number. AWS returns ten
results per page, so page 1 will
contain results 1 through 10, page
2 results 11 through 20, and so on.
Manufacturer
String
The name of the product's manufacturer (the publisher in our case)
String
Specifies the ID of the store to
browse. Each Amazon store has its
unique ID, which indicates what
kind of products it sells (e.g.:
books, music, dvd, vhs, etc.). You
can find a complete list of all the
IDs available in the AWS documentation.
ImageUrlSmall
String
A pointer to the products "small"
image on the Amazon website
ImageUrlMedium
String
Same as above, for a slightly larger image
String
Your Associate ID. If you don't
have one, you can use the generic
ID webservices-20.
ImageUrlLarge
String
Same as above, but for an even
larger image
ListPrice
String
The product's list price, including
the currency symbol (e.g.: "$
20.55")
String
Determines the type of search
results. Lite indicates a simpler
result set, while heavy provides a
richer set of information about
each item returned. We'll use lite
for our example.
OurPrice
String
The product's selling price on
Amazon, including the currency symbol
UsedPrice
String
The product's price for used
copies.
String
●
First of all, we include NuSOAP and we store the
parameters we will use for the RPC in the $params associative array. We then create a new soapclient object,
passing two arguments to the constructor: the SOAP
server address and a boolean value that indicates
whether the server uses a WSDL document. WSDL
(Web Services Description Language) documents contain information about a web service, as well as its
methods and properties. They are often used by web
service providers—including Amazon.
Once we have created the object, all we have to do
is to actually execute the RPC by invoking the call()
method and specifying the remote method name and
the parameters to be passed (contained in $params in
our case). NuSOAP automatically fetches the results of
the call and stores them in the $result array.
Since we are working with a WSDL-based server,
NuSOAP can actually create a "proxy" PHP class capable of providing a better interface to our scripts. Once
we have instantiated $s, we can also invoke a remote
method in this way:
Description
String
page
March 2004
Type
$s = new soapclient("http://server/file.wsdl", true);
$result = $s -> call('method', $params);
PHP Architect
●
The Developer Token you have
received from Amazon.
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
fy each purchase sent through our website.
10
FEATURE
Connecting to Amazon.com Web Services with NuSOAP
Designing the application
Now that we've laid down some ground rules, it's time
to decide in detail what the goals of our application are
going to be. Since we're all PHP fans, our example website will be about PHP and, therefore, we'll want to
allow our users to buy books on this topic from
Amazon.
The first thing that we need is a search page: users
will be able to search for a particular keyword (or for a
set of keywords) and the page will display some basic
information about each book that matches the criteria,
such as its title, an image, the publishing company,
author or authors and price. We also have to provide a
way to browse the results, since AWS calls only return
ten results per call.
The search page should also contain a link for each
product to another page on our website that will contain a detailed description of the book, including any
user reviews and comments. From here, the users will
be able to continue their purchase on Amazon.com or
add the product to their wish lists.
The search page
If you have had an opportunity to read through the
AWS documentation, you have probably discovered
that searches by keyword can be performed using the
KeywordSearchRequest() method, which requires the
parameters shown in Figure 1.
Assuming that the call will be successful, the server
will return an array containing several items:
• The TotalResults element, which indicates
the number of total results returned by the
query.
• The TotalPages element, which provides the
number of pages available in the search
result.
• The Details sub-array, which contains a set
of data about each search result matching
our search criteria that is included in the
page we have requested. Given that a search
only returns a maximum of ten items per
page, you can expect that this array will
contain no more than ten elements. The
lite search mode returns the data shown in
Figure 2.
Licensed to 63883 - Joseph Crawford ([email protected])
This can be useful to simplify our code: first, we create a proxy client, $proxy; any subsequent RPCs to
methods specified in the WSDL can be performed using
the proxy, without having to use the NuSOAP call()
method again. In our application, we will use proxies to
work with AWS.
Listing 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
<form action=”<?=$PHP_SELF ?>” method=”GET”>
<input type=”text” name=”keyword” value=”” />
<input type=”hidden” name=”page” value=1 />
<input type=”submit” name=”button” value=”Search!” />
</form>
<?php
if (empty($_GET[“keyword”])) // If the form has’n been submitted
exit;
// Stops the execution
require(“nusoap.php”);
$client = new soapclient(“http://soap.amazon.com/schemas2/AmazonWebServices.wsdl”, true);
$proxy = $client -> getProxy(); // Creates a WSDL client and a proxy
$param = array(
‘keyword’
‘page’
‘mode’
‘tag’
‘type’
‘devtag’
);
=>
=>
=>
=>
=>
=>
$_GET[“keyword”],
$_GET[“page”],
‘books’,
‘webservices-20’,
‘lite’,
‘YOUR-DEV-TOKEN’
$results = $proxy -> KeywordSearchRequest($param); // Calls the method
if(empty($results[“Details”])) // Checks whether there are results
die(“<h3>No results found for &quot;”.$_GET[“keyword”].”&quot;.</h3>”);
echo “<h3>Searched Amazon.com for &quot;”.$_GET[“keyword”].”&quot; - page “
.$_GET[“page”].” of “.$results[“TotalPages”].”</h3>”;
foreach($results[“Details”] as $res) // Prints each product details
echo “<img src=’”.$res[“ImageUrlMedium”].”’ align=’left’ /><br/>\n”
.”<a href=’details.php?asin=”.$res[“Asin”].”’><b>”.$res[“ProductName”].”</b></a><br /><br />\n”
.”<b>Authors</b>: “.@implode(‘, ‘, $res[“Authors”]).”<br />\n”
.”<b>Publishing Company</b>: “.$res[“Manufacturer”].”<br />”
.”<b>List Price</b>: “.$res[“ListPrice”].” - <b>Our Price</b>: “
.$res[“OurPrice”].” - <b>Used Price</b>: “.$res[“UsedPrice”].”<br /><br /><br />\n\n”;
if($_GET[“page”] > 1) // Prints a link to prev. page if any
echo “<a href=’$PHP_SELF?keyword=”.$_GET[“keyword”].”&page=”.($_GET[“page”]-1).”’>Previous Page</a>&nbsp;\n”;
if($_GET[“page”] < $results[“TotalPages”]) // Prints a link to next page if any
echo “&nbsp;&nbsp;<a href=’$PHP_SELF?keyword=”.$_GET[“keyword”].”&page=”.($_GET[“page”]+1).”’>Next Page</a>”;
?>
March 2004
●
PHP Architect
●
www.phparch.com
11
FEATURE
Type
Basic Character ClassesDescription
asin
String
The product's ASIN (which, in our
case, can be retrieved from
$_GET['asin']
tag
String
The Associate ID, or [webservices20] if you want to use a generic
one
type
String
The type of search. In this case,
we'll choose heavy, since we want
all the information available on a
particular book
devtag
String
Your Developer Token
Result Datum
SalesRank
Type
Description
Integer
Array of
Strings
Lists
The product's sales ranking
The names of the ListMania lists
that contain the product
Indicates the product categories in
which the product can be found. Its
contents look like this:
BrowseList
Array of
Arrays
BrowseList =>
Array
(
[0] => Array
(
BrowseName => PHP
)
)
Media
String
The type of medium on which the
product is distributed (e.g.:
paperback or hardcover for books)
Isbn
String
The ISBN code of the product (books
only)
Availability
String
Indicates how long the product
takes to be shipped
Reviews
SimilarProducts
Element
Array
This array contains information
about the customer reviews associated with the product. It includes
three elements: AvgCustomerRating,
which indicates the average customer rating for the product,
TotalCustomerReviews, which contains the number of customer
reviews available and
CustomerReviews, which is an array
that contains the three most recent
reviews (you can find the contents
of this array in Figure 6).
Array of
Strings
Contains the ASINs of products that
are similar to this one.
Type
Description
The rating of the product in this
review
Rating
Integer
Summary
String
A summary of the review
Comment
String
The full review itself
March 2004
●
PHP Architect
●
www.phparch.com
As you can see, the KeywordSearchRequest() method
returns quite a few pieces of information for every
result item, although, of course, we don't have to output all of them on our site. If you look at Listing 1—the
source for our search page—you'll see that the very first
part of the file is nothing more than a simple HTML
form, which contains an input text box for the keyword
and a hidden field that forces the page number to 1—
this way, a new search will automatically start from the
first page of results.
The form uses the GET method because we need to
use links for the "Next Page" and "Previous Page" operations (something like page.php?keyword=blah&page=2).
Naturally, you could also use POST, but in that case it
would be much more difficult for someone to create a
direct link to your search results, which could, in theory, prevent you from completing some sales.
The second part of the script contains the actual PHP
code. First of all, an if-then-else control block stops the
execution of the script if $_GET["keyword"] is empty.
Otherwise, we include NuSOAP and create a SOAP
client by passing the URI of the *.wsdl file for Amazon
(which is provided in AWS documentation) and the
boolean true to indicate to the constructor of the soapclient() class that the SOAP client features WSDL support. We also create a proxy to call AWS methods
directly as we have seen in the first part of the article.
The
parameters
needed
to
invoke
KeywordSearchRequest() are stored in the $param array;
the first two (the keyword and the page number) are to
be found in the $_GET superglobal, since they change
each time we perform or browse a search, while the
others are constant and, therefore, we hardcode them
in our script. Remember to insert your developer token
in $param["devtag"].
Once we have invoked the method and stored the
search results in $results, we have to display the latter
in a format that is comprehensible to the user. First, we
check whether there are any results to begin with. If
the search returned no data, the program displays a
warning and exits. Otherwise, we print a short summary of the search: the keyword, the current page number and total page count, followed by details about
each product in the current result page. These are actually produced by a simple foreach loop, which browses the $results["Details"] array, echoing the title of
each book, a medium-size image, its authors, publishing company and prices. We will also provide a link to
another page, details.php, which contains further
information on each book. The link contains a reference to the product's ASIN (the Amazon identifier for
each product) in order to make the application able to
retrieve the correct product from Amazon's catalogue
with another RPC.
The last part of this page allows the user to browse
the results: if the current page isn't the first one (Page
Licensed to 63883 - Joseph Crawford ([email protected])
Parameter
Connecting to Amazon.com Web Services with NuSOAP
12
FEATURE
Connecting to Amazon.com Web Services with NuSOAP
The Product Detail Page
Now that we are done with the first part of the application, it's time to move on to the product detail page,
which will show advanced information about a particular book. The AWS method we need in this case is
AsinSearchRequest(), which needs the parameters
shown in Figure 4. Just like before, the response that we
get back from Amazon is an array of arrays—except
that, in this case, we will simply concern ourselves with
the first result set, since the ASIN uniquely identifies
one product. Our data, therefore, will be stored in
$results['Details'][0], which, in turn, will contain
the information shown in Figure 5. As you can see,
some of the values returned are the same as the results
of the KeywordSearchRequest() call that we used in
Listing 1, while some others, like the customer reviews,
are more appropriate for a detailed product page.
Speaking of the product page, Listing 2 contains the
code for details.php. First, we check $_GET["asin"]; if
it is empty, the program displays a warning and exits.
In a more complete application, you may want a slightly more verbose explanation of what went wrong, or
perhaps an automatic redirection to the search page.
If we have an ASIN, we include the NuSOAP library,
then create a SOAP client and proxy as we did in the
previous page. Please note that we have to use
sprintf() to transform the ASIN in a ten-character
strings, since AWS requires it to be submitted in that
format (as an alternative, you could use str_pad() to
ensure that the string is ten character long).
This time, we only need to pass the ASIN and specify
heavy as the search type. Once the RPC has been executed, we retrieve the results and print them out, using
a foreach loop to cycle through the user reviews.
The final touch in our application consists of providing a link back to the Amazon website in order to make
it possible for our users to purchase a product—you
can't do much selling by just showing which products
are available!
The AWS documentation specifies that an HTTP form
must be set up for the purpose of submitting the purchase information over to Amazon.com. This form (you
can look at the one in Listing 2 for an example) uses the
POST method, and its action attribute is really nothing
more than a page on Amazon.com that contains the
Licensed to 63883 - Joseph Crawford ([email protected])
1), the script prints a link to the previous one and, if it
isn't the last page (based on the information returned
by our AWS call), it prints a link to the next one.
Figure 3 shows our search page at work.
Listing 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?php
if(empty($_GET[“asin”]))
die(“<h3>No ASIN specified</h3>”);
require(“nusoap.php”);
$_GET[“asin”] = sprintf(“%010d”, $_GET[“asin”]);
$client = new soapclient(“http://soap.amazon.com/schemas2/AmazonWebServices.wsdl”, true);
$proxy = $client -> getProxy(); // Creates a WSDL client and a proxy
$param = array(
‘asin’
‘tag’
‘type’
‘devtag’
);
=>
=>
=>
=>
$_GET[“asin”],
‘webservices-20’,
‘heavy’,
‘YOUR-DEV-TOKEN’
$results = $proxy -> AsinSearchRequest($param); // Calls the method
?>
<h1><?=$results[“Details”][0][“ProductName”] ?></h1>
<img src=”<?=$results[“Details”][0][“ImageUrlLarge”] ?>” align=”left” height=”350” />
<b>Authors:</b> <?=@implode(‘, ‘, $results[“Details”][0][“Authors”])?><br /><br />
<b>Published by</b> <?=$results[“Details”][0][“Manufacturer”]?>
<b> on</b> <?=$results[“Details”][0][“ReleaseDate”]?><br /><br />
<b>List Price</b>: <?=$results[“Details”][0][“ListPrice”] ?> <b>Our Price</b>: <?=$results[“Details”][0][“OurPrice”] ?> <b>Used Price</b>: <?=$results[“Details”][0][“UsedPrice”] ?><br /><br /><br />
<!— Form to purchase on Amazon.com —>
<form method=”POST” action=”http://www.amazon.com/o/dt/assoc/handle-buy-box=<?=$_GET[“asin”] ?>”>
<input type=”hidden” name=”asin.<?=$_GET[“asin”] ?>” value=”1”>
<input type=”hidden” name=”tag-value” value=”webservices-20”>
<input type=”hidden” name=”tag_value” value=”webservices-20”>
<input type=”hidden” name=”dev-tag-value” value=”YOUR-DEV-TOKEN”>
<input type=”submit” name=”submit.add-to-cart” value=”Buy From Amazon.com”>&nbsp;&nbsp;
<input type=”submit” name=”submit.add-to-registry.wishlist” value=”Add to Wish List”>
</form>
<!— End Form —>
<b>ISBN:</b> <?=$results[“Details”][0][“Isbn”]?><br /><br />
<b>Availability:</b> <?=$results[“Details”][0][“Availability”]?><br /><br /><br />
<b>Sales Ranking:</b> <?=$results[“Details”][0][“SalesRank”]?><br /><br />
<b>Average customer rating:</b> <?=$results[“Details”][0][“Reviews”][“AvgCustomerRating”]?>
<br /><br /><h2>Read user reviews:</h2>
<?php
foreach($results[“Details”][0][“Reviews”][“CustomerReviews”] as $res)
echo “<h3>”.$res[“Summary”].”</h3>”
.”<b>Rating: </b>”.$res[“Rating”].”<br /><br />”.$res[“Comment”].”<br /><hr />”;
?>
March 2004
●
PHP Architect
●
www.phparch.com
13
FEATURE
Connecting to Amazon.com Web Services with NuSOAP
Further Improvements
As you have probably noticed, writing a SOAP-based
application using a library like NuSOAP is much faster
than developing your own SOAP classes—if you have
read my article about the Google API that appeared on
the January issue of php|a, you probably know what I
am talking about. This means that you can develop
rather complex applications without having to waste
time dealing with the nitty-gritty details of the underlying protocol; in fact, we didn't even write any SOAP
code for our Amazon application—NuSOAP did it all for
us.
Naturally, the code that I have introduced here is very
basic and could stand to gain from some improvements. For instance, Amazon Web Services allow you to
to manage a a remote shopping cart or wish list by
adding and removing items to them. The very last part
of the purchase—the one where money changes
hands—must still take place on Amazon.com, but you
can let the user perform most of the normal operations
associated with an e-commerce website without leaving your website. However, do keep in mind that if you
choose to manage the user's shopping cart remotely,
you can't change it once you've submitted to
Amazon—this is done to protect the end user from
fraudulent transactions. You can check out the AWS
documentation for more details on this topic—you'll
find that it's not complicated at all.
Depending on your needs, you may choose to perform a different kind of search operation on your website: by similar products, by author, by ISBN, by manufacturer, and so on. You may also want to browse a
"node", or product category (e. g. "programming",
"web", etc.) directly, without performing a search. It
goes without saying that all this depends on what your
goals are.
If your Amazon-based shop becomes very popular,
you may decide to join the Amazon Associates
Program, an affiliate system that pays you commissions
on every sale. Be careful, however, that your application
must not send more than one request per second to
Amazon—even if you provide an error handling system,
you must not immediately retry a request if the previous one has failed.
You should also provide a caching system, in order to
store the data needed by your site without going back
and forth to AWS for every request—you can check out
Bruno Pedro's excellent article in the February 2004
issue of php|a for more idea on caching data from your
PHP scripts. If you choose to do so, don't forget that
you can't keep your data cached for more than twentyfour hours.
Finally, please keep in mind that in the examples
shown in this article we always referred to
Amazon.com, the American website. AWS are also
available for Amazon.co.uk, Amazon.de and
Amazon.co.jp, but you have to modify the URIs in the
script, changing the specifications in the WSDL document
from
[soap.amazon.com/]
to
soapeu.amazon.com/, and so on. You will also have to add
the locale parameter to your RPC invocations—its value
can be set to uk, de or jp, depending on which Amazon
Licensed to 63883 - Joseph Crawford ([email protected])
ASIN of product that must be added to the user's shopping basket. A few additional hidden fields provide the
ASIN, the Associates Id and the Developer's token. The
form supports two different buttons: one adds the
product to the user's basket, while the other adds it to
his wishlist.
Figure 3
March 2004
●
PHP Architect
●
www.phparch.com
14
FEATURE
Connecting to Amazon.com Web Services with NuSOAP
I'm Outta Here
Amazon.com Web Services is a powerful tool that you
can use to add e-commerce functionality to your site
without going to the expense of developing an online
store of your own and stocking all the merchandise.
Even if you can't create a complete on-line shop using
ASW (because the purchase must be completed on the
Amazon website), you can still give your users a customized shopping experience that relies on the practically limitless resources of one of the world's most popular e-commerce websites.
The sample application that I showed you in this article is quite simple: if you plan to use it in a production
environment—especially if your site has a lot of traffic—
you should probably consider implementing features
like error handling and caching in order to prevent
problems with the Amazon servers. Adding these elements to your application may require some extra
work, but it could all pay off if you enjoy decent traffic
and join the Amazon Associates Program.
Perhaps most importantly, I hope to have given you
a good idea of how much a SOAP library (in this article
we have chosen NuSOAP, but there are some others
packages, like PEAR::SOAP) can simplify the creation of
a complex application—write in few lines of code to
perform a Remote Procedure Call and you're practically done.
If you want to extend our sample application and create a "complete" on-line shop using AWS, have a look
to the documentation: there you will find a detailed
description of every method that's available for use. If
you want to learn more about SOAP, you can check out
the World Wide Web Consortium's notes about the protocol at http://www.w3.org/TR/SOAP or—if you missed it—
read the article "Exploring the Google API with SOAP"
published in the January 2004 issue of php|a.
About the Author
?>
Alessandro Sfondrini is a young Italian PHP programmer from Como. He
has already written some on-line PHP tutorials and published scripts on
most important Italian web portals. You can contact him at
[email protected] .
Licensed to 63883 - Joseph Crawford ([email protected])
website you are referring to.
To Discuss this article:
http://forums.phparch.com/130
FavorHosting.com offers reliable and cost effective web hosting...
SETUP FEES WAIVED AND FIRST 30 DAYS FREE!
So if you're worried about an unreliable hosting provider who won't be
around in another month, or available to answer your PHP specific
support questions. Contact us and we'll switch your information and
servers to one of our reliable hosting facilities and you'll enjoy no
installation fees plus your first month of service is free!*
- Strong support team
- Focused on developer needs
- Full Managed Backup Services Included
Our support team consists of knowledgable and experienced
professionals who understand the requirements of installing and
supporting PHP based applications.
Please visit http://www.favorhosting.com/phpa/
call 1-866-4FAVOR1 now for information.
March 2004
●
PHP Architect
●
www.phparch.com
15
Matchmaker, Matchmaker Make Me A Match
An Introduction to Regular Expressions
A quick search for the words "hate" and "regular expressions" on your favourite search engine is likely to bring up
thousands upon thousands of hits. While most developers
recognize the usefulness of regular expressions (and many
can't do without them once they have figured out how
regexes work), their use remains something of a blackmagic art—right up there with hypnosis and session management. Despite looking complicated, however, regular
expressions are much easier to work with than most people are willing to admit.
A Few Myths about Regexes
Before we get started, we should dispel a
few popular myths about regexs:
Myth: Regular Expressions are Slow.
Truth: Regular expressions can be slow,
but they don't need to be. The main regular expression library used by PHP (called
PCRE and consisting of the preg_ family of
functions) is quite fast and also quite
powerful. This power means that it is
easy to write a short regular expression
that performs a lot of work, and performing a lot of work with any tool can be
slow.
Myth: You should use basic string functions instead of regular expressions.
Truth: Regular string functions (for
example strstr or strtok) are (marginally)
faster than the regular expression to
accomplish the same task. That having
been noted, this myth often leads to people implementing complicated string
parsers using string matching functions
where a single regular expression would
do the trick. The PCRE library will always
match complex patterns faster than
implementing a parser on your own.
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
F E A T U R E
by George Schlossnagle
R
egular expressions (commonly known as regexes)
are a powerful tool for pattern matching and text
manipulation. A typical problem that pulls people
into learning regular expressions is text munging: you
have a string of text and you need to replace portions
of it based on certain rules. For instance, you
might want to obfuscate all the email addresses
in a block of text so that email addresses like
[email protected] get translated to the form
george [at] example [dot] com. Regular
expressions are the tool for the job, and provide a powerful and deep syntax for handling tasks like these.
Alternatives to the PCRE
Functions
PHP supplies some alternatives to the PCRE functions.
The most direct competitor is the POSIX regular expression library that consists of ereg, ereg_replace and others. We won't be looking at the POSIX regular expression functions because the PCRE library provides a
broader pattern-matching facility than its POSIX counterpart and the PCRE library is about 30% faster on
average. The other option is to perform string matching with the standard string functions. As noted above,
REQUIREMENTS
PHP: ANY
OS: Any
Applications: N/A
Code Directory: match-regex
16
FEATURE
Matchmaker, Matchmaker Make Me A Match
the string functions are faster on the tasks they were
designed for (finding specific characters or substrings),
but are not an appropriate fit for anything but the simplest patterns.
Your First Regex
The simplest regex is a match against a static string. To
determine if the string '[email protected]' is present in a piece of text, we can use the following code
fragment:
if(preg_match("/george@example\.com/", $text)) {
print "Matches";
} else {
print "Does not match";
}
this function in more detail later in the article.
• preg_replace_callback—This function
makes it possible to perform very complex
operations on a per-match basis through
the use of callback functions. We will cover
it in a future article, but some of its functionality overlaps with evaluated replacements, which are discussed in this article.
• preg_quote(string text)—When using input
text in a pattern, you may want to sanitize it
to ensure it does not contain any regex
metacharacters. preg_quote escapes all regex
metachacters in a string.
preg_replace("/george@example\.com/",
"george [at]
nospam.example.com",
$text);
The other PCRE functions are:
fied using straightforward textsearch functions
like strstr().”
• pcre_grep(string pattern,
array subjects [, int flag])—ppcre_grep
applies the specified pattern to every element of subjects, returning an array consisting of those that matched. If the optional
flag is set to PREG_GREP_INVER, only those
elements that did not match will be
returned.
• pcre_match_all( s t r i n g p a t t e r n , s t r i n g
subject [,array matches, int flags]])—
pcre_match returns only the first match
found in its subject text. pcre_match_all
matches as many times as possible, returning an array of all the matches. I will discuss
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
Despite its simplicity, this example illustrates the
basic syntax of a regex match. The regex itself is the
first parameter, and is contained within slashes ([/]).
• preg_split(string pattern, string subject
The second parameter is the text you want to test
[, int limit [, int flags]])—ppreg_split
the pattern against. The preg_match function returns
performs similarly to explode, allowing us to
true if the match succeeds, and false if it fails. Using
break up the string subject into limit parts.
slashes to delimit regular expressions is a convention
Instead of splitting on a specific delimiter,
(taken from the UNIX utility awk), but is not necespreg_split allows the string to be broken
sary—you can actually use any non-alphanumeric
based on a regex.
character. Alternative delimiters are convenient if
your pattern itself contains slashes.
Regex Basics
For instance, when dealing with file
Of course, we can (and should) perpaths or URLs (both of which conform the previous simple match using
tain numerous slashes), it is common
“The power of regustrstr(), which is faster than any regex
to use a different delimiter.
lar expressions is
function. What if, however, we want to
We can also perform substitutions
match all email addresses in a string,
in matching comwith PCREs. To substitute 'george at
rather than a specific one? What if you
plex patterns that
nospam.example.com' for my address
wanted to change text only if it
(a common anti-spam technique), you
cannot be identiappeared in a particular position within
can use
your string?
The power of regular expressions is in
matching complex patterns that cannot be identified using straightforward
text-search functions like strstr(). The
basic components of a regular expression pattern are:
• Character Classes—Patterns rarely consist of
specified letters, but classes of letters. For
example 'any number' instead of a particular
number, or 'any letter' instead of a particular
letter.
• Grouping—Grouping allows for changing
the precedence of operations as well as
providing a means to extract the text you
matched with a pattern.
• Enumerations—Enumerators allow you to
specify how many times a character class or
sub-pattern appears. This allows for conven-
17
FEATURE
Matchmaker, Matchmaker Make Me A Match
Second, if you test this pattern you will find the following results.
ient expression of fixed length patterns like
'a US zipcode is 5 digits' as well as variable
length patterns such as 'a domain is a number of alphanumeric characters separated by
dots'.
• 555-123-4567 matches. This is correct.
• 5555-123-45678 matches. This is not correct.
• Alternations—Alternations allow for multiple
patterns to be combined. Unlike character
classes, which allow for a position to match
multiple characters, alternations allow for
entire patterns to be alternatively matched.
For example, a valid workday can be
Monday, Tuesday, Wednesday, Thursday or
Friday.
• Positional Anchors—Anchors allow you to
require your pattern to start matching at a
specific location in the search text, for example at the beginning or end of a line.
• Global Pattern Modifiers—Global pattern
modifiers allow you to change the basic
behavior of a regular expression, for example rendering it case-insensitive.
/\s\d\d\d-\d\d\d-\d\d\d\d\s/
Character Classes
While it's usually easy to find a particular substring
within a larger string—for example, my e-mail address
in a message—it's not always easy to find a particular
type of substring-like any e-mail address. To do this,
you need to be able to match against a more generic
pattern and not just against a static string. PCRE supplies character classes to allow you to do this; a character class allows a specific character in a search text
to be matched against a range of possible characters.
For example, a US phone number is composed of a
three digit area code, a three digit exchange, and a four
digit line number, commonly delimited by a '-'. To
match this pattern, you could use the following regular
expression:
/\d\d\d-\d\d\d-\d\d\d\d/
The \d specifier is a built-in PCRE character class
that consists of all the digits. There are a couple
things you should note about the pattern above. The
first is that we have many \d's. In regular expressions, any character or character class matches only
a single character unless you use an enumerator
(which we'll cover later) to attach a quantity to it.
Figure 1
Regex doesn't always work the way you expect
8
8
7
7
-
x
x
x
-
y
y
y
y
\d
\d
\d
-
\d
\d
\d
-
\d
\d
\d
\d
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
The second example does not represent a valid
phone number (the area code and line number are too
long), but it matches because the pattern fits as shown
in Figure 1.
There are a couple of ways to combat this problem.
If you know that your search text should be exactly a
phone number (with no leading or trailing text), you
can use positional anchors to force the pattern to start
at the beginning of the text and end at the end, as we'll
see later on.
If the phone number might be contained in text, on
the other hand, you might try and fix the pattern by
having the numbers have at least one character of leading and trailing whitespace, using a pattern like:
y
The \s specifier is another character class for all
whitespace (spaces, tabs, newlines, etc.). This pattern does not work in all situations, though, since if
the text begins with the phone number you will be
unable to match the leading \s. To handle this case,
boundary condition that
PCRE supports \b—a
matches at the border (or boundary) between a
'word' and a 'non-word' (these are words in the C
programming language sense—letters, numbers and
underscores only). \b is actually not a character class,
but what is known as a 'zero-width assertion'; this
means that the \b specifier does not actually match
the character on the other side of the boundary, but
only ensures that such a boundary exists. Putting
that into our pattern we can refine it to:
/\b\d\d\d-\d\d\d-\d\d\d\d\b/
Continuing the testing, we find that "077-xxx-yyyy"
matches. US and Canadian area codes and exchanges
cannot begin with 0 or 1 (these are reserved for long
distance and operator-assisted or international services). To be able to restrict the leading numbers to the
allowed set, we need to be able to create our own
character classes. In PCRE, these are constructed by
filling a set of brackets ([[ ]) with the characters we
want to match. To match 2-9, we can use the character class [23456789], which is commonly shortened via
a range operator to [2-9]. To use a custom character
class in a pattern, you use it exactly as you would a
regular character or character class. Here is the phone
number pattern reworked to employ this:
/\b[2-9]\d\d-[2-9]\d\d-\d\d\d\d\b/
18
FEATURE
Matchmaker, Matchmaker Make Me A Match
Figure 2
Basic Character Classes
them with a backslash (\\). The two exceptions are
the range operator -, which can appear un-escaped
as the last character in a class, since that is unambiguous, and the negation character ^, which can
appear un-escaped in any position but the first.
Grouping and Sub-Patterns
Usually, you will not only want to match a pattern, but
extract data from it as well. To extract a specific part of
a pattern, you surround it within parentheses. For
example, to capture each part of the phone number
pattern, you would add parentheses as follows:
/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/
Figure 3
POSIX Style Classes
:alpha:
Any letter
:alnum:
Any alphanumeric character
:ascii:
Any ASCII character
:cntrl:
Any control chatacter.
.
Matches any character
:digit:
Any digit (same as \d)
\w
An alphanumeric character or the underscore character.
:graph:
Any alphanumeric or punctuation character.
\W
Anything not a \w.
:lower:
Any lowercase letter.
\d
A digit.
:print:
Any printable character.
\D
A non-digit.
:space:
Any whitespace character (same as \s).
\s
Any whitespace. This includes spaces, tabs, newlines,
control characters.
:upper:
Any upperspace character.
\S
A non-whitespace character.
:xdigit:]
Any hexadecimal 'digit'.
Licensed to 63883 - Joseph Crawford ([email protected])
PCRE provides six commonly used built-in character classes, described in Figure 2. Additionally, PCRE
provides POSIX-style character classes for compatibility with POSIX-style regular expressions. These classes are described in Figure 3. POSIX character sets
aren't commonly used much in real-life code, which
is a shame because they are often a perfect fit for
problems that programmers encounter in their dayto-day work.
You can negate a POSIX character class by adding a
^ after the first colon. For instance, to match all non-letter characters, you could use the class :^alpha:.
Negations are also available in custom character
classes—for example, to match anything that is not the
greater-than character (>), you can use the custom
character class [^>]. Negations are very useful when
you are creating regular expressions that extract quoted text or if you want to manually parse XML or HTML.
Since '--', '^^' and '[[ ]' have special meanings in custom character classes, if you want those actual characters to be elements of the class, you should escape
Figure 4
March 2004
●
PHP Architect
●
www.phparch.com
19
FEATURE
Matchmaker, Matchmaker Make Me A Match
$text = 'My phone number is 555-321-1212';
preg_match("/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/",
$text, $matches);
print_r($matches);
Executing that code yields the following results, just
as we predicted:
Array
(
[0]
[1]
[2]
[3]
)
=>
=>
=>
=>
555-321-1212
555
321
1212
We can also nest patterns. If we wanted to capture
the entire local part of the phone number, in addition
to its componentized parts, the regex could be modified to be:
/\b([2-9]\d\d)-(([2-9]\d\d)-(\d\d\d\d))\b/
When we nest patterns, we move left to right and,
when we hit a nested pattern, we take the outermost
part first, then recursively parse its contents following
the same rules. With the above pattern, the patterns are
numbered as shown in Figure 4.
Sub-patterns are also extremely useful in substituListing 1
1
2
3
4
5
6
7
8
9
10
$fp = fopen(“/usr/share/dict/words”, “r”);
if(!$fp) {
print “dictionary file not found\n”;
exit;
}
while(($line = fgets($fp)) !== false) {
if(preg_match(‘/\b(\w)(\w)(\w)\3\2\1\b/’, $line)) {
print “palindrome: $line\n”;
}
}
Figure 5
h
a
l
l
a
h
\w(captured as
\1)
\w(captured as
\2)
\w(captured as
\3)
\3
\2
\1
●
preg_replace("/\b([2-9]\d\d)-([2-9]\d\d)(\d\d\d\d)\b/",
'\1-\2-XXXX', $text);
If we run this on the text 'My phone number is 410555-1212.', it returns 'My phone number is 410-552XXXX'.
Note that the replacement string in the above example is single-quoted. If we were to double quote it, we
would have to double escape our sub-pattern references as "\\1-\\2-XXXX". This may seem mysterious but
the reasoning is this: the PCRE library needs to be
passed the sub-pattern references as \1, but when we
double-quote a string, PHP attempts to interpret the
escaped characters for us. Single-quoting performs no
such interpretation and leaves your references
untouched. This is the same process by which "\n"
becomes a newline, but '\n' remains literally '\n'.
We can reference sub-patterns in matches as well,
using the same rules. A fun example of this is finding
all 6-letter palindromes. A palindrome is a word that
is spelled the same forward and backward, for example 'noon' or 'deed'. To spot a six-letter palindrome,
we match 3 characters and require that we see them
immediately in reverse order. Here is the pattern:
Note 2
This isn't the full story on RFC compliant email
addresses. Because the specification allows for
addresses to contain descriptions as well, a completely accurate email address validator is actually quite complex. An example can be found at
the end of Mastering Regular Expressions in Perl
- the regex presented there is X characters long!
For most purposes, the regex presented above is
completely sufficient.
Enumeration modifiers can also be used to
compress patterns with long repetitive parts.
For instance, the phone-number pattern can be
compressed to:
/\b[2-9]\d{2}-[2-9]\d{2}-\d{4}\b/
Matching a palindrome
March 2004
tions, since they allow us access to the matched subpatterns when performing the replacement. A captured sub-pattern can be accessed in the
{preg_replace} replacement text by referencing its offset as \N (where N is the sub-pattern number). Here is
an example that sanitizes phone numbers by obscuring their line number:
Licensed to 63883 - Joseph Crawford ([email protected])
Pattern fragments grouped in this fashion are called
sub-patterns. To see what they capture, you need to
pass a third argument to {preg_match}. This argument is set by the function as an array with the captured sub-pattern results in it. The zeroth element the
array is the text matched by the pattern as a whole,
while the sub-patterns captures are at the offset of
their pattern number. Patterns are numbered left-toright and outside-to-inside. So in the pattern above
the entire phone number is offset 0, the area code is
sub-pattern 1, the exchange is sub-pattern 2, and the
line number is sub-pattern 3.
Here you can see a sample phone number being run
through the regular expression.
PHP Architect
●
www.phparch.com
or, by noting that the area code and exchange
match the same pattern, we can compress it
even further, as follows:
/\b([2-9]\d{2}-) {2}\d{4}\b/
20
FEATURE
Matchmaker, Matchmaker Make Me A Match
/\b(\w)(\w)(\w)\3\2\1\b/
<?php
$text = 'Work: 877-555-1212, Fax: 888-555-1212';
preg_match_all("/\b([2-9]\d\d)-([2-9]\d\d)(\d\d\d\d)\b/",
$text, $matches);
print_r($matches);
?>
Executing that script returns the following:
Array
(
[0] => Array
(
[0] => 877-555-1212
[1] => 888-555-1212
)
[1] => Array
(
[0] => 877
[1] => 888
)
[2] => Array
(
[0] => 555
[1] => 555
)
[3] => Array
(
[0] => 1212
[1] => 1212
)
)
The alternative is to pass the optional flag
PREG_SET_ORDER. With this flag set, the ordering of the
match array is reversed: the match array contains one
element for each search text matched, with that array
containing the sub-pattern captures for that search
text. If we are looking to replicate the Perl idiom
while($text =~ /$regex/g) {
# perform work on one set of matches at a time
}
you can accomplish it with this PHP:
preg_match_all($regex, $text, $matches,
PREG_SET_ORDER);
foreach($matches as $match) {
// perform work on one set of matches at a time
}
March 2004
●
PHP Architect
●
www.phparch.com
Enumerations
Another important feature in pattern matching is the
ability to match variable-length patterns. In the phone
number example, even though the digits of the number were unknown, the length of the pattern was
fixed—it is always a three digit area code, three digit
exchange and four digit line number. On the other
hand, if we are matching email addresses, we don't a
priori know the length of the address.
Figure 6
Enumeration Modifiers
*
Match 0 or more times.
+
Match 1 or more times.
?
Match 0 or 1 times.
{m}
Licensed to 63883 - Joseph Crawford ([email protected])
When we run this pattern against a palindrome like '
hallah', it matches as shown in Figure 5.
Notice that you need to use \b to make sure you
don't misidentify words that contain palindrome substrings. If you are running on a UNIX system, Listing 1
is a code block that will find all the six-letter palindromes in the dictionary file /usr/share/dict/words.
When we use preg_match_all with sub-patterns, we
have two choices of how we want the data returned to
us. The default behavior is for the match array to contain an array for each sub-pattern, where that array
contains the capture for the nth search match as its nth
element. If that's confusing, here is how it looks when
matching all the phone numbers in a text:
Match exactly m times.
{m,n}
Match between m and n times.
{m,}
Match at least m times.
{,n}
Match between 0 and n times.
To handle this, PCRE supplies enumeration modifiers.
The most basic description of an email address is a
number of non-whitespace characters, followed by an
'@', followed by more non-whitespace characters. \S is
the character class for all non-whitespace characters, so
using that we can write this simplistic email-matching
pattern as:
/\S+@\S+/
+ is a PCRE enumerator that instructs the regex
engine to match one or more instances of the character or character class it applies to. PCRE supports a
number of enumeration methods for specifying that a
character or character class should be matched multiple times, as you can see in Figure 6.
The + and * modifiers are both greedy. This means
they will always match as long a sub-pattern as possible. This is not always the way you want your patterns
to behave, but I will leave the details of when we might
want a greedy or non-greedy match to a later article.
Enumeration modifiers can be applied not only to
characters and character classes, but to sub-patterns as
well. This allows for some pretty complex pattern generation, which is, after all, one of the best features of
regular expressions (at least when you can understand
what they do).
For example, we can use enumeration modifiers to
significantly improve our email-address pattern.
21
FEATURE
Matchmaker, Matchmaker Make Me A Match
According to RFC 2822, which defines the "official"
valid email address syntax, an email message is composed of a localpart, an '@' and a domain. The localpart
is one or more characters from the set
[\w!#$%"*+\/=?`{}|~^-], while a domain is a dot-separated list of parts composed of \w-. The pattern for the
local part is almost identical to the definition of \S+:
/[\w!#$%"*+\/=?`{}|~^-]+/
The pattern for domains is more complex. First, we
need to identify elements in the string. These are given
by
/[\w-]+/
If we only have two such elements, the domain pattern would look like this:
and not /, since our pattern contains slashes and we
would rather not have to escape them. A more elegant
approach is to combine them using an alternation, as
follows:
#(https?|ftp)://\S+#
The alternation operator | means that the sub-pattern #(https?|ftp)# matches either #https?# ('http'
with an optional 's') or #ftp#. To use this to automatically create anchor tags for all linked content, we can
use a replacement like this:
preg_replace('#((https?|ftp)://\S+)#',
'<a href="\1">\1</a>', $text);
Running this over a sample text, we notice that any
preexisting anchor tags will become munged. For
example:
/[\w-]+\.[\w-]+/
/([\w-]+\.)+[\w-]+/
Creating a sub-pattern simply involves placing it
inside parentheses. Combining the local and domain
patterns together, we arrive at a decent regular expression for matching valid email addresses:
/[\w!#$%"*+\/=?`{}|~^-]+@([\w-]+\.)+[\w-]+/
We can use this regular expression to perform the
anti-spam rewriting we illustrated at the beginning of
the article.
function obscure_emails($text) {
$regex = '/([\w!#$%"*+\/=?`{}|~^-]+)@(([\w-]+\.)+[\w]+)/';
preg_replace($regex, '\\1 [at] nospam.\\2', $text);
return $text;
}
Alternation
The last of the basic regular expression syntactical elements is alternation. Where character classes let us
match a single character against a set of allowed characters, alternations allow for matching a string against
multiple sub-patterns. For example, we might want to
identify all HTTP and FTP addresses in a document for
auto-linking or indexing purposes. We could do this
with two regular expressions:
#https?://\S+#
#ftp://\S+#
●
PHP Architect
Becomes
Come visit us at <a href="<a href=
"http://www.phpa.com">phpa.com</a>
.">http://www.phpa.com">phpa.com</a>.</a>
Solving this in a completely robust manner involves
using look-behind assertions, which will be covered in
a future article, but we can do a decent job by noting
that the href value must be enclosed in quotes. Thus, if
we require the URL to not be preceded by a quote, we
should catch most cases. The revised regular expression
is:
preg_replace('#([^\'"])((https?|ftp)://\S+)([:punct:])
#',
'\1<a href="\2">\2</a>', $text);
Note here that we need to capture and return in
the substitution the non-quote (^^\'") character we
match before the URL to avoid losing it, and that we
have to escape the single quote, since it the entire
pattern is part of a single-quoted string.
Positional Anchors
In the example of matching valid US phone numbers,
the regular expression we had was good for spotting
phone numbers in a block of text, but not for validating that a block of text is a phone number. To do that,
we need to ensure that the phone number is the only
element in the search text, with no leading or trailing
components. Anchors help solve this problem. To mandate that our phone number match starts at the beginning of the search test and ends at the end of it, we can
modify our regex as follows:
/^([2-9]\d{2})-([2-9]\d{2})-(\d{4})$/
but this will require the document to be completely
scanned twice. Note that we are using # as a delimiter
March 2004
Come visit us at <a
href="http://www.phpa.com">phpa.com</a>.
Licensed to 63883 - Joseph Crawford ([email protected])
Note that since '.' is a special regex character (the
wild-card character class), we must escape it to have it
match just the '.' character. Since we can have an arbitrary number of dot-separated segments, we will encapuslate the first part of the pattern in a sub-pattern and
use the '+' enumerator to specify that it must occur one
or more times:
●
www.phparch.com
The leading ^ anchors the match at the beginning
of the text, meaning that the match will only succeed
22
FEATURE
Matchmaker, Matchmaker Make Me A Match
function validate_us_phone($phone)
{
$regex =
'/^([2-9]\d{2})[.\s -]?([2-9]\d{2})[.\s ](\d{4})$/';
if(preg_match($regex, $phone, $matches)) {
return array( 'area_code' => $matches[1],
'exchange' => $matches[2],
'line_number' => $matches[3]);
}
return false;
}
Don't confuse the anchor operator ^ with the negated character class operator [^]. Because an anchor is
not a character class (in fact it's a special zero-length
look behind assertion, but that's a topic for a later article), it has no meaning inside a character class.
Anchors are also useful for extracting information
near the beginning or end of a string. For example, a
line from an Apache Common Log Format logfile looks
like the following:
10.80.117.254 - - [13/Feb/2004:14:53:01 -0500]
"GET /~george/blog/ HTTP/1.1" 200 43489
This says that on February 13, 2004 a request for
"/~george/blog/" was made from the IP address
10.80.117.254. This request was successful (it returned
a 200 Request OK response code), and the amount of
data returned was 43489 bytes. Writing a full parser for
this log line is not too difficult (we will do so in the
cookbook section at the end of the article), but many
queries do not require parsing the entire log. For
Listing 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?php
$logfile = $_SERVER[‘argv’][1];
if(!$logfile) {
print “Please specify a logfile to parse\n”;
}
if(($fp = fopen($logfile, “r”)) == false) {
print “Error opening $logfile\n”;
exit;
}
while(($line = fgets($fp)) !== false) {
$regex = ‘/(\d+) \d+$/’;
if(preg_match($regex, $line, $matches)) {
$frequency[$matches[1]]++;
}
}
print “Code\tOccurences\n”;
foreach ($frequency as $code => $occurences) {
print “$code\t$occurences\n”;
}
?>
March 2004
●
PHP Architect
●
www.phparch.com
instance, if we want to count the number of occurrences of each response code, the expression to use is
quite simple. Looking at the log format, we see that the
last two fields are numbers, and we want the next to
last one. Expressed as a regex, that pattern looks like
this:
/(\d+) \d+$/
Working backwards, this says we first match the end
of the line ($$), then a number (which we don't bother
to capture), then a number which we do want to capture (the response
code). We can wrap
this into a quick script
“Anchors are also
to determine the frequency of various
useful for extractresponses as shown in
ing information
Listing 2. When we
near the begindon't need to parse an
entire text string, espening or end of a
cially if its format is
string.”
complex, anchors can
make our life much
easier.
Licensed to 63883 - Joseph Crawford ([email protected])
if it begins there. The trailing $ anchors the match at
the end of the text, meaning that the match will only
succeed if the pattern terminates on the final character of the text to be matched against.
Here we use a slightly modified version of the
anchored pattern to make a function useful for validating user-inputted data. If the phone number is valid, it
will return an array of its components. If not, it will
return false. The regex has been made a bit more
robust by allowing the delimiter (previously -) to be
replaced by an optional . or whitespace.
Global Pattern Modifiers
The final regular expression syntactical elements we are
going to discuss in this article are global pattern modifiers. As their name implies, global pattern modifiers
change the overall behavior of the pattern. By far the
most common of these is the case insensitivity modifier, i. Global modifiers are implemented in the Perl
style, directly following the pattern they apply to. Here
is a function which uses a regex to extract all addresses
under a specified domain from a subject text, regardless of the casing of the domain (domains are case
insensitive).
function extract_addresses($domain, $text)
{
$domain = preg_quote($domain);
if(preg_match_all('/([\w!#\$%\"*+\/=?\'{}|~^]+)@$domain/i',
$text, $matches, PREG_PATTERN_ORDER)) {
return $matches[1];
}
return false;
}
Notice here that, in addition to using the i modifier,
we also use preg_quote to sanitize $domain. Data that
can potentially come from an untrusted source (such as
a user) should always be quoted to prevent the accidental or malicious inclusion of regex characters. Also,
we use the PREG_PATTERN_ORDER flag so that all the subpattern \1 matches are stored in $matches[1] .
Otherwise we would need to iterate over $matches and
manually build the result set.
The other possible pattern modifiers are as follows:
23
FEATURE
Matchmaker, Matchmaker Make Me A Match
Licensed to 63883 - Joseph Crawford ([email protected])
dollar end-anchor $ will match only at the
• m (treat as multiline). By default, PCRE
end of the string. By default, $ will match
assumes that we intend our search text to
before the final character if that character is
processed as one big string, and ^ and $
a newline. This is ignored if the m modifier is
will match only the beginning
also used.
and ending of the search text,
respectively. When the m modi• S (Study) If we are going to
execute a pattern a number of
fier is used, ^ and $ will match
“As with most
at the beginning and ending of
times, we can use this flag to
tools, the way to
every line in the pattern (the
instruct PCRE to take extra time
really learn
search text is considered to be
'studying' the pattern to improve
broken into lines by any newits efficiency.
regexes is to use
line characters).
them in practical
• U (Ungreedy) By default, all
matches in PCRE are greedy—
• s (treat as single line for wildsituations.”
cards) By default the wildcard
that is, a pattern will attempt to
match the longest possible piece
character (..) will not match a
of the search text. The U modifier
newline. If . should match newreverses this behavior, asking PCRE to find
lines as well, add the s modifier to the patthe shortest possible match for the pattern.
tern.
More on greedy versus non-greedy matching will be covered in a future article.
• x (extended legibility) By default, any whitespace in a pattern is considered part of the
• u (UTF-8) This modifier instructs PCRE to
pattern. Allowing whitespace in a pattern
treat patterns and search texts as UTF-8
can be helpful for readability and inline
characters instead of just single-byte characcomments. Compare the following two regters. UTF-8 support is still new and should
ular expressions:
be used with some caution as it may be
/([2-9]\d{2})[.\s-]?([2-9]\d{2})[.\s-]?(\d{4})/
incomplete.
and
• e (Evaluated replacements). This causes the
/([2-9]\d{2}) # Match the area code (200-999) as
replacement string in a preg_replace call to
subpattern 1
be evaluated as PHP. Back-references are
[.\s-]?
# An optional delimiter - dot, dash or
ws
expanded and the resulting expression is
([2-9]\d{2}) # Match the exchange as subpattern 2
executed via eval. The result of the evalua[.\s-]?
# An optional delimiter - dot, dash or
ws
tion is used as the final replacement text.
(\d{4})
# Match the line number as subpattern 3
Let's try an example of how to use this writ/x
ing Wiki-style links to documents. In Wikis,
More information of creating readable patputting so-called CamelCaps text in a docuterns will be covered in a future article.
ment will link it to the wiki page of that
name. Doing this blindly with a regex can
• A (Start anchored) This modifier is equivabe achieved with the following replacement:
lent to putting a ^ at the start of our pat$text = preg_replace('/\b(([A-Z]\w+){2,})\b/',
tern—it anchors the pattern at the start of
'<a href="/wiki/\1.html">\1</a>', $text);
the search text. Thus the following two
This might result in a number of non-exisregular expressions are equivalent:
tent documents being linked to, though. If
/^Subject: (.*)/
/Subject: (.*)/A
There are no benefits of using this method
over manually anchoring a pattern with ^
(other than, perhaps, moving the anchor
character from the beginning of your pattern to its end).
• D (Dollar end-only) If this modifier is set, the
March 2004
●
PHP Architect
●
www.phparch.com
Listing 3
1
2
3
4
5
6
7
8
9
10
11
function is_wiki_page($token)
{
$page = $_SERVER[‘DOCUMENT_ROOT’].”/wiki/$token.php”;
if(file_exists($page)) {
return true;
}
return false;
}
$text = preg_replace(‘/\b(([A-Z]\w+){2,})\b/e’,
‘is_wiki_page(\1)?”<a href=\”/wiki/\1\”>\1</a>”:”\1”’,
$text);
24
FEATURE
Matchmaker, Matchmaker Make Me A Match
Unless specifically contraindicated (such as B and m),
pattern global modifiers can be freely combined.
A Simple Regex Cookbook
As with most tools, the way to really learn regexes is to
use them in practical situations. To help you get on
your way, here is a short selection of recipes for making
the most out of your regular expressions.
Apache Log Processing
Being able to extract information from webserver logfiles is essential to both good housekeeping (knowing
what links are broken and the disposition of our traffic)
and forensics (determining where traffic is coming from
and what actions users are taking). The first step to this
is being able to parse our logs into an easily accessible
data structure. Apache common log format is defined
as the following:
"%h %l %u %t \"%r\" %>s %b"
Where the individual fields are:
• %h—-The IP address (or hostname if DNS
lookups are enabled) of the requestor.
• %l—The remote logname, as supplied by
identd.
• %u—The remote user supplied to HTTP Basic
Authentication (same as
$_SERVER['PHP_AUTH_USER'] )
• %t—The time in common log format
(%%d/%b/%G:%H:%M:%S %z in strftime format
terms).
• \"%r\"—The full request line, such as GET
/index.php HTTP/1.0"
• %>s—The three digit response code of the
final request served (Apache has a notion of
internal redirects—this is the response code
on the page actually returned to the user).
• %b—The number of bytes returned in the
response.
A function to parse a single line and return an array
with its contents is given in Listing 4. Even though we
March 2004
●
PHP Architect
●
www.phparch.com
didn't really explore it in much detail, the benefit of
using extended legibility regexes should be obvious
here—with 17 sub-patterns being captured, it would
be extremely difficult to guess the correct offsets at a
glance. Now that we have a parser, its applications are
nearly limitless. For example, Listing 5 shows a little
script I like to leave running in a window on my desktop; I tail my Apache log into it and it reports the number of hits I get per second in real-time. Running it as
tail -f /apache/logs/mysite/access | freq.php
Gives a running tally of hits per second (note that
this will only run under a UNIX-like environment and
that you'll need to make freq.php executable). This
data could just as easily be written to an MRTG database for graphing, or something even cleverer.
Because we have access to the fully parsed log line, we
Listing 4
Licensed to 63883 - Joseph Crawford ([email protected])
we want the rewriting to only happen if the
destination document exists, we can perform the conditional replacement with an
evaluated replacement as shown in Listing 3.
Now, when a CamelCaps word is encountered, the regex checks is_wiki_page to see
if it should be linked. If so, the text is
replaced with a link; otherwise, it is left as-is
(or, rather, it is replaced with itself).
Evaluated replacements and their companion function preg_replace_callback will be
covered in depth in a future article.
1 function parse_clf_line($line)
2 {
3
static $regex = ‘/^
4
(\S+) # the host or ip ($m[1])
5
[ ] # a space
6
(\S+) # remote logname ($m[2])
7
[ ]
8
(\S+) # auth user ($m[3])
9
[ ]
10
\[(
# begin date match ($m[4])
11
(\d{2})\/ # the day ($m[5])
12
(\w{3})\/ # the month ($m[6])
13
(\d{4}): # the year ($m[7])
14
(\d{2}): # the hour ($m[8])
15
(\d{2}): # the mintute ($m[9])
16
(\d{2})\s+
# the second ($m[10])
17
([+-]\d{4}) # UTC offset ($m[11])
18
)\]
# end date match
19
[ ]
20
“(
#begin request match ($m[12])
21
(GET|HEAD|POST) # the HTTP method ($m[13])
22
\s+
23
(\S+) # The requested URL ($m[14])
24
\s+
25
(HTTP\/\d\.\d) # the protocol ($m[15])
26
)”
# end reqyest match
27
[ ]
28
(\d{3}) # status ($m[16])
29
[ ]
30
(\d+)
# bytes ($m[17])
31
$/xi’;
32
if(preg_match($regex, $line, $m)) {
33
return $m;
34
}
35 }
Listing 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/usr/local/bin/php
# log_freq.php
<?php
include_once(“LogParser.inc”);
$last_time = ‘’;
$count = 0;
while(($line = fgets(STDIN)) !== false) {
if($data = parse_clf_line($line)) {
$this_time = $data[4];
if($last_time && $last_time != $this_time) {
print “$last_time: $count\n”;
$count = 0;
}
$last_time = $this_time;
$count++;
}
}
?>
25
FEATURE
Matchmaker, Matchmaker Make Me A Match
$this_time = $data[4];
to
$this_sec = "$data[5]/$data[6]/$data[7] $data[8]";
Similarly, we could count bytes instead of pages by
accumulating $data[17] (bytes transferred) in
$count.
Single Pass Template Substitution
In its simplest form, a templating system runs through
a 'template' and replaces certain tokens with dynamic
values. One of the things that makes many templating
systems slow is that they must perform multiple passes
through a document, one for each token to be
replaced. If we standardize our token naming convention, we can actually perform the replacement in a single pass.
First, we require that all templates be of the form
{NAME} where NAME is a key in an associative array that
contains our substitutions. With this in place, we can
match all tokens in a single pass with the following
regex:
shows one possible way to do so. This function looks for
various DHTML and CSS directives that can be used for
cross-site scripting attacks, and if any are found it performs a very draconian stripping of all but the basic formatting tags.
Conclusion
We have now come to the end of our journey through
the basics of regular expressions. With these tools in
your hands, you should be able to tackle almost any
text matching challenge. Hopefully, you have lost any
fears you might have had concerning regular expressions. Once past the terseness of their syntax, regexes
can be a powerful and versatile addition to our programming toolkit.
At the same time, we have really only touched the tip
of the regex iceberg. In addition to the things we have
seen so far, the PCRE extension supports a number of
fine-grain features that allow for incredibly complex
matches. These advanced features will be covered in a
future set of articles.
Licensed to 63883 - Joseph Crawford ([email protected])
could easily convert this to display hits per hour by
changing
Listing 6
/{(\w+)}/
Next we will use an evaluated replacement to substitute the appropriate value from the passed associative
array. Here is the full function:
function expand_text($text, $data)
{
return preg_replace('/{(\w+)}/e', '$data[\1]',
$text);
}
A simple demonstration of this function in action is
the following:
$template = <<<EOD
Hello {NAME},
Your friend {FRIEND} has sent you an e-card.
Click <a href="{LINK}">here</a> to pick it up.
EOD;
$data = array(
'NAME' => 'George',
'FRIEND' => 'Bob',
'LINK' =>
'http://www.example.com/ecard.html?id=12345'
);
print expand_text($template, $data);
Preventing Cross-Site Scripting
Attacks
Javascript is one of the banes of my existence. Don't get
me wrong—it is a powerful and useful language, but its
tight integration with HTML makes it a fertile playground for malicious users to launch cross-site scripting
attacks. If we must allow HTML in user input, we will
want to at least remove any Javascript from it. Listing 6
March 2004
●
PHP Architect
●
www.phparch.com
1 function strip_dhtml($html)
2 {
3
$ok_tags =
‘<br><b><h1><h2><h3><h4><i><li><ol><p><strong><table>’ .
4
‘<tr><td><th><u><ul>’;
5
$js_event_list = array(‘load’, ‘unload’, ‘click’, ‘dblclick’,
6
‘mousedown’, ‘mouseup’, ‘mouseover’,
7
‘mousemove’, ‘mouseout’, ‘focus’,
‘blur’,
8
‘keypress’, ‘keydown’, ‘keyup’, ‘submit’,
9
‘reset’, ‘select’, ‘change’);
10
$js_events = implode(‘|’, $js_event_list);
11
$regexp[] = “/on($js_events)\s*=/i”;
12
$regexp[] = “/(java|vb)scri?pt/i”;
13
$regexp[] = “/@\s*import/i”;
14
foreach($regexp as $re) {
15
if(preg_match($re, $html)) {
16
return strip_tags($html, $ok_tags);
17
}
18
}
19
return $html;
20 }
About the Author
?>
George Schlossnagle is a Principal at OmniTI Computer Consulting, a
Maryland-based tech company specializing in high-volume web and
email systems. Before joining OmniTI, George led technical operations
at several high-profile community web sites where he developed experience managing PHP in very large enterprise environments. George is a
frequent contributor to the PHP community. His work can be found in
the PHP core, as well as in the PEAR and PECL extension repositories.
Before entering into information technology, George trained to be a
mathematician and served a 2 year stint as a teacher in the Peace Corps.
His experience has taught him to value an inter-disciplinary approach to
problem solving that favors root-cause analysis of problems over simply
addressing symptoms.
To Discuss this article:
http://forums.phparch.com/131
26
Licensed to 63883 - Joseph Crawford ([email protected])
Can’t stop thinking about PHP?
Write for us!
Visit us at http://www.phparch.com/writeforus.php
Automated Testing For PHP Applications
PHP enables Web developers to create complex Web applications—nothing new there. The techniques for writing
automated tests for PHP Web applications, however, are
not well known. In this article, James McCaffrey shows you
a simple but representative PHP application and then
walks you through the creation of a powerful automated
test program written entirely in PHP. The code is explained
in detail so you can use it as is, or modify and extend the
technique to meet your own needs.
I
n this article, I will show you how to write powerful
automated tests in PHP for your Web applications.
PHP is remarkably well-suited for writing software test
automation and the system I present is surprisingly
short. Web applications built with PHP are becoming
more and more common in the enterprise arena and,
as a result, they are becoming increasingly complex. As
PHP matures, the ability to write test automation
becomes more valuable, but in conversations with my
colleagues I discovered that the techniques required for
automated testing of PHP Web applications are not well
known. In this article, I will show you how to quickly
write effective test automation that verifies your PHP
Web applications' correctness.
The best way to show you what we will accomplish is
with two screenshots. Figure 1 shows a dummy PHP
Web application that accepts a last name for an
employee and then searches a MySQL database and
displays the employee's ID, first name, last name, and
e-mail address. In this example searching for "Baker"
correctly returns a single employee whose ID is 002,
first name is Bob, and e-mail is [email protected].
Manually testing even this minimal Web application
would be extremely tedious, time consuming, and
error prone. Instead, we can test the application by
programmatically sending input to the PHP script on
the Web server, then capture the response stream,
examine the response for a correct target, and log a
pass or fail result. Figure 2 shows a PHP shell program
that does just that. Test cases 0002 and 0003 correspond to the manual test shown in Figure 1.
You might have noticed that my examples use a
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
F E A T U R E
by Dr. James McCaffrey
Windows/IIS system rather than the more usual
Linux/Apache setup. Most client companies that I work
with are large and have a mixed technology environment. Because many of these companies are experimenting with PHP and MySQL on a Windows/IIS base,
I decided to use that base for this article.
In the sections that follow I will walk you through the
underlying PHP Web application so that you will understand what we are testing, briefly examine the underlying MySQL database so that you understand its relationship to the test automation, and carefully go over
the PHP test automation program so that you can modify the source code to meet your own particular needs.
I will conclude with a discussion of some of the ways
you can extend this technique and use it in a production environment. After reading this article you will
have the ability to write PHP test automation—a hopefully valuable addition to your skill set.
The PHP Web Application
The most common use of PHP among the companies I
work with is to create dynamic Web pages that have an
interface to a MySQL database. I created a reduced
REQUIREMENTS
PHP: 4.3.4
OS: Tested on Red Hat Linux 7 and
Windows Server 2003
Other Software: N/A
Code Directory: auto-test
28
FEATURE
Automated Testing For PHP Applications
dummy Web application that contains the essential elements of most real-life applications I deal with. I started
by making a small database named dbCompany, which
contains a table named tblEmployees that has four
columns: empid (employee ID), lastname, firstname,
and email. I populated the table with the four rows of
data you can see in Figure 3 (next page).
Next, I created a simple PHP Web application that
searches the database. The code shown in Listing 1
generates the Web page shown in Figure 1.
Both the database and the PHP application are simplistic, but together they have all the elements needed
to demonstrate test automation. Before I show you the
test automation program, let's imagine what it would
be like to manually test the application. (In fact, asking
how to test a dummy Web application like this is often
used as an interview question for dedicated software
test engineers.)
There are thousands of inputs you would have to
enter into the page and then visually determine if the
response was correct or not. Then, suppose you
changed the logic or the database structure—you'd
have to start all over. As you can imagine, this would
not be fun, or particularly efficient.
To automate the testing of the dummy PHP Web
application, we must programmatically send input to
the PHP script (via HTTP), then capture the HTTP
response stream, examine the response for strings that
tell us if the response is correct or not, and log results.
The PHP shell script shown in Listing 2 does exactly that
and generated the output shown in Figure 2.
I structured the test automation as two functions. The
main() function reads test case data from a text file,
sends an input value to the PHP Web application, and
Licensed to 63883 - Joseph Crawford ([email protected])
Figure 1
Figure 2
March 2004
●
PHP Architect
●
www.phparch.com
29
FEATURE
Automated Testing For PHP Applications
0001:Anderson:Adam:
0002:Baker:Bob:
0003:Baker:[email protected]:
0004:Chung:Kathy:deliberate fail
0005:De La Paz:Doug:
Each line of data represents a single test case. A 4digit test case ID is followed by an input value, then an
expected result, and an optional comment. So, in test
case 0002, if we submit "Baker", we should see "Bob" in
the response.
The main() function starts by assigning values to variables for the IP address of the Web server, the port on
which the server listens, the path to the PHP application, and the method used to send user data:
$ipAddress = '127.0.0.1';
$port = '80';
$page = '/PHP/simple.php';
$method = 'POST';
Because this is test automation, you will know the IP
address of the Web server that has your PHP application, and it will usually be 127.0.0.1 (localhost), unless
you test on a server that is not installed on your local
machine. Port 80 is the default HTTP port, but it may
be different in a test environment. The two main methods of sending information to a Web server are POST
and GET. Recall that our dummy Web application sends
data using POST:
<form name="theForm" action="simple.php"
method="POST">
I will discuss using GET requests later. Next, main()
prints some minimal header information to the shell
and then opens the test case file for reading. The test
automation reads the test case file line by line:
●
PHP Architect
For each line, we parse the four colon-delimited fields
using the explode() function. Using colons to delimit
test case data is arbitrary—in general, you can use any
character but want to avoid characters that appear in
the actual test case data. We append the input value to
lastname= using the urlencode() function. It replaces
characters that might be misinterpreted by the Web
server with their escaped equivalents. For example, a '/'
character would be replaced by a %2F sequence.
After we have a test case ID, an input last name to
send and an expected value to look for, the
resHasTarget() function does all the work:
if (resHasTarget($ipAddress, $port, $method, $page,
$postData, $expected))
echo "$caseid Pass input = " . str_pad($input,
12) . "expected = $expected\n";
else
echo "$caseid FAIL input = " . str_pad($input,
12) . "expected = $expected\n";
The resHasTarget() function posts data to the PHP
Web application and checks if the expected value is in
the response stream. For test case 0001,
"lastname=Anderson"
is
posted
to
127.0.0.1:80/PHP/simple.php and the response is
examined for the presence of the string "Adam". If
"Adam" is found, resHasTarget() returns TRUE and we
log a "pass" message, otherwise we log a "fail" message.
Let's now examine the resHasTarget() function that
does most of the actual work. We start by creating a
socket and then using it to connect to our Web server:
$socket = socket_create(AF_INET, SOCK_STREAM, 0)
or die("Socket failed\n");
$connect = socket_connect($socket, $ipAddress, $port)
or die("Connect failed\n");
The
constants
AF_INET
and
SOCK_STREAM
mean that we want
to use the dottedquad notation (i.e.,
127.0.0.1) and a
full-duplex,
TCP
connection. There
are two important
alternatives to the
socket_* family of
functions I chose to
use. A lower level
choice
is
the
fsock() family of
functions. A higher
Figure 3
March 2004
$line = fgets($fp, 4096);
list($caseid, $input, $expected, $comment) =
explode(":", $line);
$postData = 'lastname=' . urlencode($input);
Licensed to 63883 - Joseph Crawford ([email protected])
examines the response for an expected value. The
main() function calls a resHasTarget() function which
returns TRUE if some input data contains a target string.
Here are the contents of the test case file used in this
example:
●
www.phparch.com
30
FEATURE
Automated Testing For PHP Applications
$reqBody = $postData;
$contentLength = strlen($reqBody);
The $postData input parameter assumes we have
data in a name-value sequence like:
user=chris&age=25&job=tester
for example. Next we construct the HTTP headers we
are going to send to the server:
$send = $method . " " . $page . " HTTP/1.1\r\n";
$send .= "Host: localhost\r\n";
$send .= "Accept: */*\r\n";
$send .= "User-Agent: test.php test automation\r\n";
$send .= "Content-Type: application/x-www-form-urlencoded\r\n";
$send .= "Content-Length: " . $contentLength .
"\r\n\r\n";
$send .= $reqBody;
$send .= "\r\n";
An HTTP request starts with a line that specifies the
method (e.g., POST, GET, HEAD), followed by the path
to the PHP application and the HTTP version. The next
header line must specify the host that the request is
being sent to. The next two header lines are optional.
The Accept header tells the server what types of
responses are acceptable (here we'll accept anything).
The User-Agent header is a courtesy so the Web server
knows who is making the request. The next two header lines are required for POST requests. Content-Type
tells the server what kind of data is coming. You can
think of application/x-www-form-urlencoded as a magic
string that means "data from an HTML form".
The Content-Length header is the size of the POST
data. Notice that we have to construct the POST data
before the headers so we can specify the size at this
point in the program. Also notice that the
Content-Length header is followed by 2 newline characters (or in the case of the Windows based system here,
2 carriage return, linefeed combinations). Finally we
append the POST data to the request.
Now we are ready to send the HTTP request to the
server, then grab the response stream and examine it:
response 2048 bytes (an arbitrary size) at a time (as
opposed to line-by-line). We also use strpos() to see if
the target string is anywhere in the 2048 bytes, and if
it is we close the socket and return TRUE. If we examine the entire response and never find the target string
we return FALSE.
There is one trick to watch for here—it is possible that
a response stream block of bytes might end in the middle of the target, breaking it into two parts. If so, you
would not find the target string. In practice this is not
very likely and you can defend against this possibility by
increasing the number of bytes read per socket_read()
so that you capture the entire response stream.
To summarize, the key to automated testing of PHP
Web applications is the ability to send raw HTTP data to
the Web server. PHP has a family of socket functions
that make it easy to do so. After reading information
from test case files containing input values and expected values, you send the input to the server then examine the response for the expected value.
Licensed to 63883 - Joseph Crawford ([email protected])
level choice is to use classes in the PEAR library. I have
programmed sockets using all three methods and have
found that any preference is more a matter of personal
programming style than functionality. After we connect
to the Web server we determine the size of the data we
will be posting :
Using The GET Method
In the previous sections, we assumed that the PHP Web
application under test sends data to the server using
the POST method. What if the application uses GET?
Suppose you have a Web application where the user
submits a user ID and a password using GET. (By the
way, this is a bad idea because with GET the form data
is appended to the request URL). The following code
snippet shows how to send a request using GET:
// create socket
// connect
$send
$send
$send
$send
$send
$send
= "GET /PHP/form2.php?";
.= "userID=" . urlencode("root");
.= "&password=" . urlencode("secret");
.= " HTTP/1.1\r\n";
.= "Host: localhost\r\n";
.= "\r\n";
socket_write($socket, $send, strlen($send));
// read response
The first line of the HTTP request header uses GET
and the data to send is appended to the URL as a query
string using the name=value format. Because the user
data is tied to the URL, it is especially important to use
the urlencode() function to handle troublesome characters.
socket_write($socket, $send, strlen($send));
while ($receiveBuffer = socket_read($socket, 2048))
{
if (strpos($receiveBuffer, $target))
{
socket_close($socket);
return TRUE;
}
}
The socket_write() function sends the request and
associates the response to the socket. We read the
March 2004
●
PHP Architect
●
www.phparch.com
Beyond the Basics
You can modify and extend the basic PHP application
test framework presented here in many ways. For clarity, I used a simple text file to store test cases, but you
should consider good alternatives, like XML or database
storage. Using XML to hold your test cases is particularly appropriate when the test cases have a complex
structure (for example, many optional parameters), or
are shared across groups. A database, on the other
31
FEATURE
Automated Testing For PHP Applications
Listing 1
1 <html>
2
<!— simple.php —>
3
<head><title>PHP Test Automation</title></head>
4
<body>
5
<h3>Query Employees</h3>
6
<form name=”theForm” action=”simple.php” method=”POST”>
7
<p>Last name: <input type=”text” name=”lastname” /></p>
8
<p><input type=”submit” value=”Find Employee” /></p>
9
</form>
10
11
<?php
12
$conn = mysql_connect(“localhost”, “guest”, “secret”);
13
mysql_select_db(“dbCompany”);
14
15
if (isset($_POST[‘lastname’]))
16
{
17
$search = $_POST[‘lastname’];
18
$query = “SELECT * FROM tblEmployees WHERE lastname = ‘“
. $search . “‘“;
19
20
$dataset = mysql_query($query);
21
22
echo “<table>\n”;
23
while ($row = mysql_fetch_array($dataset, MYSQL_ASSOC))
24
{
25
echo “<tr>\n”;
26
echo “<td>” . $row[‘empid’] . “ “ . $row[‘firstname’];
27
echo “ “ . $row[‘lastname’] . “ “ . $row[‘email’] .
“</td>\n”;
28
echo “</tr>\n”;
29
}
30
echo “</table>\n”;
31
}
32
mysql_close($conn);
33
?>
34
35
</body>
36 </html>
March 2004
●
PHP Architect
●
www.phparch.com
testing language, I was pleased to find that they are as
good as any language I've worked with—and maybe
even better, in some cases.
In the introduction to this article, I noted that most of
the client companies I work with are currently investiListing 2
1 <?php
2
3 // test.php
4
5 function resHasTarget($ipAddress, $port, $method, $page,
$postData, $target)
6 {
7
$socket = socket_create(AF_INET, SOCK_STREAM, 0)
8
or die(“Socket failed\n”);
9
10
$connect = socket_connect($socket, $ipAddress, $port)
11
or die(“Connect failed\n”);
12
13
$reqBody = $postData;
14
$contentLength = strlen($reqBody);
15
16
$send = $method . “ “ . $page . “ HTTP/1.1\r\n”;
17
$send .= “Host: localhost\r\n”;
18
$send .= “Accept: */*\r\n”;
19
$send .= “User-Agent: test.php test automation\r\n”;
20
$send .= “Content-Type: application/x-www-formurlencoded\r\n”;
21
$send .= “Content-Length: “ . $contentLength . “\r\n\r\n”;
22
$send .= $reqBody;
23
$send .= “\r\n”;
24
25
socket_write($socket, $send, strlen($send));
26
27
while ($receiveBuffer = socket_read($socket, 2048))
28
{
29
if (strpos($receiveBuffer, $target))
30
{
31
socket_close($socket);
32
return TRUE;
33
}
34
echo $receiveBuffer;
35
}
36
37
socket_close($socket);
38
return FALSE;
39 }
40
41 function main()
42 {
43
$ipAddress = ‘127.0.0.1’;
44
$port = ‘80’;
45
$page = ‘/PHP/simple.php’;
46
$method = ‘POST’;
47
48
echo “\nBegin test run\n\n”;
49
echo “caseid result\n”;
50
echo
“===================================================\n\n”;
51
52
$fp = fopen(“cases.txt”, “r”);
53
while (!feof($fp))
54
{
55
$line = fgets($fp, 4096);
56
list($caseid, $input, $expected, $comment) = explode(“:”,
$line);
57
$postData = ‘lastname=’ . urlencode($input);
58
59
if (resHasTarget($ipAddress, $port, $method, $page,
$postData, $expected))
60
echo “$caseid
Pass
input = “ . str_pad($input, 12) .
“expected = $expected\n”;
61
else
62
echo “$caseid
FAIL
input = “ . str_pad($input, 12) .
“expected = $expected\n”;
63
}
64
fclose($fp);
65
echo “\nDone\n”;
66
67
$postData=’lastname=Baker’;
68
$expected=’’;
69
resHasTarget($ipAddress, $port, $method, $page, $postData,
$expected);
70 }
71
72 main(); // run tests
73
74 ?>
Licensed to 63883 - Joseph Crawford ([email protected])
hand, can come in handy when you have a very large
number of test cases.
The technique in this article displays its output to a
command shell. In a production environment, you will
probably want to write test results to a text file or a SQL
database. Writing to a text file is most appropriate
when you are on a relatively short production cycle.
Writing results to a SQL database is useful when you are
in a long production cycle because you will be generating lots of data that can be shared and analyzed in
many different ways.
In a production environment, I always add additional
data to the results log. At a minimum, you will want to
add counters for the number of cases which pass and
which fail. I also like to add timing information for each
test case and the overall test run. Timing information
can uncover problems in the Web application code that
basic pass-fail data misses. And for reporting purposes,
you can timestamp the date of the test run.
To be honest, when I first started using PHP I was very
surprised at how well it works as a language for software test automation. In general, it is best to write test
automation using the same language as that used by
the system under test—test a C++ application using
C++, test a Java application using Java. The idea is that
if you use different languages, you run into many crosslanguage issues which affect the validity of your test
automation. But often, using the same language is just
not possible. When I examined PHP's capabilities as a
32
FEATURE
gating mixed-technology enviFigure 4
ronments. As recently as twelve
months ago, mixing Open
Source and proprietary technologies usually had uneven
results, but the situation has
changed dramatically for the
better. The machine on which I
developed the techniques used
in this article happily supports
MySQL and SQL Server, C#
and PHP, Apache and IIS, and
dual boots into Linux and
Windows XP. This works in
PHP's favor: developers can
install PHP over their existing
technologies and gradually
migrate. In particular, I am seeing many in-house shops start
to move from ColdFusion to
PHP as their programming
platform of choice for Web
projects.
An interesting side effect of
the test automation presented
in this article is that you can
easily adapt the test code to
create a general purpose HTTP
response viewer. By placing an
echo() statement inside the
while loop that examines the response:
while ($receiveBuffer = socket_read($socket, 2048))
{
echo $receiveBuffer;
}
and making a few other cosmetic changes you can
view the entire response stream, as you can see in
Figure 4. If you are new to programming with PHP at a
low level, this is a great way to learn what is really going
on with HTTP behind the scenes.
In principle, testing PHP Web applications is similar to
traditional API (Application Programming Interface) or
Unit testing. But because PHP applications are clientserver based, there are additional connectivity issues.
This means you will want to liberally use error checking.
As usual for instructional articles, I removed all error
checking in the code presented here. Based on my
experience, adding exception handing code (if you're
using PHP5) will double the size of your source code
but is well worth the effort.
One valuable use of the technique presented in this
article is to construct Developer Regression Tests (DRTs)
for your PHP Web applications. DRTs are a sequence of
automated tests that are run after you make changes to
your application. They are designed to determine if
your new code has broken existing functionality, before
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
Automated Testing For PHP Applications
you check it in your version-control repository. You can
also create an extensive set of test cases for a Full Test
Pass.
Conclusion
In this article, I have shown you how easy it is to create
test automation systems written in PHP for your applications. As PHP matures, testing will become more
important and the ability to write automated tests will
become more useful than it already is. And because
PHP works so well in a mixed technology environment,
the ability to write PHP test automation is a valuable
addition to your skill set—no matter what platforms
you use.
About the Author
?>
Dr. James McCaffrey works for Volt Information Sciences, Inc., where he
manages technical training for over 4,000 software engineers working at
a wide range of companies. Previously, he was a university professor and
worked on several Microsoft products including Internet Explorer and
MSN Search. James can be reached at [email protected].
To Discuss this article:
http://forums.phparch.com/132
33
by Eddie Peloke
Flash MX 2004 for Rich Internet
Applications
by Phillip Kerman
Publisher by New Riders
Paperback 430 pages
$45.00 (US)
$67.99 (Canada)
I
t is hard to do much web surfing without coming
across some form of Flash content. Whether it is a
menu, form or movie, Macromedia's rich content
seems to be everywhere. In the past, it was primarily
used for creating animations and movies, but, recently,
Macromedia has begun pushing it past the boundaries
of simply being an 'animation tool' and into the realm
of programming. Rich Internet Applications, or RIAs, as
they are called, are intended as applications which create a 'Rich' user experience by closer resembling a traditional desktop application than a web app. RIAs also
bring forth a technology in Flash called Flash Remoting,
which allows Flash to communicate with outside services. These services can take the form of a Cold Fusion
page, Java class, or even a PHP class, as outlined in the
"Flash Remoting with AMFPHP" article that appeared in
the July issue of PHP|architect.
In the past, I was never really interested in using
Flash. I had seen some cool menus and movies, but
never thought it practical enough to take the time to
learn the tool. However, when I read about RIAs for the
first time, I was excited to give this approach a try.
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
T I P S
&
T R I C K S
Book Review
Being a former teacher, I have always wanted to create an online grade book application. The thought of
using Flash's data grids, forms, and other features for
the presentation layer while using PHP classes for the
backend really interested me. I decided now is the time
to get to know Flash so I picked up Flash MX 2004 for
Rich Internet Applications. I should say, first of all, that
this book is not what I had expected. I was originally
hoping for more of a step-by-step tutorial on creating
an RIA, which this book does not provide. While the
author does provide some good examples, I found the
book to be more of a general RIA development best
practices book.
The book does a good job of explaining why you
would want to use an RIA and the technologies
involved. Chapters such as 'Presenting Data',
'Production Techniques', and 'Using Components' help
build a general understanding of the process and techniques, and some of the information in the chapters is
applicable to the creation of any application, regardless
of the tool involved. Even though the book is geared
more toward the skilled Flash and Actionscript user, the
author does include plenty of code examples of which
many are self-explanatory.
One of my gripes about this book is its lack of coverage for some of the Flash remoting tools. While it can
be argued that this book is primarily about Flash's role
in RIAs, it would be nice to see some coverage of tools
such as AMFPHP, the open source Flash remoting tool
for PHP. It allows you to create a Flash movie which
connects to your PHP classes, where you can handle all
of the logic while Flash manages the presentation layer.
In my opinion, this could be a powerful combination
that is not nearly as documented as it should be.
All in all, I think this is a good read for the developer
looking for more information on Flash and Rich Internet
Applications. It contains a lot of useful information
sprinkled with some cool RIAs created by the author. If
you are a Flash and Actionscript newbie however, you
may want to brush up on your skills first.
34
A look at php | Cruise
March 1 - 5 • Bahamas 2004
by Marco Tabini
W
hen the php|a decided to organize a conference, the first question that we asked ourselves was "why". After all, there already was a
well-established circuit of PHP conferences throughout
North America at different times of the year—on could
almost say that the PHP conference market (if there
even is such a thing) was getting more and more saturated.
If anything, we wanted to avoid both interfering with
existing events and proposing "just another conference," given that there are already so many other
organizations out there that do a great job in conventional settings. Therefore, it took a while before Brian K.
Jones, then our Editor-in-Chief, came up with the bril-
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
C R U I S E
R E V I E W
PHP Ahoy!
liant idea of holding the conference on a cruise ship,
and thus set us off on our path. Naturally, even with the
idea firmly in our minds, getting the first php|cruise off
the ground was an enormous task that took several
months of work just to get from the idea stage—you
know, the point where somebody starts saying "wouldn't it be nice if..."—to the moment in which we finally
decided to announce it to the public.
Lots more work afterwards, we finally sailed for the
Caribbeans from Port Canaveral (near Orlando, Florida)
aboard the Sovereign of the Seas on March 1st. For
those who have never been on a large cruise ship
before, sailing on such a big vessel (the Sovereign holds
almost 3,000 people) is an... interesting experience.
35
CRUISE REVIEW
PHP Ahoy! A look at php|cruise 2004
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
Given that the ship itself is so massive,
even in relatively rough seas it will only
rock slightly, so that, while one easily
notices that something is "odd", it is
rarely disturbing to the point where one
gets seasick. Since we had a very busy
schedule, we started off with our
keynote session, given by Zend Studio
co-creator Zeev Suraski, shortly before
departure. Once the ship had actually
left the docks, many people only
noticed that we were not connected to
terra firma anymore because they saw
the speakers swaying slightly from one
said to the other, themselves hardly
aware of the fact!
The conference ran on two separate
tracks, so that the attendees could, at all
times, have an opportunity to choose
the session they best liked. For the most
part, every lecture took place in one of
the ship's two theatres, equipped with
the appropriate audio/video tools.
Despite some initial technical difficulties
(caused primarily by a projector with
the wrong cable), once we got under
way you could hardly tell that the whole
thing was happening on a ship in the
middle of the Atlantic Ocean—we could
have easily been in a hotel in any major
city. After the first day, once we had
everything set up, we even had practically continuous wireless (and wired)
Internet access.
From a practical perspective, therefore, php|c was a full-fledged, typical
PHP conference, with excellent speakers, many of whom offered original
talks, and lectures on all sorts of PHPrelated topics, such as regular expressions, debugging, profiling and creating
development frameworks. However,
php|c was made very unique by two
elements that were a consequence of
the venue we had chosen.
A cruise ship is a very odd place. On
one hand, you are, effectively confined
to a limited space—huge (the Sovereign
held some 3,000 people very comfortably), but still limited if you compare it
to, say, being in the middle of
Manhattan. On the other, I dare anyone
to become bored during their permanence aboard the ship. No matter what
time it is, there is always something to
do—whether you're into gourmet food,
gambling, rock climbing or just sitting
36
around and have fun with your friends. In the context
of a conference, this results in a significantly higher
amount of experience-sharing between the attendees
and the speakers. More than once during the cruise, I
had occasion to walk by one of the many bars and find
groups of people talking animately about things as varied as what applications they were working on or what
they were expecting from PHP5. The ability to
exchange your personal experiences with your peers is,
perhaps, one of the most important aspects of a conference, but in a traditional setting it's too easy for the
attendees to go their separate ways outside of session
times and lose sight of each other.
We learned another important lesson by experimenting with the conference rooms in which each session
was held. By sheer accident, we were forced to move
one of the tracks from its assigned theatre to one of the
ship's many lounges for an entire day. Now, it goes
without saying that a lounge is set up in a very different way compared to a theatre—the seats are disposed
around tables, and the tables themselves are disposed
so that everybody is capable of seeing everybody else
(at least for the most part).
Although counterintuitive for a lecture, this setting
seems to have worked wonders as far as our sessions
went. Both speakers and attendees found themselves
more at ease and much more comfortable with intervening during each session with their personal comment and experiences. Speaker Stuart Herbert held
perhaps one of the most memorable PHP sessions I
have ever participated in by hosting what he called a
"shared experiences" discussion on creating programming frameworks, loosely guided by a set of slides he
had prepared beforehand. Some of the attendees liked
Stu's idea so much that they rated it as a "six" on a scale
of one to five in their questionnaires!
PHP Ahoy! A look at php|cruise 2004
“Both speakers and
attendees found themselves more at ease
and much more comfortable with intervening during each
session with their personal comment and
experiences.”
Licensed to 63883 - Joseph Crawford ([email protected])
CRUISE REVIEW
From Work to Fun
Even though we were absolutely serious about holding
a full-fledged PHP conference, the venue we had chosen gave us plenty of opportunities for unprecedented
fun—we were, after all, on a cruise ship going to the
Bahamas!
As I mentioned earlier, the ship itself was a constant
source of activities, which, of course prompted many of
our attendees to bring their significant others along
March 2004
●
PHP Architect
●
www.phparch.com
37
CRUISE REVIEW
PHP Ahoy! A look at php|cruise 2004
with them for the ride. A conference at which you can
have fun with your family—the perfect crime!
On top of the amenities you would normally expect,
like two salt-water swimming pools and two giant hot
tubs that the guests loved to take advantage of at
night, one could find all sorts of
attractions. Perhaps the most exotic
of them all must have been the rock
climbing wall, which I found a bit
scary but that some of our attendees
enjoyed very much in their spare
time.
There's a commercial for Disney
Cruises on TV in the US where each
member of the family finds something fun aboard the ship to do during their stay. The children go play
with the Disney characters, the
grandparents go play bingo (or
something like that) and mom goes
to the spa. Before leaving, they all
ask dad if he's all right—wondering
what activities he has planned. The man of the house
reassures them that he's got everything covered and
sends them all on their way. We next see him sleeping
in just about every spot that is fit for showing on television—from the beach to the massage parlour. Even
though Disney was not our cruise line (although "PHP
with Mickey" may be a good idea for the future), that
dad in the commercial inspired me—and I enjoyed
every last snoozing moment aboard that ship, as well,
of course, as the ensuing sunburn (but that's another
story).
So far, we've covered daytime
aboard the ship (we'll get back to
the shore later on). What about
the nightlife? Well, there were all
sorts of things going on, of
course. Given that we had been
blessed by extremely good
weather, the ship's crew organized some sort of dancing party
every night on the top deck (in
the open air), inclusive of a rich
midnight buffet. For the one
among who like taking risks, the
ship featured a full-fledged casino, whose friendly operators were
more than ready to take our
money. For a friendly discussion on the latest PHP
developments, many hit the various bars and shared
their thoughts over an excellent daiquiris or piña
coladas. Finally, the musically inclined also had an
opportunity for more dancing in one of the disco
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
“Naturally, the ship
wasn't at sea the
whole time. We visited Coco Cay, a private island owned by
Royal Caribbean”
38
lounges or even karaoke.
Naturally, the ship wasn't at sea the whole time. We
visited Coco Cay, a private island owned by Royal
Caribbean (proof positive that the cruise business is
very profitable, apparently) that is, essentially, a huge
water park, complete with slides and beaches, as well
as many different excursion opportunities—like snorkeling, scuba diving or swimming with the dolphins—
which our attendees promptly took advantage of.
Although I didn't get an opportunity to visit Coco
Cay—I was too busy imitating the guy in the Disney
commercial-everyone who went had a great time and
brought home some wonderful memories.
Our second port-of-call—just before coming back to
Orlando—was Nassau, the capital of the Bahamas. The
ship docks on the "touristy" side of the town, so that
one can have a comfortable stroll around the various
shops and spend some of his hard-earned money on
anything from clothes to handmade souvenirs, which is
what I did. For the more adventurous, the ship was,
once more, organizing a number of different excursions, some of which were quite exotic—picture yourself snorkeling in the middle of crystal-clear water while
you play with a stingray, and you'll have a good idea of
what some of our guests experienced.
March 2004
●
PHP Architect
●
www.phparch.com
PHP Ahoy! A look at php|cruise 2004
The Atlantis Resort, located on Paradise Island (not far
from Nassau itself), offered even more opportunities for
those who didn't want to stay on the ship but still enjoy
the nightlife. Atlantis features a number of different
attractions, including an incredible aquarium, a private
beach, several restaurant and yet another casino for the
gamblers. Being "busy" with a bit more R&R myself, I
didn't get an opportunity to go, but those who did
were very enthusiastic about it.
A Look at the Future
php|cruise turned out to be a very interesting experience. I think that everybody who participated had lots
of fun and learned something new about PHP, which
was, of course, our goal from the very beginning.
Encouraged by its success, we have started working
on the next edition, which will take place in the fall.
This time, we will go to Alaska, a land that offers a very
different, if just as exciting, set of possibilities for having
fun. Watch out for an announcement on our next exciting adventure coming April 15 on the php|architect
website!
Licensed to 63883 - Joseph Crawford ([email protected])
CRUISE REVIEW
To Discuss this article or see more pictures:
http://www.phparch.com/discuss/index.php/t/518/0
39
PHP Ahoy! A look at php|cruise 2004
Licensed to 63883 - Joseph Crawford ([email protected])
CRUISE REVIEW
March 2004
●
PHP Architect
●
www.phparch.com
40
Licensed to 63883 - Joseph Crawford ([email protected])
Content Management System
www.mamboserver.com
by Eddie Peloke
L
ike a lot of my peers, I spend most of my time helping others with their site—yet I rarely have time to
look after my own. For the past year, my family site
has had nothing more than the default Apache page.
Don't get me wrong—I have plenty of ideas for the site,
but there just always seems to be other things to work
on. On top of that, since I don't have time to create the
pages initially, I know I will have even less time to maintain the site.
Looks like I'll need a content management system.
While I have tried a few in the past, I haven't really
found one that I like—there's always something that
doesn't sit well with me. After all, CMS's are notoriously difficult to write, because it's nearly impossible to create a single application that will satisfy the needs of
every possible website.
Thus, when a co-worker returned from LinuxWorld
talking about a CMS he saw named Mambo, he managed to rouse my curiosity. His description was a bit
vague—I was told it was a PHP based CMS which
"looked nice"—but I thought I'd give it a try nonetheless. After all, with the amount of digital noise we are
subject to on a daily basis, the recommendations of
friends and colleagues are the last bastions of unfiltered, selfless information (at least for the most part).
According to the Mambo Open Source site: "Mambo
Open Source (MOS) is a PHP/MySQL based Content
Management System (CMS) framework released under
the GNU/GPL License, which enables the easy creation
and maintenance of a Web site or portal. The pure simplicity of MOS 4.5 means that you do not need to be an IT
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
P R O D U C T
R E V I E W
Mambo Open Source
QUICK FACTS
Description:
First and foremost Mambo Open Source is a Content
Management System (CMS).The goal of the Mambo
Open Source project is to meet most of the requirements highlighted in the above article. As each day in
development goes by we are getting nearer and nearer,
whilst at the same time building a solid core which can
be expanded upon by 3rd Party Developers.
Mambo Open Source is the engine behind your website
that provides the ability to simplify the creation of content.
Requirements:
OS: UNIX, Microsoft Windows 2000/XP
Database: MySQL 3.23.55 or above
PHP: 4.2.1 or above
Web Server: Apache 1.3 or above
Web Browser: Internet Explorer 5.5 / Mozilla 1.4
Price:
Mambo Open Source is Free Software released under the
GNU General Public License.
Download Page:
http://www.mamboserver.com/content/menu/Mambo_Open_Source_
Download/
Product Homepage
http://www.mamboserver.com/
42
Professional to update, maintain and customize your content." Boy, have I heard this before... Well, let's see if it's
true.
Requirements
Before downloading the code, it's a good thing to
ensure that the system you're working on meets the
software's minimum requirements. In the case of MOS,
these are as follows:
• Operating Systems: UNIX, Microsoft
Windows 2000/XP
• Database: MySQL 3.23.55 or above
• PHP: 4.2.1 or above
• Web Server: Apache 1.3 or above
• Web Browser: Internet Explorer 5.5 /
Mozilla 1.4
Set Up
Well, my system does meet the requirements, and I was
pretty ex?cited to give Mambo Open Source a try, so I
quickly downloaded the code and started the set up
process. Installing Mambo Open Source could not have
been easier: within five minutes from download, it was
up and running. The Mambo install performs quite a
Mambo Open Source Content Management System
few system checks and might complain if it doesn't find
certain PHP configuration parameters set or have access
to certain directories but those errors are easy to fix and
the installation should complete without trouble.
The System
Once installed, the Mambo Open Source administration pages are the first place you will want to visit to
begin customizing the site to meet your needs. They
allow you to manage your site's templates, users,
menus, database, and so on. The administrator has a
nice, clean interface through which items are fairly easy
to find and manipulate.
One of the first aspects of your site you will probably
want to tweak is the interface. Mambo Open Source
comes with a handful of templates, but a quick web
search will return several sites with hundreds more—
giving more credence to the popularity of the package,
which is usually a good indication of its quality. Once
you have selected and installed your template, all that
is left is a click of the 'Publish' button to have it take
control of the site's look and feel. Incidentally, "Publish"
is a button you will become very familiar with when
using Mambo, as, by default, things don't appear
online until they are published.
Licensed to 63883 - Joseph Crawford ([email protected])
PRODUCT REVIEW
Figure 1
March 2004
●
PHP Architect
●
www.phparch.com
43
If the template you selected needs to be modified,
don't worry—within the Mambo administrator, you can
edit the page's code or style sheet directly online. While
I have never had any trouble editing the code, I have
heard complaints from another developer using
Mambo of strange things happening when editing the
code with the administrator's WYSIWYG editor. Thus,
you may find it easier and less troublesome on a regular basis to edit your template's code directly in the
template's PHP file rather than through the administrator—but it's good to know that, in a pinch, you can
easily get by without having direct access to your server's filesystem.
Usage
A CMS needs to do more than simply manage the look
and feel of your applications, and Mambo does attempt
to give the user absolute control over every aspect of
the site it runs. From a management perspective, it is
broken down into three main parts: Components,
Modules and Templates. We have briefly discussed
Templates but what exactly are Modules and
Components? Some of the components that come
Mambo Open Source Content Management System
prepackaged with Mambo include:
•
•
•
•
Banner Manager
Polls
Media Manager
News Feeds
Some of the prepackaged modules include:
• Menu Managers
• Logins
• Statistics
If you don't find what you need from the initial install,
you will find several sites offering many free components and modules. Everything from file managers, to
forums, galleries, online shops, weather plug-ins, bug
tracking systems and pretty much any other thing you
can think of is out there for you to grab. I have even
come across some games, such as a humorous PacMan knock off named, obviously, Mambo Man.
Now that we have talked briefly about components,
modules, and templates we should take a moment to
Licensed to 63883 - Joseph Crawford ([email protected])
PRODUCT REVIEW
Figure 2
March 2004
●
PHP Architect
●
www.phparch.com
44
talk about how they are installed. When the component, module and template (CMT) installer works, it is
extremely easy. Typically, all you have to do is upload
the zip file from within the Mambo administrator and
mambo will unzip the code and take care of the installation for you. (It does this via the zlib package, for
which Mambo will check during the initial product
install to make sure it is available). I say "typically" here
only because I have had several components, modules,
and templates that just refused to play nice and install.
Now, it is probably unfair to fault Mambo for some of
the third party components, but it is hard to determine
who is causing the problem. Of course, Mambo does
give you the option to upload the files yourself and
then install from the uploaded directory—it's just much
easier to use the administrator and let Mambo take care
of it.
Advanced Features
Mambo has a few "advanced" features that I have
found to be a nice addition. The most notable, for me,
is the database management system. The Mambo
administrator allows you to back up, restore and run
queries against your MySQL database and, while this in
Mambo Open Source Content Management System
no way replaces tools such as SQLyog or MySQL Front,
it is nice when you just need to run a quick query and
don't have access to another DB tool.
Mambo also contains support for content archiving
and versioning. While I have not yet used these features
in my system, I can see their benefit in an environment
with several users constantly changing content.
What I Liked
The main thing I like about Mambo is that it is written
in PHP. That means anything I don't like, I can fix with
ease. All of the quirks or deficiencies of the system can
be corrected by the programmer without having to
reinvent the wheel every time. I have also found the
code clean and, for the most part, well documented.
For example, a quick look into the database class shows
comments around the functions, data members, and so
on, making it easier to figure out what is going on and
ultimately easier to modify the code if needed.
I also like the ease with which content can be published, moved around the templates and re-ordered
directly from the web interface. As I mentioned, the
amount of different templates, components and modules available online is sign, in my opinion, of a healthy
Licensed to 63883 - Joseph Crawford ([email protected])
PRODUCT REVIEW
Figure 3
March 2004
●
PHP Architect
●
www.phparch.com
45
and well-supported system. There is enough out there
to satisfy just about anyone's needs.
What I Didn't Like
I have found that it takes a day or so or playing around
with the system to get the hang of how items exactly
work. In some areas, where things should happen in a
certain order, it is not always obvious what the correct
procedure to follow is. For example, it happened to me
several times that, after having published an item, I
couldn't find my content on a web page because it didn't contain any 'records' or wasn't attached to a higher
level item. While I understand the need for such a hierarchy, it would be nice if the administrative pages gave
you better indications as to why some items won't
show up unless you do something else first.
My biggest gripe, however, is with some of the external components. Again, it is hard to fault Mambo for
code written by outside developers, but it is still a pain
nonetheless to get an error when attempting to install
new items, which is something that all users will have
to do at some point. Thus, a better installation management system might not be a bad idea.
Mambo Open Source Content Management System
If you are a standards stickler, you will find that some
of the templates will not pass the W3C validator. For
instance, running my test homepage through the validator returned 119 errors. While this doesn't particularly trouble me, I would have liked to see less standards
violations in the code. If standards are of high importance, you may want to check out xMambo, which is a
standards-compliant publishing system based on
Mambo. The good news is that, thankfully, the Mambo
team is now working to bring the benefits of xMambo
under the Mambo umbrella into a single package.
Conclusion
Overall, I like Mambo. If you are looking for a content
management system for one of your web projects, this
is definitely worth a look. With a wealth of external
plug-ins available, you should be able to find just about
any item you need to achieve what you want.
Licensed to 63883 - Joseph Crawford ([email protected])
PRODUCT REVIEW
Figure 4
March 2004
●
PHP Architect
●
www.phparch.com
46
WAP: Past, Present and Future
W
AP stands for Wireless Application Protocol. It
is developed and administered by a consortium of companies known as the Open
Mobile Alliance (OMA), which also own the trademark
on its brand name and regulates its use.
Just like the name implies, WAP is the application protocol for wireless services. WAP specifications range
from the communication protocol between a server
and a wireless device to the markup language that
should be used to exchange data. Most of the specifications are an adaptation of existing standards for the
wireless and wired world. This means, of course, that
you will not need to study everything from scratch to
develop a WAP site—it is much easier than many might
think.
If you can write an HTML page, then you can also
write a WAP page. WML (Wireless Markup Language) is
the language that was invented to develop WAP sites.
It is derived from XML and complies with the XML standard. The first thing you will have to get used to is that
WAP browsers are not as developer-friendly as the traditional Web browsers you are used to. If you have a
typo or introduce incorrect syntax, the browser will
simply print an error message ("Compile error" is the
most common, but it depends on the browser) rather
than being tolerant and trying to recover from the situation like a Web browser.
If you want to develop a WAP site, the first you may
want to do is to download an emulator (you will need
it to test your WAP pages), download the WAP documentation from the OMA Web page (http://www.openmobilealliance.org) and set up a Web server to load your
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
F E A T U R E
by Andrea Trasatti
pages.
To support as many devices as possible, I suggest that
you download the Openwave's SDK (http://developer.
openwave.com ) and the Nokia Mobile Browser
(http://forum.nokia.com). Both software packages are
available for download free of charge (or after a free
registration on their respective websites). While the
Openwave SDK comes with a generic WAP emulator
and supports skins to emulate specific devices, the
Nokia Mobile Browser is a core application that will
require you to download some of the available plug-ins
to emulate specific devices; if you don't, you will get a
generic device which doesn't have much to do with real
WAP terminals (I don't suggest using it for testing!).
If you are moving into the world of WAP from HTML
and Web applications, you might think you will need a
specific editor to write your pages, but this is not the
case, as WML is very similar to HTML—and I know
many people who use Homesite or similar HTML editors
to do their WAP work without any problem. Thus, you
can just pick your favorite editor ('vim' is always at the
top of my list!) and get straight to work.
Getting Started
Now you have all the tools you need to start develop-
REQUIREMENTS
PHP: 4.x
OS: Any
Applications: N/A
Code Directory: wap
47
FEATURE
WAP: Past, Present and Future
text/vnd.wap.wml
text/vnd.wap.wmlc
text/vnd.wap.wmlsscript
text/vnd.wap.wmlsscriptc
image/vnd.wap.wbmp
wml
wmlc
wmls
wmlsc
wbmp
While developing your WAP site, you will mostly set
the appropriate mime type through PHP, but you might
need these Apache settings for images (wbmp), or in
case you decide to use hard-coded wml and wmls
pages. Since I introduced all these new file types, it is
worth talking about them a little bit, although you can
find extensive descriptions of these and more extensions in the OMA documents.
"wml" is the extension for WML pages. WML pages
are the same as HTML pages. "wmls" is for wmlscript,
which is the same as javascript for wireless devices. Just
like for Web pages, you can have wmlscript embedded
in a wml page, or store it in external files.
WBMP stands for Wireless BitMaP and represents a
black and white image. All graphically-capable WAP
devices support WBMP, but the most recent devices
also support many other image formats, such as GIF,
JPG and PNG (which also provide color images). If you
are trying to write an application that will work with as
many different devices as possible, it's usually safe to
just go with a WBMP, which any device will display it
properly.
If you don't know how to generate a WBMP image,
check your favorite graphics software—many recent
ones have plug-ins to generate WBMP's. Also, you can
find some software to convert a generic
image into a WBMP (try this online converter:
http://www.teraflops.com/wbmp/ and this online editor
http://Webcab.de/woe.htm, or Google a little bit, and
you'll find that there's plenty of applications available).
WMLC pages are WML pages that are already compiled. This format is not widely used, but it exists
nonetheless. While WMLC content is referred to as
"compiled," it is not compiled in the way a C source file
is; WML tags are simply converted into symbols so that
they will use less bandwidth (remember that many
WAP devices do not dispose of high bandwidth and,
therefore, you should make your documents as lean as
possible). Keep in mind that any WAP emulator will also
March 2004
●
PHP Architect
●
www.phparch.com
show the complete source of a WML page, regardless
of whether it has been compiled or compressed (just
like Web browsers do) and, therefore, compiling will
not protect your pages from prying eyes.
If you want to know more about how WAP works, I
suggest that you take a look at the OMA official documents—they will shed more light on subjects that I cannot cover in this article, like compiled pages and the
communication protocols between devices, gateways,
and webservers.
A Simple WAP Page
Now that we have had a little introduction to WAP and
its different component file types, we can proceed to
the first example.
Let's analyze the code in Listing 1. The first two lines
define the document and the revision of the WML used
in the current page. This is required code just as it is for
any XML document.
The <wml> tag defines the beginning of the WML
page. Each page must begin with this tag and end with
</wml> (just like an HTML page must be contained
within an <HTML> object).
<card> is the tag that defines the beginning of a
"card". When WML was first defined, each page was
defined as a "deck" that is composed by one or more
cards. Each card will be displayed as a single page by
the WAP browser. This is particularly useful if you have
a predefined navigation scheme.
If you think of a WAP device and the time that takes
to load a page over a slow network, you will understand how useful it is to already have the next page in
memory. Each card that composes the deck begins
with the <card> tag and ends with the </card> tag. The
opening tag needs the id attribute, which is needed to
differentiate between different cards and behave like
HTML anchors. If you want to jump from a card to
another, all you need to do is create a link to #cardid,
where cardid is, obviously, the string defined in the id
attribute for that card.
Many of the tags that are part of the XHTML specification can also be used in WML. For example, <p> is
used to identify a paragraph, while <br/> introduces a
line break. It's important to remember that you are
Licensed to 63883 - Joseph Crawford ([email protected])
ing—with the exception of a WAP server. Luckily, all
you need is a common webserver, such as Apache
(http://httpd.apache.org), and a little configuration in
the MIME types file to allow for the file types specific to
WAP documents and images. If you are going to develop all your WAP pages with PHP and won't need to use
any images, you will not really need to make this modification, although I always suggest to do it anyway—
after all, you never know if you will need it some day.
WAP introduced a few new extensions that need the
appropriate MIME types: wml, wmls and wbmp. If you
are using Apache, add the following lines to your
"mime.types" file (and then restart the server):
Listing 1
1 <?xml version=”1.0”?>
2 <!DOCTYPE wml PUBLIC “-//WAPFORUM/DTD WML 1.1//EN”
“http://www.wapforum.org/DTD/wml_1.1.xml”>
3
4 <wml>
5
<card id=’main’ title’first_page’>
6
<p>
7
Hello World</br/>
8
this is my first WAP page.
9
</p>
10
</card>
11 </wml>
48
FEATURE
WAP: Past, Present and Future
cards, main, like and dislike. By default, the WAP
browser will show the first card to the user. The user will
pick one of the two options (click the link or accept button, which is bound to the right softkey) and the
appropriate following card will be displayed.
The img tag has a closing slash to comply with the
XML standard (just as it would in XHTML). Also, notice
the alt attribute, which is needed in WML so that if the
browser cannot show the image, the "alternative text"
will be displayed. The reason why this attribute is necessary is that many WAP devices do not have graphicsdisplay capabilities and, therefore, the ability to show a
text alternative to an image is very important.
I put the image on the same line as the message to
raise a problem you might encounter with different
WAP browsers. Some (like newer Openware browsers)
will display this image as you would expect a Web
browser to, on the same line as the text beside it, while
some others (like older Nokia devices) will place it on a
new line. Naturally, if the image width plus the text
exceed the screen width, it's natural that the image will
go on a new line, but you should be aware that not all
browsers will behave the same way under all circumstances—and, therefore, you should plan your WAP
documents accordingly.
The syntax for anchors is just the same as for Web
pages. If you look at the WML 1.x documentation, you
will see that anchor can also be used as a tag for
anchors. Another alternative is the go tag. The browser's behavior in response to either is the same, but the
go tag also gives you the possibility to define the
method (GET and POST) and any additional parameters you want to pass to the subsequent pages. I suggest readying the full documentation for more specific
help.
Another important thing to keep in mind is that, in a
WAP application, the order in which tags appear is
extremely important. For example, the do tag should be
used outside of a paragraph and inside a card. In this
case, if we put the do tag inside the paragraph, it will
cause a compile error.
Licensed to 63883 - Joseph Crawford ([email protected])
essentially dealing with an XML document—and, therefore, forgetting the slash will cause the WAP browser to
produce a "compile error" message.
As you can see, the structure of this simple page is
not much different from what its HTML equivalent
would look like—even more so if you were to compare
it to an XHTML document.
In the next example, we will see some more tags—
including some that are unique to WML. The first thing
you must have clear in your mind when developing a
WAP application is that you are not developing a website. Your application will be visited by users who have
a tiny display with little (often uncomfortable) keys and
who are probably paying a lot of money for the privilege of accessing the Internet. The key for a successful
WAP site, in my experience, is simplicity and usability.
The content is extremely important as well, of course,
but the difference between WAP and Web pages lays
primarily in user-friendliness. A simple Web page with a
great content will get many hits (Google being the
prime example) while good content in WAP will not be
as popular if the site is not usable and friendly—people
will just wait until they get home and use the Web
instead.
To help developers make their applications easier to
use, the OMA defined some specific tags, such as <do>,
which support the type attribute. do tags can be of type
option or accept, for example. The former links an
object to the left softkey, while the latter defines the
action for the right softkey. These are useful to ease the
user's navigation (if you're wondering what softkeys
are, they are the two buttons below the screen; in order
to be WAP-compliant, a device must have these two
keys active during the navigation).
Let's move to a more complex page and check what
WML can offer—take a look at listing 2. As you can see,
I used the well-known anchor tag and the '<do>' tag
together. This is probably not the most usable page I
ever wrote, but it's a good example that we can use to
play with cards and links.
Let's analyze the elements of this deck. We have three
Listing 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?xml version=”1.0”?>
<!DOCTYPE wml PUBLIC “-//WAPFORUM/DTD WML 1.1//EN” “http://www.wapforum.org/DTD/wml_1.1.xml”>
<wml>
<card id=”main” title=”first page”>
<p>
Hello World,<br/>
take a look at this picture:<img src=”mypic.wbmp” alt=”here is my pic”/>
Click <a href=”#like”>here</a> if you like it, or click your right softkey if you don’t.
</p>
<do type=”accept” value=”#dislike”>
</card>
<card id=”like” title=”I like it”>
<p>Good, you like my picture.</p>
</card>
<card id=”dislike” title=”I don’t like it”>
<p>Too bad, you don’t like my picture.</p>
</card>
</wml>
March 2004
●
PHP Architect
●
www.phparch.com
49
FEATURE
WAP: Past, Present and Future
WAP and PHP-A Simple Example
If you have configured your Apache webserver properly and saved the examples above into a couple of WML
files, you will be able to browse them with a WAP emulator. But what if you wanted to write a WAP application using PHP?
I don't need to explain why you should use PHP, and
that PHP can give you more than a static page, so let's
just talk about how you can use it. First of all let's talk
about the content type of your script's output. By
default, PHP sets it to text/html (default for Web
pages), but we want it as text/vnd.wap.wml, so that the
Listing 3
1 <?php
2 header(“Expires: “.date(“D, M j G:i:s T Y”, (time()-1000)) );
3 header(“Last-Modified: “.gmdate(“D, d M Y H:i:s”).” GMT” );
4 header(“Cache-Control: no-cache, must-revalidate” );
5 header(“Pragma: no-cache” );
6 ?>
7 <?xml version=”1.0”?>
8 <!DOCTYPE wml PUBLIC “-//WAPFORUM/DTD WML 1.1//EN”
“http://www.wapforum.org/DTD/wml_1.1.xml”>
9
10 <wml>
11
<head>
12
<meta forua=”true” http-equiv=”Cache-Control” content=”max-age=0”/>
13
</head>
14
<card id=’main’ title=’never cache’>
15
<p>This deck will always be reloaded</p>
16
</card>
17 </wml>
March 2004
●
PHP Architect
●
www.phparch.com
WAP Gateway and WAP Browser will recognize it properly. The PHP syntax to set the header is:
header("Content-Type: text/vnd.wap.wml");
This is the very first step you will need to take to write
a WAP page within PHP. Just like for Web pages, you
can set an expiration time—and, in fact, you should,
because WAP browsers tend to use their cache very
aggressively to make navigation as fast (and inexpensive) as possible for their users.
Since WAP relies on the HTTP protocol for its communications, you can set the expiration time of a document in pretty much the same way you would for a
normal HTML page. As you can see in Listing 3, this is
accomplished with a simple set of calls to the
header() function.
The Road to the Future
Obviously, the examples I presented do not fulfill all the
possible scenarios of WML development. The aim of
this article is not to make you a wireless master, but
rather to show you the options you have if you want to
start developing a working site. I think it was worth
introducing WML and show some of the main differences from "standard" Web development to let you
understand that you should not simply "recycle" a website and adapt it to a mobile device. You will need to
rethink it from scratch.
One of the targets of the new OMA standard (WAP
2.0) is to make the transition to WAP easier for Web
developers. The first step consists of using a "common
language": XHTML. As you all know, XHTML is supported by any browser released in the last year or two, and
WAP 2.0 is based on XHTML Basic, plus a few tags specific for mobile devices. What comes out as a result is
called XHTML Mobile Profile. Currently, version 1.0 has
been standardized and the OMA is working on version
2. Like any transition period, you should always consider, while developing, that you will need to support both
the old and the new standard for at least some time.
Any device released (but not necessarily purchased,
since dealers will have to clear inventories out) after
April 2003 supports XHTML MP. As you have probably
painfully learned as a professional developer, "support"
does not necessarily mean that everything works properly—so you can safely expect that many devices will
ignore some of the new tags, but they should, at least,
do so silently and without spitting out all sorts of errors.
What are the cool things about XHTML? The most
important is its support for CSS (Cascading Style
Sheets). With CSS, you are able to define styles and use
them in your WAP pages. This particularly technology
was never used in WML 1.x because the displays were
so tiny that applying a style was simply not effective.
The latest-model devices, however, feature bigger
screens and color displays, making the appearance of
Licensed to 63883 - Joseph Crawford ([email protected])
After the do tag I close the card and start a new one.
I created two simple cards just to show a message. As
you can see, I created a deck with three cards, even
though the user is likely to see only two of them. Thus,
not all of the content available in the document will
actually be shown to the user, but this will save load
time regardless of what choice he or she makes.
Unfortunately, this is not always possible, as you will
often generate the contents of a page depending on
the information passed from a link, a form or something similar, but you should take advantage of this
capability of WAP whenever possible, as your site will
be a bit friendlier as a result.
Also notice that some browsers will display anchors
alone on a line. If you have any text around the link,
your sentences will be split. Another particular behavior
some devices show is not allowing you to place a link
on an image. These limitations mainly apply only to the
first devices that hit the market, which had small
screens and were capable of displaying only a few lines
per page. Their manufacturers thought that introducing these rendering rules would have made navigation
easier—although, of course, they ended up making the
developers' life harder. However, as annoying as they
are, these idiosyncrasies shouldn't discourage you; as
long as you respect strict WML standards, your pages
will be viewable by everyone—all you can do is just to
try and do your best to make them look good on as
may devices as you can.
50
FEATURE
WAP: Past, Present and Future
Where to Go From Here
This was just a brief introduction to WAP, and you will
probably have a lot of questions. Starting your journey
into WAP programming is quite easy—you will probably write your first pages without many problems and
then "hit the wall" just when you start feeling confident!
A good reading of official documentation and a couple of specialized sites will certainly offer you a deeper
knowledge of the topic. What you should always keep
in mind is the main concept: a small device used while
on the move. Your site must be friendly and simple to
use. My suggestion is to develop your applications
while testing them with at least the Nokia and
Openwave SDKs. Better yet, you should test them with
all the "real" device that you intend to support. One
step further is to balance every single page, trying to
make it short enough that scrolling up and down
through its contents will not be overly annoying. Also,
design your forms so that they can be easily completed
March 2004
●
PHP Architect
●
www.phparch.com
by your users without too much typing. Employ dropdown and multiple selection boxes whenever possible,
as these also help ensure the accuracy of the data that
is entered.
What are the advantages of developing a WAP site
with a language such as PHP? Of course, you get all the
normal perks of a web language, like the ability to
access a database. The real plus, however, is the possibility to tailor each WAP page to the mobile browser
used by your visitor. Reading the user agent that the
device sends at each request, you should be able to
understand the type of device and offer ad-hoc
markup. As I mentioned in this article, each device has
its own peculiarities, and this is even more true with the
new XHTML MP devices that, in many cases, do not
support the full standard or apply it in their own way.
With PHP, you will be able to build the WAP page that
fits each particular device best, although, of course, in
practical terms you still need to figure out how each of
them will behave. For this purpose, you essentially have
three opportunities:
Licensed to 63883 - Joseph Crawford ([email protected])
WAP pages a worthy consideration. Adding colors, different fonts, and alignment suddenly becomes both
useful and practicle. Another big new feature is the
background color and background images. Once
again, this is a need that was never felt until new displays became available, but it is now a cool feature you
can add to your site.
While you could use italic or bold text in WML 1.x,
most of the devices did not support it. In XHTML MP,
these tags are inherited from the XHTML Basic and
include "strong", "big", "small", "b" and more. I don't
feel like I need to explain every single tag in XHTML,
given that you are probably very familiar with them and
you can get full documentation from the W3C. What
you should know is that you will need to be strict with
your syntax and avoid any mistakes or unclosed tags, as
mobile devices this will generate compile errors (and
you would be producing invalid XHTML anyway).
With all the new additions that XHTML MP brought,
some of the features of WML 1.x have also been pruned
out of the standard. For example, we've lost the "do"
tag that lets us assign specific functions to the two softkeys, as well as the concept of card and deck. Forms are
present (and quite similar) in the two standards, but a
useful function that was lost in the transition is the possibility for the developer to predefine the type of information the user should insert. In WML 1.x, the developer could use specific tags to define that an input field
should be filled with numbers only, or letters only, and
the device interface would not let the user insert anything els. This helped both the user, who could more
easily pick the proper set of keys, and the developer,
who could better manage the submitted form.
Openwave decided to keep supporting this functionality with a proprietary tag, but this means that if you
decide to use it your code will not be compatible with
other devices!
• Acquire each of the devices you intend to
support and test your site on each one separately. As mentioned above, this will help
you ensure maximum performance in all situations, but it may turn out to be an expensive proposition, and it will certainly slow
your development efforts down.
• Purchase a commercial package that provides you with a list of device capabilities
and build your pages based on the information you find in it.
• Rely on an open-source package to do the
same.
The last two options are essentially equivalent, and
your choice will probably depend on how you feel
about open-source compared to a proprietary solution.
Personally, I believe that open-source products can be
superior, and that's why I am an active contributor to
the Wireless Universal Resource File (WURFL) project,
which you can find at http://www.wurfl.org.
The testing path I commonly follow is to develop the
application based on the information I find in WURFL
and then test it with a few of the real devices I intend
to support openly.
A Bit of Homework
The future of WAP is bigger and more colorful than ever
before, thanks to the new devices that have larger
screens, more colors, and faster browsers based on the
GPRS and 3G standards. Even if the dream of "the Web
on a mobile device" will probably remain a dream for a
little while longer, WAP is widely used by millions of
users every day for the download of content of all
kinds. For an example, look at ringtone services, which
51
FEATURE
WAP: Past, Present and Future
March 2004
●
PHP Architect
●
www.phparch.com
at WURFL (I wrote an article about it in the
June 2003 issue of php|architect) and OUI
(http://oui.sourceforge.net), an open-source library
published by OpenWave that can dynamically adapt
your XHTML MP code to the capabilities of each wireless device.
There are also a lot of commercial products that can
help you develop WAP sites, but these are relatively
easy to find on pretty much any search engine. If websites and official documentation are not enough, you
can also come discuss WAP on the famous
WMLProgramming list on Yahoo!:
http://groups.yahoo.com/group/wmlprogramming
About the Author
?>
Andrea Trasatti started his career as a SYSOP for the second BBS in Italy
to offer internet access. As the internet grew, he integrated his experience
with the development of web applications. Now he specializes in the
development of multichannel applications. He is an active member of the
open-source community. Some of his projects are the leading value
added services for one of the biggest mobile carriers of the world.
Licensed to 63883 - Joseph Crawford ([email protected])
have ballooned in popularity over the last couple of
years and have become a rather sizable market. WAP is
the ideal medium for them: you connect to a site,
browse a list, pick your favorite ringtone and download
it.
As devices get better and the cellular networks get
faster, WAP will become more and more useful. What
was once just a "cordless phone" (often the size of an
attaché case) is now becoming a tiny computer—and
WAP is the transport medium for the content users
want to have.
If I convinced you that it is worth your time to read
and experiment with WAP a little, you will probably
need some links to start from. Your first stop should be
"The Wireless FAQ", http://www.thewirelessfaq.com,
where you will find some of the things I discussed and
many more frequently asked questions and examples.
You will also find links to the OMA (http://openmobilealliance.org) , from where you will be able to download
current and historical documents about WAP.
For your experiments, you can download the SDKs I
listed at the beginning of the article—and if you want a
comfortable emulator for Windows, take a look at
WinWAP (http://www.winwap.com). After a little testing
and playing around, you might also want to take a look
To Discuss this article:
http://forums.phparch.com/134
52
Tidying up your HTML in PHP5
Tidy is a new extension that will be available as a standard in PHP 5. It provides a wide range of functionality for
manipulating HTML, XHTML, and XML documents from
within PHP. This article introduces all of the primary features of this new extension, and how you will be able to
make the most of it in your PHP scripts.
A
lthough the Tidy extension itself is provided as a
part of PHP 5, by default it is not enabled, as it
relies on external libraries. To enable Tidy support
within PHP, you must first download the libTidy library,
available
on
the
Tidy
homepage
at:
http://tidy.sourceforge.net/. Once you have downloaded the latest version of the libTidy source, you can
install it on your server using the following commands:
[user@localhost]$
[user@localhost]$
[user@localhost]$
[user@localhost]$
[user@localhost]$
[user@localhost]$
tar -zxvf tidy_src.tar.gz
cd tidy
/bin/sh build/gnuauto/setup.sh
./configure
make
make install
Note that, in order to fully complete the installation
of the libTidy library, the make install command must
be executed as superuser (i.e. root) or equivalent.
Once the libTidy library has been installed on the
server, Tidy support can be enabled in PHP 5 by specifying the -with-tidy configuration option to PHP's
./configure script:
[user@localhost]$ ./configure --with-tidy
Although the above command should work when
Tidy is installed in common default locations, alternatively you can also specify the location of the libTidy
library directly:
[user@localhost]$ ./configure --withtidy=/path/to/libTidy
To confirm Tidy support in PHP 5, check for a Tidy
section in the output of the phpinfo() function or exe-
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
F E A T U R E
by John Coggeshall
cute the CLI version of PHP using the -m parameter (to
show installed modules). If everything has gone as
expected, you will see tidy in the module list and a Tidy
subsection in the output of the phpinfo() function.
An introduction to the Tidy API
The Tidy extension, like many of the new PHP 5 extensions, supports a dual-nature procedural/object oriented syntax. This allows you, as a developer, to use the
programming methodology you are most comfortable
with when using Tidy in your PHP applications. For
example, consider the above small snippet of code:
<?php
$tidy = tidy_parse_file("myfile.html");
tidy_clean_repair($tidy);
echo tidy_get_output($tidy);
?>
Don't be concerned that these functions have yet to
be introduced—I'll discuss them later in the article.
Instead, note the $tidy resource which is returned from
a call to the tidy_parse_file() function. This resource
represents the document being manipulated in memory, and must be passed to every function (similar to, for
instance, the way the cURL library works). This
REQUIREMENTS
PHP: 5
OS: Linux
Applications: LibTidy Library
Code Directory: N/A
53
FEATURE
Tidying up your HTML in PHP5
<?php
$tidy = tidy_parse_file("myfile.html");
$tidy->cleanRepair();
echo $tidy;
?>
This second example is identical in functionality to
the first, except, of course, that it uses the object-oriented syntax. Rather then calling tidy_clean_repair(),
which requires a resource to be passed, you can call the
cleanRepair() method instead.
The last line of the example above illustrates another
interesting feature of the Tidy library: because of its
dual-nature syntax, it is possible in PHP 5 to treat the
$tidy resource returned from the tidy_parse_file()
function as a string simply by using it in the context of
a string within PHP. The contents of this string are identical to that returned from a call to tidy_get_output(),
providing an incredibly useful shorthand for displaying
the contents of a document after it has been manipulated by the extension.
Although not recommended, for the sake of example
the Tidy syntax can also be interchanged between procedural and its object-oriented forms:
<?php
$tidy = tidy_parse_file("myfile.html");
$tidy->cleanRepair();
echo tidy_get_output($tidy);
?>
In this article, I will only use the procedural syntax for
Tidy to maintain consistency and avoid making things
appear more complicated than they are. The only time
I will use any object-oriented aspects is in the treatment
of the $tidy resource as a string when appropriate in
my examples. If you would prefer to use the OO syntax
of Tidy, converting between one and the other is a trivial task as procedural names map to their object-oriented counterparts by doing the following:
• Remove the tidy_ from the procedural name
• Remove all underscores from the function
name, capitalizing every letter but the first
word in the method name (cclean_repair()
to cleanRepair())
• When calling Tidy functions from an objectoriented context, the first parameter of the
function call (always the $tidy resource) is
omitted.
Basic Tidy usage: Parsing documents
Tidy's primary purpose is to parse, validate, and repair
markup documents in HTML, XHTML, or XML format
March 2004
●
PHP Architect
●
www.phparch.com
and return the results of that process. Parsing an input
document always begins this process. To parse a document stored within a file, you can use the
tidy_parse_file() function:
tidy_parse_file($filename [, $options [, $encoding [,
$use_inc_path]]])
Where $filename is the path and filename of the document to parse. This can either be a file on the local file
system, or a remote URL. For now, the second parameter ($$options) can be ignored (I will discuss it later),
while $encoding represents the character set of the
input document (such as utf-32). The final parameter,
$use_inc_path, is a Boolean indicating whether Tidy
should search for the file in the PHP include path if not
found initially.
The tidy_parse_file() function loads and parses the
markup document and returns a resource representing
that document to your script. During the parsing
process, the document may be modified from its original version to make it syntactically correct. For instance,
missing end-tags are automatically added, attribute values are automatically quoted, and so on.
Documents can also be read from memory rather
than a file by using the tidy_parse_string() function:
Licensed to 63883 - Joseph Crawford ([email protected])
resource, however, is much more powerful than its
PHP4 counterparts, as it also can be treated as an
instance of a Tidy Document Object:
tidy_parse_string($data [, $options [, $encoding]]);
Where $data is a string representing the document to
parse and $encoding is the character set the data is
stored in. As was the case with the tidy_parse_file()
function, I will temporarily ignore the $options parameter and discuss it in detail later.
As I mentioned earlier, once the document has been
parsed, the $tidy resource returned represents the document in memory. It can either be displayed immediately by using the tidy_get_output() function (or by
treating the $tidy resource as a string), or be further
manipulated as we will do shortly.
Cleaning and Repairing Documents
When a document is parsed through the Tidy extension, it is only modified as necessary to make the
markup syntactically correct according to the configuration associated with it. The second phase of using
Tidy, called the "clean and repair" stage, further applies
configuration options to the document. This process is
manifested in the tidy_clean_repair() function with
the following syntax:
tidy_clean_repair($tidy);
Where, as expected, $tidy is the tidy resource representing the document. Since we have not discussed
configuration options in Tidy at all yet, let's introduce
them now.
54
FEATURE
Tidying up your HTML in PHP5
<?php
/* Define the tidy configuration options
In this case, output the document in XHTML format
and set the line
wrap for the markup to 1 kilobyte
*/
$options = array('output-xhtml' => true, 'wrap' =>
1024);
/* Pass the options to Tidy */
$tidy = tidy_parse_file("http://www.phparch.com/",
$options);
tidy_clean_repair($tidy);
echo $tidy;
?>
In the example above, we are modifying the values of
two Tidy configuration values, output-xhtml and wrap,
which instruct Tidy to generate output in XHTML format with a markup line-wrapping of 1 kilobyte per line.
This configuration is then applied to the php|architect
web site and a XHTML 1.0 version of the document is
sent as output to the browser or console.
As an alternative to setting configuration options
using an associative array, the $options parameter can
also be a string representing a file on the local file system that defines the options you would like to set.
Below is the content of an example Tidy configuration
file, which sets a number of different options spanning
the range of types:
March 2004
●
PHP Architect
●
www.phparch.com
indent-spaces: 4
indent: auto
tidy-mark: no
show-body-only: yes
new-blocklevel-tags: mytag, anothertag
Thus, to duplicate the options defined in the example
above the following would be used for the contents of
the configuration file:
wrap: 1024
output-xhtml: yes
Assuming that this file was saved as myconfig.tcfg in
the /usr/local/etc/tidy directory then the following
script below could be used to duplicate our previous
example:
<?php
/* Pass the options to Tidy */
$tidy = tidy_parse_file("http://www.phparch.com/",
"/usr/local/etc/tidy/myconfig.tcfg");
Licensed to 63883 - Joseph Crawford ([email protected])
Tidy Configuration Options
For any given document parsed by Tidy, there are an
incredible number of options, which can be set to control different aspects of how the document will ultimately be rendered. These options range from the output format (HTML, XHTML, etc), to the way the document will look (i.e. indented tags, wrapping length),
and more. In fact, the vast majority of Tidy's abilities are
taken advantage of by setting different combinations of
configuration options.
To modify the current configuration of a document,
options can be set a number of different ways (all
which occur prior to the parsing of the document). For
now, we'll look at the run-time method of setting
options by taking a second look at the
tidy_parse_file() and tidy_parse_string() functions. As you may recall, when I first introduced these
functions I ignored the $options parameter of each—
this parameter as you might expect controls the configuration for the document. This value can be one of two
things, either an associative array containing configuration options and their respective values, or a string containing the path and filename of a Tidy configuration
file.
To begin, lets take a look at setting configuration
options through the use of an array. Consider the following code:
tidy_clean_repair($tidy);
echo $tidy;
?>
Setting a Default Configuration
The use of Tidy configuration files is a powerful feature,
as it allows developers to create Tidy "profiles" that
allow them to process many different types of markup
in a very logical fashion. However, Tidy configuration
files can also be used to change the default configuration of a document when it is parsed. To define a
default configuration file, the tidy.default_config
php.ini configuration directive is used. Simply set this
directive to the path and filename of a Tidy configuration file and it will automatically be applied any time a
Note 1
Because of the sheer number of Tidy configuration options available, only a brief
cross-section will be discussed. For a complete reference, consult the Tidy homepage
at http://tidy.sourceforge.net/
Note 2
Unlike documents that need to parsed,
Tidy configuration files must be stored in
the local file system and cannot be fetched
from a remote resource.
55
FEATURE
Tidying up your HTML in PHP5
new document is parsed.
Short Hand Tidying
Since the use of configurations in conjunction with calls
to the tidy_clean_repair() and tidy_get_output()
functions can be lengthy, the Tidy library provides a
resource that combines these two into a single function
(actually, 2 similar functions). These functions are
tidy_repair_file() and tidy_repair_string() whose
syntax is as shown:
include path for the input file if it is not initially found.
When executed, these functions will parse and
clean/repair the input document using the specified
configuration and return a string containing the final
output:
<?php
$content =
tidy_repair_file("http://www.phparch.com/",
"/usr/local/lib/tidy/myconfig.tcfg");
echo $content;
/*
tidy_repair_file($filename [, $options [, $encoding [,
$use_inc_path]]]);
tidy_repair_string($data [, $options [, $encoding]]);
Figure 1
March 2004
●
PHP Architect
●
www.phparch.com
$tidy = tidy_parse_file("http://www.phparch.com/",
"/usr/local/lib/tidy/myconfig.tcfg");
tidy_clean_repair($tidy);
$content = tidy_get_output($tidy);
echo $tidy;
*/
?>
Licensed to 63883 - Joseph Crawford ([email protected])
Where $filename and $data represent the document
(either in a string or as a file), $options is an associative
array of options (or a tidy configuration file), and
$encoding is the character set to use when reading the
input document. The final parameter of the
tidy_repair_file() function, $use_inc_path , is a
Boolean indicating if Tidy should search the PHP
The above is identical to:
Using the Tidy Parser Abilities
Along with all of the functionality provided by the Tidy
extension to validate,
manipulate, and repair
markup documents, Tidy is
also, of course, an excellent
parser of markup documnets.
When Tidy parses a document, it generates a "document tree" representing its
contents in a hierarchical
fashion. This tree can be
accessed from within PHP
through a series of objects,
allowing you to pull out
entire blocks of HTML or
other markup without the
need for messy regular
expressions or another
extension.
To understand how to
use this feature of the Tidy
extension, first you must
understand how Tidy represents a document. As
stated, Tidy generates a
document tree based on
the input document, consisting of a number of parent and child nodes. When
dealing with HTML or
XHTML, these nodes represent tags within the document. Consider, for example, the following HTML
code:
56
FEATURE
Tidying up your HTML in PHP5
<HTML>
<HEAD>
<TITLE>My document</TITLE>
</HEAD>
<BODY>
<B>This is <I>An example</I> Document!</B>
</BODY>
</HTML>
Internally, when this document is parsed by Tidy, the
structure shown in Figure 1 is generated. As you can
see, every HTML tag within the document is stored as
a node within the document tree. These nodes are represented in PHP by an internal class named tidy_node.
The structure of this class is as follows (note, the following is pseudo-PHP for illustration only):
within a document tree. In order to retrieve the first
instance of the tidy_node class from the Tidy extension,
there are four different methods, which you can use:
root(), head(), html() and body(). Each of these methods returns an instance of the tidy_node class representing the node for the document tag with the same
name (i.e. the html() method returns the node for the
<HTML> tag). As this aspect of Tidy is only available
using an object-oriented syntax, no procedural equivalent exists for node-retrieval functions:
<?php
$tidy = tidy_parse_file("http://www.phparch.com");
/* Get the node representing the <BODY> HTML Tag */
$body_node = $tidy->body();
<?php
/* The string value of this node and all of its
child nodes */
public $value;
/* The tag name i.e 'HTML' or 'BODY' */
public $name;
echo "The HTML Tag for this node is: {$body_node>name}";
?>
Licensed to 63883 - Joseph Crawford ([email protected])
class tidy_node {
When executed, you can expect the output to be:
The HTML Tag for this node is: body
/* A numeric value representing the node type */
public $type;
/* A numeric value representing type of tag (if
any) */
public $id;
/* An associative array of tag attributes */
public $attribute[];
/* An indexed array of child nodes
public $child[];
public function hasChildren();
public function hasSiblings();
public
public
public
public
public
public
function
function
function
function
function
function
isComment();
isHtml();
isText();
isJste();
isAsp();
isPhp();
}
?>
Through the properties and methods available in the
tidy_node, class you are able to access all of the nodes
Note 3
The tidy_node class is also an overloaded
class, meaning that you can treat an instance
of the class as a string to retrieve the $value
property of the class:
One of the most important features of the Tidy extension's parsing abilities is the $value attribute of each
tidy_node instance. Specifically, the contents of this
property will not only be the value of the current node,
but all of the nodes spawned as children from it. Thus,
the value of a <TABLE> node will contain the contents of
the entire table, making pulling large complex sections
of HTML out of documents a snap.
When parsing HTML, another incredibly useful attribute of the tidy_node class is the $id property, which
represents an integer value indicating the HTML tag
this node represents. These integer values correspond
to a set of constants registered by the Tidy extension
and provide a quick way to identify HTML tags from
within a PHP script. All tag constants defined by the
Tidy extension are in the format of TIDY_TAG_<TAGNAME>
(where <TAGNAME> is the uppercase tag name you are
interested in, such as TIDY_TAG_BODY for the <BODY>
tag).
To retrieve a particular attribute of a tag within a document (for instance the HREF attribute of an anchor <A>
tag), the $attribute associative array is used by accessing the key with the name of the attribute in question.
To demonstrate all of this functionality, consider the
dump_nodes() function below, which extracts all of the
URLS from anchor (<<A>) tags in the provided document:
<?php
<?php
echo $mynode->value; /* You can use this method
function dump_nodes(tidy_node $node, &$urls = NULL)
{
*/
echo $mynode;
/* Or this one! */
?>
March 2004
$urls = (is_array($urls)) ? $urls : array();
if(isset($node->id)) {
●
PHP Architect
●
www.phparch.com
57
FEATURE
Tidying up your HTML in PHP5
}
if($node->hasChildren()) {
foreach($node->child as $c) {
dump_nodes($c, $urls);
}
}
return $urls;
}
$tidy = tidy_parse_file("http://www.phparch.com/");
tidy_clean_repair($tidy);
$urls = dump_nodes($tidy->html());
print_r($urls);
?>
Looking at this code, the dump_nodes() function
accepts two parameters—the first is a node of type
tidy_node, and the second is an internal-use parameter
that we need during the recursion process to store the
array of URLs retrieved from the document. When executed, the dump_nodes() function begins by determining whether the current node is a known HTML tag by
checking for the existence of the $id property of the
node. If the latter exists, the function then proceeds to
check if this node is an anchor tag by comparing the
value of the $id property to the TIDY_TAG_A constant. If
we are indeed on an anchor tag, the function checks
for and saves the value of the $attribute['href'] array
key into the $urls array.
Once it has finished processing the current node,
Have you had your PHP today?
regardless of its type, the dump_nodes() function proceeds to look for children nodes and handle each of
them in the same fashion recursively. Ultimately, this
script will navigate the entire document tree (starting
from the node provided to it initially) and return an
array of URLs found within.
Summary
As you can see, the Tidy extension for PHP 5 is an
incredibly useful and powerful extension which, when
used properly, can make your life as a developer much
easier. Furthermore, with the judicious use of Caching
of the output of your web site, making documents web
standard compliant won't even introduce an additional
load on your server. For more information on the Tidy
extension
visit
the
PHP
Manual
at
http://www.php.net/tidy or the author's web site
http://www.coggeshall.org/ .
About the Author
Licensed to 63883 - Joseph Crawford ([email protected])
if($node->id == TIDY_TAG_A) {
$urls[] = $node->attribute['href'];
}
?>
John Coggeshall is a PHP consultant and author who started losing sleep
over PHP around five years ago. Lately you'll find him losing sleep meeting deadlines for books or online columns on a wide range of PHP topics. You can find his work online at O'Reilly Networks onlamp.com and
Zend Technologies, or at his website http://www.coggeshall.org/. John
has also contributed to Apress' Professional PHP4 and is currently in the
progress of writing the PHP Developer's Handbook published by Sams
Publishing.
To Discuss this article:
http://forums.phparch.com/135
http://www.phparch.com
NEW COMBO NOW AVAILABLE: PDF + PRINT
The Magazine For PHP Professionals
March 2004
●
PHP Architect
●
www.phparch.com
58
by Chris Shiflett
Welcome to another edition of Security Corner. This month, I have chosen a
topic that is a concern for many PHP developers: shared hosting. Through my
involvement with the PHPCommunity.org project, my contributions to mailing
lists, and my frequent browsing of PHP blogs and news sites, I have seen this
topic brought up in various incarnations. Some people are concerned about
hiding their database access credentials, some are concerned about safe_mode
being enabled or disabled, and others just want to know what they should be
concerned about, if anything.
As a result, I have decided to address these concerns in as much detail as possible, so that you will have a better understanding and appreciation of shared
hosting. After reading this article, you may decide that there is nothing for you
to be concerned about, or you may be terrified. Regardless, I hope to at least
provide you with clarity.
Shared Hosting
Since the advent of HTTP/1.1 and the required Host header, shared hosting has become very popular. Prior to
HTTP/1.1, there was no direct way for a Web client to
identify the domain from which it wanted content. The
browser simply used to determine the IP address associated with the domain entered by the user, and sent its
request there. An HTTP 1.0 request looks something like
the following, at a minimum:
GET /path/to/index.php HTTP/1.0
Notice that the URL presented in the request does not
include the domain name. This is because this is unnecessary information under the assumption that only one
domain is served by the particular Web server (and that
domains have a one-to-one relationship with IP addresses). With HTTP/1.1, Host becomes a required header, so
this request, at a minimum, must be expressed as follows:
GET /path/to/index.php HTTP/1.1
Host: www.example.org
With this format, a single Web server (with a single IP
address) can serve an arbitrary number of domains,
because the client must identify the domain from which
it intends to be requesting content. As a direct result, a
hosting company can host many domains on a single
server, and it is not necessary to have a separate public IP
for each domain. This yields much more inexpensive hosting and has spurred a tremendous growth in the Web
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
C O R N E R
S E C U R I T Y
Security Corner:
Shared Hosting
itself. Of course, this has been a driving force behind early
PHP adoption as well.
The downside to shared hosting is that it incurs some
security risks that do not exist in a dedicated server environment. Some of these risks are mitigated by PHP's
safe_mode directive, but a solid understanding of the risks
is necessary to appreciate what safe_mode does (and
what it doesn't). Because of this, I will begin by introducing some of the unique risks associated with shared hosting.
Filesystem Security
A true multi-user operating system, such as Linux, is built
upon a fundamentally secure approach to user permissions. When you create a file, you specify a set of permissions for that file, either explicitly or implicitly by virtue of
the fact that you are creating that file within a specific
context. This is achieved by assigning each file both user
and group ownership as well as a set of privileges for
three groups of people:
1. The user who owns the file
2. All users in the group
3. All users on the server
These categories of people are referenced as user,
group, and other, respectively. The privileges that you can
assign each category of user include read, write, and execute (there are some other details, but they are irrelevant
59
SECURITY CORNER
-rw-r--r-1 chris
12:34 myfile
shiflett
4321 May
21
This file, myfile, is owned by the user chris and the
group shiflett. The permissions are identified as -rw-r-r--, and this can be broken into the leading hyphen
(indicating a normal file, as opposed to, say, a directory),
and then three groups of permissions:
1. rw- (read, write, no
execute)
2. r-- (read, no write, no
execute)
3. r-- (read, no write, no
execute)
These three sets of permissions correspond directly to
the three groups of users: user (chris), group (shiflett),
and other.
Linux users are probably familiar with these permissions
and how to change them with commands such as chown
and chmod. For a more thorough explanation of filesystem
http://www.linuxsecurity.com/
security,
see
docs/LDP/Security-HOWTO/file-security.html .
As a user on a shared host, it is unlikely that you will
have read access to many files outside of your own home
directory. You certainly shouldn't be able to browse the
home directory or document root of other users.
However, with a simple PHP script, this can be possible.
Browsing with PHP
For this discussion, we'll assume that the Web server is
Apache and that it is running as the user nobody. As a
result, in order for Apache to be able to serve your Web
content, that content must be readable by the user
nobody. This includes images, HTML files, and PHP scripts.
Thus, if someone could gain the same privileges as
nobody on the server, they would at least have access to
everyone's Web content, even if precautions are taken to
prevent access to any other user.
Whenever Apache executes your PHP scripts, it of
course does so as the user nobody. Combine this with
PHP's
rich
set
of
filesystem
functions
(http://www.php.net/filesystem), and you should begin to
realize the risk. To make the risk clearer, I have written a
very simplistic filesystem browser in PHP (See Listing 1).
This script outputs the current setting for the safe_mode
directive (for informational purposes) and allows you to
browse the local filesystem. This is an example of the type
of script an attacker might write, although several
enhancements would likely be added to make malicious
actions more convenient.
One of the first places an attacker might want to glance
is at /etc/passwd. This is achieved by either browsing
there from the root directory (where the script begins) or
visiting the URL directly (by calling the script with
?file=/etc/passwd).
This gives an attacker a list of users and their home
directories. Another file of interest might be httpd.conf.
March 2004
●
PHP Architect
●
www.phparch.com
Assuming each user's home directory has a directory
called public_html for their respective document roots,
an attacker can browse another user's Web content by
calling the script with ?dir=/home/victim/public_html/.
A security-conscious user will most likely keep sensitive
configuration files and the like somewhere outside of document root. For example, perhaps the database username
and password are stored in a file called db.inc and included with code similar to the following:
include('../inc/db.inc');
This seems wise, but unfortunately an attacker can still
view this file by calling the browse.php script with
?file=/home/victim/inc/db.inc. Why does this necessarily work? For the include() call to be successful,
Apache must have read access to the file. Thus, this script
must also have access. In addition, because the user's
login credentials are often the same as the database
access credentials, this technique will likely allow an
attacker to compromise any account on the server (and
launch additional attacks from compromised accounts).
There is also the potential for an attacker to use this
same script to gain access to anyone's session data. By just
browsing the /tmp directory (?dir=/tmp/), it is possible to
read any session that is stored there. With a few enhancements to the script, it could be even easier to view and/or
modify session data from these files. An attacker could
visit your application and then modify the associated session to grant administrator access, forge profile information, or anything of the like. And, because the attacker
can browse the source to your applications, this doesn't
even require guesswork. The attacker knows exactly what
session variables your applications use.
Of course, it is much safer to store session data in your
own database, but we have just seen how an attacker can
gain access to that as well. Luckily, safe_mode helps prevent these attacks.
Licensed to 63883 - Joseph Crawford ([email protected])
to the present discussion). To illustrate this further, consider the following file listing:
The safe_mode Directive
The safe_mode directive is specifically designed to try to
mitigate some of these shared hosting concerns. If you
practice running the script from Listing 1 on your own
server, you can experiment with enabling safe_mode and
observing how much less effective the script becomes.
When safe_mode is enabled, PHP checks to see whether
the owner of the script being executed matches that of
the file being opened. Thus, a PHP script owned by you
cannot open files that are not owned by you. Your PHP
scripts are actually more restricted than you are from the
shell when safe_mode is enabled, because you likely have
read access to files not specifically owned by you. This
strict checking can be relaxed somewhat by enabling the
safe_mode_gid directive, which relaxes the checking to
the group instead of the user.
Because safe_mode can cause problems for users who
have a legitimate reason to access files owned by another
user, there are a few other directives that allow even more
flexibility. The safe_mode_include_dir directive can spec-
60
SECURITY CORNER
ify
one
or
more
directories
from which users can include() files, regardless of
ownership.
I
encourage
you
to
read
http://www.php.net/features.safe-mode for more information.
Bypassing safe_mode
Is there a known flaw in safe_mode that allows people to
1 <?
2 echo “<pre>\n”;
3
4 if (ini_get(‘safe_mode’))
5 {
6
echo “[safe_mode enabled]\n\n”;
7 }
8 else
9 {
10
echo “[safe_mode disabled]\n\n”;
11 }
12
13 if (isset($_GET[‘dir’]))
14 {
15
ls($_GET[‘dir’]);
16 }
17 elseif (isset($_GET[‘file’]))
18 {
19
cat($_GET[‘file’]);
20 }
21 else
22 {
23
ls(‘/’);
24 }
25
26 echo “</pre>\n”;
27
28 function ls($dir)
29 {
30
$handle = dir($dir);
31
while ($filename = $handle->read())
32
{
33
$size = filesize(“$dir$filename”);
34
35
if (is_dir(“$dir$filename”))
36
{
37
if (is_readable(“$dir$filename”))
38
{
39
$line = str_pad($size, 15);
40
$line .= “<a
href=\”{$_SERVER[‘PHP_SELF’]}?dir=$dir$filename/\”>$filename/</a>”;
41
}
42
else
43
{
44
$line = str_pad($size, 15);
45
$line .= “$filename/”;
46
}
47
}
48
else
49
{
50
if (is_readable(“$dir$filename”))
51
{
52
$line = str_pad($size, 15);
53
$line .= “<a
href=\”{$_SERVER[‘PHP_SELF’]}?file=$dir$filename\”>$filename</a>”;
54
}
55
else
56
{
57
$line = str_pad($size, 15);
58
$line .= $filename;
59
}
60
}
61
62
echo “$line\n”;
63
}
64
$handle->close();
65
66
return true;
67 }
68
69 function cat($file)
70 {
71
ob_start();
72
readfile($file);
73
$contents = ob_get_contents();
74
ob_clean();
75
echo htmlentities($contents);
76
77
return true;
78 }
79 ?>
●
PHP Architect
●
www.phparch.com
bypass it? Not to my knowledge, but keep in mind that
safe_mode only protects against people using PHP to gain
access to otherwise restricted data. safe_mode does nothing to protect you against someone on your shared server who writes a similar program in another language. In
fact, the manual states: "It is architecturally incorrect to
try to solve this problem at the PHP level, but since the
alternatives at the web server and OS levels aren't very
realistic, many people, especially ISP's, use safe mode for
now."
Consider the following CGI script written in Bash:
#!/bin/bash
Licensed to 63883 - Joseph Crawford ([email protected])
Listing 1
March 2004
A similar PHP directive is open_basedir. This directive
allows you to restrict all PHP scripts to only be able to
open files within the directories specified by this directive,
regardless of whether safe_mode is enabled.
echo "Content-Type: text/plain"
echo ""
cat /etc/passwd
This will output the contents of /etc/passwd as long as
Apache can read that file. So, we're back to the same
dilemma. While the attacker can't use the script in Listing
1 to browse the filesystem when safe_mode is enabled,
this doesn't prevent the possibility of similar scripts written in other languages.
What Can You Do?
You probably knew that a shared host was less secure
than a dedicated one long before this article. Luckily,
there are some solutions to a few of the problems I have
presented, but not all. There are basically two main steps
that you want to take on a shared host:
1. Keep all sensitive data, such as session data,
stored in the database.
2. Keep your database access credentials safe.
The question is: how do you achieve the second goal?
If another user can potentially have access to any file that
we make available to Apache, it seems that there is
nowhere to hide the database access credentials. My
favorite solution to this problem is one that is described in
the PHP Cookbook by David Sklar and Adam
Trachtenberg.
The approach is to use environment variables to store
sensitive data (such as your database access credentials).
With Apache, you can use the SetEnv directive for this:
SetEnv DB_USER "myuser"
SetEnv DB_PASS "mypass"
Set as many environment variables as you need using
this syntax, and save this in a separate file that is not readable by Apache (so that it cannot be read using the techniques described earlier). In httpd.conf, you can include
61
SECURITY CORNER
Include "/path/to/secret-stuff"
Of course, you want to keep these include statements
within each user's VirtualHost block, otherwise all users
could access the same data.
Because Apache is typically started as root, it is able to
include this file while it is reading its configuration. Once
it is running as the user nobody, it can no longer access
this file, so other users cannot access this information with
clever scripts.
Once these environment variables are set, you can
access them in the $_ENV array. For example:
mysql_connect('localhost', $_ENV['DB_USER'],
$_ENV['DB_PASS']);
Because this information is stored in $_ENV, you need to
take care that this array is not output in any of your
scripts. In addition, a call to phpinfo() reveals all environment variables, so you should ensure that you have no
public scripts that execute this function.
Until Next Time...
Hopefully, you now understand some of the risks involved
with shared hosting and can take some steps to mitigate
them. While safe_mode is a nice feature, there is only so
March 2004
●
PHP Architect
●
www.phparch.com
much help it can provide in this regard. It should be clear
that these risks are actually independent of PHP, and this
is why other steps are necessary.
As always, I'd love to hear about your own solutions to
these problems. Until next month, be safe.
About the Author
?>
Chris Shiflett is a frequent contributor to the PHP community and one of
the leading security experts in the field. His solutions to security problems
are often used as points of reference, and these solutions are showcased
in his talks at conferences such as ApacheCon and the O'Reilly Open
Source Convention, his answers to questions on mailing lists such as
PHP-General and NYPHP-Talk, and his articles in publications such as
PHP Magazine and php|architect. Security Corner, his new monthly column for php|architect, is the industry's first and foremost PHP security
column.
Chris is the author of the HTTP Developer's Handbook (Sams
Publishing) and is currently writing PHP Security (O'Reilly and
Associates). In order to help bolster the strength of the PHP community, he is also leading an effort to create a PHP community site at
PHPCommunity.org. You can contact him at [email protected] or visit
his Web site at http://shiflett.org/.
Licensed to 63883 - Joseph Crawford ([email protected])
this file as follows:
62
By John W. Holmes
Licensed to 63883 - Joseph Crawford ([email protected])
T I P S
&
T R I C K S
Tips & Tricks
it's going to work and allow you to test your proCreating a Free MSSQL
grams—that's the end result we're shooting for anyDevelopment Environment on
Windows
how.
MSDE is Microsoft's Desktop Engine for their SQL
The first step is to download MSDE from
Server. Why would you want to install such a thing? http://www.microsoft.com/sql/msde/howtobuy/msdeuse.asp
Well, assuming you are a professional developer creat- and extract the file to your hard drive. Within the
ing applications that you intend for other people to extracted directory, you'll notice a setup.exe file. There
use, it's not always a good idea to limit yourself to a is also a ReadmeMSDE2000A.htm file that contains installasingle database. Designing your "killer app" to work tion directions in addition to what I'll be outlining here.
only with MySQL will limit who can
Step two is getting to a comactually use your program. Believe it
mand line and running the
or not, not everyone can install
setup.exe program with some
MySQL just to use your program!
“Installing MSDE on
parameters. We're going to install
Installing MSDE on your machine
a default instance of the program
your machine (your
(your Windows machine, obviously)
configured to use a mixed mode
will give you a full featured install of
Windows machine,
authentication, meaning that it
SQL Server for free that you can test
not be tied to Windows
obviously) will give you will
your code with. Using your own
authentication and you'll be able
database abstraction layer, PEAR, or
a full featured install of to use plain-text authentication
ADOdb, you can test that your
SQL Server for free that with it. The additional configuraapplication actually works with diftion parameters you can use are
you can test your
ferent databases system like you say
explained in the HTML file. Run
it does.
the following command:
code with.”
First, a couple caveats. This is just
setup.exe SAPWD='password' SEQURIone method I've found that works.
TYMODE=SQL
Obviously, if you have a full installaThis sets the sa (or root) password to password and
tion of SQL Server or can afford one, you should go
that route. This is not meant for a production machine, configures the mixed mode authentication. Obviously,
only development. I could not get the native MSSQL you can (and should) pick a stronger password for your
PHP function to work with MSDE, so we'll also have to own needs—even if this is just for a development
resort to ODBC. While this is going to be less efficient, machine.
March 2004
●
PHP Architect
●
www.phparch.com
63
TIPS & TRICKS
Listing 1
1
2
3
4
5
6
$ser=”COMPUTERNAME”; //the name of the SQL Server
$db=”tempdb”; //the name of the database
$user=”sa”; //a valid username
$pass=”password”; //a password for the username
$conn=odbc_connect(“Driver={SQL Server};Server=”.$ser.”;Database=”.$db, $user, $pass);
Listing 2
include(‘adodb/adodb.inc.php’);
$conn = &ADONewConnection(‘odbc_mssql’);
$ser=”COCONUT”; //the name of the SQL Server
$db=”tempdb”; //the name of the database
$user=”sa”; //a valid username
$pass=”password”; //a password for the username
$conn->Connect(“Driver={SQL Server};Server=$ser;Database=$db;”,$user,$pass);
That command will trigger the setup program,
which will install the necessary services for MSDE to
run. MSDE will be configured to start when the OS
starts by default, but you can change that from the
Services menu of your Control Panel. If the installation
did not trigger a reboot (I wouldn't worry too much
about that), you may have to go in and start the service for the first time.
The next step is optional, but, if you go back and
visit the MSDE website, there are a number of third
party tools offered for download or purchase. Start at
http://www.microsoft.com/sql/msde/partners/default.asp
to see the tools. I'd recommend you download the
DbaMgr SQL Tools program (DbaMgr2k) from
http://www.asql.biz/DbaMgr.shtm as it will give you a free
GUI for your MSDE installation. DbaMgr2k will allow
you to create the necessary databases, tables, relationships, etc, for your application.
Now, like I alluded to before, it'd be great to just
uncomment the php_mssql.dll line and load the
MSSQL extension in php.ini, but I could not get those
functions to work with the paired-down version of SQL
Server that MSDE installs. In fact, the mssql_connect()
function would not connect to MSDE given a wide
variety of connection options (and even MSDE installation options). Thus, we'll have to resort to ODBC. If
anyone has any experience or instructions to the contrary, please share them to [email protected] or
post a message in the php|a forums at
http://www.phparch.com/discuss .
Ensure you have the ODBC extension enabled for
your installation of PHP and you will be able to connect
to MSDE using the code shown in Listing 1. Substitute
your actual computer name, database, login and password to get this to work. During my tests, I found that
you must use the computer name, as "localhost" or
"127.0.0.1" does not work (you will not be able to connect to MSDE server). Ensure the odbc_connect()
parameters are all on one line, also.
March 2004
●
PHP Architect
●
www.phparch.com
If you're using ADOdb or PEAR, then you follow their
instructions as if you were connecting through ODBC
to MSSQL. Example connection code for ADOdb is
shown in Listing 2. The method using PEAR would be
similar and is discussed in the PEAR documentation.
From this point, you'll be able to use the ODBC or
abstraction layer functions to execute queries and
retrieve data from the MSDE server, or whatever else
your application is designed to do. You can use the
DbaMgr2k program to create new users, databases,
and tables, or do it from your queries. What you do
from this point is up to you, but you now have a functional SQL Server test environment for free.
Licensed to 63883 - Joseph Crawford ([email protected])
1
2
3
4
5
6
7
8
9
Detecting the Web Server
If you distribute programs that can be run under a variety of web servers and need to determine which one
you're running under (or within), PHP offers a useful
Figure 1
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
aolserver
activescript
apache
cgi-fcgi
cgi
isapi
nsapi
phttpd
roxen
java_servlet
thttpd
pi3web
apache2filter
caudium
apache2handler
tux
webjames
cli
embed
milter
Figure 2
This is my subject
Bcc: my_email@my_domain.com
Reply-To: bad_email@evil_domain.com
64
TIPS & TRICKS
function named php_sapi_name(). This will actually
return the type of interface between the web server
and PHP. It can then be used to determine what the
web server is (if it isn't obvious) and to take appropriate action. You may want to
include different files based upon
whether PHP is running in CGI versus SAPI mode, for example. A gracious user lists the possible return
values from the function in the
errata on the manual page. The
values are shown in Figure 1.
to the message if the altered Reply-To: address is not
noticed. Or, how about a lot of the forum scripts that
are available that let you e-mail another user, but do
not show you their e-mail address? Now you can Bcc:
yourself on the message you send
them and find out their actual email address. The recipient of the
message will not know—unless
they examine all of the e-mail
headers closely.
The bottom line is that you are
allowing the user to put in
unchecked headers into your mail
messages. The solution to this is to
filter the subject you receive from
your text box for new line ("\n")
characters.
You can remove everything after
the first new line with:
Check Your Email
Subject
If you've been a reader with us
since the beginning, you may recognize the following tip. My editor
said to avoid reusing tips, but I feel
this one needs to be brought up again as I still find
numerous live sites still vulnerable to mail header injection.
Quite a few web sites have pages through which you
can send an e-mail to someone. These can be used to
contact the site administrator, send a message out to
other users, or many other purposes. Often, the goal
of the contact form is to hide the e-mail address of
the recipient. This can be for convenience (to prevent
spam) or for security (protecting the identity of recipients).
If your site is just using a normal <input> text block
for the user to enter the subject of the message in, you
may be unwittingly allowing your visitors to inject
additional headers into the e-mail message. The user
can download (save-as) your form and modify the
<input> area into a <textarea> element. Then, he or
she can enter a "subject" like the one shown in Figure
2.
When this subject is inserted into PHP's mail() function:
mail($to,$subject,$message,$headers);
The Bcc: and Reply-To: headers are also added to
the message. Thus, the malicious user has now included themselves on the contents of the message, can see
who all of the other recipients were, and the end user
is unaware that there is now a bad or altered Reply-To:
address.
Now, this may not matter on a simple web page
where people are sending you questions about your
cat, because all this will do is give the user a copy of
the message they just typed. But imagine a web page
that allows Alcoholic Anonymous users to contact and
e-mail each other anonymously. Now malicious users
can Bcc: themselves on messages and see who all of
the recipients are—and possibly even intercept replies
March 2004
●
PHP Architect
●
www.phparch.com
Licensed to 63883 - Joseph Crawford ([email protected])
“You may want to
include different
files based upon
whether PHP is running in CGI versus
SAPI mode...”
substr($string,0,strpos($string,"\n")-1)
Another option is to just tell the user that there has
been an error if a new line is detected and attempt to
save as much information about the user as you can,
for future reference (as long as that is in compliance
with your privacy policy).
You should also be aware that this isn't a vulnerability that's limited to PHP scripts. Any scripting language
could be vulnerable if it takes user input and places it
directly into mail headers. This is something to keep in
mind if you also develop in other languages besides
PHP.
Send in Your Tips … Help the
Community
If you have any tips that would help out your peers,
please send them to [email protected] to be published. Anyone contributing a tip that gets published
will get a free issues (added on to your subscription if
you already have one). Also, if you haven't noticed
already, there is a special Tips 'n Tricks forum in the
phparch.com forums for discussing what's in this or
any column of Tips 'n Tricks. If you have any comments
about what's been written, be sure to post them there!
About the Author
?>
John Holmes is a Captain in the U.S. Army and a freelance PHP and
MySQL programmer. He has been programming in PHP for over 4 years
and loves every minute of it. He is currently serving at Ft. Gordon,
Georgia as a Company Commander with his wife and two sons.
65
I Am Jack's Total Lack of
Linux Support
M
aybe it doesn't sound like
it, but five days can be a
really long time.
When I left for php|cruis?e at
the beginning of the month
(technically, at the end of
February), I did so without any
means of accessing the Internet.
My old and faithful Pentium II
Acer laptop having recently left
me for a better place (in one of
the local city dumpsters), I figured
that a week without having to
worry about e-mail and the likes
would have been a fun and relaxing experience. Of course, I was
fully expecting that I would have
had to deal with a whole lot of email in my inbox once I got
back—but, surely, that would be
no big deal.
Boy, was I wrong. I had a miserable week and ended up bumming laptop time off of all the
other attendees. While everybody
was out having fun in Nassau, I
was walking around its busy
streets looking for a computer
store (and found none, thankfully,
March 2004
●
PHP Architect
●
www.phparch.com
given what happened once I finally got around to buying a new
laptop). I'm sure that the tales of
the overweight spirit that haunts
the decks of the Sovereign of the
Seas asking people "can I borrow
your laptop for five minutes" will
live on for years to come. On top
of everything, when I finally did
manage to get home my mailbox
contained just short of two thousand messages—all "good" ones
without spam or viruses. As a
result, I'm still sorting through my
inbox, a full week behind in my
answers. Bummer!
The effects of the one-week
withdrawal and mailbox-shock
still lingering, I resolved to take a
trip to my local Best Buy superstore and purchase a laptop. Since
I don't like to sit on a decision for
too long, a couple of hours later a
brand new, top-of-the line
Pentium 4-based Hewlett-Packard
laptop sat on my desk, ready to
be used. Given that I don't much
care for Windows as a desktop
operating system (and, let's be
Licensed to 63883 - Joseph Crawford ([email protected])
e x i t ( 0 ) ;
By Marco Tabini
clear, this is only my personal
preference, rather than a pseudoobjective comment on the operating system itself), I immediately
started the installation process for
the latest version of Gentoo.
Now, before I go on I must
point out that I have never, ever,
had any problem making any sort
of hardware work flawlessly under
Linux. I have always been able to
find drivers compatible with
whatever I'd throw in my box-be
it a sound card, disk drive controller or network adapter.
Therefore, I had no qualms about
driving to the store and picking
out my new computer the way
normal (read: Windows) users
do—by choosing the one I liked
best.
Three days later, I was still trying
to make the basic elements of my
laptop work. I'm not talking about
anything fancy here, like superfast 3D acceleration or some esoteric power-saving mode. I was
actually having trouble getting
Linux to recognize my ATI IDE
chipset (honestly, why is ATI mak-
66
EXIT(0);
Why Can’t We All Just Get Along?
March 2004
●
PHP Architect
●
www.phparch.com
made my way back to the store
for an exchange. Thankfully, my
wife is a neatness freak who keeps
everything, down to the last piece
of paper, so putting everything
back in its original package, as
dictated by Best Buy's 14-day
exchange/refund policy for computers, was easy enough—or so I
thought. Once at the store, I was
informed that they could not
accept the laptop in the condition
I brought it back in—or, in other
words, with Linux installed on it
(it originally came with Windows
XP, which, once in my hands, lasted approximately the time
required to reboot from the
Gentoo CD). In order for them to
make me the "honour" of honouring their return policy, I would
have to either drive all the way
back home and reinstall Windows
or pay a $60 reinstallation fee.
I suppose this makes perfect
sense—after all, they may actually
want to be able to resell the laptop once I bring it back (hence
the request for all the original
packaging material and manuals),
and most people who buy from a
general electronics store probably
do not have the skill required to
install an operating system from
scratch. However, two things
upset me to no end. First, their
policy said nothing about the
original operating system. It didn't even refer to the product having to be in its original condition—only that all the accessories
and packaging had to be
returned, which is exactly what I
did. Second, and most important,
while the store clerk at the returns
desk was making sure that I didn't
discreetly return a couple of bricks
instead of a $2,000 computer, I
actually went and bought another
laptop—much more expensive
than the first. So here I was, being
hassled about an unwritten policy
and having to haggle for a $60
charge that was not advertised
anywhere while I was ready to
spend several hundred dollars
more on another computer.
Luckily, human beings turned out
to be smarter than the policies
they are supposed to follow, and I
walked out of the store with
another computer—this time a
Toshiba Satellite M30, on which I
am currently writing this column.
I have no complaints about the
new Toshiba—everything "just
works", down to the last detail. Of
course, there is no "official" support for Linux-but the laptop is
built using parts for which all sorts
of drivers exist, so that running
Linux on it is not a problem. Of
course, one could say that the difference in price between the two
computers justifies the fact that
the cheaper one won't work
under Linux. However, the price
different can easily be attributed
to a larger hard drive, a newer
processor (the Toshiba runs on
Centrino technology) and better
battery time—not to mention the
fact that, in the past, I've been
able to run Linux on $300 computers without any hardware
problem.
So, score one for Toshiba. Linux
is not "one of the other operating
systems" any longer. It has survived and it is thriving—and can
no longer be ignored by a computer manufacturer who wants to
stay in the market.
Licensed to 63883 - Joseph Crawford ([email protected])
ing IDE chipsets anyway? Can't
they just stick to graphics cards?),
or my audio card. Sure, the computer would run, but with no IDE
chipset support, hard disk access
managed to slow the whole system down to speed levels I had
not seen since the days of the
original IBM 8088-based motherboard (for those who never had
the pleasure of working with that
particular monster, the boot-time
POST check—that set of tests that
today you see flying by as they
verify that your RAM works—used
to take something like five or six
minutes).
All this from a company that has
been professing its support for
Linux for a few years now. Not
only do they not release any drivers for their laptop hardware, but
they actually go the distance and
use embedded chipsets for which
no drivers are available, even in
the open source community,
because their manufacturers
refuse to make the necessary
information available in the public
domain. Undoubtedly, developing drivers for more than one
operating system is more expensive, but the process of writing
device drivers for Linux is well
documented and understood,
and I honestly find it hard to
believe that writing a single driver
with bindings for the two operating systems would be so prohibitively expensive as to not justify
the additional number of computers sold to those users who want
to use one rather than the other.
After all, making a case from a
business perspective for only supporting Windows has to be more
and more difficult as time goes by,
since Linux is quite popular in the
desktop arena as well.
Well, I guess H-P has been too
busy acquiring Compaq—a move
that I still don't understand—to
pay attention to recent market
trends. Laptop back in the box, I
php|a
67