php
Transcription
php
VOLUME III - ISSUE 3 MARCH 2004 The Magazine For PHP Professionals Matchmaker Make Me a Match Using the Amazon.com API through PHP and XML-RPC Explore your HTML code with Tidy www.phparch.com PHP And WAP: Past, Present & Future Testing Automation With PHP PHP Ahoy! A Look at: php | Cruise Bahamas 2004 Licensed to: Joseph Crawford [email protected] Tips & Tricks, User #63883 Plus: Security Corner, Product Reviews and much more... Licensed to 63883 - Joseph Crawford ([email protected]) 3UHSDUH\RXUVHOIIRU3+3« /HDUQ2EMHFW2ULHQWHG3URJUDPPLQJ ZLWKRYHU3UDFWLFDO3+36ROXWLRQV 0R QH \% DFN *8 $5 $ *HWWKLVVHWRIWZRQHZERRNV 17 (( 7KH3+3$QWKRORJ\9ROXPH,)RXQGDWLRQV /HDUQWREXLOGIDVWVHFXUHDQGUHOLDEOH 2EMHFW2ULHQWHG3+3DSSOLFDWLRQVXVLQJ SURIHVVLRQDO:HEGHYHORSPHQWWHFKQLTXHV 3UHYHQW64/LQMHFWLRQDWWDFNV 6HQG3DUVH+70/HPDLO )LOWHUXVHUVXEPLWWHGFRQWHQW &DFKHSDJHVIRUIDVWHUDFFHVV &UHDWH\RXURZQ566IHHGV 3URGXFHFKDUWVJUDSKV :ULWH3URIHVVLRQDO(UURUKDQGOLQJURXWLQHV &UHDWHVHDUFKIULHQGO\85/V $QGRWKHUSUDFWLFDODSSOLFDWLRQV %X\ERWKERRNVWRJHWKHUIRURQO\6$9( 1 H Z H HD V 5 HO 3/86¶3+3$UFKLWHFW·UHDGHUVJHWDQH[WUDRII RQO\XQWLO$SULOWK 7R2UGHU12:YLVLW« SKSDUFKLWHFWVLWHSRLQWFRP Licensed to 63883 - Joseph Crawford ([email protected]) 7KH3+3$QWKRORJ\9ROXPH,,$SSOLFDWLRQV TABLE OF CONTENTS php|architect Departments 5 Features 9 Editorial Connecting to Amazon.com Web Services with NuSOAP I N D E X 6 Licensed to 63883 - Joseph Crawford ([email protected]) by Alessandro Sfondrini What’s New! 16 34 Matchmaker, Matchmaker Make Me A Match: An Introduction to Regular Expressions Book Review Flash MX 2004 for Rich Internet Applications by George Schlossnagle 42 Product Review Mambo Open Source: Content Management System 28 Automated Testing For PHP Applications by Dr. James McCaffrey 59 Security Corner Shared Hosting by Chris Shiflett 35 PHP Ahoy! A look at php|cruise by Marco Tabini 63 Tips & Tricks By John W. Holmes 47 WAP: Past, Present and Future by Andrea Trasatti 66 exit(0); I Am Jack's Total Lack of Linux Support By Marco Tabini 53 Tidying up your HTML in PHP5 by John Coggeshall March 2004 ● PHP Architect ● www.phparch.com 3 You’ll never know what we’ll come up with next ! W E N Existing subscribers can upgrade to the Print edition and save! php|architect Visit: http://www.phparch.com/print for more information or to subscribe online. The Magazine For PHP Professionals php|architect Subscription Dept. P.O. Box 54526 1771 Avenue Road Toronto, ON M5M 4N5 Canada Name: ____________________________________________ Address: _________________________________________ City: _____________________________________________ State/Province: ____________________________________ ZIP/Postal Code: ___________________________________ Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue to be mailed to you. *US Pricing is approximate and for illustration purposes only. Choose a Subscription type: Canada/USA International Surface International Air Combo edition add-on (print + PDF edition) $ 83.99 $111.99 $125.99 $ 14.00 CAD CAD CAD CAD ($59.99 ($79.99 ($89.99 ($10.00 US*) US*) US*) US) Country: ___________________________________________ Payment type: VISA Mastercard American Express Credit Card Number:________________________________ Expiration Date: _____________________________________ E-mail address: ______________________________________ Phone Number: ____________________________________ Signature: Date: *By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above. Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly. **Offer available only in conjunction with the purchase of a print subscription. To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057 Licensed to 63883 - Joseph Crawford ([email protected]) Login to your account for more details. EDITORIAL php|architect Volume III - Issue 3 March, 2004 Publisher Marco Tabini Editorial Team Arbi Arzoumani Peter MacIntyre Eddie Peloke Graphics & Layout Arbi Arzoumani Managing Editor Emanuela Corso Director of Marketing J. Scott Johnson [email protected] Account Executive Shelley Johnston [email protected] Authors John Coggeshall, John Holmes, Dr. James McCaffrey, George Schlossnagle, Alessandro Sfondrini, Chris Shiflett, Andrea Trasatti php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini & Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada. Although all possible care has been placed in assuring the accuracy of the contents of this magazine, including all associated source code, listings and figures, the publisher assumes no responsibilities with regards of use of the information contained herein or in all associated material. Contact Information: General mailbox: [email protected] Editorial: [email protected] Subscriptions: [email protected] Sales & advertising: [email protected] Technical support: [email protected] Copyright © 2003-2004 Marco Tabini & Associates, Inc. — All Rights Reserved Continued on page 8... March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) E D I T O R I A L R A N T S I 'm sure you're familiar with the Chinese proverb "may you live in interesting times." Even though I rarely think of my professional life as dull and boring, the last month has been particularly exciting. As promised in my exit(0) column from last month's issue, if you look through the middle of the magazine you'll find a full report (in colour!) on the best conference I have ever attended—our very own php|cruise (forgive me for a bit of professional price—eight months of prep work will do that to you). Things went so well that we're working on another cruise—this time going to Alaska in the fall—and plan on making php|c an annual event for many years to come. All good things come to an end, of course, and, once back from the cruise, it's back to work. Luckily for us, work means bringing you yet another great issue of php|architect—and I personally consider that another good thing. Like every month, we've got some great content waiting for you in the following pages. The one I'm most proud of is George Schlossnagle's regular expressions article. Regexes are something that pretty much every programmer has to deal with, but that very few among us really know how to use. In fact, I've seen developers write extremely complicated code with the explicit purpose of getting around having to use a regular expression—and that is just plain wrong. After all, using the best solution for each problem is what being a programmer is all about. Thus, I approached George about writing an article on regular expressions—and it became quickly evident that one article would not even come close to covering the complexity of regex. Now, everyone knows that I always try my best to stay away from multi-part articles for a multitude of reasons, but in this case I felt that the topic more than deserved our attention over multiple issues and, therefore, George's article is the first in a series of three. Over the next three months, he will take you for a ride from the basics (which are covered in this issue) to the more complex and exotic aspects of regular expressions, thus hopefully providing the PHP world with a definitive guide to this topic. If regular expressions are not your bag, one of the other topics covered in this month's issue is certain to tickle your fancy. For example, you may want to read Alessandro Sfondrini's excellent article on using the Amazon.com API directly from your PHP website, or Andrea Trasatti's look at the world of WAP. As you can probably imagine, both Andrea and Alessandro hail from my native Italy—and that alone makes their articles more than worth reading. There, my monthly heritage tax is now paid up! As I'm sure you've noticed, in the past few months we've been publishing material about testing practices quite frequently. As larger and larger projects are devel- NEW STUFF Licensed to 63883 - Joseph Crawford ([email protected]) N E W S T U F F What’s New! ing the ability to access low-level socket operations on streams. PHP 5.0 Beta 4 PHP.net has announced the release of PHP 4.3.5 RC1. This fourth beta of PHP 5 is also scheduled to be the last one (barring unexpected surprises, that did occur with beta 3). This beta incorporates dozens of bug fixes since Beta 3, rewritten exceptions support, improved interfaces support, new experimental SOAP support, as well as lots of other improvements, some of which are documented in the ChangeLog. Some of the key features of PHP 5 include: • PHP 5 features the Zend Engine 2. • XML support has been completely redone in PHP 5, all extensions are now focused around the excellent libxml2 library (http://www.xmlsoft.org/). • SQLite has been bundled with PHP. For more information on SQLite, please visit their website. • A new SimpleXML extension for easily accessing and manipulating XML as PHP objects. It can also interface with the DOM extension and vice-versa. • Streams have been greatly improved, includ- March 2004 ● PHP Architect ● www.phparch.com PHP.net also announced the release of PHP 4.3.5 RC 3. This will be the last release candidate prior to the final release, so please test it as much as possible. For more information visit http://www.php.net/. ZEND Optimizer 2.5.1 Zend has announced the release of Zend Optimizer 2.5.1. Zend.com describes the Optimizer as: "a free application that runs the files encoded by the Zend Encoder and Zend SafeGuard Suite, while enhancing the running speed of PHP applications. Benefits: • Enables users to run files encoded by the Zend Encoder • Increases runtime performance up to 40%." Get more information from Zend.com. 6 NEW STUFF DEV Web Management System Dev is small, but powerful and very flexible content management system for web portals. System is licensed as freeware under the terms of GNU/GPL license. It is absolutely free for non-commercial and commercial use. Based on php4 + MySQL technology. This project allows the user to publish articles, evaluate article by taking the pool, publish short news and create back-ends in xml format, manage download lists, Manage advertisement on your site, Be informed about events on your site, create system reports and export them into MS Excel or XML format and much more. For more information visit: http://dev-wms.sourceforge.net/. PhpMyAdmin 2.5.6 Phpmyadmin.net has released their latest version of phpMyAdmin. PHPMyAdmin is a tool written in PHP intended to handle the administration of MySQL over the Web. "Welcome to this new version, aimed at stabilization of the 2.5 branch. Meanwhile, work is continuing on the new 2.6 branch. PhpMyAdmin is a tool written in PHP intended to handle the administration of MySQL over the Web. Currently it can create and drop databases, create/drop/alter tables, delete/edit/add fields, execute any SQL statement, manage keys on fields." For more information visit: www.phpmyadmin.net. PhpSQLiteAdmin 0.2 PhpSQLiteAdmin is a Web interface for the administration of SQLite databases. Version 0.2 comes with some new features and a lot of internal cleanups and refactoring. PhpSQLiteAdmin is still in an early stage of development. It comes free of charge and without warranty. For more information visit: www.phpsqliteadmin.net. Licensed to 63883 - Joseph Crawford ([email protected]) Zend Launches New PHP5 In-Depth Articles Section Zend Technologies have launched a new version of their Developer's Corner on the zend.com website. PHP5 In-depth showcases articles from many well-known PHP authors on the new features of PHP. For more information, check out http://www.zend.com/php/in-depth.php phpMyEdit 5.4 phpMyEdit generates PHP code for displaying/editing MySQL tables in HTML. All you need to do is to write a simple calling program (a utility to do this is included). Looking for a new PHP Extension? Check out some of the latest offerings from PECL. ps 1.1.0 ps is an extension similar to the pdf extension but for creating PostScript files. Its api is modeled after the pdf extension. Memcache 0.2 Memcached is a caching daemon designed especially for dynamic web applications to decrease database load by storing objects in memory. This extension allows you to work with memcached through handy OO interface. This extension allows you to call the functions made available by libstatgrab library. POP3 1.0 The POP3 extension makes it possible for a PHP script to connect to and interact with a POP3 mail server. It is based on the PHP streams interface and requires no external library. Fileinfo 0.1 This extension allows retrieval of information regarding vast majority of file. This information may include dimensions, quality, length etc. Additionally it can also be used to retrieve the mime type for a particular file and for text files proper language encoding. March 2004 ● PHP Architect ● www.phparch.com 7 NEW STUFF ionCube Releases New Encoder UK-based ionCube has released a new version of their compiled code PHP encoding tools. New features include a choice of ASCII or binary encoded file formats and optional support for OpenSource extensions such as mmcache. Prices start at a special price of $159 in their March 20% off sale. For further information, please visit the homepage of the Encoder: Editorial: Contiuned from page 5 oped using PHP, serious testing processes are going to become an integral part of every good developer's arsenal of programming tools. What we never quite considered is that PHP is a great testing platform even for those projects that are not written using it. Thankfully, James McCaffrey came to the rescue and provided us with a wonderful article on the subject. Our final article this month is about the new Tidy extension, which author John Coggeshall has recently introduced in PHP. You may have already heard about the Tidy project, which provides a series of libraries capable of parsing and automatically required documents written in markup languages like HTML or XML. Tidy brings an important set of capabilities to PHP, and I'm happy to have the author of the extension introduce us to it. That's it for this month—time for me to go tend to my sunburn while I start working on the next issue. Until then, happy readings! Licensed to 63883 - Joseph Crawford ([email protected]) It includes a huge set of table manipulation functions (record adition, change, view, copy, and remove), table sorting, filtering, table lookups, and more. Several minor bugs were fixed. A few new options were added. Major features include tabs support, the ability to specify SQL expressions for fields when writing to the database, the ability to define new triggers, and more. All eval() calls were removed due to security and performance reasons. Some code was optimized. Several parts of the documentation were updated. A lot of new language files were added and updated. For more information visit: http://platon.sk/projects/ phpMyEdit/ . http://www.ioncube.com/sa_encoder.php php|a Check out some of the hottest new releases from PEAR. Mail_Queue 1.1 Class to handle mail queue managment.Wrapper for PEAR::Mail and PEAR::DB (or PEAR::MDB).It can load, save and send saved mails in background and also backup some mails. The Mail_Queue class puts mails in a temporary container waiting to be fed to the MTA (Mail Transport Agent) and send them later (eg. every few minutes) by crontab or in other way. XML_Transformer 0.9.1 With the XML/Transformer class one can easily bind PHP functionality to XML tags, thus transforming the input XML tree into an output XML tree without the need for XSLT. Net_LMTP 0.7.0 Provides an implementation of the RFC2033 LMTP using PEAR's Net_Socket and Auth_SASL class. Text_Wiki 0.8.3 Abstracts parsing and rendering rules for Wiki markup in structured plain text. March 2004 ● PHP Architect ● www.phparch.com 8 Connecting to Amazon.com Web Services with NuSOAP Have you ever wanted to add an online shop to your website but gave up on the idea because you lack the expertise and resources to run it? Using SOAP, you can connect to Amazon Web Services and create a PHP application to remotely browse and search products, add them to Amazon shopping carts or wish lists and, yes, you can even earn money on every purchase performed from your site. I n the article "Exploring the Google API with SOAP," which appeared in the January issue of php|a, I showed you what SOAP is and how it can be used together with PHP. We used a SOAP-encoded document to perform a search using the Google Engine, then we parsed the response to display the results on our website. To perform these operations, we wrote an application from scratch; this approach can be great to understand how SOAP works, but when a customer asks you to implement a SOAP-based feature in an application, you can't waste your time in that way. In this case, there are some libraries that will make your coding quicker and easier: one of these is NuSOAP, which allows you to send Remote Procedure Calls (RPCs) over HTTP. This article will show you how we can use the Amazon.com API with NuSOAP to perform searches and display product details, without having to sort through a lot of SOAP syntax: if you have had an opportunity to read my previous article, you will notice how much shorter an application written this way is, and how much time can actually be saved by using this method. What are Amazon Web Services? Amazon.com is one of the most widely known on-line shops. You can find and buy almost everything, from books to toys to power tools. Several years ago, Amazon launched a very successful affiliate program, which they later expanded in their Web Services program. Why would you want to use Amazon Web Services March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) F E A T U R E by Alessandro Sfondrini (AWS)? For instance, if your website is about Literature, you may want to allow your users to look for books in the (huge) Amazon database directly from your pages, without redirecting them to Amazon.com. You can provide them with a detailed description of each book and, when they decide to buy one, you can add it directly to their Amazon shopping cart. When the time comes to complete the purchase, you can redirect the user directly to the Amazon website, where the checkout process actually takes place and you receive credit for your affiliate referral. It is important to understand that AWS are designed only to retrieve information about products and create, as well as populate, shopping carts, not to perform payments: this must be done directly on the Amazon website-the reason being, of course, one of security for the customer's personal information. In any case, a significant portion of the transaction is performed from your website. This results in a benefit both for you and for your users, since you can offer your customers a nearly seamless user experience and collect your referral fees. Access to AWS, as well as to the affiliate program, requires you to register with the Amazon Associates Program and obtain an Associates ID, which will identi- REQUIREMENTS PHP: 4.1 and higher OS: Any Other software:: NuSOAP 0.6.4 Code Directory: webs-nusoap 9 FEATURE Connecting to Amazon.com Web Services with NuSOAP Getting started Before we start coding, I recommend you download the AWS Software Developer's Kit from http://www.amazon.com/gp/browse.html/?node=3434641. It contains the License Agreement, a guide (you should have a look at it to familiarize yourself with the concepts associated with the program) and some code samplesincluding a few written in PHP! As I mentioned earlier, you will also have to apply for your Developer's token-an alphanumerical string needed for performing searches and purchases: to do so, you have to visit : https://associates.amazon.com/exec/panama/associates/j oin/developer/application.html and accept the AWS terms and conditions. To write our application, we will take advantage of a PHP library called NuSOAP-which is really just a group of "userland" classes written in PHP and designed to allow developers to manage SOAP web services, which will speed up our coding by allowing us to focus on functionality rather than on the communication protocols. NuSOAP is distributed under the LGPL license, and can be downloaded here: http://dietrich.ganx4.com/nusoap/ . To add NuSOAP support to our project, we simply have to include nusoap.php to our PHP scripts using require(). Performing a Remote Procedure Call (RPC) is simple—look at this example: require("nusoap.php"); $params = array('name' => 'value'); Figure 1 Parameter Name keyword mode tag type devtag $proxy = $s -> getProxy(); $result = $proxy -> method($params); Figure 2 Result Datum Type The keyword on which the search should be performed. Description Url String The URL of the product page for this item on Amazon Asin String The Amazon.com Standard Item Number for this product ProductName String The name of the product (in our case, the title of the book) Catalog String The category of the product (e.g.: books) Authors String The name(s) of the author(s) ReleaseDate String The release date, in human-readable format (e.g.: "23 February, 1976"). String The page number. AWS returns ten results per page, so page 1 will contain results 1 through 10, page 2 results 11 through 20, and so on. Manufacturer String The name of the product's manufacturer (the publisher in our case) String Specifies the ID of the store to browse. Each Amazon store has its unique ID, which indicates what kind of products it sells (e.g.: books, music, dvd, vhs, etc.). You can find a complete list of all the IDs available in the AWS documentation. ImageUrlSmall String A pointer to the products "small" image on the Amazon website ImageUrlMedium String Same as above, for a slightly larger image String Your Associate ID. If you don't have one, you can use the generic ID webservices-20. ImageUrlLarge String Same as above, but for an even larger image ListPrice String The product's list price, including the currency symbol (e.g.: "$ 20.55") String Determines the type of search results. Lite indicates a simpler result set, while heavy provides a richer set of information about each item returned. We'll use lite for our example. OurPrice String The product's selling price on Amazon, including the currency symbol UsedPrice String The product's price for used copies. String ● First of all, we include NuSOAP and we store the parameters we will use for the RPC in the $params associative array. We then create a new soapclient object, passing two arguments to the constructor: the SOAP server address and a boolean value that indicates whether the server uses a WSDL document. WSDL (Web Services Description Language) documents contain information about a web service, as well as its methods and properties. They are often used by web service providers—including Amazon. Once we have created the object, all we have to do is to actually execute the RPC by invoking the call() method and specifying the remote method name and the parameters to be passed (contained in $params in our case). NuSOAP automatically fetches the results of the call and stores them in the $result array. Since we are working with a WSDL-based server, NuSOAP can actually create a "proxy" PHP class capable of providing a better interface to our scripts. Once we have instantiated $s, we can also invoke a remote method in this way: Description String page March 2004 Type $s = new soapclient("http://server/file.wsdl", true); $result = $s -> call('method', $params); PHP Architect ● The Developer Token you have received from Amazon. www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) fy each purchase sent through our website. 10 FEATURE Connecting to Amazon.com Web Services with NuSOAP Designing the application Now that we've laid down some ground rules, it's time to decide in detail what the goals of our application are going to be. Since we're all PHP fans, our example website will be about PHP and, therefore, we'll want to allow our users to buy books on this topic from Amazon. The first thing that we need is a search page: users will be able to search for a particular keyword (or for a set of keywords) and the page will display some basic information about each book that matches the criteria, such as its title, an image, the publishing company, author or authors and price. We also have to provide a way to browse the results, since AWS calls only return ten results per call. The search page should also contain a link for each product to another page on our website that will contain a detailed description of the book, including any user reviews and comments. From here, the users will be able to continue their purchase on Amazon.com or add the product to their wish lists. The search page If you have had an opportunity to read through the AWS documentation, you have probably discovered that searches by keyword can be performed using the KeywordSearchRequest() method, which requires the parameters shown in Figure 1. Assuming that the call will be successful, the server will return an array containing several items: • The TotalResults element, which indicates the number of total results returned by the query. • The TotalPages element, which provides the number of pages available in the search result. • The Details sub-array, which contains a set of data about each search result matching our search criteria that is included in the page we have requested. Given that a search only returns a maximum of ten items per page, you can expect that this array will contain no more than ten elements. The lite search mode returns the data shown in Figure 2. Licensed to 63883 - Joseph Crawford ([email protected]) This can be useful to simplify our code: first, we create a proxy client, $proxy; any subsequent RPCs to methods specified in the WSDL can be performed using the proxy, without having to use the NuSOAP call() method again. In our application, we will use proxies to work with AWS. Listing 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 <form action=”<?=$PHP_SELF ?>” method=”GET”> <input type=”text” name=”keyword” value=”” /> <input type=”hidden” name=”page” value=1 /> <input type=”submit” name=”button” value=”Search!” /> </form> <?php if (empty($_GET[“keyword”])) // If the form has’n been submitted exit; // Stops the execution require(“nusoap.php”); $client = new soapclient(“http://soap.amazon.com/schemas2/AmazonWebServices.wsdl”, true); $proxy = $client -> getProxy(); // Creates a WSDL client and a proxy $param = array( ‘keyword’ ‘page’ ‘mode’ ‘tag’ ‘type’ ‘devtag’ ); => => => => => => $_GET[“keyword”], $_GET[“page”], ‘books’, ‘webservices-20’, ‘lite’, ‘YOUR-DEV-TOKEN’ $results = $proxy -> KeywordSearchRequest($param); // Calls the method if(empty($results[“Details”])) // Checks whether there are results die(“<h3>No results found for "”.$_GET[“keyword”].”".</h3>”); echo “<h3>Searched Amazon.com for "”.$_GET[“keyword”].”" - page “ .$_GET[“page”].” of “.$results[“TotalPages”].”</h3>”; foreach($results[“Details”] as $res) // Prints each product details echo “<img src=’”.$res[“ImageUrlMedium”].”’ align=’left’ /><br/>\n” .”<a href=’details.php?asin=”.$res[“Asin”].”’><b>”.$res[“ProductName”].”</b></a><br /><br />\n” .”<b>Authors</b>: “.@implode(‘, ‘, $res[“Authors”]).”<br />\n” .”<b>Publishing Company</b>: “.$res[“Manufacturer”].”<br />” .”<b>List Price</b>: “.$res[“ListPrice”].” - <b>Our Price</b>: “ .$res[“OurPrice”].” - <b>Used Price</b>: “.$res[“UsedPrice”].”<br /><br /><br />\n\n”; if($_GET[“page”] > 1) // Prints a link to prev. page if any echo “<a href=’$PHP_SELF?keyword=”.$_GET[“keyword”].”&page=”.($_GET[“page”]-1).”’>Previous Page</a> \n”; if($_GET[“page”] < $results[“TotalPages”]) // Prints a link to next page if any echo “ <a href=’$PHP_SELF?keyword=”.$_GET[“keyword”].”&page=”.($_GET[“page”]+1).”’>Next Page</a>”; ?> March 2004 ● PHP Architect ● www.phparch.com 11 FEATURE Type Basic Character ClassesDescription asin String The product's ASIN (which, in our case, can be retrieved from $_GET['asin'] tag String The Associate ID, or [webservices20] if you want to use a generic one type String The type of search. In this case, we'll choose heavy, since we want all the information available on a particular book devtag String Your Developer Token Result Datum SalesRank Type Description Integer Array of Strings Lists The product's sales ranking The names of the ListMania lists that contain the product Indicates the product categories in which the product can be found. Its contents look like this: BrowseList Array of Arrays BrowseList => Array ( [0] => Array ( BrowseName => PHP ) ) Media String The type of medium on which the product is distributed (e.g.: paperback or hardcover for books) Isbn String The ISBN code of the product (books only) Availability String Indicates how long the product takes to be shipped Reviews SimilarProducts Element Array This array contains information about the customer reviews associated with the product. It includes three elements: AvgCustomerRating, which indicates the average customer rating for the product, TotalCustomerReviews, which contains the number of customer reviews available and CustomerReviews, which is an array that contains the three most recent reviews (you can find the contents of this array in Figure 6). Array of Strings Contains the ASINs of products that are similar to this one. Type Description The rating of the product in this review Rating Integer Summary String A summary of the review Comment String The full review itself March 2004 ● PHP Architect ● www.phparch.com As you can see, the KeywordSearchRequest() method returns quite a few pieces of information for every result item, although, of course, we don't have to output all of them on our site. If you look at Listing 1—the source for our search page—you'll see that the very first part of the file is nothing more than a simple HTML form, which contains an input text box for the keyword and a hidden field that forces the page number to 1— this way, a new search will automatically start from the first page of results. The form uses the GET method because we need to use links for the "Next Page" and "Previous Page" operations (something like page.php?keyword=blah&page=2). Naturally, you could also use POST, but in that case it would be much more difficult for someone to create a direct link to your search results, which could, in theory, prevent you from completing some sales. The second part of the script contains the actual PHP code. First of all, an if-then-else control block stops the execution of the script if $_GET["keyword"] is empty. Otherwise, we include NuSOAP and create a SOAP client by passing the URI of the *.wsdl file for Amazon (which is provided in AWS documentation) and the boolean true to indicate to the constructor of the soapclient() class that the SOAP client features WSDL support. We also create a proxy to call AWS methods directly as we have seen in the first part of the article. The parameters needed to invoke KeywordSearchRequest() are stored in the $param array; the first two (the keyword and the page number) are to be found in the $_GET superglobal, since they change each time we perform or browse a search, while the others are constant and, therefore, we hardcode them in our script. Remember to insert your developer token in $param["devtag"]. Once we have invoked the method and stored the search results in $results, we have to display the latter in a format that is comprehensible to the user. First, we check whether there are any results to begin with. If the search returned no data, the program displays a warning and exits. Otherwise, we print a short summary of the search: the keyword, the current page number and total page count, followed by details about each product in the current result page. These are actually produced by a simple foreach loop, which browses the $results["Details"] array, echoing the title of each book, a medium-size image, its authors, publishing company and prices. We will also provide a link to another page, details.php, which contains further information on each book. The link contains a reference to the product's ASIN (the Amazon identifier for each product) in order to make the application able to retrieve the correct product from Amazon's catalogue with another RPC. The last part of this page allows the user to browse the results: if the current page isn't the first one (Page Licensed to 63883 - Joseph Crawford ([email protected]) Parameter Connecting to Amazon.com Web Services with NuSOAP 12 FEATURE Connecting to Amazon.com Web Services with NuSOAP The Product Detail Page Now that we are done with the first part of the application, it's time to move on to the product detail page, which will show advanced information about a particular book. The AWS method we need in this case is AsinSearchRequest(), which needs the parameters shown in Figure 4. Just like before, the response that we get back from Amazon is an array of arrays—except that, in this case, we will simply concern ourselves with the first result set, since the ASIN uniquely identifies one product. Our data, therefore, will be stored in $results['Details'][0], which, in turn, will contain the information shown in Figure 5. As you can see, some of the values returned are the same as the results of the KeywordSearchRequest() call that we used in Listing 1, while some others, like the customer reviews, are more appropriate for a detailed product page. Speaking of the product page, Listing 2 contains the code for details.php. First, we check $_GET["asin"]; if it is empty, the program displays a warning and exits. In a more complete application, you may want a slightly more verbose explanation of what went wrong, or perhaps an automatic redirection to the search page. If we have an ASIN, we include the NuSOAP library, then create a SOAP client and proxy as we did in the previous page. Please note that we have to use sprintf() to transform the ASIN in a ten-character strings, since AWS requires it to be submitted in that format (as an alternative, you could use str_pad() to ensure that the string is ten character long). This time, we only need to pass the ASIN and specify heavy as the search type. Once the RPC has been executed, we retrieve the results and print them out, using a foreach loop to cycle through the user reviews. The final touch in our application consists of providing a link back to the Amazon website in order to make it possible for our users to purchase a product—you can't do much selling by just showing which products are available! The AWS documentation specifies that an HTTP form must be set up for the purpose of submitting the purchase information over to Amazon.com. This form (you can look at the one in Listing 2 for an example) uses the POST method, and its action attribute is really nothing more than a page on Amazon.com that contains the Licensed to 63883 - Joseph Crawford ([email protected]) 1), the script prints a link to the previous one and, if it isn't the last page (based on the information returned by our AWS call), it prints a link to the next one. Figure 3 shows our search page at work. Listing 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <?php if(empty($_GET[“asin”])) die(“<h3>No ASIN specified</h3>”); require(“nusoap.php”); $_GET[“asin”] = sprintf(“%010d”, $_GET[“asin”]); $client = new soapclient(“http://soap.amazon.com/schemas2/AmazonWebServices.wsdl”, true); $proxy = $client -> getProxy(); // Creates a WSDL client and a proxy $param = array( ‘asin’ ‘tag’ ‘type’ ‘devtag’ ); => => => => $_GET[“asin”], ‘webservices-20’, ‘heavy’, ‘YOUR-DEV-TOKEN’ $results = $proxy -> AsinSearchRequest($param); // Calls the method ?> <h1><?=$results[“Details”][0][“ProductName”] ?></h1> <img src=”<?=$results[“Details”][0][“ImageUrlLarge”] ?>” align=”left” height=”350” /> <b>Authors:</b> <?=@implode(‘, ‘, $results[“Details”][0][“Authors”])?><br /><br /> <b>Published by</b> <?=$results[“Details”][0][“Manufacturer”]?> <b> on</b> <?=$results[“Details”][0][“ReleaseDate”]?><br /><br /> <b>List Price</b>: <?=$results[“Details”][0][“ListPrice”] ?> <b>Our Price</b>: <?=$results[“Details”][0][“OurPrice”] ?> <b>Used Price</b>: <?=$results[“Details”][0][“UsedPrice”] ?><br /><br /><br /> <!— Form to purchase on Amazon.com —> <form method=”POST” action=”http://www.amazon.com/o/dt/assoc/handle-buy-box=<?=$_GET[“asin”] ?>”> <input type=”hidden” name=”asin.<?=$_GET[“asin”] ?>” value=”1”> <input type=”hidden” name=”tag-value” value=”webservices-20”> <input type=”hidden” name=”tag_value” value=”webservices-20”> <input type=”hidden” name=”dev-tag-value” value=”YOUR-DEV-TOKEN”> <input type=”submit” name=”submit.add-to-cart” value=”Buy From Amazon.com”> <input type=”submit” name=”submit.add-to-registry.wishlist” value=”Add to Wish List”> </form> <!— End Form —> <b>ISBN:</b> <?=$results[“Details”][0][“Isbn”]?><br /><br /> <b>Availability:</b> <?=$results[“Details”][0][“Availability”]?><br /><br /><br /> <b>Sales Ranking:</b> <?=$results[“Details”][0][“SalesRank”]?><br /><br /> <b>Average customer rating:</b> <?=$results[“Details”][0][“Reviews”][“AvgCustomerRating”]?> <br /><br /><h2>Read user reviews:</h2> <?php foreach($results[“Details”][0][“Reviews”][“CustomerReviews”] as $res) echo “<h3>”.$res[“Summary”].”</h3>” .”<b>Rating: </b>”.$res[“Rating”].”<br /><br />”.$res[“Comment”].”<br /><hr />”; ?> March 2004 ● PHP Architect ● www.phparch.com 13 FEATURE Connecting to Amazon.com Web Services with NuSOAP Further Improvements As you have probably noticed, writing a SOAP-based application using a library like NuSOAP is much faster than developing your own SOAP classes—if you have read my article about the Google API that appeared on the January issue of php|a, you probably know what I am talking about. This means that you can develop rather complex applications without having to waste time dealing with the nitty-gritty details of the underlying protocol; in fact, we didn't even write any SOAP code for our Amazon application—NuSOAP did it all for us. Naturally, the code that I have introduced here is very basic and could stand to gain from some improvements. For instance, Amazon Web Services allow you to to manage a a remote shopping cart or wish list by adding and removing items to them. The very last part of the purchase—the one where money changes hands—must still take place on Amazon.com, but you can let the user perform most of the normal operations associated with an e-commerce website without leaving your website. However, do keep in mind that if you choose to manage the user's shopping cart remotely, you can't change it once you've submitted to Amazon—this is done to protect the end user from fraudulent transactions. You can check out the AWS documentation for more details on this topic—you'll find that it's not complicated at all. Depending on your needs, you may choose to perform a different kind of search operation on your website: by similar products, by author, by ISBN, by manufacturer, and so on. You may also want to browse a "node", or product category (e. g. "programming", "web", etc.) directly, without performing a search. It goes without saying that all this depends on what your goals are. If your Amazon-based shop becomes very popular, you may decide to join the Amazon Associates Program, an affiliate system that pays you commissions on every sale. Be careful, however, that your application must not send more than one request per second to Amazon—even if you provide an error handling system, you must not immediately retry a request if the previous one has failed. You should also provide a caching system, in order to store the data needed by your site without going back and forth to AWS for every request—you can check out Bruno Pedro's excellent article in the February 2004 issue of php|a for more idea on caching data from your PHP scripts. If you choose to do so, don't forget that you can't keep your data cached for more than twentyfour hours. Finally, please keep in mind that in the examples shown in this article we always referred to Amazon.com, the American website. AWS are also available for Amazon.co.uk, Amazon.de and Amazon.co.jp, but you have to modify the URIs in the script, changing the specifications in the WSDL document from [soap.amazon.com/] to soapeu.amazon.com/, and so on. You will also have to add the locale parameter to your RPC invocations—its value can be set to uk, de or jp, depending on which Amazon Licensed to 63883 - Joseph Crawford ([email protected]) ASIN of product that must be added to the user's shopping basket. A few additional hidden fields provide the ASIN, the Associates Id and the Developer's token. The form supports two different buttons: one adds the product to the user's basket, while the other adds it to his wishlist. Figure 3 March 2004 ● PHP Architect ● www.phparch.com 14 FEATURE Connecting to Amazon.com Web Services with NuSOAP I'm Outta Here Amazon.com Web Services is a powerful tool that you can use to add e-commerce functionality to your site without going to the expense of developing an online store of your own and stocking all the merchandise. Even if you can't create a complete on-line shop using ASW (because the purchase must be completed on the Amazon website), you can still give your users a customized shopping experience that relies on the practically limitless resources of one of the world's most popular e-commerce websites. The sample application that I showed you in this article is quite simple: if you plan to use it in a production environment—especially if your site has a lot of traffic— you should probably consider implementing features like error handling and caching in order to prevent problems with the Amazon servers. Adding these elements to your application may require some extra work, but it could all pay off if you enjoy decent traffic and join the Amazon Associates Program. Perhaps most importantly, I hope to have given you a good idea of how much a SOAP library (in this article we have chosen NuSOAP, but there are some others packages, like PEAR::SOAP) can simplify the creation of a complex application—write in few lines of code to perform a Remote Procedure Call and you're practically done. If you want to extend our sample application and create a "complete" on-line shop using AWS, have a look to the documentation: there you will find a detailed description of every method that's available for use. If you want to learn more about SOAP, you can check out the World Wide Web Consortium's notes about the protocol at http://www.w3.org/TR/SOAP or—if you missed it— read the article "Exploring the Google API with SOAP" published in the January 2004 issue of php|a. About the Author ?> Alessandro Sfondrini is a young Italian PHP programmer from Como. He has already written some on-line PHP tutorials and published scripts on most important Italian web portals. You can contact him at [email protected] . Licensed to 63883 - Joseph Crawford ([email protected]) website you are referring to. To Discuss this article: http://forums.phparch.com/130 FavorHosting.com offers reliable and cost effective web hosting... SETUP FEES WAIVED AND FIRST 30 DAYS FREE! So if you're worried about an unreliable hosting provider who won't be around in another month, or available to answer your PHP specific support questions. Contact us and we'll switch your information and servers to one of our reliable hosting facilities and you'll enjoy no installation fees plus your first month of service is free!* - Strong support team - Focused on developer needs - Full Managed Backup Services Included Our support team consists of knowledgable and experienced professionals who understand the requirements of installing and supporting PHP based applications. Please visit http://www.favorhosting.com/phpa/ call 1-866-4FAVOR1 now for information. March 2004 ● PHP Architect ● www.phparch.com 15 Matchmaker, Matchmaker Make Me A Match An Introduction to Regular Expressions A quick search for the words "hate" and "regular expressions" on your favourite search engine is likely to bring up thousands upon thousands of hits. While most developers recognize the usefulness of regular expressions (and many can't do without them once they have figured out how regexes work), their use remains something of a blackmagic art—right up there with hypnosis and session management. Despite looking complicated, however, regular expressions are much easier to work with than most people are willing to admit. A Few Myths about Regexes Before we get started, we should dispel a few popular myths about regexs: Myth: Regular Expressions are Slow. Truth: Regular expressions can be slow, but they don't need to be. The main regular expression library used by PHP (called PCRE and consisting of the preg_ family of functions) is quite fast and also quite powerful. This power means that it is easy to write a short regular expression that performs a lot of work, and performing a lot of work with any tool can be slow. Myth: You should use basic string functions instead of regular expressions. Truth: Regular string functions (for example strstr or strtok) are (marginally) faster than the regular expression to accomplish the same task. That having been noted, this myth often leads to people implementing complicated string parsers using string matching functions where a single regular expression would do the trick. The PCRE library will always match complex patterns faster than implementing a parser on your own. March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) F E A T U R E by George Schlossnagle R egular expressions (commonly known as regexes) are a powerful tool for pattern matching and text manipulation. A typical problem that pulls people into learning regular expressions is text munging: you have a string of text and you need to replace portions of it based on certain rules. For instance, you might want to obfuscate all the email addresses in a block of text so that email addresses like [email protected] get translated to the form george [at] example [dot] com. Regular expressions are the tool for the job, and provide a powerful and deep syntax for handling tasks like these. Alternatives to the PCRE Functions PHP supplies some alternatives to the PCRE functions. The most direct competitor is the POSIX regular expression library that consists of ereg, ereg_replace and others. We won't be looking at the POSIX regular expression functions because the PCRE library provides a broader pattern-matching facility than its POSIX counterpart and the PCRE library is about 30% faster on average. The other option is to perform string matching with the standard string functions. As noted above, REQUIREMENTS PHP: ANY OS: Any Applications: N/A Code Directory: match-regex 16 FEATURE Matchmaker, Matchmaker Make Me A Match the string functions are faster on the tasks they were designed for (finding specific characters or substrings), but are not an appropriate fit for anything but the simplest patterns. Your First Regex The simplest regex is a match against a static string. To determine if the string '[email protected]' is present in a piece of text, we can use the following code fragment: if(preg_match("/george@example\.com/", $text)) { print "Matches"; } else { print "Does not match"; } this function in more detail later in the article. • preg_replace_callback—This function makes it possible to perform very complex operations on a per-match basis through the use of callback functions. We will cover it in a future article, but some of its functionality overlaps with evaluated replacements, which are discussed in this article. • preg_quote(string text)—When using input text in a pattern, you may want to sanitize it to ensure it does not contain any regex metacharacters. preg_quote escapes all regex metachacters in a string. preg_replace("/george@example\.com/", "george [at] nospam.example.com", $text); The other PCRE functions are: fied using straightforward textsearch functions like strstr().” • pcre_grep(string pattern, array subjects [, int flag])—ppcre_grep applies the specified pattern to every element of subjects, returning an array consisting of those that matched. If the optional flag is set to PREG_GREP_INVER, only those elements that did not match will be returned. • pcre_match_all( s t r i n g p a t t e r n , s t r i n g subject [,array matches, int flags]])— pcre_match returns only the first match found in its subject text. pcre_match_all matches as many times as possible, returning an array of all the matches. I will discuss March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) Despite its simplicity, this example illustrates the basic syntax of a regex match. The regex itself is the first parameter, and is contained within slashes ([/]). • preg_split(string pattern, string subject The second parameter is the text you want to test [, int limit [, int flags]])—ppreg_split the pattern against. The preg_match function returns performs similarly to explode, allowing us to true if the match succeeds, and false if it fails. Using break up the string subject into limit parts. slashes to delimit regular expressions is a convention Instead of splitting on a specific delimiter, (taken from the UNIX utility awk), but is not necespreg_split allows the string to be broken sary—you can actually use any non-alphanumeric based on a regex. character. Alternative delimiters are convenient if your pattern itself contains slashes. Regex Basics For instance, when dealing with file Of course, we can (and should) perpaths or URLs (both of which conform the previous simple match using tain numerous slashes), it is common “The power of regustrstr(), which is faster than any regex to use a different delimiter. lar expressions is function. What if, however, we want to We can also perform substitutions match all email addresses in a string, in matching comwith PCREs. To substitute 'george at rather than a specific one? What if you plex patterns that nospam.example.com' for my address wanted to change text only if it (a common anti-spam technique), you cannot be identiappeared in a particular position within can use your string? The power of regular expressions is in matching complex patterns that cannot be identified using straightforward text-search functions like strstr(). The basic components of a regular expression pattern are: • Character Classes—Patterns rarely consist of specified letters, but classes of letters. For example 'any number' instead of a particular number, or 'any letter' instead of a particular letter. • Grouping—Grouping allows for changing the precedence of operations as well as providing a means to extract the text you matched with a pattern. • Enumerations—Enumerators allow you to specify how many times a character class or sub-pattern appears. This allows for conven- 17 FEATURE Matchmaker, Matchmaker Make Me A Match Second, if you test this pattern you will find the following results. ient expression of fixed length patterns like 'a US zipcode is 5 digits' as well as variable length patterns such as 'a domain is a number of alphanumeric characters separated by dots'. • 555-123-4567 matches. This is correct. • 5555-123-45678 matches. This is not correct. • Alternations—Alternations allow for multiple patterns to be combined. Unlike character classes, which allow for a position to match multiple characters, alternations allow for entire patterns to be alternatively matched. For example, a valid workday can be Monday, Tuesday, Wednesday, Thursday or Friday. • Positional Anchors—Anchors allow you to require your pattern to start matching at a specific location in the search text, for example at the beginning or end of a line. • Global Pattern Modifiers—Global pattern modifiers allow you to change the basic behavior of a regular expression, for example rendering it case-insensitive. /\s\d\d\d-\d\d\d-\d\d\d\d\s/ Character Classes While it's usually easy to find a particular substring within a larger string—for example, my e-mail address in a message—it's not always easy to find a particular type of substring-like any e-mail address. To do this, you need to be able to match against a more generic pattern and not just against a static string. PCRE supplies character classes to allow you to do this; a character class allows a specific character in a search text to be matched against a range of possible characters. For example, a US phone number is composed of a three digit area code, a three digit exchange, and a four digit line number, commonly delimited by a '-'. To match this pattern, you could use the following regular expression: /\d\d\d-\d\d\d-\d\d\d\d/ The \d specifier is a built-in PCRE character class that consists of all the digits. There are a couple things you should note about the pattern above. The first is that we have many \d's. In regular expressions, any character or character class matches only a single character unless you use an enumerator (which we'll cover later) to attach a quantity to it. Figure 1 Regex doesn't always work the way you expect 8 8 7 7 - x x x - y y y y \d \d \d - \d \d \d - \d \d \d \d March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) The second example does not represent a valid phone number (the area code and line number are too long), but it matches because the pattern fits as shown in Figure 1. There are a couple of ways to combat this problem. If you know that your search text should be exactly a phone number (with no leading or trailing text), you can use positional anchors to force the pattern to start at the beginning of the text and end at the end, as we'll see later on. If the phone number might be contained in text, on the other hand, you might try and fix the pattern by having the numbers have at least one character of leading and trailing whitespace, using a pattern like: y The \s specifier is another character class for all whitespace (spaces, tabs, newlines, etc.). This pattern does not work in all situations, though, since if the text begins with the phone number you will be unable to match the leading \s. To handle this case, boundary condition that PCRE supports \b—a matches at the border (or boundary) between a 'word' and a 'non-word' (these are words in the C programming language sense—letters, numbers and underscores only). \b is actually not a character class, but what is known as a 'zero-width assertion'; this means that the \b specifier does not actually match the character on the other side of the boundary, but only ensures that such a boundary exists. Putting that into our pattern we can refine it to: /\b\d\d\d-\d\d\d-\d\d\d\d\b/ Continuing the testing, we find that "077-xxx-yyyy" matches. US and Canadian area codes and exchanges cannot begin with 0 or 1 (these are reserved for long distance and operator-assisted or international services). To be able to restrict the leading numbers to the allowed set, we need to be able to create our own character classes. In PCRE, these are constructed by filling a set of brackets ([[ ]) with the characters we want to match. To match 2-9, we can use the character class [23456789], which is commonly shortened via a range operator to [2-9]. To use a custom character class in a pattern, you use it exactly as you would a regular character or character class. Here is the phone number pattern reworked to employ this: /\b[2-9]\d\d-[2-9]\d\d-\d\d\d\d\b/ 18 FEATURE Matchmaker, Matchmaker Make Me A Match Figure 2 Basic Character Classes them with a backslash (\\). The two exceptions are the range operator -, which can appear un-escaped as the last character in a class, since that is unambiguous, and the negation character ^, which can appear un-escaped in any position but the first. Grouping and Sub-Patterns Usually, you will not only want to match a pattern, but extract data from it as well. To extract a specific part of a pattern, you surround it within parentheses. For example, to capture each part of the phone number pattern, you would add parentheses as follows: /\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/ Figure 3 POSIX Style Classes :alpha: Any letter :alnum: Any alphanumeric character :ascii: Any ASCII character :cntrl: Any control chatacter. . Matches any character :digit: Any digit (same as \d) \w An alphanumeric character or the underscore character. :graph: Any alphanumeric or punctuation character. \W Anything not a \w. :lower: Any lowercase letter. \d A digit. :print: Any printable character. \D A non-digit. :space: Any whitespace character (same as \s). \s Any whitespace. This includes spaces, tabs, newlines, control characters. :upper: Any upperspace character. \S A non-whitespace character. :xdigit:] Any hexadecimal 'digit'. Licensed to 63883 - Joseph Crawford ([email protected]) PCRE provides six commonly used built-in character classes, described in Figure 2. Additionally, PCRE provides POSIX-style character classes for compatibility with POSIX-style regular expressions. These classes are described in Figure 3. POSIX character sets aren't commonly used much in real-life code, which is a shame because they are often a perfect fit for problems that programmers encounter in their dayto-day work. You can negate a POSIX character class by adding a ^ after the first colon. For instance, to match all non-letter characters, you could use the class :^alpha:. Negations are also available in custom character classes—for example, to match anything that is not the greater-than character (>), you can use the custom character class [^>]. Negations are very useful when you are creating regular expressions that extract quoted text or if you want to manually parse XML or HTML. Since '--', '^^' and '[[ ]' have special meanings in custom character classes, if you want those actual characters to be elements of the class, you should escape Figure 4 March 2004 ● PHP Architect ● www.phparch.com 19 FEATURE Matchmaker, Matchmaker Make Me A Match $text = 'My phone number is 555-321-1212'; preg_match("/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/", $text, $matches); print_r($matches); Executing that code yields the following results, just as we predicted: Array ( [0] [1] [2] [3] ) => => => => 555-321-1212 555 321 1212 We can also nest patterns. If we wanted to capture the entire local part of the phone number, in addition to its componentized parts, the regex could be modified to be: /\b([2-9]\d\d)-(([2-9]\d\d)-(\d\d\d\d))\b/ When we nest patterns, we move left to right and, when we hit a nested pattern, we take the outermost part first, then recursively parse its contents following the same rules. With the above pattern, the patterns are numbered as shown in Figure 4. Sub-patterns are also extremely useful in substituListing 1 1 2 3 4 5 6 7 8 9 10 $fp = fopen(“/usr/share/dict/words”, “r”); if(!$fp) { print “dictionary file not found\n”; exit; } while(($line = fgets($fp)) !== false) { if(preg_match(‘/\b(\w)(\w)(\w)\3\2\1\b/’, $line)) { print “palindrome: $line\n”; } } Figure 5 h a l l a h \w(captured as \1) \w(captured as \2) \w(captured as \3) \3 \2 \1 ● preg_replace("/\b([2-9]\d\d)-([2-9]\d\d)(\d\d\d\d)\b/", '\1-\2-XXXX', $text); If we run this on the text 'My phone number is 410555-1212.', it returns 'My phone number is 410-552XXXX'. Note that the replacement string in the above example is single-quoted. If we were to double quote it, we would have to double escape our sub-pattern references as "\\1-\\2-XXXX". This may seem mysterious but the reasoning is this: the PCRE library needs to be passed the sub-pattern references as \1, but when we double-quote a string, PHP attempts to interpret the escaped characters for us. Single-quoting performs no such interpretation and leaves your references untouched. This is the same process by which "\n" becomes a newline, but '\n' remains literally '\n'. We can reference sub-patterns in matches as well, using the same rules. A fun example of this is finding all 6-letter palindromes. A palindrome is a word that is spelled the same forward and backward, for example 'noon' or 'deed'. To spot a six-letter palindrome, we match 3 characters and require that we see them immediately in reverse order. Here is the pattern: Note 2 This isn't the full story on RFC compliant email addresses. Because the specification allows for addresses to contain descriptions as well, a completely accurate email address validator is actually quite complex. An example can be found at the end of Mastering Regular Expressions in Perl - the regex presented there is X characters long! For most purposes, the regex presented above is completely sufficient. Enumeration modifiers can also be used to compress patterns with long repetitive parts. For instance, the phone-number pattern can be compressed to: /\b[2-9]\d{2}-[2-9]\d{2}-\d{4}\b/ Matching a palindrome March 2004 tions, since they allow us access to the matched subpatterns when performing the replacement. A captured sub-pattern can be accessed in the {preg_replace} replacement text by referencing its offset as \N (where N is the sub-pattern number). Here is an example that sanitizes phone numbers by obscuring their line number: Licensed to 63883 - Joseph Crawford ([email protected]) Pattern fragments grouped in this fashion are called sub-patterns. To see what they capture, you need to pass a third argument to {preg_match}. This argument is set by the function as an array with the captured sub-pattern results in it. The zeroth element the array is the text matched by the pattern as a whole, while the sub-patterns captures are at the offset of their pattern number. Patterns are numbered left-toright and outside-to-inside. So in the pattern above the entire phone number is offset 0, the area code is sub-pattern 1, the exchange is sub-pattern 2, and the line number is sub-pattern 3. Here you can see a sample phone number being run through the regular expression. PHP Architect ● www.phparch.com or, by noting that the area code and exchange match the same pattern, we can compress it even further, as follows: /\b([2-9]\d{2}-) {2}\d{4}\b/ 20 FEATURE Matchmaker, Matchmaker Make Me A Match /\b(\w)(\w)(\w)\3\2\1\b/ <?php $text = 'Work: 877-555-1212, Fax: 888-555-1212'; preg_match_all("/\b([2-9]\d\d)-([2-9]\d\d)(\d\d\d\d)\b/", $text, $matches); print_r($matches); ?> Executing that script returns the following: Array ( [0] => Array ( [0] => 877-555-1212 [1] => 888-555-1212 ) [1] => Array ( [0] => 877 [1] => 888 ) [2] => Array ( [0] => 555 [1] => 555 ) [3] => Array ( [0] => 1212 [1] => 1212 ) ) The alternative is to pass the optional flag PREG_SET_ORDER. With this flag set, the ordering of the match array is reversed: the match array contains one element for each search text matched, with that array containing the sub-pattern captures for that search text. If we are looking to replicate the Perl idiom while($text =~ /$regex/g) { # perform work on one set of matches at a time } you can accomplish it with this PHP: preg_match_all($regex, $text, $matches, PREG_SET_ORDER); foreach($matches as $match) { // perform work on one set of matches at a time } March 2004 ● PHP Architect ● www.phparch.com Enumerations Another important feature in pattern matching is the ability to match variable-length patterns. In the phone number example, even though the digits of the number were unknown, the length of the pattern was fixed—it is always a three digit area code, three digit exchange and four digit line number. On the other hand, if we are matching email addresses, we don't a priori know the length of the address. Figure 6 Enumeration Modifiers * Match 0 or more times. + Match 1 or more times. ? Match 0 or 1 times. {m} Licensed to 63883 - Joseph Crawford ([email protected]) When we run this pattern against a palindrome like ' hallah', it matches as shown in Figure 5. Notice that you need to use \b to make sure you don't misidentify words that contain palindrome substrings. If you are running on a UNIX system, Listing 1 is a code block that will find all the six-letter palindromes in the dictionary file /usr/share/dict/words. When we use preg_match_all with sub-patterns, we have two choices of how we want the data returned to us. The default behavior is for the match array to contain an array for each sub-pattern, where that array contains the capture for the nth search match as its nth element. If that's confusing, here is how it looks when matching all the phone numbers in a text: Match exactly m times. {m,n} Match between m and n times. {m,} Match at least m times. {,n} Match between 0 and n times. To handle this, PCRE supplies enumeration modifiers. The most basic description of an email address is a number of non-whitespace characters, followed by an '@', followed by more non-whitespace characters. \S is the character class for all non-whitespace characters, so using that we can write this simplistic email-matching pattern as: /\S+@\S+/ + is a PCRE enumerator that instructs the regex engine to match one or more instances of the character or character class it applies to. PCRE supports a number of enumeration methods for specifying that a character or character class should be matched multiple times, as you can see in Figure 6. The + and * modifiers are both greedy. This means they will always match as long a sub-pattern as possible. This is not always the way you want your patterns to behave, but I will leave the details of when we might want a greedy or non-greedy match to a later article. Enumeration modifiers can be applied not only to characters and character classes, but to sub-patterns as well. This allows for some pretty complex pattern generation, which is, after all, one of the best features of regular expressions (at least when you can understand what they do). For example, we can use enumeration modifiers to significantly improve our email-address pattern. 21 FEATURE Matchmaker, Matchmaker Make Me A Match According to RFC 2822, which defines the "official" valid email address syntax, an email message is composed of a localpart, an '@' and a domain. The localpart is one or more characters from the set [\w!#$%"*+\/=?`{}|~^-], while a domain is a dot-separated list of parts composed of \w-. The pattern for the local part is almost identical to the definition of \S+: /[\w!#$%"*+\/=?`{}|~^-]+/ The pattern for domains is more complex. First, we need to identify elements in the string. These are given by /[\w-]+/ If we only have two such elements, the domain pattern would look like this: and not /, since our pattern contains slashes and we would rather not have to escape them. A more elegant approach is to combine them using an alternation, as follows: #(https?|ftp)://\S+# The alternation operator | means that the sub-pattern #(https?|ftp)# matches either #https?# ('http' with an optional 's') or #ftp#. To use this to automatically create anchor tags for all linked content, we can use a replacement like this: preg_replace('#((https?|ftp)://\S+)#', '<a href="\1">\1</a>', $text); Running this over a sample text, we notice that any preexisting anchor tags will become munged. For example: /[\w-]+\.[\w-]+/ /([\w-]+\.)+[\w-]+/ Creating a sub-pattern simply involves placing it inside parentheses. Combining the local and domain patterns together, we arrive at a decent regular expression for matching valid email addresses: /[\w!#$%"*+\/=?`{}|~^-]+@([\w-]+\.)+[\w-]+/ We can use this regular expression to perform the anti-spam rewriting we illustrated at the beginning of the article. function obscure_emails($text) { $regex = '/([\w!#$%"*+\/=?`{}|~^-]+)@(([\w-]+\.)+[\w]+)/'; preg_replace($regex, '\\1 [at] nospam.\\2', $text); return $text; } Alternation The last of the basic regular expression syntactical elements is alternation. Where character classes let us match a single character against a set of allowed characters, alternations allow for matching a string against multiple sub-patterns. For example, we might want to identify all HTTP and FTP addresses in a document for auto-linking or indexing purposes. We could do this with two regular expressions: #https?://\S+# #ftp://\S+# ● PHP Architect Becomes Come visit us at <a href="<a href= "http://www.phpa.com">phpa.com</a> .">http://www.phpa.com">phpa.com</a>.</a> Solving this in a completely robust manner involves using look-behind assertions, which will be covered in a future article, but we can do a decent job by noting that the href value must be enclosed in quotes. Thus, if we require the URL to not be preceded by a quote, we should catch most cases. The revised regular expression is: preg_replace('#([^\'"])((https?|ftp)://\S+)([:punct:]) #', '\1<a href="\2">\2</a>', $text); Note here that we need to capture and return in the substitution the non-quote (^^\'") character we match before the URL to avoid losing it, and that we have to escape the single quote, since it the entire pattern is part of a single-quoted string. Positional Anchors In the example of matching valid US phone numbers, the regular expression we had was good for spotting phone numbers in a block of text, but not for validating that a block of text is a phone number. To do that, we need to ensure that the phone number is the only element in the search text, with no leading or trailing components. Anchors help solve this problem. To mandate that our phone number match starts at the beginning of the search test and ends at the end of it, we can modify our regex as follows: /^([2-9]\d{2})-([2-9]\d{2})-(\d{4})$/ but this will require the document to be completely scanned twice. Note that we are using # as a delimiter March 2004 Come visit us at <a href="http://www.phpa.com">phpa.com</a>. Licensed to 63883 - Joseph Crawford ([email protected]) Note that since '.' is a special regex character (the wild-card character class), we must escape it to have it match just the '.' character. Since we can have an arbitrary number of dot-separated segments, we will encapuslate the first part of the pattern in a sub-pattern and use the '+' enumerator to specify that it must occur one or more times: ● www.phparch.com The leading ^ anchors the match at the beginning of the text, meaning that the match will only succeed 22 FEATURE Matchmaker, Matchmaker Make Me A Match function validate_us_phone($phone) { $regex = '/^([2-9]\d{2})[.\s -]?([2-9]\d{2})[.\s ](\d{4})$/'; if(preg_match($regex, $phone, $matches)) { return array( 'area_code' => $matches[1], 'exchange' => $matches[2], 'line_number' => $matches[3]); } return false; } Don't confuse the anchor operator ^ with the negated character class operator [^]. Because an anchor is not a character class (in fact it's a special zero-length look behind assertion, but that's a topic for a later article), it has no meaning inside a character class. Anchors are also useful for extracting information near the beginning or end of a string. For example, a line from an Apache Common Log Format logfile looks like the following: 10.80.117.254 - - [13/Feb/2004:14:53:01 -0500] "GET /~george/blog/ HTTP/1.1" 200 43489 This says that on February 13, 2004 a request for "/~george/blog/" was made from the IP address 10.80.117.254. This request was successful (it returned a 200 Request OK response code), and the amount of data returned was 43489 bytes. Writing a full parser for this log line is not too difficult (we will do so in the cookbook section at the end of the article), but many queries do not require parsing the entire log. For Listing 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 <?php $logfile = $_SERVER[‘argv’][1]; if(!$logfile) { print “Please specify a logfile to parse\n”; } if(($fp = fopen($logfile, “r”)) == false) { print “Error opening $logfile\n”; exit; } while(($line = fgets($fp)) !== false) { $regex = ‘/(\d+) \d+$/’; if(preg_match($regex, $line, $matches)) { $frequency[$matches[1]]++; } } print “Code\tOccurences\n”; foreach ($frequency as $code => $occurences) { print “$code\t$occurences\n”; } ?> March 2004 ● PHP Architect ● www.phparch.com instance, if we want to count the number of occurrences of each response code, the expression to use is quite simple. Looking at the log format, we see that the last two fields are numbers, and we want the next to last one. Expressed as a regex, that pattern looks like this: /(\d+) \d+$/ Working backwards, this says we first match the end of the line ($$), then a number (which we don't bother to capture), then a number which we do want to capture (the response code). We can wrap this into a quick script “Anchors are also to determine the frequency of various useful for extractresponses as shown in ing information Listing 2. When we near the begindon't need to parse an entire text string, espening or end of a cially if its format is string.” complex, anchors can make our life much easier. Licensed to 63883 - Joseph Crawford ([email protected]) if it begins there. The trailing $ anchors the match at the end of the text, meaning that the match will only succeed if the pattern terminates on the final character of the text to be matched against. Here we use a slightly modified version of the anchored pattern to make a function useful for validating user-inputted data. If the phone number is valid, it will return an array of its components. If not, it will return false. The regex has been made a bit more robust by allowing the delimiter (previously -) to be replaced by an optional . or whitespace. Global Pattern Modifiers The final regular expression syntactical elements we are going to discuss in this article are global pattern modifiers. As their name implies, global pattern modifiers change the overall behavior of the pattern. By far the most common of these is the case insensitivity modifier, i. Global modifiers are implemented in the Perl style, directly following the pattern they apply to. Here is a function which uses a regex to extract all addresses under a specified domain from a subject text, regardless of the casing of the domain (domains are case insensitive). function extract_addresses($domain, $text) { $domain = preg_quote($domain); if(preg_match_all('/([\w!#\$%\"*+\/=?\'{}|~^]+)@$domain/i', $text, $matches, PREG_PATTERN_ORDER)) { return $matches[1]; } return false; } Notice here that, in addition to using the i modifier, we also use preg_quote to sanitize $domain. Data that can potentially come from an untrusted source (such as a user) should always be quoted to prevent the accidental or malicious inclusion of regex characters. Also, we use the PREG_PATTERN_ORDER flag so that all the subpattern \1 matches are stored in $matches[1] . Otherwise we would need to iterate over $matches and manually build the result set. The other possible pattern modifiers are as follows: 23 FEATURE Matchmaker, Matchmaker Make Me A Match Licensed to 63883 - Joseph Crawford ([email protected]) dollar end-anchor $ will match only at the • m (treat as multiline). By default, PCRE end of the string. By default, $ will match assumes that we intend our search text to before the final character if that character is processed as one big string, and ^ and $ a newline. This is ignored if the m modifier is will match only the beginning also used. and ending of the search text, respectively. When the m modi• S (Study) If we are going to execute a pattern a number of fier is used, ^ and $ will match “As with most at the beginning and ending of times, we can use this flag to tools, the way to every line in the pattern (the instruct PCRE to take extra time really learn search text is considered to be 'studying' the pattern to improve broken into lines by any newits efficiency. regexes is to use line characters). them in practical • U (Ungreedy) By default, all matches in PCRE are greedy— • s (treat as single line for wildsituations.” cards) By default the wildcard that is, a pattern will attempt to match the longest possible piece character (..) will not match a of the search text. The U modifier newline. If . should match newreverses this behavior, asking PCRE to find lines as well, add the s modifier to the patthe shortest possible match for the pattern. tern. More on greedy versus non-greedy matching will be covered in a future article. • x (extended legibility) By default, any whitespace in a pattern is considered part of the • u (UTF-8) This modifier instructs PCRE to pattern. Allowing whitespace in a pattern treat patterns and search texts as UTF-8 can be helpful for readability and inline characters instead of just single-byte characcomments. Compare the following two regters. UTF-8 support is still new and should ular expressions: be used with some caution as it may be /([2-9]\d{2})[.\s-]?([2-9]\d{2})[.\s-]?(\d{4})/ incomplete. and • e (Evaluated replacements). This causes the /([2-9]\d{2}) # Match the area code (200-999) as replacement string in a preg_replace call to subpattern 1 be evaluated as PHP. Back-references are [.\s-]? # An optional delimiter - dot, dash or ws expanded and the resulting expression is ([2-9]\d{2}) # Match the exchange as subpattern 2 executed via eval. The result of the evalua[.\s-]? # An optional delimiter - dot, dash or ws tion is used as the final replacement text. (\d{4}) # Match the line number as subpattern 3 Let's try an example of how to use this writ/x ing Wiki-style links to documents. In Wikis, More information of creating readable patputting so-called CamelCaps text in a docuterns will be covered in a future article. ment will link it to the wiki page of that name. Doing this blindly with a regex can • A (Start anchored) This modifier is equivabe achieved with the following replacement: lent to putting a ^ at the start of our pat$text = preg_replace('/\b(([A-Z]\w+){2,})\b/', tern—it anchors the pattern at the start of '<a href="/wiki/\1.html">\1</a>', $text); the search text. Thus the following two This might result in a number of non-exisregular expressions are equivalent: tent documents being linked to, though. If /^Subject: (.*)/ /Subject: (.*)/A There are no benefits of using this method over manually anchoring a pattern with ^ (other than, perhaps, moving the anchor character from the beginning of your pattern to its end). • D (Dollar end-only) If this modifier is set, the March 2004 ● PHP Architect ● www.phparch.com Listing 3 1 2 3 4 5 6 7 8 9 10 11 function is_wiki_page($token) { $page = $_SERVER[‘DOCUMENT_ROOT’].”/wiki/$token.php”; if(file_exists($page)) { return true; } return false; } $text = preg_replace(‘/\b(([A-Z]\w+){2,})\b/e’, ‘is_wiki_page(\1)?”<a href=\”/wiki/\1\”>\1</a>”:”\1”’, $text); 24 FEATURE Matchmaker, Matchmaker Make Me A Match Unless specifically contraindicated (such as B and m), pattern global modifiers can be freely combined. A Simple Regex Cookbook As with most tools, the way to really learn regexes is to use them in practical situations. To help you get on your way, here is a short selection of recipes for making the most out of your regular expressions. Apache Log Processing Being able to extract information from webserver logfiles is essential to both good housekeeping (knowing what links are broken and the disposition of our traffic) and forensics (determining where traffic is coming from and what actions users are taking). The first step to this is being able to parse our logs into an easily accessible data structure. Apache common log format is defined as the following: "%h %l %u %t \"%r\" %>s %b" Where the individual fields are: • %h—-The IP address (or hostname if DNS lookups are enabled) of the requestor. • %l—The remote logname, as supplied by identd. • %u—The remote user supplied to HTTP Basic Authentication (same as $_SERVER['PHP_AUTH_USER'] ) • %t—The time in common log format (%%d/%b/%G:%H:%M:%S %z in strftime format terms). • \"%r\"—The full request line, such as GET /index.php HTTP/1.0" • %>s—The three digit response code of the final request served (Apache has a notion of internal redirects—this is the response code on the page actually returned to the user). • %b—The number of bytes returned in the response. A function to parse a single line and return an array with its contents is given in Listing 4. Even though we March 2004 ● PHP Architect ● www.phparch.com didn't really explore it in much detail, the benefit of using extended legibility regexes should be obvious here—with 17 sub-patterns being captured, it would be extremely difficult to guess the correct offsets at a glance. Now that we have a parser, its applications are nearly limitless. For example, Listing 5 shows a little script I like to leave running in a window on my desktop; I tail my Apache log into it and it reports the number of hits I get per second in real-time. Running it as tail -f /apache/logs/mysite/access | freq.php Gives a running tally of hits per second (note that this will only run under a UNIX-like environment and that you'll need to make freq.php executable). This data could just as easily be written to an MRTG database for graphing, or something even cleverer. Because we have access to the fully parsed log line, we Listing 4 Licensed to 63883 - Joseph Crawford ([email protected]) we want the rewriting to only happen if the destination document exists, we can perform the conditional replacement with an evaluated replacement as shown in Listing 3. Now, when a CamelCaps word is encountered, the regex checks is_wiki_page to see if it should be linked. If so, the text is replaced with a link; otherwise, it is left as-is (or, rather, it is replaced with itself). Evaluated replacements and their companion function preg_replace_callback will be covered in depth in a future article. 1 function parse_clf_line($line) 2 { 3 static $regex = ‘/^ 4 (\S+) # the host or ip ($m[1]) 5 [ ] # a space 6 (\S+) # remote logname ($m[2]) 7 [ ] 8 (\S+) # auth user ($m[3]) 9 [ ] 10 \[( # begin date match ($m[4]) 11 (\d{2})\/ # the day ($m[5]) 12 (\w{3})\/ # the month ($m[6]) 13 (\d{4}): # the year ($m[7]) 14 (\d{2}): # the hour ($m[8]) 15 (\d{2}): # the mintute ($m[9]) 16 (\d{2})\s+ # the second ($m[10]) 17 ([+-]\d{4}) # UTC offset ($m[11]) 18 )\] # end date match 19 [ ] 20 “( #begin request match ($m[12]) 21 (GET|HEAD|POST) # the HTTP method ($m[13]) 22 \s+ 23 (\S+) # The requested URL ($m[14]) 24 \s+ 25 (HTTP\/\d\.\d) # the protocol ($m[15]) 26 )” # end reqyest match 27 [ ] 28 (\d{3}) # status ($m[16]) 29 [ ] 30 (\d+) # bytes ($m[17]) 31 $/xi’; 32 if(preg_match($regex, $line, $m)) { 33 return $m; 34 } 35 } Listing 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #!/usr/local/bin/php # log_freq.php <?php include_once(“LogParser.inc”); $last_time = ‘’; $count = 0; while(($line = fgets(STDIN)) !== false) { if($data = parse_clf_line($line)) { $this_time = $data[4]; if($last_time && $last_time != $this_time) { print “$last_time: $count\n”; $count = 0; } $last_time = $this_time; $count++; } } ?> 25 FEATURE Matchmaker, Matchmaker Make Me A Match $this_time = $data[4]; to $this_sec = "$data[5]/$data[6]/$data[7] $data[8]"; Similarly, we could count bytes instead of pages by accumulating $data[17] (bytes transferred) in $count. Single Pass Template Substitution In its simplest form, a templating system runs through a 'template' and replaces certain tokens with dynamic values. One of the things that makes many templating systems slow is that they must perform multiple passes through a document, one for each token to be replaced. If we standardize our token naming convention, we can actually perform the replacement in a single pass. First, we require that all templates be of the form {NAME} where NAME is a key in an associative array that contains our substitutions. With this in place, we can match all tokens in a single pass with the following regex: shows one possible way to do so. This function looks for various DHTML and CSS directives that can be used for cross-site scripting attacks, and if any are found it performs a very draconian stripping of all but the basic formatting tags. Conclusion We have now come to the end of our journey through the basics of regular expressions. With these tools in your hands, you should be able to tackle almost any text matching challenge. Hopefully, you have lost any fears you might have had concerning regular expressions. Once past the terseness of their syntax, regexes can be a powerful and versatile addition to our programming toolkit. At the same time, we have really only touched the tip of the regex iceberg. In addition to the things we have seen so far, the PCRE extension supports a number of fine-grain features that allow for incredibly complex matches. These advanced features will be covered in a future set of articles. Licensed to 63883 - Joseph Crawford ([email protected]) could easily convert this to display hits per hour by changing Listing 6 /{(\w+)}/ Next we will use an evaluated replacement to substitute the appropriate value from the passed associative array. Here is the full function: function expand_text($text, $data) { return preg_replace('/{(\w+)}/e', '$data[\1]', $text); } A simple demonstration of this function in action is the following: $template = <<<EOD Hello {NAME}, Your friend {FRIEND} has sent you an e-card. Click <a href="{LINK}">here</a> to pick it up. EOD; $data = array( 'NAME' => 'George', 'FRIEND' => 'Bob', 'LINK' => 'http://www.example.com/ecard.html?id=12345' ); print expand_text($template, $data); Preventing Cross-Site Scripting Attacks Javascript is one of the banes of my existence. Don't get me wrong—it is a powerful and useful language, but its tight integration with HTML makes it a fertile playground for malicious users to launch cross-site scripting attacks. If we must allow HTML in user input, we will want to at least remove any Javascript from it. Listing 6 March 2004 ● PHP Architect ● www.phparch.com 1 function strip_dhtml($html) 2 { 3 $ok_tags = ‘<br><b><h1><h2><h3><h4><i><li><ol><p><strong><table>’ . 4 ‘<tr><td><th><u><ul>’; 5 $js_event_list = array(‘load’, ‘unload’, ‘click’, ‘dblclick’, 6 ‘mousedown’, ‘mouseup’, ‘mouseover’, 7 ‘mousemove’, ‘mouseout’, ‘focus’, ‘blur’, 8 ‘keypress’, ‘keydown’, ‘keyup’, ‘submit’, 9 ‘reset’, ‘select’, ‘change’); 10 $js_events = implode(‘|’, $js_event_list); 11 $regexp[] = “/on($js_events)\s*=/i”; 12 $regexp[] = “/(java|vb)scri?pt/i”; 13 $regexp[] = “/@\s*import/i”; 14 foreach($regexp as $re) { 15 if(preg_match($re, $html)) { 16 return strip_tags($html, $ok_tags); 17 } 18 } 19 return $html; 20 } About the Author ?> George Schlossnagle is a Principal at OmniTI Computer Consulting, a Maryland-based tech company specializing in high-volume web and email systems. Before joining OmniTI, George led technical operations at several high-profile community web sites where he developed experience managing PHP in very large enterprise environments. George is a frequent contributor to the PHP community. His work can be found in the PHP core, as well as in the PEAR and PECL extension repositories. Before entering into information technology, George trained to be a mathematician and served a 2 year stint as a teacher in the Peace Corps. His experience has taught him to value an inter-disciplinary approach to problem solving that favors root-cause analysis of problems over simply addressing symptoms. To Discuss this article: http://forums.phparch.com/131 26 Licensed to 63883 - Joseph Crawford ([email protected]) Can’t stop thinking about PHP? Write for us! Visit us at http://www.phparch.com/writeforus.php Automated Testing For PHP Applications PHP enables Web developers to create complex Web applications—nothing new there. The techniques for writing automated tests for PHP Web applications, however, are not well known. In this article, James McCaffrey shows you a simple but representative PHP application and then walks you through the creation of a powerful automated test program written entirely in PHP. The code is explained in detail so you can use it as is, or modify and extend the technique to meet your own needs. I n this article, I will show you how to write powerful automated tests in PHP for your Web applications. PHP is remarkably well-suited for writing software test automation and the system I present is surprisingly short. Web applications built with PHP are becoming more and more common in the enterprise arena and, as a result, they are becoming increasingly complex. As PHP matures, the ability to write test automation becomes more valuable, but in conversations with my colleagues I discovered that the techniques required for automated testing of PHP Web applications are not well known. In this article, I will show you how to quickly write effective test automation that verifies your PHP Web applications' correctness. The best way to show you what we will accomplish is with two screenshots. Figure 1 shows a dummy PHP Web application that accepts a last name for an employee and then searches a MySQL database and displays the employee's ID, first name, last name, and e-mail address. In this example searching for "Baker" correctly returns a single employee whose ID is 002, first name is Bob, and e-mail is [email protected]. Manually testing even this minimal Web application would be extremely tedious, time consuming, and error prone. Instead, we can test the application by programmatically sending input to the PHP script on the Web server, then capture the response stream, examine the response for a correct target, and log a pass or fail result. Figure 2 shows a PHP shell program that does just that. Test cases 0002 and 0003 correspond to the manual test shown in Figure 1. You might have noticed that my examples use a March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) F E A T U R E by Dr. James McCaffrey Windows/IIS system rather than the more usual Linux/Apache setup. Most client companies that I work with are large and have a mixed technology environment. Because many of these companies are experimenting with PHP and MySQL on a Windows/IIS base, I decided to use that base for this article. In the sections that follow I will walk you through the underlying PHP Web application so that you will understand what we are testing, briefly examine the underlying MySQL database so that you understand its relationship to the test automation, and carefully go over the PHP test automation program so that you can modify the source code to meet your own particular needs. I will conclude with a discussion of some of the ways you can extend this technique and use it in a production environment. After reading this article you will have the ability to write PHP test automation—a hopefully valuable addition to your skill set. The PHP Web Application The most common use of PHP among the companies I work with is to create dynamic Web pages that have an interface to a MySQL database. I created a reduced REQUIREMENTS PHP: 4.3.4 OS: Tested on Red Hat Linux 7 and Windows Server 2003 Other Software: N/A Code Directory: auto-test 28 FEATURE Automated Testing For PHP Applications dummy Web application that contains the essential elements of most real-life applications I deal with. I started by making a small database named dbCompany, which contains a table named tblEmployees that has four columns: empid (employee ID), lastname, firstname, and email. I populated the table with the four rows of data you can see in Figure 3 (next page). Next, I created a simple PHP Web application that searches the database. The code shown in Listing 1 generates the Web page shown in Figure 1. Both the database and the PHP application are simplistic, but together they have all the elements needed to demonstrate test automation. Before I show you the test automation program, let's imagine what it would be like to manually test the application. (In fact, asking how to test a dummy Web application like this is often used as an interview question for dedicated software test engineers.) There are thousands of inputs you would have to enter into the page and then visually determine if the response was correct or not. Then, suppose you changed the logic or the database structure—you'd have to start all over. As you can imagine, this would not be fun, or particularly efficient. To automate the testing of the dummy PHP Web application, we must programmatically send input to the PHP script (via HTTP), then capture the HTTP response stream, examine the response for strings that tell us if the response is correct or not, and log results. The PHP shell script shown in Listing 2 does exactly that and generated the output shown in Figure 2. I structured the test automation as two functions. The main() function reads test case data from a text file, sends an input value to the PHP Web application, and Licensed to 63883 - Joseph Crawford ([email protected]) Figure 1 Figure 2 March 2004 ● PHP Architect ● www.phparch.com 29 FEATURE Automated Testing For PHP Applications 0001:Anderson:Adam: 0002:Baker:Bob: 0003:Baker:[email protected]: 0004:Chung:Kathy:deliberate fail 0005:De La Paz:Doug: Each line of data represents a single test case. A 4digit test case ID is followed by an input value, then an expected result, and an optional comment. So, in test case 0002, if we submit "Baker", we should see "Bob" in the response. The main() function starts by assigning values to variables for the IP address of the Web server, the port on which the server listens, the path to the PHP application, and the method used to send user data: $ipAddress = '127.0.0.1'; $port = '80'; $page = '/PHP/simple.php'; $method = 'POST'; Because this is test automation, you will know the IP address of the Web server that has your PHP application, and it will usually be 127.0.0.1 (localhost), unless you test on a server that is not installed on your local machine. Port 80 is the default HTTP port, but it may be different in a test environment. The two main methods of sending information to a Web server are POST and GET. Recall that our dummy Web application sends data using POST: <form name="theForm" action="simple.php" method="POST"> I will discuss using GET requests later. Next, main() prints some minimal header information to the shell and then opens the test case file for reading. The test automation reads the test case file line by line: ● PHP Architect For each line, we parse the four colon-delimited fields using the explode() function. Using colons to delimit test case data is arbitrary—in general, you can use any character but want to avoid characters that appear in the actual test case data. We append the input value to lastname= using the urlencode() function. It replaces characters that might be misinterpreted by the Web server with their escaped equivalents. For example, a '/' character would be replaced by a %2F sequence. After we have a test case ID, an input last name to send and an expected value to look for, the resHasTarget() function does all the work: if (resHasTarget($ipAddress, $port, $method, $page, $postData, $expected)) echo "$caseid Pass input = " . str_pad($input, 12) . "expected = $expected\n"; else echo "$caseid FAIL input = " . str_pad($input, 12) . "expected = $expected\n"; The resHasTarget() function posts data to the PHP Web application and checks if the expected value is in the response stream. For test case 0001, "lastname=Anderson" is posted to 127.0.0.1:80/PHP/simple.php and the response is examined for the presence of the string "Adam". If "Adam" is found, resHasTarget() returns TRUE and we log a "pass" message, otherwise we log a "fail" message. Let's now examine the resHasTarget() function that does most of the actual work. We start by creating a socket and then using it to connect to our Web server: $socket = socket_create(AF_INET, SOCK_STREAM, 0) or die("Socket failed\n"); $connect = socket_connect($socket, $ipAddress, $port) or die("Connect failed\n"); The constants AF_INET and SOCK_STREAM mean that we want to use the dottedquad notation (i.e., 127.0.0.1) and a full-duplex, TCP connection. There are two important alternatives to the socket_* family of functions I chose to use. A lower level choice is the fsock() family of functions. A higher Figure 3 March 2004 $line = fgets($fp, 4096); list($caseid, $input, $expected, $comment) = explode(":", $line); $postData = 'lastname=' . urlencode($input); Licensed to 63883 - Joseph Crawford ([email protected]) examines the response for an expected value. The main() function calls a resHasTarget() function which returns TRUE if some input data contains a target string. Here are the contents of the test case file used in this example: ● www.phparch.com 30 FEATURE Automated Testing For PHP Applications $reqBody = $postData; $contentLength = strlen($reqBody); The $postData input parameter assumes we have data in a name-value sequence like: user=chris&age=25&job=tester for example. Next we construct the HTTP headers we are going to send to the server: $send = $method . " " . $page . " HTTP/1.1\r\n"; $send .= "Host: localhost\r\n"; $send .= "Accept: */*\r\n"; $send .= "User-Agent: test.php test automation\r\n"; $send .= "Content-Type: application/x-www-form-urlencoded\r\n"; $send .= "Content-Length: " . $contentLength . "\r\n\r\n"; $send .= $reqBody; $send .= "\r\n"; An HTTP request starts with a line that specifies the method (e.g., POST, GET, HEAD), followed by the path to the PHP application and the HTTP version. The next header line must specify the host that the request is being sent to. The next two header lines are optional. The Accept header tells the server what types of responses are acceptable (here we'll accept anything). The User-Agent header is a courtesy so the Web server knows who is making the request. The next two header lines are required for POST requests. Content-Type tells the server what kind of data is coming. You can think of application/x-www-form-urlencoded as a magic string that means "data from an HTML form". The Content-Length header is the size of the POST data. Notice that we have to construct the POST data before the headers so we can specify the size at this point in the program. Also notice that the Content-Length header is followed by 2 newline characters (or in the case of the Windows based system here, 2 carriage return, linefeed combinations). Finally we append the POST data to the request. Now we are ready to send the HTTP request to the server, then grab the response stream and examine it: response 2048 bytes (an arbitrary size) at a time (as opposed to line-by-line). We also use strpos() to see if the target string is anywhere in the 2048 bytes, and if it is we close the socket and return TRUE. If we examine the entire response and never find the target string we return FALSE. There is one trick to watch for here—it is possible that a response stream block of bytes might end in the middle of the target, breaking it into two parts. If so, you would not find the target string. In practice this is not very likely and you can defend against this possibility by increasing the number of bytes read per socket_read() so that you capture the entire response stream. To summarize, the key to automated testing of PHP Web applications is the ability to send raw HTTP data to the Web server. PHP has a family of socket functions that make it easy to do so. After reading information from test case files containing input values and expected values, you send the input to the server then examine the response for the expected value. Licensed to 63883 - Joseph Crawford ([email protected]) level choice is to use classes in the PEAR library. I have programmed sockets using all three methods and have found that any preference is more a matter of personal programming style than functionality. After we connect to the Web server we determine the size of the data we will be posting : Using The GET Method In the previous sections, we assumed that the PHP Web application under test sends data to the server using the POST method. What if the application uses GET? Suppose you have a Web application where the user submits a user ID and a password using GET. (By the way, this is a bad idea because with GET the form data is appended to the request URL). The following code snippet shows how to send a request using GET: // create socket // connect $send $send $send $send $send $send = "GET /PHP/form2.php?"; .= "userID=" . urlencode("root"); .= "&password=" . urlencode("secret"); .= " HTTP/1.1\r\n"; .= "Host: localhost\r\n"; .= "\r\n"; socket_write($socket, $send, strlen($send)); // read response The first line of the HTTP request header uses GET and the data to send is appended to the URL as a query string using the name=value format. Because the user data is tied to the URL, it is especially important to use the urlencode() function to handle troublesome characters. socket_write($socket, $send, strlen($send)); while ($receiveBuffer = socket_read($socket, 2048)) { if (strpos($receiveBuffer, $target)) { socket_close($socket); return TRUE; } } The socket_write() function sends the request and associates the response to the socket. We read the March 2004 ● PHP Architect ● www.phparch.com Beyond the Basics You can modify and extend the basic PHP application test framework presented here in many ways. For clarity, I used a simple text file to store test cases, but you should consider good alternatives, like XML or database storage. Using XML to hold your test cases is particularly appropriate when the test cases have a complex structure (for example, many optional parameters), or are shared across groups. A database, on the other 31 FEATURE Automated Testing For PHP Applications Listing 1 1 <html> 2 <!— simple.php —> 3 <head><title>PHP Test Automation</title></head> 4 <body> 5 <h3>Query Employees</h3> 6 <form name=”theForm” action=”simple.php” method=”POST”> 7 <p>Last name: <input type=”text” name=”lastname” /></p> 8 <p><input type=”submit” value=”Find Employee” /></p> 9 </form> 10 11 <?php 12 $conn = mysql_connect(“localhost”, “guest”, “secret”); 13 mysql_select_db(“dbCompany”); 14 15 if (isset($_POST[‘lastname’])) 16 { 17 $search = $_POST[‘lastname’]; 18 $query = “SELECT * FROM tblEmployees WHERE lastname = ‘“ . $search . “‘“; 19 20 $dataset = mysql_query($query); 21 22 echo “<table>\n”; 23 while ($row = mysql_fetch_array($dataset, MYSQL_ASSOC)) 24 { 25 echo “<tr>\n”; 26 echo “<td>” . $row[‘empid’] . “ “ . $row[‘firstname’]; 27 echo “ “ . $row[‘lastname’] . “ “ . $row[‘email’] . “</td>\n”; 28 echo “</tr>\n”; 29 } 30 echo “</table>\n”; 31 } 32 mysql_close($conn); 33 ?> 34 35 </body> 36 </html> March 2004 ● PHP Architect ● www.phparch.com testing language, I was pleased to find that they are as good as any language I've worked with—and maybe even better, in some cases. In the introduction to this article, I noted that most of the client companies I work with are currently investiListing 2 1 <?php 2 3 // test.php 4 5 function resHasTarget($ipAddress, $port, $method, $page, $postData, $target) 6 { 7 $socket = socket_create(AF_INET, SOCK_STREAM, 0) 8 or die(“Socket failed\n”); 9 10 $connect = socket_connect($socket, $ipAddress, $port) 11 or die(“Connect failed\n”); 12 13 $reqBody = $postData; 14 $contentLength = strlen($reqBody); 15 16 $send = $method . “ “ . $page . “ HTTP/1.1\r\n”; 17 $send .= “Host: localhost\r\n”; 18 $send .= “Accept: */*\r\n”; 19 $send .= “User-Agent: test.php test automation\r\n”; 20 $send .= “Content-Type: application/x-www-formurlencoded\r\n”; 21 $send .= “Content-Length: “ . $contentLength . “\r\n\r\n”; 22 $send .= $reqBody; 23 $send .= “\r\n”; 24 25 socket_write($socket, $send, strlen($send)); 26 27 while ($receiveBuffer = socket_read($socket, 2048)) 28 { 29 if (strpos($receiveBuffer, $target)) 30 { 31 socket_close($socket); 32 return TRUE; 33 } 34 echo $receiveBuffer; 35 } 36 37 socket_close($socket); 38 return FALSE; 39 } 40 41 function main() 42 { 43 $ipAddress = ‘127.0.0.1’; 44 $port = ‘80’; 45 $page = ‘/PHP/simple.php’; 46 $method = ‘POST’; 47 48 echo “\nBegin test run\n\n”; 49 echo “caseid result\n”; 50 echo “===================================================\n\n”; 51 52 $fp = fopen(“cases.txt”, “r”); 53 while (!feof($fp)) 54 { 55 $line = fgets($fp, 4096); 56 list($caseid, $input, $expected, $comment) = explode(“:”, $line); 57 $postData = ‘lastname=’ . urlencode($input); 58 59 if (resHasTarget($ipAddress, $port, $method, $page, $postData, $expected)) 60 echo “$caseid Pass input = “ . str_pad($input, 12) . “expected = $expected\n”; 61 else 62 echo “$caseid FAIL input = “ . str_pad($input, 12) . “expected = $expected\n”; 63 } 64 fclose($fp); 65 echo “\nDone\n”; 66 67 $postData=’lastname=Baker’; 68 $expected=’’; 69 resHasTarget($ipAddress, $port, $method, $page, $postData, $expected); 70 } 71 72 main(); // run tests 73 74 ?> Licensed to 63883 - Joseph Crawford ([email protected]) hand, can come in handy when you have a very large number of test cases. The technique in this article displays its output to a command shell. In a production environment, you will probably want to write test results to a text file or a SQL database. Writing to a text file is most appropriate when you are on a relatively short production cycle. Writing results to a SQL database is useful when you are in a long production cycle because you will be generating lots of data that can be shared and analyzed in many different ways. In a production environment, I always add additional data to the results log. At a minimum, you will want to add counters for the number of cases which pass and which fail. I also like to add timing information for each test case and the overall test run. Timing information can uncover problems in the Web application code that basic pass-fail data misses. And for reporting purposes, you can timestamp the date of the test run. To be honest, when I first started using PHP I was very surprised at how well it works as a language for software test automation. In general, it is best to write test automation using the same language as that used by the system under test—test a C++ application using C++, test a Java application using Java. The idea is that if you use different languages, you run into many crosslanguage issues which affect the validity of your test automation. But often, using the same language is just not possible. When I examined PHP's capabilities as a 32 FEATURE gating mixed-technology enviFigure 4 ronments. As recently as twelve months ago, mixing Open Source and proprietary technologies usually had uneven results, but the situation has changed dramatically for the better. The machine on which I developed the techniques used in this article happily supports MySQL and SQL Server, C# and PHP, Apache and IIS, and dual boots into Linux and Windows XP. This works in PHP's favor: developers can install PHP over their existing technologies and gradually migrate. In particular, I am seeing many in-house shops start to move from ColdFusion to PHP as their programming platform of choice for Web projects. An interesting side effect of the test automation presented in this article is that you can easily adapt the test code to create a general purpose HTTP response viewer. By placing an echo() statement inside the while loop that examines the response: while ($receiveBuffer = socket_read($socket, 2048)) { echo $receiveBuffer; } and making a few other cosmetic changes you can view the entire response stream, as you can see in Figure 4. If you are new to programming with PHP at a low level, this is a great way to learn what is really going on with HTTP behind the scenes. In principle, testing PHP Web applications is similar to traditional API (Application Programming Interface) or Unit testing. But because PHP applications are clientserver based, there are additional connectivity issues. This means you will want to liberally use error checking. As usual for instructional articles, I removed all error checking in the code presented here. Based on my experience, adding exception handing code (if you're using PHP5) will double the size of your source code but is well worth the effort. One valuable use of the technique presented in this article is to construct Developer Regression Tests (DRTs) for your PHP Web applications. DRTs are a sequence of automated tests that are run after you make changes to your application. They are designed to determine if your new code has broken existing functionality, before March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) Automated Testing For PHP Applications you check it in your version-control repository. You can also create an extensive set of test cases for a Full Test Pass. Conclusion In this article, I have shown you how easy it is to create test automation systems written in PHP for your applications. As PHP matures, testing will become more important and the ability to write automated tests will become more useful than it already is. And because PHP works so well in a mixed technology environment, the ability to write PHP test automation is a valuable addition to your skill set—no matter what platforms you use. About the Author ?> Dr. James McCaffrey works for Volt Information Sciences, Inc., where he manages technical training for over 4,000 software engineers working at a wide range of companies. Previously, he was a university professor and worked on several Microsoft products including Internet Explorer and MSN Search. James can be reached at [email protected]. To Discuss this article: http://forums.phparch.com/132 33 by Eddie Peloke Flash MX 2004 for Rich Internet Applications by Phillip Kerman Publisher by New Riders Paperback 430 pages $45.00 (US) $67.99 (Canada) I t is hard to do much web surfing without coming across some form of Flash content. Whether it is a menu, form or movie, Macromedia's rich content seems to be everywhere. In the past, it was primarily used for creating animations and movies, but, recently, Macromedia has begun pushing it past the boundaries of simply being an 'animation tool' and into the realm of programming. Rich Internet Applications, or RIAs, as they are called, are intended as applications which create a 'Rich' user experience by closer resembling a traditional desktop application than a web app. RIAs also bring forth a technology in Flash called Flash Remoting, which allows Flash to communicate with outside services. These services can take the form of a Cold Fusion page, Java class, or even a PHP class, as outlined in the "Flash Remoting with AMFPHP" article that appeared in the July issue of PHP|architect. In the past, I was never really interested in using Flash. I had seen some cool menus and movies, but never thought it practical enough to take the time to learn the tool. However, when I read about RIAs for the first time, I was excited to give this approach a try. March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) T I P S & T R I C K S Book Review Being a former teacher, I have always wanted to create an online grade book application. The thought of using Flash's data grids, forms, and other features for the presentation layer while using PHP classes for the backend really interested me. I decided now is the time to get to know Flash so I picked up Flash MX 2004 for Rich Internet Applications. I should say, first of all, that this book is not what I had expected. I was originally hoping for more of a step-by-step tutorial on creating an RIA, which this book does not provide. While the author does provide some good examples, I found the book to be more of a general RIA development best practices book. The book does a good job of explaining why you would want to use an RIA and the technologies involved. Chapters such as 'Presenting Data', 'Production Techniques', and 'Using Components' help build a general understanding of the process and techniques, and some of the information in the chapters is applicable to the creation of any application, regardless of the tool involved. Even though the book is geared more toward the skilled Flash and Actionscript user, the author does include plenty of code examples of which many are self-explanatory. One of my gripes about this book is its lack of coverage for some of the Flash remoting tools. While it can be argued that this book is primarily about Flash's role in RIAs, it would be nice to see some coverage of tools such as AMFPHP, the open source Flash remoting tool for PHP. It allows you to create a Flash movie which connects to your PHP classes, where you can handle all of the logic while Flash manages the presentation layer. In my opinion, this could be a powerful combination that is not nearly as documented as it should be. All in all, I think this is a good read for the developer looking for more information on Flash and Rich Internet Applications. It contains a lot of useful information sprinkled with some cool RIAs created by the author. If you are a Flash and Actionscript newbie however, you may want to brush up on your skills first. 34 A look at php | Cruise March 1 - 5 • Bahamas 2004 by Marco Tabini W hen the php|a decided to organize a conference, the first question that we asked ourselves was "why". After all, there already was a well-established circuit of PHP conferences throughout North America at different times of the year—on could almost say that the PHP conference market (if there even is such a thing) was getting more and more saturated. If anything, we wanted to avoid both interfering with existing events and proposing "just another conference," given that there are already so many other organizations out there that do a great job in conventional settings. Therefore, it took a while before Brian K. Jones, then our Editor-in-Chief, came up with the bril- March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) C R U I S E R E V I E W PHP Ahoy! liant idea of holding the conference on a cruise ship, and thus set us off on our path. Naturally, even with the idea firmly in our minds, getting the first php|cruise off the ground was an enormous task that took several months of work just to get from the idea stage—you know, the point where somebody starts saying "wouldn't it be nice if..."—to the moment in which we finally decided to announce it to the public. Lots more work afterwards, we finally sailed for the Caribbeans from Port Canaveral (near Orlando, Florida) aboard the Sovereign of the Seas on March 1st. For those who have never been on a large cruise ship before, sailing on such a big vessel (the Sovereign holds almost 3,000 people) is an... interesting experience. 35 CRUISE REVIEW PHP Ahoy! A look at php|cruise 2004 March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) Given that the ship itself is so massive, even in relatively rough seas it will only rock slightly, so that, while one easily notices that something is "odd", it is rarely disturbing to the point where one gets seasick. Since we had a very busy schedule, we started off with our keynote session, given by Zend Studio co-creator Zeev Suraski, shortly before departure. Once the ship had actually left the docks, many people only noticed that we were not connected to terra firma anymore because they saw the speakers swaying slightly from one said to the other, themselves hardly aware of the fact! The conference ran on two separate tracks, so that the attendees could, at all times, have an opportunity to choose the session they best liked. For the most part, every lecture took place in one of the ship's two theatres, equipped with the appropriate audio/video tools. Despite some initial technical difficulties (caused primarily by a projector with the wrong cable), once we got under way you could hardly tell that the whole thing was happening on a ship in the middle of the Atlantic Ocean—we could have easily been in a hotel in any major city. After the first day, once we had everything set up, we even had practically continuous wireless (and wired) Internet access. From a practical perspective, therefore, php|c was a full-fledged, typical PHP conference, with excellent speakers, many of whom offered original talks, and lectures on all sorts of PHPrelated topics, such as regular expressions, debugging, profiling and creating development frameworks. However, php|c was made very unique by two elements that were a consequence of the venue we had chosen. A cruise ship is a very odd place. On one hand, you are, effectively confined to a limited space—huge (the Sovereign held some 3,000 people very comfortably), but still limited if you compare it to, say, being in the middle of Manhattan. On the other, I dare anyone to become bored during their permanence aboard the ship. No matter what time it is, there is always something to do—whether you're into gourmet food, gambling, rock climbing or just sitting 36 around and have fun with your friends. In the context of a conference, this results in a significantly higher amount of experience-sharing between the attendees and the speakers. More than once during the cruise, I had occasion to walk by one of the many bars and find groups of people talking animately about things as varied as what applications they were working on or what they were expecting from PHP5. The ability to exchange your personal experiences with your peers is, perhaps, one of the most important aspects of a conference, but in a traditional setting it's too easy for the attendees to go their separate ways outside of session times and lose sight of each other. We learned another important lesson by experimenting with the conference rooms in which each session was held. By sheer accident, we were forced to move one of the tracks from its assigned theatre to one of the ship's many lounges for an entire day. Now, it goes without saying that a lounge is set up in a very different way compared to a theatre—the seats are disposed around tables, and the tables themselves are disposed so that everybody is capable of seeing everybody else (at least for the most part). Although counterintuitive for a lecture, this setting seems to have worked wonders as far as our sessions went. Both speakers and attendees found themselves more at ease and much more comfortable with intervening during each session with their personal comment and experiences. Speaker Stuart Herbert held perhaps one of the most memorable PHP sessions I have ever participated in by hosting what he called a "shared experiences" discussion on creating programming frameworks, loosely guided by a set of slides he had prepared beforehand. Some of the attendees liked Stu's idea so much that they rated it as a "six" on a scale of one to five in their questionnaires! PHP Ahoy! A look at php|cruise 2004 “Both speakers and attendees found themselves more at ease and much more comfortable with intervening during each session with their personal comment and experiences.” Licensed to 63883 - Joseph Crawford ([email protected]) CRUISE REVIEW From Work to Fun Even though we were absolutely serious about holding a full-fledged PHP conference, the venue we had chosen gave us plenty of opportunities for unprecedented fun—we were, after all, on a cruise ship going to the Bahamas! As I mentioned earlier, the ship itself was a constant source of activities, which, of course prompted many of our attendees to bring their significant others along March 2004 ● PHP Architect ● www.phparch.com 37 CRUISE REVIEW PHP Ahoy! A look at php|cruise 2004 with them for the ride. A conference at which you can have fun with your family—the perfect crime! On top of the amenities you would normally expect, like two salt-water swimming pools and two giant hot tubs that the guests loved to take advantage of at night, one could find all sorts of attractions. Perhaps the most exotic of them all must have been the rock climbing wall, which I found a bit scary but that some of our attendees enjoyed very much in their spare time. There's a commercial for Disney Cruises on TV in the US where each member of the family finds something fun aboard the ship to do during their stay. The children go play with the Disney characters, the grandparents go play bingo (or something like that) and mom goes to the spa. Before leaving, they all ask dad if he's all right—wondering what activities he has planned. The man of the house reassures them that he's got everything covered and sends them all on their way. We next see him sleeping in just about every spot that is fit for showing on television—from the beach to the massage parlour. Even though Disney was not our cruise line (although "PHP with Mickey" may be a good idea for the future), that dad in the commercial inspired me—and I enjoyed every last snoozing moment aboard that ship, as well, of course, as the ensuing sunburn (but that's another story). So far, we've covered daytime aboard the ship (we'll get back to the shore later on). What about the nightlife? Well, there were all sorts of things going on, of course. Given that we had been blessed by extremely good weather, the ship's crew organized some sort of dancing party every night on the top deck (in the open air), inclusive of a rich midnight buffet. For the one among who like taking risks, the ship featured a full-fledged casino, whose friendly operators were more than ready to take our money. For a friendly discussion on the latest PHP developments, many hit the various bars and shared their thoughts over an excellent daiquiris or piña coladas. Finally, the musically inclined also had an opportunity for more dancing in one of the disco March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) “Naturally, the ship wasn't at sea the whole time. We visited Coco Cay, a private island owned by Royal Caribbean” 38 lounges or even karaoke. Naturally, the ship wasn't at sea the whole time. We visited Coco Cay, a private island owned by Royal Caribbean (proof positive that the cruise business is very profitable, apparently) that is, essentially, a huge water park, complete with slides and beaches, as well as many different excursion opportunities—like snorkeling, scuba diving or swimming with the dolphins— which our attendees promptly took advantage of. Although I didn't get an opportunity to visit Coco Cay—I was too busy imitating the guy in the Disney commercial-everyone who went had a great time and brought home some wonderful memories. Our second port-of-call—just before coming back to Orlando—was Nassau, the capital of the Bahamas. The ship docks on the "touristy" side of the town, so that one can have a comfortable stroll around the various shops and spend some of his hard-earned money on anything from clothes to handmade souvenirs, which is what I did. For the more adventurous, the ship was, once more, organizing a number of different excursions, some of which were quite exotic—picture yourself snorkeling in the middle of crystal-clear water while you play with a stingray, and you'll have a good idea of what some of our guests experienced. March 2004 ● PHP Architect ● www.phparch.com PHP Ahoy! A look at php|cruise 2004 The Atlantis Resort, located on Paradise Island (not far from Nassau itself), offered even more opportunities for those who didn't want to stay on the ship but still enjoy the nightlife. Atlantis features a number of different attractions, including an incredible aquarium, a private beach, several restaurant and yet another casino for the gamblers. Being "busy" with a bit more R&R myself, I didn't get an opportunity to go, but those who did were very enthusiastic about it. A Look at the Future php|cruise turned out to be a very interesting experience. I think that everybody who participated had lots of fun and learned something new about PHP, which was, of course, our goal from the very beginning. Encouraged by its success, we have started working on the next edition, which will take place in the fall. This time, we will go to Alaska, a land that offers a very different, if just as exciting, set of possibilities for having fun. Watch out for an announcement on our next exciting adventure coming April 15 on the php|architect website! Licensed to 63883 - Joseph Crawford ([email protected]) CRUISE REVIEW To Discuss this article or see more pictures: http://www.phparch.com/discuss/index.php/t/518/0 39 PHP Ahoy! A look at php|cruise 2004 Licensed to 63883 - Joseph Crawford ([email protected]) CRUISE REVIEW March 2004 ● PHP Architect ● www.phparch.com 40 Licensed to 63883 - Joseph Crawford ([email protected]) Content Management System www.mamboserver.com by Eddie Peloke L ike a lot of my peers, I spend most of my time helping others with their site—yet I rarely have time to look after my own. For the past year, my family site has had nothing more than the default Apache page. Don't get me wrong—I have plenty of ideas for the site, but there just always seems to be other things to work on. On top of that, since I don't have time to create the pages initially, I know I will have even less time to maintain the site. Looks like I'll need a content management system. While I have tried a few in the past, I haven't really found one that I like—there's always something that doesn't sit well with me. After all, CMS's are notoriously difficult to write, because it's nearly impossible to create a single application that will satisfy the needs of every possible website. Thus, when a co-worker returned from LinuxWorld talking about a CMS he saw named Mambo, he managed to rouse my curiosity. His description was a bit vague—I was told it was a PHP based CMS which "looked nice"—but I thought I'd give it a try nonetheless. After all, with the amount of digital noise we are subject to on a daily basis, the recommendations of friends and colleagues are the last bastions of unfiltered, selfless information (at least for the most part). According to the Mambo Open Source site: "Mambo Open Source (MOS) is a PHP/MySQL based Content Management System (CMS) framework released under the GNU/GPL License, which enables the easy creation and maintenance of a Web site or portal. The pure simplicity of MOS 4.5 means that you do not need to be an IT March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) P R O D U C T R E V I E W Mambo Open Source QUICK FACTS Description: First and foremost Mambo Open Source is a Content Management System (CMS).The goal of the Mambo Open Source project is to meet most of the requirements highlighted in the above article. As each day in development goes by we are getting nearer and nearer, whilst at the same time building a solid core which can be expanded upon by 3rd Party Developers. Mambo Open Source is the engine behind your website that provides the ability to simplify the creation of content. Requirements: OS: UNIX, Microsoft Windows 2000/XP Database: MySQL 3.23.55 or above PHP: 4.2.1 or above Web Server: Apache 1.3 or above Web Browser: Internet Explorer 5.5 / Mozilla 1.4 Price: Mambo Open Source is Free Software released under the GNU General Public License. Download Page: http://www.mamboserver.com/content/menu/Mambo_Open_Source_ Download/ Product Homepage http://www.mamboserver.com/ 42 Professional to update, maintain and customize your content." Boy, have I heard this before... Well, let's see if it's true. Requirements Before downloading the code, it's a good thing to ensure that the system you're working on meets the software's minimum requirements. In the case of MOS, these are as follows: • Operating Systems: UNIX, Microsoft Windows 2000/XP • Database: MySQL 3.23.55 or above • PHP: 4.2.1 or above • Web Server: Apache 1.3 or above • Web Browser: Internet Explorer 5.5 / Mozilla 1.4 Set Up Well, my system does meet the requirements, and I was pretty ex?cited to give Mambo Open Source a try, so I quickly downloaded the code and started the set up process. Installing Mambo Open Source could not have been easier: within five minutes from download, it was up and running. The Mambo install performs quite a Mambo Open Source Content Management System few system checks and might complain if it doesn't find certain PHP configuration parameters set or have access to certain directories but those errors are easy to fix and the installation should complete without trouble. The System Once installed, the Mambo Open Source administration pages are the first place you will want to visit to begin customizing the site to meet your needs. They allow you to manage your site's templates, users, menus, database, and so on. The administrator has a nice, clean interface through which items are fairly easy to find and manipulate. One of the first aspects of your site you will probably want to tweak is the interface. Mambo Open Source comes with a handful of templates, but a quick web search will return several sites with hundreds more— giving more credence to the popularity of the package, which is usually a good indication of its quality. Once you have selected and installed your template, all that is left is a click of the 'Publish' button to have it take control of the site's look and feel. Incidentally, "Publish" is a button you will become very familiar with when using Mambo, as, by default, things don't appear online until they are published. Licensed to 63883 - Joseph Crawford ([email protected]) PRODUCT REVIEW Figure 1 March 2004 ● PHP Architect ● www.phparch.com 43 If the template you selected needs to be modified, don't worry—within the Mambo administrator, you can edit the page's code or style sheet directly online. While I have never had any trouble editing the code, I have heard complaints from another developer using Mambo of strange things happening when editing the code with the administrator's WYSIWYG editor. Thus, you may find it easier and less troublesome on a regular basis to edit your template's code directly in the template's PHP file rather than through the administrator—but it's good to know that, in a pinch, you can easily get by without having direct access to your server's filesystem. Usage A CMS needs to do more than simply manage the look and feel of your applications, and Mambo does attempt to give the user absolute control over every aspect of the site it runs. From a management perspective, it is broken down into three main parts: Components, Modules and Templates. We have briefly discussed Templates but what exactly are Modules and Components? Some of the components that come Mambo Open Source Content Management System prepackaged with Mambo include: • • • • Banner Manager Polls Media Manager News Feeds Some of the prepackaged modules include: • Menu Managers • Logins • Statistics If you don't find what you need from the initial install, you will find several sites offering many free components and modules. Everything from file managers, to forums, galleries, online shops, weather plug-ins, bug tracking systems and pretty much any other thing you can think of is out there for you to grab. I have even come across some games, such as a humorous PacMan knock off named, obviously, Mambo Man. Now that we have talked briefly about components, modules, and templates we should take a moment to Licensed to 63883 - Joseph Crawford ([email protected]) PRODUCT REVIEW Figure 2 March 2004 ● PHP Architect ● www.phparch.com 44 talk about how they are installed. When the component, module and template (CMT) installer works, it is extremely easy. Typically, all you have to do is upload the zip file from within the Mambo administrator and mambo will unzip the code and take care of the installation for you. (It does this via the zlib package, for which Mambo will check during the initial product install to make sure it is available). I say "typically" here only because I have had several components, modules, and templates that just refused to play nice and install. Now, it is probably unfair to fault Mambo for some of the third party components, but it is hard to determine who is causing the problem. Of course, Mambo does give you the option to upload the files yourself and then install from the uploaded directory—it's just much easier to use the administrator and let Mambo take care of it. Advanced Features Mambo has a few "advanced" features that I have found to be a nice addition. The most notable, for me, is the database management system. The Mambo administrator allows you to back up, restore and run queries against your MySQL database and, while this in Mambo Open Source Content Management System no way replaces tools such as SQLyog or MySQL Front, it is nice when you just need to run a quick query and don't have access to another DB tool. Mambo also contains support for content archiving and versioning. While I have not yet used these features in my system, I can see their benefit in an environment with several users constantly changing content. What I Liked The main thing I like about Mambo is that it is written in PHP. That means anything I don't like, I can fix with ease. All of the quirks or deficiencies of the system can be corrected by the programmer without having to reinvent the wheel every time. I have also found the code clean and, for the most part, well documented. For example, a quick look into the database class shows comments around the functions, data members, and so on, making it easier to figure out what is going on and ultimately easier to modify the code if needed. I also like the ease with which content can be published, moved around the templates and re-ordered directly from the web interface. As I mentioned, the amount of different templates, components and modules available online is sign, in my opinion, of a healthy Licensed to 63883 - Joseph Crawford ([email protected]) PRODUCT REVIEW Figure 3 March 2004 ● PHP Architect ● www.phparch.com 45 and well-supported system. There is enough out there to satisfy just about anyone's needs. What I Didn't Like I have found that it takes a day or so or playing around with the system to get the hang of how items exactly work. In some areas, where things should happen in a certain order, it is not always obvious what the correct procedure to follow is. For example, it happened to me several times that, after having published an item, I couldn't find my content on a web page because it didn't contain any 'records' or wasn't attached to a higher level item. While I understand the need for such a hierarchy, it would be nice if the administrative pages gave you better indications as to why some items won't show up unless you do something else first. My biggest gripe, however, is with some of the external components. Again, it is hard to fault Mambo for code written by outside developers, but it is still a pain nonetheless to get an error when attempting to install new items, which is something that all users will have to do at some point. Thus, a better installation management system might not be a bad idea. Mambo Open Source Content Management System If you are a standards stickler, you will find that some of the templates will not pass the W3C validator. For instance, running my test homepage through the validator returned 119 errors. While this doesn't particularly trouble me, I would have liked to see less standards violations in the code. If standards are of high importance, you may want to check out xMambo, which is a standards-compliant publishing system based on Mambo. The good news is that, thankfully, the Mambo team is now working to bring the benefits of xMambo under the Mambo umbrella into a single package. Conclusion Overall, I like Mambo. If you are looking for a content management system for one of your web projects, this is definitely worth a look. With a wealth of external plug-ins available, you should be able to find just about any item you need to achieve what you want. Licensed to 63883 - Joseph Crawford ([email protected]) PRODUCT REVIEW Figure 4 March 2004 ● PHP Architect ● www.phparch.com 46 WAP: Past, Present and Future W AP stands for Wireless Application Protocol. It is developed and administered by a consortium of companies known as the Open Mobile Alliance (OMA), which also own the trademark on its brand name and regulates its use. Just like the name implies, WAP is the application protocol for wireless services. WAP specifications range from the communication protocol between a server and a wireless device to the markup language that should be used to exchange data. Most of the specifications are an adaptation of existing standards for the wireless and wired world. This means, of course, that you will not need to study everything from scratch to develop a WAP site—it is much easier than many might think. If you can write an HTML page, then you can also write a WAP page. WML (Wireless Markup Language) is the language that was invented to develop WAP sites. It is derived from XML and complies with the XML standard. The first thing you will have to get used to is that WAP browsers are not as developer-friendly as the traditional Web browsers you are used to. If you have a typo or introduce incorrect syntax, the browser will simply print an error message ("Compile error" is the most common, but it depends on the browser) rather than being tolerant and trying to recover from the situation like a Web browser. If you want to develop a WAP site, the first you may want to do is to download an emulator (you will need it to test your WAP pages), download the WAP documentation from the OMA Web page (http://www.openmobilealliance.org) and set up a Web server to load your March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) F E A T U R E by Andrea Trasatti pages. To support as many devices as possible, I suggest that you download the Openwave's SDK (http://developer. openwave.com ) and the Nokia Mobile Browser (http://forum.nokia.com). Both software packages are available for download free of charge (or after a free registration on their respective websites). While the Openwave SDK comes with a generic WAP emulator and supports skins to emulate specific devices, the Nokia Mobile Browser is a core application that will require you to download some of the available plug-ins to emulate specific devices; if you don't, you will get a generic device which doesn't have much to do with real WAP terminals (I don't suggest using it for testing!). If you are moving into the world of WAP from HTML and Web applications, you might think you will need a specific editor to write your pages, but this is not the case, as WML is very similar to HTML—and I know many people who use Homesite or similar HTML editors to do their WAP work without any problem. Thus, you can just pick your favorite editor ('vim' is always at the top of my list!) and get straight to work. Getting Started Now you have all the tools you need to start develop- REQUIREMENTS PHP: 4.x OS: Any Applications: N/A Code Directory: wap 47 FEATURE WAP: Past, Present and Future text/vnd.wap.wml text/vnd.wap.wmlc text/vnd.wap.wmlsscript text/vnd.wap.wmlsscriptc image/vnd.wap.wbmp wml wmlc wmls wmlsc wbmp While developing your WAP site, you will mostly set the appropriate mime type through PHP, but you might need these Apache settings for images (wbmp), or in case you decide to use hard-coded wml and wmls pages. Since I introduced all these new file types, it is worth talking about them a little bit, although you can find extensive descriptions of these and more extensions in the OMA documents. "wml" is the extension for WML pages. WML pages are the same as HTML pages. "wmls" is for wmlscript, which is the same as javascript for wireless devices. Just like for Web pages, you can have wmlscript embedded in a wml page, or store it in external files. WBMP stands for Wireless BitMaP and represents a black and white image. All graphically-capable WAP devices support WBMP, but the most recent devices also support many other image formats, such as GIF, JPG and PNG (which also provide color images). If you are trying to write an application that will work with as many different devices as possible, it's usually safe to just go with a WBMP, which any device will display it properly. If you don't know how to generate a WBMP image, check your favorite graphics software—many recent ones have plug-ins to generate WBMP's. Also, you can find some software to convert a generic image into a WBMP (try this online converter: http://www.teraflops.com/wbmp/ and this online editor http://Webcab.de/woe.htm, or Google a little bit, and you'll find that there's plenty of applications available). WMLC pages are WML pages that are already compiled. This format is not widely used, but it exists nonetheless. While WMLC content is referred to as "compiled," it is not compiled in the way a C source file is; WML tags are simply converted into symbols so that they will use less bandwidth (remember that many WAP devices do not dispose of high bandwidth and, therefore, you should make your documents as lean as possible). Keep in mind that any WAP emulator will also March 2004 ● PHP Architect ● www.phparch.com show the complete source of a WML page, regardless of whether it has been compiled or compressed (just like Web browsers do) and, therefore, compiling will not protect your pages from prying eyes. If you want to know more about how WAP works, I suggest that you take a look at the OMA official documents—they will shed more light on subjects that I cannot cover in this article, like compiled pages and the communication protocols between devices, gateways, and webservers. A Simple WAP Page Now that we have had a little introduction to WAP and its different component file types, we can proceed to the first example. Let's analyze the code in Listing 1. The first two lines define the document and the revision of the WML used in the current page. This is required code just as it is for any XML document. The <wml> tag defines the beginning of the WML page. Each page must begin with this tag and end with </wml> (just like an HTML page must be contained within an <HTML> object). <card> is the tag that defines the beginning of a "card". When WML was first defined, each page was defined as a "deck" that is composed by one or more cards. Each card will be displayed as a single page by the WAP browser. This is particularly useful if you have a predefined navigation scheme. If you think of a WAP device and the time that takes to load a page over a slow network, you will understand how useful it is to already have the next page in memory. Each card that composes the deck begins with the <card> tag and ends with the </card> tag. The opening tag needs the id attribute, which is needed to differentiate between different cards and behave like HTML anchors. If you want to jump from a card to another, all you need to do is create a link to #cardid, where cardid is, obviously, the string defined in the id attribute for that card. Many of the tags that are part of the XHTML specification can also be used in WML. For example, <p> is used to identify a paragraph, while <br/> introduces a line break. It's important to remember that you are Licensed to 63883 - Joseph Crawford ([email protected]) ing—with the exception of a WAP server. Luckily, all you need is a common webserver, such as Apache (http://httpd.apache.org), and a little configuration in the MIME types file to allow for the file types specific to WAP documents and images. If you are going to develop all your WAP pages with PHP and won't need to use any images, you will not really need to make this modification, although I always suggest to do it anyway— after all, you never know if you will need it some day. WAP introduced a few new extensions that need the appropriate MIME types: wml, wmls and wbmp. If you are using Apache, add the following lines to your "mime.types" file (and then restart the server): Listing 1 1 <?xml version=”1.0”?> 2 <!DOCTYPE wml PUBLIC “-//WAPFORUM/DTD WML 1.1//EN” “http://www.wapforum.org/DTD/wml_1.1.xml”> 3 4 <wml> 5 <card id=’main’ title’first_page’> 6 <p> 7 Hello World</br/> 8 this is my first WAP page. 9 </p> 10 </card> 11 </wml> 48 FEATURE WAP: Past, Present and Future cards, main, like and dislike. By default, the WAP browser will show the first card to the user. The user will pick one of the two options (click the link or accept button, which is bound to the right softkey) and the appropriate following card will be displayed. The img tag has a closing slash to comply with the XML standard (just as it would in XHTML). Also, notice the alt attribute, which is needed in WML so that if the browser cannot show the image, the "alternative text" will be displayed. The reason why this attribute is necessary is that many WAP devices do not have graphicsdisplay capabilities and, therefore, the ability to show a text alternative to an image is very important. I put the image on the same line as the message to raise a problem you might encounter with different WAP browsers. Some (like newer Openware browsers) will display this image as you would expect a Web browser to, on the same line as the text beside it, while some others (like older Nokia devices) will place it on a new line. Naturally, if the image width plus the text exceed the screen width, it's natural that the image will go on a new line, but you should be aware that not all browsers will behave the same way under all circumstances—and, therefore, you should plan your WAP documents accordingly. The syntax for anchors is just the same as for Web pages. If you look at the WML 1.x documentation, you will see that anchor can also be used as a tag for anchors. Another alternative is the go tag. The browser's behavior in response to either is the same, but the go tag also gives you the possibility to define the method (GET and POST) and any additional parameters you want to pass to the subsequent pages. I suggest readying the full documentation for more specific help. Another important thing to keep in mind is that, in a WAP application, the order in which tags appear is extremely important. For example, the do tag should be used outside of a paragraph and inside a card. In this case, if we put the do tag inside the paragraph, it will cause a compile error. Licensed to 63883 - Joseph Crawford ([email protected]) essentially dealing with an XML document—and, therefore, forgetting the slash will cause the WAP browser to produce a "compile error" message. As you can see, the structure of this simple page is not much different from what its HTML equivalent would look like—even more so if you were to compare it to an XHTML document. In the next example, we will see some more tags— including some that are unique to WML. The first thing you must have clear in your mind when developing a WAP application is that you are not developing a website. Your application will be visited by users who have a tiny display with little (often uncomfortable) keys and who are probably paying a lot of money for the privilege of accessing the Internet. The key for a successful WAP site, in my experience, is simplicity and usability. The content is extremely important as well, of course, but the difference between WAP and Web pages lays primarily in user-friendliness. A simple Web page with a great content will get many hits (Google being the prime example) while good content in WAP will not be as popular if the site is not usable and friendly—people will just wait until they get home and use the Web instead. To help developers make their applications easier to use, the OMA defined some specific tags, such as <do>, which support the type attribute. do tags can be of type option or accept, for example. The former links an object to the left softkey, while the latter defines the action for the right softkey. These are useful to ease the user's navigation (if you're wondering what softkeys are, they are the two buttons below the screen; in order to be WAP-compliant, a device must have these two keys active during the navigation). Let's move to a more complex page and check what WML can offer—take a look at listing 2. As you can see, I used the well-known anchor tag and the '<do>' tag together. This is probably not the most usable page I ever wrote, but it's a good example that we can use to play with cards and links. Let's analyze the elements of this deck. We have three Listing 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <?xml version=”1.0”?> <!DOCTYPE wml PUBLIC “-//WAPFORUM/DTD WML 1.1//EN” “http://www.wapforum.org/DTD/wml_1.1.xml”> <wml> <card id=”main” title=”first page”> <p> Hello World,<br/> take a look at this picture:<img src=”mypic.wbmp” alt=”here is my pic”/> Click <a href=”#like”>here</a> if you like it, or click your right softkey if you don’t. </p> <do type=”accept” value=”#dislike”> </card> <card id=”like” title=”I like it”> <p>Good, you like my picture.</p> </card> <card id=”dislike” title=”I don’t like it”> <p>Too bad, you don’t like my picture.</p> </card> </wml> March 2004 ● PHP Architect ● www.phparch.com 49 FEATURE WAP: Past, Present and Future WAP and PHP-A Simple Example If you have configured your Apache webserver properly and saved the examples above into a couple of WML files, you will be able to browse them with a WAP emulator. But what if you wanted to write a WAP application using PHP? I don't need to explain why you should use PHP, and that PHP can give you more than a static page, so let's just talk about how you can use it. First of all let's talk about the content type of your script's output. By default, PHP sets it to text/html (default for Web pages), but we want it as text/vnd.wap.wml, so that the Listing 3 1 <?php 2 header(“Expires: “.date(“D, M j G:i:s T Y”, (time()-1000)) ); 3 header(“Last-Modified: “.gmdate(“D, d M Y H:i:s”).” GMT” ); 4 header(“Cache-Control: no-cache, must-revalidate” ); 5 header(“Pragma: no-cache” ); 6 ?> 7 <?xml version=”1.0”?> 8 <!DOCTYPE wml PUBLIC “-//WAPFORUM/DTD WML 1.1//EN” “http://www.wapforum.org/DTD/wml_1.1.xml”> 9 10 <wml> 11 <head> 12 <meta forua=”true” http-equiv=”Cache-Control” content=”max-age=0”/> 13 </head> 14 <card id=’main’ title=’never cache’> 15 <p>This deck will always be reloaded</p> 16 </card> 17 </wml> March 2004 ● PHP Architect ● www.phparch.com WAP Gateway and WAP Browser will recognize it properly. The PHP syntax to set the header is: header("Content-Type: text/vnd.wap.wml"); This is the very first step you will need to take to write a WAP page within PHP. Just like for Web pages, you can set an expiration time—and, in fact, you should, because WAP browsers tend to use their cache very aggressively to make navigation as fast (and inexpensive) as possible for their users. Since WAP relies on the HTTP protocol for its communications, you can set the expiration time of a document in pretty much the same way you would for a normal HTML page. As you can see in Listing 3, this is accomplished with a simple set of calls to the header() function. The Road to the Future Obviously, the examples I presented do not fulfill all the possible scenarios of WML development. The aim of this article is not to make you a wireless master, but rather to show you the options you have if you want to start developing a working site. I think it was worth introducing WML and show some of the main differences from "standard" Web development to let you understand that you should not simply "recycle" a website and adapt it to a mobile device. You will need to rethink it from scratch. One of the targets of the new OMA standard (WAP 2.0) is to make the transition to WAP easier for Web developers. The first step consists of using a "common language": XHTML. As you all know, XHTML is supported by any browser released in the last year or two, and WAP 2.0 is based on XHTML Basic, plus a few tags specific for mobile devices. What comes out as a result is called XHTML Mobile Profile. Currently, version 1.0 has been standardized and the OMA is working on version 2. Like any transition period, you should always consider, while developing, that you will need to support both the old and the new standard for at least some time. Any device released (but not necessarily purchased, since dealers will have to clear inventories out) after April 2003 supports XHTML MP. As you have probably painfully learned as a professional developer, "support" does not necessarily mean that everything works properly—so you can safely expect that many devices will ignore some of the new tags, but they should, at least, do so silently and without spitting out all sorts of errors. What are the cool things about XHTML? The most important is its support for CSS (Cascading Style Sheets). With CSS, you are able to define styles and use them in your WAP pages. This particularly technology was never used in WML 1.x because the displays were so tiny that applying a style was simply not effective. The latest-model devices, however, feature bigger screens and color displays, making the appearance of Licensed to 63883 - Joseph Crawford ([email protected]) After the do tag I close the card and start a new one. I created two simple cards just to show a message. As you can see, I created a deck with three cards, even though the user is likely to see only two of them. Thus, not all of the content available in the document will actually be shown to the user, but this will save load time regardless of what choice he or she makes. Unfortunately, this is not always possible, as you will often generate the contents of a page depending on the information passed from a link, a form or something similar, but you should take advantage of this capability of WAP whenever possible, as your site will be a bit friendlier as a result. Also notice that some browsers will display anchors alone on a line. If you have any text around the link, your sentences will be split. Another particular behavior some devices show is not allowing you to place a link on an image. These limitations mainly apply only to the first devices that hit the market, which had small screens and were capable of displaying only a few lines per page. Their manufacturers thought that introducing these rendering rules would have made navigation easier—although, of course, they ended up making the developers' life harder. However, as annoying as they are, these idiosyncrasies shouldn't discourage you; as long as you respect strict WML standards, your pages will be viewable by everyone—all you can do is just to try and do your best to make them look good on as may devices as you can. 50 FEATURE WAP: Past, Present and Future Where to Go From Here This was just a brief introduction to WAP, and you will probably have a lot of questions. Starting your journey into WAP programming is quite easy—you will probably write your first pages without many problems and then "hit the wall" just when you start feeling confident! A good reading of official documentation and a couple of specialized sites will certainly offer you a deeper knowledge of the topic. What you should always keep in mind is the main concept: a small device used while on the move. Your site must be friendly and simple to use. My suggestion is to develop your applications while testing them with at least the Nokia and Openwave SDKs. Better yet, you should test them with all the "real" device that you intend to support. One step further is to balance every single page, trying to make it short enough that scrolling up and down through its contents will not be overly annoying. Also, design your forms so that they can be easily completed March 2004 ● PHP Architect ● www.phparch.com by your users without too much typing. Employ dropdown and multiple selection boxes whenever possible, as these also help ensure the accuracy of the data that is entered. What are the advantages of developing a WAP site with a language such as PHP? Of course, you get all the normal perks of a web language, like the ability to access a database. The real plus, however, is the possibility to tailor each WAP page to the mobile browser used by your visitor. Reading the user agent that the device sends at each request, you should be able to understand the type of device and offer ad-hoc markup. As I mentioned in this article, each device has its own peculiarities, and this is even more true with the new XHTML MP devices that, in many cases, do not support the full standard or apply it in their own way. With PHP, you will be able to build the WAP page that fits each particular device best, although, of course, in practical terms you still need to figure out how each of them will behave. For this purpose, you essentially have three opportunities: Licensed to 63883 - Joseph Crawford ([email protected]) WAP pages a worthy consideration. Adding colors, different fonts, and alignment suddenly becomes both useful and practicle. Another big new feature is the background color and background images. Once again, this is a need that was never felt until new displays became available, but it is now a cool feature you can add to your site. While you could use italic or bold text in WML 1.x, most of the devices did not support it. In XHTML MP, these tags are inherited from the XHTML Basic and include "strong", "big", "small", "b" and more. I don't feel like I need to explain every single tag in XHTML, given that you are probably very familiar with them and you can get full documentation from the W3C. What you should know is that you will need to be strict with your syntax and avoid any mistakes or unclosed tags, as mobile devices this will generate compile errors (and you would be producing invalid XHTML anyway). With all the new additions that XHTML MP brought, some of the features of WML 1.x have also been pruned out of the standard. For example, we've lost the "do" tag that lets us assign specific functions to the two softkeys, as well as the concept of card and deck. Forms are present (and quite similar) in the two standards, but a useful function that was lost in the transition is the possibility for the developer to predefine the type of information the user should insert. In WML 1.x, the developer could use specific tags to define that an input field should be filled with numbers only, or letters only, and the device interface would not let the user insert anything els. This helped both the user, who could more easily pick the proper set of keys, and the developer, who could better manage the submitted form. Openwave decided to keep supporting this functionality with a proprietary tag, but this means that if you decide to use it your code will not be compatible with other devices! • Acquire each of the devices you intend to support and test your site on each one separately. As mentioned above, this will help you ensure maximum performance in all situations, but it may turn out to be an expensive proposition, and it will certainly slow your development efforts down. • Purchase a commercial package that provides you with a list of device capabilities and build your pages based on the information you find in it. • Rely on an open-source package to do the same. The last two options are essentially equivalent, and your choice will probably depend on how you feel about open-source compared to a proprietary solution. Personally, I believe that open-source products can be superior, and that's why I am an active contributor to the Wireless Universal Resource File (WURFL) project, which you can find at http://www.wurfl.org. The testing path I commonly follow is to develop the application based on the information I find in WURFL and then test it with a few of the real devices I intend to support openly. A Bit of Homework The future of WAP is bigger and more colorful than ever before, thanks to the new devices that have larger screens, more colors, and faster browsers based on the GPRS and 3G standards. Even if the dream of "the Web on a mobile device" will probably remain a dream for a little while longer, WAP is widely used by millions of users every day for the download of content of all kinds. For an example, look at ringtone services, which 51 FEATURE WAP: Past, Present and Future March 2004 ● PHP Architect ● www.phparch.com at WURFL (I wrote an article about it in the June 2003 issue of php|architect) and OUI (http://oui.sourceforge.net), an open-source library published by OpenWave that can dynamically adapt your XHTML MP code to the capabilities of each wireless device. There are also a lot of commercial products that can help you develop WAP sites, but these are relatively easy to find on pretty much any search engine. If websites and official documentation are not enough, you can also come discuss WAP on the famous WMLProgramming list on Yahoo!: http://groups.yahoo.com/group/wmlprogramming About the Author ?> Andrea Trasatti started his career as a SYSOP for the second BBS in Italy to offer internet access. As the internet grew, he integrated his experience with the development of web applications. Now he specializes in the development of multichannel applications. He is an active member of the open-source community. Some of his projects are the leading value added services for one of the biggest mobile carriers of the world. Licensed to 63883 - Joseph Crawford ([email protected]) have ballooned in popularity over the last couple of years and have become a rather sizable market. WAP is the ideal medium for them: you connect to a site, browse a list, pick your favorite ringtone and download it. As devices get better and the cellular networks get faster, WAP will become more and more useful. What was once just a "cordless phone" (often the size of an attaché case) is now becoming a tiny computer—and WAP is the transport medium for the content users want to have. If I convinced you that it is worth your time to read and experiment with WAP a little, you will probably need some links to start from. Your first stop should be "The Wireless FAQ", http://www.thewirelessfaq.com, where you will find some of the things I discussed and many more frequently asked questions and examples. You will also find links to the OMA (http://openmobilealliance.org) , from where you will be able to download current and historical documents about WAP. For your experiments, you can download the SDKs I listed at the beginning of the article—and if you want a comfortable emulator for Windows, take a look at WinWAP (http://www.winwap.com). After a little testing and playing around, you might also want to take a look To Discuss this article: http://forums.phparch.com/134 52 Tidying up your HTML in PHP5 Tidy is a new extension that will be available as a standard in PHP 5. It provides a wide range of functionality for manipulating HTML, XHTML, and XML documents from within PHP. This article introduces all of the primary features of this new extension, and how you will be able to make the most of it in your PHP scripts. A lthough the Tidy extension itself is provided as a part of PHP 5, by default it is not enabled, as it relies on external libraries. To enable Tidy support within PHP, you must first download the libTidy library, available on the Tidy homepage at: http://tidy.sourceforge.net/. Once you have downloaded the latest version of the libTidy source, you can install it on your server using the following commands: [user@localhost]$ [user@localhost]$ [user@localhost]$ [user@localhost]$ [user@localhost]$ [user@localhost]$ tar -zxvf tidy_src.tar.gz cd tidy /bin/sh build/gnuauto/setup.sh ./configure make make install Note that, in order to fully complete the installation of the libTidy library, the make install command must be executed as superuser (i.e. root) or equivalent. Once the libTidy library has been installed on the server, Tidy support can be enabled in PHP 5 by specifying the -with-tidy configuration option to PHP's ./configure script: [user@localhost]$ ./configure --with-tidy Although the above command should work when Tidy is installed in common default locations, alternatively you can also specify the location of the libTidy library directly: [user@localhost]$ ./configure --withtidy=/path/to/libTidy To confirm Tidy support in PHP 5, check for a Tidy section in the output of the phpinfo() function or exe- March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) F E A T U R E by John Coggeshall cute the CLI version of PHP using the -m parameter (to show installed modules). If everything has gone as expected, you will see tidy in the module list and a Tidy subsection in the output of the phpinfo() function. An introduction to the Tidy API The Tidy extension, like many of the new PHP 5 extensions, supports a dual-nature procedural/object oriented syntax. This allows you, as a developer, to use the programming methodology you are most comfortable with when using Tidy in your PHP applications. For example, consider the above small snippet of code: <?php $tidy = tidy_parse_file("myfile.html"); tidy_clean_repair($tidy); echo tidy_get_output($tidy); ?> Don't be concerned that these functions have yet to be introduced—I'll discuss them later in the article. Instead, note the $tidy resource which is returned from a call to the tidy_parse_file() function. This resource represents the document being manipulated in memory, and must be passed to every function (similar to, for instance, the way the cURL library works). This REQUIREMENTS PHP: 5 OS: Linux Applications: LibTidy Library Code Directory: N/A 53 FEATURE Tidying up your HTML in PHP5 <?php $tidy = tidy_parse_file("myfile.html"); $tidy->cleanRepair(); echo $tidy; ?> This second example is identical in functionality to the first, except, of course, that it uses the object-oriented syntax. Rather then calling tidy_clean_repair(), which requires a resource to be passed, you can call the cleanRepair() method instead. The last line of the example above illustrates another interesting feature of the Tidy library: because of its dual-nature syntax, it is possible in PHP 5 to treat the $tidy resource returned from the tidy_parse_file() function as a string simply by using it in the context of a string within PHP. The contents of this string are identical to that returned from a call to tidy_get_output(), providing an incredibly useful shorthand for displaying the contents of a document after it has been manipulated by the extension. Although not recommended, for the sake of example the Tidy syntax can also be interchanged between procedural and its object-oriented forms: <?php $tidy = tidy_parse_file("myfile.html"); $tidy->cleanRepair(); echo tidy_get_output($tidy); ?> In this article, I will only use the procedural syntax for Tidy to maintain consistency and avoid making things appear more complicated than they are. The only time I will use any object-oriented aspects is in the treatment of the $tidy resource as a string when appropriate in my examples. If you would prefer to use the OO syntax of Tidy, converting between one and the other is a trivial task as procedural names map to their object-oriented counterparts by doing the following: • Remove the tidy_ from the procedural name • Remove all underscores from the function name, capitalizing every letter but the first word in the method name (cclean_repair() to cleanRepair()) • When calling Tidy functions from an objectoriented context, the first parameter of the function call (always the $tidy resource) is omitted. Basic Tidy usage: Parsing documents Tidy's primary purpose is to parse, validate, and repair markup documents in HTML, XHTML, or XML format March 2004 ● PHP Architect ● www.phparch.com and return the results of that process. Parsing an input document always begins this process. To parse a document stored within a file, you can use the tidy_parse_file() function: tidy_parse_file($filename [, $options [, $encoding [, $use_inc_path]]]) Where $filename is the path and filename of the document to parse. This can either be a file on the local file system, or a remote URL. For now, the second parameter ($$options) can be ignored (I will discuss it later), while $encoding represents the character set of the input document (such as utf-32). The final parameter, $use_inc_path, is a Boolean indicating whether Tidy should search for the file in the PHP include path if not found initially. The tidy_parse_file() function loads and parses the markup document and returns a resource representing that document to your script. During the parsing process, the document may be modified from its original version to make it syntactically correct. For instance, missing end-tags are automatically added, attribute values are automatically quoted, and so on. Documents can also be read from memory rather than a file by using the tidy_parse_string() function: Licensed to 63883 - Joseph Crawford ([email protected]) resource, however, is much more powerful than its PHP4 counterparts, as it also can be treated as an instance of a Tidy Document Object: tidy_parse_string($data [, $options [, $encoding]]); Where $data is a string representing the document to parse and $encoding is the character set the data is stored in. As was the case with the tidy_parse_file() function, I will temporarily ignore the $options parameter and discuss it in detail later. As I mentioned earlier, once the document has been parsed, the $tidy resource returned represents the document in memory. It can either be displayed immediately by using the tidy_get_output() function (or by treating the $tidy resource as a string), or be further manipulated as we will do shortly. Cleaning and Repairing Documents When a document is parsed through the Tidy extension, it is only modified as necessary to make the markup syntactically correct according to the configuration associated with it. The second phase of using Tidy, called the "clean and repair" stage, further applies configuration options to the document. This process is manifested in the tidy_clean_repair() function with the following syntax: tidy_clean_repair($tidy); Where, as expected, $tidy is the tidy resource representing the document. Since we have not discussed configuration options in Tidy at all yet, let's introduce them now. 54 FEATURE Tidying up your HTML in PHP5 <?php /* Define the tidy configuration options In this case, output the document in XHTML format and set the line wrap for the markup to 1 kilobyte */ $options = array('output-xhtml' => true, 'wrap' => 1024); /* Pass the options to Tidy */ $tidy = tidy_parse_file("http://www.phparch.com/", $options); tidy_clean_repair($tidy); echo $tidy; ?> In the example above, we are modifying the values of two Tidy configuration values, output-xhtml and wrap, which instruct Tidy to generate output in XHTML format with a markup line-wrapping of 1 kilobyte per line. This configuration is then applied to the php|architect web site and a XHTML 1.0 version of the document is sent as output to the browser or console. As an alternative to setting configuration options using an associative array, the $options parameter can also be a string representing a file on the local file system that defines the options you would like to set. Below is the content of an example Tidy configuration file, which sets a number of different options spanning the range of types: March 2004 ● PHP Architect ● www.phparch.com indent-spaces: 4 indent: auto tidy-mark: no show-body-only: yes new-blocklevel-tags: mytag, anothertag Thus, to duplicate the options defined in the example above the following would be used for the contents of the configuration file: wrap: 1024 output-xhtml: yes Assuming that this file was saved as myconfig.tcfg in the /usr/local/etc/tidy directory then the following script below could be used to duplicate our previous example: <?php /* Pass the options to Tidy */ $tidy = tidy_parse_file("http://www.phparch.com/", "/usr/local/etc/tidy/myconfig.tcfg"); Licensed to 63883 - Joseph Crawford ([email protected]) Tidy Configuration Options For any given document parsed by Tidy, there are an incredible number of options, which can be set to control different aspects of how the document will ultimately be rendered. These options range from the output format (HTML, XHTML, etc), to the way the document will look (i.e. indented tags, wrapping length), and more. In fact, the vast majority of Tidy's abilities are taken advantage of by setting different combinations of configuration options. To modify the current configuration of a document, options can be set a number of different ways (all which occur prior to the parsing of the document). For now, we'll look at the run-time method of setting options by taking a second look at the tidy_parse_file() and tidy_parse_string() functions. As you may recall, when I first introduced these functions I ignored the $options parameter of each— this parameter as you might expect controls the configuration for the document. This value can be one of two things, either an associative array containing configuration options and their respective values, or a string containing the path and filename of a Tidy configuration file. To begin, lets take a look at setting configuration options through the use of an array. Consider the following code: tidy_clean_repair($tidy); echo $tidy; ?> Setting a Default Configuration The use of Tidy configuration files is a powerful feature, as it allows developers to create Tidy "profiles" that allow them to process many different types of markup in a very logical fashion. However, Tidy configuration files can also be used to change the default configuration of a document when it is parsed. To define a default configuration file, the tidy.default_config php.ini configuration directive is used. Simply set this directive to the path and filename of a Tidy configuration file and it will automatically be applied any time a Note 1 Because of the sheer number of Tidy configuration options available, only a brief cross-section will be discussed. For a complete reference, consult the Tidy homepage at http://tidy.sourceforge.net/ Note 2 Unlike documents that need to parsed, Tidy configuration files must be stored in the local file system and cannot be fetched from a remote resource. 55 FEATURE Tidying up your HTML in PHP5 new document is parsed. Short Hand Tidying Since the use of configurations in conjunction with calls to the tidy_clean_repair() and tidy_get_output() functions can be lengthy, the Tidy library provides a resource that combines these two into a single function (actually, 2 similar functions). These functions are tidy_repair_file() and tidy_repair_string() whose syntax is as shown: include path for the input file if it is not initially found. When executed, these functions will parse and clean/repair the input document using the specified configuration and return a string containing the final output: <?php $content = tidy_repair_file("http://www.phparch.com/", "/usr/local/lib/tidy/myconfig.tcfg"); echo $content; /* tidy_repair_file($filename [, $options [, $encoding [, $use_inc_path]]]); tidy_repair_string($data [, $options [, $encoding]]); Figure 1 March 2004 ● PHP Architect ● www.phparch.com $tidy = tidy_parse_file("http://www.phparch.com/", "/usr/local/lib/tidy/myconfig.tcfg"); tidy_clean_repair($tidy); $content = tidy_get_output($tidy); echo $tidy; */ ?> Licensed to 63883 - Joseph Crawford ([email protected]) Where $filename and $data represent the document (either in a string or as a file), $options is an associative array of options (or a tidy configuration file), and $encoding is the character set to use when reading the input document. The final parameter of the tidy_repair_file() function, $use_inc_path , is a Boolean indicating if Tidy should search the PHP The above is identical to: Using the Tidy Parser Abilities Along with all of the functionality provided by the Tidy extension to validate, manipulate, and repair markup documents, Tidy is also, of course, an excellent parser of markup documnets. When Tidy parses a document, it generates a "document tree" representing its contents in a hierarchical fashion. This tree can be accessed from within PHP through a series of objects, allowing you to pull out entire blocks of HTML or other markup without the need for messy regular expressions or another extension. To understand how to use this feature of the Tidy extension, first you must understand how Tidy represents a document. As stated, Tidy generates a document tree based on the input document, consisting of a number of parent and child nodes. When dealing with HTML or XHTML, these nodes represent tags within the document. Consider, for example, the following HTML code: 56 FEATURE Tidying up your HTML in PHP5 <HTML> <HEAD> <TITLE>My document</TITLE> </HEAD> <BODY> <B>This is <I>An example</I> Document!</B> </BODY> </HTML> Internally, when this document is parsed by Tidy, the structure shown in Figure 1 is generated. As you can see, every HTML tag within the document is stored as a node within the document tree. These nodes are represented in PHP by an internal class named tidy_node. The structure of this class is as follows (note, the following is pseudo-PHP for illustration only): within a document tree. In order to retrieve the first instance of the tidy_node class from the Tidy extension, there are four different methods, which you can use: root(), head(), html() and body(). Each of these methods returns an instance of the tidy_node class representing the node for the document tag with the same name (i.e. the html() method returns the node for the <HTML> tag). As this aspect of Tidy is only available using an object-oriented syntax, no procedural equivalent exists for node-retrieval functions: <?php $tidy = tidy_parse_file("http://www.phparch.com"); /* Get the node representing the <BODY> HTML Tag */ $body_node = $tidy->body(); <?php /* The string value of this node and all of its child nodes */ public $value; /* The tag name i.e 'HTML' or 'BODY' */ public $name; echo "The HTML Tag for this node is: {$body_node>name}"; ?> Licensed to 63883 - Joseph Crawford ([email protected]) class tidy_node { When executed, you can expect the output to be: The HTML Tag for this node is: body /* A numeric value representing the node type */ public $type; /* A numeric value representing type of tag (if any) */ public $id; /* An associative array of tag attributes */ public $attribute[]; /* An indexed array of child nodes public $child[]; public function hasChildren(); public function hasSiblings(); public public public public public public function function function function function function isComment(); isHtml(); isText(); isJste(); isAsp(); isPhp(); } ?> Through the properties and methods available in the tidy_node, class you are able to access all of the nodes Note 3 The tidy_node class is also an overloaded class, meaning that you can treat an instance of the class as a string to retrieve the $value property of the class: One of the most important features of the Tidy extension's parsing abilities is the $value attribute of each tidy_node instance. Specifically, the contents of this property will not only be the value of the current node, but all of the nodes spawned as children from it. Thus, the value of a <TABLE> node will contain the contents of the entire table, making pulling large complex sections of HTML out of documents a snap. When parsing HTML, another incredibly useful attribute of the tidy_node class is the $id property, which represents an integer value indicating the HTML tag this node represents. These integer values correspond to a set of constants registered by the Tidy extension and provide a quick way to identify HTML tags from within a PHP script. All tag constants defined by the Tidy extension are in the format of TIDY_TAG_<TAGNAME> (where <TAGNAME> is the uppercase tag name you are interested in, such as TIDY_TAG_BODY for the <BODY> tag). To retrieve a particular attribute of a tag within a document (for instance the HREF attribute of an anchor <A> tag), the $attribute associative array is used by accessing the key with the name of the attribute in question. To demonstrate all of this functionality, consider the dump_nodes() function below, which extracts all of the URLS from anchor (<<A>) tags in the provided document: <?php <?php echo $mynode->value; /* You can use this method function dump_nodes(tidy_node $node, &$urls = NULL) { */ echo $mynode; /* Or this one! */ ?> March 2004 $urls = (is_array($urls)) ? $urls : array(); if(isset($node->id)) { ● PHP Architect ● www.phparch.com 57 FEATURE Tidying up your HTML in PHP5 } if($node->hasChildren()) { foreach($node->child as $c) { dump_nodes($c, $urls); } } return $urls; } $tidy = tidy_parse_file("http://www.phparch.com/"); tidy_clean_repair($tidy); $urls = dump_nodes($tidy->html()); print_r($urls); ?> Looking at this code, the dump_nodes() function accepts two parameters—the first is a node of type tidy_node, and the second is an internal-use parameter that we need during the recursion process to store the array of URLs retrieved from the document. When executed, the dump_nodes() function begins by determining whether the current node is a known HTML tag by checking for the existence of the $id property of the node. If the latter exists, the function then proceeds to check if this node is an anchor tag by comparing the value of the $id property to the TIDY_TAG_A constant. If we are indeed on an anchor tag, the function checks for and saves the value of the $attribute['href'] array key into the $urls array. Once it has finished processing the current node, Have you had your PHP today? regardless of its type, the dump_nodes() function proceeds to look for children nodes and handle each of them in the same fashion recursively. Ultimately, this script will navigate the entire document tree (starting from the node provided to it initially) and return an array of URLs found within. Summary As you can see, the Tidy extension for PHP 5 is an incredibly useful and powerful extension which, when used properly, can make your life as a developer much easier. Furthermore, with the judicious use of Caching of the output of your web site, making documents web standard compliant won't even introduce an additional load on your server. For more information on the Tidy extension visit the PHP Manual at http://www.php.net/tidy or the author's web site http://www.coggeshall.org/ . About the Author Licensed to 63883 - Joseph Crawford ([email protected]) if($node->id == TIDY_TAG_A) { $urls[] = $node->attribute['href']; } ?> John Coggeshall is a PHP consultant and author who started losing sleep over PHP around five years ago. Lately you'll find him losing sleep meeting deadlines for books or online columns on a wide range of PHP topics. You can find his work online at O'Reilly Networks onlamp.com and Zend Technologies, or at his website http://www.coggeshall.org/. John has also contributed to Apress' Professional PHP4 and is currently in the progress of writing the PHP Developer's Handbook published by Sams Publishing. To Discuss this article: http://forums.phparch.com/135 http://www.phparch.com NEW COMBO NOW AVAILABLE: PDF + PRINT The Magazine For PHP Professionals March 2004 ● PHP Architect ● www.phparch.com 58 by Chris Shiflett Welcome to another edition of Security Corner. This month, I have chosen a topic that is a concern for many PHP developers: shared hosting. Through my involvement with the PHPCommunity.org project, my contributions to mailing lists, and my frequent browsing of PHP blogs and news sites, I have seen this topic brought up in various incarnations. Some people are concerned about hiding their database access credentials, some are concerned about safe_mode being enabled or disabled, and others just want to know what they should be concerned about, if anything. As a result, I have decided to address these concerns in as much detail as possible, so that you will have a better understanding and appreciation of shared hosting. After reading this article, you may decide that there is nothing for you to be concerned about, or you may be terrified. Regardless, I hope to at least provide you with clarity. Shared Hosting Since the advent of HTTP/1.1 and the required Host header, shared hosting has become very popular. Prior to HTTP/1.1, there was no direct way for a Web client to identify the domain from which it wanted content. The browser simply used to determine the IP address associated with the domain entered by the user, and sent its request there. An HTTP 1.0 request looks something like the following, at a minimum: GET /path/to/index.php HTTP/1.0 Notice that the URL presented in the request does not include the domain name. This is because this is unnecessary information under the assumption that only one domain is served by the particular Web server (and that domains have a one-to-one relationship with IP addresses). With HTTP/1.1, Host becomes a required header, so this request, at a minimum, must be expressed as follows: GET /path/to/index.php HTTP/1.1 Host: www.example.org With this format, a single Web server (with a single IP address) can serve an arbitrary number of domains, because the client must identify the domain from which it intends to be requesting content. As a direct result, a hosting company can host many domains on a single server, and it is not necessary to have a separate public IP for each domain. This yields much more inexpensive hosting and has spurred a tremendous growth in the Web March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) C O R N E R S E C U R I T Y Security Corner: Shared Hosting itself. Of course, this has been a driving force behind early PHP adoption as well. The downside to shared hosting is that it incurs some security risks that do not exist in a dedicated server environment. Some of these risks are mitigated by PHP's safe_mode directive, but a solid understanding of the risks is necessary to appreciate what safe_mode does (and what it doesn't). Because of this, I will begin by introducing some of the unique risks associated with shared hosting. Filesystem Security A true multi-user operating system, such as Linux, is built upon a fundamentally secure approach to user permissions. When you create a file, you specify a set of permissions for that file, either explicitly or implicitly by virtue of the fact that you are creating that file within a specific context. This is achieved by assigning each file both user and group ownership as well as a set of privileges for three groups of people: 1. The user who owns the file 2. All users in the group 3. All users on the server These categories of people are referenced as user, group, and other, respectively. The privileges that you can assign each category of user include read, write, and execute (there are some other details, but they are irrelevant 59 SECURITY CORNER -rw-r--r-1 chris 12:34 myfile shiflett 4321 May 21 This file, myfile, is owned by the user chris and the group shiflett. The permissions are identified as -rw-r-r--, and this can be broken into the leading hyphen (indicating a normal file, as opposed to, say, a directory), and then three groups of permissions: 1. rw- (read, write, no execute) 2. r-- (read, no write, no execute) 3. r-- (read, no write, no execute) These three sets of permissions correspond directly to the three groups of users: user (chris), group (shiflett), and other. Linux users are probably familiar with these permissions and how to change them with commands such as chown and chmod. For a more thorough explanation of filesystem http://www.linuxsecurity.com/ security, see docs/LDP/Security-HOWTO/file-security.html . As a user on a shared host, it is unlikely that you will have read access to many files outside of your own home directory. You certainly shouldn't be able to browse the home directory or document root of other users. However, with a simple PHP script, this can be possible. Browsing with PHP For this discussion, we'll assume that the Web server is Apache and that it is running as the user nobody. As a result, in order for Apache to be able to serve your Web content, that content must be readable by the user nobody. This includes images, HTML files, and PHP scripts. Thus, if someone could gain the same privileges as nobody on the server, they would at least have access to everyone's Web content, even if precautions are taken to prevent access to any other user. Whenever Apache executes your PHP scripts, it of course does so as the user nobody. Combine this with PHP's rich set of filesystem functions (http://www.php.net/filesystem), and you should begin to realize the risk. To make the risk clearer, I have written a very simplistic filesystem browser in PHP (See Listing 1). This script outputs the current setting for the safe_mode directive (for informational purposes) and allows you to browse the local filesystem. This is an example of the type of script an attacker might write, although several enhancements would likely be added to make malicious actions more convenient. One of the first places an attacker might want to glance is at /etc/passwd. This is achieved by either browsing there from the root directory (where the script begins) or visiting the URL directly (by calling the script with ?file=/etc/passwd). This gives an attacker a list of users and their home directories. Another file of interest might be httpd.conf. March 2004 ● PHP Architect ● www.phparch.com Assuming each user's home directory has a directory called public_html for their respective document roots, an attacker can browse another user's Web content by calling the script with ?dir=/home/victim/public_html/. A security-conscious user will most likely keep sensitive configuration files and the like somewhere outside of document root. For example, perhaps the database username and password are stored in a file called db.inc and included with code similar to the following: include('../inc/db.inc'); This seems wise, but unfortunately an attacker can still view this file by calling the browse.php script with ?file=/home/victim/inc/db.inc. Why does this necessarily work? For the include() call to be successful, Apache must have read access to the file. Thus, this script must also have access. In addition, because the user's login credentials are often the same as the database access credentials, this technique will likely allow an attacker to compromise any account on the server (and launch additional attacks from compromised accounts). There is also the potential for an attacker to use this same script to gain access to anyone's session data. By just browsing the /tmp directory (?dir=/tmp/), it is possible to read any session that is stored there. With a few enhancements to the script, it could be even easier to view and/or modify session data from these files. An attacker could visit your application and then modify the associated session to grant administrator access, forge profile information, or anything of the like. And, because the attacker can browse the source to your applications, this doesn't even require guesswork. The attacker knows exactly what session variables your applications use. Of course, it is much safer to store session data in your own database, but we have just seen how an attacker can gain access to that as well. Luckily, safe_mode helps prevent these attacks. Licensed to 63883 - Joseph Crawford ([email protected]) to the present discussion). To illustrate this further, consider the following file listing: The safe_mode Directive The safe_mode directive is specifically designed to try to mitigate some of these shared hosting concerns. If you practice running the script from Listing 1 on your own server, you can experiment with enabling safe_mode and observing how much less effective the script becomes. When safe_mode is enabled, PHP checks to see whether the owner of the script being executed matches that of the file being opened. Thus, a PHP script owned by you cannot open files that are not owned by you. Your PHP scripts are actually more restricted than you are from the shell when safe_mode is enabled, because you likely have read access to files not specifically owned by you. This strict checking can be relaxed somewhat by enabling the safe_mode_gid directive, which relaxes the checking to the group instead of the user. Because safe_mode can cause problems for users who have a legitimate reason to access files owned by another user, there are a few other directives that allow even more flexibility. The safe_mode_include_dir directive can spec- 60 SECURITY CORNER ify one or more directories from which users can include() files, regardless of ownership. I encourage you to read http://www.php.net/features.safe-mode for more information. Bypassing safe_mode Is there a known flaw in safe_mode that allows people to 1 <? 2 echo “<pre>\n”; 3 4 if (ini_get(‘safe_mode’)) 5 { 6 echo “[safe_mode enabled]\n\n”; 7 } 8 else 9 { 10 echo “[safe_mode disabled]\n\n”; 11 } 12 13 if (isset($_GET[‘dir’])) 14 { 15 ls($_GET[‘dir’]); 16 } 17 elseif (isset($_GET[‘file’])) 18 { 19 cat($_GET[‘file’]); 20 } 21 else 22 { 23 ls(‘/’); 24 } 25 26 echo “</pre>\n”; 27 28 function ls($dir) 29 { 30 $handle = dir($dir); 31 while ($filename = $handle->read()) 32 { 33 $size = filesize(“$dir$filename”); 34 35 if (is_dir(“$dir$filename”)) 36 { 37 if (is_readable(“$dir$filename”)) 38 { 39 $line = str_pad($size, 15); 40 $line .= “<a href=\”{$_SERVER[‘PHP_SELF’]}?dir=$dir$filename/\”>$filename/</a>”; 41 } 42 else 43 { 44 $line = str_pad($size, 15); 45 $line .= “$filename/”; 46 } 47 } 48 else 49 { 50 if (is_readable(“$dir$filename”)) 51 { 52 $line = str_pad($size, 15); 53 $line .= “<a href=\”{$_SERVER[‘PHP_SELF’]}?file=$dir$filename\”>$filename</a>”; 54 } 55 else 56 { 57 $line = str_pad($size, 15); 58 $line .= $filename; 59 } 60 } 61 62 echo “$line\n”; 63 } 64 $handle->close(); 65 66 return true; 67 } 68 69 function cat($file) 70 { 71 ob_start(); 72 readfile($file); 73 $contents = ob_get_contents(); 74 ob_clean(); 75 echo htmlentities($contents); 76 77 return true; 78 } 79 ?> ● PHP Architect ● www.phparch.com bypass it? Not to my knowledge, but keep in mind that safe_mode only protects against people using PHP to gain access to otherwise restricted data. safe_mode does nothing to protect you against someone on your shared server who writes a similar program in another language. In fact, the manual states: "It is architecturally incorrect to try to solve this problem at the PHP level, but since the alternatives at the web server and OS levels aren't very realistic, many people, especially ISP's, use safe mode for now." Consider the following CGI script written in Bash: #!/bin/bash Licensed to 63883 - Joseph Crawford ([email protected]) Listing 1 March 2004 A similar PHP directive is open_basedir. This directive allows you to restrict all PHP scripts to only be able to open files within the directories specified by this directive, regardless of whether safe_mode is enabled. echo "Content-Type: text/plain" echo "" cat /etc/passwd This will output the contents of /etc/passwd as long as Apache can read that file. So, we're back to the same dilemma. While the attacker can't use the script in Listing 1 to browse the filesystem when safe_mode is enabled, this doesn't prevent the possibility of similar scripts written in other languages. What Can You Do? You probably knew that a shared host was less secure than a dedicated one long before this article. Luckily, there are some solutions to a few of the problems I have presented, but not all. There are basically two main steps that you want to take on a shared host: 1. Keep all sensitive data, such as session data, stored in the database. 2. Keep your database access credentials safe. The question is: how do you achieve the second goal? If another user can potentially have access to any file that we make available to Apache, it seems that there is nowhere to hide the database access credentials. My favorite solution to this problem is one that is described in the PHP Cookbook by David Sklar and Adam Trachtenberg. The approach is to use environment variables to store sensitive data (such as your database access credentials). With Apache, you can use the SetEnv directive for this: SetEnv DB_USER "myuser" SetEnv DB_PASS "mypass" Set as many environment variables as you need using this syntax, and save this in a separate file that is not readable by Apache (so that it cannot be read using the techniques described earlier). In httpd.conf, you can include 61 SECURITY CORNER Include "/path/to/secret-stuff" Of course, you want to keep these include statements within each user's VirtualHost block, otherwise all users could access the same data. Because Apache is typically started as root, it is able to include this file while it is reading its configuration. Once it is running as the user nobody, it can no longer access this file, so other users cannot access this information with clever scripts. Once these environment variables are set, you can access them in the $_ENV array. For example: mysql_connect('localhost', $_ENV['DB_USER'], $_ENV['DB_PASS']); Because this information is stored in $_ENV, you need to take care that this array is not output in any of your scripts. In addition, a call to phpinfo() reveals all environment variables, so you should ensure that you have no public scripts that execute this function. Until Next Time... Hopefully, you now understand some of the risks involved with shared hosting and can take some steps to mitigate them. While safe_mode is a nice feature, there is only so March 2004 ● PHP Architect ● www.phparch.com much help it can provide in this regard. It should be clear that these risks are actually independent of PHP, and this is why other steps are necessary. As always, I'd love to hear about your own solutions to these problems. Until next month, be safe. About the Author ?> Chris Shiflett is a frequent contributor to the PHP community and one of the leading security experts in the field. His solutions to security problems are often used as points of reference, and these solutions are showcased in his talks at conferences such as ApacheCon and the O'Reilly Open Source Convention, his answers to questions on mailing lists such as PHP-General and NYPHP-Talk, and his articles in publications such as PHP Magazine and php|architect. Security Corner, his new monthly column for php|architect, is the industry's first and foremost PHP security column. Chris is the author of the HTTP Developer's Handbook (Sams Publishing) and is currently writing PHP Security (O'Reilly and Associates). In order to help bolster the strength of the PHP community, he is also leading an effort to create a PHP community site at PHPCommunity.org. You can contact him at [email protected] or visit his Web site at http://shiflett.org/. Licensed to 63883 - Joseph Crawford ([email protected]) this file as follows: 62 By John W. Holmes Licensed to 63883 - Joseph Crawford ([email protected]) T I P S & T R I C K S Tips & Tricks it's going to work and allow you to test your proCreating a Free MSSQL grams—that's the end result we're shooting for anyDevelopment Environment on Windows how. MSDE is Microsoft's Desktop Engine for their SQL The first step is to download MSDE from Server. Why would you want to install such a thing? http://www.microsoft.com/sql/msde/howtobuy/msdeuse.asp Well, assuming you are a professional developer creat- and extract the file to your hard drive. Within the ing applications that you intend for other people to extracted directory, you'll notice a setup.exe file. There use, it's not always a good idea to limit yourself to a is also a ReadmeMSDE2000A.htm file that contains installasingle database. Designing your "killer app" to work tion directions in addition to what I'll be outlining here. only with MySQL will limit who can Step two is getting to a comactually use your program. Believe it mand line and running the or not, not everyone can install setup.exe program with some MySQL just to use your program! “Installing MSDE on parameters. We're going to install Installing MSDE on your machine a default instance of the program your machine (your (your Windows machine, obviously) configured to use a mixed mode will give you a full featured install of Windows machine, authentication, meaning that it SQL Server for free that you can test not be tied to Windows obviously) will give you will your code with. Using your own authentication and you'll be able database abstraction layer, PEAR, or a full featured install of to use plain-text authentication ADOdb, you can test that your SQL Server for free that with it. The additional configuraapplication actually works with diftion parameters you can use are you can test your ferent databases system like you say explained in the HTML file. Run it does. the following command: code with.” First, a couple caveats. This is just setup.exe SAPWD='password' SEQURIone method I've found that works. TYMODE=SQL Obviously, if you have a full installaThis sets the sa (or root) password to password and tion of SQL Server or can afford one, you should go that route. This is not meant for a production machine, configures the mixed mode authentication. Obviously, only development. I could not get the native MSSQL you can (and should) pick a stronger password for your PHP function to work with MSDE, so we'll also have to own needs—even if this is just for a development resort to ODBC. While this is going to be less efficient, machine. March 2004 ● PHP Architect ● www.phparch.com 63 TIPS & TRICKS Listing 1 1 2 3 4 5 6 $ser=”COMPUTERNAME”; //the name of the SQL Server $db=”tempdb”; //the name of the database $user=”sa”; //a valid username $pass=”password”; //a password for the username $conn=odbc_connect(“Driver={SQL Server};Server=”.$ser.”;Database=”.$db, $user, $pass); Listing 2 include(‘adodb/adodb.inc.php’); $conn = &ADONewConnection(‘odbc_mssql’); $ser=”COCONUT”; //the name of the SQL Server $db=”tempdb”; //the name of the database $user=”sa”; //a valid username $pass=”password”; //a password for the username $conn->Connect(“Driver={SQL Server};Server=$ser;Database=$db;”,$user,$pass); That command will trigger the setup program, which will install the necessary services for MSDE to run. MSDE will be configured to start when the OS starts by default, but you can change that from the Services menu of your Control Panel. If the installation did not trigger a reboot (I wouldn't worry too much about that), you may have to go in and start the service for the first time. The next step is optional, but, if you go back and visit the MSDE website, there are a number of third party tools offered for download or purchase. Start at http://www.microsoft.com/sql/msde/partners/default.asp to see the tools. I'd recommend you download the DbaMgr SQL Tools program (DbaMgr2k) from http://www.asql.biz/DbaMgr.shtm as it will give you a free GUI for your MSDE installation. DbaMgr2k will allow you to create the necessary databases, tables, relationships, etc, for your application. Now, like I alluded to before, it'd be great to just uncomment the php_mssql.dll line and load the MSSQL extension in php.ini, but I could not get those functions to work with the paired-down version of SQL Server that MSDE installs. In fact, the mssql_connect() function would not connect to MSDE given a wide variety of connection options (and even MSDE installation options). Thus, we'll have to resort to ODBC. If anyone has any experience or instructions to the contrary, please share them to [email protected] or post a message in the php|a forums at http://www.phparch.com/discuss . Ensure you have the ODBC extension enabled for your installation of PHP and you will be able to connect to MSDE using the code shown in Listing 1. Substitute your actual computer name, database, login and password to get this to work. During my tests, I found that you must use the computer name, as "localhost" or "127.0.0.1" does not work (you will not be able to connect to MSDE server). Ensure the odbc_connect() parameters are all on one line, also. March 2004 ● PHP Architect ● www.phparch.com If you're using ADOdb or PEAR, then you follow their instructions as if you were connecting through ODBC to MSSQL. Example connection code for ADOdb is shown in Listing 2. The method using PEAR would be similar and is discussed in the PEAR documentation. From this point, you'll be able to use the ODBC or abstraction layer functions to execute queries and retrieve data from the MSDE server, or whatever else your application is designed to do. You can use the DbaMgr2k program to create new users, databases, and tables, or do it from your queries. What you do from this point is up to you, but you now have a functional SQL Server test environment for free. Licensed to 63883 - Joseph Crawford ([email protected]) 1 2 3 4 5 6 7 8 9 Detecting the Web Server If you distribute programs that can be run under a variety of web servers and need to determine which one you're running under (or within), PHP offers a useful Figure 1 • • • • • • • • • • • • • • • • • • • • aolserver activescript apache cgi-fcgi cgi isapi nsapi phttpd roxen java_servlet thttpd pi3web apache2filter caudium apache2handler tux webjames cli embed milter Figure 2 This is my subject Bcc: my_email@my_domain.com Reply-To: bad_email@evil_domain.com 64 TIPS & TRICKS function named php_sapi_name(). This will actually return the type of interface between the web server and PHP. It can then be used to determine what the web server is (if it isn't obvious) and to take appropriate action. You may want to include different files based upon whether PHP is running in CGI versus SAPI mode, for example. A gracious user lists the possible return values from the function in the errata on the manual page. The values are shown in Figure 1. to the message if the altered Reply-To: address is not noticed. Or, how about a lot of the forum scripts that are available that let you e-mail another user, but do not show you their e-mail address? Now you can Bcc: yourself on the message you send them and find out their actual email address. The recipient of the message will not know—unless they examine all of the e-mail headers closely. The bottom line is that you are allowing the user to put in unchecked headers into your mail messages. The solution to this is to filter the subject you receive from your text box for new line ("\n") characters. You can remove everything after the first new line with: Check Your Email Subject If you've been a reader with us since the beginning, you may recognize the following tip. My editor said to avoid reusing tips, but I feel this one needs to be brought up again as I still find numerous live sites still vulnerable to mail header injection. Quite a few web sites have pages through which you can send an e-mail to someone. These can be used to contact the site administrator, send a message out to other users, or many other purposes. Often, the goal of the contact form is to hide the e-mail address of the recipient. This can be for convenience (to prevent spam) or for security (protecting the identity of recipients). If your site is just using a normal <input> text block for the user to enter the subject of the message in, you may be unwittingly allowing your visitors to inject additional headers into the e-mail message. The user can download (save-as) your form and modify the <input> area into a <textarea> element. Then, he or she can enter a "subject" like the one shown in Figure 2. When this subject is inserted into PHP's mail() function: mail($to,$subject,$message,$headers); The Bcc: and Reply-To: headers are also added to the message. Thus, the malicious user has now included themselves on the contents of the message, can see who all of the other recipients were, and the end user is unaware that there is now a bad or altered Reply-To: address. Now, this may not matter on a simple web page where people are sending you questions about your cat, because all this will do is give the user a copy of the message they just typed. But imagine a web page that allows Alcoholic Anonymous users to contact and e-mail each other anonymously. Now malicious users can Bcc: themselves on messages and see who all of the recipients are—and possibly even intercept replies March 2004 ● PHP Architect ● www.phparch.com Licensed to 63883 - Joseph Crawford ([email protected]) “You may want to include different files based upon whether PHP is running in CGI versus SAPI mode...” substr($string,0,strpos($string,"\n")-1) Another option is to just tell the user that there has been an error if a new line is detected and attempt to save as much information about the user as you can, for future reference (as long as that is in compliance with your privacy policy). You should also be aware that this isn't a vulnerability that's limited to PHP scripts. Any scripting language could be vulnerable if it takes user input and places it directly into mail headers. This is something to keep in mind if you also develop in other languages besides PHP. Send in Your Tips … Help the Community If you have any tips that would help out your peers, please send them to [email protected] to be published. Anyone contributing a tip that gets published will get a free issues (added on to your subscription if you already have one). Also, if you haven't noticed already, there is a special Tips 'n Tricks forum in the phparch.com forums for discussing what's in this or any column of Tips 'n Tricks. If you have any comments about what's been written, be sure to post them there! About the Author ?> John Holmes is a Captain in the U.S. Army and a freelance PHP and MySQL programmer. He has been programming in PHP for over 4 years and loves every minute of it. He is currently serving at Ft. Gordon, Georgia as a Company Commander with his wife and two sons. 65 I Am Jack's Total Lack of Linux Support M aybe it doesn't sound like it, but five days can be a really long time. When I left for php|cruis?e at the beginning of the month (technically, at the end of February), I did so without any means of accessing the Internet. My old and faithful Pentium II Acer laptop having recently left me for a better place (in one of the local city dumpsters), I figured that a week without having to worry about e-mail and the likes would have been a fun and relaxing experience. Of course, I was fully expecting that I would have had to deal with a whole lot of email in my inbox once I got back—but, surely, that would be no big deal. Boy, was I wrong. I had a miserable week and ended up bumming laptop time off of all the other attendees. While everybody was out having fun in Nassau, I was walking around its busy streets looking for a computer store (and found none, thankfully, March 2004 ● PHP Architect ● www.phparch.com given what happened once I finally got around to buying a new laptop). I'm sure that the tales of the overweight spirit that haunts the decks of the Sovereign of the Seas asking people "can I borrow your laptop for five minutes" will live on for years to come. On top of everything, when I finally did manage to get home my mailbox contained just short of two thousand messages—all "good" ones without spam or viruses. As a result, I'm still sorting through my inbox, a full week behind in my answers. Bummer! The effects of the one-week withdrawal and mailbox-shock still lingering, I resolved to take a trip to my local Best Buy superstore and purchase a laptop. Since I don't like to sit on a decision for too long, a couple of hours later a brand new, top-of-the line Pentium 4-based Hewlett-Packard laptop sat on my desk, ready to be used. Given that I don't much care for Windows as a desktop operating system (and, let's be Licensed to 63883 - Joseph Crawford ([email protected]) e x i t ( 0 ) ; By Marco Tabini clear, this is only my personal preference, rather than a pseudoobjective comment on the operating system itself), I immediately started the installation process for the latest version of Gentoo. Now, before I go on I must point out that I have never, ever, had any problem making any sort of hardware work flawlessly under Linux. I have always been able to find drivers compatible with whatever I'd throw in my box-be it a sound card, disk drive controller or network adapter. Therefore, I had no qualms about driving to the store and picking out my new computer the way normal (read: Windows) users do—by choosing the one I liked best. Three days later, I was still trying to make the basic elements of my laptop work. I'm not talking about anything fancy here, like superfast 3D acceleration or some esoteric power-saving mode. I was actually having trouble getting Linux to recognize my ATI IDE chipset (honestly, why is ATI mak- 66 EXIT(0); Why Can’t We All Just Get Along? March 2004 ● PHP Architect ● www.phparch.com made my way back to the store for an exchange. Thankfully, my wife is a neatness freak who keeps everything, down to the last piece of paper, so putting everything back in its original package, as dictated by Best Buy's 14-day exchange/refund policy for computers, was easy enough—or so I thought. Once at the store, I was informed that they could not accept the laptop in the condition I brought it back in—or, in other words, with Linux installed on it (it originally came with Windows XP, which, once in my hands, lasted approximately the time required to reboot from the Gentoo CD). In order for them to make me the "honour" of honouring their return policy, I would have to either drive all the way back home and reinstall Windows or pay a $60 reinstallation fee. I suppose this makes perfect sense—after all, they may actually want to be able to resell the laptop once I bring it back (hence the request for all the original packaging material and manuals), and most people who buy from a general electronics store probably do not have the skill required to install an operating system from scratch. However, two things upset me to no end. First, their policy said nothing about the original operating system. It didn't even refer to the product having to be in its original condition—only that all the accessories and packaging had to be returned, which is exactly what I did. Second, and most important, while the store clerk at the returns desk was making sure that I didn't discreetly return a couple of bricks instead of a $2,000 computer, I actually went and bought another laptop—much more expensive than the first. So here I was, being hassled about an unwritten policy and having to haggle for a $60 charge that was not advertised anywhere while I was ready to spend several hundred dollars more on another computer. Luckily, human beings turned out to be smarter than the policies they are supposed to follow, and I walked out of the store with another computer—this time a Toshiba Satellite M30, on which I am currently writing this column. I have no complaints about the new Toshiba—everything "just works", down to the last detail. Of course, there is no "official" support for Linux-but the laptop is built using parts for which all sorts of drivers exist, so that running Linux on it is not a problem. Of course, one could say that the difference in price between the two computers justifies the fact that the cheaper one won't work under Linux. However, the price different can easily be attributed to a larger hard drive, a newer processor (the Toshiba runs on Centrino technology) and better battery time—not to mention the fact that, in the past, I've been able to run Linux on $300 computers without any hardware problem. So, score one for Toshiba. Linux is not "one of the other operating systems" any longer. It has survived and it is thriving—and can no longer be ignored by a computer manufacturer who wants to stay in the market. Licensed to 63883 - Joseph Crawford ([email protected]) ing IDE chipsets anyway? Can't they just stick to graphics cards?), or my audio card. Sure, the computer would run, but with no IDE chipset support, hard disk access managed to slow the whole system down to speed levels I had not seen since the days of the original IBM 8088-based motherboard (for those who never had the pleasure of working with that particular monster, the boot-time POST check—that set of tests that today you see flying by as they verify that your RAM works—used to take something like five or six minutes). All this from a company that has been professing its support for Linux for a few years now. Not only do they not release any drivers for their laptop hardware, but they actually go the distance and use embedded chipsets for which no drivers are available, even in the open source community, because their manufacturers refuse to make the necessary information available in the public domain. Undoubtedly, developing drivers for more than one operating system is more expensive, but the process of writing device drivers for Linux is well documented and understood, and I honestly find it hard to believe that writing a single driver with bindings for the two operating systems would be so prohibitively expensive as to not justify the additional number of computers sold to those users who want to use one rather than the other. After all, making a case from a business perspective for only supporting Windows has to be more and more difficult as time goes by, since Linux is quite popular in the desktop arena as well. Well, I guess H-P has been too busy acquiring Compaq—a move that I still don't understand—to pay attention to recent market trends. Laptop back in the box, I php|a 67