Lecture 2 - UCLA Statistics
Transcription
Lecture 2 - UCLA Statistics
Lecture 2: The pipeline Last time • We presented some (admittedly rough) context for the material to be covered in this course: From a comparison between Stat202a from 2004 and 2010 we saw somewhat incredible shifts in mechanisms by which data are distributed • We also considered the term “data science” and compared current attempts at a definition with older (some really old) approaches and the classic cycle of exploratory data analysis • We ended by examining some early work on bringing data analysis to high school students, curricular approaches that should have something to say about how we train statisticians Today • We will look at the Unix operating system, the philosophy underlying its design and some basic tools • Given the abstract nature of our first lecture, we will emphasize some hands-on tasks -- We will use as our case study the last week of traffic across the department’s website www.stat.ucla.edu • You will have a chance to kick the tires on these tools and address some simple web site usage statistics The shell: Why we teach it • Many of you know the computer as a Desktop and a handful of applications (a browser, Word, Excel, maybe R); to introduce the shell means having a discussion about the structure of the computer, about operating systems, about filesystems, about history • The shell offers programmatic access to a computer’s constituent parts, allows students to “do” data analysis on directories (folders), on processes, on their network • Given that there are so many flavors of shell, it is also the first time we can talk about choosing tools (that the choice is yours to make!), about evaluating which shell is best for you, and about the shifting terrain of software development The shell: Why we teach it • As a practical matter, shell tools are an indispensable part of my own practice, for data “cleaning,” for preliminary data processing, for exploratory analysis -By design, they are simple, lightweight and easily combined to perform much more complex computations • The shell also becomes important as you start to make use of shared course resources (data, software, hardware) -- This year, most of our work will be done on computers outside the Statistics Department, and the shell will be your window onto these other systems • On the computers in front of you, you can access the shell through a Terminal Window (Applications -> Utilities -> Terminal) -- For those of you with a Window’s machine, you will have to download an emulator like Cygwin... Operating systems • Most devices that contain a computer of some kind will have an OS; they tend to emerge when the appliance will have to deal with new applications, complex user-input and possibly changing requirements of its function • Your Tivo, smartphone and even your automobile all have operating systems Operating systems • An operating system is a piece of software (code) that organizes and controls hardware and other software so your computer behaves in a flexible but predictable way • For home computers, Windows, MacOS and Linux are among the most commonly used operating systems Your computer • The Central Processing Unit (CPU) or microprocessor is a complete computational engine, capable of carrying out a number of basic commands (basic arithmetic, store and retrieve information from its memory, and so on) • The CPU itself has the capacity to store some amount of information; when it needs more space, it moves data to another kind of chip known as Random Access Memory (RAM) -- “random access” as opposed to, say, sequential access to memory locations • Your computer also has one or more storage devices that can be used to organize and store data; hard disks or drives store data magnetically, while solid state drives again use semiconductor technology Your computer • Your operating system, then, manages all of your computer’s resources, providing layers of abstraction for both you, the user, as well as developers who are writing new programs for your computer • With the emergence of so-called cloud computing, we imagine a variety of computing resources “out there” on the web that we can execute; in this model, computations are performed elsewhere, and your own computer might function more as a “browser” receiving results -- Google’s Chrome operating system is “minimalist” in this sense http://www.youtube.com/watch?v=0QRO3gKj3qw http://www.youtube.com/watch?v=0QRO3gKj3qw http://www.youtube.com/watch?v=0QRO3gKj3qw http://www.youtube.com/watch?v=0QRO3gKj3qw But let’s step back a bit... • The computers you are sitting at are running the Mac OS which is built on a Unix platform -- let’s spend some time talking about Unix (we will have more to say about three fundamental abstractions for computer components -- the memory, the interpreter and the communication link -- next time) • We’ll start with a little history that might help explain why these tools look the way they do, their underlying design philosophy and how they can help you in the early (or late, even) stages of an analysis Some history • In 1964, Bell Labs partnered with MIT and GE to create Multics (for Multiplexed Information and Computing Service) -- here is the vision they had for computing “Such systems must run continuously and reliably 7 days a week, 24 hours a day in a way similar to telephone or power systems, and must be capable of meeting wide service demands: from multiple man-machine interaction to the sequential processing of absentee-user jobs; from the use of the system with dedicated languages and subsystems to the programming of the system itself” Some history • Bell Labs pulled out of the Multics project in 1969, a group of researchers at Bell Labs started work on Unics (Uniplexed information and computing system) because initially it could only support one user; as the system matured, it was renamed Unix, which isn’t an acronym for anything • Richie simply says that Unix is a “somewhat treacherous pun on Multics” From “The Art of Unix Programming” by Raymond The Unix filesystem • In Multics, we find the first notion of a hierarchical file system -- software for organizing and storing computer files and the data they contain in Unix, files are arranged in a tree structure that allows separate users to have control of their own areas • In general, a file systems provide a level of abstraction for the user, keeping track of how data are stored on an associated physical storage device (say, a hard drive or a CD) • Unix began (more or less) as a file system and then an interactive shell emerged to let you examine its contents and perform basic operations The Unix kernel and its shell • The Unix kernel is the part of the operating system that provides other programs access to the system’s resources (the computer’s CPU or central processing unit, its memory and various I/O or input/output devices) • The Unix shell is a command-line interface to the kernel -- keep in mind that Unix was designed by computer scientists for computer scientists and the interface is not optimized for novices Unix shells • The Unix shell is a type of program called an interpreter; in this case, think of it as a text-based interface to the kernel • The shell operates in a simple loop: It accepts a command, interprets it, executes the command and waits for another -- The shell displays a prompt to tell you that it is ready to accept a command Shells in general • The term “shell” is often applied to any program that you type commands into (we’ll see similar kinds of interfaces when we talk about Python, R, and even sqlite3 and mongo) -- the idea is that the shell is the outermost interface to the inner workings of the system it surrounds • It is sometimes useful to take a broader view of the shell concept, for example, considering a browser (like Firefox) a shell for a system that renders web pages -- in such cases, we might usefully distinguish between text-based shells (command line interfaces) and graphics-based shells (graphical user interfaces) • Now, open up the terminal window on your Mac and let’s explore a little... * A boring slide, but full of potential! Unix shells • The shell is itself a program that the Unix operating system runs for you (a program is referred to as a process when it is running) • The kernel manages many processes at once, many of which are the result of user commands (others provide services that keep the computer running) • Some commands are built into the shell, others have been added by users -Either way, the shell waits until the command is executed Name of the command Process ID How hard the computer is thinking about it How much memory is being used The result of typing in the command top; a printout of all the processes running on your computer % hostinfo # neolab.stat.ucla.edu Mach kernel version: ! Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 Kernel configured for up to 8 processors. 8 processors are physically available. 8 processors are logically available. Processor type: i486 (Intel 80486) Processors active: 0 1 2 3 4 5 6 7 Primary memory available: 16.00 gigabytes Default processor set: 159 tasks, 364 threads, 8 processors Load average: 10.14, Mach factor: 1.43 % hostinfo # my laptop Mach kernel version: ! Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 Kernel configured for up to 2 processors. 2 processors are physically available. 2 processors are logically available. Processor type: i486 (Intel 80486) Processors active: 0 1 Primary memory available: 4.00 gigabytes Default processor set: 110 tasks, 442 threads, 2 processors Load average: 0.25, Mach factor: 1.74 Running top on neolab.stat.ucla.edu ! How many processes are running? How much RAM is available? Running top on taia.stat.ucla.edu (last year) ! How many processes are running? How much RAM is available? What do you reckon this computer does? Operating systems • Processor management • • Schedules jobs (formally referred to as processes) to be executed by the computer Memory and storage management • Allocate space required for each running process in main memory (RAM) or in some other temporary location if space is tight; and supervise the storage of data onto disk Operating systems • Device management • • Application interface • • A program called a driver translates data (files from the filesystem) into signals that devices like printers can understand; an operating system manages the communication between devices and the CPU, for example An API (application programming interface) let programmers use functions of the computer and the operating system without having to know how something is done User interface • Finally, the operating system turns and looks at you; the UI is a program that defines how users interact with the computer -- some are graphical (Windows is a GUI) and some are text-based (your Unix shell) Bourne shell /bin/sh The oldest and most standardized shell. Widely used for system startup files (scripts run during system startup). Installed in Mac OS X. Bash (Bourne Again SHell) /bin/bash Bash is an improved version of sh. Combines features from csh, sh, and ksh. Very widely used, especially on Linux systems. Installed in Mac OS X. http://www.gnu.org/manual/bash/ Unix shell(s) • There are, in fact, many different kinds of Unix shells -- The table on the right lists a few of the most popular • Why so many? C shell /bin/csh Provides scripting features that have a syntax similar to that of the C programming language (originally written by Bill Joy). Installed in Mac OS X. Korn shell /bin/ksh Developed at AT&T by David Korn in the early 1980s. Ksh is widely used for programming. It is now open-source software, although you must agree to AT&T's license to install it. http://www.kornshell.com TC Shell /bin/tcsh An improved version of csh. The t in tcsh comes from the TENEX and TOPS-20 operating systems, which provided a command-completion feature that the creator (Ken Greer) of tcsh included in his new shell. Wilfredo Sanchez, formerly lead engineer on Mac OS X for Apple, worked on tcsh in the early 1990s at the Massachusetts Institute of Technology. Z shell /bin/zsh Created in 1990, zsh combines features from tcsh, bash, and ksh, and adds many of its own. Installed in Mac OS X. http://zsh.sourceforge.net Why the choices? • A shell program was originally meant to take commands, interpret them and then execute some operation • Inevitably, one wants to collect a number of these operations into programs that execute compound tasks; at the same time you want to make interaction on the command line as easy as possible (a history mechanism, editing capabilities and so on) • The original Bourne shell is ideal for programming; the C-shell and its variants are good for interactive use; the Korn shell is a combination of both Steve Bourne, creator of sh, in 2005 From "ksh - An Extensible High Level Language" by David G. Korn At the time of these so-called ``shell wars'', I worked on a project at Bell Laboratories that needed a form entry system. We decided to build a form interpreter, rather than writing a separate program for each form. Instead of inventing a new script language, we built a form entry system by modifying the Bourne shell, adding built-in commands as necessary. The application was coded as shell scripts. We added a builtin to read form template description files and create shell variables, and a built-in to output shell variables through a form mask. We also added a built-in named let to do arithmetic using a small subset of the C language expression syntax. An array facility was added to handle columns of data on the screen. Shell functions were added to make it easier to write modular code, since our shell scripts tended to be larger than most shell scripts at that time. Since the Bourne shell was written in an Algol-like variant of C, we converted our version of it to a more standard K&R version of C. We removed the restriction that disallowed I/O redirection of built-in commands, and added echo, pwd, and test built-in commands for improved performance. Finally, we added a capability to run a command as a coprocess so that the command that processed the user-entered data and accessed the database could be written as a separate process. By the late 1970s, each of these shells had sizable followings within Bell Laboratories. The two shells were not compatible, leading to a division as to which should become the standard shell. Steve Bourne and John Mashey argued their respective cases at three successive UNIX user group meetings. Between meetings, each enhanced their shell to have the functionality available in the other. A committee was set up to choose a standard shell. It chose the Bourne shell as the standard. At the same time, Steve Bourne at Bell Laboratories wrote a version of the shell which included programming language techniques. A rich set of structured flow control primitives was part of the language; the shell processed commands by building a parse tree and then evaluating the tree. Because of the rich flow control primitives, there was no need for a goto command. Bourne introduced the "here-document" whereby the contents of a file are inserted directly into the script. One of the often overlooked contributions of the Bourne shell is that it helped to eliminate the distinction between programs and shell scripts. Earlier versions of the shell read input from standard input, making it impossible to use shell scripts as part of a pipeline. Unlike most earlier systems, the Thompson shell command language was a user-level program that did not have any special privileges. This meant that new shells could be created by any user, which led to a succession of improved shells. In the mid-1970s, John Mashey at Bell Laboratories extended the Thompson shell by adding commands so that it could be used as a primitive programming language. He made commands such as if and goto built-ins for improved performance, and also added shell variables. The original UNIX system shell was a simple program written by Ken Thompson at Bell Laboratories, as the interface to the new UNIX operating system. It allowed the user to invoke single commands, or to connect commands together by having the output of one command pass through a special file called a pipe and become input for the next command. The Thompson shell was designed as a command interpreter, not a programming language. While one could put a sequence of commands in a file and run them, i.e., create a shell script, there was no support for traditional language facilities such as flow control, variables, and functions. When the need for some flow control surfaced, the commands /bin/if and /bin/goto were created as separate commands. The /bin/if command evaluated its first argument and, if true, executed the remainder of the line. The /bin/goto command read the script from its standard input, looked for the given label, and set the seek position at that location. When the shell returned from invoking /bin/goto, it read the next line from standard input from the location set by /bin/ goto. The awk command was developed in the late 1970s by Al Aho, Brian Kernighan, and Peter Weinberger of Bell Laboratories as a report generation language. A secondgeneration awk developed in the early 1980s was a more general-purpose scripting language, but lacked some shell features. It became very common to combine the shell and awk to write script applications. For many applications, this had the disadvantage of being slow because of the time consumed in each invocation of awk. The perl language, developed by Larry Wall in the mid-1980s, is an attempt to combine the capabilities of the shell and awk into a single language. Because perl is freely available and performs better than combined shell and awk, perl has a large user community, primarily at universities. In 1992, the IEEE POSIX 1003.2 and ISO/IEC 9945-2 shell and utilities standards were ratified. These standards describe a shell language that was based on the UNIX System V shell and the 1988 version of ksh. The 1993 version of ksh is a version of ksh which is a superset of the POSIX and ISO/IEC shell standards. With few exceptions, it is backward compatible with the 1988 version of ksh. In spite of its wide availability, ksh source is not in the public domain. This has led to the creation of bash, the ``Bourne again shell'', by the Free Software Foundation; and pdksh, a public domain version of ksh. Unfortunately, neither is compatible with ksh. As use of ksh grew, the need for more functionality became apparent. Like the original shell, ksh was first used primarily for setting up processes and handling I/O redirection. Newer uses required more string handling capabilities to reduce the number of process creations. The 1988 version of ksh, the one most widely distributed at the time this is written, extended the pattern matching capability of ksh to be comparable to that of the regular expression matching found in sed and grep. The popular inline editing features (vi and emacs mode) of ksh were created by software developers at Bell Laboratories; the vi line editing mode by Pat Sullivan, and the emacs line editing mode by Mike Veach. Each had independently modified the Bourne shell to add these features, and both were in organizations that wanted to use ksh only if ksh had their respective inline editor. Originally the idea of adding command line editing to ksh was rejected in the hope that line editing would move into the terminal driver. However, when it became clear that this was not likely to happen soon, both line editing modes were integrated into ksh and made optional so that they could be disabled on systems that provided editing as part of the terminal interface. In 1982, the UNIX System V shell was converted to K&R C, echo and pwd were made built-in commands, and the ability to define and use shell functions was added. Unfortunately, the System V syntax for function definitions was different from that of ksh. In order to maintain compatibility with the System V shell and preserve backward compatibility, I modified ksh to accept either syntax. I created the first version of ksh soon after I moved to a research position at Bell Laboratories. Starting with the form scripting language, I removed some of the form-specific code, and added useful features from the C shell such as history, aliases, and job control. At the same time, at the University of California at Berkeley, Bill Joy put together a new shell called the C shell. Like the Mashey shell, it was implemented as a command interpreter, not a programming language. While the C shell contained flow control constructs, shell variables, and an arithmetic facility, its primary contribution was a better command interface. It introduced the idea of a history list and an editing facility, so that users didn't have to retype commands that they had entered incorrectly. And while we are at it... • Unix itself comes in different flavors; the 1980s saw an incredible proliferation of Unix versions, somewhere around 100 (System V, AIX, Berkeley BSD, SunOS, Linux, ...) • Vendors provided (diverging) version of Unix, optimized for their own computer architectures and supporting different features • Despite the diversity, it was still easier to “port” applications between versions of Unix than it was between different proprietary OS • In the 90s, some consolidation took place; today Linux dominates the low-end market, while Solaris, AIX and HP-UX are leaders in the mid-to-high end A few common commands First, commands to explore your file system; walk through directories and list files • • pwd, ls, cd • mkdir, rmdir • cp, mv, rm (These commands seem somewhat terse -- Why not “remove” or “copy” -- and some have speculated that it was because of the effort required to enter data on the early ASR-33 teletype keyboards) Another view of the filesystem; here, your Mac will display directories as folders and you navigate by clicking rather than typing commands Type of file: Number of 512-byte blocks used by the files in the directory - normal file d directory s socket file l link file Shorthand for your present working directory (where you’re at) An example • • • • • • • • • Shorthand for the directory one level above [obstacle 12 lectures] ls -al total 8 drwxr-xr-x 8 cocteau staff drwxr-xr-x 70 cocteau staff drwxr-xr-x 4 cocteau staff drwxr-xr-x@ 282 cocteau staff -rw-r--r-1 cocteau staff drwxr-xr-x 11 cocteau staff drwxr-xr-x@ 80 cocteau staff Your user name Permissions; what you (and others) can do to the file 272 2380 136 9588 139 374 2720 Sep Sep Sep Sep Sep Sep Sep 30 30 30 30 21 30 30 08:11 08:08 08:05 08:13 15:34 10:24 10:22 . .. lecture1 lecture1.key lecture1.txt lecture2 lecture2.key The file’s size in bytes The group you belong to that owns the file When the file was last modified The file’s name http://www.thegeekstuff.com/2009/07/linux-ls-command-examples/ Options • Notice that Unix commands can also be modified by adding one or more options; in the case of ls, we added the option -l for the “long” form of the output and “-a” for all directories that begin with a “.” -- Another useful option is “-h” for humanly readable output • The command man will provide you with help on Unix commands; you simply supply the name of the command you are interested in as an operand -- try typing “man ls” Permission bits • Unix can support many users (remember, that was part of its initial design goals) on a single system and each user can belong to one or more groups • Every file in a Unix filesystem is owned by some user and one of that user’s groups; each file also has a set of permissions specifying which users can • r: read • w: write (modify) or • x: execute • the file; these are specified with three “bits” and we need three sets of bits to define what the user can do, what their group (that owns the file) can do and what others can do Permission bits • • • • • • • • • [obstacle 12 lectures] ls -al total 8 drwxr-xr-x 8 cocteau staff drwxr-xr-x 70 cocteau staff drwxr-xr-x 4 cocteau staff drwxr-xr-x@ 282 cocteau staff -rw-r--r-1 cocteau staff drwxr-xr-x 11 cocteau staff drwxr-xr-x@ 80 cocteau staff 272 2380 136 9588 139 374 2720 Sep Sep Sep Sep Sep Sep Sep 30 30 30 30 21 30 30 08:11 08:08 08:05 08:13 15:34 10:24 10:22 . .. lecture1 lecture1.key lecture1.txt lecture2 lecture2.key The type of file Permission bits • • • • • • • drwxr-xr-x 8 cocteau drwxr-xr-x 70 cocteau drwxr-xr-x 4 cocteau drwxr-xr-x@ 282 cocteau -rw-r--r-1 cocteau drwxr-xr-x 11 cocteau drwxr-xr-x@ 80 cocteau staff staff staff staff staff staff staff What you can do to the file (u) What the owning group can do (g) What others can do (o) In general... • • The command chmod changes the permissions on a file; here are some examples • % chmod g+x lecture1.txt • % chmod ug-x lecture1.txt • % chmod a+w lecture1.txt • % chmod go-w lecture1.txt You can also use binary to express the permissions; so if we think of bits ordered as rwx, then r-x rwx r-• 1 × 22 + 0 × 21 + 1 × 20 = 5 1 × 22 + 1 × 21 + 1 × 20 = 7 1 × 22 + 0 × 21 + 0 × 20 = 4 and we can specify permissions with these values • % chmod 754 dates.sh Kinds of files • What you’ll notice right away is that there are different types of files having different permissions -- we will talk about different kinds of data storage and representation next time • Unix filesystem conventions places (shared, commonly used) executable files in places like /usr/bin or /usr/local/bin (navigate to one of these directories and have a look!) • Different files are opened by different kinds of programs; in OSX, there is a beautiful command called open that decides which program to use Kinds of files • Filenames which contain special characters like * and ~ are candidates for filename substitution • ~ refers to your home directory and * is a wildcard for any number of characters (it’s referred to globbing -- you can try the command glob to see what happens, something we’ll do shortly) • Other special characters like {, [ and ? can also be expanded, but we’ll get to them when we learn a bit more about regular expressions Unix shells • There are many flavors of Unix Shells; that is, there are many kinds of programs that operate as shells sh, csh, tcsh, bash, ksh • They are all programs and can be found on the file system which sh An example: HTTP access logs • www.stat.ucla.edu • The department runs an Apache Web server running on taia.stat.ucla.edu • Each request, each click by a user out on the Internet browsing our site, is logged -- there are standards for these files, but in general, they can be a bit hairy to “parse” Log files • A Unix system is constantly maintaining logs about its performance; various applications log their activities as well -- some are visible by everyone, some by users with “root” access • Scraping through logs for anomalies, for system profiling, for system management, is a fairly common exercise; the logs we’ll look at have taken on extra importance given all the transactions that take place across the web Into the cloud... • This morning, I spun up an EC2 instance from Amazon Web Services -- EC2 stands for Elastic Compute Cloud, assigned it to a domain (stat202a.org) and copied our relatively small data set over to the new machine • The machine is running Ubuntu and you’ll have access to all the Unix tools we’ve been discussion so far (and more!) From machine to machine • In courses you’ve had previously, you probably ran R or some other program on the computer on your desk -- The Statistics Department has a number of computers available to you that are more powerful (memory, speed) than those on your desk and with EC2, we can spin up any number of machines to do our bidding • You can navigate between machines by invoking the commands ssh (secure shell) and move files around using scp (secure copy, example shortly)... % ssh [email protected] Moving data • You can also move data to your own machine (assuming it’s not too big -- at least one of the data sets we’ll consider this term will be too big to budge!) • If the data, for example, are stored on homework.stat202a.org, you can bring them to your desktop with the command • % scp [email protected]:/data/weblogs/access_log.1267056000 . The secure copy command; friends with ssh Copy it to a file of the same name on your local machine in your current working directory Copy the file access_log.1267056000 in the directory /data/weblogs from the computer homework.stat202a.org HTTP access logs • A bit of digging... Commands • pwd, ls, cd • more/less, tail, wc • cut, sort, uniq % cd /data/weblogs % ls % head access_log.1267056000 seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:00:31 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-" ces.stat.ucla.edu 128.97.55.242 - - [24/Feb/2010:23:02:12 -0800] "GET /rss/ HTTP/1.1" 404 1515 "-" "Apple-PubSub/65.11" seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:03:48 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-" seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:03:53 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-" seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:37 -0800] "GET /2007-04-10/3:00pm/1260-franz HTTP/1.0" 200 3459 "-" "UCLA%20Google% csrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 99 "-" "UCLA%20Google%20Search%20Appliance%20 csrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /projects/ HTTP/1.0" 200 1179 "-" "UCLA%20Google%20Search%20Appliance%2 seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /archive?page=24 HTTP/1.0" 200 4329 "-" "UCLA%20Google%20Search%20Ap seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /2007-02-23/3:00pm/5200-boelter HTTP/1.0" 200 3560 "-" "UCLA%20Googl seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /1999-03-01/3:00pm/6627-ms HTTP/1.0" 200 3005 "-" "UCLA%20Google%20S % wc access_log.1267056000 1215118 24435535 306343201 access_log.1267056000 % tail access_log.1267056000 forums.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:54 -0800] "GET /feed.php?16,102,type=rss HTTP/1.0" 200 1570 "-" "UCLA%20Google%20Sear cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance% cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /web/checks/check_results_zic.html HTTP/1.0" 200 4965 "-" "UCLA%20Google www.socr.ucla.edu 207.46.199.191 - - [03/Mar/2010:23:59:56 -0800] "GET /robots.txt HTTP/1.1" 200 58 "-" "msnbot/2.0b (+http://search.msn.com/msn www.stat.ucla.edu 67.195.110.174 - - [03/Mar/2010:23:59:57 -0800] "GET /~jkokkin/gray/pb_lrn_bound_1_sharp_2_dc_2_1_nm/253055_counts.mat HTTP/1. directory.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:57 -0800] "GET /info.php?directory=443 HTTP/1.0" 200 2905 "-" "UCLA%20Google%20Sea cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance% cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /web/checks/check_results_xlsReadWrite.html HTTP/1.0" 200 5185 "-" "UCLA www.stat.ucla.edu 67.180.31.126 - - [03/Mar/2010:23:59:58 -0800] "GET /projects/datasets/risk_perception.html HTTP/1.1" 200 5818 "http://www.sta www.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:59 -0800] "GET /cgi-bin/man-cgi?ypmatch HTTP/1.0" 404 - "-" "UCLA%20Google%20Search%20Ap Combined log format virtual host IP address Identity Userid date Request Status Bytes Referrer Agent Slicing out a field... For speed, we’ll work with just one day’s worth of logs; look at the file access_log.1267056000 -- what time period does this cover? Many of the built-in Unix tools are designed to digest log files of this kind; in a way, this is a very old dance, one that has been enacted for decades... Examples • cut -d" " -f1,11 access_log.1267056000 • cut -d" " -f11 access_log.1267056000 In general... cut -d" " -f1,5,9 cut -d" " -f1-5 cut -d" " -f1 select the first, fifth and ninth select the first through the fifth select just the first Unix pipes • Rather than have the output scroll out on the screen, it would be better (useful, even) to “catch” the output with a more command • Many programs take some kind of input and generate some kind of output; Unix tools often take input from the user and print the output to the screen • “Redirection” of data to and from files or programs is controlled by a construction known as a pipe % cut -d" " -f1 access_log.1267056000 | more seminars.stat.ucla.edu ces.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu csrcb.stat.ucla.edu csrcb.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu csrcb.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu seminars.stat.ucla.edu ... Sending output to a file with “>” With this form of redirection, we take a stream of processed data and store it in a file -- For example • % cut -d" " -f2 access_log.1267056000 > ~/ips.txt • % ls ~ Taking input from a file with “<” With this form of redirection, we create an input stream from a file -- For example wc < ~/ips.txt Tallying • As its name suggests, the command sort will order the rows in our file; by default it uses alphabetical order -- the option -n let you sort numerically instead • The command uniq will remove repeated adjacent lines; so if your file is sorted, it will return just the unique rows • sort ~/ips.txt > ~/ips_sorted.txt • head ~/ips_sorted.txt • uniq ~/ips_sorted.txt | more The pipeline We seem to be proliferating files -- one version of the IPs that is raw, one that is sorted, maybe one that has just the unique entries -- we can use the pipe construction to combine many of these steps into one As the name might imply, you can connect pipes and have data stream from process to process Example • cut -d" " -f2 access_log.1267056000 | sort | uniq | wc • cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn • cut -d" " -f10 access_log.1267056000 | sort | uniq -c % cut -d" " 28564 -f2 access_log.1267056000 | sort | uniq | wc 28564 404062 # what does this mean? % cut -d" " -f10 access_log.1267056000 | sort | uniq -c | sort -rn | head -10 933724 127624 72112 46878 22145 9682 2415 239 208 43 200 304 404 206 302 301 403 405 401 416 HTTP Error 101 Switching Protocols. Again, not really an "error", this HTTP Status Code means everything is working fine. HTTP Error 200 Success. This HTTP Status Code means everything is working fine. However, if you receive this message on screen, obviously something is not right... Please contact the server's administrator if this problem persists. Typically, this status code (as well as most other 200 Range codes) will only be written to your server logs. What are these numbers? • • A fast Google search gives us a list of possible errors Note that Error 200 actually means a success HTTP Error 201 Created. A new resource has been created successfully on the server. HTTP Error 202 Accepted. Request accepted but not completed yet, it will continue asynchronously. HTTP Error 203 Non-Authoritative Information. Request probably completed successfully but can't tell from original server. HTTP Error 204 No Content. The requested completed successfully but the resource requested is empty (has zero length). HTTP Error 205 Reset Content. The requested completed successfully but the client should clear down any cached information as it may now be invalid. • • Error 206 means that only part of the file was delivered; the user cancelled the request before it could be delivered Error 304 is “not modified”; sometimes clients perform conditional GET requests HTTP Error 206 Partial Content. The request was canceled before it could be fulfilled. Typically the user gave up waiting for data and went to another page. Some download accelerator programs produce this error as they submit multiple requests to download a file at the same time. HTTP Error 300 Multiple Choices. The request is ambiguous and needs clarification as to which resource was requested. HTTP Error 301 Moved Permanently. The resource has permanently moved elsewhere, the response indicates where it has gone to. HTTP Error 302 Moved Temporarily. The resource has temporarily moved elsewhere, the response indicates where it is at present. HTTP Error 303 See Other/Redirect. A preferred alternative source should be used at present. % cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn | more 130258 129354 123756 83956 83205 23554 23058 21205 15941 12967 12365 8029 7748 5256 4894 4554 4057 3857 3691 3622 3181 3144 2825 ... 128.97.86.247 164.67.132.220 164.67.132.219 164.67.132.221 164.67.132.222 67.195.110.174 128.97.55.186 76.89.146.27 128.97.55.188 128.97.55.185 67.195.114.250 137.110.114.8 208.115.111.248 216.9.17.231 128.97.55.3 66.249.67.236 69.235.86.81 66.249.67.183 67.195.110.173 67.195.113.236 77.88.29.246 74.6.149.154 98.222.49.118 The pipeline • • In 1972, pipes appear in Unix, and with them a philosophy, albeit after some struggle for the syntax; should it be more(sort(cut))) • [Remember this; S/R has this kind of functional syntax] • The development of pipes led to the concept of tools -- software programs that would be in a tool box, available when you need them “And that’s, I think, when we started to think consciously about tools, because then you could compose things together... compose them at the keyboard and get them right every time.” from an interview with Doug McIlroy Remember, read the man pages! As we did with ls, if the command uniq is unfamiliar, you can look up its usage Example man uniq man host % host 128.97.86.247 247.86.97.128.in-addr.arpa domain name pointer limen.stat.ucla.edu. % host 164.67.132.220 220.132.67.164.in-addr.arpa domain name pointer gsa2.search.ucla.edu. % host 67.195.110.174 174.110.195.67.in-addr.arpa domain name pointer b3091077.crawl.yahoo.net. But what are we really after? • Rather than splitting up the data in access_log.1267056000 by day, we might consider dividing it by IP -- that is extract all the hits from a given IP • Once we have such a thing, we can use the command wc to tell us about the number of accesses from each user • We can also start to “fit” user-level models that can be used to predict navigation Rudimentary pattern matching grep can be used to skim lines from a file that have (or don’t have) a particular pattern -- patterns are specified via regular expressions, something we will learn more about later The name of the command comes from an editing operation on Unix: g/re/p Example host 98.222.49.118 grep 98.222.49.118 access_log.1267056000 grep /~dinov access_log.1267056000 %grep 98.222.49.118 access_log.1267056000| head -25 www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:37 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=z www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:38 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=z www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:39 -0800] "GET /~zyyao/common.css HTTP/1.1" 200 3762 "http://www.stat.ucla.ed www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (Windows; U; W www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 200 36901 "http://www.googl www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:42 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 206 4352 "-" "Mozilla/5.0 ( www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /~zyyao/projects/ABSS/all_exp.html HTTP/1.1" 200 525 "-" "Mozi www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (X11; U; Linux www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:37 -0800] "GET /%7Ezyyao/projects/ABSS/walk.html HTTP/1.1" 200 22230 "http:// www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/000template101.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1001.png HTTP/1.1" 200 113 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003.png HTTP/1.1" 200 113 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003sketch.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1004.png HTTP/1.1" 200 111 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1004sketch.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1005.png HTTP/1.1" 200 112 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1005sketch.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1006.png HTTP/1.1" 200 115 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1006sketch.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1007.png HTTP/1.1" 200 112 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1007sketch.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1008.png HTTP/1.1" 200 113 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1008sketch.png HTTP/1.1" 2 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1009.png HTTP/1.1" 200 109 www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1009sketch.png HTTP/1.1" 2 % cut -d" " -f12 access_log.1267056000|grep google | head -25 "http://www.google.com/search?hl=en&newwindow=1&q=filled+%22rounded+box%22+image+beamer&aq=f&aqi=&aql=&oq=" "http://www.google.com/imgres?imgurl=http://scc.stat.ucla.edu/page_attachments/0000/0064/map.jpg&imgrefurl=http://scc.s "http://www.google.com.ph/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=tl&source=hp&q=HISTOR "http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions" "http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions" "http://translate.google.gr/translate?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/" "http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=tra "http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=tra "http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_hea "http://www.google.com/search?q=socr&rls=com.microsoft:ko:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7HPAB_ko" "http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_bod "http://www.google.co.in/search?q=Lienhart+ICME+2003+face&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client= "http://www.google.com.my/search?hl=en&client=firefox-a&channel=s&rls=org.mozilla:en-US:official&q=markov+chain+monte+c "http://www.google.com/search?q=stata+goodness+of+fit+test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client "http://translate.googleusercontent.com/translate_c?hl=zh-CN&sl=en&u=http://www.stat.ucla.edu/~yuille/courses/Stat202C/ "http://www.google.lt/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~bk%2F&rct=j&q "http://www.google.nl/search?hl=nl&source=hp&q=f+distribution+table&meta=&aq=1&oq=F+distrib" "http://www.google.com.my/search?q=case+in+suicide&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox "http://www.google.com/url?sa=t&source=web&ct=res&cd=9&ved=0CCgQFjAI&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~sczhu%2FProj "http://www.google.com/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=en&source=hp&q=t-statist "http://www.google.co.in/search?hl=en&q=Genetic+algorithm+tutorial&meta=&aq=o&oq=" "http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&q=an+introduction+to+stochastic+mod "http://www.google.com/search?hl=en&client=firefox-a&hs=YlE&rls=org.mozilla%3Aen-US%3Aofficial&q=dinov+ucla+stats&aq=f& "http://www.google.com/search?hl=en&lr=&client=safari&rls=en&q=the+reading+on+a+voltage+meter+connected+to+a+test+circu "http://www.google.co.jp/search?hl=ja&q=Ryacas%E3%80%80r-commander+install&btnG=%E6%A4%9C%E7%B4%A2&lr=&aq=f&oq=" A puzzle • In working through these notes, you might have noticed that I’ve been protecting you a little from some nastiness -Suppose, for example, we decided to use the cut command to extract the 10th field in each line of the logfile and look at just the first 10 lines of ouptut • The code 200 is the most frequent by far the most frequent with second place going to code 304, but by looking at the first 10 lines I intentionally hid information from you... count code 933724 127624 72112 46878 22145 9682 2415 239 208 43 200 304 404 206 302 301 403 405 401 416 A puzzle • Looking at the full output from the command and not just the first 10 lines, you see something a bit different; at the right we present that table • What else do you notice? count 933724 127624 72112 46878 22145 9682 2415 239 208 43 29 8 3 3 2 1 1 1 code 200 304 404 206 302 301 403 405 401 416 400 HTTP/1.0" prep.R HTTP/1.1" 1103 Post-rej/project_II.htm 500 2008 A puzzle • About 18 of the hits in this file are logged in such a way that the 10th field (as defined by a space) is not the code we were after • We’ve made an assumption about the data that isn’t correct, and the result shows up when we have a large enough sample -- to chase down the problem, we extracted just the log line referring to a request for project_II.htm • So, what happened? www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/project_II.htm HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, [email protected])" Backstory • We specify web pages using a Uniform Resource Locator (URL); for example, the page http://www.stat.ucla.edu/research will be interpreted as a request for /research from the host www.stat.ucla.edu • The request and response are handled via a communication protocol called HTTP (or the Hypertext Transfer Protocol); you will see two versions of HTTP in our log files (HTTP/1.0 which opens a new “connection” per request, and HTTP/1.1 which can reuse a connection) • HTTP is the protocol by which clients (your web browser, or in the parlance of your logs, a user agent) makes requests from one or more servers (computers hosting web sites; also called an HTTP server) and receives data in return (web pages, media items, and so on) Backstory • In addition to its particular format, a URL can be made up of only certain classes of symbols taken from the US ASCII* character set-- alphanumeric characters, some the special characters $-_.+!*() and some reserved characters that have their own special purpose • I don’t want to go into the specifics of naming here, as we will likely take it up in earnest later, except to say that certain characters cannot appear in a URL or should only appear with a special meaning (like / to specify the path) • When you need to include a special character, they need to be URL encoded; that is, they are replaced with a numerical equivalent... • *The American Standard Code for Information Interchange URL Encoding • URL encoding replaces characters with their ASCII hexadecimal equivalent, prefaced a %; recall that the hexadecimal number system represents values in “base” 16 -- that is, it uses the sixteen symbols 0,1,...,9,A,B,C,D,E,F • Two hexadecimal digits are equivalent to 8 binary digits (8 bits or 1 byte) and represent an integer between 0 and 255 -- we will have more to say about data representation next time • For now, it’s enough to know that the %20 is the URL encoding of the space “ “; does this bring us any closer to understanding the three lines a few slides back? Puzzle solved • Notice that the first line of the five has “ “‘s in the URL that was requested -a “ “ that was not URL encoded -- and this means that counting spaces is not really the same as identifying items from the log • Technically, your browser takes care of this encoding (try typing in a URL with a space in it); it’s worth noting that the request came from a crawler; a robot that (presumably) doesn’t have this check in place % grep Post-rej/project_II.htm access_log.* | head -5 access_log.1267056000:www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/projec access_log.1267660800:www.stat.ucla.edu 67.195.110.174 - - [06/Mar/2010:04:29:33 -0800] "GET /%7Ewxl/research/proteomic/_Project_Rej%20vs%20Post-rej/p access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:20 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/pro access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:21 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/lig access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:28 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/oth The lesson • This explains just one of the handful of errors we encountered in the one file, but the basic problem is the same -- we are relying on spaces to define a pattern for us, but this is not a completely reliable way of identifying the content we’re after • Instead, we need a more elaborate language for specifying patterns; and today we’ll see an extremely powerful construction known as regular expressions • (Note: This material is perhaps slightly out of time order -- I decided to present it now to get us out of this puzzle -- and next time we will take a small step back and talk about data formats like CSV, XML, JSON and the relational model) Quien es mas macho? • In online marketing, hits rule the roost; the more raw traffic you attract, the greater your opportunities for making a sale • Alright, so it’s not as simple as that, but let’s see what we can learn about centers of activity on our website Quien es mas macho? • Compute the number of hits to the portions of the site owned by SongChun Zhu, Vivian Lew, Jan deLeeuw and Ivo Dinov • Who received the most hits in August? In March? • What can you say about the kinds of files that were downloaded? • What was the most popular portion of each site? Then... • Pull back a little and tell me about the site in general and the habits of its visitors; specifically, think about • When is the site active? When is it quiet? • Do the visitors stay for very long? Do they download any of our papers or software? What applications do they run? • On the balance, is our traffic “real” or mostly the result of robots or automated processes? Finally • Unix tools can be incredibly powerful; they will be much faster than most other options for what they’re good at • That said, there are plenty of kinds of analysis that are not workable with collections of simple Unix facilities; as you do your homework, keep track of what seems hard and why (in particular • Our programming resources will keep getting more and more elaborate as the quarter progresses; this is just a first step