Lecture 2 - UCLA Statistics

Transcription

Lecture 2: The pipeline
Last time
•
We presented some (admittedly rough) context for the material to be covered in
this course: From a comparison between Stat202a from 2004 and 2010 we saw
somewhat incredible shifts in mechanisms by which data are distributed
•
We also considered the term “data science” and compared current attempts at
a definition with older (some really old) approaches and the classic cycle of
exploratory data analysis
•
We ended by examining some early work on bringing data analysis to high
school students, curricular approaches that should have something to say
about how we train statisticians
Today
•
We will look at the Unix operating system, the philosophy underlying its design
and some basic tools
•
Given the abstract nature of our first lecture, we will emphasize some hands-on
tasks -- We will use as our case study the last week of traffic across the
department’s website www.stat.ucla.edu
•
You will have a chance to kick the tires on these tools and address some simple
web site usage statistics
The shell: Why we teach it
•
Many of you know the computer as a Desktop and a handful of applications
(a browser, Word, Excel, maybe R); to introduce the shell means having a
discussion about the structure of the computer, about operating systems,
about filesystems, about history
•
The shell offers programmatic access to a computer’s constituent parts,
allows students to “do” data analysis on directories (folders), on processes,
on their network
•
Given that there are so many flavors of shell, it is also the first time we can
talk about choosing tools (that the choice is yours to make!), about
evaluating which shell is best for you, and about the shifting terrain of
software development
The shell: Why we teach it
•
As a practical matter, shell tools are an indispensable part of my own practice,
for data “cleaning,” for preliminary data processing, for exploratory analysis -By design, they are simple, lightweight and easily combined to perform much
more complex computations
•
The shell also becomes important as you start to make use of shared course
resources (data, software, hardware) -- This year, most of our work will be done
on computers outside the Statistics Department, and the shell will be your
window onto these other systems
•
On the computers in front of you, you can access the shell through a Terminal
Window (Applications -> Utilities -> Terminal) -- For those of you with a
Window’s machine, you will have to download an emulator like Cygwin...
Operating systems
•
Most devices that contain a computer of some
kind will have an OS; they tend to emerge when
the appliance will have to deal with new
applications, complex user-input and
possibly changing requirements of its
function
•
Your Tivo, smartphone and even your
automobile all have operating systems
Operating systems
•
An operating system is a piece of software (code)
that organizes and controls hardware and other
software so your computer behaves in a flexible
but predictable way
•
For home computers, Windows, MacOS and Linux
are among the most commonly used operating
systems
Your computer
•
The Central Processing Unit (CPU) or microprocessor is a complete
computational engine, capable of carrying out a number of basic commands
(basic arithmetic, store and retrieve information from its memory, and so on)
•
The CPU itself has the capacity to store some amount of information; when it
needs more space, it moves data to another kind of chip known as Random
Access Memory (RAM) -- “random access” as opposed to, say, sequential
access to memory locations
•
Your computer also has one or more storage devices that can be used to
organize and store data; hard disks or drives store data magnetically, while
solid state drives again use semiconductor technology
Your computer
•
Your operating system, then, manages all of your computer’s resources,
providing layers of abstraction for both you, the user, as well as developers who
are writing new programs for your computer
•
With the emergence of so-called cloud computing, we imagine a variety of
computing resources “out there” on the web that we can execute; in this model,
computations are performed elsewhere, and your own computer might function
more as a “browser” receiving results -- Google’s Chrome operating system is
“minimalist” in this sense
http://www.youtube.com/watch?v=0QRO3gKj3qw
But let’s step back a bit...
•
The computers you are sitting at are running the Mac OS which is built on a
Unix platform -- let’s spend some time talking about Unix (we will have more to
say about three fundamental abstractions for computer components -- the
memory, the interpreter and the communication link -- next time)
•
We’ll start with a little history that might help explain why these tools look the
way they do, their underlying design philosophy and how they can help you in
the early (or late, even) stages of an analysis
Some history
•
In 1964, Bell Labs partnered with MIT and GE to create Multics (for Multiplexed
Information and Computing Service) -- here is the vision they had for computing
“Such systems must run continuously and reliably 7 days a week, 24 hours a day
in a way similar to telephone or power systems, and must be capable of
meeting wide service demands: from multiple man-machine interaction to the
sequential processing of absentee-user jobs; from the use of the system with
dedicated languages and subsystems to the programming of the system itself”
Some history
•
Bell Labs pulled out of the Multics project in 1969, a
group of researchers at Bell Labs started work on
Unics (Uniplexed information and computing system)
because initially it could only support one user; as the
system matured, it was renamed Unix, which isn’t an
acronym for anything
•
Richie simply says that Unix is a “somewhat
treacherous pun on Multics”
From “The Art of Unix Programming” by Raymond
The Unix filesystem
•
In Multics, we find the first notion of a hierarchical file system -- software for
organizing and storing computer files and the data they contain in Unix, files are
arranged in a tree structure that allows separate users to have control of their
own areas
•
In general, a file systems provide a level of abstraction for the user, keeping
track of how data are stored on an associated physical storage device (say, a
hard drive or a CD)
•
Unix began (more or less) as a file system and then an interactive shell
emerged to let you examine its contents and perform basic operations
The Unix kernel and its shell
•
The Unix kernel is the part of the operating system that provides other
programs access to the system’s resources (the computer’s CPU or central
processing unit, its memory and various I/O or input/output devices)
•
The Unix shell is a command-line interface to the kernel -- keep in mind that
Unix was designed by computer scientists for computer scientists and the
interface is not optimized for novices
Unix shells
•
The Unix shell is a type of program called an interpreter; in this case, think
of it as a text-based interface to the kernel
•
The shell operates in a simple loop: It accepts a command, interprets it,
executes the command and waits for another -- The shell displays a prompt
to tell you that it is ready to accept a command
Shells in general
•
The term “shell” is often applied to any program that you type commands into
(we’ll see similar kinds of interfaces when we talk about Python, R, and even
sqlite3 and mongo) -- the idea is that the shell is the outermost interface to
the inner workings of the system it surrounds
•
It is sometimes useful to take a broader view of the shell concept, for example,
considering a browser (like Firefox) a shell for a system that renders web
pages -- in such cases, we might usefully distinguish between text-based
shells (command line interfaces) and graphics-based shells (graphical user
interfaces)
•
Now, open up the terminal window on your Mac and let’s explore a little...
* A boring slide, but full of potential!
Unix shells
•
The shell is itself a program that the Unix operating system runs for you (a
program is referred to as a process when it is running)
•
The kernel manages many processes at once, many of which are the result of
user commands (others provide services that keep the computer running)
•
Some commands are built into the shell, others have been added by users -Either way, the shell waits until the command is executed
Name of the command
Process ID
How hard the computer is
thinking about it
How much memory is
being used
The result of typing in the command top;
a printout of all the processes running on
your computer
% hostinfo # neolab.stat.ucla.edu
Mach kernel version:
!
Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009;
root:xnu-1228.15.4~1/RELEASE_I386
Kernel configured for up to 8 processors.
8 processors are physically available.
8 processors are logically available.
Processor type: i486 (Intel 80486)
Processors active: 0 1 2 3 4 5 6 7
Primary memory available: 16.00 gigabytes
Default processor set: 159 tasks, 364 threads, 8 processors
Load average: 10.14, Mach factor: 1.43
% hostinfo # my laptop
Mach kernel version:
!
Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT
2010; root:xnu-1504.7.4~1/RELEASE_I386
Kernel configured for up to 2 processors.
2 processors are physically available.
2 processors are logically available.
Processor type: i486 (Intel 80486)
Processors active: 0 1
Primary memory available: 4.00 gigabytes
Default processor set: 110 tasks, 442 threads, 2 processors
Load average: 0.25, Mach factor: 1.74
Running top on neolab.stat.ucla.edu
!
How many processes are running?
How much RAM is available?
Running top on taia.stat.ucla.edu (last year)
!
How many processes are running?
How much RAM is available?
What do you reckon this computer does?
Operating systems
•
Processor management
•
•
Schedules jobs (formally referred to as processes) to be executed by the
computer
Memory and storage management
•
Allocate space required for each running process in main memory (RAM) or
in some other temporary location if space is tight; and supervise the storage
of data onto disk
Operating systems
•
Device management
•
•
Application interface
•
•
A program called a driver translates data (files from the filesystem) into
signals that devices like printers can understand; an operating system
manages the communication between devices and the CPU, for example
An API (application programming interface) let programmers use functions
of the computer and the operating system without having to know how
something is done
User interface
•
Finally, the operating system turns and looks at you; the UI is a program
that defines how users interact with the computer -- some are graphical
(Windows is a GUI) and some are text-based (your Unix shell)
Bourne shell
/bin/sh
The oldest and most standardized shell. Widely used for
system startup files (scripts run during system startup).
Installed in Mac OS X.
Bash (Bourne Again SHell)
/bin/bash
Bash is an improved version of sh. Combines features from
csh, sh, and ksh. Very widely used, especially on Linux
systems. Installed in Mac OS X.
http://www.gnu.org/manual/bash/
Unix shell(s)
•
There are, in fact, many different kinds of
Unix shells -- The table on the right lists a
few of the most popular
•
Why so many?
C shell
/bin/csh
Provides scripting features that have a syntax similar to that
of the C programming language (originally written by Bill
Joy). Installed in Mac OS X.
Korn shell
/bin/ksh
Developed at AT&T by David Korn in the early 1980s. Ksh is
widely used for programming. It is now open-source
software, although you must agree to AT&T's license to
install it. http://www.kornshell.com
TC Shell
/bin/tcsh
An improved version of csh. The t in tcsh comes from the
TENEX and TOPS-20 operating systems, which provided a
command-completion feature that the creator (Ken Greer) of
tcsh included in his new shell. Wilfredo Sanchez, formerly
lead engineer on Mac OS X for Apple, worked on tcsh in the
early 1990s at the Massachusetts Institute of Technology.
Z shell
/bin/zsh
Created in 1990, zsh combines features from tcsh, bash,
and ksh, and adds many of its own. Installed in Mac OS X.
http://zsh.sourceforge.net
Why the choices?
•
A shell program was originally meant to take commands,
interpret them and then execute some operation
•
Inevitably, one wants to collect a number of these
operations into programs that execute compound
tasks; at the same time you want to make interaction on
the command line as easy as possible (a history
mechanism, editing capabilities and so on)
•
The original Bourne shell is ideal for programming; the
C-shell and its variants are good for interactive use; the
Korn shell is a combination of both
Steve Bourne, creator of sh, in 2005
From "ksh - An Extensible High Level Language" by David G. Korn
At the time of these so-called ``shell wars'', I worked on a
project at Bell Laboratories that needed a form entry system.
We decided to build a form interpreter, rather than writing a
separate program for each form. Instead of inventing a new
script language, we built a form entry system by modifying
the Bourne shell, adding built-in commands as necessary.
The application was coded as shell scripts. We added a builtin to read form template description files and create shell
variables, and a built-in to output shell variables through a
form mask. We also added a built-in named let to do
arithmetic using a small subset of the C language expression
syntax. An array facility was added to handle columns of data
on the screen. Shell functions were added to make it easier
to write modular code, since our shell scripts tended to be
larger than most shell scripts at that time. Since the Bourne
shell was written in an Algol-like variant of C, we converted
our version of it to a more standard K&R version of C. We
removed the restriction that disallowed I/O redirection of
built-in commands, and added echo, pwd, and test built-in
commands for improved performance. Finally, we added a
capability to run a command as a coprocess so that the
command that processed the user-entered data and
accessed the database could be written as a separate
process.
By the late 1970s, each of these shells had sizable
followings within Bell Laboratories. The two shells were not
compatible, leading to a division as to which should become
the standard shell. Steve Bourne and John Mashey argued
their respective cases at three successive UNIX user group
meetings. Between meetings, each enhanced their shell to
have the functionality available in the other. A committee was
set up to choose a standard shell. It chose the Bourne shell
as the standard.
At the same time, Steve Bourne at Bell Laboratories wrote a
version of the shell which included programming language
techniques. A rich set of structured flow control primitives
was part of the language; the shell processed commands by
building a parse tree and then evaluating the tree. Because
of the rich flow control primitives, there was no need for a
goto command. Bourne introduced the "here-document"
whereby the contents of a file are inserted directly into the
script. One of the often overlooked contributions of the
Bourne shell is that it helped to eliminate the distinction
between programs and shell scripts. Earlier versions of the
shell read input from standard input, making it impossible to
use shell scripts as part of a pipeline.
Unlike most earlier systems, the Thompson shell command
language was a user-level program that did not have any
special privileges. This meant that new shells could be
created by any user, which led to a succession of improved
shells. In the mid-1970s, John Mashey at Bell Laboratories
extended the Thompson shell by adding commands so that it
could be used as a primitive programming language. He
made commands such as if and goto built-ins for improved
performance, and also added shell variables.
The original UNIX system shell was a simple program written
by Ken Thompson at Bell Laboratories, as the interface to
the new UNIX operating system. It allowed the user to invoke
single commands, or to connect commands together by
having the output of one command pass through a special
file called a pipe and become input for the next command.
The Thompson shell was designed as a command
interpreter, not a programming language. While one could
put a sequence of commands in a file and run them, i.e.,
create a shell script, there was no support for traditional
language facilities such as flow control, variables, and
functions. When the need for some flow control surfaced, the
commands /bin/if and /bin/goto were created as separate
commands. The /bin/if command evaluated its first argument
and, if true, executed the remainder of the line. The /bin/goto
command read the script from its standard input, looked for
the given label, and set the seek position at that location.
When the shell returned from invoking /bin/goto, it read the
next line from standard input from the location set by /bin/
goto.
The awk command was developed in the late 1970s by Al
Aho, Brian Kernighan, and Peter Weinberger of Bell
Laboratories as a report generation language. A secondgeneration awk developed in the early 1980s was a more
general-purpose scripting language, but lacked some shell
features. It became very common to combine the shell and
awk to write script applications. For many applications, this
had the disadvantage of being slow because of the time
consumed in each invocation of awk. The perl language,
developed by Larry Wall in the mid-1980s, is an attempt to
combine the capabilities of the shell and awk into a single
language. Because perl is freely available and performs
better than combined shell and awk, perl has a large user
community, primarily at universities.
In 1992, the IEEE POSIX 1003.2 and ISO/IEC 9945-2 shell
and utilities standards were ratified. These standards
describe a shell language that was based on the UNIX
System V shell and the 1988 version of ksh. The 1993
version of ksh is a version of ksh which is a superset of the
POSIX and ISO/IEC shell standards. With few exceptions, it
is backward compatible with the 1988 version of ksh.
In spite of its wide availability, ksh source is not in the public
domain. This has led to the creation of bash, the ``Bourne
again shell'', by the Free Software Foundation; and pdksh, a
public domain version of ksh. Unfortunately, neither is
compatible with ksh.
As use of ksh grew, the need for more functionality became
apparent. Like the original shell, ksh was first used primarily
for setting up processes and handling I/O redirection. Newer
uses required more string handling capabilities to reduce the
number of process creations. The 1988 version of ksh, the
one most widely distributed at the time this is written,
extended the pattern matching capability of ksh to be
comparable to that of the regular expression matching found
in sed and grep.
The popular inline editing features (vi and emacs mode) of
ksh were created by software developers at Bell
Laboratories; the vi line editing mode by Pat Sullivan, and
the emacs line editing mode by Mike Veach. Each had
independently modified the Bourne shell to add these
features, and both were in organizations that wanted to use
ksh only if ksh had their respective inline editor. Originally the
idea of adding command line editing to ksh was rejected in
the hope that line editing would move into the terminal driver.
However, when it became clear that this was not likely to
happen soon, both line editing modes were integrated into
ksh and made optional so that they could be disabled on
systems that provided editing as part of the terminal
interface.
In 1982, the UNIX System V shell was converted to K&R C,
echo and pwd were made built-in commands, and the ability
to define and use shell functions was added. Unfortunately,
the System V syntax for function definitions was different
from that of ksh. In order to maintain compatibility with the
System V shell and preserve backward compatibility, I
modified ksh to accept either syntax.
I created the first version of ksh soon after I moved to a
research position at Bell Laboratories. Starting with the form
scripting language, I removed some of the form-specific
code, and added useful features from the C shell such as
history, aliases, and job control.
At the same time, at the University of California at Berkeley,
Bill Joy put together a new shell called the C shell. Like the
Mashey shell, it was implemented as a command interpreter,
not a programming language. While the C shell contained
flow control constructs, shell variables, and an arithmetic
facility, its primary contribution was a better command
interface. It introduced the idea of a history list and an editing
facility, so that users didn't have to retype commands that
they had entered incorrectly.
And while we are at it...
•
Unix itself comes in different flavors; the 1980s saw an incredible proliferation
of Unix versions, somewhere around 100 (System V, AIX, Berkeley BSD,
SunOS, Linux, ...)
•
Vendors provided (diverging) version of Unix, optimized for their own computer
architectures and supporting different features
•
Despite the diversity, it was still easier to “port” applications between versions
of Unix than it was between different proprietary OS
•
In the 90s, some consolidation took place; today Linux dominates the low-end
market, while Solaris, AIX and HP-UX are leaders in the mid-to-high end
A few common commands
First, commands to explore your file system; walk through directories and list files
•
•
pwd, ls, cd
•
mkdir, rmdir
•
cp, mv, rm
(These commands seem somewhat terse -- Why not “remove” or “copy” -- and
some have speculated that it was because of the effort required to enter data
on the early ASR-33 teletype keyboards)
Another view of the
filesystem; here, your
Mac will display
directories as folders
and you navigate by
clicking rather than
typing commands
Type of file:
Number of 512-byte blocks used
by the files in the directory
- normal file
d directory
s socket file
l link file
Shorthand for your present working
directory (where you’re at)
An example
•
•
•
•
•
•
•
•
•
Shorthand for the directory
one level above
[obstacle 12 lectures] ls -al
total 8
drwxr-xr-x
8 cocteau staff
drwxr-xr-x
70 cocteau staff
drwxr-xr-x
4 cocteau staff
drwxr-xr-x@ 282 cocteau staff
-rw-r--r-1 cocteau staff
drwxr-xr-x
11 cocteau staff
Your user name
Permissions; what
you (and others) can
do to the file
272
2380
136
9588
139
374
2720
Sep
Sep
Sep
Sep
Sep
Sep
Sep
30
30
30
30
21
30
30
08:11
08:08
08:05
08:13
15:34
10:24
10:22
.
..
lecture1
lecture1.key
lecture1.txt
lecture2
lecture2.key
The file’s size in
bytes
The group you belong to that
owns the file
When the file was last modified
The file’s name
http://www.thegeekstuff.com/2009/07/linux-ls-command-examples/
Options
•
Notice that Unix commands can also be modified by adding one or more
options; in the case of ls, we added the option -l for the “long” form of the
output and “-a” for all directories that begin with a “.” -- Another useful option is
“-h” for humanly readable output
•
The command man will provide you with help on Unix commands; you simply
supply the name of the command you are interested in as an operand -- try
typing “man ls”
Permission bits
• Unix can support many users (remember, that was part of its initial
design goals) on a single system and each user can belong to one or
more groups
• Every file in a Unix filesystem is owned by some user and one of that
user’s groups; each file also has a set of permissions specifying which
users can
• r: read
• w: write (modify) or
• x: execute
• the file; these are specified with three “bits” and we need three sets of
bits to define what the user can do, what their group (that owns the
file) can do and what others can do
Permission bits
•
•
•
•
•
•
•
•
•
[obstacle 12 lectures] ls -al
total 8
drwxr-xr-x
8 cocteau staff
drwxr-xr-x
70 cocteau staff
drwxr-xr-x
4 cocteau staff
-rw-r--r-1 cocteau staff
drwxr-xr-x
11 cocteau staff
272
2380
136
9588
139
374
2720
Sep
Sep
Sep
Sep
Sep
Sep
Sep
30
30
30
30
21
30
30
08:11
08:08
08:05
08:13
15:34
10:24
10:22
.
..
lecture1
lecture1.key
lecture1.txt
lecture2
lecture2.key
The type of file
Permission bits
•
•
•
•
•
•
•
drwxr-xr-x
8 cocteau
drwxr-xr-x
70 cocteau
drwxr-xr-x
4 cocteau
drwxr-xr-x@ 282 cocteau
-rw-r--r-1 cocteau
drwxr-xr-x
11 cocteau
drwxr-xr-x@ 80 cocteau
staff
staff
staff
staff
staff
staff
staff
What you can
do to the file (u)
What the owning
group can do (g)
What others can do (o)
In general...
•
•
The command chmod changes the permissions on a file; here are some
examples
•
% chmod g+x lecture1.txt
•
% chmod ug-x lecture1.txt
•
% chmod a+w lecture1.txt
•
% chmod go-w lecture1.txt
You can also use binary to express the permissions; so if we think of bits
ordered as rwx, then
r-x
rwx
r-•
1 × 22 + 0 × 21 + 1 × 20 = 5
1 × 22 + 1 × 21 + 1 × 20 = 7
1 × 22 + 0 × 21 + 0 × 20 = 4
and we can specify permissions with these values
•
% chmod 754 dates.sh
Kinds of files
•
What you’ll notice right away is that there are different types of files having
different permissions -- we will talk about different kinds of data storage and
representation next time
•
Unix filesystem conventions places (shared, commonly used) executable files
in places like /usr/bin or /usr/local/bin (navigate to one of these
directories and have a look!)
•
Different files are opened by different kinds of programs; in OSX, there is a
beautiful command called open that decides which program to use
Kinds of files
•
Filenames which contain special characters like * and ~ are candidates for
filename substitution
•
~ refers to your home directory and * is a wildcard for any number of
characters (it’s referred to globbing -- you can try the command glob to see
what happens, something we’ll do shortly)
•
Other special characters like {, [ and ? can also be expanded, but we’ll get to
them when we learn a bit more about regular expressions
Unix shells
•
There are many flavors of Unix Shells; that is, there are many kinds of
programs that operate as shells
sh, csh, tcsh, bash, ksh
•
They are all programs and can be found on the file system
which sh
An example: HTTP access logs
• www.stat.ucla.edu
•
The department runs an Apache Web server running on
taia.stat.ucla.edu
•
Each request, each click by a user out on the Internet browsing our site, is
logged -- there are standards for these files, but in general, they can be a
bit hairy to “parse”
Log files
•
A Unix system is constantly maintaining logs about its performance; various
applications log their activities as well -- some are visible by everyone,
some by users with “root” access
•
Scraping through logs for anomalies, for system profiling, for system
management, is a fairly common exercise; the logs we’ll look at have taken on
extra importance given all the transactions that take place across the web
Into the cloud...
•
This morning, I spun up an EC2 instance from Amazon Web Services -- EC2
stands for Elastic Compute Cloud, assigned it to a domain (stat202a.org)
and copied our relatively small data set over to the new machine
•
The machine is running Ubuntu and you’ll have access to all the Unix tools
we’ve been discussion so far (and more!)
From machine to machine
•
In courses you’ve had previously, you probably ran R or some other program
on the computer on your desk -- The Statistics Department has a number of
computers available to you that are more powerful (memory, speed) than those
on your desk and with EC2, we can spin up any number of machines to do our
bidding
•
You can navigate between machines by invoking the commands ssh (secure
shell) and move files around using scp (secure copy, example shortly)...
% ssh [email protected]
Moving data
•
You can also move data to your own machine (assuming it’s not too big -- at
least one of the data sets we’ll consider this term will be too big to budge!)
•
If the data, for example, are stored on homework.stat202a.org, you can
bring them to your desktop with the command
•
% scp [email protected]:/data/weblogs/access_log.1267056000 .
The secure copy command;
friends with ssh
Copy it to a file of the same name on your local
machine in your current working directory
Copy the file access_log.1267056000 in the
directory /data/weblogs from the computer
homework.stat202a.org
HTTP access logs
•
A bit of digging...
Commands
•
pwd, ls, cd
•
more/less, tail, wc
•
cut, sort, uniq
% cd /data/weblogs
% ls
% head access_log.1267056000
seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:00:31 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"
ces.stat.ucla.edu 128.97.55.242 - - [24/Feb/2010:23:02:12 -0800] "GET /rss/ HTTP/1.1" 404 1515 "-" "Apple-PubSub/65.11"
seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:37 -0800] "GET /2007-04-10/3:00pm/1260-franz HTTP/1.0" 200 3459 "-" "UCLA%20Google%
csrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 99 "-" "UCLA%20Google%20Search%20Appliance%20
csrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /projects/ HTTP/1.0" 200 1179 "-" "UCLA%20Google%20Search%20Appliance%2
seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /archive?page=24 HTTP/1.0" 200 4329 "-" "UCLA%20Google%20Search%20Ap
seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /2007-02-23/3:00pm/5200-boelter HTTP/1.0" 200 3560 "-" "UCLA%20Googl
seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /1999-03-01/3:00pm/6627-ms HTTP/1.0" 200 3005 "-" "UCLA%20Google%20S
% wc access_log.1267056000
1215118
24435535 306343201 access_log.1267056000
% tail access_log.1267056000
forums.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:54 -0800] "GET /feed.php?16,102,type=rss HTTP/1.0" 200 1570 "-" "UCLA%20Google%20Sear
cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance%
cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /web/checks/check_results_zic.html HTTP/1.0" 200 4965 "-" "UCLA%20Google
www.socr.ucla.edu 207.46.199.191 - - [03/Mar/2010:23:59:56 -0800] "GET /robots.txt HTTP/1.1" 200 58 "-" "msnbot/2.0b (+http://search.msn.com/msn
www.stat.ucla.edu 67.195.110.174 - - [03/Mar/2010:23:59:57 -0800] "GET /~jkokkin/gray/pb_lrn_bound_1_sharp_2_dc_2_1_nm/253055_counts.mat HTTP/1.
directory.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:57 -0800] "GET /info.php?directory=443 HTTP/1.0" 200 2905 "-" "UCLA%20Google%20Sea
cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance%
cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /web/checks/check_results_xlsReadWrite.html HTTP/1.0" 200 5185 "-" "UCLA
www.stat.ucla.edu 67.180.31.126 - - [03/Mar/2010:23:59:58 -0800] "GET /projects/datasets/risk_perception.html HTTP/1.1" 200 5818 "http://www.sta
www.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:59 -0800] "GET /cgi-bin/man-cgi?ypmatch HTTP/1.0" 404 - "-" "UCLA%20Google%20Search%20Ap
Combined log format
virtual host
IP address
Identity
Userid
date
Request
Status
Bytes
Referrer
Agent
Slicing out a field...
For speed, we’ll work with just one day’s worth of logs; look at the file
access_log.1267056000 -- what time period does this cover?
Many of the built-in Unix tools are designed to digest log files of this kind; in a way,
this is a very old dance, one that has been enacted for decades...
Examples
• cut -d" " -f1,11 access_log.1267056000
• cut -d" " -f11 access_log.1267056000
In general...
cut -d" " -f1,5,9
cut -d" " -f1-5
cut -d" " -f1
select the first, fifth and ninth
select the first through the fifth
select just the first
Unix pipes
•
Rather than have the output scroll out on the screen, it would be better (useful,
even) to “catch” the output with a more command
•
Many programs take some kind of input and generate some kind of output; Unix
tools often take input from the user and print the output to the screen
•
“Redirection” of data to and from files or programs is controlled by a
construction known as a pipe
% cut -d" " -f1 access_log.1267056000 | more
seminars.stat.ucla.edu
ces.stat.ucla.edu
csrcb.stat.ucla.edu
csrcb.stat.ucla.edu
csrcb.stat.ucla.edu
...
Sending output to a file with “>”
With this form of redirection, we take a stream of processed data and store it in a
file -- For example
• % cut -d" " -f2 access_log.1267056000 > ~/ips.txt
• % ls ~
Taking input from a file with “<”
With this form of redirection, we create an input stream from a file -- For example
wc < ~/ips.txt
Tallying
•
As its name suggests, the command sort will order the rows in our file; by
default it uses alphabetical order -- the option -n let you sort numerically
instead
•
The command uniq will remove repeated adjacent lines; so if your file is
sorted, it will return just the unique rows
•
sort ~/ips.txt > ~/ips_sorted.txt
•
head ~/ips_sorted.txt
•
uniq ~/ips_sorted.txt | more
The pipeline
We seem to be proliferating files -- one version of the IPs that is raw, one that is sorted,
maybe one that has just the unique entries -- we can use the pipe construction to
combine many of these steps into one
As the name might imply, you can connect pipes and have data stream from
process to process
Example
•
cut -d" " -f2 access_log.1267056000 | sort | uniq | wc
•
cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn
•
cut -d" " -f10 access_log.1267056000 | sort | uniq -c
% cut -d" "
28564
-f2 access_log.1267056000 | sort | uniq | wc
28564
404062
# what does this mean?
% cut -d" " -f10 access_log.1267056000 | sort | uniq -c | sort -rn | head -10
933724
127624
72112
46878
22145
9682
2415
239
208
43
200
304
404
206
302
301
403
405
401
416
HTTP Error 101
Switching Protocols. Again, not really an "error", this HTTP Status
Code means everything is working fine.
HTTP Error 200
Success. This HTTP Status Code means everything is working fine.
However, if you receive this message on screen, obviously something
is not right... Please contact the server's administrator if this problem
persists. Typically, this status code (as well as most other 200 Range
codes) will only be written to your server logs.
What are these numbers?
•
•
A fast Google search gives us a list of
possible errors
Note that Error 200 actually means a
success
HTTP Error 201
Created. A new resource has been created successfully on the
server.
HTTP Error 202
Accepted. Request accepted but not completed yet, it will continue
asynchronously.
HTTP Error 203
Non-Authoritative Information. Request probably completed
successfully but can't tell from original server.
HTTP Error 204
No Content. The requested completed successfully but the resource
requested is empty (has zero length).
HTTP Error 205
Reset Content. The requested completed successfully but the client
should clear down any cached information as it may now be invalid.
•
•
Error 206 means that only part of the file
was delivered; the user cancelled the
request before it could be delivered
Error 304 is “not modified”; sometimes
clients perform conditional GET requests
HTTP Error 206
Partial Content. The request was canceled before it could be fulfilled.
Typically the user gave up waiting for data and went to another page.
Some download accelerator programs produce this error as they
submit multiple requests to download a file at the same time.
HTTP Error 300
Multiple Choices. The request is ambiguous and needs clarification
as to which resource was requested.
HTTP Error 301
Moved Permanently. The resource has permanently moved
elsewhere, the response indicates where it has gone to.
HTTP Error 302
Moved Temporarily. The resource has temporarily moved elsewhere,
the response indicates where it is at present.
HTTP Error 303
See Other/Redirect. A preferred alternative source should be used at
present.
% cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn | more
130258
129354
123756
83956
83205
23554
23058
21205
15941
12967
12365
8029
7748
5256
4894
4554
4057
3857
3691
3622
3181
3144
2825
...
128.97.86.247
164.67.132.220
164.67.132.219
164.67.132.221
164.67.132.222
67.195.110.174
128.97.55.186
76.89.146.27
128.97.55.188
128.97.55.185
67.195.114.250
137.110.114.8
208.115.111.248
216.9.17.231
128.97.55.3
66.249.67.236
69.235.86.81
66.249.67.183
67.195.110.173
67.195.113.236
77.88.29.246
74.6.149.154
98.222.49.118
The pipeline
•
•
In 1972, pipes appear in Unix, and with them a philosophy, albeit after some
struggle for the syntax; should it be
more(sort(cut)))
•
[Remember this; S/R has this kind of functional syntax]
•
The development of pipes led to the concept of tools -- software programs that
would be in a tool box, available when you need them
“And that’s, I think, when we started to think consciously
about tools, because then you could compose things
together... compose them at the keyboard and get them
right every time.”
from an interview with Doug McIlroy
Remember, read the man pages!
As we did with ls, if the command uniq is unfamiliar, you can look up its usage
Example
man uniq
man host
% host 128.97.86.247
247.86.97.128.in-addr.arpa domain name pointer limen.stat.ucla.edu.
% host 164.67.132.220
220.132.67.164.in-addr.arpa domain name pointer gsa2.search.ucla.edu.
% host 67.195.110.174
174.110.195.67.in-addr.arpa domain name pointer b3091077.crawl.yahoo.net.
But what are we really after?
•
Rather than splitting up the data in access_log.1267056000 by day, we
might consider dividing it by IP -- that is extract all the hits from a given IP
•
Once we have such a thing, we can use the command wc to tell us about the
number of accesses from each user
•
We can also start to “fit” user-level models that can be used to predict
navigation
Rudimentary pattern matching
grep can be used to skim lines from a file that have (or don’t have) a particular
pattern -- patterns are specified via regular expressions, something we will learn
more about later
The name of the command comes from an editing operation on Unix: g/re/p
Example
host 98.222.49.118
grep 98.222.49.118 access_log.1267056000
grep /~dinov access_log.1267056000
%grep 98.222.49.118 access_log.1267056000| head -25
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:37 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=z
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:38 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=z
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:39 -0800] "GET /~zyyao/common.css HTTP/1.1" 200 3762 "http://www.stat.ucla.ed
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (Windows; U; W
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 200 36901 "http://www.googl
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:42 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 206 4352 "-" "Mozilla/5.0 (
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /~zyyao/projects/ABSS/all_exp.html HTTP/1.1" 200 525 "-" "Mozi
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (X11; U; Linux
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:37 -0800] "GET /%7Ezyyao/projects/ABSS/walk.html HTTP/1.1" 200 22230 "http://
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/000template101.png HTTP/1.1" 2
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1001.png HTTP/1.1" 200 113
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003sketch.png HTTP/1.1" 2
% cut -d" " -f12
access_log.1267056000|grep google | head -25
"http://www.google.com/search?hl=en&newwindow=1&q=filled+%22rounded+box%22+image+beamer&aq=f&aqi=&aql=&oq="
"http://www.google.com/imgres?imgurl=http://scc.stat.ucla.edu/page_attachments/0000/0064/map.jpg&imgrefurl=http://scc.s
"http://www.google.com.ph/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=tl&source=hp&q=HISTOR
"http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions"
"http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions"
"http://translate.google.gr/translate?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/"
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=tra
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=tra
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_hea
"http://www.google.com/search?q=socr&rls=com.microsoft:ko:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7HPAB_ko"
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_bod
"http://www.google.co.in/search?q=Lienhart+ICME+2003+face&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=
"http://www.google.com.my/search?hl=en&client=firefox-a&channel=s&rls=org.mozilla:en-US:official&q=markov+chain+monte+c
"http://www.google.com/search?q=stata+goodness+of+fit+test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client
"http://translate.googleusercontent.com/translate_c?hl=zh-CN&sl=en&u=http://www.stat.ucla.edu/~yuille/courses/Stat202C/
"http://www.google.lt/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~bk%2F&rct=j&q
"http://www.google.nl/search?hl=nl&source=hp&q=f+distribution+table&meta=&aq=1&oq=F+distrib"
"http://www.google.com.my/search?q=case+in+suicide&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox
"http://www.google.com/url?sa=t&source=web&ct=res&cd=9&ved=0CCgQFjAI&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~sczhu%2FProj
"http://www.google.com/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=en&source=hp&q=t-statist
"http://www.google.co.in/search?hl=en&q=Genetic+algorithm+tutorial&meta=&aq=o&oq="
"http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&q=an+introduction+to+stochastic+mod
"http://www.google.com/search?hl=en&client=firefox-a&hs=YlE&rls=org.mozilla%3Aen-US%3Aofficial&q=dinov+ucla+stats&aq=f&
"http://www.google.com/search?hl=en&lr=&client=safari&rls=en&q=the+reading+on+a+voltage+meter+connected+to+a+test+circu
"http://www.google.co.jp/search?hl=ja&q=Ryacas%E3%80%80r-commander+install&btnG=%E6%A4%9C%E7%B4%A2&lr=&aq=f&oq="
A puzzle
•
In working through these notes, you might have noticed that
I’ve been protecting you a little from some nastiness -Suppose, for example, we decided to use the cut
command to extract the 10th field in each line of the logfile
and look at just the first 10 lines of ouptut
•
The code 200 is the most frequent by far the most frequent
with second place going to code 304, but by looking at the
first 10 lines I intentionally hid information from you...
count code
933724
127624
72112
46878
22145
9682
2415
239
208
43
200
304
404
206
302
301
403
405
401
416
A puzzle
•
Looking at the full output from the
command and not just the first 10
lines, you see something a bit
different; at the right we present
that table
•
What else do you notice?
count
933724
127624
72112
46878
22145
9682
2415
239
208
43
29
8
3
3
2
1
1
1
code
200
304
404
206
302
301
403
405
401
416
400
HTTP/1.0"
prep.R
HTTP/1.1"
1103
Post-rej/project_II.htm
500
2008
A puzzle
•
About 18 of the hits in this file are logged in such a way that the 10th field (as
defined by a space) is not the code we were after
•
We’ve made an assumption about the data that isn’t correct, and the result
shows up when we have a large enough sample -- to chase down the problem,
we extracted just the log line referring to a request for project_II.htm
•
So, what happened?
www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800]
"GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/project_II.htm HTTP/1.1" 404 - "-"
"Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, [email protected])"
Backstory
•
We specify web pages using a Uniform Resource Locator (URL); for example,
the page http://www.stat.ucla.edu/research will be interpreted as a
request for /research from the host www.stat.ucla.edu
•
The request and response are handled via a communication protocol called
HTTP (or the Hypertext Transfer Protocol); you will see two versions of HTTP in
our log files (HTTP/1.0 which opens a new “connection” per request, and
HTTP/1.1 which can reuse a connection)
•
HTTP is the protocol by which clients (your web browser, or in the parlance of
your logs, a user agent) makes requests from one or more servers (computers
hosting web sites; also called an HTTP server) and receives data in return (web
pages, media items, and so on)
Backstory
•
In addition to its particular format, a URL can be made up of only certain
classes of symbols taken from the US ASCII* character set-- alphanumeric
characters, some the special characters $-_.+!*() and some reserved
characters that have their own special purpose
•
I don’t want to go into the specifics of naming here, as we will likely take it up in
earnest later, except to say that certain characters cannot appear in a URL
or should only appear with a special meaning (like / to specify the path)
•
When you need to include a special character, they need to be URL encoded;
that is, they are replaced with a numerical equivalent...
•
*The American Standard Code for Information Interchange
URL Encoding
•
URL encoding replaces characters with their ASCII hexadecimal equivalent,
prefaced a %; recall that the hexadecimal number system represents values in
“base” 16 -- that is, it uses the sixteen symbols 0,1,...,9,A,B,C,D,E,F
•
Two hexadecimal digits are equivalent to 8 binary digits (8 bits or 1 byte) and
represent an integer between 0 and 255 -- we will have more to say about data
representation next time
•
For now, it’s enough to know that the %20 is the URL encoding of the space
“ “; does this bring us any closer to understanding the three lines a few slides
back?
Puzzle solved
•
Notice that the first line of the five has “ “‘s in the URL that was requested -a “ “ that was not URL encoded -- and this means that counting spaces is not
really the same as identifying items from the log
•
Technically, your browser takes care of this encoding (try typing in a URL with a
space in it); it’s worth noting that the request came from a crawler; a robot that
(presumably) doesn’t have this check in place
% grep Post-rej/project_II.htm access_log.* | head -5
access_log.1267056000:www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/projec
access_log.1267660800:www.stat.ucla.edu 67.195.110.174 - - [06/Mar/2010:04:29:33 -0800] "GET /%7Ewxl/research/proteomic/_Project_Rej%20vs%20Post-rej/p
access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:20 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/pro
access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:21 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/lig
access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:28 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/oth
The lesson
•
This explains just one of the handful of errors we encountered in the one file,
but the basic problem is the same -- we are relying on spaces to define a
pattern for us, but this is not a completely reliable way of identifying the content
we’re after
•
Instead, we need a more elaborate language for specifying patterns; and today
we’ll see an extremely powerful construction known as regular expressions
•
(Note: This material is perhaps slightly out of time order -- I decided to present
it now to get us out of this puzzle -- and next time we will take a small step back
and talk about data formats like CSV, XML, JSON and the relational model)
Quien es mas macho?
•
In online marketing, hits rule the roost; the
more raw traffic you attract, the greater your
opportunities for making a sale
•
Alright, so it’s not as simple as that, but let’s
see what we can learn about centers of
activity on our website
Quien es mas macho?
•
Compute the number of hits to the portions of the site owned by SongChun Zhu, Vivian Lew, Jan deLeeuw and Ivo Dinov
•
Who received the most hits in August? In March?
•
What can you say about the kinds of files that were downloaded?
•
What was the most popular portion of each site?
Then...
•
Pull back a little and tell me about the site in general and the habits of its
visitors; specifically, think about
•
When is the site active? When is it quiet?
•
Do the visitors stay for very long? Do they download any of our papers or
software? What applications do they run?
•
On the balance, is our traffic “real” or mostly the result of robots or
automated processes?
Finally
•
Unix tools can be incredibly powerful; they will be much faster than most other
options for what they’re good at
•
That said, there are plenty of kinds of analysis that are not workable with
collections of simple Unix facilities; as you do your homework, keep track of
what seems hard and why (in particular
•
Our programming resources will keep getting more and more elaborate as the
quarter progresses; this is just a first step

Lecture 2 - UCLA Statistics

Transcription

Similar documents

IPMD Entrepreneur Seet

welcome to sdf - SDF Public Access UNIX System

The History and Monuments of Ancient Rome

UCLA Concludes Camp at CSUSB

An expression of gratitude for all our moments.

Beverly Hills Courier - UCLA Institute of the Environment and

new blood donor center opens

to learn about our values and mission to support

Cadence Tutorial - Electrical Engineering

India MFA Brochure Web