Google 201, April 2004

Transcription

Google 201, April 2004
Important Information
• This presentation was created by
Patrick Crispen.
• You are free to reuse this
presentation provided that you
– Not make any money from this
presentation.
– Give credit where credit is due.
1
Google 201:
“ Advanced Googolgy ”
a presentation by
Patrick Douglas Crispen
2
Our Goals
• Learn how Google really works.
• Discover some Google secrets no one
ever tells you.
• Play around with some of Google’s
advanced search operators.
operators
• Find out where to get more
Google-related help and information.
• DO ALL OF THIS IN ENGLISH!
3
PART ONE:
How Google REALLY
Works
Or, at least, how I think
Google really works.
4
One Word of Warning
• For obvious reasons, the folks at Google
would rather the Wizard of Oz
stay behind the curtain, so to speak.
• So, what you are about to see on the next
few slides are just plain guesses on my
part.
• And, my guesses are probably
completely wrong! But they’re ‘pretty’.
‘pretty’
And that’s all that matters.
5
Another Word of Warning
• I also need to warn you that my guesses
use a little bit of algebra, but I promise it
is simple algebra.
– Well, there is one intimidating-looking
equation, but we’ll get to that in a bit.
• Just remember that, in this case, X > Y >
Z, and there can be different values for
each variable (X1 > X2 … > Xn.)
• I’ve lost you already, haven’t I?
6
How Google Works - Phrases
Image source: Google
• When you search for
multiple keywords,
Google first searches
for all of your
keywords as a phrase.
I think.
• So, if your keywords
are disney
fantasyland
pirates, any pages
on which those words
appear as a phrase
receive a score of X.
7
Source: Google Hacks, p. 21
How Google Works - Adjacency
• Google then
measures the
adjacency between
your keywords and
gives those pages
a score of Y.
• What does this
mean in English?
Well …
Image source: Google
8
Source: Google Hacks, p. 21
How Adjacency Works
A page that says
“My favorite Disney attraction, outside of
Fantasyland, is Pirates of the Caribbean”
will receive a higher adjacency score than a page
that says
“Walt Disney was a both a genius and a
taskmaster. The team at WDI spent many
sleepless nights designing Fantasyland. But
nothing could compare to the amount of
Imagineering work required to create Pirates
of the Caribbean.”
9
How Google Works - Weights
• Then, Google measures the number
of times your keywords appear on
the page (the keywords’ “weights”)
and gives those pages a score of Z.
• A page that has the word disney
four times, fantasyland three times,
and pirates seven times would
receive a higher weights score than a
page that only has those words once.
Source: Google Hacks, p.10
21
You Still
With Me?
11
Putting it All Together
• Google takes
–
–
–
–
The phrase hits (the Xs),
The adjacency hits (the Ys),
The weights hits (the Zs), and
About 100 other secret variables
• Throws out everything but the top 2,000
• Multiplies each remaining page’s individual
score by it’s “PageRank”
• And, finally, displays the top 1,000 in
order.
12
PageRank?
• There is a premise in higher education that the
importance of a research paper can be judged by
the number of citations the paper has from
other research papers.
• Google simply applies this premise to the Web:
the importance of a Web page can be judged by
the number of hyperlinks pointing to it …
from other pages.
• Or, to put it mathematically [brace yourself – the
next slide contains the intimidating-looking
equation I warned you about] …
13
Source: Google Hacks, p. 294
The PageRank Algorithm
 PR(T 1)
PR(Tn) 

PR( A) = (1 − d ) + d 
+ ... +
C (Tn) 
 C (T 1)
Where
•
PR(A) is the PageRank of Page A
•
PR(T1) is the PageRank of page T1
•
C(T1) is the number of outgoing links from the page
T1
•
d is a damping factor in the range of 0 < d < 1,
usually set to 0.85
14
Source: Google Hacks, p. 295
You Can Start Breathing Again
• I promise there are no more equations in
this presentation.
presentation
• I just wanted to show you that the
PAGE RANK of a Web page is the sum of
the PageRanks of all the pages linking to it
divided by the number of links on each of
those pages.
– A page with a lot of (incoming) links to it is
deemed to be more important than a page
with only a few links to it.
– A page with few (outgoing) links to other
pages is deemed to be more important than a
page with links to lots of other pages.
15
Source: Google Hacks, p. 295
PART ONE :
Summary
• Google first searches for your keywords as a
phrase and gives those hits a score of X.
• Google then searches for keyword adjacency and
gives those hits a score of Y.
• Google then looks for keyword weights and gives
those hits a score of Z.
• Google combines the Xs, the Ys, the Zs, and a
whole bunch of unknown variables, and then
weeds out all but the top 2,000 scores.
• Finally, Google takes the top 2,000 scores,
multiplies each by their respective PageRank, and
displays the top 1,000.
• I think.
16
PART TWO:
More Stuff No One Tells You
Google’s shocking secrets
revealed!
17
Google’s Boolean Default is
AND
But there are ways to get
around that.
18
Boolean Default is AND
• If you search for more than one keyword
at a time, Google will automatically search
for pages that contain ALL of your
keywords.
• A search for disney fantasyland
pirates is the same as searching for
disney AND fantasyland AND pirates
• But, if you try to use AND on your own,
Google yells at you.
19
Source: http://www.google.com/help/basics.html
“ PHRASES ”
• To search for phrases, just put your
phrase in quotes.
• For example, disney fantasyland
“pirates of the caribbean”
– This would show you all the pages in Google’s
index that contain the word disney AND the
word fantasyland AND the phrase pirates
of the caribbean (without the quotes)
• By the way, while this search is technically
perfect, my choice of keywords contains a
(deliberate) factual mistake. Can you spot
it?
20
Source: http://www.google.com/help/refinesearch.html
Arr, There She Blows!
• Pirates of the
Caribbean
- isn’t in Fantasyland,
- it’s in Adventureland
in Orlando and New
Orleans Square in
Anaheim.
• So searching for
disney AND
fantasyland AND
“pirates of the
caribbean” probably
isn’t a good idea.
Image source: http://www.balgavy.at/
21
Boolean OR
• Sometimes the default AND gets in the
way. That’s where OR comes in.
• The Boolean operator OR is always in
all CAPS and goes between keywords.
• For example, an improvement over our
earlier search would be disney
fantasyland OR “pirates of the
caribbean”
– This would show you all the pages in Google’s
index that contain the word disney AND the
word fantasyland OR the phrase pirates of
the caribbean (without the quotes)
22
Source: http://www.google.com/help/refinesearch.html
Three Ways to OR at Google
• Just type OR between keywords
– disney fantasyland OR
“pirates of the caribbean”
• Put your OR statement in parentheses
– disney (fantasyland OR
“pirates of the caribbean”)
• Use the | (“pipe”) character in place of the word
OR
– disney (fantasyland |
“pirates of the caribbean”)
• All three methods yield the exact same results.
Source: Google Hacks, p.233
OR, She Blows!
• Just remember,
Google’s Boolean
default is AND
• Sometimes the
default AND gets in
the way. That’s
where OR comes in.
Image source: http://www.phil-sears.com/
24
Capitalization
Does NOT Matter
The old AltaVista trick of
typing your keywords in lower
case is no longer necessary.
25
How Insensitive !
• Google is not case sensitive.
• So, the following searches all yield
exactly the same results:
disney
Disney
DISNEY
DiSnEy
fantasyland
Fantasyland
FANTASYLAND
FaNtAsYlAnD
pirates
Pirates
PIRATES
pIrAtEs
26
Source: http://www.google.com/help/basics.html
Google Has a Hard Limit of
10 Keywords
Bet you didn’t know THAT!
Source: Google Hacks, p.2719
Google’s 10 Word Limit
• Google won’t accept more than 10
keywords at a time.
• Any keyword past 10 is simply
ignored.
• How can you get around this limit?
Well, first you need to remember
that …
Source: Google Hacks, p.2819
Google Ignores a BUNCH
of Common Words
Words to avoid
29
Stop Words
To enhance the
speed and
relevancy of your
Web search,
Google routinely
and automatically
ignores common
words and
characters known
as “stop words.”
30
Source: http://www.google.com/press/guide/reviewguide_7.html
Stop, _ _ Name _ Love
• This is certainly not a canonical list, but
here are 28 stop words I know about.
• a, about, an, and, are, as, at, be, by,
from, how, i, in, is, it, of, on, or, that, the,
this, to, we, what, when, where, which,
with
• You can force Google to search for a stop
word by putting a + in front of it (for
example pirates +of +the caribbean)
31
Source: 10/23/02 post by Bill Todd to news:google.public.support.general
Dealing with the 10 Word Limit
• Omit the stop
words in your
search terms and
you’ll probably
never run into the
10 word limit.
• Another way
around the limit is
to use wildcards.
Image source: http://www.alloyd.com/
32
Google DOES Support
Wildcard Searches … Sort Of.
When you wish upon a *.
33
Wildcards
• Wildcards are characters, usually asterisks
(*), that represent other characters.
• For example, some search engines
support a technique called “stemming”
– With stemming, you search for something like
pirate* and the search engine shows you all
the pages in its database that contain variants
of the word pirate – pirates, pirated, etc.
• But, did you notice I said …
“some search engines?”
34
Google and Wildcards
• Google doesn’t support stemming.
• Rather, Google offers full-word wildcards.
• For example, if you search Google for it’s
+a * world, Google shows you all of the
pages in its database that contain the
phrase “it’s a small world” … and “it’s a
nano world” … and “it’s a Linux world” …
and so on.
Source: Google Hacks, p.3537
it’s +a * world
Image source: http://themeparksource.com/
• The + before a is
required because it is
a stop word and would
otherwise be ignored.
• Most of the hits are
phrases because
that’s what Google
looks for first.
• Oh, and I defy you to
get that song out of
your head!
36
Wildcards
and the Word Limit
• Remember when I said that one way to
get around the 10 word limit was to use
wildcards?
• Google doesn’t count wildcards toward the
limit.
• For example, Google thinks that though *
mountains divide * * oceans * wide
it's * small world after all is
exactly 10 words long.
Source: Google Hacks, p.3719
The Order of Your
Keywords Matters
A me life for pirate’s?
38
How Google Works
• When you conduct
a search at Google,
it searches for
– Phrases,
then
– Adjacency, then
– Weights.
Image source: Google
• Because Google
searches for
phrases first, the
order of your
keywords matters.
39
Source: Google Hacks, p. 20-22
For Example
A search for disney
fantasyland
pirates yields the
same number of hits
as a search for
fantasyland
disney pirates,
but the order of
those hits –
especially the first
10 – is noticeably
different.
40
PART TWO:
Summary
• Google’s Boolean default is AND.
• Capitalization does not matter.
• Google has a hard limit of
10 keywords.
• Google ignores a BUNCH of
common words.
• Google does support
wildcard searches … sort of.
• The order of your keywords matters.
41
PART THREE:
Advanced Search Operators
Beyond plusses, minuses,
ANDs, ORs, quotes, and *s
42
How Google Finds New Pages
•
•
Image source: http://www.disobey.com/
Google has special
programs called
spiders (a.k.a.
“Google bots”) that
constantly search the
Internet looking for
new or updated Web
pages.
When a spider finds a
new or updated page,
it reads that entire
page, reports back to
Google, and then visits
all of the other pages
to which that new page
links.
43
“ Paging Miss Muffet “
• When the spider reports back to
Google, it doesn’t just tell Google the
new or updated page’s URL.
• The spider also sends Google a
complete copy of the entire Web
page – HTML, text, images, etc.
• Google then adds that page and all
of its content to Google’s cache.
44
So What?
When you search Google, you’re actually
searching Google’s cache of Web pages.
• And because of this, you can search for
more than text or phrases in the body of a
Web page.
• Google has some secret, advanced search
operators that let you search specific parts
of Web pages or specific types of
information.
Source: Google Hacks, 45
p. 5
Advanced Operators
Query modifiers
• daterange:
• filetype:
• inanchor:
• intext:
• intitle:
• inurl:
• site:
Alternative query types
• cache:
• link:
• related:
• info:
Other information needs
• phonebook:
• stocks:
• define:
• Google Calculator
46
Query Modifiers
Stuff you can add to the end
your regular searches
47
daterange:
• daterange: limits
your search to a
particular date or
range of dates that a
page was indexed by
Google.
• daterange: only
works with Julian
dates, so you’ll need
to find a Julian date
converter online.
• The Julian date must
be an integer
(no decimals.)
Source: Google Hacks, p.486
daterange:start-stop
pirates daterange:2452401-2452766
49
filetype:
• filetype: restricts
your results to files
ending in ".doc" (or
.xls, .ppt. etc.), and
shows you only files
created with the
corresponding
program.
• There can be no space
between filetype:
and the file extension
• The “dot” in the file
extension – .doc – is
optional.
50
Source: http://www.google.com/help/faq_filetypes.html
Google’s Official Filetypes
• Adobe Portable
Document Format
(pdf)
• Adobe PostScript (ps)
• Lotus 1-2-3 (wk1,
wk2, wk3, wk4, wk5,
wki, wks, wku)
• Lotus WordPro (lwp)
• MacWrite (mw)
• Microsoft Excel (xls)
• Microsoft PowerPoint
(ppt)
• Microsoft Word (doc)
• Microsoft Works (wks,
wps, wdb)
• Microsoft Write (wri)
• Rich Text Format (rtf)
• Text (ans, txt)
51
Source: http://www.google.com/help/faq_filetypes.html
filetype:extension
pirates filetype:pdf
pirates -filetype:pdf
52
inanchor:
• Using inanchor:
restricts the results to
text in a page’s link
anchors.
• There can be no space
between inanchor:
and the following
word.
• You can also search
for phrases. Just put
your phrase in quotes.
53
Source: http://www.google.com/help/operators.html
Link Anchor Text?
…
<body>
<p>Pirates of the Caribbean
opened March 18, 1967.</p>
<p>Please <a
href=“guestbook.html”>sign our
guestbook</a></p>
</body>
…
54
inanchor:terms
inanchor:guestbook
pirates -inanchor:”walt disney”
55
intext:
• intext: ignores link
text, URLs, and titles,
and only searches
body text.
• intext: helps you
find query words that
are too common in
URLs and links.
• There can be no space
between intext: and
the following word.
• You can also search
for phrases. Just put
your phrase in quotes.
Source: Google Hacks, p.565
intext:terms
intext:disney
pirates -intext:”disney.com”
57
intitle:
• Using intitle:
restricts the results to
documents containing
a particular word in its
title.
• There can be no space
between intitle:
and the following
word.
• You can also search
for phrases. Just put
your phrase in quotes.
58
Source: http://www.google.com/help/operators.html
Title?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD
HTML 4.01 Transitional//EN">
<html>
<head>
<title>
Pirates of the Caribbean
</title>
</head>
<body> ...
59
intitle:terms
intitle:pirates
pirates -intitle:”walt disney”
60
A Quick Question
What would happen if I searched for
intitle:walt disney (without the
quotes)?
• Google would look for every page
with the world walt in its title AND
the word disney somewhere in its
body.
• Remember, the quotes are kind of
important if you want to search for
phrases using intitle:
61
inurl:
• Using inurl:
restricts the results
to documents
containing a
particular word in
its URL.
• There can be no
space between
inurl: and the
following word.
62
Source: http://www.google.com/help/operators.html
URL ?
A URL is a uniform resource locator,
a string that uses a standard syntax
to identify an access protocol,
location, and identifier for a file or
other Internet resource.
– http://www.disney.com/
– http://www.google.com/
– ftp://wuarchive.wustl.edu/
– news:google.public.support.general
63
Source: http://search400.techtarget.com/newsItem/0,289139,sid3_gci850,00.html
inurl:term
inurl:disney
pirates –inurl:disney
64
site:
• Using site:
restricts the results
to those websites
in a domain.
• There can be no
space between
site: and the
domain.
65
Source: http://www.google.com/help/operators.html
site:domain
pirates site:disney.com
66
Using site:
• You use site: in conjunction with another
search term or phrase.
pirates site:disney.com
• You can also use site: to exclude sites.
pirates –site:disney.com
• You can use site: to exclude or include
entire domains (and, like with filetype, the
dot is optional).
pirates –site:com
pirates site:edu
67
Alternative Query Types
Stuff you can use,
if you want to search
without using any keywords
68
cache:
• Using cache: shows
the version of a web
page that Google has
in its cache.
• There can be no space
between cache: and
the URL.
• You can use cache: in
conjunction with a
keyword or phrase,
but few do.
69
Source: http://www.google.com/help/operators.html
cache:URL
cache:disney.com
70
link:
• Using link:
restricts the results
to those web pages
that have links to
the specified URL.
• There can be no
space between
link: and the
URL.
71
Source: http://www.google.com/help/operators.html
link:URL
link:disney.com
72
related:
• Using related:
lists web pages
that are "similar"
to a specified web
page.
• There can be no
space between
related: and the
URL.
73
Source: http://www.google.com/help/operators.html
related:URL
related:disney.com
74
info:
• Using info:
presents some
information that
Google has about a
particular web
page.
• There can be no
space between
info: and the
URL.
75
Source: http://www.google.com/help/operators.html
info:URL
info:disney.com
76
Other Information Needs
Did you know that Google can look up
phone numbers, stock quotes,
dictionary definitions, and
… even the answer to math problems?
77
phonebook:
• There are actually
three different Google
phonebook operators.
• Using phonebook:
searches the entire
Google phonebook.
• Using rphonebook:
searches residential
listings only.
• Using bphonebook:
searches business
listings only.
78
Source: http://www.google.com/help/operators.html
How to Use the Phonebook
• first name (or first initial), last name, city
(state is optional)
• first name (or first initial), last name,
state
• first name (or first initial), last name, area
code
• first name (or first initial), last name, zip
code
• phone number, including area code
• last name, city, state
• last name, zip code
79
phonebook:Data
phonebook:disneyland ca
phonebook:(714) 956-6425
80
stocks:
• If you begin a query
with stocks: Google
will treat the rest of
the query terms as
stock ticker symbols,
and will link to a
Yahoo finance page
showing stock
information for those
symbols.
• Go crazy with the
spaces – Google
ignores them!
81
Source: http://www.google.com/help/operators.html
stocks:Symbol1 Symbol2 …
stocks: msft
stocks: aapl intc msft macr
82
define:
• If you begin a query
with define: Google
will display definitions
for the word or phrase
that follows, if
definitions are
available.
• There can be no space
between define: and
the word or phrase
you wish to define.
• You don’t need quotes
around your phrases.
83
Source: http://www.google.com/help/features.html#definitions
define:term
define:pirate
define:barbary coast
84
Google Calculator
• Simply key in what
you'd like Google to
compute (like 2+2)
and then hit enter.
• Google’s Calculator
can solve math
problems involving
basic arithmetic, more
complicated math,
units of measure and
conversions, and
physical constants.
85
Source: http://www.google.com/help/features.html#calculator
3+44
56*78
1.21 GW / 88 mph
100 miles in kilometers
sine(30 degrees)
G*(6e24 kg)/(4000 miles)^2
0x7d3 in roman numerals
For instructions on how to use the Google Calculator, see
http://www.google.com/help/calculator.html
86
PART THREE:
Advanced Operators
SUMMARY
Query modifiers
• daterange:
• filetype:
• inanchor:
• intext:
• intitle:
• inurl:
• site:
Alternative query types
• cache:
• link:
• related:
• info:
Other information needs
• phonebook:
• stocks:
• define:
• Google Calculator
87
The Last Part:
Google Resources
Where to get more
information
88
http://www.google.com/help/
• Google Help
Central
• Free guides and
FAQs that tell you
about Web
searching in
general and
Google’s features
in specific.
89
Google Support Newsgroup
• Google has a free
Usenet newsgroup:
google.public.
support.general
• You may be able to
access this
newsgroup through
your Usenet
reader.
90
Google Support Newsgroup
• You can also search
for the google.
public.support.
general newsgroup at
news.google.com.
• The easiest way to
access the newsgroup
is to just click on the
“user support
discussion forum” link
at the top of the
Google Help Central
page.
91
Google Hacks
• Google Hacks by
Calishain and Dornfest
• US$24.95 (ISBN
0596004478)
• This is an extremely
advanced book written
for Perl programmers,
NOT you and me.
• But I still highly
recommend it.
Image source: Amazon.com
92
Our Goals
• Learn how Google really works.
• Discover some Google secrets no one
ever tells you.
• Play around with some of Google’s
advanced search operators.
• Find out where to get more Googlerelated help and information.
• DO ALL OF THIS IN ENGLISH!
93
Fair Use Disclaimer
This presentation was created following the Fair
Use Guidelines for Educational Multimedia.
Certain materials are included under the Fair Use
exemption of the U.S. Copyright Law. Further
use of these materials and this presentation is
restricted.
94
GOOGLE 201:
‘Advanced Googolgy’
a presentation by
Patrick Douglas Crispen
95