Web Data Collection

Transcription

Web Data Collection
Web Data Collection
Department of Communication PhD Student Workshop
Web Mining for Communication Research
April 22-25, 2014
http://weblab.com.cityu.edu.hk/blog/project/workshops
Hai Liang
Copy and paste
Save as
…
2
Outline
I. Introduction to web data sources
II. Basic terms and procedures
III. Hands-on tutorial
–
–
–
–
Collecting web data using NodeXL
Collecting web data using Visual Web Ripper
Collecting web data via APIs
Collecting web data via scraping
3
SOURCES
4
Relevant Web Apps
Raw
• News website
Processed
• Media outlet
Media generated • News sharing website
• Social media
• Tweets corpus
•
•
User generated
• Discussion groups
HERMES
Obama Win Corpus (OWC)
• blog corpus
5
PROCEDURES
6
Overview
Web
Database
Retrieving
API
Data File
Retrieving
API
ID V1 V2 V3
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Web
Web
Web
Pages
Pages
Pages
Web Scraping
7
Retrieving through API
What is API:
• API (Application
Programming Interface) is
a small script file (i.e.,
program) written by users,
following the rules
specified by the web
owner, to download data
from its database (rather
than webpages)
An API script usually
contains:
• Login information (if
required by the owner)
• Name of the data source
requested
• Name of the fields (i.e.,
variables) requested
• Range of date/time
• Other information
requested
• Format of the output data
• etc.
8
-Reddit API
• http://www.reddit.co
m/user/TheInvaderZi
m/comments/
• http://www.reddit.co
m/user/TheInvaderZi
m/comments/.json
• https://twitter.com/se • https://api.twitter.com
arch?src=typd&q=%23
/1.1/search/tweets.jso
tcot
n?src=typd&q=%23t
cot
9
JSON Editor
• http://www.jsoneditor
online.org/
10
Web Scraping
Identify
Seeds
• hyperlinks
on a page
• list of web
URLs
• etc.
 sampling
and/or prior
knowledge
is required
Crawl
Webpages
• download
pages
matching
the seeds;
• scheduled
based on
crawling
policy of
target
sites.
Parse
Tags
• extract
tagged
elements
(e.g.,
URLs,
paragraphs,
tables, lists,
images,
etc.)
Save to
Data Files
• save the
results in
tabular
format,
delineated
by tabs;
• readable
by Excel,
SPSS, etc.
11
Reddit: Seeds
URLs as seeds
12
Reddit: Parsing
13
HANDS-ON
14
NodeXL
I. http://nodexl.codeplex.com/
II. Twitter search network
III. Twitter user network
IV. Twitter list network
15
Twitter Search network
• Demo 1
– Given a list of key words, e.g., {#tcot, #teaparty,
#p2}
– Get the most recent N tweets containing the key
words (including time stamps)
– Get all user information (name, time zone…)
– Get user relationships ∈ {follow, replies-to,
mention}
16
Twitter Search network (cont’d)
17
Twitter user network
• Demo 2
– Given a username, e.g., ‘cityu_lianghai’
– Get 100 most recent tweets
•
•
•
•
Replies-to network
Mentions network
All users’ information (name, time zone, …)
Text, url, location, time,…
18
Twitter user network (cont’d)
19
Twitter list network
• Demo 3
– Given a list of user names, e.g., {cityu_lianghai,
aqinj, ChengJunWang, cityu_jzhu, wzzjasmine,
StephanieDurr, marc_smith, toreopsahl}. N=8
– To find the relationships between any two users
• 8*7=56 pairs (directed)
• Relationship∈{follow, replies-to, mention}
20
Twitter list network (cont’d)
Replies to
Follows
Mentions
21
Twitter list network (cont’d)
22
Exercise 1
1. Try demo 1-3
2. Given a “Twitter list”, e.g., cspan/senators
23
Visual Web Ripper
24
Visual Web Ripper (cont’d)
I. Demo 4
I.
Given a “Twitter list”, e.g., cspan/senators
(exercise 1)
II. Get all member names and their tweets
(NodeXL can extract part of the members and
tweets)
25
APIs-Python
I. Install Active python
II. Start -> type “cmd”
III. Type
I. “pip install json”
II. “pip install requests”
III. “pip install beautifulsoup4”
26
APIs (reddit.com)
I. Demo 5
I. Given a username, e.g., ‘TheInvaderZim’
II. Get comments by the user
II. Demo 6
I. Given the URL of the front page
II. Get the most controversial posts. Extract titles
and “ups”
27
APIs (reddit.com) (cont’d)
# Import the modules
import requests
import json
# Demo5: Get a user's Comments
user = "TheInvaderZim"
r=
requests.get("http://www.reddit.com/user/"+use
r+"/comments/.json")
# r.text
# Convert it to a Python dictionary
data = json.loads(r.text)
for item in data['data']['children']:
#output_file = open('Demo5.txt', 'a')
body = item['data']['body']
#output_file.write(body+"\r\n")
print body
#output_file.close()
# Demo6: Get controversial posts
r2 =
requests.get("http://www.reddit.com/co
ntroversial/.json")
data2 = json.loads(r2.text)
for item in data2['data']['children']:
#output_file = open('Demo6.txt', 'a')
title = item['data']['title']
domain = item['data']['domain']
sub = item['data']['subreddit']
ups = item['data']['ups']
#output_file.write(str(title)+'\t'+str(do
main)+'\t'+str(sub)+'\t'+str(ups)+"\r\n
")
print
str(title)+'\t'+str(domain)+'\t'+str(sub)
+'\t'+str(ups)+"\r\n"
#output_file.close()
28
Exercise 2
• Try demo 5-6
• Get 100 news titles on a topic of your
interest (e.g., politics)
– Author
– Time stamp
– Url
– Ups/downs
– Number of comments
29
Exercise 2 (cont’d)
30
Scraping
I. Demo 7 (similar to demo 6)
I.
Given the URL of the front page (No API this
time)
II. Get the most controversial posts. Extract titles
and “ups”
31
Scraping (cont’d)
32
Exercise 3
• Please visit debatepolitic.com
(http://www.debatepolitics.com/)
• Using scraping method to get the following
table (N>30 threads)
Thread Title
Author
Time
Number of views
Number of replies
xx
xx
xx
xx
xx
xx
xx
xx
xx
xx
33
Exercise 3 (cont’d)
34
TOOLS
35
Frequently Used Tools for Data
Collection
Operation
Open Source
Commercial
Pull-down
menus
• NodeXL (SNSs)
• Visual Web Ripper
• VOSON (hyperlinks)
• Web Data Extractor
• IssueCrawler (pages)
Programmining-based
• Python APIs
• Python Beautiful
Soup
• Twitter Firehose API
36