Methodological approach to a massive destination

Transcription

Methodological approach to a massive destination
Methodological approach to a massive destination content analysis of travel blogs and reviews
Dr. Estela Mariné‐Roig Dr. Salvador Anton Clavé
7th World Conference for Graduate Research in Tourism, Hospitality and Leisure
Istanbul 3‐8 June 2014 Department of Geography
Content
1
Aims
2
Travel blogs and reviews
3
 Case Study Methodology  Database
 Content analysis
4
Concluding remarks
1 Aims
 Propose a methodology to conduct a massive
computerized quantitative content analysis of travel
blogs and reviews concerning a specific destination.
 This should serve researchers to unveil multiple
aspects of a destination’s online image as expressed
by tourists.
2 Travel blogs and reviews
 The Internet has become the main channel for
seeking and disseminating information, notably in
travel and tourism
 Knowing what is said by tourists in the web 2.0
UGC becomes of major importance for destinations
 Within Travel 2.0, blogs and reviews of travel
experiences have rapidly expanded and become
popular  Great amounts of information
2 Travel blogs and reviews
2 Travel blogs and reviews
 Travel blogs and reviews have great potential as rich
and meaningful information sources for destinations:
They are frequently up‐dated, ordered and classified geographically and chronologically
They give insights into destination image and tourists’ perceptions
 “Future research needs to explore other frameworks
that will be appropriate in maximizing the usefulness of
travel blogs to the academe and the industry” (Pan et al.,
2007)
3 Methodology – Case Study
Catalonia
 First order world tourism destination.
Second top tourist region in EU‐27.
 2013: 15.6 million foreign tourists.
 Barcelona is among the top European
tourist capitals.
 9 regional tourist brands.
3 Methodology ‐ Database
1. Data source selection:
 Specialized Travel blog and review hosting websites
 Need to choose websites objectively in relation to the
case study (Catalonia):
 Check former works, bibliographical sources, subject
guides, blog search engines, and standard search and
metasearch engines using keywords
Use a selection criterion: More than 100 entries about
the case study
GetJealous.com, MyTripJournal.com, StaTravel.com,
TravelBlog.org, TravelJournals.net, TravellersPoint.com,
TravelPod.com,
IgoUgo.com,
TripAdvisor.com,
TravBuddy.com, VirtualTourist.com
3 Methodology ‐ Database
2. Data collection and download:
 Most studies gather very small samples of blogs and
reviews for study  difficulties.
 Need to conduct massive quantitative analyses
because of the great volume of information online.
 All relevant blogs and reviews about the case study
should be downloaded through Web Copiers.
 Manual exploration to see web structure and locate
html files relative to the case study
More than 100,000 files retrieved in the case of Catalonia
TP. http://www.travelpod.com/blogs/0/State/destination.html
TB. http://www.TravelBlog.org/Europe/Spain/Catalonia/
3 Methodology ‐ Database
2. Data collection and download:
3 Methodology ‐ Database
2. Data collection and download:
 Travel blog and review database
Domain (acronym)
GetJealous.com (GJ)
IgoUgo.com (IO)
MyTripJournal.com (MT)
StaTravel.com (ST)
TravBuddy.com (TY)
TravelBlog.org (TB)
TravelJournals.net (TJ)
TravellersPoint.com (TS)
TravelPod.com (TP)
TripAdvisor.com (TA)
VirtualTourist.com (VT)
Barcelona
0
1,073
536
243
1,066
2,348
115
0
998
67,882
10,289
Other towns
0
71
72
12
80
280
4
0
481
34,519
2,192
Unclassified
1,164
0
0
0
0
106
0
596
0
43
285
Empty
371 (1)
‐
‐
‐
11 (2)
‐
‐
‐
‐
112,698 (3)
515 (3)
1: "This site has now expired ..."; 2: "Sorry, X has not created any entries ..."; 3: The writing body is empty
 Travel blogs and reviews per tourism brands
First entry
2001‐08‐27
2000‐06‐06
2001‐07‐25
2005‐05‐30
1985‐05‐20
1997‐03‐07
2002‐08‐01
1986‐05‐09
1984‐12‐27
2002‐10‐17
1999‐12‐08
3 Methodology ‐ Database
3. Data arrangement, cleaning and debugging:
Arrangement: Structure of folders and files
root\website\brand\town\entrydate_pagename[_ending].htm
3 Methodology ‐ Database
3. Data arrangement, cleaning and debugging:
Cleaning: Online sources are full of “noise” (Carson, 2008)
Character encoding problems
 Needless content: identified with WYSIWYG interface
and erased using a mass removal utility
 Non‐significant words
3 Methodology ‐ Database
3. Data arrangement, cleaning and debugging:
Before: 21KB
After: 3KB
Sample of removed HTML directives:

<div id="header"> ... </div>

<div class='blog_breadcrumbs'> ... </div>

<div class='blognav'> ... </div>

<div class='ads_leader'>...</div>

<div id="footer"> ... </div>
3 Methodology ‐ Database
3. Data arrangement, cleaning and debugging:
 Debugging: Preliminary word frequency count and
identification of misspelled keywords. Especially common in
non‐English speaking destinations.
Correct noun
Barcelona
Casa Batlló
Antoni Gaudí
Barri Gòtic
Parc Güell
Montjuïc
!!!
Misspellings
Bathelona, Barcellona, Barthelonaaaa, Bar‐th‐elona, Bar‐tha‐lona, Bar‐the‐lona ...
Batllo House; Casa Batillo, Batilló, Batlla, Batllao, Batllò, Bátllo, Batlo, Battllo, Battló ...
Antonio Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi, Gudi ...
Barri Gotico; Bari Gotic; Ghotic Barrio, District, Quarter; Gotic area, neighborhood ...
Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Parque Guelle, Güelle ...
Monjuic, Montjeuic, Montjic, Montjouïc, Montjuîc, Montjuich, Montjuiic, Montjuik ...
More than 100 ways of misspelling “Sagrada Familia”
3 Methodology ‐ Database
4. Language detection, data mining and dissemination:
 Language detection: Before content analysis language of
entries should be detected (naive Bayes classifier)
Language of posts
3 Methodology ‐ Database
4. Language detection, data mining and dissemination:
 Data mining: Extraction of
 Blog titles
 Bloggers’ hometown
Country of origin of bloggers and reviewers
3 Methodology ‐ Database
4. Language detection, data mining and dissemination:
 Dissemination:
Visibility
Indexed pages
Presence in the social media
Domain (acronym)
GetJealous.com (GJ)
IgoUgo.com (IO)
MyTripJournal.com (MT)
StaTravel.com (ST)
TravBuddy.com (TY)
TravelBlog.org (TB)
TravelJournals.net (TJ)
TravellersPoint.com (TS)
TravelPod.com (TP)
TripAdvisor.com (TA)
VirtualTourist.com (VT)
Usage
Bing
20,700
221,000
12,100
9,430
83,700
256,000
119,000
175,000
667,000
8,260,000
2,100,000
Geographical distribution of users
Link‐based ranks
Google
544,000
1,470,000
270,000
33,900
194,000
888,000
1,400,000
594,000
9,260,000
85,500,000
6,870,000
Visit‐based ranks
3 Methodology – Content analysis
 Researchers are still trying to ascertain the ‘what’ and
‘how’ of analysing travel blogs (Banyai & Glover, 2011)
 Content analysis: most suitable technique to conduct
massive analyses of blogs and reviews
 What makes this technique particularly rich and
meaningful is its reliance on coding and categorizing of
data (Stemler, 2001).
3 Methodology – Content analysis
Receptacle
Text
Approach
Quantitative
Interpretation
Thematic
Categories of Analysis




Geography: brand regions & region
Attraction factors
Feelings and dichotomies
Cultural identity references
Measuring system
Frequency counts
Software
 Site Content Analyzer
 Other software: java utility to process strings
3 Methodology – Content analysis
3 Methodology – Content analysis
 With this process data should be organized in two
different ways to be able to implement different measures:
Group or category
Count
Site‐Wide Density
Average Weight
Word_a1
...
...
...
Word_a2
...
...
...
Word_a3
...
...
...
Word_a4
...
...
...
GROUP A
...
...
...
Word_b1
...
...
...
Word_b2
...
...
...
GROUP B
...
...
...
 Word groups of categories, with reference to the total database
 Matrix with content categories file per file
CATEGORY 1
CATEGORY 2
CATEGORY 3 CATEGORY 4
T‐BLOG 1
XXX
XXX
XXX
XXX
T‐BLOG 2
XXX
XXX
XXX
XXX
T‐BLOG 3
XXX
XXX
XXX
XXX
T‐BLOG 4
XXX
XXX
XXX
XXX
…
…
…
…
…
3 Methodology – Content analysis
 Examples of measures implemented to this database:
 Most frequent words
 Study of outstanding elements
 Descriptive statistics and P‐correlation
 Cluster analysis
 Spatial indexes
4 Concluding remarks
 Objective method to select the most relevant data sources for the case study and
establishment of a selection criterion according to research goals.
 It includes the analysis of websites’ image dissemination to assess the capacity the
targeted information sources have to disseminate the information they convey.
 Massive analysis of data:

All travel blogs and reviews about our case study on the websites fulfilling the
criterion.

key: Creation of a database  Download web pages and entries to the PC,
arrange them into a structure of folders and files. Data cleaning and debugging,
language detection, data mining.
 Quantitative content analysis performed on online texts, based on word counts or
frequencies and word grouping into categories, proved to be a useful and appropriate
method of analysis to shed light on the projected and perceived images of a destination.
 Computerized content analysis through Site Content Analyzer and other software are
suitable to deal with quantitative data and large sets of analysis.
 Category system enables to look deeper into certain complex aspects, such as cultural
identity and the spatial distribution of tourist image.
Methodological framework could be used for other studies whose target were
different destinations and different types of online media and tourist websites.
It contributes to the preparation of data and sistematization of procedures
7th World Conference for Graduate Research in Tourism, Hospitality and Leisure
Istanbul 3‐8 June 2014 Thanks for your attention!
Estela Marine‐Roig ([email protected])
Salvador Anton Clavé ([email protected])
www.globaltur.org
Acknowledgement: The research that this paper is based on was financed by the Spanish Ministry of Science and Innovation (CSO2011‐23004/GEOG).
Department of Geography