Descrambling the social TV echo chamber

Transcription

Descrambling the social TV echo chamber
Descrambling the Social TV Echo Chamber
Nitya Narasimhan
2nd Author
Venu Vasudevan
Betaworks, Applied Research
Motorola Mobility, Inc.
Libertyville, IL 60048
2nd author's affiliation
1st line of address
2nd line of address
Telephone number, incl. country
Betaworks, Applied Research
Motorola Mobility, Inc.
Libertyville, IL 60048
[email protected]
ABSTRACT
one location (home), from a single source (cable), and at a
scheduled time (broadcast). Now, x-shifted viewing is on the
rise with more users watching content across devices (mobile
PC, TV), locations (home, café, transit), times (recorded, ondemand) and sources (providers, mobile, web portals).
Ubiquitous broadband access and increased adoption of tablet and
smartphone devices is driving new trends in x-shifting and multiscreen viewing for television. These create audience and attention
fragmentation problems that impact viewer engagement. But they
also fuel demand for second screen experiences that can enhance
first screen content. This dichotomy calls for better understanding
and modeling of viewer behavior at a micro-level that can support
the design of attention-aware, engagement-boosting experiences.
In this paper we report on an exploration of social activity around
television from two perspectives: quality and quantity of attention
or behavior ‘signal’ in the social noise, and ease of extraction and
presentation of signals as real-time ‘hints’ to mobile applications.
The process taught us valuable lessons on the cost, constraints and
characteristics of different echo chambers for social television.
We share these along with early insights into collective attention
behaviors observed at different fidelities, some intriguing enough
to warrant further investigation. The results show promise for the
development of noise cancellers and signal boosters to extract the
relevant attention data, but require more work to develop a robust
solution for real-time use in second screen companion apps.
2.
Viewing was traditionally a lean-back experience with users’
attention on a single screen. Today, multi-screen viewing is
growing with users watching content on one screen (TV) and
interacting with social networks or applications on a second
(mobile or tablet), leading to challenges in divided attention
Measurement mechanisms that rely on self-reporting (user diaries)
or automated device reports (set meters) are ill equipped to handle
these shifts. Self-reporting incurs non-trivial user effort and recall
that is exacerbated by shifted-viewing and can cause omissions or
inaccuracies. Device-based reporting may not scale easily to cover
all the access interfaces and content repositories involved. The net
result is that such systems may under-estimate actual engagement.
This is leading to efforts like the Nielsen cross-platform initiative1
that can monitor broadcast and online viewing in the home. But,
such solutions are not yet complete or comprehensive; they cannot
reliably measure place-shifted viewing, nor do they factor divided
attention into their calculations. Further, they give macro insights
(e.g., was the show engaging or not) but not micro behaviors (e.g.,
why was the show engaging, or what segments of the show drove
the most engagement). As the definition of ‘commercial success’
gets more atomized (e.g., a sports broadcast is judged not just by
its x million viewers but by selling y million jerseys), sub-program
interest mining becomes less of a curiosity and more of a mandate
for measurement and monetization solutions.
Categories and Subject Descriptors
H.1.2 [User/Machine Systems]: Human Information Processing
Keywords
Activity streams, attention management, social sensor, viewer
engagement, social television
1. INTRODUCTION
However, growing multi-screen usage is also driving the market
for ‘companion applications’ on a second screen. Over 40 percent
of tablet or smartphone users multi-task while watching television
[1], undertaking tasks that often fall into three buckets: search (for
information), social (for conversation) and interstitial (for quick,
opportune tasks during perceived lulls in the first screen content).
By characterizing the type and duration of second screen activity,
we can establish a ‘divided attention’ factor to adjust engagement
metrics. For instance, related search or social actions are additive
while interstitial or unrelated tasks may prove dilutive to viewers’
engagement in that context.
Television is a multi-billion dollar enterprise that generates a bulk
of its revenue from advertising, merchandise and content sales.
The cost for producing commercial content is prohibitive; analysts
believe that HBO’s Game Of Thrones cost over 50 million dollars
for just one season. Returns on such investment are measured by
viewer engagement; higher ratings translate into premium ad rates
and boost merchandise and content sales. As a result, the industry
relies heavily on audience measurements like Nielsen’s ratings, to
evaluate the performance of their offerings. However, changing
viewer behaviors pose new challenges. For instance:
1.
[email protected]
Viewers traditionally watched content on one device (TV), in
Further, second screen applications promote social conversation
(regardless of shifted viewing context), with social activity data
often containing that context explicitly (e.g., the user tweets about
watching a show on Hulu, hinting at a source-shift) or implicitly
(e.g., Twitter annotates location, time-zone or app-name to tweets,
hinting at place, time or device shifts). By characterizing the shift-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
MCSS’12, June 25, 2012, Low Wood Bay, Lake District, UK.
Copyright 2012 ACM 978-1-4503-1324-7/12/06...$10.00.
1
33
“Cross Platform Measurement”, http://bit.ly/Nielsen-cross-platform
viewer attention to it. This distinction may become important in
usage of this work, in both provider and application contexts.
based features and weighting them suitably, we establish ‘x-shift’
factors to adjust viewer engagement metrics further. Thus, we can
weight live viewing higher than deferred viewing (time-shift) or
prioritize remote viewing over in-home viewing (place-shift) to
acknowledge the additional effort required to find a local station
or setup remote streaming, in order to sustain that live viewing.
In the following sections we describe the decision process behind
selecting relevant data sources and workflow behind the analysis.
We present preliminary insights, with a focus on the more unusual
(or counter-intuitive) results; we found valid explanation for some
behaviors but could only develop hypotheses for others, requiring
further experimentation for validation. The paper concludes with a
brief discussion of related work and an outline of next steps
towards design of a system that can meaningfully incorporate the
results into a real-time ‘attention sensing’ capability for mobile
devices and their resident second-screen applications.
While this complements the macro-level metrics available today,
the real utility of social conversations are in the micro-behaviors
they expose during the show. Not only does collective data reveal
peaks and valleys of interest during the show, but it also surfaces
data related to divided attention. Thus, if interstitial or unrelated
companion tasks trigger a social update, its chronological position
in that users’ social activity stream can highlight an attention shift
to (or away from) surrounding updates related to the first screen.
By analyzing streams for statistics on frequency, content types
and inter-cluster update times, we can potentially create a dividedattention adjustment for viewer engagement.
2. SELECTING DATA SOURCES
The first step involved selecting relevant sources of social activity
data for television viewing. Three types of sources exist: generic
social networks (e.g., Twitter2), specialized social networks (e.g.,
GetGlue3) and social analytics services (e.g., Topsy4).
Finally, we note that while social data has value to enhancing our
knowledge and characterization of viewer behaviors, it may not be
representative of all viewer populations. Thus, older viewers may
not embrace social media to the same extent, while younger users
may more readily adopt new social paradigms like ‘check-ins’. Or
we may find that the same user is more engaged socially in certain
contexts (shows, shift state) than others. Thus, a goal for our work
is to also understand if (and how) viewer profiles and preferences
fit into our model. Ideally, we see two benefits to the work. First,
as a ‘complementary’ model to existing engagement systems –
where the adjustments we compute are further weighted to reflect
some representation bias for the larger viewer audience. Second,
and perhaps more interesting to us, as a mechanism that reveals
‘hints’ about user attention or behaviors to applications, enabling
them to adapt or create richer companion experiences for content.
In that context, mobile devices and ‘second screen’ applications
serve two critical roles: they are producers of social activity traces
(given rich shift context and inherent indicator of screen focus),
and they are consumers of the attention model (notably using
collective attention hints to enhance the individual experience). In
particular, the link between mobility and measurement of viewer
engagement cannot be understated; if we buy into the premise that
situation is commitment – where the inconvenience and increased
user effort in x-shifted viewing speaks to a higher degree of user
loyalty to the show – then mobile devices provide implicit and
fine grained context that cannot be replicated by in-home systems.
Generic networks offer a breadth of coverage of viewer activities
across application domains. On the plus side, this allows capture
of non-television activities (e.g., to track divided attention modes)
and a broader understanding of user behaviors {before, during and
after} each viewing session. On the minus side, this adds noise to
content-related signals in the stream, requiring more intelligence
to detect and suppress or attenuate the spurious data.
Specialized networks offer a depth of coverage of user activities
in a given domain. For television, services like GetGlue and Miso
allow users to broadcast what they are watching (‘check-in’), chat
with peer viewers (‘comment’) and earn digital rewards (badges,
points, bonus content) for participating. On the plus side, it makes
for cleaner signals since all activity data is relevant and there is
less room for the disambiguation errors plaguing generic streams.
On the minus side, the streams can be sparse; users often check-in
once per show (to accrue rewards) but mute their conversations
after, or move them to a second network. In this context however,
check-in data is invaluable; not only does it correlate directly to a
television-viewing event, but analysis shows it correlates well to
Nielsen ratings in gauging relative audience engagement [2].
Social analytics services are different in that they provide topical
intelligence (collective) rather than user-level activity (raw data).
This is useful for gaining broader perspective or overcoming the
limitations (or gaps) in data provided directly by the networks.
For instance, the Topsy analytics service provides a histogram of
Twitter activity for a given topic (query), sampling rate (slice) and
requested window (period). With adept usage, it is fairly simple to
retrieve historical patterns of behavior for a given show extending
back hours, days, weeks, even months, from the present. Twitter’s
search API, by comparison, caps query results to ~1500 items. For
a high-traffic topic (e.g., Superbowl) this window is a few hours.
These observations motivated our initial experiments to determine
the viability of social activity streams as a mechanism to detect
and characterize television viewer behaviors. For convenience, we
identified two granularities of collective viewing behavior: macro
(patterns across items that are indicative of shifted viewing) and
micro (patterns within a single item, or indicative of divided
attention). Macro trends can help identify when, where or how
viewers watch a content series; micro trends help in determining
which screen they are focused on at any point of time, and why.
We also indulge in a small but relevant debate on terminology –
engagement and attention find interchangeable use in categorizing
viewer presence in a show’s audience. However, in our context,
attention is the specific act of ‘watching’ the content; engagement
covers any activity with or about it. Attention implies engagement
but the reverse does not always hold true. Rather, we see certain
types of engagement data as ‘hints’ of user attention; one check-in
can suggest attention but a flood of clustered social updates may
speak more to engagement with the content than to sustained
For our purpose, we used Twitter for real-time data collection and
Topsy for historical data retrieval. We intend to add GetGlue as
our specialized network in the near future. However, for the initial
analysis, we leveraged the fact that Twitter is a data sink for other
networks; members link Twitter profiles and configure options to
auto-post select activities or achievements to their Twitter stream.
This makes Twitter fairly effective as a sensor for specific types
2
Twitter: http://www.twitter.com, http://dev.twitter.com
GetGlue: http://www.getglue.com, http://getglue.com/api
4
Topsy: http://www.topsy.com, http://otter.topsy.com
3
34
of activities from those networks. As we show later, this is true for
GetGlue ‘check-ins’ which represent the key data of interest to us.
that show’s GetGlue URI to get a coarse-grained sense of user
check-in counts at different slices. In effect, we made 6 API calls
per show, which results in <1000 calls to process all shows in our
master list. This is well under the API’s 7000 calls/day limit but
note that this makes sense for post-hoc analysis; usage in real-time
for fine-grained sensing will easily exceed that limit.
3. DATA COLLECTION
Our next step was to select target content items for data collection.
We generated a master list of candidates from a mix of broadcast
and cable channels with emphasis on primetime scheduling. Cable
content was further pruned based on good Nielsen ratings or welldefined social presence, to maximize potential for social activity.
This resulted in a total of 144 shows. For each, we added relevant
metadata (e.g., Twitter account, hashtags, GetGlue URI, channel,
title and description) with manual review for quality and accuracy.
3.2 The Twitter Dataset
Twitter provides two APIs: streaming and search. For reasons
mentioned earlier, the search interface is not immediately useful.
Instead, ‘API’ refers only to the streaming version that exposes
the Twitter firehose, a real-time stream of social activity updates
(tweets) from all users. Different API endpoints provide different
views into the stream; we use the filter endpoint, which returns a
subset of the stream matching tracking criteria such as keywords
or account names. In our case, we filter by the same keywords
(hashtags) used earlier with Topsy, for each tracked show.
Next, we set up data collection tasks at two granularities: macro
(item level, across item occurrences) and micro (segment level,
within the item). Our two data sources (Topsy and Twitter) each
aligned best with one of these objectives. Topsy is ideal for macro
analysis given the ability to retrieve historical trends going back
weeks in time. Because Topsy derives its insights primarily from
Twitter data, it can be queried using comparable parameters (e.g.,
hashtags). In principle, Twitter supports macro- (search API) and
micro- (streaming API) data collection; in practice, search results
are limited in size with limited configuration options. Results that
are obtained by search will be relevant, but not deterministic or
comprehensive enough for analysis of temporal behaviors.
The API has some limitations. By default, this ‘free’ service will
return only a representative subset (~1 percent) of the full firehose
for requests. Further, that subset is selected (using an unknown
relevance metric) to provide a fair distribution of results across all
specified criteria. While Twitter allows up to 400 terms in filters,
each additional term cannibalizes the share of results associated
with pre-existing terms. To put it in context, a filter with terms for
a single show is likely to be fairly comprehensive in covering that
show; tracking all 144 shows concurrently results not just in poor
coverage of any one show, but also incurs client side complexity
by requiring a de-multiplexing component that segments that feed
into show-specific streams, potentially in real time.
3.1 The Topsy Dataset
While Topsy provides various data retrieval APIs, we used only
its histogram endpoint to retrieve historical trends. The API is rate
limited (to 7000 calls/day) and takes primarily three arguments: a
slice (the desired sampling rate in seconds), period (the number of
samples requested), and query (topic or keywords). The result is a
single JSON object with request parameters (for reference) and an
array of integers, each data point representing the number of
tweets seen for that query term in that ‘slice’ of time. The size of
the array matches the requested period, with data returned in
reverse-chronological order; in other words, the first value refers
to the slice ending at the request time. To simplify analysis, we
convert each data point into a tuple consisting of that data value,
and the related start-time for that slice; the latter is computed by
progressively decrementing request time by ‘slice’ seconds.
As a result, we decided to instead focus on tracking one show at a
time, to get a sense of micro-behaviors. For the initial experiment,
we deliberately selected four ‘signature’ events with a fairly high
expectation of social engagement – two from sports (NCAA Final
Four and Championship games) and two from highly-anticipated
season premieres (HBO’s Game of Thrones, AMC’s Mad Men).
In all cases, we collected data before, during and after the show,
in order to also capture pre-show and post-show activity. Given
the space constraints, we will focus on analysis of just one dataset
from this collection. However, we plan to publish a more detailed
report later that explores subtler nuances of cadence and content
given the characteristic differences and half-life of these genres.
The service also provides an analytics dashboard5 that visualizes
data as a simple time series and allows up to three queries to run
in parallel to view comparative trends across them. It allows users
to set slice and period values implicitly by selecting among four
options: {past day, past week, past 2 weeks, past month}. We used
this feature for quick insights as we describe later. However, we
note that the dashboard uses the same histogram endpoint, making
the data comparable to that retrieved from a direct call to the API,
with the small difference in time offsets between the two requests.
We also note that the API allows us to set the slice/period to any
values of our choosing. For instance, {slice=30, period=120} will
return a histogram for the past hour sampled at 30sec intervals. In
theory this lets us retrieve micro-level data in real-time for shows;
however, it offers no details on the content behind these counts.
4. PROCESSING WORKFLOW
Fig 1.
For our analysis, we ran queries on the Topsy dashboard using the
default 1-day, 1-week and 1-month settings per show. Moreover,
we ran two concurrent queries for each show: the first focused on
the keywords (hashtags) for the show to get a sense of overall user
engagement, while the second scoped queries to tweets containing
5
Data Collection and Processing Workflow for Twitter
Post-collection processing of Topsy data was fairly minimal. Our
only task was to create timestamped tuples corresponding to the
distinct histogram data points. By contrast, the wealth of content
and context in Twitter data called for some pre-processing to both
segment the results into meaningful datasets, and to enhance them
where useful for our analysis. Figure 1 illustrates the workflow at
a high level. Twitter stream data arrives at a fairly rapid clip and is
Topsy Analytics Dashboard: http://analytics.topsy.com
35
ingested and stored prior to processing. To explore hypothesis
around subjective vs. objective behaviors, we annotate each tweet
with sentiment polarity using Twitter Sentiment6, a bulk-classifier
that returns {0=negative, 2=neutral, 4=positive} labels per tweet.
The tweets were then fed to a ‘Slicer’ configured with a sampling
rate, effectively segmenting the stream into chunks of a specified
duration. Tweets in each chunk were analyzed for various markers
(related to behaviors) and aggregate tweet counts (per behavior)
were registered in a corresponding ‘transcript’ for that chunk.
tags that are invariably used in other, non-television contexts.
This adds noise to the signal. Fig 2(b) shows histograms for
the term “#house” (yellow) the equivalent GetGlue URI (red)
and the term “#house and Fox” (blue) to provide contextual
relevance; House airs on the Fox channel. Observe that even
knowing scheduled slots for the program does not guarantee
that tweets retrieved in that interval are related to the content.
3.
We identify three broad types of behaviors: basic (that focuses on
user activity types e.g., check-in, chat, comment), x-shifted (that is
indicative of a shift from norm e.g., timezone bias in east vs. west
coast viewers) and n-screen (that is indicative of divided attention
between a first screen content and complementary or competitive
second screen activity occurring in that interval). The last of these
requires additional data retrieval and analysis, and is the focus of
ongoing work; we mention it here only for completeness. For the
present, our behavior classification is relatively naïve, using some
mixture of natural language processing (e.g., we look for known
text markers e.g., GetGlue’s distinctive check-in signatures) and
Twitter’s rich set of semantic annotations (e.g., the utc-offset that
is provided in streaming API results and indicative of timezone of
that user). In a subsequent iteration, we plan to evolve analysis to
incorporate machine learning and advanced statistical tools that
can improve accuracy and expand breadth of identified features.
Sensor mash-up adds clarity. Correlating the activity around
GetGlue (check-in) with that around a hashtag (tweet) can
identify anomalies or increase clarity on behaviors. Fig 2. (c)
shows the check-in (red) and tweet (yellow) volumes around
a popular wrestling show (WWW Raw); if we assume each
check-in represents a user, we can clearly see a non-trivial
ratio of tweets to users, indicative of an extremely engaged
fan base. In a different example (Fig 2d), tweets (blue) and
check-ins (red) are contrasted for ‘2 Broke Girls’ showing a
low, steady tweet volume, with one sudden spike. By itself,
this could be misconstrued as an unscheduled airing of a new
episode; however correlating this with check-in data (which
clearly indicates a lack of viewers) highlights an anomaly; a
check of tweets in that region indicate the spike related to a
CBS announcement of show ‘renewals’ including this one. A
clear example of engagement not equating to attention.
The behavior transcripts are analogous to a ‘social’ closed caption
transcript for the content; they summarize various behavioral
insights or contexts corresponding to that slice, and are associated
with a timestamp indicative of its start time. The transcript format
is also an artifact of our current usage requirements. In particular,
we ‘visualize’ transcripts to look for unexpected characteristics, or
evidence to validate (or disprove) different hypotheses. For this,
we use flot7, a Javascript library that mandates specific format for
time-series data; transcripts are generated to meet this by default.
5. PRELIMINARY ANALYSIS
Fig 2. (a) #Alcatraz 1-month
(c) WWE-Raw 1-month
We now share some of the insights obtained from our initial
analysis. In some cases, the insights are not conclusive, but rather
hint at interesting behaviors that warrant further exploration. Due
to space constraints, we also describe only a subset of the insights
that we felt were of interest to the broader community. A detailed
report on these, and other insights, is forthcoming.
We also found value in macro-analysis for understanding change
in attention under schedule face-off (i.e., two shows in the same
time slot, on competing networks), or regimen changes (e.g. series
finale of show A could be traced to increased engagement in faceoff show B the following week). Some of these insights were not
conclusive and have been flagged for repeat experimentation.
5.1 Macro-Analysis With Topsy
The Topsy dataset was useful in studying broad trends but also in
then instrumenting parameters for the Twitter collection phase. A
few of the insights obtained using the Topsy analytics dashboard:
1.
Sample size. Large slice values expose periodicity in viewer
engagement that can be directly correlated to episodic series.
In Fig 2(a) we can easily identify the air-times for episodes
of Alcatraz and even differentiate new episodes from repeats.
However, a micro-analysis (1-day) is more sensitive to slice
duration and shows jitter due to transient effects (e.g., a callto-action during a specific episode driving up traffic).
2.
Term disambiguation. While Twitter advocates the use of
clear ‘hashtags’ for each TV show for cohesive conversation,
some shows (e.g., Unforgettable, House, The Office) have
6
7
(b) #House 1-month
(d) #2brokegirls 1-month
5.2 Micro-Analysis With Twitter
While Topsy data was useful in understanding macro behaviors
for a show, it provided no details on the content under the counts.
As a result, understanding nuances of engagement (including both
x-shifting context and multi-screen behavior) is impossible. We
plan to revisit macro-analysis with the Twitter streaming data in
due course; for reasons mentioned earlier, we are constrained in
the number of concurrent ‘captures’ we can perform with the API.
Instead, in this section we focus on micro-analysis of the Twitter
data stream around a single show: HBO’s Game of Thrones.
The season premiere of the drama was highly-anticipated, with
social engagement exceeding expectations [3]. Our filter used a
few key terms notably the dominant hashtag (#gameofthrones), a
user-friendlier version (game of thrones), and some show-specific
themes (e.g., ‘westeros’), which were first validated by the search
API to guarantee a respectable volume of usage. The show aired
Twitter Sentiment API: https://sites.google.com/site/twittersentimenthelp
Flot Visualization Library: http://code.google.com/p/flot/
36
The Timezone Conundrum. We also have a third possibility. We
segmented tweets by the ‘time-zone’ associated with that user’s
profile. This is not the same as location, which reflects GPS or
place coordinates for current location, and is inevitably undefined.
Rather, profile location is at the city level and is always provided.
We made the assumption that most viewers will watch content in
their home time-zone; the data is visualized in Fig 4 below, with
focus on the segment around first airtime.
simultaneously (at 9pm EST) on east and central time zones; its
latest start-time was on the west coast (at 12am EST). To capture
anticipation, we began collection 2 hours ahead of first broadcast
(~7pm EST) and targeted a cut-off at 2 hours after last broadcast
(~2am EST). In reality, we terminated the capture at 6am EST the
next day. We captured ~65K tweets; the fraction of tweets within
the broadcast window was relatively less than the ’60,000’ count
reported by other sites; we attribute this to our accessing only the
partial firehose, and limiting our collection only to Twitter data.
Post capture, the data was processed using the workflow shown in
Fig 1 earlier, with transcripts saved as files that could be input to
the flot visualizer described earlier. All data visualizations shown
used this tool with time represented in UTC format (on the x-axis)
against tweet counts for the specified feature (on the y-axis). For
convenience, we highlight the first broadcast time (1:00-2:00 am
UTC, or 9-10 pm EST) as a column in a lighter shade of white.
Some data (e.g., sentiment polarity) is not visualized here given
space constraints; it may instead be discussed briefly in context.
Fig 4.
Sample size. The granularity of insights is a direct function of the
‘sample’ size we use in segmenting the Twitter dataset. Fig 3
shows the effect of different sample sizes (0.5s, 1s, 5s, 30s, 1min,
5min) on visualizing engagement in tweet counts; as anticipated,
we have few but more distinct peaks and valleys at higher sample
sizes (5 min), and more jitter at low sample intervals (0.5 sec).
The former allows us to easily detect areas of high or low
collective engagement, which could translate to areas of sustained
high or low individual attention; intuition is that only an onscreen
event of interest could create this near-synchronized spike in
chatter across a large segment of the audience. This coherence is
diluted at lower sample sizes. However, the tradeoff is in reaction
times for mobile applications that want to ‘sense’ and respond to
such engagement in real time. Peak-detection is reactive; the
system can only report a peak after it computes the next sample
count and finds it lower. A 5-minute lag in learning about, and
reacting to, a peak-related action is unacceptable.
Fig 3.
Segmenting tweet volumes based on timezone in user profiles
The results are counter-intuitive; total tweets are shown in yellow,
east in green, central in blue, mountain in red and pacific in purple
for comparison. By this count, east coast viewers have the secondlowest contribution despite being the first to see the show. And,
there is a noticeable presence from west coast viewers despite it
being three hours ahead of their broadcast time; this was readily
explained by observing that HBO (the host cable network) offers
branded East and West channels available nationwide. In effect, a
west coast viewer can watch content on east coast time by adding
a subscription to HBO East.
What makes this unique is the realization that west coast viewers
are exhibiting a higher level of engagement than their peers by
time-shifting consumption “forward”; in essence, they disrupted a
normal routine (e.g., 6pm is a commute hour) just to be first in
line for the show. The relative paucity in east coast contribution is
puzzling and we need to dig deeper. One hypothesis is that 9pm
EST (air-time) is ‘late’ for families, potentially encouraging timeshifted viewing; another is that our fundamental assumption (on
validity of profile location) is flawed. Regardless, we perceive this
to be an under-utilized attribute that can contribute to our better
understanding of x-shifted viewing.
Check-ins: Noise or Signal? Earlier, we described how check-in
data helps identify if tweets in an interval are more likely to relate
to attention (watching) or engagement (other activity) for a show.
We now observe that in Fig 4, the first and sharpest peak is due to
a surge in check-ins as verified by the ‘signature’ pattern of tweet
texts. This correlates well with other reports [4] that indicate a
marked user preference for ‘checking-in’ right before or after the
targeted show. However, clustered check-ins boost tweet counts in
a manner that creates a false peak of attention (within the show),
and diminishes the signal from real attention events. Thus, in this
particular episode, the first real peak of user attention is actually
about 12 minutes in; however when compared in magnitude to the
check-in peak, it create erroneous impressions of the event being
‘less’ significant in comparison. This implies the need for a signal
separator that can isolate the check-in signals for use in boosting
macro-level analysis without distorting the micro-level picture.
Identifying peaks and valleys at different sample rates
Viewing Behavior: Fig 3 also shows collective patterns of microlevel viewing behaviors. The first airing at 1:00-2:00am UTC (on
east and central coast) produces the bulk of engagement data; the
pattern repeats consistently for other shows. First comes pre-show
anticipation (till just after 1:00), then in-show attention (till just
before 2:00), and finally post-show reaction. Note the distinct but
relatively smaller peaks at 3:00 and 4:00 marks respectively; these
correspond to broadcasts in the Mountain and Pacific time-zones.
The reason for markedly low chatter is unclear; speculation runs
along two lines that need to be validated or disproven. The first
points to the ‘nothing left to say’ syndrome; later users routinely
find their twitter streams polluted by chatter (and spoilers) from
earlier viewers and just lose interest in participation. The second
points to an ‘empty room’ effect; at that hour, a significant
segment of the US population is turning in for the night, leaving
fewer users online for sustained chatter.
Word Cloud: For bulk-sentiment extraction, we had to produce a
transcript containing just the text of all tweets, which we then also
visualized as a word cloud using the popular Wordle8 service. A
8
37
Wordle Net: http://www.wordle.net
Rich academic research also exists in this context. A similar UK
study [7] focused on using tweets to understand audience behavior
in signature events (e.g., X Factor) as design guidance for related
second screen apps. Prior work from our group [8] looked at the
vertical domain of sports, and was the first to validate the utility
of social sensing for gathering micro-level attention signals. But,
sports events have maximum impact when viewed live, and don’t
suffer the extent of fragmented behaviors that other shows do; our
current work seeks to extend that concept meaningfully to a larger
television ecosystem. Finally, we hope to build on the extensive
research in natural language processing techniques for classifying,
summarizing and clustering tweets for various needs. For example
recent research on classifying collective attention [9] is relevant in
that it allows us to potentially identify (and track) emergent topics
around television shows that improve both our data collection and
subsequent data filtering mechanisms, potentially alleviating the
known issues in term disambiguation and call-to-action pollution.
particularly useful Wordle feature allows us to iteratively remove
words from the visualized cloud to trigger a refresh in context. By
removing anticipated terms (e.g., dominant hashtags) we can peel
the layers of the onion to get new ‘hints’ of collective behaviors.
Fig 5.
Game Of Thrones tweet ‘text’ visualized as a word cloud
In Fig 5 we show the word cloud generated from that transcript
after we discarded our filter hashtags and we can immediately see
some patterns emerge. The terms {@GetGlue, watching, checkedin} all relate to the check-in template used by GetGlue; the size of
those words show that a sizeable subset of tweets came from
GetGlue and this is borne out by news reports [3]. Next, we notice
the dominance of “RT” (retweet), an activity that takes no
cognitive effort from users but boosts overall traffic for that show.
Tweet analysis showed a significant number of these retweets was
from @HBO (another dominant word) stoking anticipation for the
premiere. This is in line with Twitter’s own guidelines for driving
engagement; however, we notice that for ‘attention’ profiling, the
RT can add noise. For instance, some ‘call-to-action’ tweets asked
users to ‘RT and be entered into a drawing to win’ merchandize;
many obliged, adding noise to the attention signal. To tackle such
conflicts of interest we plan to integrate pattern-matching and
machine-learning mechanisms similar to spam detection, to cancel
or bias select metrics impact of such data in an evolving manner.
7. CONCLUSION
6. RELATED WORK
[2] GetGlue Blog. “Analyzing Social Televison: Checkins vs.
Nielsen”. http://bit.ly/getglue-nielsen
All these factors call for continuous monitoring and understanding
of best-practices and real-world patterns of social activity around
television to more effectively select, amplify or attenuate signals
of interest to second screen applications in specific contexts. In
this paper, we described early efforts to collect and analyze social
activity data from television viewers, and determine if there was
sufficient quality and quantity of ‘signal’ to create differentiated
attention and engagement metrics for content. We shared early
results that showed ways to boost relevant signals, filter spurious
noise and cancel the negative metrics impacts of otherwise valid
data. The process also helped us identify processing flaws and
intriguing results that warranted study and will be the focus of our
next iteration on this research.
8. REFERENCES
[1] Nielsen Wire. “40% of Tablet and Smartphone owners ser
them while watching TV”. http://bit.ly/nielsen-wire-multitask
The domination of Twitter as the digital watercooler for television
viewers is not coincidental. More than any other network, Twitter
has proactively advocated and supported “best practices” [3, 4] for
increasing viewer engagement with television. Since the report,
numerous social analytics startups (e.g., Bluefin Labs, Trendrr and
SocialGuide) have built on the data to deliver richer social
dashboards that complement Nielsen engagement ratings.
[3] Mashable. “Game of Thrones Premiere crashes GetGlue; gets
60,000 comments” http://on.mash.to/mashable-got-premiere
[4] Miso Blog. Most Miso Users Check in When Shows Start.
http://bit.ly/miso-checkins-dist
[5] Twitter Developers. Twitter on TV: A Producer’s Guide.
https://dev.twitter.com/media/twitter-tv
However, our goal is somewhat different. First, these leverage the
full firehose, along with proprietary paid access to different backend services; we wanted to understand if a best-effort solution
based on a partial firehose could have utility. Second, they current
focus on macro-insights (item-level); we are more interested in
micro-insights (segment-level) for modeling and understanding
fragmented attention and audience behaviors. Third, our ultimate
goal is to support application development by abstracting and
presenting collective behaviors as ‘hints’ that can be used to
develop or enhance a second screen experience. In that context,
we believe that advocated best-practices (e.g., add call-to-action
hashtags to boost engagement) may be detrimental to our needs; it
takes microseconds for a user to ‘RT’ or ‘meme’ tweets in the
name of engagement, but the resulting collective noise may be
easily misinterpreted as signal if not detected and screened out.
[6] Twitter Blog. Watching Together: Twitter and TV.
http://bit.ly/twitter-and-tv
[7] Lochrie, M. and Coulton P., 2012. Tweeting with the telly
on: mobile phones as second screen for TV. Proceedings of
the IEEE Consumer Communications and Networking Conf.
(January 14 - 17, 2012). CCNC '12. IEEE. Las Vegas, NV.
[8] Zhao, S., Zhong, L., Wickramasuriya J., and Vasudevan V.
Human as real-time sensors of social and physical events: A
case study of twitter and sports games. Arxiv Preprint, 2011
http://arxiv.org/abs/1106.4300
[9] Lehmann J., Goncalves B., Ramasco J. J. and Cattuto C.
Proceedings of the 21st International Conference on World
Wide Web. WWW ’12. ACM. New York, NY.
38