From Contextual Search to Automatic Content Generation

Transcription

From Contextual Search to Automatic Content Generation
From Contextual Search to
Automatic Content Generation:
Scaling Human Editorial Judgment
Larry Birnbaum
Northwestern University Knight Lab
and
Narrative Science Inc.
The challenge
Watson:
Context-driven search
AI + IR
 Our point: Context can drive search
 Our starting context: Text
 Our techniques: Statistical and heuristic
 Our substrate: Search
 Our technology: Automatic query formation,
management, result filtering, ranking
 Our goal: Support content creation
Fast, automatic, better than human performance
Similarity vs. relevance
 The most similar possible document to a
document that you have in your hands is…
another copy of that document
 What makes a document relevant is that it is
similar in certain respects and dissimilar in
certain other respects
 Watson relied on noise and the size of the
internet to achieve the variation necessary to
provide truly relevant documents
 Can we do better?
Another example
Relevance revisited
 Determining what information people see—and in
what form and order—is an editorial judgment
 This editorial judgment should be explicit and
visible to both designers/engineers and users
 This means a deliberate mechanism
 Which can—only?—be realized by a semantic
mechanism
 Specific dimensions of similarity/dissimilarity
provide useful information to the user based on
his/her context of activity
Beyond Broadcast
What is editorial
judgment?
 What to look for: Context analysis and query
formation
 Where to look for it: Source selection
 How to assess the results: Result filtering,
tagging, and ranking
 How to show the results: Presentation
Compare & Contrast
Two comparable stories
Oracle tried to buy open-source MySQL
SAN FRANCISCO--Oracle tried to acquire opensource database maker MySQL, an indication of the
profound changes the software giant is willing to
make as it adapts to the increasingly significant
collaborative programming philosophy.
MySQL Chief Executive Marten Mickos confirmed
the acquisition attempt in an interview at the Open
Source Business Conference here but wouldn't
provide details such as when the approach was made
or how much money Oracle offered.
…
Oracle didn't immediately comment on the
acquisition offer.
Though it is increasingly diversified, Oracle's primary
business is selling its own proprietary database
software. MySQL, in contrast, is a leader among
several companies trying to commercialize rival
open-source products.
…
from CNET News.com
IBM Expands Paid Open Source Strategy
UPDATED: IBM is making a bid on professional
open source with the acquisition of privately held
Gluecode, officials announced Tuesday.
Officials did not discuss financial and operational
details of the merger, the first acquisition made by
Big Blue of an open source company.
Gluecode's operations will be assimilated into
IBM's software group and expand the company's
WebSphere application integration middleware
product line.
Officials plan to offer customers and business
partners Gluecode's application server software and
sell software and support services on top of the
offering, as well as let customers upgrade to IBM
WebSphere products.
…
from internetnews.com
Another example
Compare & Contrast:
The numbers
 Precision: ~70%
 “Recall”: ~60%
LocalSavvy
LocalSavvy cont’d
Blog search: Spectrum
query keywords
“epistemic” dimension in which the
user is interested
Spectrum results
From technologist and IT entrepreneurs
Focus on the impact on the IT industry
Spectrum results cont’d
From lawyers and law school professors
Focus on the legal issues of the case
Spectrum: The numbers
Tell Me More
Tell Me More: Actors
Tell Me More: Data
Tell Me More: Quotes
Tell Me More: Twitter
And you get back…
 With contextualized search the text box is
(usually) gone…
 But the presentation model for the output is the
same—a list of results!
 Users don’t want a million documents, or a
hundred, or even ten—they want one document,
that’s written just for them, right now
 They want synthetic documents
News at Seven
Narratives from data
Narratives from data:
Derived features
Narratives from data:
The outline
Narratives from data:
The story
Big Ten Softball & Baseball
Narrative Science:
Human Insight at Machine Scale
Human Insight at Machine Scale
• Technology initially
developed at
Northwestern
University
• Company launched
in 2010
• A partnership
between AI and
editorial
• Started in media,
now 70% of our
customers are in
other industries
34
February 9, 2012, Hockey Recap:
Rio Grande Valley rolls over Laredo, 6-3
The Rio Grande Valley Killer Bees were firing on all cylinders against the Laredo Bucks, and when the
final buzzer sounded Killer Bees emerged with a 6-3 win.
Zac Pearson was all over the ice for Rio Grande Valley, as he tallied two goals and one assist in the win.
Pearson scored the first of his two goals at 5:23 into the first period to make the score 1-0 Rio Grande
Valley. Brandon Campos picked up the assist. Pearson's next tally made the score 2-0 Rio Grande Valley
with 12:44 left in the first period. David Marshall assisted on the tally.
The Rio Grande
Valley
Killer
Bees
were
The punch line...
firing on all cylinders against the
The Killer Bees' goal total was higher than their season average. Rio Grande Valley averages two goals
per game.
The Killer Bees
could not stay and
out of the when
penalty box, asthe
the teamfinal
accrued 17 minutes in
Laredo
Bucks,
penalties during the game. The leading offender was Jason Beeman, who totaled five minutes in penalty
time with one major. With 48 shots on target during the contest, Rio Grande Valley exceeded the 22
buzzer sounded Killer Bees emerged
shots it averages per game this year.
Rio Grande
Valley
additionally
got points from Aaron Lee, who had one goal and one assist, Marshall,
with
a
6-3
win.
who registered one goal and two assists, and Dan Gendur, who racked up one goal and one assist. Dan
Special teams units factored heavily in the game's outcome, as there were 14 penalties called on the two
teams. The busiest period in the sin bins was the first period, which saw 18 minutes of penalty time
combined between the two teams.
Nicholls also scored for Rio Grande Valley. Others to record assists for Rio Grande Valley were AJ
Mikkelsen, who had two and Adam Bartholomay and Marc-Andre Carre, who each chipped in one.
Laredo was often in penalty trouble, as it ended with six minors and one major for 17 minutes in penalty
time. The leading offender was Justin Styffe, who totaled nine minutes in penalty time with two minors
and one major.
More than Sports: Business…
Medicine…
Medicine…
This 48-year-old White female subject in Canada with a
medical history of diarrhea, ovarian cyst, asthma,
depression, hypertension, GI upset, cholecystectomy and
hysterectomy received CP-690,550 for the treatment of
active rheumatoid arthritis. The subject was treated with
CP-690,550 orally, 5 mg twice daily, at a total daily dose
of 10 mg, from 07 Apr 2010 (Study Day 1) to 09 Sep
2010, for a total of 156 days.
…and Unstructured Data Too
NEWT GINGRICH GAINS ATTENTION WITH HOT-BUTTON
TOPICS TAXES, CHARACTER ISSUES
Newt Gingrich received the largest increase in Tweets about him
today. Twitter activity associated with the candidate has shot up
since yesterday, with most users tweeting about taxes and
character issues. Newt Gingrich has been consistently popular on
Twitter, as he has been the top riser on the site for the last four
days. Conversely, the number of tweets about Ron Paul has
dropped in the past 24 hours. Another traffic loser was Rick
Santorum, who has also seen tweets about him fall off a bit.
While the overall tone of the Gingrich tweets is positive, public
opinion regarding the candidate and character issues is trending
negatively. In particular, @MommaVickers says, "Someone needs
to put The Blood Arm's 'Suspicious Character' to a photo
montage of Newt Gingrich. #pimp”.
On the other hand, tweeters with a long reach are on the upside
with regard to Newt Gingrich's take on taxes. Tweeting about
this issue, @elvisroy000 says, "Newt Gingrich Cut Taxes
Balanced Budget, 1n 80s and 90s, Newt experienced
Conservative with values”.
Maine recently held its primary, but it isn't talking about Gingrich.
Instead the focus is on Ron Paul and religious issues.
Why Stories?
Over the past eight quarters, AmgenCorp has
seen a steady increase in both sales and gross
profit. During this period, however, two trends
have become apparent.
While sales have increased, there has been a
steady decrease in margins which should be cause
for concern.
In addition, while we have seen quarter-to-quarter
increases in both sales and gross profit, each of
these metrics is decelerating, indicating that the
business is experiencing a slow down.
Stories as Human-Centered Data Analytics
• Stories embody high-level patterns or themes
that are efficiently communicated and
effectively grasped—e.g., comeback, fade,
low-hanging fruit, etc.
• Stories pick out, connect, and summarize the
critical aspects of a situation
• Stories explain: They convey trends, causes,
and even recommendations
• Stories make data meaningful.
How Does It Work?
Narrative Science:
Big Data… Made Personal
Investment Research
DATE
TICKER
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
TICKER
SHORT TERM DEBT
LONG TERM DEBT
12/31/2011
S
8
9/30/2011
S
2257
CURRENT
ACCOUNTS TOTAL
COST OF
SALES
CASH 2256
6/30/2011ASSETS SPAYABLE ASSETS
12/31/11 0:00
10337
1887
49383
5597
5419
3/31/2011
S
9/30/11 0:00
8650
2188
48015
400122564611
6/30/11 0:00
9182
3035
49043
4271
4589
12/31/2010
S
3/31/11 0:00
8447
2738
49320
399816564396
12/31/10 0:00
9880
2131
51654
5473
9/30/2010
S
17584568
9/30/10 0:00
9255
2423
52274
4666
4496
6/30/10 0:00
8955
2494
53226
427717584230
6/30/2010
S
3/31/10 0:00
8835
2282
54282
4374
4198
3/31/2010
S
12/31/09
0:00
8593
1575
55424
392424154079
9/30/09 0:00
10259
2161
55648
5943
4269
12/31/2009
S
6/30/09 0:00
9098
2061
55885
4609 7684061
3/31/09 0:00
8903
2105
57225
4536
9/30/2009
S
7654026
12/31/08 0:00
8344
1574
58252
3719
4122
9/30/08 0:00
9266
2582
61861
419113734224
6/30/2009
S
6/30/08 0:00
8941
2481
62805
3501
4177
3/31/2009
S
3/31/08 0:00
10327
3264
65463
477512254223
12/31/07 0:00
8661
2750
64109
2440
4237
12/31/2008
S
9/30/07 0:00
8920
3320
93926
2243 6184222
6/30/07 0:00
9278
3309
94895
2424
9/30/2008
S
16174342
3/31/07 0:00
8947
3307
95089
2354
4390
12/31/06
0:00
10304
2366
97161
2061 6264320
6/30/2008
S
12/31/11 0:00
23027
8593
270344
3185
11194
3/31/2008
S
9/30/11 0:00
28605
17860
277653 107621330
13165
6/30/11 0:00
22239
18145
272014
3831
13332
12/31/2007
S
3/31/11 0:00
19737
18349
268085
13911661
13403
12/31/10
0:00
19951
7437
268488
1437 429
13939
9/30/2007
S
9/30/10 0:00
21977
18417
269252
3246
13605
6/30/10 0:00
21392
18157
267556
13771185
12452
6/30/2007
S
3/31/10 0:00
22362
18087
265701
2617
12383
3/31/2007
S
12/31/09
0:00
24334
7514
268752
3802 419
12974
9/30/09 0:00
25921
18093
266568
6167
12839
12/31/2006
S
114312557
6/30/09 0:00
26940
18046
267918
7348
3/31/09 0:00
23521
17359
264358
3812
12201
12/31/08 0:00
22556
6921
265245
1792
12642
9/30/08 0:00
23445
18690
284528
1594
13022
6/30/08 0:00
23225
18927
284508
1631
11897
DATE
DEBT IN
CURRENT
LIABILITIES
DEPRECIATION
EXPENSE
SALES
CAPEX
20266
1174
8722
909
16272
1194
8333
818
DEBT IN LONG
DEPRECIATION AND
TERM
LIABILITIES AMORTIZATION 1235NET INCOME INVENTORIES
16278
8311
8
20266
1174
-1303
913
16282 16272
1255
2257
1194
-301 8313
923
2256
16278
1235
-847
1160
18535 16282
1386
2256
1255
-439 8301
728
1656
18535
1386
-929
670
18540
1552
8152
1758
18540
1552
-911
562
1758
1635
-760 8025
675
18543 18543
1635
2415
18639
1675
-865
526
18639 20293
1675
768
1802
-980 8085
628
765
20892
1820
-478
485
20293 19618
1802
1373
1911
-384 7868
575
1225
1870
-594
20892 20376
1820
8042 510
618
20992
1992
-1621
528
1617
2057
-326 8141
682
19618 21023
1911
626
22358
2157
-344
625
20376 22987
1870
1330
2190
-505 8209
842
1661
20469
2220
-29452
938
20992 21723
1992
429
2222
64 8430
876
1185
2313
19
21023 21713
2057
8816 1018
419
21752
2268
-211
884
1143
2404
261 9055 1176
22358 21011
2157
3453
61300
4573
-6678
1188
22987 62326
2190
8900
4618
3623 93341216.75
7910
58663
4602
3591
1245.5
20469 58126
2220
6902
4584
3408 98471274.25
7196
4907
1087 10044 1303
21723 58971
2222
6426
62540
4873
11539
1198.5
9721
4819
4008 10163 1094
21713 60277
2313
9437
60024
4780
2451
989.5
21752 64720
2268
7361
4966
2716 10092
885
6755
65909
4816
3077
879.25
21011 66565
2404
10438 873.5
10155
4875
3195
10790
63560
4858
3130
867.75
14119
60872
5044
2404
862
17419
59355
4978
3230
926.25
16472
63675
4958
3772
990.5
While the company appears to be
investing in fixed assets, the
company's total debt remained level
at $20.27 billion to signal the
company isn't taking on significantly
more debt, but free cash flow fell
80.5% from a year earlier, signaling
the company may not be generating
sufficient cash flow to cover future
capital expenditures.
759
644
523
490
417
505
484
402
340
377
610
703
899
1670
1671
1261
1577
1813
2411
Investment Research
While the company appears to be
investing in fixed assets, the
company's total debt remained level
at $20.27 billion to signal the
company isn't taking on significantly
more debt, but free cash flow fell
80.5% from a year earlier, signaling
the company may not be generating
sufficient cash flow to cover future
capital expenditures.
Sector Reporting
MRK (Merck & Co Inc.) erased early losses and rose 0.6% to
$31.26. The company recently announced its chairman is stepping
down.October
MRK6,stock
traded
in the rangeMIDDAY
of $31.21
- $31.56. MRK's
Thursday,
2011 12:00
PM: HEALTHCARE
COMMENTARY:
volume was 86.1% lower than usual with 2.5 million shares trading
The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcare
hands.
gains
11.1%
lower
thanandits
stocks
trailed Today's
the market by
0.4%. still
So far,leave
the Dowthe
rose stock
0.2%, theabout
NASDAQ
saw growth
of 0.8%,
theprice
S&P500three
was upmonths
0.4%.
ago.
Here are a few Healthcare stocks that bucked the sector's downward trend.
The Healthcare (XLV) sector underperformed the market
MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recently
in early
trading
ondown.
Thursday.
Healthcare
trailed
announced
its chairman
is stepping
MRK stock traded
in the range ofstocks
$31.21 - $31.56.
MRK's
volume was 86.1% lower than usual with 2.5 million shares trading hands. Today's gains still leave
the market by 0.4%. So far, the Dow rose 0.2%, the
the stock about 11.1% lower than its price three months ago.
NASDAQ saw growth of 0.8%, and the S&P500 was up
LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose
3.8%
to $26.92. LUX traded in the range of $26.48 - $26.99. Luxottica Group’s early share volume
0.4%.
was 34,155. Today's gains still leave the stock 21.8% below its 52-week high of $34.43. The stock
remains about 16.3% lower than its price three months ago.
Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from the
previous close of $32.08. UHS traded in the range of $32.06 - $33.01…
Business Reporting
Education
Client Portfolio Data
48
Quill™ for Google Analytics
49
Narrative Science:
Human Insight at Machine Scale
In In
2011:
2012:
370,000
2M+ Stories
Stories
Impact:
Knight Lab
 A joint laboratory of the McCormick School of
Engineering and the Medill School of Journalism at
Northwestern
 Mission: To significantly increase the pace of
technology innovation in media and journalism
 Focus on technologies that leverage human editorial
judgment at all stages of the content pipeline—
intelligent information gathering, content creation,
distribution, news experiences and user interaction—
at scale
 Supported by a 4-year, $4.2 million grant from the
John S. and James L. Knight Foundation
The Local Angle
Twitter Profiling
Twitter Profiling cont’d
Twitter Profiling cont’d
Twitter Profiling cont’d
Book recommendations
from Twitter
Thanks
 Research partner: Kris Hammond
 Students: Nick Allen, Jay Budzik, Andy Crossen, Lisa
Gandy, Francisco Iacobelli, Jiahui Liu, Patrick McNally,
Nate Nichols, Shawn O’Banion, John Templon, Earl
Wagner
 Developers: Scott Bradley, Jenny Wilson
 Medill colleagues: Rich Gordon, Owen Youngman,
Jeremy Gilbert, Miranda Mulligan, Joe Germuska
 Support: Knight Foundation, NSF, McCormick
Foundation
 And of course the folks at Narrative Science!