Semantic Challenges in Getting Work Done

Transcription

Semantic Challenges in Getting Work Done
Semantic Challenges
in Getting Work Done"
Yolanda Gil
Information Sciences Institute
and Department of Computer Science
University of Southern California
http://www.isi.edu/~gil
@yolandagil
[email protected]
Keynote at the International Semantic Web Conference (ISWC)
October 21, 2014
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
1!
Outline!
1. 
Managing work
• 
2. 
Knowledge rich tasks in science
• 
3. 
Semantic workflows
Collaborative tasks in science
• 
4. 
Personal to do lists
Organic data science
Closing thoughts
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
2!
Outline!
1. 
Managing work
• 
2. 
Knowledge rich tasks in science
• 
3. 
Semantic workflows
Collaborative tasks in science
• 
4. 
2 semantic challenges
Personal to do lists
Organic data science
2 semantic challenges
2 semantic challenges
Closing thoughts
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
3!
To Dos!
Email
requests
Daily to-dos
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
4!
To Dos!
Monthly travel
and deadlines
By annual
timeline
(e.g. SIGAI)
By project
(next 6
months)
CFPs,
BAAs
USC Information Sciences Institute
This
week
Yolanda Gil
ISWC-2014
[email protected]
5!
To Do Lists
■ 
To do lists are pervasive [Kirsh 01; Norman 91]
• 
• 
■ 
Prior research focused on user studies
• 
■ 
Used by more than 60% of people for personal information
[Jones & Thomas 97]
Used more than calendars, contact lists, etc.
[Bellotti et al 04; Dey et al 00]
Opportunity for assistance
• 
Major potential impact on productivity
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
6!
To Do List Management: Opportunities for
Interpretation-based Assistance
To Do List Manager
Automate through agents
Anticipate missing
entries & sub-tasks
Group and organize
Get advice from others
Get information from Web
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
7!
What Are To Do Items Like:
FB app!
DIET!!!
Shoes? Debate… need sneakers
Be a true Christian
Mafia blog update
Think about more Facebook Money Making Ideas!
Ending of reproductive abilities
Get off my lazy arse and start achieving some stuff
Renew registration for car
Hotel reservation
Pay bills
Apply for financial aid
Print plane tickets
Renew BOFA card before you leave for summer
Mettre les images des captures sur Facebook
Spinatch and Bashamel Cupcakes
Skriva kod till Simons webbsida
Buy Air Blades Mk2 (Imperial)
Watch ‘arry pottaaa!
Buy Ablative Shell (Imperial)
Ruff Racing Hyperblack 278 19’’ 275/35 &
■  ~1500 items collected
from ~325 people
245/40 wrapped in NITTO 555R’s
■  Many are not amenable
Order more AA Eneloop batteries
to automation
Return Fan to Westside via UPS
■  Many could be automated
Return ugly jacket
fully or in part
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
8!
What Are To Do Items Like:
Office!
■ 
Unusual structure
• 
• 
• 
■ 
Many ways to refer to the same task
• 
■ 
“Schedule meeting with John”
Ambiguous references out of context
• 
• 
■ 
“Meet with John about paper”, “Discuss paper with John”, …
Incomplete task specifications
• 
■ 
No verb: “quarterly report to Joe”
Abbreviations (also typos): “Sched wed 15 ISI”
Questions: “How to extract data for Steve”
“Meet about paper”
“Meet with Raytheon folks”
Personal items
• 
■ 
■ 
“Walk the dog”
■ 
■ 
USC Information Sciences Institute
Yolanda Gil
Corpus of 2400 to-do
entries from users of
CALO office assistant
77% lack a verb
56% missing at least
one argument
14% could be
automated by agents
ISWC-2014
[email protected]
9!
Opportunities
To Do List Manager
Automate through agents
Agents
Anticipate missing
entries & sub-tasks
Colleagues
Group and organize
On-Line
Resources
Get advice from others
Get information from Web
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
10!
Agents: Beamer for CALO and Radar [Gil & Ratnakar AAAI 2008]!
■ 
Match agent capabilities to user’s to dos
Calendar Agent
To Do
?
•  SchMtg <person>
<topic> <time> <loc>
?
•  Set up discussion
with Bill on
ISWC paper
?
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
• 
• 
A2
Action2
<x1> <x2>
A3
Action3
<z1>
[email protected]
11!
Agents: Beamer for CALO and Radar [Gil & Ratnakar AAAI 2008]!
■ 
Use paraphrase patterns of agent capabilities to match them
to user’s to dos
Calendar Agent
•  SchMtg <person>
<topic> <time> <loc>
To Do
•  Set up discussion
with Bill on
ISWC paper
USC Information Sciences Institute
Match
Yolanda Gil
PARAPHRASES:
•  Set up discussion
with <person> on
<topic>
•  Meet about <topic>
with <person>
•  …
ISWC-2014
A2
Action2
<x1> <x2>
• 
• 
A3
Action3
<z1>
[email protected]
12!
Agents: Beamer for CALO and Radar [Gil & Ratnakar AAAI 2008]!
■ 
Use paraphrase patterns of agent capabilities to
match them to user’s to dos
Calendar Agent
•  SchMtg <person>
<topic> <time> <loc>
To Do
•  Set up discussion
with Bill on
ISWC paper
■ 
Match
PARAPHRASES:
•  Set up discussion
with <person> on
<topic>
•  Meet about <topic>
with <person>
•  …
Evaluation with CALO office assistant corpus
• 
• 
86.7% accuracy in detecting relevance to agents
only 0.2 to 0.4 edits needed to set up task parameters
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
13!
The Need for Semantics
Knowledge
To Do List Manager
Automate through agents
Anticipate missing
entries & sub-tasks
Group and organize
Get advice from others
Get information
from
Web
Yolanda Gil
ISWC-2014
USC Information Sciences Institute
[email protected]
14!
Paraphrase Game [Chklovski 2005]!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
15!
Common (Sense) Knowledge [Chklovski and Gil, K-CAP 2005, AAAI 2005]!
Learner2
700,000+ statements
collected from over
3,000 users
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
16!
VerbOcean [Chkovski and Pantel, IJCNLP 2005]!
hips between taks
(a) Relationships betwee
(a) Relationships between taks
“Lunch” – likely duration 1hr
USC Information Sciences Institute
“Presentation” – likely duration 10mins
Yolanda Gil
ISWC-2014
[email protected]
17!
Managing To Dos through Colleagues:
Social Task Networks [Groth et al 2010]!
■ 
To Do app for FB
Social task networks
• 
• 
• 
■ 
Person
People linked to
their to-dos
To-dos linked to
their subtasks
Tasks are linked to
URIs which link to
web resources
To-do
Web resource
Shared to-do
Shared technique
Person | Person
To-do | Resource
To-do | Technique
Task | Subtask
Open Task
Repository using
Linked Data
Principles
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
18!
Managing To Dos through On-Line Resources
[Vrandecic and Gil IUI 2011]!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
19!
http://www.isi.edu/~gil/publications.php
Some Readings!
■ 
■ 
Yolanda Gil, Varun Ratnakar, Timothy Chklovski, Paul T.
Groth, Denny Vrandecic: “Capturing Common Knowledge
about Tasks: Intelligent Assistance for To Do Lists.” ACM
Transactions on Interactive Intelligent Systems, 2(3). 2012.
Hans Chalupsky, Yolanda Gil, Craig A. Knoblock, Kristina
Lerman, Jean Oh, David V. Pynadath, Thomas A. Russ,
Milind Tambe: “Electric Elves: Agent Technology for
Supporting Human Organizations.” AI Magazine 23(2):
11-24 (2002)
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
20!
A Semantic Challenge:
Managing Personal To Dos!
To-Do list interfaces
Agents/services, other
people, advice web
sites
To Do
List
Manager
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
21!
A Semantic Challenge:
Coordinating To Dos of Different People!
To Do
List
Manager
To Do
List
Manager
USC Information Sciences Institute
Yolanda Gil
To Do
List
Manager
ISWC-2014
To Do
List
Manager
[email protected]
22!
Semantic Challenges in Getting Work Done!
■ 
To dos
• 
• 
Managing personal to dos
Managing coordinated to dos
■ 
Knowledge rich tasks in science
■ 
Open science
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
23!
Data-Intensive Computing in Science!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
24!
The Bottleneck is the Process, Not the Data!!
■ 
Today: significant human bottleneck in the scientific process
What is the state of the art?
What is a good problem to work on?
What is a good experiment to design?
What data should be collected?
What is the best way to analyze the data?
What are the implications of the experiments?
What are appropriate revisions of current models?
■ 
Need to help machines understand the scientific research
process in order to assist scientists
•  Semantics can be a game changer
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
25!
Text Extraction in Hanalyzer (L. Hunter, U. Colorado)!
Text extraction
from publications
Generation of interesting
new hypotheses
Semantic
integration of
biomedical
databases
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
26!
Robot Scientist [King et al 2009]!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
27!
Intelligent Science Assistants!
What is the state of the art?
What is a good problem to work on?
What is a good experiment to design?
What data should be collected?
What is the best way to analyze the data?
What are the implications of the experiments?
What are appropriate revisions of current models?
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
28!
Timely Analysis of Environmental Data [Gil et al ISWC 2011]!
With Tom Harmon (UC Merced), Craig Knoblock and Pedro Szekely (ISI)
California’s Central Valley:
•  Farming, pesticides, waste
•  Water releases
•  Restoration efforts
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
29!
A Semantic Workflow!
DailySensorData isa Hydrolab_Sensor_Data siteLong rdf:datatype=“float” siteLaHtude rdf:datatype=“float” dateStart rdf:datatype=“date” forSite rdf:datatype=”string” numberOfDayNights rdf:datatype=“int” avgDepth rdf:datatype=”float” avgFlow rdf:datatype=“float” Owens-Gibbs Model
O’Connor-Dobbins Model
Churchill Model
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
30!
Semantic Workflows in Wings
[Gil et al 10][Gil et al 09][Kim & Gil et al 08][Kim et al 06]!
■ 
Workflows are augmented with
semantic constraints
• 
Each workflow constituent has a
variable associated with it
–  Workflow components, arguments,
datasets
• 
• 
Constraints are used to restrict
workflow variables
Can define abstract classes of
components
–  Concrete components model exec. codes
■ 
■ 
■ 
Workflow reasoners propagate and use
semantic constraints
Uses semantic web standards: OWL/
RDF, SPARQL
Compilation of workflows to scalable
execution infrastructure
USC Information Sciences Institute
Yolanda Gil
www.wings-workflows.org
ISWC-2014
[email protected]
9 31!
Semantic Components in WINGS
[Gil iEMSs 2014]!
Classes of
models/
components
I/O Data
constraints
Use
constraints
USC Information Sciences Institute
Yolanda Gil
;; Depth must be over .6m
[ CMInvalidity1:
(?c rdf:type pcdom:ReaerationCMClass)
(?c pc:hasInput ?idv)
(?idv pc:hasArgumentID
'InputParameters')
(?idv dcdom:depth ?depth)
le(?depth '0.61’)
->
(?c pc:isInvalid 'true’)] [email protected]
ISWC-2014
32!
WINGS Specializes Workflow Based on
Characteristics of Daily Data!
<dcdom:Hydrolab_Sensor_Data rdf:ID=“Hydrolab-­‐CDEC-­‐04272011"> <dcdom:siteLong rdf:datatype=“float">-­‐120.931</dcdom:siteLongitude> <dcdom:siteLaHtude rdf:datatype=“float">37.371</dcdom:siteLaHtude> <dcdom:dateStart rdf:datatype=“date">2011-­‐04-­‐27</dcdom:dateStart> <dcdom:forSite rdf:datatype=”string">MST</dcdom:forSite> <dcdom:numberOfDayNights rdf:datatype=“int">1</dcdom:numberOfDayNights> <dcdom:avgDepth rdf:datatype=”float">4.523957</dcdom:avgDepth> <dcdom:avgFlow rdf:datatype=“float">2399</dcdom:avgFlow> </dcdom:Hydrolab_Sensor_Data> 2) Choice of models 1) Parameter se+ngs Owens-Gibbs Model
O’Connor-Dobbins Model
Churchill Model
3) Metadata of new results <dcdom:Metabolism_Results rdf:ID=“Metabolism_Results-­‐CDEC-­‐04272011"> <dcdom:siteLong rdf:datatype=“float">-­‐120.931</dcdom:siteLongitude> <dcdom:siteLaHtude rdf:datatype=“float">37.371</dcdom:siteLaHtude> <dcdom:dateStart rdf:datatype=“date">2011-­‐04-­‐27</dcdom:dateStart> <dcdom:forSite rdf:datatype=”string">MST</dcdom:forSite> <dcdom:numberOfDayNights rdf:datatype=“int">1</dcdom:numberOfDayNights> <dcdom:avgDepth rdf:datatype=”float">4.523957</dcdom:avgDepth> <dcdom:avgFlow rdf:datatype=“float">2399</dcdom:avgFlow> </dcdom: Metabolism_Results> USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
33!
WINGS Dynamically Selects Appropriate Model
Based on Daily Sensor Readings!
Churchill model
USC Information Sciences Institute
O’Connor-Dobbins
model
Yolanda Gil
ISWC-2014
Owens-Gibbs
model
[email protected]
34!
WINGS Workflow Reasoners!
Input data for
decision tree
modelers (eg ID3)
must be discrete
?Dataset4
dcdom:isDiscrete true
■ 
■ 
Key idea: Skeletal planning,
where constraints for each
component are propagated
through a fixed workflow
structure (the skeleton)
Phase 1: Goal Regression
• 
• 
■ 
Phase 2: Forward Projection
• 
• 
USC Information Sciences Institute
Yolanda Gil
Starting from final products,
traverse workflow backwards
For each node, query for constraints
on inputs
Starting from input datasets,
traverse workflow forwards
For each node, query for constraints
on ISWC-2014
outputs
35!
[email protected]
Example (Step 1 of 5)!
Rule in Component Catalog:
[modelerSpecialCase2:
(?c rdf:type pcdom:ID3ModelerClass)
(?c pc:hasInput ?idv)
(?idv pc:hasArgumentID "trainingData”)
?Dataset4 dcdom:isDiscrete true
-> (?idv dcdom:isDiscrete
"true"^^xsd:boolean)]
Model5!
USC Information Sciences Institute
Yolanda Gil
Model6!
ISWC-2014
Model7!
[email protected]
36!
Example (Step 2 of 5)!
?Dataset3 dcdom:isDiscrete true
Rule in Component Catalog:
[samplerTransfer:
(?c rdf:type pcdom:RandomSampleNClass)
(?c pc:hasOutput ?odv)
(?odv pc:hasArgumentID
"randomSampleNOutputData")
(?c pc:hasInput ?idv)
(?idv pc:hasArgumentID
"randomSampleNInputData”)
(?odv ?p ?val)
(?p rdfs:subPropertyOf dc:hasMetrics)
?Dataset4 dcdom:isDiscrete true
Model5!
Model6!
Model7!
-> (?idv ?p ?val)]
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
37!
Example (Step 3 of 5)!
?TrainingData dcdom:isDiscrete true
?Dataset3 dcdom:isDiscrete true
Rule in Component Catalog:
[normalizerTransfer:
(?c rdf:type pcdom:NormalizeClass)
(?c pc:hasOutput ?odv)
(?odv pc:hasArgumentID
"normalizeOutputData")
(?c pc:hasInput ?idv)
(?idv pc:hasArgumentID
"normalizeInputData")
(?odv ?p ?val)
(?p rdfs:subPropertyOf dc:hasMetrics
?Dataset4 dcdom:isDiscrete true
Model5!
Model6!
Model7!
-> (?idv ?p ?val)]
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
38!
Example (Step 4 of 5)!
?TrainingData dcdom:isDiscrete true
?Dataset3 dcdom:isDiscrete true
Rule in Component Catalog:
[modelerTransferFwdData:
(?c rdf:type pcdom:ModelerClass)
(?c pc:hasOutput ?odv)
(?odv pc:hasArgumentID "outputModel”)
(?c pc:hasInput ?idv)
(?idv pc:hasArgumentID "trainingData")
(?idv ?p ?val)
(?p rdfs:subPropertyOf dc:hasDataMetrics)
notEqual(?p dcdom:isSampled)
?Dataset4 dcdom:isDiscrete true
Model5!
Model6!
?Model5 dcdom:isDiscrete true
?Model6 dcdom:isDiscrete true
?Model7 dcdom:isDiscrete true
-> (?odv ?p ?val)]
USC Information Sciences Institute
Model7!
Yolanda Gil
ISWC-2014
[email protected]
39!
Example (Step 5 of 5)!
?TrainingData dcdom:isDiscrete true
Rule in Component Catalog:
?Dataset3 dcdom:isDiscrete true
[voteClassifierTransferDataFwd10:
(?c rdf:type pcdom:VoteClassifierClass)
(?c pc:hasInput ?idvmodel1)
(?idvmodel1 pc:hasArgumentID "voteInput1")
(?c pc:hasInput ?idvmodel2)
(?idvmodel2 pc:hasArgumentID "voteInput2")
(?c pc:hasInput ?idvmodel3)
(?idvmodel3 pc:hasArgumentID "voteInput3")
(?c pc:hasInput ?idvdata)
(?idvdata pc:hasArgumentID "voteInputData")
(?idvmodel1 dcdom:isDiscrete ?val1)
(?idvmodel2 dcdom:isDiscrete ?val2)
Model5!
(?idvmodel3 dcdom:isDiscrete ?val3)
equal(?val1, ?val2), equal(?val2, ?val3)
?TestData
dcdom:isDiscrete
-> (?idvdata dcdom:isDiscrete ?va1l)]
true
USC Information Sciences Institute
Yolanda Gil
?Dataset4 dcdom:isDiscrete true
Model6!
Model7!
?Model5 dcdom:isDiscrete true
?Model6 dcdom:isDiscrete true
?Model7 dcdom:isDiscrete true
ISWC-2014
[email protected]
40!
WINGS Workflow Reasoners:
Result!
?TrainingData
dcdom:isDiscrete
true
?Dataset3 dcdom:isDiscrete true
?Dataset4
dcdom:isDiscrete
true
?Dataset4 dcdom:isDiscrete true
Model5!
Model6!
?Model5 dcdom:isDiscrete true
?Model6 dcdom:isDiscrete true
?Model7 dcdom:isDiscrete true
?TestData
dcdom:isDiscrete
true
USC Information Sciences Institute
Yolanda Gil
Model7!
ISWC-2014
[email protected]
41!
unified well-formed req.
WINGS Automatic Workflow
Generation Algorithm [Gil et al JETAI 2011]!
Seed workflow from request
seeded workflows
Find input data requirements
Work with P. Gonzalez (UCM) and Jihie Kim (ISI)
Workflows with S. McWeeney & C. Zhang (OHSU)
binding-ready workflows
Data source selection
“Pay-asyou-go”
semantics
bound workflows
Parameter selection
configured workflows
Workflow instantiation
workflow instances
Workflow grounding
ground workflows
Workflow ranking
top-k workflows
Workflow mapping
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
42!
[email protected]
executable
workflows
Workflows!
■ 
■ 
Workflow systems
■  [Goble et al 2007]
■  [Ludaescher et al 2007]
■  [Freire et al 2008]
■  [Mattmann et al 2007]
■  [Mesirov et al 2009]
■  [Dinov et al 2009]
Workflow representations
■  [Moreau et al 2010]
■  [IBM/MSR 2002]
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
43!
Semantic Process Models!
■ 
Composition from first principles
• 
• 
• 
• 
• 
• 
• 
■ 
[McIlraith & Son KR 2002] [Sohrabi et al ISWC 2006] [Sohrabi &
McIlraith ISWC 2009] [Sohrabi & McIlraith ISWC 2010]
[McDermott AIPS 2002]
[Kuter et al ISWC 2004] [Sirin et al JWS 2005] [Kuter et al JWS 2005] [Lin
et al ESWC 2008]
[Lecue ISWC 2009]
[Calvanese et al IEEE 2008]
[Bertolli et al ICAPS 2009]
[Li et al ISSC 2011]
Representations
• 
• 
• 
[Burstein et al ISWC 2002] [Martin et al ISWC 2007]
[Domingue & Fensel IEEE IS 2008] [Dietze et al IJWSR 2011] [Dietze et al
ESWC 2009]
[Fensel et al 2011] [Vitvar et al ESWC 2008] [Roman et al AO 2005]
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
44!
Semantic Descriptions of Software Components in Geosciences
Work with C. Duffy (PSU), S. Peckham (CU), C. Mattmann (JPL), J. Howison (UT)
!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
45!
CSDMS Standard Names [Peckham iEMSs 2014]
http://csdms.colorado.edu/!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
46!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
47!
Benefits of Semantic Workflows:
1) Automatic Workflow Elaboration [Gil et al WORKS’13]!
Workflowsdeveloped with Y. Liu (USC) and C. Mattmann (JPL)
LDA
Online LDA
Parallel
LDA
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
48!
Benefits of Semantic Workflows:
2) Access to Data Analytics Expertise!
Science, Dec 2011
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
49!
Capturing Expertise through Workflows
[Hauder et al e-Science 2011]!
Workflows for text analytics, joint work with Yan Liu (USC) and Mattheus Hauder (TUM)
Naïve
Approach
Expert
Approach
Feature
selection
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
50!
Capturing Expertise [Gil et al 2012]!
Work with Christopher Mason (Cornell University)
Workflows for population genomics
Association Tests
Variant Discovery from Resequencing
USC Information Sciences Institute
Yolanda Gil
CNV Detection
ISWC-2014
Transmission Disequilibrium Test
[email protected]
51!
Benefits of Semantic Workflows:
3) Saving Time Through Reuse [Sethi et al MM’13]!
Work with Ricky Sethi and Hyujoon Jo of USC
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
52!
Saving Time through Reuse [Garijo et al FGCS’13]!
Work with D. Garijo and O. Corcho (UPM), P. Alper, K. Belhajjame, and C. Goble
(UM)
Result
• 
“Scientists and engineers spend more than 60% of their time
just preparing the data for model input or data-model
comparison” (NASA A40)
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
53!
Su
Measuring Time Savings with
“Reproducibility Maps” [Garijo et al PLOS CB12]!
Work with D. Garijo of UPM and P. Bourne of UCSD
■ 
■ 
2 months of effort in reproducing published method (in PLoS’10)
Authors expertise was required
Comparison of ligand binding sites Comparison of dissimilar protein structures Graph network genera?on Molecular Docking USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
54!
Benefits of Semantic Workflows:
4) Interoperability in a Workflow Ecosystem [Garijo et al 2014]!
Work with D. Garijo and O. Corcho of UPM
WT'(.pipe)'
Workflow'Execu,on'
LONI&Pipeline&
Repository'
(.pipe)'
Workflow'Genera,on'
Wings&
WME'to'
Wings'
DAXEC'
to'Wings'
WI'(PEPlan)'
DAXEC'to'
OPMW'&'PROV'
WE+WT'
Repository'
(PROV,'PEPLAN,'
OPMW)'
WE'(OPMW'
'&'PROV)'
WE'(WME)'
WT(PEPlan)'
WE,''WT''
(OPMW'&PROV)'
Workflow'Browsing'
Wexp&
WE'(PROV)'
to'RO'
WE,''WT''
(OPMW'&PROV)'
PEPLAN'to'
CASEPGE'
OPMW'to'
DEPROV'
Workflow'
Execu,on'
Apache&OODT&
Workflow'Documenta,on'
Organic&data&science&wiki&
WT'(OPMW)'
OPMW'
PEPLAN'
to'DAX'
Workflow'Mapping'
and'Execu,on'
Pegasus/Condor&
Wings'to'
OPMW'&'
PEPlan'
WE'(DAXEC)'
WI'(PEPlan)'
Workflow'Mining'
FragFlow'
WT'(PEPlan)'
WT'(PEPlan)'
WI'(Wings)'
Wings'to'
PEplan'
LONI'to''
PEPlan'
Provenance'visualiza,on'
(e.g.,'Prov7o7viz)'
RO'Model'applica,on'
DEPROV'applica,on'
Ecosystem'Tool'
Working'converter'
WME'to'OPMW'
&'PROV'
USC Information Sciences Institute
Yolanda Gil
Planned'converter'
Repositories'
Current'dataflow'
ISWC-2014
Planned'dataflow'
Other'workflow'tool'
SPARQL'construct'
converter'
[email protected]
55!
Benefits of Semantic Workflows:
4) Interoperability in a Workflow Ecosystem [Garijo et al 2014]!
Work with D. Garijo and O. Corcho of UPM
WT'(.pipe)'
Workflow'Execu,on'
LONI&Pipeline&
Repository'
(.pipe)'
LONI'to''
PEPlan'
Workflow'Mining'
FragFlow'
WT'(PEPlan)'
Workflows are:
WT(PEPlan)'
Workflow'Genera,on'
-  Wings&
Described with semantic metadata
Workflow'Documenta,on'
-  Published as Web objects
(linked
open
data)
Organic&data&science&wiki&
Repository'
(PROV,'PEPLAN,'
-  Imported by systems
with diverse functions:
OPMW)'
Workflow'Browsing'
(eg editing, execution, provenance browsing,
Wexp&
workflow mining, etc)
Provenance'visualiza,on'
WT'(PEPlan)'
WI'(Wings)'
Wings'to'
PEplan'
WI'(PEPlan)'
WME'to'
Wings'
DAXEC'
to'Wings'
WT'(OPMW)'
WE'(OPMW'
'&'PROV)'
WE'(WME)'
WE'(PROV)'
OPMW'
to'RO'
PEPLAN'
to'DAX'
DAXEC'to'
OPMW'&'PROV'
WE+WT'
WE'(DAXEC)'
WI'(PEPlan)'
Workflow'Mapping'
and'Execu,on'
Pegasus/Condor&
WE,''WT''
(OPMW'&PROV)'
Wings'to'
OPMW'&'
PEPlan'
WE,''WT''
(OPMW'&PROV)'
PEPLAN'to'
CASEPGE'
OPMW'to'
DEPROV'
Workflow'
Execu,on'
Apache&OODT&
(e.g.,'Prov7o7viz)'
RO'Model'applica,on'
DEPROV'applica,on'
Ecosystem'Tool'
Working'converter'
WME'to'OPMW'
&'PROV'
USC Information Sciences Institute
Yolanda Gil
Planned'converter'
Repositories'
Current'dataflow'
ISWC-2014
Planned'dataflow'
Other'workflow'tool'
SPARQL'construct'
converter'
[email protected]
56!
http://www.isi.edu/~gil/publications.php
Some Readings!
■ 
■ 
■ 
Yolanda Gil: “Intelligent Workflow Systems and
Provenance-Aware Software.” Proceedings of the Seventh
International Congress on Environmental Modeling and
Software (iEMSs), San Diego, CA, 2014.
Yolanda Gil: “From Data to Knowledge to Discoveries:
Artificial Intelligence and Scientific Workflows.” Scientific
Programming 17(3), 2009.
Ewa Deelman, Chris Duffy, Yolanda Gil, Suresh Marru,
Marlon Pierce, and Gerry Wiener: “EarthCube Report on a
Workflows Roadmap for the Geosciences.” National
Science Foundation, Arlington, VA. 2012.
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
57!
A Semantic Challenge:
Automatic Paper Generator!
■ 
Capture
knowledge about
analytic methods
• 
• 
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
Run workflows
in existing data
repositories
Report new
findings
[email protected]
58!
A Semantic Challenge:
A Web of Semantic Workflows/Processes!
Assist people to:
■  Share
■  Copy
■  Reuse
■  Adapt
■  Remix
■  Update
■  Certify
■  Review
■  …
USC Information Sciences Institute
“Pay-as-you-go”
semantics
Yolanda Gil
ISWC-2014
[email protected]
59!
Semantic Challenges in Getting Work Done!
■ 
To dos
• 
• 
■ 
Knowledge rich tasks in science
• 
• 
■ 
Managing personal to dos
Managing coordinated to dos
Automatic paper generator
A Web of semantic workflows/processes
Open science
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
60!
Collaboration to Develop Workflows [Gil et al 2007]!
Slide from T. Jordan of USC and SCEC
Seismicity !
Paleoseismology!
Local site effects!
Faults!
Geologic structure!
High-Level
Workflow
Seismic
Hazard
Model
Stress!
transfer!
Crustal!
motion!
USC Information Sciences Institute
Crustal!
deformation!
Yolanda Gil
Seismic velocity!
structure!
ISWC-2014
Rupture!
dynamics!
!
[email protected]
61!
Understanding the “Age of Water”!
Work with P. Hanson (UWisc), C. Duffy (PSU), and J. Read (USGS)
Research Soil Survey
Lidar-derived numerical mesh
Linking Catchment
Model-Data Assets
Supported by NSF GEO-CZO
High Resolution Vegetation Mapping
Mapping Bedrock GP Radar
Linking Lake Model-Data
Assets Supported by
NSF BIO-GLEON, USGS CIDA
High-resolution sensor network data
Models of lake hydrodynamics and water quality
USC Information Sciences Institute
Read JS, et al. 2014. Ecological
Modelling. 291C: 142-150. doi:10.1016/j.ecolmodel.2014.07.029
Yolanda Gil
ISWC-2014
[email protected]
62!
A New Kind of Collaborative Platform!
■ 
Taxonomy of Science Communities [Bos et al 2007]!
Shared'Instruments
Community'Data'Systems
Open'Community'Contribution'Systems
Virtual'Communities'of'Practice
Virtual'Learning'Communities
Distributed'Research'Centers
Community'Infrastructure'Projects
!
■ 
NEON
PDB
Zooniverse
GLEON
VIVO
ENCODE
CSDMS
Need a platform to support science collaborations that
require:!
• 
• 
• 
Significant organization and coordination
Maintaining a community over the longer term
Growing the community based on unanticipated needs
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
63!
Organic Data Science!
Work with F. Michel and M. Hauder of TUM
■ 
Organic data science is a novel approach to on-line
scientific collaboration that supports:
• 
Self-organization of communities by enabling any user to specify
and decompose tasks
• 
On-line community support by incorporating social sciences
principles and best practices
• 
An open science process by capturing new kinds of metadata
about the collaboration that give necessary context to newcomers
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
64!
Self-Organization through Task Decomposition!
■ 
■ 
■ 
Many tasks involved
Necessary data resides in different repositories
Different people understand different kinds of data
• 
• 
■ 
USC Information Sciences Institute
Where it is
How to use it
If other data needed, unclear who has it
Yolanda Gil
ISWC-2014
[email protected]
65!
Social Principles for Online Communities!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
66!
Social Principles: Some Examples!
■ 
Starting communities, e.g.:
• 
• 
■ 
Encouraging contributions, e.g.:
• 
• 
■ 
Simple tasks with challenging goals are easier to comply with
Publicize that others have complied with requests
Encouraging commitment, e.g.:
• 
■ 
Organize content, people, and activities into subspaces
Inactive tasks should have “expected active times”
Interdependent tasks increase commitments and reduce conflict
Dealing with newcomers, e.g.:
• 
• 
Design common learning experiences for newcomers
Provide sandboxes while they are learning
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
67!
Opening Science:
Polymath [Nielsen, Gowers 09]!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
68!
Organic Data Science!
=> Task-oriented self-organizing on-line communities for
open collaboration in science
■ 
Organic data science is a novel approach to on-line
scientific collaboration that supports:
• 
Self-organization of communities by enabling any user to specify
and decompose tasks
• 
On-line community support by incorporating social sciences
principles and best practices
• 
An open science process by capturing new kinds of metadata
about the collaboration that give necessary context to newcomers
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
69!
Self-Organization through Dynamic Task
Decomposition!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
70!
!
Organic Data Science:
Contributors!
USC Information Sciences Institute
Yolanda Gil
!
ISWC-2014
[email protected]
71!
Data!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
72!
Models!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
73!
Workflows!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
74!
Training Newcomers!
Reader
Participant
Owner
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
75!
What Features Are Used to Manage Tasks?!
!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
!
[email protected]
76!
How Do Users Find Relevant Tasks?!
!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
77!
Are Users
Collaborating?!
A 52% of tasks are viewed by more than one person
B 72% of tasks have more than one person signed up
C 19% of tasks have more than one person editing metadata
than one person editing content
D 11% of tasks have more
Yolanda Gil
ISWC-2014
USC Information Sciences Institute
[email protected]
78!
What Does the Social Network of
Collaborators Look Like?!
Network of
users (nodes)
linked by
shared tasks
■  Links across
all users
■  Two distinct
subgroups
■ 
1. 
2. 
Water
Software
!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
79!
A Semantic Challenge:
Email-less Coordination for Projects!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
80!
A Semantic Challenge:
Open Science Processes!
Datasets&linked&to&
download&sites&
From: http://www.ncdc.noaa.gov/paleo/metadata/noaa-coral-1865.html
{{ #ask: [[Is a::dataset]]
| ?Domain=geochemistry
| ?Archive
| ?MeasurementMaterial
| ?MeasurementStandard
| ?MeasurementUnits}}
www
Metadata&
proper*e
s&created&
on&the&fly&
Datasets&
linked&to&
loca*ons&
Projects&
linked&to&
datasets&
En**es&linked&to&&linked&
data&
Metadata&
added&by&
different&
volunteers&
Credits
Informa*on&
sources&are&
documented&
People&linked&to&
projects&
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
81!
Semantic Challenges in Getting Work Done!
■ 
To dos
• 
• 
■ 
Knowledge rich tasks in science
• 
• 
■ 
Managing personal to dos
Managing coordinated to dos
Automatic paper generator
A Web of semantic workflows/processes
Open science
• 
• 
Email-less coordination of projects
Open science processes
http://www.isi.edu/~gil
http://www.wings-workflows.org
http://www.organicdatascience.org
http://discoveryinformaticsinitiative.org
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
82!
“We need bigger glasses and more hands in the
water” – J. Tarter, SETI Institute!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
83!
ILLUSTRATION: P. HUEY/
INSIGHTS | the
PERSPECT
I V E S entists
in many
fields based
are augmenting
entists in many fields are augmenting
group
of cognitive
systems,
largely on
1
powerinofneural
searchnetworks
by usingand
machine-reada
power
of search
by Institute,
using machine-readable
advances
neurologiInformation
Sciences
University of Southern
biological
imaging
(
7),
species
preservation
scientists’
workload
and
false-positive
rates
ment
2
California, and
Marina Semantic
del Rey, CA 90292,
USA.
Pacifi
c
Northwest
ontologies
and
Semantic
Web
technol
ontologies
Web
technology
cally
inspired
computation,
is
beginning
to
(8), and quantum chemistry (9).
in identifying supernovae (14).
techn
3
National
Richland,
WA 99354,
USA. Information
These
fourjust
systems
are representative
of
distri
(
4)
to
tag
not
scientific
articles
(4) to
tagLaboratory,
not just
scientific
articles
but
show
promise
in
the
analysis
of
nontextual
DATA. AI techniques have acthe ways that more advanced AI can serve
Howe
Technology and Web Science, RensselaerDIGESTING
Polytechnic
celerated
the
pace
and
quality
of
analysis
of
scientific
ends.
They
are
based
on
explicit
limit
also
figures
and
videos,
blogs,
data
s
also Institute,
figures
and
videos,
blogs,
data
sets,
processing,
especially
of
online
images
and
Troy, NY 12203, USA. 4Cornell University, Ithaca, NY
the huge quantities of data that can stream
representations of science processes, and
the c
USA. *E-mail: [email protected]
and
computational
services,
which
all
and 14850,
computational
services, which
allow
video,
across
a wide
of areas
including
from modern
laboratory equipment.
To dethey
reasonrange
about these
to automate
protion
Discovery Informatics:
Knowledge-Rich Science Infrastructure!
Published by AAAS
ARTIFICIAL INTELLIGENCE
Amplify scientific discovery
with artificial intelligence
Many human activities are a bottleneck in progress
By Yolanda Gil,1 Mark Greaves,2
James Hendler,3* Haym Hirsh4
T
echnological innovations are penetrating all areas of science, making
predominantly human activities a
principal bottleneck in scientific progress while also making scientific advancement more subject to error and
harder to reproduce. This is an area where a
new generation of artificial intelligence (AI)
systems can radically transform the practice of scientific discovery. Such systems are
showing an increasing ability to
POLICY automate scientific data analysis and discovery processes, can
searchInformation
systematicallySciences
and correctly
through
USC
Institute
hypothesis spaces to ensure best results, can
autonomously discover complex patterns in
data, and can reliably apply small-scale sci-
“AI-based systems that can
represent hypotheses … can
reduce the error-prone human
bottleneck in … discovery.”
trans
navig
17enthu
1
On
to me
are e
mach
wher
defin
sured
such
for th
field.
vance
gies i
the im
An
of th
mark
non-A
datab
atten
ers a
fictio
rathe
logica
of he
be a
and b
Th
lenge
gies.
brain
the b
emag.org on October 9, 2014
SCIENCE sciencemag.org
rive scientific insight from data at this scale,
cesses and assist the human scientist. Destandard methods include applying dimenvelopment of the explicit representations of
sionality-reduction techniques and
scientific
which
they are
based
10 feature
OCTOBER
2014processes
• VOL on
346
ISSUE
6206
extractors to create high-speed classifiers
is complex. When successful, the computer
based on machine-learning approaches, such
can become a real (although junior) parby
AAAS
beyond
current
as Bayesian networks or support-vector ma- information-finding
ticipant inPublished
the science
process,
doingsearch
what
limitations.
chines. Because the phenomena under study
it does best: applying algorithmic methods
Webringing
can project
a not-so-distant
future
often exist in nonstationary environments
and
knowledge
to bear in a consiswhere
“intelligent
science
assistant”
or in contexts with only small quantities of
tent, systematic, and complete manner. prolabeled data that can be used for training— grams identify and summarize relevant
described
across
the worldwide
complex, unsupervised, and reinforcement research
A VIRTUOUS
CIRCLE.
Developing
systems
spectrum
blogs, preprint
armachine-learning techniques are critical for multilingual
like these is not
just anofexercise
in AI applichives,
and
discussion
forums;
find
or
gendata analysis. These types of approaches are
cation—it affects the direction of AI research.
new hypotheses
that might
confirm
or
being used in recent projects in data-rich ar- erate
Addressing
real challenges
of science
pushes
ongoing
work; areas,
and even
rerun
eas as diverse as chemical structure predic- conflict
the AI with
envelope
in many
including
increased
the
numbers
of
interested
particiold
analyses
when
a
new
computational
tion, pathway analysis and identification in
knowledge representation, automatic inferpants;
steady ofexponential
becomes
available.
Aided by
such
systemsMoore’s
biology,law
the and
processing
large-scale method
ence, process
reasoning,
hypothesis
generaincreases
in
computing
power;
and
expoa
system,
the
scientist
will
focus
on
more
geophysics data, and others.
tion, natural language processing, machine
nential
increases
and broad
the creative
aspects interaction,
of research, and
withina
Another,
more in,
ambitious
classavailability
of intelli- of
learning,
collaborative
of,
relevant
data
in
volumes
never
previously
larger
fraction
of
the
routine
work
left
to
the
gent systems is being developed under the
telligent user interfaces. This interaction
seen.
scientificScience
efforts that
have leverartificially intelligent assistant.
rubricThose
of Discovery
or, increasingly,
aged
AI advances
have largely
harnessed
soNew types of intelligent systems that can
Discovery
Informatics
(10). These
systems
phisticated
techniques
to
enhance scientific efforts in this manner are
enhance themachine-learning
intelligent assistants
described
create
correlative
predictions
from
large
sets
earlier with the capability to attack scientific transitioning from academic and industrial
of
“bigthat
data.”
Such work
aligns
wellincreasing
with the
research laboratories. A term gathering poptasks
combine
rote work
with
current
needs
of
petaand
exascale
science.
amounts of adaptivity and freedom. These ularity for systems that intelligently process
However,
AI encoded
has far broader
capacity
to aconline information beyond search is “cognisystems use
knowledge
of scientific
domains and processes in order to assist
with tasks that previously required human
knowledge and reasoning. In fact, several
sciences have significant investments in the
creates a virtuous circle where advances in
84!
Yolanda
Gil
ISWC-2014
[email protected]
representation of vast amounts of scientific
science go hand in hand
with advances in AI.
knowledge and are poised to explore new inThis virtuous circle can only work well if baltelligent systems that exploit that knowledge
anced and well oiled.
http://www.discoveryinformaticsinitiative.org"
"
PSB Workshop (January 2015)
KDD Workshop (August 2014):
http://ailab.ist.psu.edu/idkdd14/
AAAI Workshop (July 2014):
http://discoveryinformaticsinitiative/diw2014
AAAI Fall Symposium (Nov 2013):
http://discoveryinformaticsinitiative/dis2013
AAAI Fall Symposium (Nov 2012):
http://discoveryinformaticsinitiative/dis2012
Microsoft eScience Summit (Aug 2012)
Workshop on Web Observatories
for Discovery Informatics
PSB Workshop (Jan 2013):
on Computational Challenges of
Mass Phenotyping
USC Information Sciences Institute
NSF Workshop (Feb 2012):
http://discoveryinformaticsinitiative/diw2012
Yolanda Gil
ISWC-2014
[email protected]
85!
A View from Biomedical Research:
The NIH Big Data To Knowledge (BD2K) Initiative!
“Discovery informatics is in its infancy. Search engines are grappling with the need for
deep search, but it is doubtful they will fulfill the needs of the biomedical research
community when it comes to finding and analyzing the appropriate datasets. Let me
cast the vision in a use case. As a research group winds down for the day algorithms
take over, deciphering from the days on-line raw data, lab notes, grant drafts etc.
underlying themes that are being explored by the laboratory (the lab’s digital assets).
Those themes are the seeds of deep search to discover what is relevant to the lab that
has appeared since a search was last conducted in published papers, public data sets,
blogs, open reviews etc. Next morning the results of the deep search are presented to
each member as a personalized view for further post processing. We have a long way
to go here, but programs that incite groups of computer, domain and social scientists to
work
on these needs
move us forward.”
86!
Yolanda Gil
ISWC-2014
USC Information
Scienceswill
Institute
[email protected]
A View from Geoscieces:
The NSF EarthCube Initiative!
Data
Workflows Semantics Governance
hAp://www.earthcube.org/ USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
87!
What Might the Future Look Like?!
YOU: What are you working on?
OTHER PERSON: I am really busy, working on…
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
88!
In the Future!
YOU: What are you working on?
OTHER PERSON: I am really busy, working on…
YOU: Yes, but aren’t you glad that we can get our
work done faster?
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
89!
Thank you!!
http://www.isi.edu/~gil
http://www.wings-workflows.org
http://www.organicdatascience.org
http://discoveryinformaticsinitiative.org
■ 
■ 
■ 
■ 
■ 
■ 
Wings contributors: Varun Ratnakar, Ricky Sethi, Hyunjoon Jo, Jihie Kim, Yan Liu, Dave Kale
(USC), Ralph Bergmann (U Trier), William Cheung (HKBU), Daniel Garijo (UPM), Pedro
Gonzalez & Gonzalo Castro (UCM), Paul Groth (VUA)
Wings collaborators: Chris Mattmann (JPL), Paul Ramirez (JPL), Dan Crichton (JPL), Rishi
Verma (JPL), Ewa Deelman & Gaurang Mehta & Karan Vahi (USC), Sofus Macskassy (ISI),
Natalia Villanueva & Ari Kassin (UTEP)
Organic Data Science: Felix Michel and Matheus Hauder (TUM), Varun Ratnakar (ISI), Chris
Duffy (PSU), Paul Hanson, Hilary Dugan, Craig Snortheim (U Wisconsin), Jordan Read
(USGS)
Biomedical workflows: Phil Bourne & Sarah Kinnings (UCSD), Chris Mason (Cornell), Joel
Saltz & Tahsin Kurk (Emory U.), Jill Mesirov & Michael Reich (Broad), Randall Wetzel
(CHLA), Shannon McWeeney & Christina Zhang (OHSU)
Geosciences workflows: Chris Duffy (PSU), Paul Hanson (U Wisconsin), Tom Harmon &
Sandra Villamizar (U Merced), Tom Jordan & Phil Maechlin (USC), Kim Olsen (SDSU)
And many others!
USC Information Sciences Institute
Yolanda Gil
ISWC-2014
[email protected]
90!