PARIS Inalco du 4 au 8 juillet 2016 - jep-taln-recital 2016

Transcription

PARIS Inalco du 4 au 8 juillet 2016 - jep-taln-recital 2016
Journées d’Études sur la Parole
Traitement Automatique des Langues Naturelles
Rencontre des Étudiants Chercheurs en Informatique pour le
Traitement Automatique des Langues
PARIS Inalco du 4 au 8 juillet 2016
Organisé par les laboratoires franciliens
https://jep-taln2016.limsi.fr
Conférenciers invités:
Christian Chiarcos (Goethe-Universität, Frankfurt.)
Mark Liberman (University of Pennsylvania, Philadelphia)
Coordinateurs comités d'organisation
Nicolas Audibert et Sophie Rosset (JEP)
Laurence Danlos & Thierry Hamon (TALN)
Damien Nouvel & Ilaine Wang (RECITAL)
Philippe Boula de Mareuil, Sarra El Ayari & Cyril Grouin (Ateliers)
©2016 Association Francophone pour la Communication Parlée (AFCP) et
Association pour le Traitement Automatique des Langues (ATALA)
Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 4 : Invités
Table des matières
Corpora and Linguistic Linked Open Data : Motivations, Applications, Limitations
Christian Chiarcos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
From Human Language Technology to Human Language Science
Mark Liberman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
i
Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 4 : Invités
Corpora and Linguistic Linked Open Data: Motivations,
Applications, Limitations
Christian Chiarcos1
(1) Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
[email protected]
Linguistic Linked Open Data (LLOD) is a technology and a movement in several disciplines working
with language resources, including Natural Language Processing, general linguistics, computational
lexicography and the localization industry. This talk describes basic principles of Linguistic Linked
Open Data and their application to linguistically annotated corpora, it summarizes the current status
of the Linguistic Linked Open Data cloud and gives an overview over selected LLOD vocabularies
and their uses.
A resource constitutes Linguistic Linked Open Data if it is published in accordance with the following
principles :
1.
2.
3.
4.
The dataset is relevant for linguistic research or NLP algorithms.
The elements in the dataset should be uniquely identified by means of a URI.
The URI should resolve, so users can access more information using web browsers.
Resolving an LLOD resource should return results using web standards such as Resource
Description Framework (RDF).
5. Links to other resources should be included to help users discover new resources and provide
semantics.
6. Data should be openly licensed using licenses such as the Creative Commons licenses.
Criterion (1) defines linguistic(ally relevant) data, criteria (2-5) define linked data, criterion (6)
defines open data, their combination thus yields Linguistic Linked Open Data.
The primary benefits of LLOD have been identified as :
—
—
—
—
—
—
—
Representation : Linked graphs are a more flexible representation format for linguistic data
Interoperability : Common RDF models can easily be integrated
Federation : Data from multiple sources can trivially be combined
Ecosystem : Tools for RDF and linked data are widely available under open source licenses
Expressivity : Existing vocabularies help express linguistic resources.
Semantics : Common links express what you mean.
Dynamicity : Web data can be continuously improved.
I specifically focus on linguistically annotated corpora and discuss the potential of Linked Data in
relation to four standing problems in the field :
1. representing highly interlinked corpora (e.g., multi-layer corpora, annotated parallel corpora),
1
Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 4 : Invités
2. integrating corpora with lexical resources available from the web of data,
3. facilitating annotation interoperability using terminology resources available from the web of
data, and
4. streamlining data manipulation processes in a modular and domain-independent fashion.
These aspects will be discussed in relation to two selected resources from both general linguistics and
Natural Language Processing. Finally, the talk will discuss some of the challenges that LLOD is still
facing in both areas.
Références
C HIARCOS C., H ELLMANN S. & N ORDHOFF S. (2011). Towards a linguistic linked open data
cloud : The open linguistics working group. Traitement automatique des langues, 52(3), 245–275.
C HIARCOS C., M C C RAE J., C IMIANO P. & F ELLBAUM C. (2013). Towards open data for
linguistics : Lexical linked data. In A. O LTRAMARI , P. VOSSEN , L. Q IN & E. H OVY, Eds., New
Trends of Research in Ontologies and Lexical Resources. Heidelberg : Springer.
C HIARCOS C. & S UKHAREVA M. (2015). OLiA - ontologies of linguistic annotation. Semantic
Web Journal, 6, 379–386.
C HIARCOS C. et al. (2016a). CoNLL-RDF. beyond the tsv. unpublished manuscript.
C HIARCOS C. et al. (2016b). Leight-weight conceptual interoperability for the universal dependencies. unpublished manuscript.
M C C RAE J. P., C HIARCOS C., B OND F., C IMIANO P., D ECLERCK T., DE M ELO G., G RACIA J.,
H ELLMANN S., K LIMEK B., M ORAN S., O SENOVA P., PAREJA -L ORA A. & P OOL J. (2016). The
open linguistics working group : Developing the linguistic linked open data cloud. In Proceedings
of the Ninth International Conference on Language Resources and Evaluation (LREC 2016), p.
2435–2441, Portorož, Slovenia : European Language Resources Association (ELRA).
S UKHAREVA M. & C HIARCOS C. (2016). Combining ontologies and neural networks for analyzing
historical language varieties. a case study in middle low german. In Proceedings of the Ninth
International Conference on Language Resources and Evaluation (LREC 2016), p. 1471–1480,
Portorož, Slovenia : European Language Resources Association (ELRA).
2
Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 4 : Invités
From Human Language Technology to Human Language
Science
Mark Liberman1
(1) Linguistic Data Consortium. University of Pennsylvania
3600 Market Street, Suite 810, Philadelphia, PA, 19104 USA
[email protected]
A BSTRACT
Thirty years ago, in order to get past roadblocks in Machine Translation and Automatic Speech
Recognition, DARPA invented a new way to organize and manage technological R&D : a “common
task” is defined by a formal quantitative evaluation metric and a body of shared training data, and
researchers join an open competition to compare approaches. Over the past three decades, this method
has produced steadily improving technologies, with many practical applications now possible. And
Moore’s law has created a sort of digital shadow universe, which increasingly mirrors the real world
in flows and stores of bits, while the same improvements in digital hardware and software make it
increasingly easy to pull content out of the these rivers and oceans of information.
It’s natural to be excited about these technologies, where we can see an open road to rapid improvements beyond the current state of the art, and an explosion of near-term commercial applications.
But there are some important opportunities in a less obvious direction. Several areas of scientific and
humanistic research are being revolutionized by the application of Human Language Technology. At
a minimum, orders of magnitude more data can be addressed with orders of magnitude less effort –
but this change also transforms old theoretical questions, and poses new ones. And eventually, new
modes of research organization and funding are likely to emerge.
3