Dependency Parsing for Relation Extraction in Biomedical Literature

Transcription

Dependency Parsing for Relation Extraction in Biomedical Literature
Dependency Parsing for Relation Extraction in
Biomedical Literature
Master Thesis in Computer Science
presented by Nicola Colic
Zurich, Switzerland
Immatriculation Number: 09-716-572
to the Institute of Computational Linguistics,
Department of Informatics at the University of Zurich
Supervisor: Prof. Dr. Martin Volk
Instructor: Dr. Fabio Rinaldi
submitted on the 20th of March, 2016
i
Abstract
This thesis describes the development of a system for the extraction of entities in biomedical literature, as well as their relationships with each other.
We leverage efficient dependency parsers to provide fast relation extraction,
in order for the system to be potentially able to process large collections of
publications (such as PubMed) in useful time. The main contributions are
the finding and integration of a suitable dependency parser, and the development of a system for creating and executing rules to find relations. For
the evaluation of the system, a previously annotated corpus was further refined, and insights for the further development of this and similar systems
are drawn.
ii
Acknowledgements
I would like to thank Prof. Martin Volk for supervising the writing of this
thesis, and especially my direct instructor Dr. Fabio Rinaldi for his neverceasing help and motivation.
Contents
1 Introduction
1.1 The Need for Biomedical Text Mining .
1.2 Related Work . . . . . . . . . . . . . .
1.2.1 Named Entity Recognition . . .
1.2.2 Relation Extraction . . . . . . .
1.3 Beyond Automated Curation . . . . . .
1.4 Importance of PubMed . . . . . . . . .
1.5 This Thesis . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
3
4
5
5
2 python-ontogene Pipeline
2.1 OntoGene Pipeline . . . . . . . . . . . . . .
2.2 python-ontogene . . . . . . . . . . . . . . . .
2.2.1 Architecture of the System . . . . . .
2.2.2 Configuration . . . . . . . . . . . . .
2.2.3 Backwards Compatibility . . . . . . .
2.3 Usage . . . . . . . . . . . . . . . . . . . . .
2.4 Module: Article . . . . . . . . . . . . . . . .
2.4.1 Implementation . . . . . . . . . . . .
2.4.2 Usage . . . . . . . . . . . . . . . . .
2.4.3 Export . . . . . . . . . . . . . . . . .
2.5 Module: File Import and Accessing PubMed
2.5.1 Updating the PubMed Dump . . . .
2.5.2 Downloading via the API . . . . . .
2.5.3 Dealing with the large number of files
2.5.4 Usage . . . . . . . . . . . . . . . . .
2.6 Module: Text Processing . . . . . . . . . . .
2.6.1 Usage . . . . . . . . . . . . . . . . .
2.7 Module: Entity Recognition . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
8
8
8
9
9
10
10
11
12
12
13
13
13
14
14
15
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
2.8
2.9
2.7.1 Usage . .
Evaluation . . . .
2.8.1 Speed . .
2.8.2 Accuracy .
Summary . . . .
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
16
17
18
3 Parsing
3.1 Selection Process . . . . . . . . . . . .
3.1.1 spaCy . . . . . . . . . . . . . .
3.1.2 spaCy + Stanford POS tagger .
3.1.3 Stanford Parser . . . . . . . . .
3.1.4 Charniak-Johnson . . . . . . . .
3.1.5 Malt Parser . . . . . . . . . . .
3.2 Evaluation . . . . . . . . . . . . . . . .
3.2.1 Ease of Use and Documentation
3.2.2 Evaluation of Speed . . . . . . .
3.2.3 Evaluation of Accuracy . . . . .
3.2.4 Prospective Benefits . . . . . .
3.2.5 Selection . . . . . . . . . . . . .
3.3 Summary . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
21
21
22
22
23
23
23
24
26
47
47
47
4 Rule-Based Relation Extraction
4.1 Design Considerations . . . . . . . .
4.2 Implementation . . . . . . . . . . . .
4.2.1 stanford_pos_to_db . . . . .
4.2.2 Database . . . . . . . . . . .
4.2.3 query_helper . . . . . . . . .
4.2.4 browse_db . . . . . . . . . .
4.3 Data Set . . . . . . . . . . . . . . . .
4.3.1 Conversion . . . . . . . . . . .
4.3.2 Categorization . . . . . . . . .
4.3.3 Development and Test Subsets
4.4 Queries . . . . . . . . . . . . . . . . .
4.4.1 HYPHEN queries . . . . . . .
4.4.2 ACTIVE queries . . . . . . .
4.4.3 DEVELOP queries . . . . . .
4.5 Summary . . . . . . . . . . . . . . .
4.5.1 Arity of Relations . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
51
52
54
59
62
64
65
69
70
70
71
74
78
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
v
4.5.2
4.5.3
Query Development Insights . . . . . . . . . . . . . . . 79
Augmented Corpus . . . . . . . . . . . . . . . . . . . . 79
5 Evaluation
5.1 Evaluation of epythemeus . . . . .
5.1.1 Query Evaluation . . . . . .
5.1.2 Speed Evaluation and Effect
5.2 Processing PubMed . . . . . . . . .
5.2.1 Test Set . . . . . . . . . . .
5.2.2 Timing . . . . . . . . . . . .
5.2.3 Downloading PubMed . . .
5.2.4 Tagging and Parsing . . . .
5.2.5 Running Queries . . . . . .
5.2.6 Results . . . . . . . . . . . .
5.3 Summary . . . . . . . . . . . . . .
5.3.1 epythemeus . . . . . . . . .
5.3.2 Processing PubMed . . . . .
. . . . . .
. . . . . .
of Indices
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusion
6.1 Our Contributions . . . . . . . . . . . . . . . . .
6.1.1 python-ontogene . . . . . . . . . . . . . . .
6.1.2 Parser Evaluation . . . . . . . . . . . . . .
6.1.3 epythemeus . . . . . . . . . . . . . . . . .
6.1.4 Fragments . . . . . . . . . . . . . . . . . .
6.1.5 Corpus . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . .
6.2.1 Improving spaCy POS tagging . . . . . . .
6.2.2 Integration of spaCy and python-ontogene
6.2.3 Improvements for epythemeus . . . . . . .
6.2.4 Evaluation Methods . . . . . . . . . . . .
6.3 Processing PubMed . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
80
80
80
85
87
87
87
87
88
88
88
89
89
90
.
.
.
.
.
.
.
.
.
.
.
.
91
91
91
92
92
92
92
93
93
93
93
93
94
Chapter 1
Introduction
1.1
The Need for Biomedical Text Mining
One of the defining factors of our time is an unprecedented growth in information, and resulting from this is the challenge of information overload both
in the personal and professional space.
Independent of the respective domain, recent years have seen a shift in focus from information retrieval to information extraction. That means, rather
than attempting to bringing the right document containing relevant information to the user, research is now concerned with processing and extracting
specific information contained in unstructured text [1].
This holds particularly true in the biomedical domain, where the rate
at which biomedical papers are published is ever increasing, leading to what
Hunter and Cohen call literature overload [16]. In their 2006 paper, they show
that the number of articles published on PubMed, the largest collection of
biomedical publications, is growing at double-exponential rate.
Because of this, biology researchers need to rely on manually curated
databases that list information relevant to their research in order to stay
up-to-date. From PubMed, information is manually curated, that is, human
experts compile key facts of publications into dedicated databases. This
process of curation is expensive and labor intensive, and causes a substantial
time lag between publication and appearance of its key information in the
respective database [31]. Curators, too, struggle to cope with the amount of
papers published, and thus need to turn to automated processing, that is,
biomedical text mining.
1
CHAPTER 1. INTRODUCTION
2
However, the field of biomedical text mining is not only limited to aiding
or automating curation of databases. It covers a variety of applications ranging from simpler information extraction to question answering to literaturebased discovery. Generally speaking, it is concerned with the discovery of
facts as well as the association between them in unstructured text. These
associations can be explicit or implicit. As Simpson [33] note, advances in
biomedical text mining can help prevent or alter the course of many diseases,
and thus are not only of relevance to professional researchers, but also benefit
the general public. Furthermore, they rely on the combination of efforts by
both experts in the biomedical domain as well as computational linguists.
We describe the different applications of biomedical text mining below.
1.2
Related Work
Simpson [33] attributes much of the development in the field to communitywide evaluations and shared tasks such as BioCreative [15] and BioNLP
[18]. Such shared tasks focus on different aspects of biomedical text mining:
Named entity recognition (NER) and relation extraction are the main tasks,
which are briefly discussed here.
1.2.1
Named Entity Recognition
In NER, biological and medical terms are identified and marked in an unstructured text. Examples for such entities include proteins, drugs or diseases, or any other semantically well-defined data. This task is often coupled
with assigning each of the found entities with a unique identifier, called entity
normalization.
Named entity recognition is particularly difficult in the biomedical domain given the constant discovery of new concepts and entities. Because
of this, approaches that utilize a dictionary containing known entities need
take extraordinary measures to keep their dictionaries up to date with current research, mirroring the problem of database curation described above.
In spite of this, dictionary-based methods can achieve favorable results [22].
In particular, dictionaries can automatically be generated from pre-existing
ontologies [11], making them easier to maintain and to be kept up-to-date.
Other approaches to NER are rule-based, which exploit patterns in protein names [12], for example, or statistically inspired, in which features such
CHAPTER 1. INTRODUCTION
3
as word sequences or part-of-speech tags are used by machine learning algorithms to infer occurrences of a named entity [14].
The related task of entity normalization is made difficult by the fact that
often, there’s no universal accord on the preferred name of a specific entity.
Particularly with protein and gene names, variations can go as far as to come
to the authors’ personal preference. Another complication are abbreviations,
which are largely context-dependent: The same abbreviation can refer to
very different entities in different contexts. However, as Zweigenbaum et al.
note, problems such as this can essentially be considered solved problems [42].
1.2.2
Relation Extraction
The goal of relation extraction is to extract interactions between entities.
In the biomedical domain, extracting drug-drug interactions [32], chemicaldisease relations (CDR) [39] or protein-protein interactions (PPI) [26] are
particularly relevant examples. However, these are highly specialized problems, and require specialized methods of relation extraction.
Simpson [33] distinguish between relation extraction and event extraction:
They define relation extraction as finding binary associations between entities, whereas event extraction is concerned with more complex associations
between an arbitrary number of entities.
The simplest approach of extracting relations relies on statistical evaluation of cooccurrence of entities. A second class of approaches are rulebased. The rules used by these approaches are either created manually by
experts, or stem from automated analysis of annotated texts. Simpson note
that co-occurrence approaches commonly exhibit high recall and low precision,
and rule-based approaches typically demonstrate high precision and low recall
[33]. A third class of approaches uses machine learning to directly identify
relations, using a variety of features. These approaches can both be used for
relation as well as event extraction. For both rule-based as well as machine
learning approaches, syntactic information is an invaluable feature [37] [2].
In particular, dependency-based representations of the syntactic structure of
a sentence have proven to be particularly useful for text mining purposes.
Given the fact that the approach used in this thesis is a rule-based approach capable of extracting both binary as well as more complicated relations, we will not adhere to the distinction between relation and event
extraction, and use the term relation extraction for both problems.
CHAPTER 1. INTRODUCTION
1.3
4
Beyond Automated Curation
More complex applications of biomedical text mining are summarization,
question answering and literature-based discovery.
In summarization, the goal is to extract important facts or passages from a
single document or a collection of documents, and represent them in a concise
fashion. This is particularly relevant in the light of the aforementioned literature overload. The approaches employed here either identify representative
components of the articles using statistical methods, or extract important
facts and use them to generate summaries.
Question answering aims at providing precise answers rather than relevant
documents to natural language queries. On one hand, this relies on natural
language processing of the user-supplied queries, and the processing of a
collection of documents potentially containing the answer, on the other hand.
For the latter case, named entity recognition and relation extraction are of
major importance.
The problems described above are all concerned with finding of facts explicitly stated in biomedical literature. Literature-based discovery, however,
aims at revealing implicit facts, namely, relations that have not previously
been discovered. Swanson was one of the first to explore this research field.
The following is a simplified account of his definition from 1991: Given two related scientific communities that do not communicate, where one community
establishes a relationship between A and B, and the other the relationship
between B and C, infer the relation of A and C [35]. As Simpson [33] explain,
recent systems use semantic information to uncover such A-C relations, and
thus build heavily on relation extraction.
With the rapid growth of publications, the aspect of literature-based discovery becomes more important. The body of publications is quickly becoming too vast to be manually processed, and research communities find it
impossible to keep up-to-date with their peers, leading to disjoint scientific
communities. Literature-based discovery thus holds the promise of leveraging
the unprecedented scale of publications and contribute to the advancement
of biomedical science that would not otherwise be possible.
CHAPTER 1. INTRODUCTION
1.4
5
Importance of PubMed
MEDLINE is the largest database containing articles from the biomedical
domain, and is maintained by the US National Library of Medicine (NLM).
Currently, it contains more than 25 million articles published from 19461 , and
as Hunter and Cohen note, it is growing at extraordinary pace [16]. Between
2011 and today, the amount of articles it contains has more than doubled.
The abstracts of MEDLINE can be freely accessed and downloaded via
PubMed2 , making it one of the most important resource for biomedical text
mining [42] [33]. Furthermore, thanks to a new National Institutes of Health
(NIH)-issued Policy on Enhancing Public Access to Archived Publications Resulting From NIH-Funded Research from 2005, more than 3.8 million full-text
articles can now be freely downloaded on PubMedCentral3 . The goal of this
endeavor, as stated by the NLM, is as follows: To integrate the literature with
a variety of other information resources such as sequence databases and other
factual databases that are available to scientists, clinicians and everyone else
interested in the life sciences. The intentional and serendipitous discoveries
that such links might foster excite us and stimulate us to move forward [16].
In the course of this thesis, we will focus on the article abstracts available
via PubMed. Given its importance and size, we conduct our efforts with the
processing of the entire PubMed database in mind.
1.5
This Thesis
With such a large corpus of freely available biomedical texts, efficiency of
biomedical text mining becomes increasingly more important: Text mining
systems need to be able to cope with rapidly growing collections of text, and
in order to be relevant and timely, need to do so in an efficient manner.
The goal of this thesis is to explore how relation extraction can
be efficiently performed using dependency parsing. Recent technological advances make dependency parsing computationally cheap, and as
explained in Sections 1.2.2 and 1.3, it lies at the core of many other aspects
of biomedical text mining. We explore how to leverage this availability of
efficient dependency parsing, especially in regard to processing the entire
1
https://www.nlm.nih.gov/pubs/factsheets/medline.html
https://www.ncbi.nlm.nih.gov/pubmed/
3
http://www.ncbi.nlm.nih.gov/pmc/
2
CHAPTER 1. INTRODUCTION
6
PubMed.
We first describe our pipeline for named entity recognition and part-ofspeech tagging in Chapter 2. We expand on this previous work by finding an
accurate and efficient dependency parser in Chapter 3. These results are then
used to develop a new, independent system aimed at exploiting dependency
parse information to find relations using manually written rules (Chapter 4).
This system uses a novel way of creating rules, and is evaluated against a
manually annotated corpus in Chapter 5. Furthermore, we give an estimate
of time it would take to process the entire PubMed and search it for relations
using our approach. In Chapter 6 we draw some conclusions on the results
of this research.
Chapter 2
python-ontogene Pipeline
This chapter describes the development of a new text processing pipeline
that performs tokenization, tagging and named entity recognition, building
on previous work by Rinaldi et al. [28] [29] [30], and their OntoGene pipeline,
in particular.
2.1
OntoGene Pipeline
The OntoGene system is a pipeline that is patched together from different
modules written in different programming languages, which communicate
with each other via files. Each module takes a file as input, and produces
a file, typically in a predefined XML format (called OntoGene XML). The
subsequent module will then read the files produced from the antecedent
modules. These different modules themselves are coordinated by bash scrips.
This is inefficient for two reasons:
1. Every module needs to parse the precedent module’s output. The repeated accessing of the disk, reading and writing considerably slows
down the processing.
2. Usage of the pipeline is not easy for new users, since the different modules are written in different languages; and since there is no centralized
documentation.
The low processing speed described in 1. makes it impossible to process
larger collections of text, such as PubMed. Because of that there is demand
for a streamlined pipeline.
7
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
2.2
8
python-ontogene
Consequently, the existing OntoGene pipeline was rewritten in python3, and
focus was placed particularly on reducing communication between modules
via files. Through this, processing is accelerated, and makes the processing of
the entire PubMed possible. Furthermore, the new pipeline has a consistent
documentation, and is hence easier to understand for the user.
The pipeline is currently developed up to the point of entity recognition,
and can be found online1 or in the python-ontogene directory that accompanies this thesis.
2.2.1
Architecture of the System
The python-ontogene pipeline is composed of several independent modules,
which are coordinated by a control script. The main mode of communication
between the modules is via objects of a custom Article class, which mimics
an XML structure. All modules read and return objects of this class, which
ensures independence of the modules.
The modules are coordinated via a control script written in python, which
passes the various Article objects produced by the modules to the subsequent modules.
2.2.2
Configuration
All variables relevant for the pipeline (such as input files, output directories
and bookkeeping parameters) are stored in a single file, which is read by the
control script. The control script will then supply the relevant arguments
read from the configuration file to the individual modules. This ensures that
the user only has to edit a single file, while at the same time keeping the
modules independent.
2.2.3
Backwards Compatibility
In order to preserve compatibility to the existing pipeline (see Section 2.1),
the Article objects can be exported to OntoGene XML format at various
stages of processing.
1
https://gitlab.cl.uzh.ch/colic/python-ontogene
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
9
Figure 2.1: The architecture of the python-ontogene pipeline
2.3
Usage
The exact usage of the individual modules is described in the subsequent
chapters. Furthermore, the in-file documentation in example.py file can serve
to provide a more concrete idea of how to use the pipeline.
2.4
Module: Article
The article module is a collection of various classes, such as Token, Sentence
andSection. The classes are hierarchically organized (e.g. Article has
Sections), but kept flexible to allow to future variations in the structure.
Each class offers methods particularly suited to dealing with its contents,
such as writing to file or performing further processing.
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
10
However, in order to keep the pipeline flexible, the article class relies
on other modules to perform tasks such as tokenization or entity recognition.
While this leads to coupling between the modules, it also allows for easy
replacement of modules. For example, if the tokenizer that is currently used
needs replacing, it is easy to just supply a new tokenization module to the
Article object to perform tokenization.
2.4.1
Implementation
Currently, there are the following classes, all of which implement an abstract
Unit class: Article, Section, Sentence, Token and Term. Each of these
classes has a subelements list, which contains objects of other classes. In
this fashion, a tree-like structure is built, in which an Article object has
a subelements list of Sections, which each have a subelements list of
Sentences and so on.
The abstract Unit class implements, amongst others, the get_subelement()
function, which will traverse the object’s subelements list recursively until
the elements of type of the argument have been found. In this fashion, the
data structure is kept flexible for future changes. For example, Articles
may be gathered in Collections, or Sections might contain Paragraphs.
As for tokenization, the Article class expects the tokenize() function
to be called with an tokenizer object as argument. This tokenizer object
needs to implement the following two functions: tokenize_sentences(), and
tokenize_words(). The first function is expected to return a list of strings;
the second one to return a list of tuples, which store token text as well as
start and end position in text.
Finally, the Article class implements functions such as add_section()
and add_term(), which internally create the corresponding objects. This is
done so that other modules only need to import the Article class, which in
turn will take care of accessing and creating other classes.
2.4.2
Usage
The example below will create an Article object, manually add a Section
with some text, tokenize it and print it to console and file.
1
2
import article
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
3
4
5
6
7
11
my_article = Article ( " 12345678 " ) # constructor needs
ID
my_article . addSection ( " S1 " ," abstract " ," this is an
example text " )
my_article . tokenize ()
print ( my_article )
my_article . print_xml ( ’ path / to / output_file . xml ’)
2.4.3
Export
At the time of writing, the Article class implements a print_xml(), which
allows exporting of the data structure to a file. This function in turn recursively calls an xml() function on the elements of the data structure. Like
this, it lies in the responsibility of the respective class to implement the xml()
function.
The goal of this function is to export the Article object in its current
state of processing. For example, if no tokenization has yet taken place, it
will not try to export tokens. This however, requires much processing work.
Because of this, this function and the related functions need to be updated
as the pipeline is updated.
Pickling
To store and load Article objects without exporting them to a specific
format, the Article class implements the pickle() and unpickle() functions.
These allow dumping the current Article object as a pickle file, and restoring
a previously pickled Article object.
1
import article
2
3
my_article = None
4
5
# create Article object
6
7
8
my_article . pickle ( ’ path / to / pickle ’)
new_article = article . Article . unpickle ( ’ path / to /
pickle ’)
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
12
Exporting Entities
The Article class implements a print_entities_xml() function, which exports the found entities to an XML file. As with the general export function,
the XML file is built recursively by calling an entity_xml() function on the
Entity objects that are linked to the Article.
2.5
Module: File Import and Accessing PubMed
This module allows importing texts from files, or to download them from
PubMed, and converts them into the Article format discussed above. From
there, they can be handed to the other modules and exported to XML.
There are three ways how the PubMed can be accessed:
• PubMed dump. After applying for a free licence, the whole of
PubMed can be downloaded as a collection of around 700.xml.gz files,
each of which contain about 30000 PubMed articles. This dump is
updated once a year (in November / December).
• API. This allows the individual downloading of PubMed articles given
their ID. If theentrez library is used, PubMed returns XML, if the
BioPython library is used, PubMed returns python objects. However,
PubMed enforces a throttling of download speed in order to prevent
overloading their systems: If more than three articles are downloaded
per second, the user risks being denied further access to PubMed using
the API.
• BioC. For the BioCreative V: Task 3 challenge, participants are supplied with data in BioC format. BioC is an XML format tailored towards representing annotations in the biomedical domain [8].
2.5.1
Updating the PubMed Dump
Since the PubMed dump is only updated once per year, additional articles
published throughout the year need to be downloaded manually using the
API.
This takes substantial effort: Between the last publication of the PubMed
dump in December 2014 and August 1st 2015, 800000 new articles were
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
13
published. Given the aforementioned limitations of download speed, this
takes about 3 days to download using the API.
2.5.2
Downloading via the API
In order to prevent repeated downloads from PubMed, the module keeps a
copy of downloaded articles as python pickle objects.
2.5.3
Dealing with the large number of files
Since the pipeline operates on the basis of single articles, the PubMed dump
was converted into multiple files, each of which corresponds to one article.
However, most file systems, such as FAT32, and ext2, cannot cope with 25
million files in one directory. Because of this, the following structure was
chosen:
Every article has a PubMed ID with up to 8 digits. If lower, they are
padded from left with zeros. All articles are then grouped by their first 4
digits into directories, resulting in up to 10 000 folders with each up to 10 000
files. For example, the file with ID 12345678 would reside in the directory
1234.
However, different solutions for efficient dealing with the large number
of files could be explored in the continuation of this project. Especially
databases, inherently suited to large data sets, such as NoSQL, seem promising.
2.5.4
Usage
The following code snippet demonstrates how to import from file and from
PubMed. The import_file module allows to specify a directory rather than
a path. In that case, it will load all files in the directory and convert them
to Article objects.
1
from text_import . pubmed_import import pubmed_import
2
3
4
article = pubmed_import ( " 12345678 " ," mail@example . com
")
# email can be omitted if file has already been
downloaded
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
5
6
14
# if file was downloaded before , the module will
load it from local dump_directory
article . print_xml ( ’ path / to / file ’)
7
8
from text_import . file_import import import_file
9
10
11
12
13
articles = import_file ( ’/ path / to / directory / or / file .
txt ’)
# always returns a list
for article in article :
print ( article )
2.6
Module: Text Processing
This module wraps around the NLTK library, to make sentence splitting,
tokenization and part-of-speech tagging useable to the Article class. This
module can be swapped out for a different one in the future, provided the
functions tokenize_sentences(), tokenize_words() and pos_tag() are implemented.
2.6.1
Usage
Since NLTK offers several tokenizers based on different modules, and allows
you to train own models, this wrapper needs you to specify which model you
want to use. The config module gives convenient ways to do this.
1
2
from config . config import Configuration
from text_processing . text_processing import
Text_processing as tp
3
4
5
6
my_config = Configuration ()
my_tp = tp ( word_tokenizer = my_config .
word_tokenizer_object ,
sentence_tokenizer = my_config .
sentence_tokenizer_object )
7
8
9
for pmid , article in pubmed_articles . items () :
article . tokenize ( tokenizer = my_tp )
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
2.7
15
Module: Entity Recognition
This module implements a dictionary-based entity recognition algorithm. In
this approach, a list of known entities is used to find entities in a text. This
approach is not without limitations: Notably, considerable effort must be undertaken to keep the dictionary up-to-date in order to find newly discovered
entities, and entities not previously described cannot be found [41].
We alleviate this problem by using an approach put forth by Ellendorf et
al. [11]. Here, a dictionary is automatically generated drawing from a variety
of different ontologies. Their approach also helps to take into consideration
the problem of homonymy as described by [21], by mapping every term to
an internal concept ID and to the ID of the respective origin databases.
We opted for this approach in order to deliver a fast solution able to cope
with large amounts of data. This aspect has so far received little attention
in the field.
2.7.1
Usage
The user first needs to instantiate an Entity Recognition object, which
will hold the entity list in memory. This object is then passed to the recognize_entities() function of the Article object, which will then use the
Entity Recognition object to find entities. While this is slightly convoluted, it ensures that different entity recognition approaches can be used in
conjunction with the Article class.
When creating the Entity Recognition object, the user needs to supply
an entity list as discussed above, and a Tokenizer object. The Tokenizer
object is used to tokenize multi-word entries in the entity list. The tokenization applied here should be the same as the one used to tokenize the articles.
The config module ensures this.
1
2
3
from config . config import Configuration
from text_processing . text_processing import
Text_processing as tp
from entity_recognition . entity_recognition import
Entity_recognition as er
4
5
6
my_config = Configuration ()
my_tp = tp ( word_tokenizer = my_config .
word_tokenizer_object , sentence_tokenizer =
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
7
16
my_config . sentence_tokenizer_object )
my_er = er ( my_config . termlist_file_absolute ,
my_config . termlist_format , word_tokenizer = my_tp )
8
9
# create tokenised Article object
10
11
12
my_article . recognize_entities ( my_er )
my_article . print_entities_xml ( ’ output / file / path ’ ,
pretty_print = True )
2.8
Evaluation
Two factors have been evaluated: speed and accuracy of named entity recognition.
2.8.1
Speed
Both the existing OntoGene pipeline as well as the new python-ontogene
pipeline were run on the same machine on the same data set and their running
time measured using the Unix time command. The test data set consists of
9559 randomly selected text files, each containing the abstract of a Pubmed
article. References to the test set can be found in the data/pythonontogene_comparison directory.
The Unix time command returns three values: real, user and system.
real time refers to the so-called wall clock time, that is, the time that has
actually passed for the execution of the command. user and system refer
to the period of time during which the CPU was engaged in the respective
mode. For example, system calls will add to the system time, but normal
user mode programs to the user time. Table 2.1 lists the measured results.
Table 2.1: Speed evaluation for Ontogene and python-ongogene pipelines
pipeline
real
user + system
s / article
OntoGene
37m5.153s 59 323s ( 16.5 hours)
6.206
python-ontogene 21m22.359s 1280s ( 0.36 hours)
0.133
Note that the OntoGene pipeline is explicitly parallelized: Because of
this, real time is relatively low. The python-ontogene pipeline is not explic-
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
17
itly parallelized. However, this could be the subject of future development,
resulting in faster real operation still.
2.8.2
Accuracy
To compare the results of named entity recognition of both pipelines, testing
was done on the same test data set of 9559 files as above. The data set
contains both chemical named entities as well as diseases, which are listed
separately in the evaluation below.
A testing script compares entities found by one pipeline and compares
them against a gold standard. Here, we used the output of the old OntoGene pipeline as gold standard. The test scripts requires the input to
be in BioC format. Because of this, the output of both pipelines was first
converted to BioC format. Test scripts can be found in the accompanying
data/pythonontogene_comparison directory.
The script calculates TP, FP, FN, as well as precision and recall values on
a document basis, as well as average values for the entire data set evaluated.
Table 2.2 lists the results returned by the evaluation script:
Table 2.2: Evaluation of python-ontogene against OntoGene NER
Entity Type Precision Recall F-Score
Chemical
0.835
0.865
0.850
Disease
0.946
0.826
0.882
Note that precision and recall are measured against the output of the
OntoGene pipeline. This means that true positives found by the new pythonontogene pipeline that the old pipeline did not find are treated as false positives by the evaluation script. In table 2.3 we list some examples of differences
between what the two pipelines produce.
As example 15552512 in table 2.3 shows, the new pipeline lists many
entities several times, due to them having several entries with different IDs
in the term list. While this can be useful, future development should allow
for this behavior to be optional.
Simpson [33] report that community-wide evaluations have demonstrated
that NER systems are typically capable of achieving favorable results. While
our values obtained above cannot be directly compared, systems in the
BioCreative gene mention recognition tasks were able to obtain F-scores between 0.83 and 0.87 [34].
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
2.9
18
Summary
In this chapter, we presented an efficient pipeline for tokenization, POS tagging and named entity recognition, which focuses on modularity and welldocumented code. In the rest of this dissertation, we describe the finding of
a dependency parser to be included as a module. The modular nature of the
python-ontogene pipeline should make the inclusion of new modules easy, as
well as facilitate the use of different modules for POS tagging, for example.
We specially note the considerable improvements in speed as shown in
table 2.1. The new python-ontogene pipeline runs approximately 46 times
faster than the old OntoGene pipeline, making it a promising starting point
for future developments.
CHAPTER 2. PYTHON-ONTOGENE PIPELINE
19
Table 2.3: Differences in NER between the two pipelines
PMID
Original text
Comment
15552511 These indices include 3 types Here the new pipeline doesn’t
of measures, which are de- mark C-reactive protein as an
rived from a health profes- entity, but the old one does
sional [joint counts, global]; a (False Negative).
This is
laboratory [erythrocyte sedi- probably due to different tokmentation rate (ESR) and C- enization in regards to parenreactive protein (CRP)]; or a theses.
patient questionnaire [physical function, pain, global].
15552512
Patient-derived
measures
have been increasingly recognized as a valuable means
for monitoirng patients with
rheumatoid arthritis.
The new pipeline lists both
rheumatoid arthritis as well
as arthritis as entities in separate entries. This behavior is quite common: The
new pipeline will try to match
as many entities as possible.
Other examples include tumor
necrosis and necrosis (in article 15552517). This behavior makes the python-ontogene
pipeline more robust.
15552518
It is now accepted that
rheumatoid arthritis is not a
benign disease.
Here, the old pipeline marks
not a as an entity, and lists
it with the preferred form
of 1,4,7-triazacyclononaneN,N’,N”-triacetic acid. This
is obviously a mistake, which
the new pipeline does not
make, attributed to the
quality of the dictionary used.
Chapter 3
Parsing
This chapter describes the process of finding a suitable parser to be integrated into the python-ontogene pipeline, and to be used as basis for relation
extraction. We evaluate a set of different dependency parsers in terms of
speed, ease of use, and accuracy, and select the most promising parser.
3.1
Selection Process
The 2009 BioNLP shared task was concerned with event extraction, and Kim
et al. list the parsers that were used in this challenge [17]. Building on this
work, Kong et al. recently evaluated 8 parsers for speed and accuracy [20].
To our knowledge, this is the most recent and substantiated evaluation of
parsers. Based on their findings, we selected a set of parsers for our own
evaluation. We included only parsers for which a readily available implementation exists, and which performed above average in the respective evaluation
above.
Recall that the python-ontogene pipeline is entirely written in python3
and aims at reducing time lost at reading and writing to disk by keeping as
much communication between modules in memory as possible. In trying to
maintain this advantage, we narrow our selection further by only choosing
parsers that are either written in python or have already been interfaced for
python.
Given the considerations described above, the following dependency parsers
were selected for further evaluation.
• Stanford parser, as it was described as the state-of-the-art parser by
20
CHAPTER 3. PARSING
21
Kim et al. as well as Kong et al., and has recently been updated.
• Charniak-Johnson (also known as BLLIP or Brown reranking parser),
as it was the most accurate parser Kong et al.’s study mentioned above.
• Malt parser, as it performed fastest in the above evaluation when using
its Stack-Projective algorithm.
Furthermore, we also include spaCy, a dependency parser written entirely
in python3 that has not yet been the subject of scientific evaluation to our
knowledge.
Except for spaCy, all parsers mentioned above are written in different
languages than python, but claim to offer python interfaces.
3.1.1
spaCy
spaCy 1 is library including a dependency parser written entirely in python3
with focus on good documentation and use in production systems, and is
published under the MIT license. To our knowledge, there are no publications that evaluate its performance; however the developer self-reports on the
project’s website2 that the parser out-performs the Stanford parser in terms
of accuracy and speed. For our tests, we used version v0.100.
spaCy attempts to achieve high performance by the fact that the user
interfaces are written in python, but the actual algorithms are written in
cython. cython is a programming language and a compiler that aims at
providing C’s optimized performance and python’s ease of use simultaneously
[3].
spaCy also provides tokenization and POS tagging models trained on the
OntoNotes 5 corpus3 . The Universal POS Tag set the tagger maps to is
described in [25], and the dependency parsing annotation scheme in [7].
3.1.2
spaCy + Stanford POS tagger
Our preliminary evaluation, however, showed that the POS tagger the spaCy
library provides does not perform well on biomedical texts, and thus affects the accuracy of the dependency parser. We found that the results of
1
https://spacy.io/
https://spacy.io/blog/parsing-english-in-python
3
https://catalog.ldc.upenn.edu/LDC2013T19
2
CHAPTER 3. PARSING
22
spaCy’s dependency parser can be improved when used in conjunction with
a more accurate POS tagger. For part-of-speech tagging, we thus employed
the widely-used Stanford POS tagger 3.6.04 with the pre-trained englishleft3words-distsim.tagger model, which is the model recommended by the
developers5 . The results obtained by combining spaCy and Stanford POS
tagger are included in the evaluation below.
3.1.3
Stanford Parser
The Stanford parsing suite6 is a collection of different parsers written in Java.
The parsers annotate according to the Universal Dependency scheme7 or to
the older Stanford dependencies described in [9]
In our tests, we used version 3.5.2. It was tested using the englishPCFG
parser (see [19]), which is the default setting.
3.1.4
Charniak-Johnson
The most recent release (4.12.2015) of the implementation of the CharniakJohnson parser8 was originally described in [6]. The parser is written in
C++, and suffers from two major shortcomings:
1. It does not compile under OS X
2. It does not perform sentence splitting, but requires the input to be
already split into sentences.
Because of 1., we conducted our tests for this parser on a 2.6GHz Intel Xeon E5-2670 machine running Ubuntu 14.04.3 LTS. Note that all other
parsers were tested on a different machine running OS X. Given this difference, and because of the fact that all other parsers perform sentence splitting
themselves, the results obtained for the Charniak-Johnson parser cannot directly be compared.
4
http://nlp.stanford.edu/software/tagger.shtml
http://nlp.stanford.edu/software/pos-tagger-faq.shtml#h
6
http://nlp.stanford.edu/software/lex-parser.shtml
7
http://universaldependencies.github.io/docs/
8
https://github.com/BLLIP/bllip-parser
5
CHAPTER 3. PARSING
3.1.5
23
Malt Parser
The MaltParser was first described in [24] and is written in Java. Version
1.8.1 of the MaltParser9 requires the input to be already tagged with the Penn
Treebank PoS set in order to work. As in the case of spaCy, we prepared the
test set using the Stanford POS Tagger 3.6.0, using the pre-trained englishleft3words-distsim.tagger model.
3.2
Evaluation
Following a preliminary assessment of ease of use and quality of documentation, the parsers were first tested in their native environment (e.g. Java or
python) for speed. In a second step, the fastest parsers were then manually
evaluated in terms of accuracy.
3.2.1
Ease of Use and Documentation
• spaCy offers a centralized documentation10 and tutorials. Furthermore,
being written entirely in python3 it suffers little from difficulties that
arise in cross-platform use.
• The Stanford parser has an extensive FAQ11 , but documentation is
spread across several files as well as JavaDocs. There’s no centralized
documentation: The user is dependent on sample files and in-code
documentation. However, the code is well-documented. There is a
wealth of options, most of which can be applied using the command
line, making the software very easy to use.
• The Charniak-Johnson parser offers little documentation on how to
use it, and being written in C++ it is not trivial to use across different
platforms.
• The Malt parser offers a centralized documentation12 , however it focuses mostly on training a custom model and offers little help on using
9
http://www.maltparser.org/index.html
https://spacy.io/docs
11
http://nlp.stanford.edu/software/parser-faq.shtml
12
http://www.maltparser.org/optiondesc.html
10
CHAPTER 3. PARSING
24
pre-trained models. The need for tagged data as input is a major
shortcoming, necessitating additional steps in order to use it.
Table 3.1 summarizes these results.
parser
spaCy
cross-plattfrom
use
easy (python)
Stanford
easy (Java)
Charniak-J.
difficult (C++)
Malt
easy (Java)
documentation
centralized documentation, tutorials
extensive FAQ,
well-documented
code, sample files
little documentation
centralized documentation
further
comments
inferior POS tagger
requires sentence
split input
requires tagged
input
Table 3.1: Summary of assessment of ease of use for different parsers
3.2.2
Evaluation of Speed
The parsers were compared on a test set consisting of 1000 randomly selected
text files containing abstracts from PubMed articles, averaging at 1277 characters each. The test set as well as intermediary results can be found in the
data/parser_evaluation directory accompanying this thesis. The tests were
run on a 3.5 GHz Intel Core i5 machine with 8GB RAM.
Table 3.2 lists the various processing speeds measured using the Unix
time command. In reading the table, bear in mind the following points:
• The spaCy library takes considerable time to load, but then processes
documents comparably fast. To demonstrate this, we list separately
the time for processing the test set including loading of the library
(loading in the table) and excluding loading time (loaded ). We do so,
since the overhead the loading of the library presents will diminish in
significance with increasing size of the data to be processed.
• We also take separate note of spaCy’s performance when using plain
text files as input and applying its own part-of-speech tagger (plain text
CHAPTER 3. PARSING
25
in the table), and when provided with previously tagged text (tagged
text). In the latter case, a small parsing step takes place to extract
tags and tokens from the output produced by Stanford POS tagger.
• The evaluation of the Charniak-Johnson parser should not be directly
compared to the other two, since it was performed on a different machine (see 3.1.4).
parser
Stanford POS tagger (SPT)
spaCy (plain text, loading)
spaCy (plain text, loaded)
spaCy (tagged text, loading)
spaCy (tagged text, loaded)
spaCy + SPT (loading)
Stanford
Charniak-Johnson
Malt
Malt + SPT
time
29.126s
49.236s
26.113s
48.342s
23.662s
77.468s
2 430.141s
6 069.198s
52 509.288s
52 538.414s
characters / s
43 840
25 933
48 896
26 413
53962
12 482
525
210
24
24
Table 3.2: Processing time for different parsers
Discussion
Table 3.2 shows the simple parsing step necessitated to make Stanford POS
tagger output useable by spaCy and loading of tags thus provided by spaCy
takes approximately the same amount of time as relying on spaCy’s internal parser. Furthermore, the time to load the spaCy library is substantial,
although negligible in absolute terms.
In relative terms, the combination of spaCy + Stanford POS tagger significantly slows down spaCy’s performance. However, as we shall show in
Section 3.2.3, it is practically inevitable given the poor accuracy of spaCy’s
part-of-speech tagger.
Apart from algorithmic differences, the big gap in speed between the
parsers is probably due to the fact that a new Java virtual machine is invoked
for the processing of every document for the Stanford and Malt parsers. This
CHAPTER 3. PARSING
26
could be amended by configuring the parsers in such a way that the Java
virtual machine acts as a server that processes requests. However, this is
beyond the scope of this work.
3.2.3
Evaluation of Accuracy
10 sentences from the test set have been selected to evaluate the output of
the parsers visualized as parse trees by hand. The parses were converted
into CoNLL 2006 format [4], and then visualized using the Whatswrong visualizer13 . Of the 10 sentences, the first five are considered easy sentences to
parse, while the latter five are more difficult. While we do not provide a quantitative evaluation, but the qualitative evaluation below gives a good indication of the individual parsers’ performance. We only present the parse trees
relevant for the discussion below; for a complete list and higher-resolution
images refer to the additional material14 that accompanies this dissertation.
Only the parses of spaCy, Stanford Parser and Malt Parser are considered,
as well as all the parses produced by the combination of spaCy + Stanford
POS tagger. Given the lack of ease of use of the Charniak-Johnson parser,
and its difficulty to produce parse trees, it is omitted from this evaluation.
The parse trees below highlight how poorly spaCy parser performs using
its own tagger (for example in sentence 8), often yielding parses that would
make a meaningful extraction of relations impossible. The Malt parser never
yields parses that are superior to Stanford ones, and sometimes makes mistakes that the Stanford parser does not do (for example in sentence 5). However, using spaCy + Stanford POS tagger results comparable to Stanford
parser are achieved, with the exception of minor mistakes (see sentences 3
and 6, for example).
13
14
https://code.google.com/p/whatswrong/
data/parser_evaluation/accuracy_evaluation/parse_tree
CHAPTER 3. PARSING
27
Sentence 1
Neurons and other cells require intracellular transport of essential
components for viability and function.
All three parsers accurately mark require as the root of the sentence and the
phrase neurons and other cells as its subject. None of the parsers accurately
depicts the dependency of the phrase for viability and function on require,
assigning it to either transport or components.
Figure 3.1: spaCy parser
Figure 3.2: Malt parser
CHAPTER 3. PARSING
28
Sentence 2
Strikingly, PS deficiency has no effect on an unrelated cargo vesicle class containing synaptotagmin, which is powered by a different kinesin motor.
All three parsers correctly identify root and subject. Noticeably, they also all
correctly recognize PS deficiency as a compound. Unlike spaCy and Stanford parser, the Malt parser here incorrectly indicates a dependency between
synaptotagmin and effect (rather than between effect and class (containing synaptotagmin)). Without expert knowledge it is not possible to decide
wether the dependency between synaptotagmin and the relative clause which
is powered by a different kinesin motor is correct, or if the relative clause
depends on class (containing synaptotagmin).
Figure 3.4: Malt parser
Figure 3.3: spaCy parser
CHAPTER 3. PARSING
29
CHAPTER 3. PARSING
30
Sentence 3
However, it is unclear how mutations in NLGN4X result in neurodevelopmental defects.
Stanford and Malt parser deal with the sentence similarly. spaCy, however, incorrectly marks NLGN4X result as a compound (as opposed to mutations in NLGN4X. This error seems to be caused by result being tagged as
NN (noun) rather than VBP (verb, non-3rd person singular present) by the
spaCy tagger. Indeed, providing spaCy with tags from Stanford tagger helps
to ameliorate this problem, but it still incorrectly marks NLGN4X result as
a compound.
Figure 3.5: spaCy parser
Figure 3.6: Stanford tagger, spaCy parser
Figure 3.7: Stanford parser
CHAPTER 3. PARSING
31
Sentence 4
Diurnal and seasonal cues play critical and conserved roles in
behavior, physiology, and reproduction in diverse animals.
spaCy has a peculiar way to describe the dependencies in the phrase diurnal
and seasonal cues and play. This is likely to be caused by diurnal being
tagged as NNP (proper noun). Indeed, this problem is solved by providing Stanford tagger’s tags to spaCy. Furthermore, all three parsers assign
different dependencies to the phrase in diverse animals: spaCy marks it as
dependent on play, Stanford on behaviour, physiology, and reproduction, and
Malt on reproduction only. Without expert knowledge, it is hard to decide
which one is the most correct way; but Stanford’s assessment seems most
plausible, while spaCy is the most simplistic one.
Figure 3.9: Stanford tagger, spaCy parser
Figure 3.8: spaCy parser
CHAPTER 3. PARSING
32
Figure 3.11: Malt parser
Figure 3.10: Stanford parser
CHAPTER 3. PARSING
33
CHAPTER 3. PARSING
34
Sentence 5
The nine articles contained within this issue address aspects of circadian signaling in diverse taxa, utilize wide-ranging approaches,
and collectively provide thought-provoking discussion of future
directions in circadian research.
None of the parsers identify address as the root of the sentence. Both
spaCy and Malt parser mark utilize as the root of the phrase, and consider the
actual root address as either as the root of an adverbial clause, or to be part
of a composite noun (this) issue address aspects. Stanford also fails to mark
address as the root, but captures the dependencies between address, utilize
and provide appropriately. spaCy only captures the dependency between
utilize and provide, while Malt parser falsely identifies a dependency between
approaches and provides.
The phrase aspects of circadian signaling is correctly parsed by the Malt
parser, while spaCy and Stanford both mark signaling to be an ACI of aspects
of circadian. Utilizing the Stanford tagger in conjunction with the spaCy
parser yields the best results: While the true root address of the sentence is
still not found, it parses the phrase aspects of circadian signaling in diverse
taxa correctly, and accurately describes it as the object of address.
Figure 3.15: Malt parser
Figure 3.14: Stanford parser
Figure 3.13: Stanford tagger, spaCy parser
Figure 3.12: spaCy parser
CHAPTER 3. PARSING
35
CHAPTER 3. PARSING
36
Sentence 6
Thus, perturbations of APP/PS transport could contribute to
early neuropathology observed in AD, and highlight a potential
novel therapeutic pathway for early intervention, prior to neuronal loss and clinical manifestation of disease.
spaCy, unlike Stanford and Malt, fails to correctly identify the dependency of observed in AD on neuropathology. It also does not accurately mark
highlight as dependent on contribute, which Stanford and Malt do. This is
not fixed by providing it with Stanford tagger’s tags, but it results in a slight
improvement in marking novel an adverbial modifier of pathway.
Without expert knowledge it cannot be established wether spaCy’s established dependency of prior to neuronal loss ... on highlight is correct, or
if Stanford’s and Malt’s one on intervention is.
Figure 3.18: Stanford parser
Figure 3.17: Stanford tagger, spaCy parser
Figure 3.16: spaCy parser
CHAPTER 3. PARSING
37
CHAPTER 3. PARSING
38
Sentence 7
Presenilin controls kinesin-1 and dynein function during APPvesicle transport in vivo.
All parsers parse this sentence correctly. Without expert knowledge, it
cannot be decided if it is more correct to mark the phrase during APPvesicle transport in vivo as dependent on function (as spaCy and Malt do)
or on controls (as Stanford does).
Figure 3.19: spaCy parser
Figure 3.20: Stanford parser
Figure 3.21: Malt parser
CHAPTER 3. PARSING
39
Sentence 8
Log EuroSCORE I of octogenarians was significantly higher (30
±5 17 vs 20 ±5 16, P < 0.001).
spaCy does not recognize Log EuroSCORE I of octogenarians as one
phrase. In fact, the I is tagged as a personal pronoun. Stanford and Malt
do recognize it correctly, and Malt in particular identifies the I as a cardinal
number. Consequently, providing spaCy parser with Stanford tags yields
much better results.
The phrase in parentheses is marked differently by all parsers: spaCy
marks it as an attribute of was, while Stanford and Malt mark it as an unclassified dependency on was higher. Within the parentheses, only Stanford
recognizes vs as the root of the phrase, and P < 0.001 as an apposition.
Using spaCy parser in conjunction with Stanford parser also improves on the
parse provided.
Figure 3.23: Stanford tagger, spaCy parser
Figure 3.22: spaCy parser
CHAPTER 3. PARSING
40
Figure 3.25: Malt parser
Figure 3.24: Stanford parser
CHAPTER 3. PARSING
41
CHAPTER 3. PARSING
42
Sentence 9
Introduction to the symposium–keeping time during evolution:
conservation and innovation of the circadian clock.
spaCy incorrectly marks symposium-keeping time as one phrase. It correctly parses this phrase once using Stanford tagger’s tags. In that setting, it
also parses the phrase keeping time during evolution as an ACI that depends
on symposium, while Stanford marks keeping time during evolution as an
unclassified dependency of introduction, and Malt as an adjectival modifier.
Syntactically, the ACI is the most accurate interpretation, but in this special
constellation Stanford or Malt parser’s results may be more accurate.
All parsers, however, deal well with the segmentation of syntactically independent phrases by the colon :, marking it as an apposition to introduction
(spaCy), and unclassified dependency of time (Stanford) or of introduction
(Malt)
Figure 3.27: Stanford tagger, spaCy parser
Figure 3.26: spaCy parser
CHAPTER 3. PARSING
43
Figure 3.29: Malt parser
Figure 3.28: Stanford parser
CHAPTER 3. PARSING
44
CHAPTER 3. PARSING
45
Sentence 10
Genetic mutations in NLGN4X (neuroligin 4), including point
mutations and copy number variants (CNVs), have been associated with susceptibility to autism spectrum disorders (ASDs).
This sentence is parsed surprisingly well by all parsers. However, spaCy
marks (neuroligin 4) as an adverbial modifier of including rather than an
apposition of NLGN4X (as Stanford does). Using Stanford tags, though, it
marks it as an apposition of mutations in NLGN4X, which is not as correct
as Stanford parser’s results, but an improvement over its default usage.
Figure 3.33: Malt parser
Figure 3.32: Stanford parser
Figure 3.31: Stanford tagger, spaCy parser
Figure 3.30: spaCy parser
CHAPTER 3. PARSING
46
CHAPTER 3. PARSING
3.2.4
47
Prospective Benefits
spaCy’s POS tagger can be trained on user-supplied data. While this is beyond the scope of this work, spaCy’s part-of-speech tagger could be trained on
data tagged with the Stanford POS tagger, hopefully yielding better results
than its default model. It could then be used instead of the Stanford tagger
in the pipeline. This would greatly increase performance for two reasons:
1. Switching environments (python3 and Java) relies on reading and writing to file. As table 3.2 shows, the small parsing step introduced by
having to make Stanford POS tagger output available to spaCy further
slows down processing. If tagging and parsing can both be done in
python3, this would make disk access and conversion superfluous and
further speed up the pipeline.
2. spaCy’s tagger itself seems comparably fast. If retraining does not
impact its performance, it could yield further increase in speed.
3.2.5
Selection
The combination of spaCy + Stanford POS tagger outperforms the other
parsers by at least two orders of magnitude in terms of speed, and maintains
comparable accuracy. Because of this, and taken the prospective benefits
described in 3.2.4 into account, we opt to use spaCy in conjunction with
Stanford POS tagger in the course of this dissertation.
Given the modular nature and lose coupling with the part-of-speech tagger in the python-ontogene pipeline, integrating the retrained spaCy POS
tagger should be easy, and would hopefully yield further increase in processing speed.
3.3
Summary
In this chapter we described the selection process of a suitable dependency
parser for the python-ontogene pipeline. We evaluated a series of different
parsers, and decided to use the spaCy parser in conjunction with Stanford
POS tagger. Not only does this approach outperform the other parsers in
terms of speed, it also offers the potential for further improvement. Namely, if
the spaCy POS tagger is trained using the output of Stanford POS tagger, or
CHAPTER 3. PARSING
48
another means is found to improve on the spaCy POS tagger’s performance,
we presume that performance in accuracy and speed can be dramatically
increased.
Chapter 4
Rule-Based Relation Extraction
In this chapter, we explain our approach for relation extraction based on
hand-written rules. Building on the methods of parsing described in Chapter 3, we created an independent system, which we call epymetheus. It allows
searching a corpus of parsed texts for specific relations defined by rules provided by the user.
We first discuss fundamental design decisions made for the epythemeus
system in Section 5.1.1. The system and its components are described in
Section 4.2. A brief account of the data set used for the development and
evaluation follows in Section 4.3. We present a set of manually created rules
aimed at finding a large portion of relations in a specific domain of medical
literature to demonstrate the functionality of our system (Section 4.4), and
conclude with a summary in Section 4.5. The system is evaluated in Chapter
5.
All modules and queries described in this chapter can be found in the
python-ontogene/epythemeus directory that accompanies this dissertation.
4.1
Design Considerations
While rule-based approaches usually perform well, Simpson explain that the
manual generation of ... rules is a time-consuming process [33]. Considerable effort has been taken to facilitate the writing of rules, and thus reduce
development time. We attempt to make these efforts benefit a wider audience
by converting rules into queries of a common, widely-used format. We opted
for the Structured Query Language (SQL), the most widely used query lan49
CHAPTER 4. RULE-BASED RELATION EXTRACTION
50
guage for relational databases. This dictates the architecture of our system
described at the beginning of Section 4.2.
The epythemeus system builds solely on the syntactic information produced by dependency parsing as described in Chapter 3, and explicitly does
not yet take named entity recognition into account. While we point out that
including NER information can improve results, allowing the epythemeus
to utilize such information only at a future stage of development offers the
following advantages:
1. Systems utilizing different approaches in a sequential manner can be
subject to cascading errors. In the case at hand, this means that a
relation may not be found if the system does not detect the corresponding named entity in a previous step. Postponing the inclusion of
named entity recognition prevents such cascading errors to occur as a
consequence of system architecture.
2. Given our focus on aiding query development process, limiting the features available for phrasing rules allows us to explain our approach with
greater clarity and conciseness.
3. In not developing the epythemeus system as a component of the pythonontogene pipeline, but as an independent system, we can ensure that
our contributions can be of use for a greater audience.
Especially in regards to 3., we attempt to keep the epythemeus as independent as possible, allowing it to be used with different parsers and allowing
further features for rules to be included easily.
4.2
Implementation
The epymetheus system consists of three python modules (stanford_pos_to_db,
browse_db, query_helper ), and a database. The stanford_pos_to_db populates the database given a previously tagged corpus as input. The database
can be accessed either via the browse_db module, or through third-party
software. The query_helper module facilitates the creation of queries used
by either browse_db or the third-party software to extract relations from the
database.
CHAPTER 4. RULE-BASED RELATION EXTRACTION
51
Figure 4.1: Schematic overview of the architecture of the epymetheus system.
4.2.1
stanford_pos_to_db
This module uses spaCy to parse a previously POS tagged text. While spaCy
offers POS tagging functionality, we found that parsing quality is increased
when using a different tagger (see Chapter 3). In the implementation at
hand, the module expects as input a directory of plain text articles containing tokens and tags as it is produced by the Stanford POS tagger. The
module will take the input, create a new spaCy object for every article, use
spaCy to parse the articles, and commit both the spaCy objects as well as
all dependencies to the database.
1
Within_IN the_DT last_JJ several_JJ years_NNS ,_ ,
previously_RB rare_JJ liver_NN tumors_NNS
have_VBP been_VBN seen_VBN in_IN young_JJ
women_NNS using_VBG oral_JJ contraceptive_JJ
steroids_NNS . _ .
Listing 4.1: Example of the format expected by the stanford_pos_to_db
module.
Note that this module can be swapped out to convert the output of a
different parser into the database without affecting the remainder of the
system.
CHAPTER 4. RULE-BASED RELATION EXTRACTION
4.2.2
52
Database
The database is implemented using SQLite 1 , which was chosen for two reasons:
1. Its python3 interface called sqlite3 allows for easy integration with the
rest of the epymetheus system and with the python-ontogene pipeline.
2. It can potentially cope with large amounts of data (up to 140 TB2 ).
While the system has only been tested with comparably small data sets
(see 5.1), this allows epymetheus to be used with much larger data sets
such as PubMed in the future.
Schema
The database has two tables: dependency and article. While the dependency
table stores dependency tuples generated from the stanford_pos_to_db module, the article table contains serialized python objects generated by the
spaCy library.
This approach was chosen to make use of the highly optimized search
algorithms employed by SQLite in order to find articles or sentences containing a relation given a certain pattern. At the same time, we maintain
the ability to load the corresponding python object containing additional
information such as part-of-speech tags, dependency trees and lemmata for
further analysis and processing.
The tuples saved in the dependency table have the following format:
dependency( article_id, sentence_id, dependency_type, head_id,
head_token, dependent_token, dependent_id, dependency_id)
To demonstrate the relation of database entries and dependency parses,
consider the following sample sentence, and the related parse tree (figure 4.2)
and set of tuples (table 4.1).
The ventricular arrhythmias responded to intravenous administration of lidocaine and to direct current electric shock ...3
1
https://www.sqlite.org/
https://www.sqlite.org/whentouse.html
3
2004 | 6 in the development set
2
CHAPTER 4. RULE-BASED RELATION EXTRACTION
53
Figure 4.2: Parse tree of a sample sentence.
aid
2004
2004
2004
2004
2004
2004
2004
2004
2004
2004
2004
2004
2004
2004
sid
6
6
6
6
6
6
6
6
6
6
6
6
6
6
type
amod
nsubj
ccomp
prep
amod
pobj
prep
pobj
cc
aux
xcomp
amod
amod
dobj
hid
115
116
132
116
119
117
119
120
117
124
116
127
127
124
head_token
arrhythmias
responded
required
responded
administration
to
administration
of
to
direct
responded
shock
shock
direct
dep_token
ventricular
arrhythmias
responded
to
intravenous
administration
of
lidocaine
and
to
direct
current
electric
shock
did
114
115
116
117
118
119
120
121
122
123
124
125
126
127
id
56733
56734
56735
56736
56737
56738
56739
56740
56741
56742
56743
56744
56745
56746
Table 4.1: Dependency tuples for a sample sentence (abbreviated header
names).
Indices
Indices are used to increase the efficiency of querying the database. Finding relations relies heavily on joining on the dependent_id and head_id
columns, while maintaining that a relationship cannot extend over several
articles. This forces all potential constituents of a relation to have the same
article_id. In order to maintain an easy mapping between the position of
the tokens in the article and the dependent_id or head_id, respectively, the
CHAPTER 4. RULE-BASED RELATION EXTRACTION
54
dependent_id or head_id are not unique across the database, but rather
commence at 0 for every new article_id.
Because of this, so-called compound indices that allow easy joining on
several columns are created on the column pairs (article_id, head_id) and
(article_id, dependent_id). The effects of these compound indices on
query performance are described in Section 5.1.2.
4.2.3
query_helper
This module aids with the creation of complex SQL queries for extracting
relations. The key idea is that relation patterns can be split into fragments,
which are then combined in various ways. The module is thus particularly
useful in automatically generating queries that exhaust all possible combinations between fragments.
For example, a relationship might be expressed in the pattern of X causes
Y or X induces Y, which are equivalent in terms of their dependency pattern.
Another way to express the same relation is X appears to cause Y, or X
seems to cause Y. In this example, six different queries are needed to capture
all possibilities. This highlights the usefulness of a tool that automatically
generates all possible queries given a minimal set of building blocks.
We have thus created our own short-hand notation for such fragments,
which are parsed by the query_helper module. The module then in turn
offers functions that generate queries based on the user-supplied fragments.
Fragments
Fragments represent conditions that apply to dependencies in the database,
and that can be chained together. Example 4.2 is a comparably simple
fragment that would match the phrase result in. It is used here to explain
the notation of fragments used by the query_helper module. Fragments are
saved in plain text files, and a single text file can contain multiple fragments.
Having multiple fragments in a single file allows for similar fragments to
reside in the same file, and thus helps organization.
1
2
3
// result in
d1 . head_id , d1 . head_id
d1 . head_token LIKE ’ result % ’
CHAPTER 4. RULE-BASED RELATION EXTRACTION
4
55
d1 . dependent_token LIKE ’in ’
Listing 4.2: Simple fragment matching the phrase result in.
The first two lines in every fragment carry special meaning. Line 1 is
the title line, and contains the name of the fragment so that it can later be
referred to. The title line is marked by being prefixed with //. It marks
the beginning of a new fragment: Every subsequent line that does not begin
with // is considered a part of the fragment.
Line 2 is the joining elements line, the use of which will be explained
below.
The remaining lines contain conditions. Every fragment is defined by a
set of conditions that apply to a set of dependency tuples as they are stored
in the database (see Section 4.2.2). A single dependency tuple is referred
to by a name such as d1 within a fragment. Inspired by SQL notation, the
elements of a tuple are referred to by the following notation dependency_name
.element_name. Conditions on the elements can either be expressed by = or
by LIKE, which is the same operator as in SQL. Namely, it allows the righthand operand to contain the wild-card %, which represents missing letters,
and also allows for the matching to be case-indifferent. The condition d1.
head_token LIKE ’result%’ thus applies to all dependency tuples in which
the head_token begins with result, including results, resulted and resulting.
Using different names for tuples allows the fragment to describe patterns
that extend over several tuples, as the example 4.3 shows. The condition
d1.dependent_id = d2.head_id indicates the connection between the dependency tuples. If no condition specifies the relation between the two tuples,
the system will merely assume that the two dependency tuples have to be in
the same sentence.
1
2
3
4
5
// be responsible
d1 . head_id , d2 . head_id
d1 . dependency_type = ’ acomp ’
d1 . dependent_id = d2 . head_id
d2 . head_token LIKE ’ responsible % ’
Listing 4.3: Fragment involving multiple dependency tuples, matching
phrases such as is responsible or be responsible.
The following sample sentence contains such a phrase. Again, a simplified
parse tree and dependency tuples are provided below.
CHAPTER 4. RULE-BASED RELATION EXTRACTION
56
... different anatomical or pathophysiological substrates may be
responsible for the generation of parkinsonian ’off’ signs and dyskinesias4
Figure 4.3: Simplified parse tree of a sample sentence containing the be
responsible fragment.
aid
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
11099450
sid
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
type
amod
amod
cc
conj
nsubj
aux
acomp
prep
det
pobj
prep
amod
punct
amod
punct
pobj
hid
227
227
224
224
229
229
229
230
233
231
233
239
239
239
239
234
head_token
substrates
substrates
anatomical
anatomical
be
be
be
responsible
generation
for
generation
signs
signs
signs
signs
of
dep_token
different
anatomical
or
pathophysiological
substrates
may
responsible
for
the
generation
of
parkinsonian
‘
off
’
signs
did
223
224
225
226
227
228
230
231
232
233
234
235
236
237
238
239
id
7319
7320
7321
7322
7323
7324
7326
7327
7328
7329
7330
7331
7332
7333
7334
7335
Table 4.2: Dependency tuples corresponding to the sample sentences containing the be responsible fragment.
4
11099450 | 5 in the development set
CHAPTER 4. RULE-BASED RELATION EXTRACTION
57
Joining Fragments
In both example 4.2 and 4.3, the second line does not represent a condition.
The first non-empty line to follow the title line describes the left and right
elements of the fragment. These specify which elements of the fragment to
use when several fragments are joined together using the join_fragments
(left_fragment,right_fragment) function of the query_helper module.
Consider fragment 4.4 below, and the results produced by calling join_fragments
(subj,result in) (4.5), and join_fragments(subj,be responsible) (4.6),
respectively.
1
2
3
1
2
3
4
5
1
2
3
4
5
6
// subj
d1 . head_id , d1 . head_id
d1 . dependency_type = ’ nsubj ’
Listing 4.4: subj fragment matching any phrase containing a subject.
d1 . head_id , d1 . head_id
d1 . dependency_type = ’ nsubj ’
d1 . head_id = d2 . head_id
d2 . head_token LIKE ’ result % ’
d2 . dependent_token LIKE ’in ’
Listing 4.5: result of joining fragments 4.4 (subj) and 4.2 (result in), which
will match phrases in which the word result is governed by a subject.
d1 . head_id , d3 . head_id
d1 . dependency_type = ’ nsubj ’
d1 . head_id = d2 . head_id
d2 . dependency_type = ’ acomp ’
d2 . dependent_id = d3 . head_id
d3 . head_token LIKE ’ responsible % ’
Listing 4.6: result of joining the fragments 4.4 (subj) and 4.3 (be
responsible), which will match phrases in which a subphrases like is
responsible or be responsible is governed by a subject.
Note that in example 4.5 (and analogously 4.6), the resulting fragment is
much more specific than just matching phrases in which a subject exists and
which contain the words result. Specifying left and right joining elements in
the fragment definition allows the join_fragments() function to connect the
fragments in a more meaningful manner, and truly chain fragments together.
CHAPTER 4. RULE-BASED RELATION EXTRACTION
58
As can be seen, the join_fragments() function also automatically renames the tuple identifiers, and adds a condition equating the left-hand tuple’s right element with the right-hand tuple’s left element. In this fashion,
the relevant elements for joining fragments can be defined as part of the fragment definition. This allows for the automated joining of fragments, instead
of having to specify the element on which to join elements individually for
every join.
Setting the option cooccur=True when calling the join_fragments()
function disables this behavior, and will merely rename tuples and merge
conditions.
Alternatives
The fragment notation allows for an effortless listing of alternatives. Consider
the example 4.7, which describes phrases such as lead to or leads to. There
are several verbs that behave like lead, such as attribute or relate. In order
to easily account for such structurally equivalent verbs, the notation as in
example 4.8 can be used.
1
2
3
4
1
2
3
4
5
6
7
// lead to
d1 . head_id , d1 . head_id
d1 . head_token LIKE ’ lead % ’
d1 . dependent_token LIKE ’to ’
Listing 4.7: fragment matching the phrase lead to or leads to
// to
d1 . head_id , d1 . head_id
d1 . head_token LIKE ’ attribute % ’
|| led %
|| lead %
|| relate %
d1 . dependent_token LIKE ’to ’
Listing 4.8: fragment matching several phrases similar to lead to
Every line beginning with || refers to the closest previous line that is
not preceded by ||, and will describe an alternative to that line’s right-hand
operand.
CHAPTER 4. RULE-BASED RELATION EXTRACTION
59
From Fragments to SQL Queries
Fragments can be directly translated into SQL queries, or first joined as many
times as necessary before being turned into queries that can be used by the
browse_db module, using the querify() function. This function ensures that
all dependency tuples have the same article_id as well as sentence_id.
For example, calling querify(join_fragments(nsubj, to)), using the
fragments from examples 4.4 and 4.8, results in the following SQL query:
1
2
3
4
5
6
7
8
9
10
SELECT d1 . article_id , d1 . sentence_id
FROM dependency AS d1 , dependency AS d2
WHERE d1 . article_id = d2 . article_id AND d1 .
sentence_id = d2 . sentence_id
AND d1 . dependency_type = ’ nsubj ’
AND ( d2 . head_token LIKE ’ attribute % ’
OR d2 . head_token LIKE ’ led % ’
OR d2 . head_token LIKE ’ lead % ’
OR d2 . head_token LIKE ’ relate % ’)
AND d2 . dependent_token LIKE ’ to ’
AND d1 . head_id = d2 . head_id
Listing 4.9: query generated from the result of joining the fragments subj
and to.
Automated Joining of Fragments
The introduction of this section highlighted the importance of automatically
generating all possible combinations of fragments. The function active()
in the query_helper module gives an example of how from a very limited
set of fragments a large set of queries can be generated. The data/example_generate directory that accompanies this thesis contains both the fragments as well as the generated queries, showcasing the usefulness of the
query_helper module.
4.2.4
browse_db
The browse_db module is a shell-like environment that serves three purposes:
• the execution of custom queries
CHAPTER 4. RULE-BASED RELATION EXTRACTION
60
• the easy execution of predefined queries on the database
• the execution of related queries in one batch
When calling the browse_db module, the argument -d ’path/to/file’
can be used to access a custom database. This is particularly useful when
the same queries need to be executed on different data sets.
After calling the browse_db module, it will present a new command-line
prompt composed of the database name and the $ sign, waiting for the user
to input one of the commands explained below.
1
2
user_shell$ python3 browse_db . py
dependency_db . sqlite$
Custom Queries
The database can be queried from the browse_db environment using the q
command followed by the SQL query in quotes. For example, a simple search
for a specific token can be performed as follows:
1
2
3
$ python3 browse_db . py
dependency_db . sqlite$ q " SELECT * FROM dependency
WHERE article_id = 2004 AND head_id = 5"
2004 ,0 , pobj ,5 , in , patients ,6 ,56625
Predefined Queries
Predefined queries written in SQL are saved in plain text files, which are
loaded by browse_db. Every file contains one query, and can be called from
within the browse_db environment using the q command and its file name.
For example, a query stored in the file x_causes_y.sql can be executed as
follows:
1
2
$ python3 browse_db . py
dependency_db . sqlite$ q x_causes_y
Several predefined queries are described in section 4.4. More queries can
easily be added by creating a new file containing the new query to either the
predefined_queries directory, or by adding the new file to a custom directory
and calling browse_db.py as follows:
1
$ python3 browse_db . py -q path / to / custom / directory
CHAPTER 4. RULE-BASED RELATION EXTRACTION
61
Specialized Queries and Helper Functions
browse_db furthermore offers helper functions that perform subtree traversal
(subtree()), negation detection (is_negated()) and relative clause resolution (relative_clause()) given an article_id and a token_id. These
functions can be used in user mode as follows:
1
2
dependency_db . sqlite$ subtree 2004 5
in patients receiving psychotropic drugs
Furthermore, browse_db allows for specialized functions that will not only
execute the query, but in addition perform further analysis of the results. The
functions need to be specifically written, and can utilize the helper functions
described above. For example, for the query x_cause_y such a specialized
function has been written; and the listing below highlights the difference in
output for custom queries and queries with specialized functions:
1
2
3
dependency_db . sqlite$ q " SELECT * FROM dependency
WHERE article_id = 2004"
2004 ,0 , amod ,1 , changes , Electrocardiographic ,0 ,56619
...
4
5
6
7
8
9
dependency_db . sqlite$ q x_cause_y
ID 6504332: that (55) cause disorders (59)
-> subj : that
-> obj : movement disorders
...
These specialized functions will be automatically used if their function
name and the query name coincide. This allows for the easy addition of
further specialized functions in the browse_db module by the user.
Categories of Queries
When loading predefined queries from a directory, the browse_db module will
also keep the names of the directory and all the queries it contains, considering also subdirectories. This allows for the organization of related queries
into directories, which can then be called using the name of the directory. For
example, if the predefined_queries or the directory provided to browse_db
contains a sub-directory example, all the queries that are contained in the
example directory can be easily executed in one command as follows:
CHAPTER 4. RULE-BASED RELATION EXTRACTION
1
2
3
4
5
62
dependency_db . sqlite$ q example
Running query ’ x_cause_y ’ from category ’ example ’...
10539815 ,47 , it ,50 , cause , function ,53
10728962 ,31 , they ,32 , cause , vasodilation ,33
...
Again, the system will check if any specialized function has been written
that matches the name of any of the queries supplied in the directory. If so,
it will use that function rather than the query provided.
Command-Line Mode
The execution of all queries in one category can also be initiated without
entering the shell-like mode. For this, the argument --ids is used when
calling the module. In that case, the system will not consider any specialized
functions and only execute functions as they are provided as text files as
described above. It will also only return the first two fields of every row,
assumed to always be article_id and sentence_id.
1
2
3
4
5
$ python3 browse_db . py -- ids ACTIVE_GENERATED
9041081 | 10
2004 | 2
2886572 | 10
...
This way of using the module is suited for subsequent automatic use,
especially when the output of the module is redirected to a file (using python3
browse_db --ids category > output.txt).
4.3
Data Set
In their foundational book Mining Text Data, Aggarwal et al. [1] describe the
wealth of annotated corpora for domain-independent text mining. However,
all these data sets are draw on broadcast news, newspaper and newswire data
(as in the case of ACE [10]), from the Wall Street Journal (MUC) [13] or on
the Reuters corpus (English CoNLL-2003 [36]).
However, as Simpson explain, a major obstacle to the advancement of
biomedical text mining is the shortage of quality annotated corpora for this
CHAPTER 4. RULE-BASED RELATION EXTRACTION
63
specific domain [33]. Neves [23], for example, gives an overview of 36 annotated corpora in the biomedical domain, most of which, however, do not offer
annotations of relations between entities. The study points to the quality of
the corpora released in conjunction with the BioCreative challenges5 , which
organizes challenges in the fields of evaluating text mining and information
extraction systems applied to the biological domain, and releases annotated
corpora for evaluation.
For the development of predefined queries as well as the evaluation of
our epymetheus system, we use the annotated corpus originally provided for
the BioCreative V challenge [39]. It contains 1500 PubMed article abstracts
that have been manually annotated for chemicals and diseases, as well as
Chemical-Disease Relations (CDRs). It is split into three data sets (development, training, testing), each containing 500 documents. The data is
presented both in BioC format, a XML standard for biomedical text mining
[8], and PubTator format, a special tab-delimited text format used by the
PubTator annotation tool [38].
One major shortcoming of the data set, however, is that CDR annotations are made on document level, not on mention level. This means that
for every document, the annotation notes which relations are found in the
entire document, but do not offer further information on which occurrence
of an entity is the argument of the relation and where it is found within
the document. The PubTator annotation tool highlights named entities as
shown in figure 4.4, but it does not provide out-of-the-box visualization for
relations, and hence is not fit for our purpose.
Based on the BioCreative V corpus, we automatically extracted candidate sentences, which are likely to contain a relation (Section 4.3.1). These
sentences were then manually categorized according to the pattern that contains the relation (Section 4.3.2), in order to develop queries that match
the patterns and to be able to evaluate the effectiveness of the epythemeus
system.
Note that we chose to use a corpus containing CDR annotations not
because the epythemeus system is specific to that subdomain, but due to the
scarcity of high-quality annotated corpora in the biomedical domain. In fact,
our system is just as suitable for relation extraction in any other subdomain.
5
http://www.biocreative.org/about/background/description/
CHAPTER 4. RULE-BASED RELATION EXTRACTION
64
Figure 4.4: Exemplary view of named entity highlighting on PubTator.
4.3.1
Conversion
For the development of queries and evaluations, we only consider relations
that are confined within a single sentence. While the epythemeus system is
technically able to deal with relations that transcend sentence boundaries,
this is beyond of the scope of this work. We thus converted the documents
of the corpus6 as follows:
The document is split into sentences using spaCy, and only sentences
that contain both entities of an annotated relation are maintained. These
sentences are printed out separately, and the entities participating in the
annotated relationship are capitalized to facilitate human evaluation.
1
2
3
4
5
6
804391| t | Light chain proteinuria and cellular
mediated immunity in rifampin treated patients
with tuberculosis .
804391| a | Light chain proteinuria was found in 9 of
17 tuberculosis patients treated with rifampin .
...
804391
12
23
proteinuria
Disease
D011507
804391
58
66
rifampin
Chemical D012293
...
804391
CID
D012293
D011507
Listing 4.10: Example of PubTator format
6
using the script python-ontogene/converters/pubtator_to_relations.py
CHAPTER 4. RULE-BASED RELATION EXTRACTION
1
2
65
804391 | 0 | Light chain PROTEINURIA and cellular
mediated immunity in RIFAMPIN treated patients
with tuberculosis .
804391 | 1 | Light chain PROTEINURIA was found in 9
of 17 tuberculosis patients treated with RIFAMPIN
Listing 4.11: Extracted relations after conversion
Table 4.3 lists the number of sentences containing a probable mention of
an annotated relationship extracted from the respective subset of the corpus.
subset
development
training
test
articles in subset
500
500
500
sentences extracted
623
581
604
Table 4.3: Sentences extracted per subset.
Note that table 4.3 also lists the training data set for the sake of completeness and comparison. However, that set is not used in the course of this
work.
4.3.2
Categorization
From the manual analysis of the sentences in the development subset, a set
of 8 categories was derived, and each of the sentences manually assigned to
one of these categories. The categories describe the structure of the sentence
pointing towards the relation they contain. Following this, the sentences in
the test set were each assigned to the same set of categories.
Below, we describe the categories and the criteria that determine the association of a sentence with the respective category. While the categories could
apply to other domains, too, they have been developed from sentences containing chemical-disease relations, and thus their precise definition is specific
to the CDR domain.
ACTIVE
This category involves active sentences in the form of X causes Y (or X cause
Y ). Included are constructions with modal verbs such as X may cause Y or
CHAPTER 4. RULE-BASED RELATION EXTRACTION
66
X did cause Y, as well as extended patterns such as X appears to cause Y.
The following sentence stands as an example for this category.
This is the first documentation that METOCLOPRAMIDE provokes TORSADE DE POINTES clinically.7
A collection of verbs that establish relation in the development subset has
been established.
•
•
•
•
•
•
•
•
accompany
associate
attenuate
attribute
cause
decrease
elicit
enhance
•
•
•
•
•
•
•
•
increase
induce
invoke
kindle
lead to
precipitate
produce
provoke
•
•
•
•
•
•
•
•
recur on
relate
reflect
be responsible
resolve
result in
suppress
use
DEVELOP
A common setting for establishing relationships between chemicals and diseases is to expose a subject to a chemical X and observe a subsequent case
of disease Y [27]. This category captures sentences that express such cases.
It is the broadest category, including a vast variety of patterns. An example
for a simple patterns is X in _ on Y, where X is a disease, _ represents an
entity, usually patient, and Y is the chemical. A more complicated patterns
is case of X within _ receiving Y or X in _ admitted to using Y. Many
of these patterns also contain a temporal component, such as development
of X following Y treatment or X within _ of administration of Y, where _
represents some time period.
The sentence below is a typical example of this category.
Five patients with cancer who developed ACUTE RENAL FAILURE that followed treatment with CIPROFLOXACIN are described ...8
7
8
11858397 | 6 in the development set
8494478 | 2 in the development set
CHAPTER 4. RULE-BASED RELATION EXTRACTION
67
DUE
This simple category captures sentences in the form of X due to Y and related
variants that contain the word due. An example of such a sentences is listed
beneath:
Fatal APLASTIC ANEMIA due to INDOMETHACIN–lymphocyte
transformation tests in vitro.9
HYPHEN
A large proportion of annotated relations were found in the pattern of Xinduced Y, such as APOMORPHINE-induced HYPERACTIVITY 10 . The
category also includes more complicated variation of the pattern such as
KETAMINE- or diazepam-induced NARCOSIS 11 or PILOCARPINE (416
mg/kg, s.c.)-induced limbic motor SEIZURES 12 . It also extends to the same
pattern using different words, namely:
• associate
• attribute
• induce
• kindle
• mediate
• relate
NOUN
This category revolves around nouns that can express relations in patterns
such as the action of X on Y or the X action of Y. For example, a sentence
containing the dual action of MELATONIN on pharmacological NARCOSIS
seems ...13 is considered to belong to this category. Nouns that have been
found to express relations in this sense in the development subset are:
•
•
•
•
•
9
action
association
case
cause
complication
•
•
•
•
•
effect
enhancement
factor
induction
marker
7263204 | 0 in the development set
6293644 | 2 in the development set
11
11226639 | 6 in the development set
12
9121607 | 2 in the development set
13
11226639 | 7 in the development set
10
• pathogenesis
• relationship
• role
CHAPTER 4. RULE-BASED RELATION EXTRACTION
68
NOUN+VERB
This category extends the previous one in that it applies to sentences in
which one of the nouns of the NOUN category are used in conjunction with
a verb to express a relation. The pattern X plays role in Y as expressed in
the sentence below is a prime example of this category.
ERGOT preparations continue to play a major role in MIGRAINE
therapy14
PASSIVE
Sentences in the form of X associated with Y or X is associated with Y
belong to this category. This includes all tempora (X was associated with
Y ), sentences of the pattern X appears to be associated with Y, as well as the
rare case of X associated by Y. The same set of verbs as used in the ACTIVE
category applies here. For example, the following sentence is assigned to this
category.
The HYPERACTIVITY induced by NOMIFENSINE in mice remained ...15 Symptomatic VISUAL FIELD CONSTRICTION
thought to be associated with VIGABATRIN ...16
NO CATEGORY
Sentences that did not match any of the previously mentioned categories
were assigned the NO CATEGORY label. Note that the sentences extracted
from the development subset of the original corpus do not necessarily express
the annotated relation, even though both entities in the relation appear in
the sentence. A large proportion of sentences were assigned to this category
for that reason. For example, the following sentence does not establish any
relation between the two entities sirolimus and capillary leak :
Systemic toxicity following administration of SIROLIMUS (formerly rapamycin) for psoriasis: association of CAPILLARY LEAK
syndrome with apoptosis of lesional lymphocytes.17
14
3300918 | 4 in the development set
2576810 | 3 in the development set
16
11077455 | 1 in the development set
17
10328196 | 0 in the development set
15
CHAPTER 4. RULE-BASED RELATION EXTRACTION
69
Another minor reason for attribution to this category are entities with
short names that also occur in natural language, and are thus extracted
falsely by the system. An extraction process more elaborate than the one
described in 4.3.1, for example involving tokenization or even entity normalization, could ameliorate this shortcoming, but lies beyond the score of this
work.
4.3.3
Development and Test Subsets
Tables 4.4 and 4.5 list the number of sentences for every category in the
development subset and test subset, respectively, as well as their percentage.
category
ACTIVE
DEVELOP
DUE
HYPHEN
NO CATEGORY
NOUN
NOUN+VERB
PASSIVE
Total
sentences
49
146
7
181
109
23
11
97
623
percentage
7.865%
23.43%
1.124%
29.05%
17.5%
3.692%
1.766%
15.57%
100%
Table 4.4: Categorization for sentences extracted from the development subset.
category
ACTIVE
DEVELOP
DUE
HYPHEN
NO CATEGORY
NOUN
NOUN+VERB
PASSIVE
Total
sentences
47
128
6
150
122
21
22
85
581
percentage
8.09%
22.03%
1.033%
25.82%
21%
3.614%
3.787%
14.63%
100%
Table 4.5: Categorization for sentences extracted from the test subset.
CHAPTER 4. RULE-BASED RELATION EXTRACTION
70
The annotated corpora, the extracted sentences and their categorization
as well as related material can be found in the data/manual_corpus directory
that accompanies this thesis.
4.4
Queries
This section describes the development of query sets, which should provide
the reader with a fair notion of how to use the epythemeus system. Based
on the development set and using the query_helper module, a set of queries
was developed for three categories:
• the trivial case of the HYPHEN category
• the ACTIVE category, considered relatively simple
• the complex DEVELOP category
The query sets were aimed at having near-perfect recall for their respective category on the development set, while generalizing as much as possible.
The fragments and generated queries for each query set can be found in the
data/query_set directory that accompanies this work.
4.4.1
HYPHEN queries
Queries for this category are trivially easy to make. The following single fragment produces a query that achieves almost perfect recall on the development
set:
1
2
3
4
5
6
7
8
// hyphen
d1 . dependent_id , d1 . head_id
d1 . dependent_token LIKE ’% - induced ’
|| % - associated
|| % - attributed
|| % - kindled
|| % - mediated
|| % - related
Note that it is the dependent_token where we expect words such as
levodopa-induced to occur. This is because most commonly, the phrases in
the pattern X-induced Y are parsed as an amod-dependency, where Y will
CHAPTER 4. RULE-BASED RELATION EXTRACTION
71
be the head_token and X-induced the dependent_token. Table 4.6 below
shows the corresponding dependency tuples.
aid
10091616
10091616
10091616
sid
0
0
0
type
prep
amod
pobj
hid
0
3
1
head_token
Worsening
dyskinesias
of
dep_token
of
levodopa-induced
dyskinesias
did
1
2
3
id
631
632
633
Table 4.6: Dependency tuples representing a HYPHEN relation.
4.4.2
ACTIVE queries
In order to maximize generalization, a set of minimal fragments was determined that would cover as many sentences from the development as possible,
and then all possible combinations of these were automatically generated.
This required manual analysis of every sentence’s structure and key words.
We found that sentences in the ACTIVE category are made up out of up to
three sets of fragments:
A first set of fragments describes a set of verbs that express a direct
relationship between two entities. These words may either take direct objects
(such as to cause), or require a preposition (such as to associate with). We
also added to this set of fragments the case of to be responsible for. This set
also includes variations involving modal verbs (may cause), different numeri
(X causes Y and X and Y cause Z ) as well as tempora (X causes Y and X
caused Y ). Below is an example of fragments in this set. For a full account
of such fragments, refer to the data/example_generate directory.
1
2
3
4
5
6
// with
d1 . head_id , d1 . head_id
d1 . head_token LIKE ’ associate % ’
|| co - occur %
|| coincide %
d1 . dependent_token LIKE ’ with ’
7
8
9
10
11
12
// active
d1 . head_id , d1 . head_id
d1 . head_token LIKE ’ accompan % ’
|| associate %
|| attenuate %
CHAPTER 4. RULE-BASED RELATION EXTRACTION
13
14
15
16
72
|| cause %
...
|| use %
d1 . dependency_type = ’ dobj ’
17
18
19
20
21
22
// be responsible
d1 . head_id , d2 . head_id
d1 . dependency_type = ’ acomp ’
d1 . dependent_id = d2 . head_id
d2 . head_token LIKE ’ responsible % ’
Figure 4.5 and table 4.7 demonstrate a parse trees for a typical sentence
in this category, as well as the corresponding dependency tuples.
Figure 4.5: Typical parse tree for an ACTIVE sentence.
aid
11858397
11858397
11858397
11858397
11858397
sid
6
6
6
6
6
type
nsubj
dobj
advmod
compound
nsubj
hid
132
132
132
135
135
head_token
provokes
provokes
provokes
pointes
pointes
dep_token
metoclopramide
pointes
clinically
torsade
de
did
131
135
136
133
134
Table 4.7: Dependency tuples for a typical ACTIVE sentence.
A second set captures cases in the pattern of X verb_a and verb_b
Y, where verb_b expresses the relation in questin. An example of such a
case is the following sentence, where the relation enhances(oral hydrocortisone,pressor responsiveness) is captured by this pattern. Note that because
id
13197
13201
13202
13199
13200
CHAPTER 4. RULE-BASED RELATION EXTRACTION
73
of the way the sentence is parsed, this relation would not be discovered without this fragment (see figure 4.6).
Oral hydrocortisone increases blood pressure and enhances pressor responsiveness in normal human subjects.18
Figure 4.6: Parse tree of a sentence in the pattern X verb_a and verb_b Y.
1
2
3
// conj
d1 . head_id , d1 . dependent_id
d1 . dependency_type = ’ conj ’
A third set entails structures like X appears to cause Y or X seems to
cause Y.
1
2
3
4
5
// appears
d1 . head_id , d1 . dependent_id
d1 . head_token LIKE ’ appear % ’
|| seem %
d1 . dependency_type = ’ xcomp ’
Note that such patterns can be combined in various ways: for example,
the verb to cause can occur in the pattern X causes Y, X appears to cause
Y, X some_verb and causes Y, X appears to some_verb and cause Y and
X some_verb and appears to cause Y. Queries that match the latter case,
however, are not generated, as there are no such sentences in the development
set.
From these fragments, a set of 29 queries was automatically generated
using the query_helper module. The set of generated queries can be found
in the data/example_generate directory that accompanies this thesis.
18
2722224 | 1 in the original development set of the BioCreative corpus
CHAPTER 4. RULE-BASED RELATION EXTRACTION
4.4.3
74
DEVELOP queries
The patterns of sentences in the DEVELOP category certainly are the most
varied ones. Recall that sentences in the DEVELOP category describe a
situation where a chemical is administered to a recipient, and a disease observed in that recipient. Every pattern must thus have a part describing the
disease, and one describing the chemical.
Since the database does not store entity recognition information, the
epythemeus system needs to rely on parsing patterns to identify diseases and
chemicals, respectively. The fact that the administration of the chemical as
well as the observation of the disease need to be described in the sentence
makes it possible to identify the elements of a chemical-disease relationship.
While the fragments presented here do not cover all the cases in the
development set, they give an idea of how more complicated relations can be
found.
Chemicals
The patterns that identify chemicals revolve around the administration of
the chemical, which can manifest in a variety of ways. Below we give an
example of the kind of structures that can express the administration of a
chemical. The fragment titles should give sufficient description of the pattern
the fragments describe.
1
2
3
4
5
// X therapy
d1 . head_id , d1 . head_id
d1 . head_token = ’ therapy ’
|| injection %
d1 . dependency_type = ’ amod ’
6
7
8
9
10
11
12
13
// therapy with X
d1 . head_id , d2 . dependent_id
d1 . head_token LIKE ’ therap % ’
|| injection %
d1 . dependent_id = d2 . head_id
d2 . head_token = ’ with ’
d2 . dependency_type = ’ pobj ’
14
15
// injection of X
CHAPTER 4. RULE-BASED RELATION EXTRACTION
16
17
18
19
20
21
22
75
d1 . head_id , d2 . dependent_id
d1 . head_token LIKE ’ injection % ’
|| administration
|| dose %
d1 . dependent_id = d2 . head_id
d2 . head_token = ’of ’
d2 . dependency_type = ’ pobj ’
A particular case that was often encountered when chemical administration is not explicitly described, such as in the following sentence:
... effects were ... VOMITING in the FLUMAZENIL group.19
Here, the chemical administration is only implicitly indicated as a quality of the recipient. The fragment below describes the pattern X group, but
there are many other sentences in this sense such as women on ORAL CONTRACEPTIVES 20 or occurrence of SEIZURES and neurotoxicity in D2R -/mice treated with the cholinergic agonist PILOCARPINE 21 .
1
2
3
4
// X group
d1 . head_id , d1 . head_id
d1 . dependency_type = ’ compound ’
d1 . head_token LIKE ’ group ’
Diseases
A simple example of the description of the occurrence of a disease follows:
The development of CARDIAC HYPERTROPHY was studied...22
In fact, such constructions involving similar nouns as in the NOUN category are quite common, and it might be fruitful to explore possible synergies
between queries for the two categories. The fragments below exemplify how
such constructions can be represented as fragments:
19
1286498 | 10 in the development set
839274 | 0 in the development set
21
11860278 | 4 in the development set
22
6203632 | 1 in the development set
20
CHAPTER 4. RULE-BASED RELATION EXTRACTION
1
2
3
4
5
76
// development of X
d1 . head_id , d1 . head_id
d1 . head_token LIKE ’ development % ’
d1 . dependent_id = d2 . head_id
d2 . dependency_type = ’ pobj ’
6
7
8
9
10
11
// effects of X
d1 . head_id , d2 . dependent_id
d1 . dependent_token LIKE ’ effect % ’
d1 . dependency_type = ’ nsubj ’
d1 . head_id = d2 . head_id
Chemical Disease Relation
The patterns that actually capture the structures representing a relation
between a disease and a chemical are very varied. We’ve identified three
ways of finding them:
1. The subject exposed to the chemical can be used to establish the connection between the disease and the chemical.
2. A time word establishes a temporal relation between administration of
a chemical and disease onset.
3. A preposition is used instead of a verb.
The first case is the most straight-forward approach given our system.
However, such sentences are surprisingly rare. While the sentence below is
a good example of the kind of sentences that can be found in this approach,
we found that the second case is far more fruitful.
... NICOTINE-treated rats develop LOCOMOTOR HYPERACTIVITY ...23
In fact, it seems like time words such as after are often used when describing chemical administration, which could be exploited to create more
robust queries. The following fragments can be joined to fragments describing chemical administration as described above.
23
3220106 | 8 in the development set
CHAPTER 4. RULE-BASED RELATION EXTRACTION
1
2
3
4
77
// after X
d1 . head_id , d1 . dependent_id
d1 . dependent_token = ’ after ’
d1 . dependency_type = ’ prep ’
5
6
7
8
9
// following X
d1 . head_id , d1 . dependent_id
d1 . head_token = ’ following ’
d1 . dependency_type = ’ dobj ’
The resulting fragment, called time word+chemical for the purposes of
discussion, can then be joined directly to the occurrence of a disease, which
would allow for the finding of sentences as below:
Delayed asystolic CARDIAC ARREST after DILTIAZEM overdose; resuscitation with high dose intravenous calcium.24
It could also be joined to a verb expressing disease occurrence to find
sentences such as the following:
A 42-year-old white man developed acute hypertension with severe HEADACHE and vomiting 2 hours after the first doses of
amisulpride 100 mg and TIAPRIDE 100 mg.25
It seems, however, as if not joining the time word+chemical fragment to
anything directly, but rather to create a query that merely checks for the cooccurrence of pattern expressed by the time word+chemical fragment and
a disease yields good results with few false positives. While this claim needs
further substantiation, we suggest that this is due to the fact that the time
word+chemical fragment almost exclusively is used in sentences expressing
a chemical-disease relation.
A third way that is often used to express relations in the development set
is to rely on prepositions rather than verbs. While this is common especially
in titles, the example below shows how this can also be the case in normal
text.
24
25
12101159 | 0 in the development set
15811908 | 2 in the development set
CHAPTER 4. RULE-BASED RELATION EXTRACTION
78
Two cases of postinfarction ventricular SEPTAL RUPTURE in
patients on long-term STEROID therapy are presented ...26
However, since prepositions such as in in the example above are so common, it is very difficult to write queries that will only return sentences that
use them to express a chemical-disease relation.
4.5
Summary
In summary, we created a system capable of extracting relations of any kind,
and introduced the concept of fragments that aids with the process of writing
queries. Both are domain-independent, and while we developed them with
biomedical text mining in mind, they are just as applicable to other fields.
4.5.1
Arity of Relations
Note that no constraints are put on the number of entities participating in
a relation. The distinction between relation and event extraction, as it has
been suggested by Simpson [33], for example (see Section 1.3), thus has little
meaning.
Currently, the query_helper module will generate queries that return the
identifier of the sentence in which the relation is found. Using specialized
queries, and making use of the subtree() function as described in Section
4.2.4, however, the epythemeus system can be adapted to return the individual entities participating in relationships of arbitrary complexity.
In fact, the queries developed for the ACTIVE and HYPHEN categories
return relations consisting of three entities each, where the verb expresses
the quality of the relation. For example, relations of the pattern X increases
Y and X suppresses Y are currently both part of the ACTIVE category, but
could be assigned to different categories to allow for a more differentiated
extraction of relations.
In the same spirit, queries in the DEVELOP category in particular can
extract relations consisting of various entities: capturing dosage of drug administration, for example, is made quite easy using fragments.
26
9105126 | 1 in the development set
CHAPTER 4. RULE-BASED RELATION EXTRACTION
4.5.2
79
Query Development Insights
The examples above showcase how the concept of fragments greatly facilitates
the creation of queries, especially in cases where many possible combinations
of similar structural patterns occur. However, the writing of queries could be
facilitated if the dependency tuples would also store lemmata (forgoing the
need to use the LIKE operand and allowing for more concise queries), and if
word-lists could be supplied for alternatives, rather than listing every word
individually. This might be especially useful to increase re-use: For example,
queries for the PASSIVE category are very likely to use the same verbs as
are used in the ACTIVE category.
While the manual creation of queries requires a good understanding of the
annotation scheme used by the parser, the automatic generation of possible
variations allows the system to cover a large proportion of relations. The use
of the system has been demonstrated using the example of chemical-disease
relation extraction, and the queries written for the demonstration are specific
to that domain.
Note that the fragments and queries developed are specific to the dependency scheme employed by the parser. While efforts are made to establish
universally accepted standards such as the Universal Dependency scheme27 ,
these are not yet widely used, limiting the re-use of existing fragments and
queries.
4.5.3
Augmented Corpus
In order to evaluate the system and building on a previously annotated corpus, we manually categorized over 1000 sentences extracted from PubMed
articles according to the pattern that defines the relation they contain. While
this categorization does not follow any particular standard such as the ones
laid out by Willibur [40], and in particular offers no measure of inter-annotator
agreement, we hope that this categorization will help the reader to understand how to use the epythemeus system, and may be useful for other related
research.
27
http://universaldependencies.github.io/docs/
Chapter 5
Evaluation
In this chapter we evaluate the epythemeus system against the test corpus
described in Chapter 4. Furthermore, in order give an estimate of the effort
required to process the entire PubMed using the approach described in this
thesis, we apply our approach to a small test set and extrapolate the measured
results.
5.1
5.1.1
Evaluation of epythemeus
Query Evaluation
We evaluate the query sets developed for the HYPHEN and the ACTIVE
category (described in Sections 4.4.1 and 4.4.2, respectively). While we also
list the results for the queries written for the DEVELOP category (Section
4.4.3), the set of queries only served to exemplify how queries for more complicated patterns can be obtained. The results for this query set thus do not
give any account of the efficacy of the epythemeus system, but serve only to
help to further demonstrate the query development process.
For evaluation, the query sets were executed on the development set and
the test set. The queries return article_id and sentence_id for every sentence in which a relation is found. From the manually categorized sentences
of the development, the article_id and sentence_id for every sentence
belonging to the category in question are extracted using the categories.py
script. The article_id and sentence_id pairs extracted in this fashion are
taken as the gold standard.
80
CHAPTER 5. EVALUATION
81
The gold standard and the output produced by the query sets are then
compared using the evaluate.py script. All scripts used for evaluation, as well
as intermediate results can be found in the data/manual_corpus directory.
HYPHEN queries
Table 5.1 shows how the query set has very high recall on the development
set, and comparable recall on the test set.
set
development
test
recall
0.961
0.953
precision
0.316
0.294
F1 measure
0.476
0.449
TP
174
142
FP
7
7
FN
376
341
Table 5.1: Results of HYPHEN query set executed on development and test
set
The false positives (FP) on the development set are all sentences in which
the hyphen connects an element in parentheses, such as the sentence below:
decreased the incidence of LIDOCAINE (50 mg/kg)-induced CONVULSIONS.1
Such cases cannot be easily covered, given that in such situations the
spaCy parser will treat the hyphen as an individual token, and produces a
considerably more complex parse tree.
Of the seven FPs on the test set, five are due to the same issue. The
remaining two are explained by a spelling mistake in the original text (the
appearance of these LEVODOPA-induce OCULAR DYSKINESIAS 2 ), and
by the use of a word not previously encountered (rats with LITHIUM-treated
DIABETES INSIPIDUS 3 ).
The huge number of false negatives (FN) warrants a more thorough discussion: While a systematic evaluation of these is beyond the scope of the
work, ten randomly selected FNs were manually evaluated.
In one sentence4 marked as FN the relation between amiodarone and pulmonary lesion should have been found. However, the sentence only contains
1
11243580 | 5 in the development set
11835460 | 3 in the test set
3
6321816 | 2 in the test set
4
18801087 | 3 in the test set
2
CHAPTER 5. EVALUATION
82
the words amiodarone-related lesion, and thus was not extracted as one of
the sentences for the gold standard, but found by the query. Similar problems arise with abbreviations: In the example below, the relation between
streptozotocin and nephropathy is not recognised:
... STZ-induced diabetic nephropathy in ... mice.5
However, in the original annotation, this abbreviation STZ is annotated
and given the same identifier as streptozotocin. In fact, it is the pubtator_to_relations.py converter that fails to resolve this correctly.
Another problem with the same converter that led to two FNs is that
sentences are not extracted from the original PubTator file (see Section 4.3.1)
if the participants of the relation occur in the text with their starting letters
capitalized. This occurs occasionally in titles, and is a trivial bug to fix.
However, the fixing of this bug would partially invalidate the results obtained
so far. Two of the ten randomly selected FNs are attribute to this fault.
In one case6 , a possible relation (glucocorticoid-induced glaucoma) that
has not been annotated was returned. However, without expert knowledge, it
is not possible to decide wether this is an oversight of the original annotation,
or correctly classified as a FN.
The remaining three sentences are indeed false negatives, returning phrases
such as amphetamine-induced group 7 or 5HT-related behaviors 8 .
Table 5.2 below summarizes the findings from the manual evaluation of
the randomly selected sample of false negatives.
reason
upper case in conversion
different names for entities
correct FNs
requires expert knowledge
number of sentences
3
3
3
1
projected
102.3
102.3
102.3
34.1
Table 5.2: summary of reasons for a random sample of FNs in the test set,
and projected number for the entire test set
5
20682692
24691439
7
24739405
8
24114426
6
|
|
|
|
7
3
8
3
in
in
in
in
test set
test case
the test set
the test set
CHAPTER 5. EVALUATION
83
ACTIVE queries
Table 5.3 lists the results of the ACTIVE query set performed on the development and test set, respectively.
set
development
test
recall
0.939
0.596
precision
0.121
0.065
F1 measure
0.215
0.117
TP
46
28
FP
3
19
FN
333
403
Table 5.3: Results of active query set executed on development and test set
All three FNs on the development set are due to incorrect parses. For
example, in the sentence below, the spaCy parser considers isoniazid increase
a compound noun.
High doses of ISONIAZID increase HYPOTENSION induced by
vasodilators9
Again, from the FNs on the test set, ten randomly selected sentences were
manually evaluated. In contrast to the HYPHEN query set, the FNs for this
set seem to fall in one of two categories. In six cases, the sentence returned
did seem to contain a relation, but not the one that was annotated. Without
expert knowledge, it is not possible to make a definite assessment, but the
example below shows that it is plausible that a relation was indeed found,
and the complexity of sentences that are still recognized by the query set.
Application of an irreversible inhibitor of GABA transaminase,
gamma-vinyl-GABA (D,L-4-amino-hex-5-enoic acid), 5 micrograms,
into the SNR, bilaterally, suppressed the appearance of electrographic and behavioral seizures produced by pilocarpine10
Note that the above sentence was categorized as PASSIVE (for the relation between polcarpine and seizures, which are the annotated entities). However, the relation between gamma-vinyl-GABA and seizures which caused the
sentence to be returned by the ACTIVE query set, was not annotated in the
original corpus.
9
10
9915601 | 1 in the development set
3708328 | 7 in the test set
CHAPTER 5. EVALUATION
84
Three sentences are correctly marked as FPs. For example, the sentence
we used Poisson regression 11 is found, indicating that the word use may
be to ambiguous to be used in ACTIVE queries without other structures.
The sentence below is a correct FN, which however could hint at a possible
relation, if the reference each drug could be resolved.
Administration of each drug and their combinations did not produce any effect on locomotor activity.12
One exception is the following sentence, in which an incorrect parse causes
it to be found.
Naloxone (0.3, 1, 3, and 10 mg/kg) injected prior to training attenuated the retention deficit with a peak of activity at 3 mg/kg.13
The following table summarizes the random sample evaluation of false
negatives.
reason
correct FNs
requires expert knowledge
incorrect parse
number of sentences
3
6
1
projected
120.9
241.8
40.3
Table 5.4: Summary of reasons for a random sample of FNs
DEVELOP queries
As explained above, the DEVELOP query set is intended to demonstrate
the query creation process, and does not aim at high performance. Table 5.5
lists its results to convey a notion of what a few fragments can achieve.
set
development
test
recall
0.233
0.102
precision
0.362
0.157
F1 measure
0.283
0.123
TP
34
13
FP
112
115
FN
60
70
Table 5.5: Results of DEVELOP query set executed on development and test
set
11
25907210 | 5 in the test set
15991002 | 12 in the test set
13
3088653 | 3 in the test set
12
CHAPTER 5. EVALUATION
85
Query Evaluation Discussion
The results of the HYPHEN and ACTIVE query sets indicate that the
epythemeus system is capable of delivering useful results.
The biggest obstacle to favorable performance is the low precision (0.294
on test set for HYPHEN queries, 0.065 on the test set for the ACTIVE
queries, and 0.157 for for DEVELOP queries). Systems in the BioNLP ’09
shared tasks achieved F1 measures of up to 0.52 [17], and thus surpass our
results (F1 measure of 0.449 for HYPHEN queries, 0.117 for ACTIVE queries
and 0.123 for DEVELOP queries) by far.
As the discussion above shows, these values are partially due to the inferior quality of the gold standard used for evaluation: It is very possible that
our queries find relations that experts would consider as such, but are not
annotated in our reference corpus.
Besides that, further action needs to be taken to prune false negatives. As
stated in Section , we deliberately do not include named entity information
in our current approach. However, future versions of the epythemeus system
could use NER information to reduce the number of FNs, and thus increase
F1 measure. For example, all FNs returned by the HYPHEN query set on
the test set could have been identified as such if NER information had been
made use of.
While we suggest here that NER information be used to prune the results returned by queries based solely on syntactic information, it is certainly
more common to reverse the order of these approaches. As we describe in ,
however, this introduces the problem of cascading errors. It would thus be
interesting to compare the outcomes of systems using NER as a means of
pruning previously obtained results, or using NER as the basis for further
refinement, respectively.
5.1.2
Speed Evaluation and Effect of Indices
The effect of using the compound indices described in 4.2.2 on query execution
time was evaluated using 3 sample queries:
Q1 X-induced Y
Q2 X causes Y
Q3 X is caused by Y
Additionally, one meaningless query (Q4) was created that uses a larger
number of self-joins. The queries were executed in two different databases:
CHAPTER 5. EVALUATION
86
D1 containing 323 004 actual dependencies, and D2 containing 1 000 000
randomly generated entries, using the command line tool of SQLite. Queries
have been slightly modified to match the random nature of the data in D214 .
Since the creation of D1 using the standford_pos_to_db module involves
other processing such extraction of dependencies from spaCy objects, we only
take note of the different creation times for D2 with and without indices. As
table 5.6 below shows, adding new entries into the database took about 5
times longer when using indices. However, the database, once created, is
not expected to change frequently. Thus these numbers have little relevance
compared to the increase in query speed displayed in tables 5.7 and 5.8. As
these tables show, the querying time can be increased by a factor of about
1.98 to 12.93 depending on the number of self-joins.
Table 5.6: Table and entry creation speeds with and without indexing
indexing
total creation time time per entry
without indices
82.368s
0.00824 ms
with indices
401.685s
0.0402 ms
query
Q1
Q2
Q3
Q4
Table 5.7: Querying times for D1
self-joins in query without index
0
17ms
1
29ms
2
194ms
5
1549ms
with index
17ms
21ms
15ms
781ms
query
Q1
Q2
Q3
Table 5.8: Querying times for D2
self-joins in query without index
0
1588ms
1
4641ms
2
53281ms
with index
1579ms
14811ms
19553ms
The execution of Q4 was interrupted after 600s, and thus is not listed
in table 5.8. Note than in table 5.8, Q2 will take about 3.19 times longer
14
The materials and data used to generate the numbers described in this section can be
found in the data/db_indices directory that accompanies this thesis.
CHAPTER 5. EVALUATION
87
for the indexed D2 than for D2 without the index. We attribute this to the
random nature of the data, and the fact that the index cannot fully unfold
its potential for queries that contain only one self-join.
5.2
Processing PubMed
As Section 1.4 explains, processing the entire PubMed database of over 25
million articles is considered the ultimate goal of our research. In this section,
we thus attempt to give an estimate of time it would take to process PubMed
in its entirety using the approaches described in this dissertation. The test
set used for this evaluation as well as other intermediary results can be found
in the data/pubmed_projection directory that accompanies this thesis.
5.2.1
Test Set
We selected a random set of 1000 article abstracts from PubMed. The test
set has an average document length of 828 characters.
5.2.2
Timing
We measured the processing time for the individual stages using the Unix
time command (taking the sum of user and system values). This means
that the times noted below are in terms of processor time for a single core15 ,
and does not take into account that this task can be easily parallelized.
5.2.3
Downloading PubMed
As described in Section 2.5, there are several ways to access PubMed: Downloading the complete PubMed dump published on a yearly basis is certainly
most efficient, but needs to be updated to include more recent publications.
Because of this variability, we do not include the time it takes to prepare the
PubMed article abstracts in our calculations.
15
1,8 GHz Intel Core i7
CHAPTER 5. EVALUATION
5.2.4
88
Tagging and Parsing
We used the Stanford POS tagger as described in Chapter 3 with the englishleft3words-distsim.tagger model for POS tagging; and the stanford_pos_to_db.py
module as described in Section 4.2.1 for database conversion.
5.2.5
Running Queries
The queries for the ACTIVE, HYPHEN and DEVELOP categories as described in Section 4.4 are executed using the browse_db module with the
--ids all argument.
Note that the queries written as part of this thesis do not cover all possible
relations, and that they are specific to the CDR task. In order to generalize,
we thus make the following assumptions:
• Query sets for other applications than CDR behave similarly in terms
of execution time.
• The execution time of a query set is proportional to the amount of
relations it is intended to find.
Based on these assumptions, we note that the queries written as part of
this thesis are aimed to cover the ACTIVE and HYPHEN categories completely, and achieve 23.3% recall on the DEVELOP category, thus covering
about 42.374% of all relations. We thus note an extrapolated processing time
(running queries* in table 5.9), which multiplies actual processing time by
a factor of 2.36.
5.2.6
Results
Table 5.9 below lists measured and estimated processing times. While the
total projected time is an estimate that relies on many assumptions, it also
shows how the systems presented in this thesis are indeed capable of processing the entire PubMed in reasonable time given appropriate infrastructure.
CHAPTER 5. EVALUATION
step
POS tagging
database conversion
running queries
running queries*
TOTAL
measured time
23.832s
48.882s
30.584s
72.176s
144.89s
89
projected for PubMed
595 800s (6 days, 22h)
1 222 100s (14 days, 3h)
764 600s (8 days, 20h)
1 804 401s (20 days, 21h)
3 622 301 (41 days, 22h)
Table 5.9: Estimated processing time for the entire PubMed
5.3
5.3.1
Summary
epythemeus
While the performance of the epythemeus is inferior to current state-of-the
art systems, our evaluation points at the validity of our approach. We identify several key factors that could unlock gains in performance, such as the
inclusion of NER and lemmatization information, the employment of a more
suitable evaluation corpus and the further development of queries.
However, changing the database to store such information would not compromise its independence. Lemmata and named entity information could be
provided by the python-ontogene pipeline, as well as other systems. This
information could directly be used by fragments and queries alike without
necessitating further development of the system to improve precision.
More complex means to improve the performance of the epythemeus system could include pruning of dependency trees such as suggested by [5]. This
could make queries more robust to variations in parsing.
While the manually categorized sentences proved very useful both for
query development and evaluation, the gold standard against which the
queries were evaluated could have been improved. Partially, this is an extension of a shortcoming of the original BioCreative V corpus, in which relations
are not annotated on a mention-level, but rather on a document basis. This
prompted the need for a error-prone extraction process, and lead to lower
precision in the evaluation. Given that the epythemeus system is not specific,
however, to chemical-disease relation extraction, other corpora could be used
to obtain more reliable results.
CHAPTER 5. EVALUATION
5.3.2
90
Processing PubMed
As table 5.9 indicates, we estimate a total of almost 42 days of processing
time to process PubMed and run a hypothetical set of queries to extract
relations. This estimate assumes that query processing time is linear in
database size. While such a number may seem daunting, recall that this
measure is in terms of processing time for a single core, and that test were
performed on a general-use home machine. Using a dedicated infrastructure
with several cores, the goal of processing PubMed seems to be in reach.
Chapter 6
Conclusion
In this thesis, we explored efficient rule-based relation extraction, and present
a set of systems as well as a novel way to facilitate the process of generating hand-written rules. We recapitulate our contributions briefly in section
6.1. Special attention is devoted to processing speed: The final objective of
this research is extraction of relations in the entire set of 25 million article
abstracts that PubMed contains. This has not been possible so far, but our
results put such an endeavor at reach. In this short chapter, we conclude by
assessing shortcomings and highlight potential for future research.
6.1
Our Contributions
In order to extract biomedical relations from unstructured text, three systems
are used:
1. The python-ontogene pipeline
2. The combination of Stanford POS tagger and spaCy
3. The epythemeus system
6.1.1
python-ontogene
The python-ontogene pipeline revolves around a custom Article class, which
is well-suited to store biomedical articles in memory at various stages of processing. Special care was taken to keep this class flexible for various applications. The pipeline currently uses NLTK to provide tokenization and POS
tagging, but was developed with modularity in mind, allowing the NLTK
91
CHAPTER 6. CONCLUSION
92
library to be replaced by other tokenizers and POS taggers. A dictionarybased named entity recognition is used to extract named entities. By avoiding file-based communication between modules, the python-ontogene pipeline
outperforms existing systems by far in terms of speed while maintaining comparable levels of accuracy.
6.1.2
Parser Evaluation
To our knowledge, the spaCy parser included in our evaluation has not previously been the subject of scientific evaluation. We evaluated it together with
three state-of-the-art parsers in terms of accuracy and speed. spaCy by far
outperforms the other parsers in terms of speed, but does not yield satisfying
accuracy. We show how this shortcoming can be overcome by using spaCy
parser in conjunction with Stanford POS tagger.
6.1.3
epythemeus
The epythemeus builds on the work described above, but is an independent
system: It takes Stanford POS tagged files as input, dependency parses them
using spaCy and saves the results in a database; but other approaches can
be used to populate the database. The database can then be queried using
manually-created rules interactively or programmatically.
6.1.4
Fragments
The main contribution of the epythemeus system, however, lies in a new
approach to phrase rules and turn them into executable queries. A special
shorthand notation has been developed for so-called fragments, which represent building blocks of rules. These fragments can be programmatically
combined to create a set of queries that generalize well. This greatly aids the
development of rules. The fragments are converted in SQL queries, allowing
the concept of fragments to be useful for other systems.
6.1.5
Corpus
In order to develop queries and to evaluate them, a set of over 1000 sentences containing chemical-disease relations has been manually categorized
CHAPTER 6. CONCLUSION
93
according to the structure that points to the relation. We hope that this
categorization can be useful in similar research.
6.2
6.2.1
Future Work
Improving spaCy POS tagging
While using spaCy parser and Stanford POS tagger together yields good
results, the switching of environment (python3 and Java) considerably slows
down processing. Given spaCy’s ability to train POS tagging models, its own
POS tagger could be improved. In particular, training spaCy’s POS tagger
on the output of Stanford POS tagger would allow for spaCy to deliver highquality parses while forgoing the need to leave the python3 environment.
6.2.2
Integration of spaCy and python-ontogene
Development of python-ontogene preceded the evaluation of parsers described
in Chapter 3. Using the spaCy library in the fashion described above would
allow for it to be integrated easily into the pipeline for POS tagging.
Building on that, a mapping between the spaCy objects containing dependency parses, and the above mentioned Article objects would allow for
the python-ontogene pipeline to also include dependency parsing, again forgoing the need for file-based communication between modules and repeated
parsing.
6.2.3
Improvements for epythemeus
While the performance of the epythemeus system largely lies in the quality
of the queries, the system itself has two shortcomings: Since the database
does not store lemmatization nor named entity information, precision cannot
as easily be improved. Especially named entity information would allow for
queries to be more robust, and yield much more satisfactory results. Again,
the integration of the systems would alleviate this problem.
6.2.4
Evaluation Methods
The test set used for evaluation described in Chapter 4 suffers from errors in
the software that generated it. While this does not jeopardize the quality of
CHAPTER 6. CONCLUSION
94
the epythemeus system, a more reliable evaluation could be performed.
6.3
Processing PubMed
As we explain in Section refsec:possiblepubmed, the ultimate goal of processing the entire PubMed is put at reach, owing to the special attention to
efficiency we paid when developing the aforementioned systems.
Bibliography
[1] Charu C Aggarwal and ChengXiang Zhai. Mining text data. Springer
Science & Business Media, 2012.
[2] Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Douglas B
Kell. Event extraction for systems biology by text mining the literature. Trends in biotechnology, 28(7):381–390, 2010.
[3] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin,
Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds.
Computing in Science & Engineering, 13(2):31–39, 2011.
[4] Sabine Buchholz and Erwin Marsi. Conll-x shared task on multilingual
dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149–164. Association for
Computational Linguistics, 2006.
[5] Ekaterina Buyko, Erik Faessler, Joachim Wermter, and Udo Hahn.
Event extraction from trimmed dependency graphs. In Proceedings of
the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 19–27. Association for Computational Linguistics, 2009.
[6] Eugene Charniak. A maximum-entropy-inspired parser. In Proceedings
of the 1st North American chapter of the Association for Computational
Linguistics conference, pages 132–139. Association for Computational
Linguistics, 2000.
[7] Jinho D Choi and Martha Palmer. Guidelines for the clear style constituent to dependency conversion. Technical report, Technical Report
01-12, University of Colorado at Boulder, 2012.
95
BIBLIOGRAPHY
96
[8] Donald C Comeau, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan
Peng, Fabio Rinaldi, Manabu Torii, et al. Bioc: a minimalist approach to
interoperability for biomedical text processing. Database, 2013:bat064,
2013.
[9] Marie-Catherine De Marneffe and Christopher D Manning. Stanford
typed dependencies manual. Technical report, Technical report, Stanford University, 2008.
[10] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A
Ramshaw, Stephanie Strassel, and Ralph M Weischedel. The automatic
content extraction (ace) program-tasks, data, and evaluation. In LREC,
volume 2, page 1, 2004.
[11] Tilia Renate Ellendorff, Adrian van der Lek, Lenz Furrer, and Fabio
Rinaldi. A combined resource of biomedical terminology and its statistics. Proceedings of the conference Terminology and Artificial Intelligence (Granada, Spain), 2015.
[12] Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi Tamura, Toshihisa Takagi, et al. Toward information extraction: identifying protein names
from biological papers. In Pac Symp Biocomput, volume 707, pages
707–718. Citeseer, 1998.
[13] Ralph Grishman and Beth Sundheim.
Message understanding
conference-6: A brief history. In COLING, volume 96, pages 466–471,
1996.
[14] Jörg Hakenberg, Steffen Bickel, Conrad Plake, Ulf Brefeld, Hagen Zahn,
Lukas Faulstich, Ulf Leser, and Tobias Scheffer. Systematic feature evaluation for gene name recognition. BMC bioinformatics, 6(1):1, 2005.
[15] Lynette Hirschman, Alexander Yeh, Christian Blaschke, and Alfonso
Valencia. Overview of biocreative: critical assessment of information
extraction for biology. BMC bioinformatics, 6(Suppl 1):S1, 2005.
[16] Lawrence Hunter and K Bretonnel Cohen. Biomedical language processing: what’s beyond pubmed? Molecular cell, 21(5):589–594, 2006.
BIBLIOGRAPHY
97
[17] Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and
Jun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction.
In Proceedings of the Workshop on Current Trends in Biomedical Natural
Language Processing: Shared Task, pages 1–9. Association for Computational Linguistics, 2009.
[18] Jin-Dong Kim, Sampo Pyysalo, Tomoko Ohta, Robert Bossy, Ngan
Nguyen, and Jun’ichi Tsujii. Overview of bionlp shared task 2011. In
Proceedings of the BioNLP Shared Task 2011 Workshop, pages 1–6. Association for Computational Linguistics, 2011.
[19] Dan Klein and Christopher D Manning. Accurate unlexicalized parsing.
In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics, 2003.
[20] Lingpeng Kong and Noah A Smith. An empirical comparison of parsing
methods for stanford dependencies. arXiv preprint arXiv:1404.4314,
2014.
[21] Michael Krauthammer and Goran Nenadic. Term identification in the
biomedical literature. Journal of biomedical informatics, 37(6):512–526,
2004.
[22] Ulf Leser and Jörg Hakenberg. What makes a gene name? named entity
recognition in the biomedical literature. Briefings in bioinformatics,
6(4):357–369, 2005.
[23] Mariana Neves. An analysis on the entity annotations in biological
corpora. F1000Research, 3, 2014.
[24] Joakim Nivre. An efficient algorithm for projective dependency parsing.
In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT. Citeseer, 2003.
[25] Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-ofspeech tagset. arXiv preprint arXiv:1104.2086, 2011.
[26] Longhua Qian and Guodong Zhou. Tree kernel-based protein–protein
interaction extraction from biomedical literature. Journal of biomedical
informatics, 45(3):535–543, 2012.
BIBLIOGRAPHY
98
[27] W Scott Richardson, Mark C Wilson, Jim Nishikawa, and Robert S Hayward. The well-built clinical question: a key to evidence-based decisions.
Acp j club, 123(3):A12–3, 1995.
[28] Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand, Gerold Schneider,
Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc Von Allmen, Pierre Parisot, Martin Romacker, et al. Ontogene in biocreative
ii. Genome Biology, 9(Suppl 2):S13, 2008.
[29] Fabio Rinaldi, Gerold Schneider, and Simon Clematide. Relation mining
experiments in the pharmacogenomics domain. Journal of Biomedical
Informatics, 45(5):851–861, 2012.
[30] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Simon Clematide,
Therese Vachon, and Martin Romacker. Ontogene in biocreative ii. 5.
IEEE/ACM Transactions on Computational Biology and Bioinformatics
(TCBB), 7(3):472–480, 2010.
[31] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Michael Hess, and
Martin Romacker. An environment for relation mining over richly annotated corpora: the case of genia. BMC bioinformatics, 7(Suppl 3):S3,
2006.
[32] Isabel Segura Bedmar, Paloma Martínez, and María Herrero Zazo.
Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics, 2013.
[33] Matthew S Simpson and Dina Demner-Fushman. Biomedical text mining: A survey of recent progress. In Mining Text Data, pages 465–517.
Springer, 2012.
[34] Larry Smith, Lorraine K Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo,
I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M
Friedrich, Kuzman Ganchev, et al. Overview of biocreative ii gene mention recognition. Genome biology, 9(Suppl 2):1–19, 2008.
[35] Don R Swanson. Complementary structures in disjoint science literatures. In Proceedings of the 14th annual international ACM SIGIR
conference on Research and development in information retrieval, pages
280–289. ACM, 1991.
BIBLIOGRAPHY
99
[36] Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the conll2003 shared task: Language-independent named entity recognition. In
Proceedings of the seventh conference on Natural language learning at
HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics, 2003.
[37] Tuangthong Wattarujeekrit, Parantu K Shah, and Nigel Collier. Pasbio:
predicate-argument structures for event extraction in molecular biology.
BMC bioinformatics, 5(1):155, 2004.
[38] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. Pubtator: a webbased text mining tool for assisting biocuration. Nucleic Acids Research,
41, 07 2013.
[39] Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. Overview
of the biocreative v chemical disease relation (cdr) task. In Proceedings
of the fifth BioCreative challenge evaluation workshop, Sevilla, Spain,
2015.
[40] W John Wilbur, Andrey Rzhetsky, and Hagit Shatkay. New directions
in biomedical text annotation: definitions, guidelines and corpus construction. BMC bioinformatics, 7(1):1, 2006.
[41] Alexander Yeh, Alexander Morgan, Marc Colosimo, and Lynette
Hirschman. Biocreative task 1a: gene mention finding evaluation. BMC
bioinformatics, 6(Suppl 1):S2, 2005.
[42] Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and Kevin B
Cohen. Frontiers of biomedical text mining: current progress. Briefings
in bioinformatics, 8(5):358–375, 2007.