Dependency Parsing for Relation Extraction in Biomedical Literature
Transcription
Dependency Parsing for Relation Extraction in Biomedical Literature
Dependency Parsing for Relation Extraction in Biomedical Literature Master Thesis in Computer Science presented by Nicola Colic Zurich, Switzerland Immatriculation Number: 09-716-572 to the Institute of Computational Linguistics, Department of Informatics at the University of Zurich Supervisor: Prof. Dr. Martin Volk Instructor: Dr. Fabio Rinaldi submitted on the 20th of March, 2016 i Abstract This thesis describes the development of a system for the extraction of entities in biomedical literature, as well as their relationships with each other. We leverage efficient dependency parsers to provide fast relation extraction, in order for the system to be potentially able to process large collections of publications (such as PubMed) in useful time. The main contributions are the finding and integration of a suitable dependency parser, and the development of a system for creating and executing rules to find relations. For the evaluation of the system, a previously annotated corpus was further refined, and insights for the further development of this and similar systems are drawn. ii Acknowledgements I would like to thank Prof. Martin Volk for supervising the writing of this thesis, and especially my direct instructor Dr. Fabio Rinaldi for his neverceasing help and motivation. Contents 1 Introduction 1.1 The Need for Biomedical Text Mining . 1.2 Related Work . . . . . . . . . . . . . . 1.2.1 Named Entity Recognition . . . 1.2.2 Relation Extraction . . . . . . . 1.3 Beyond Automated Curation . . . . . . 1.4 Importance of PubMed . . . . . . . . . 1.5 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 4 5 5 2 python-ontogene Pipeline 2.1 OntoGene Pipeline . . . . . . . . . . . . . . 2.2 python-ontogene . . . . . . . . . . . . . . . . 2.2.1 Architecture of the System . . . . . . 2.2.2 Configuration . . . . . . . . . . . . . 2.2.3 Backwards Compatibility . . . . . . . 2.3 Usage . . . . . . . . . . . . . . . . . . . . . 2.4 Module: Article . . . . . . . . . . . . . . . . 2.4.1 Implementation . . . . . . . . . . . . 2.4.2 Usage . . . . . . . . . . . . . . . . . 2.4.3 Export . . . . . . . . . . . . . . . . . 2.5 Module: File Import and Accessing PubMed 2.5.1 Updating the PubMed Dump . . . . 2.5.2 Downloading via the API . . . . . . 2.5.3 Dealing with the large number of files 2.5.4 Usage . . . . . . . . . . . . . . . . . 2.6 Module: Text Processing . . . . . . . . . . . 2.6.1 Usage . . . . . . . . . . . . . . . . . 2.7 Module: Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 8 8 8 9 9 10 10 11 12 12 13 13 13 14 14 15 iii . . . . . . . . . . . . . . CONTENTS 2.8 2.9 2.7.1 Usage . . Evaluation . . . . 2.8.1 Speed . . 2.8.2 Accuracy . Summary . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 16 17 18 3 Parsing 3.1 Selection Process . . . . . . . . . . . . 3.1.1 spaCy . . . . . . . . . . . . . . 3.1.2 spaCy + Stanford POS tagger . 3.1.3 Stanford Parser . . . . . . . . . 3.1.4 Charniak-Johnson . . . . . . . . 3.1.5 Malt Parser . . . . . . . . . . . 3.2 Evaluation . . . . . . . . . . . . . . . . 3.2.1 Ease of Use and Documentation 3.2.2 Evaluation of Speed . . . . . . . 3.2.3 Evaluation of Accuracy . . . . . 3.2.4 Prospective Benefits . . . . . . 3.2.5 Selection . . . . . . . . . . . . . 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 21 21 22 22 23 23 23 24 26 47 47 47 4 Rule-Based Relation Extraction 4.1 Design Considerations . . . . . . . . 4.2 Implementation . . . . . . . . . . . . 4.2.1 stanford_pos_to_db . . . . . 4.2.2 Database . . . . . . . . . . . 4.2.3 query_helper . . . . . . . . . 4.2.4 browse_db . . . . . . . . . . 4.3 Data Set . . . . . . . . . . . . . . . . 4.3.1 Conversion . . . . . . . . . . . 4.3.2 Categorization . . . . . . . . . 4.3.3 Development and Test Subsets 4.4 Queries . . . . . . . . . . . . . . . . . 4.4.1 HYPHEN queries . . . . . . . 4.4.2 ACTIVE queries . . . . . . . 4.4.3 DEVELOP queries . . . . . . 4.5 Summary . . . . . . . . . . . . . . . 4.5.1 Arity of Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 51 52 54 59 62 64 65 69 70 70 71 74 78 78 . . . . . . . . . . . . . . . . CONTENTS v 4.5.2 4.5.3 Query Development Insights . . . . . . . . . . . . . . . 79 Augmented Corpus . . . . . . . . . . . . . . . . . . . . 79 5 Evaluation 5.1 Evaluation of epythemeus . . . . . 5.1.1 Query Evaluation . . . . . . 5.1.2 Speed Evaluation and Effect 5.2 Processing PubMed . . . . . . . . . 5.2.1 Test Set . . . . . . . . . . . 5.2.2 Timing . . . . . . . . . . . . 5.2.3 Downloading PubMed . . . 5.2.4 Tagging and Parsing . . . . 5.2.5 Running Queries . . . . . . 5.2.6 Results . . . . . . . . . . . . 5.3 Summary . . . . . . . . . . . . . . 5.3.1 epythemeus . . . . . . . . . 5.3.2 Processing PubMed . . . . . . . . . . . . . . . . . of Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion 6.1 Our Contributions . . . . . . . . . . . . . . . . . 6.1.1 python-ontogene . . . . . . . . . . . . . . . 6.1.2 Parser Evaluation . . . . . . . . . . . . . . 6.1.3 epythemeus . . . . . . . . . . . . . . . . . 6.1.4 Fragments . . . . . . . . . . . . . . . . . . 6.1.5 Corpus . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . 6.2.1 Improving spaCy POS tagging . . . . . . . 6.2.2 Integration of spaCy and python-ontogene 6.2.3 Improvements for epythemeus . . . . . . . 6.2.4 Evaluation Methods . . . . . . . . . . . . 6.3 Processing PubMed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 80 80 85 87 87 87 87 88 88 88 89 89 90 . . . . . . . . . . . . 91 91 91 92 92 92 92 93 93 93 93 93 94 Chapter 1 Introduction 1.1 The Need for Biomedical Text Mining One of the defining factors of our time is an unprecedented growth in information, and resulting from this is the challenge of information overload both in the personal and professional space. Independent of the respective domain, recent years have seen a shift in focus from information retrieval to information extraction. That means, rather than attempting to bringing the right document containing relevant information to the user, research is now concerned with processing and extracting specific information contained in unstructured text [1]. This holds particularly true in the biomedical domain, where the rate at which biomedical papers are published is ever increasing, leading to what Hunter and Cohen call literature overload [16]. In their 2006 paper, they show that the number of articles published on PubMed, the largest collection of biomedical publications, is growing at double-exponential rate. Because of this, biology researchers need to rely on manually curated databases that list information relevant to their research in order to stay up-to-date. From PubMed, information is manually curated, that is, human experts compile key facts of publications into dedicated databases. This process of curation is expensive and labor intensive, and causes a substantial time lag between publication and appearance of its key information in the respective database [31]. Curators, too, struggle to cope with the amount of papers published, and thus need to turn to automated processing, that is, biomedical text mining. 1 CHAPTER 1. INTRODUCTION 2 However, the field of biomedical text mining is not only limited to aiding or automating curation of databases. It covers a variety of applications ranging from simpler information extraction to question answering to literaturebased discovery. Generally speaking, it is concerned with the discovery of facts as well as the association between them in unstructured text. These associations can be explicit or implicit. As Simpson [33] note, advances in biomedical text mining can help prevent or alter the course of many diseases, and thus are not only of relevance to professional researchers, but also benefit the general public. Furthermore, they rely on the combination of efforts by both experts in the biomedical domain as well as computational linguists. We describe the different applications of biomedical text mining below. 1.2 Related Work Simpson [33] attributes much of the development in the field to communitywide evaluations and shared tasks such as BioCreative [15] and BioNLP [18]. Such shared tasks focus on different aspects of biomedical text mining: Named entity recognition (NER) and relation extraction are the main tasks, which are briefly discussed here. 1.2.1 Named Entity Recognition In NER, biological and medical terms are identified and marked in an unstructured text. Examples for such entities include proteins, drugs or diseases, or any other semantically well-defined data. This task is often coupled with assigning each of the found entities with a unique identifier, called entity normalization. Named entity recognition is particularly difficult in the biomedical domain given the constant discovery of new concepts and entities. Because of this, approaches that utilize a dictionary containing known entities need take extraordinary measures to keep their dictionaries up to date with current research, mirroring the problem of database curation described above. In spite of this, dictionary-based methods can achieve favorable results [22]. In particular, dictionaries can automatically be generated from pre-existing ontologies [11], making them easier to maintain and to be kept up-to-date. Other approaches to NER are rule-based, which exploit patterns in protein names [12], for example, or statistically inspired, in which features such CHAPTER 1. INTRODUCTION 3 as word sequences or part-of-speech tags are used by machine learning algorithms to infer occurrences of a named entity [14]. The related task of entity normalization is made difficult by the fact that often, there’s no universal accord on the preferred name of a specific entity. Particularly with protein and gene names, variations can go as far as to come to the authors’ personal preference. Another complication are abbreviations, which are largely context-dependent: The same abbreviation can refer to very different entities in different contexts. However, as Zweigenbaum et al. note, problems such as this can essentially be considered solved problems [42]. 1.2.2 Relation Extraction The goal of relation extraction is to extract interactions between entities. In the biomedical domain, extracting drug-drug interactions [32], chemicaldisease relations (CDR) [39] or protein-protein interactions (PPI) [26] are particularly relevant examples. However, these are highly specialized problems, and require specialized methods of relation extraction. Simpson [33] distinguish between relation extraction and event extraction: They define relation extraction as finding binary associations between entities, whereas event extraction is concerned with more complex associations between an arbitrary number of entities. The simplest approach of extracting relations relies on statistical evaluation of cooccurrence of entities. A second class of approaches are rulebased. The rules used by these approaches are either created manually by experts, or stem from automated analysis of annotated texts. Simpson note that co-occurrence approaches commonly exhibit high recall and low precision, and rule-based approaches typically demonstrate high precision and low recall [33]. A third class of approaches uses machine learning to directly identify relations, using a variety of features. These approaches can both be used for relation as well as event extraction. For both rule-based as well as machine learning approaches, syntactic information is an invaluable feature [37] [2]. In particular, dependency-based representations of the syntactic structure of a sentence have proven to be particularly useful for text mining purposes. Given the fact that the approach used in this thesis is a rule-based approach capable of extracting both binary as well as more complicated relations, we will not adhere to the distinction between relation and event extraction, and use the term relation extraction for both problems. CHAPTER 1. INTRODUCTION 1.3 4 Beyond Automated Curation More complex applications of biomedical text mining are summarization, question answering and literature-based discovery. In summarization, the goal is to extract important facts or passages from a single document or a collection of documents, and represent them in a concise fashion. This is particularly relevant in the light of the aforementioned literature overload. The approaches employed here either identify representative components of the articles using statistical methods, or extract important facts and use them to generate summaries. Question answering aims at providing precise answers rather than relevant documents to natural language queries. On one hand, this relies on natural language processing of the user-supplied queries, and the processing of a collection of documents potentially containing the answer, on the other hand. For the latter case, named entity recognition and relation extraction are of major importance. The problems described above are all concerned with finding of facts explicitly stated in biomedical literature. Literature-based discovery, however, aims at revealing implicit facts, namely, relations that have not previously been discovered. Swanson was one of the first to explore this research field. The following is a simplified account of his definition from 1991: Given two related scientific communities that do not communicate, where one community establishes a relationship between A and B, and the other the relationship between B and C, infer the relation of A and C [35]. As Simpson [33] explain, recent systems use semantic information to uncover such A-C relations, and thus build heavily on relation extraction. With the rapid growth of publications, the aspect of literature-based discovery becomes more important. The body of publications is quickly becoming too vast to be manually processed, and research communities find it impossible to keep up-to-date with their peers, leading to disjoint scientific communities. Literature-based discovery thus holds the promise of leveraging the unprecedented scale of publications and contribute to the advancement of biomedical science that would not otherwise be possible. CHAPTER 1. INTRODUCTION 1.4 5 Importance of PubMed MEDLINE is the largest database containing articles from the biomedical domain, and is maintained by the US National Library of Medicine (NLM). Currently, it contains more than 25 million articles published from 19461 , and as Hunter and Cohen note, it is growing at extraordinary pace [16]. Between 2011 and today, the amount of articles it contains has more than doubled. The abstracts of MEDLINE can be freely accessed and downloaded via PubMed2 , making it one of the most important resource for biomedical text mining [42] [33]. Furthermore, thanks to a new National Institutes of Health (NIH)-issued Policy on Enhancing Public Access to Archived Publications Resulting From NIH-Funded Research from 2005, more than 3.8 million full-text articles can now be freely downloaded on PubMedCentral3 . The goal of this endeavor, as stated by the NLM, is as follows: To integrate the literature with a variety of other information resources such as sequence databases and other factual databases that are available to scientists, clinicians and everyone else interested in the life sciences. The intentional and serendipitous discoveries that such links might foster excite us and stimulate us to move forward [16]. In the course of this thesis, we will focus on the article abstracts available via PubMed. Given its importance and size, we conduct our efforts with the processing of the entire PubMed database in mind. 1.5 This Thesis With such a large corpus of freely available biomedical texts, efficiency of biomedical text mining becomes increasingly more important: Text mining systems need to be able to cope with rapidly growing collections of text, and in order to be relevant and timely, need to do so in an efficient manner. The goal of this thesis is to explore how relation extraction can be efficiently performed using dependency parsing. Recent technological advances make dependency parsing computationally cheap, and as explained in Sections 1.2.2 and 1.3, it lies at the core of many other aspects of biomedical text mining. We explore how to leverage this availability of efficient dependency parsing, especially in regard to processing the entire 1 https://www.nlm.nih.gov/pubs/factsheets/medline.html https://www.ncbi.nlm.nih.gov/pubmed/ 3 http://www.ncbi.nlm.nih.gov/pmc/ 2 CHAPTER 1. INTRODUCTION 6 PubMed. We first describe our pipeline for named entity recognition and part-ofspeech tagging in Chapter 2. We expand on this previous work by finding an accurate and efficient dependency parser in Chapter 3. These results are then used to develop a new, independent system aimed at exploiting dependency parse information to find relations using manually written rules (Chapter 4). This system uses a novel way of creating rules, and is evaluated against a manually annotated corpus in Chapter 5. Furthermore, we give an estimate of time it would take to process the entire PubMed and search it for relations using our approach. In Chapter 6 we draw some conclusions on the results of this research. Chapter 2 python-ontogene Pipeline This chapter describes the development of a new text processing pipeline that performs tokenization, tagging and named entity recognition, building on previous work by Rinaldi et al. [28] [29] [30], and their OntoGene pipeline, in particular. 2.1 OntoGene Pipeline The OntoGene system is a pipeline that is patched together from different modules written in different programming languages, which communicate with each other via files. Each module takes a file as input, and produces a file, typically in a predefined XML format (called OntoGene XML). The subsequent module will then read the files produced from the antecedent modules. These different modules themselves are coordinated by bash scrips. This is inefficient for two reasons: 1. Every module needs to parse the precedent module’s output. The repeated accessing of the disk, reading and writing considerably slows down the processing. 2. Usage of the pipeline is not easy for new users, since the different modules are written in different languages; and since there is no centralized documentation. The low processing speed described in 1. makes it impossible to process larger collections of text, such as PubMed. Because of that there is demand for a streamlined pipeline. 7 CHAPTER 2. PYTHON-ONTOGENE PIPELINE 2.2 8 python-ontogene Consequently, the existing OntoGene pipeline was rewritten in python3, and focus was placed particularly on reducing communication between modules via files. Through this, processing is accelerated, and makes the processing of the entire PubMed possible. Furthermore, the new pipeline has a consistent documentation, and is hence easier to understand for the user. The pipeline is currently developed up to the point of entity recognition, and can be found online1 or in the python-ontogene directory that accompanies this thesis. 2.2.1 Architecture of the System The python-ontogene pipeline is composed of several independent modules, which are coordinated by a control script. The main mode of communication between the modules is via objects of a custom Article class, which mimics an XML structure. All modules read and return objects of this class, which ensures independence of the modules. The modules are coordinated via a control script written in python, which passes the various Article objects produced by the modules to the subsequent modules. 2.2.2 Configuration All variables relevant for the pipeline (such as input files, output directories and bookkeeping parameters) are stored in a single file, which is read by the control script. The control script will then supply the relevant arguments read from the configuration file to the individual modules. This ensures that the user only has to edit a single file, while at the same time keeping the modules independent. 2.2.3 Backwards Compatibility In order to preserve compatibility to the existing pipeline (see Section 2.1), the Article objects can be exported to OntoGene XML format at various stages of processing. 1 https://gitlab.cl.uzh.ch/colic/python-ontogene CHAPTER 2. PYTHON-ONTOGENE PIPELINE 9 Figure 2.1: The architecture of the python-ontogene pipeline 2.3 Usage The exact usage of the individual modules is described in the subsequent chapters. Furthermore, the in-file documentation in example.py file can serve to provide a more concrete idea of how to use the pipeline. 2.4 Module: Article The article module is a collection of various classes, such as Token, Sentence andSection. The classes are hierarchically organized (e.g. Article has Sections), but kept flexible to allow to future variations in the structure. Each class offers methods particularly suited to dealing with its contents, such as writing to file or performing further processing. CHAPTER 2. PYTHON-ONTOGENE PIPELINE 10 However, in order to keep the pipeline flexible, the article class relies on other modules to perform tasks such as tokenization or entity recognition. While this leads to coupling between the modules, it also allows for easy replacement of modules. For example, if the tokenizer that is currently used needs replacing, it is easy to just supply a new tokenization module to the Article object to perform tokenization. 2.4.1 Implementation Currently, there are the following classes, all of which implement an abstract Unit class: Article, Section, Sentence, Token and Term. Each of these classes has a subelements list, which contains objects of other classes. In this fashion, a tree-like structure is built, in which an Article object has a subelements list of Sections, which each have a subelements list of Sentences and so on. The abstract Unit class implements, amongst others, the get_subelement() function, which will traverse the object’s subelements list recursively until the elements of type of the argument have been found. In this fashion, the data structure is kept flexible for future changes. For example, Articles may be gathered in Collections, or Sections might contain Paragraphs. As for tokenization, the Article class expects the tokenize() function to be called with an tokenizer object as argument. This tokenizer object needs to implement the following two functions: tokenize_sentences(), and tokenize_words(). The first function is expected to return a list of strings; the second one to return a list of tuples, which store token text as well as start and end position in text. Finally, the Article class implements functions such as add_section() and add_term(), which internally create the corresponding objects. This is done so that other modules only need to import the Article class, which in turn will take care of accessing and creating other classes. 2.4.2 Usage The example below will create an Article object, manually add a Section with some text, tokenize it and print it to console and file. 1 2 import article CHAPTER 2. PYTHON-ONTOGENE PIPELINE 3 4 5 6 7 11 my_article = Article ( " 12345678 " ) # constructor needs ID my_article . addSection ( " S1 " ," abstract " ," this is an example text " ) my_article . tokenize () print ( my_article ) my_article . print_xml ( ’ path / to / output_file . xml ’) 2.4.3 Export At the time of writing, the Article class implements a print_xml(), which allows exporting of the data structure to a file. This function in turn recursively calls an xml() function on the elements of the data structure. Like this, it lies in the responsibility of the respective class to implement the xml() function. The goal of this function is to export the Article object in its current state of processing. For example, if no tokenization has yet taken place, it will not try to export tokens. This however, requires much processing work. Because of this, this function and the related functions need to be updated as the pipeline is updated. Pickling To store and load Article objects without exporting them to a specific format, the Article class implements the pickle() and unpickle() functions. These allow dumping the current Article object as a pickle file, and restoring a previously pickled Article object. 1 import article 2 3 my_article = None 4 5 # create Article object 6 7 8 my_article . pickle ( ’ path / to / pickle ’) new_article = article . Article . unpickle ( ’ path / to / pickle ’) CHAPTER 2. PYTHON-ONTOGENE PIPELINE 12 Exporting Entities The Article class implements a print_entities_xml() function, which exports the found entities to an XML file. As with the general export function, the XML file is built recursively by calling an entity_xml() function on the Entity objects that are linked to the Article. 2.5 Module: File Import and Accessing PubMed This module allows importing texts from files, or to download them from PubMed, and converts them into the Article format discussed above. From there, they can be handed to the other modules and exported to XML. There are three ways how the PubMed can be accessed: • PubMed dump. After applying for a free licence, the whole of PubMed can be downloaded as a collection of around 700.xml.gz files, each of which contain about 30000 PubMed articles. This dump is updated once a year (in November / December). • API. This allows the individual downloading of PubMed articles given their ID. If theentrez library is used, PubMed returns XML, if the BioPython library is used, PubMed returns python objects. However, PubMed enforces a throttling of download speed in order to prevent overloading their systems: If more than three articles are downloaded per second, the user risks being denied further access to PubMed using the API. • BioC. For the BioCreative V: Task 3 challenge, participants are supplied with data in BioC format. BioC is an XML format tailored towards representing annotations in the biomedical domain [8]. 2.5.1 Updating the PubMed Dump Since the PubMed dump is only updated once per year, additional articles published throughout the year need to be downloaded manually using the API. This takes substantial effort: Between the last publication of the PubMed dump in December 2014 and August 1st 2015, 800000 new articles were CHAPTER 2. PYTHON-ONTOGENE PIPELINE 13 published. Given the aforementioned limitations of download speed, this takes about 3 days to download using the API. 2.5.2 Downloading via the API In order to prevent repeated downloads from PubMed, the module keeps a copy of downloaded articles as python pickle objects. 2.5.3 Dealing with the large number of files Since the pipeline operates on the basis of single articles, the PubMed dump was converted into multiple files, each of which corresponds to one article. However, most file systems, such as FAT32, and ext2, cannot cope with 25 million files in one directory. Because of this, the following structure was chosen: Every article has a PubMed ID with up to 8 digits. If lower, they are padded from left with zeros. All articles are then grouped by their first 4 digits into directories, resulting in up to 10 000 folders with each up to 10 000 files. For example, the file with ID 12345678 would reside in the directory 1234. However, different solutions for efficient dealing with the large number of files could be explored in the continuation of this project. Especially databases, inherently suited to large data sets, such as NoSQL, seem promising. 2.5.4 Usage The following code snippet demonstrates how to import from file and from PubMed. The import_file module allows to specify a directory rather than a path. In that case, it will load all files in the directory and convert them to Article objects. 1 from text_import . pubmed_import import pubmed_import 2 3 4 article = pubmed_import ( " 12345678 " ," mail@example . com ") # email can be omitted if file has already been downloaded CHAPTER 2. PYTHON-ONTOGENE PIPELINE 5 6 14 # if file was downloaded before , the module will load it from local dump_directory article . print_xml ( ’ path / to / file ’) 7 8 from text_import . file_import import import_file 9 10 11 12 13 articles = import_file ( ’/ path / to / directory / or / file . txt ’) # always returns a list for article in article : print ( article ) 2.6 Module: Text Processing This module wraps around the NLTK library, to make sentence splitting, tokenization and part-of-speech tagging useable to the Article class. This module can be swapped out for a different one in the future, provided the functions tokenize_sentences(), tokenize_words() and pos_tag() are implemented. 2.6.1 Usage Since NLTK offers several tokenizers based on different modules, and allows you to train own models, this wrapper needs you to specify which model you want to use. The config module gives convenient ways to do this. 1 2 from config . config import Configuration from text_processing . text_processing import Text_processing as tp 3 4 5 6 my_config = Configuration () my_tp = tp ( word_tokenizer = my_config . word_tokenizer_object , sentence_tokenizer = my_config . sentence_tokenizer_object ) 7 8 9 for pmid , article in pubmed_articles . items () : article . tokenize ( tokenizer = my_tp ) CHAPTER 2. PYTHON-ONTOGENE PIPELINE 2.7 15 Module: Entity Recognition This module implements a dictionary-based entity recognition algorithm. In this approach, a list of known entities is used to find entities in a text. This approach is not without limitations: Notably, considerable effort must be undertaken to keep the dictionary up-to-date in order to find newly discovered entities, and entities not previously described cannot be found [41]. We alleviate this problem by using an approach put forth by Ellendorf et al. [11]. Here, a dictionary is automatically generated drawing from a variety of different ontologies. Their approach also helps to take into consideration the problem of homonymy as described by [21], by mapping every term to an internal concept ID and to the ID of the respective origin databases. We opted for this approach in order to deliver a fast solution able to cope with large amounts of data. This aspect has so far received little attention in the field. 2.7.1 Usage The user first needs to instantiate an Entity Recognition object, which will hold the entity list in memory. This object is then passed to the recognize_entities() function of the Article object, which will then use the Entity Recognition object to find entities. While this is slightly convoluted, it ensures that different entity recognition approaches can be used in conjunction with the Article class. When creating the Entity Recognition object, the user needs to supply an entity list as discussed above, and a Tokenizer object. The Tokenizer object is used to tokenize multi-word entries in the entity list. The tokenization applied here should be the same as the one used to tokenize the articles. The config module ensures this. 1 2 3 from config . config import Configuration from text_processing . text_processing import Text_processing as tp from entity_recognition . entity_recognition import Entity_recognition as er 4 5 6 my_config = Configuration () my_tp = tp ( word_tokenizer = my_config . word_tokenizer_object , sentence_tokenizer = CHAPTER 2. PYTHON-ONTOGENE PIPELINE 7 16 my_config . sentence_tokenizer_object ) my_er = er ( my_config . termlist_file_absolute , my_config . termlist_format , word_tokenizer = my_tp ) 8 9 # create tokenised Article object 10 11 12 my_article . recognize_entities ( my_er ) my_article . print_entities_xml ( ’ output / file / path ’ , pretty_print = True ) 2.8 Evaluation Two factors have been evaluated: speed and accuracy of named entity recognition. 2.8.1 Speed Both the existing OntoGene pipeline as well as the new python-ontogene pipeline were run on the same machine on the same data set and their running time measured using the Unix time command. The test data set consists of 9559 randomly selected text files, each containing the abstract of a Pubmed article. References to the test set can be found in the data/pythonontogene_comparison directory. The Unix time command returns three values: real, user and system. real time refers to the so-called wall clock time, that is, the time that has actually passed for the execution of the command. user and system refer to the period of time during which the CPU was engaged in the respective mode. For example, system calls will add to the system time, but normal user mode programs to the user time. Table 2.1 lists the measured results. Table 2.1: Speed evaluation for Ontogene and python-ongogene pipelines pipeline real user + system s / article OntoGene 37m5.153s 59 323s ( 16.5 hours) 6.206 python-ontogene 21m22.359s 1280s ( 0.36 hours) 0.133 Note that the OntoGene pipeline is explicitly parallelized: Because of this, real time is relatively low. The python-ontogene pipeline is not explic- CHAPTER 2. PYTHON-ONTOGENE PIPELINE 17 itly parallelized. However, this could be the subject of future development, resulting in faster real operation still. 2.8.2 Accuracy To compare the results of named entity recognition of both pipelines, testing was done on the same test data set of 9559 files as above. The data set contains both chemical named entities as well as diseases, which are listed separately in the evaluation below. A testing script compares entities found by one pipeline and compares them against a gold standard. Here, we used the output of the old OntoGene pipeline as gold standard. The test scripts requires the input to be in BioC format. Because of this, the output of both pipelines was first converted to BioC format. Test scripts can be found in the accompanying data/pythonontogene_comparison directory. The script calculates TP, FP, FN, as well as precision and recall values on a document basis, as well as average values for the entire data set evaluated. Table 2.2 lists the results returned by the evaluation script: Table 2.2: Evaluation of python-ontogene against OntoGene NER Entity Type Precision Recall F-Score Chemical 0.835 0.865 0.850 Disease 0.946 0.826 0.882 Note that precision and recall are measured against the output of the OntoGene pipeline. This means that true positives found by the new pythonontogene pipeline that the old pipeline did not find are treated as false positives by the evaluation script. In table 2.3 we list some examples of differences between what the two pipelines produce. As example 15552512 in table 2.3 shows, the new pipeline lists many entities several times, due to them having several entries with different IDs in the term list. While this can be useful, future development should allow for this behavior to be optional. Simpson [33] report that community-wide evaluations have demonstrated that NER systems are typically capable of achieving favorable results. While our values obtained above cannot be directly compared, systems in the BioCreative gene mention recognition tasks were able to obtain F-scores between 0.83 and 0.87 [34]. CHAPTER 2. PYTHON-ONTOGENE PIPELINE 2.9 18 Summary In this chapter, we presented an efficient pipeline for tokenization, POS tagging and named entity recognition, which focuses on modularity and welldocumented code. In the rest of this dissertation, we describe the finding of a dependency parser to be included as a module. The modular nature of the python-ontogene pipeline should make the inclusion of new modules easy, as well as facilitate the use of different modules for POS tagging, for example. We specially note the considerable improvements in speed as shown in table 2.1. The new python-ontogene pipeline runs approximately 46 times faster than the old OntoGene pipeline, making it a promising starting point for future developments. CHAPTER 2. PYTHON-ONTOGENE PIPELINE 19 Table 2.3: Differences in NER between the two pipelines PMID Original text Comment 15552511 These indices include 3 types Here the new pipeline doesn’t of measures, which are de- mark C-reactive protein as an rived from a health profes- entity, but the old one does sional [joint counts, global]; a (False Negative). This is laboratory [erythrocyte sedi- probably due to different tokmentation rate (ESR) and C- enization in regards to parenreactive protein (CRP)]; or a theses. patient questionnaire [physical function, pain, global]. 15552512 Patient-derived measures have been increasingly recognized as a valuable means for monitoirng patients with rheumatoid arthritis. The new pipeline lists both rheumatoid arthritis as well as arthritis as entities in separate entries. This behavior is quite common: The new pipeline will try to match as many entities as possible. Other examples include tumor necrosis and necrosis (in article 15552517). This behavior makes the python-ontogene pipeline more robust. 15552518 It is now accepted that rheumatoid arthritis is not a benign disease. Here, the old pipeline marks not a as an entity, and lists it with the preferred form of 1,4,7-triazacyclononaneN,N’,N”-triacetic acid. This is obviously a mistake, which the new pipeline does not make, attributed to the quality of the dictionary used. Chapter 3 Parsing This chapter describes the process of finding a suitable parser to be integrated into the python-ontogene pipeline, and to be used as basis for relation extraction. We evaluate a set of different dependency parsers in terms of speed, ease of use, and accuracy, and select the most promising parser. 3.1 Selection Process The 2009 BioNLP shared task was concerned with event extraction, and Kim et al. list the parsers that were used in this challenge [17]. Building on this work, Kong et al. recently evaluated 8 parsers for speed and accuracy [20]. To our knowledge, this is the most recent and substantiated evaluation of parsers. Based on their findings, we selected a set of parsers for our own evaluation. We included only parsers for which a readily available implementation exists, and which performed above average in the respective evaluation above. Recall that the python-ontogene pipeline is entirely written in python3 and aims at reducing time lost at reading and writing to disk by keeping as much communication between modules in memory as possible. In trying to maintain this advantage, we narrow our selection further by only choosing parsers that are either written in python or have already been interfaced for python. Given the considerations described above, the following dependency parsers were selected for further evaluation. • Stanford parser, as it was described as the state-of-the-art parser by 20 CHAPTER 3. PARSING 21 Kim et al. as well as Kong et al., and has recently been updated. • Charniak-Johnson (also known as BLLIP or Brown reranking parser), as it was the most accurate parser Kong et al.’s study mentioned above. • Malt parser, as it performed fastest in the above evaluation when using its Stack-Projective algorithm. Furthermore, we also include spaCy, a dependency parser written entirely in python3 that has not yet been the subject of scientific evaluation to our knowledge. Except for spaCy, all parsers mentioned above are written in different languages than python, but claim to offer python interfaces. 3.1.1 spaCy spaCy 1 is library including a dependency parser written entirely in python3 with focus on good documentation and use in production systems, and is published under the MIT license. To our knowledge, there are no publications that evaluate its performance; however the developer self-reports on the project’s website2 that the parser out-performs the Stanford parser in terms of accuracy and speed. For our tests, we used version v0.100. spaCy attempts to achieve high performance by the fact that the user interfaces are written in python, but the actual algorithms are written in cython. cython is a programming language and a compiler that aims at providing C’s optimized performance and python’s ease of use simultaneously [3]. spaCy also provides tokenization and POS tagging models trained on the OntoNotes 5 corpus3 . The Universal POS Tag set the tagger maps to is described in [25], and the dependency parsing annotation scheme in [7]. 3.1.2 spaCy + Stanford POS tagger Our preliminary evaluation, however, showed that the POS tagger the spaCy library provides does not perform well on biomedical texts, and thus affects the accuracy of the dependency parser. We found that the results of 1 https://spacy.io/ https://spacy.io/blog/parsing-english-in-python 3 https://catalog.ldc.upenn.edu/LDC2013T19 2 CHAPTER 3. PARSING 22 spaCy’s dependency parser can be improved when used in conjunction with a more accurate POS tagger. For part-of-speech tagging, we thus employed the widely-used Stanford POS tagger 3.6.04 with the pre-trained englishleft3words-distsim.tagger model, which is the model recommended by the developers5 . The results obtained by combining spaCy and Stanford POS tagger are included in the evaluation below. 3.1.3 Stanford Parser The Stanford parsing suite6 is a collection of different parsers written in Java. The parsers annotate according to the Universal Dependency scheme7 or to the older Stanford dependencies described in [9] In our tests, we used version 3.5.2. It was tested using the englishPCFG parser (see [19]), which is the default setting. 3.1.4 Charniak-Johnson The most recent release (4.12.2015) of the implementation of the CharniakJohnson parser8 was originally described in [6]. The parser is written in C++, and suffers from two major shortcomings: 1. It does not compile under OS X 2. It does not perform sentence splitting, but requires the input to be already split into sentences. Because of 1., we conducted our tests for this parser on a 2.6GHz Intel Xeon E5-2670 machine running Ubuntu 14.04.3 LTS. Note that all other parsers were tested on a different machine running OS X. Given this difference, and because of the fact that all other parsers perform sentence splitting themselves, the results obtained for the Charniak-Johnson parser cannot directly be compared. 4 http://nlp.stanford.edu/software/tagger.shtml http://nlp.stanford.edu/software/pos-tagger-faq.shtml#h 6 http://nlp.stanford.edu/software/lex-parser.shtml 7 http://universaldependencies.github.io/docs/ 8 https://github.com/BLLIP/bllip-parser 5 CHAPTER 3. PARSING 3.1.5 23 Malt Parser The MaltParser was first described in [24] and is written in Java. Version 1.8.1 of the MaltParser9 requires the input to be already tagged with the Penn Treebank PoS set in order to work. As in the case of spaCy, we prepared the test set using the Stanford POS Tagger 3.6.0, using the pre-trained englishleft3words-distsim.tagger model. 3.2 Evaluation Following a preliminary assessment of ease of use and quality of documentation, the parsers were first tested in their native environment (e.g. Java or python) for speed. In a second step, the fastest parsers were then manually evaluated in terms of accuracy. 3.2.1 Ease of Use and Documentation • spaCy offers a centralized documentation10 and tutorials. Furthermore, being written entirely in python3 it suffers little from difficulties that arise in cross-platform use. • The Stanford parser has an extensive FAQ11 , but documentation is spread across several files as well as JavaDocs. There’s no centralized documentation: The user is dependent on sample files and in-code documentation. However, the code is well-documented. There is a wealth of options, most of which can be applied using the command line, making the software very easy to use. • The Charniak-Johnson parser offers little documentation on how to use it, and being written in C++ it is not trivial to use across different platforms. • The Malt parser offers a centralized documentation12 , however it focuses mostly on training a custom model and offers little help on using 9 http://www.maltparser.org/index.html https://spacy.io/docs 11 http://nlp.stanford.edu/software/parser-faq.shtml 12 http://www.maltparser.org/optiondesc.html 10 CHAPTER 3. PARSING 24 pre-trained models. The need for tagged data as input is a major shortcoming, necessitating additional steps in order to use it. Table 3.1 summarizes these results. parser spaCy cross-plattfrom use easy (python) Stanford easy (Java) Charniak-J. difficult (C++) Malt easy (Java) documentation centralized documentation, tutorials extensive FAQ, well-documented code, sample files little documentation centralized documentation further comments inferior POS tagger requires sentence split input requires tagged input Table 3.1: Summary of assessment of ease of use for different parsers 3.2.2 Evaluation of Speed The parsers were compared on a test set consisting of 1000 randomly selected text files containing abstracts from PubMed articles, averaging at 1277 characters each. The test set as well as intermediary results can be found in the data/parser_evaluation directory accompanying this thesis. The tests were run on a 3.5 GHz Intel Core i5 machine with 8GB RAM. Table 3.2 lists the various processing speeds measured using the Unix time command. In reading the table, bear in mind the following points: • The spaCy library takes considerable time to load, but then processes documents comparably fast. To demonstrate this, we list separately the time for processing the test set including loading of the library (loading in the table) and excluding loading time (loaded ). We do so, since the overhead the loading of the library presents will diminish in significance with increasing size of the data to be processed. • We also take separate note of spaCy’s performance when using plain text files as input and applying its own part-of-speech tagger (plain text CHAPTER 3. PARSING 25 in the table), and when provided with previously tagged text (tagged text). In the latter case, a small parsing step takes place to extract tags and tokens from the output produced by Stanford POS tagger. • The evaluation of the Charniak-Johnson parser should not be directly compared to the other two, since it was performed on a different machine (see 3.1.4). parser Stanford POS tagger (SPT) spaCy (plain text, loading) spaCy (plain text, loaded) spaCy (tagged text, loading) spaCy (tagged text, loaded) spaCy + SPT (loading) Stanford Charniak-Johnson Malt Malt + SPT time 29.126s 49.236s 26.113s 48.342s 23.662s 77.468s 2 430.141s 6 069.198s 52 509.288s 52 538.414s characters / s 43 840 25 933 48 896 26 413 53962 12 482 525 210 24 24 Table 3.2: Processing time for different parsers Discussion Table 3.2 shows the simple parsing step necessitated to make Stanford POS tagger output useable by spaCy and loading of tags thus provided by spaCy takes approximately the same amount of time as relying on spaCy’s internal parser. Furthermore, the time to load the spaCy library is substantial, although negligible in absolute terms. In relative terms, the combination of spaCy + Stanford POS tagger significantly slows down spaCy’s performance. However, as we shall show in Section 3.2.3, it is practically inevitable given the poor accuracy of spaCy’s part-of-speech tagger. Apart from algorithmic differences, the big gap in speed between the parsers is probably due to the fact that a new Java virtual machine is invoked for the processing of every document for the Stanford and Malt parsers. This CHAPTER 3. PARSING 26 could be amended by configuring the parsers in such a way that the Java virtual machine acts as a server that processes requests. However, this is beyond the scope of this work. 3.2.3 Evaluation of Accuracy 10 sentences from the test set have been selected to evaluate the output of the parsers visualized as parse trees by hand. The parses were converted into CoNLL 2006 format [4], and then visualized using the Whatswrong visualizer13 . Of the 10 sentences, the first five are considered easy sentences to parse, while the latter five are more difficult. While we do not provide a quantitative evaluation, but the qualitative evaluation below gives a good indication of the individual parsers’ performance. We only present the parse trees relevant for the discussion below; for a complete list and higher-resolution images refer to the additional material14 that accompanies this dissertation. Only the parses of spaCy, Stanford Parser and Malt Parser are considered, as well as all the parses produced by the combination of spaCy + Stanford POS tagger. Given the lack of ease of use of the Charniak-Johnson parser, and its difficulty to produce parse trees, it is omitted from this evaluation. The parse trees below highlight how poorly spaCy parser performs using its own tagger (for example in sentence 8), often yielding parses that would make a meaningful extraction of relations impossible. The Malt parser never yields parses that are superior to Stanford ones, and sometimes makes mistakes that the Stanford parser does not do (for example in sentence 5). However, using spaCy + Stanford POS tagger results comparable to Stanford parser are achieved, with the exception of minor mistakes (see sentences 3 and 6, for example). 13 14 https://code.google.com/p/whatswrong/ data/parser_evaluation/accuracy_evaluation/parse_tree CHAPTER 3. PARSING 27 Sentence 1 Neurons and other cells require intracellular transport of essential components for viability and function. All three parsers accurately mark require as the root of the sentence and the phrase neurons and other cells as its subject. None of the parsers accurately depicts the dependency of the phrase for viability and function on require, assigning it to either transport or components. Figure 3.1: spaCy parser Figure 3.2: Malt parser CHAPTER 3. PARSING 28 Sentence 2 Strikingly, PS deficiency has no effect on an unrelated cargo vesicle class containing synaptotagmin, which is powered by a different kinesin motor. All three parsers correctly identify root and subject. Noticeably, they also all correctly recognize PS deficiency as a compound. Unlike spaCy and Stanford parser, the Malt parser here incorrectly indicates a dependency between synaptotagmin and effect (rather than between effect and class (containing synaptotagmin)). Without expert knowledge it is not possible to decide wether the dependency between synaptotagmin and the relative clause which is powered by a different kinesin motor is correct, or if the relative clause depends on class (containing synaptotagmin). Figure 3.4: Malt parser Figure 3.3: spaCy parser CHAPTER 3. PARSING 29 CHAPTER 3. PARSING 30 Sentence 3 However, it is unclear how mutations in NLGN4X result in neurodevelopmental defects. Stanford and Malt parser deal with the sentence similarly. spaCy, however, incorrectly marks NLGN4X result as a compound (as opposed to mutations in NLGN4X. This error seems to be caused by result being tagged as NN (noun) rather than VBP (verb, non-3rd person singular present) by the spaCy tagger. Indeed, providing spaCy with tags from Stanford tagger helps to ameliorate this problem, but it still incorrectly marks NLGN4X result as a compound. Figure 3.5: spaCy parser Figure 3.6: Stanford tagger, spaCy parser Figure 3.7: Stanford parser CHAPTER 3. PARSING 31 Sentence 4 Diurnal and seasonal cues play critical and conserved roles in behavior, physiology, and reproduction in diverse animals. spaCy has a peculiar way to describe the dependencies in the phrase diurnal and seasonal cues and play. This is likely to be caused by diurnal being tagged as NNP (proper noun). Indeed, this problem is solved by providing Stanford tagger’s tags to spaCy. Furthermore, all three parsers assign different dependencies to the phrase in diverse animals: spaCy marks it as dependent on play, Stanford on behaviour, physiology, and reproduction, and Malt on reproduction only. Without expert knowledge, it is hard to decide which one is the most correct way; but Stanford’s assessment seems most plausible, while spaCy is the most simplistic one. Figure 3.9: Stanford tagger, spaCy parser Figure 3.8: spaCy parser CHAPTER 3. PARSING 32 Figure 3.11: Malt parser Figure 3.10: Stanford parser CHAPTER 3. PARSING 33 CHAPTER 3. PARSING 34 Sentence 5 The nine articles contained within this issue address aspects of circadian signaling in diverse taxa, utilize wide-ranging approaches, and collectively provide thought-provoking discussion of future directions in circadian research. None of the parsers identify address as the root of the sentence. Both spaCy and Malt parser mark utilize as the root of the phrase, and consider the actual root address as either as the root of an adverbial clause, or to be part of a composite noun (this) issue address aspects. Stanford also fails to mark address as the root, but captures the dependencies between address, utilize and provide appropriately. spaCy only captures the dependency between utilize and provide, while Malt parser falsely identifies a dependency between approaches and provides. The phrase aspects of circadian signaling is correctly parsed by the Malt parser, while spaCy and Stanford both mark signaling to be an ACI of aspects of circadian. Utilizing the Stanford tagger in conjunction with the spaCy parser yields the best results: While the true root address of the sentence is still not found, it parses the phrase aspects of circadian signaling in diverse taxa correctly, and accurately describes it as the object of address. Figure 3.15: Malt parser Figure 3.14: Stanford parser Figure 3.13: Stanford tagger, spaCy parser Figure 3.12: spaCy parser CHAPTER 3. PARSING 35 CHAPTER 3. PARSING 36 Sentence 6 Thus, perturbations of APP/PS transport could contribute to early neuropathology observed in AD, and highlight a potential novel therapeutic pathway for early intervention, prior to neuronal loss and clinical manifestation of disease. spaCy, unlike Stanford and Malt, fails to correctly identify the dependency of observed in AD on neuropathology. It also does not accurately mark highlight as dependent on contribute, which Stanford and Malt do. This is not fixed by providing it with Stanford tagger’s tags, but it results in a slight improvement in marking novel an adverbial modifier of pathway. Without expert knowledge it cannot be established wether spaCy’s established dependency of prior to neuronal loss ... on highlight is correct, or if Stanford’s and Malt’s one on intervention is. Figure 3.18: Stanford parser Figure 3.17: Stanford tagger, spaCy parser Figure 3.16: spaCy parser CHAPTER 3. PARSING 37 CHAPTER 3. PARSING 38 Sentence 7 Presenilin controls kinesin-1 and dynein function during APPvesicle transport in vivo. All parsers parse this sentence correctly. Without expert knowledge, it cannot be decided if it is more correct to mark the phrase during APPvesicle transport in vivo as dependent on function (as spaCy and Malt do) or on controls (as Stanford does). Figure 3.19: spaCy parser Figure 3.20: Stanford parser Figure 3.21: Malt parser CHAPTER 3. PARSING 39 Sentence 8 Log EuroSCORE I of octogenarians was significantly higher (30 ±5 17 vs 20 ±5 16, P < 0.001). spaCy does not recognize Log EuroSCORE I of octogenarians as one phrase. In fact, the I is tagged as a personal pronoun. Stanford and Malt do recognize it correctly, and Malt in particular identifies the I as a cardinal number. Consequently, providing spaCy parser with Stanford tags yields much better results. The phrase in parentheses is marked differently by all parsers: spaCy marks it as an attribute of was, while Stanford and Malt mark it as an unclassified dependency on was higher. Within the parentheses, only Stanford recognizes vs as the root of the phrase, and P < 0.001 as an apposition. Using spaCy parser in conjunction with Stanford parser also improves on the parse provided. Figure 3.23: Stanford tagger, spaCy parser Figure 3.22: spaCy parser CHAPTER 3. PARSING 40 Figure 3.25: Malt parser Figure 3.24: Stanford parser CHAPTER 3. PARSING 41 CHAPTER 3. PARSING 42 Sentence 9 Introduction to the symposium–keeping time during evolution: conservation and innovation of the circadian clock. spaCy incorrectly marks symposium-keeping time as one phrase. It correctly parses this phrase once using Stanford tagger’s tags. In that setting, it also parses the phrase keeping time during evolution as an ACI that depends on symposium, while Stanford marks keeping time during evolution as an unclassified dependency of introduction, and Malt as an adjectival modifier. Syntactically, the ACI is the most accurate interpretation, but in this special constellation Stanford or Malt parser’s results may be more accurate. All parsers, however, deal well with the segmentation of syntactically independent phrases by the colon :, marking it as an apposition to introduction (spaCy), and unclassified dependency of time (Stanford) or of introduction (Malt) Figure 3.27: Stanford tagger, spaCy parser Figure 3.26: spaCy parser CHAPTER 3. PARSING 43 Figure 3.29: Malt parser Figure 3.28: Stanford parser CHAPTER 3. PARSING 44 CHAPTER 3. PARSING 45 Sentence 10 Genetic mutations in NLGN4X (neuroligin 4), including point mutations and copy number variants (CNVs), have been associated with susceptibility to autism spectrum disorders (ASDs). This sentence is parsed surprisingly well by all parsers. However, spaCy marks (neuroligin 4) as an adverbial modifier of including rather than an apposition of NLGN4X (as Stanford does). Using Stanford tags, though, it marks it as an apposition of mutations in NLGN4X, which is not as correct as Stanford parser’s results, but an improvement over its default usage. Figure 3.33: Malt parser Figure 3.32: Stanford parser Figure 3.31: Stanford tagger, spaCy parser Figure 3.30: spaCy parser CHAPTER 3. PARSING 46 CHAPTER 3. PARSING 3.2.4 47 Prospective Benefits spaCy’s POS tagger can be trained on user-supplied data. While this is beyond the scope of this work, spaCy’s part-of-speech tagger could be trained on data tagged with the Stanford POS tagger, hopefully yielding better results than its default model. It could then be used instead of the Stanford tagger in the pipeline. This would greatly increase performance for two reasons: 1. Switching environments (python3 and Java) relies on reading and writing to file. As table 3.2 shows, the small parsing step introduced by having to make Stanford POS tagger output available to spaCy further slows down processing. If tagging and parsing can both be done in python3, this would make disk access and conversion superfluous and further speed up the pipeline. 2. spaCy’s tagger itself seems comparably fast. If retraining does not impact its performance, it could yield further increase in speed. 3.2.5 Selection The combination of spaCy + Stanford POS tagger outperforms the other parsers by at least two orders of magnitude in terms of speed, and maintains comparable accuracy. Because of this, and taken the prospective benefits described in 3.2.4 into account, we opt to use spaCy in conjunction with Stanford POS tagger in the course of this dissertation. Given the modular nature and lose coupling with the part-of-speech tagger in the python-ontogene pipeline, integrating the retrained spaCy POS tagger should be easy, and would hopefully yield further increase in processing speed. 3.3 Summary In this chapter we described the selection process of a suitable dependency parser for the python-ontogene pipeline. We evaluated a series of different parsers, and decided to use the spaCy parser in conjunction with Stanford POS tagger. Not only does this approach outperform the other parsers in terms of speed, it also offers the potential for further improvement. Namely, if the spaCy POS tagger is trained using the output of Stanford POS tagger, or CHAPTER 3. PARSING 48 another means is found to improve on the spaCy POS tagger’s performance, we presume that performance in accuracy and speed can be dramatically increased. Chapter 4 Rule-Based Relation Extraction In this chapter, we explain our approach for relation extraction based on hand-written rules. Building on the methods of parsing described in Chapter 3, we created an independent system, which we call epymetheus. It allows searching a corpus of parsed texts for specific relations defined by rules provided by the user. We first discuss fundamental design decisions made for the epythemeus system in Section 5.1.1. The system and its components are described in Section 4.2. A brief account of the data set used for the development and evaluation follows in Section 4.3. We present a set of manually created rules aimed at finding a large portion of relations in a specific domain of medical literature to demonstrate the functionality of our system (Section 4.4), and conclude with a summary in Section 4.5. The system is evaluated in Chapter 5. All modules and queries described in this chapter can be found in the python-ontogene/epythemeus directory that accompanies this dissertation. 4.1 Design Considerations While rule-based approaches usually perform well, Simpson explain that the manual generation of ... rules is a time-consuming process [33]. Considerable effort has been taken to facilitate the writing of rules, and thus reduce development time. We attempt to make these efforts benefit a wider audience by converting rules into queries of a common, widely-used format. We opted for the Structured Query Language (SQL), the most widely used query lan49 CHAPTER 4. RULE-BASED RELATION EXTRACTION 50 guage for relational databases. This dictates the architecture of our system described at the beginning of Section 4.2. The epythemeus system builds solely on the syntactic information produced by dependency parsing as described in Chapter 3, and explicitly does not yet take named entity recognition into account. While we point out that including NER information can improve results, allowing the epythemeus to utilize such information only at a future stage of development offers the following advantages: 1. Systems utilizing different approaches in a sequential manner can be subject to cascading errors. In the case at hand, this means that a relation may not be found if the system does not detect the corresponding named entity in a previous step. Postponing the inclusion of named entity recognition prevents such cascading errors to occur as a consequence of system architecture. 2. Given our focus on aiding query development process, limiting the features available for phrasing rules allows us to explain our approach with greater clarity and conciseness. 3. In not developing the epythemeus system as a component of the pythonontogene pipeline, but as an independent system, we can ensure that our contributions can be of use for a greater audience. Especially in regards to 3., we attempt to keep the epythemeus as independent as possible, allowing it to be used with different parsers and allowing further features for rules to be included easily. 4.2 Implementation The epymetheus system consists of three python modules (stanford_pos_to_db, browse_db, query_helper ), and a database. The stanford_pos_to_db populates the database given a previously tagged corpus as input. The database can be accessed either via the browse_db module, or through third-party software. The query_helper module facilitates the creation of queries used by either browse_db or the third-party software to extract relations from the database. CHAPTER 4. RULE-BASED RELATION EXTRACTION 51 Figure 4.1: Schematic overview of the architecture of the epymetheus system. 4.2.1 stanford_pos_to_db This module uses spaCy to parse a previously POS tagged text. While spaCy offers POS tagging functionality, we found that parsing quality is increased when using a different tagger (see Chapter 3). In the implementation at hand, the module expects as input a directory of plain text articles containing tokens and tags as it is produced by the Stanford POS tagger. The module will take the input, create a new spaCy object for every article, use spaCy to parse the articles, and commit both the spaCy objects as well as all dependencies to the database. 1 Within_IN the_DT last_JJ several_JJ years_NNS ,_ , previously_RB rare_JJ liver_NN tumors_NNS have_VBP been_VBN seen_VBN in_IN young_JJ women_NNS using_VBG oral_JJ contraceptive_JJ steroids_NNS . _ . Listing 4.1: Example of the format expected by the stanford_pos_to_db module. Note that this module can be swapped out to convert the output of a different parser into the database without affecting the remainder of the system. CHAPTER 4. RULE-BASED RELATION EXTRACTION 4.2.2 52 Database The database is implemented using SQLite 1 , which was chosen for two reasons: 1. Its python3 interface called sqlite3 allows for easy integration with the rest of the epymetheus system and with the python-ontogene pipeline. 2. It can potentially cope with large amounts of data (up to 140 TB2 ). While the system has only been tested with comparably small data sets (see 5.1), this allows epymetheus to be used with much larger data sets such as PubMed in the future. Schema The database has two tables: dependency and article. While the dependency table stores dependency tuples generated from the stanford_pos_to_db module, the article table contains serialized python objects generated by the spaCy library. This approach was chosen to make use of the highly optimized search algorithms employed by SQLite in order to find articles or sentences containing a relation given a certain pattern. At the same time, we maintain the ability to load the corresponding python object containing additional information such as part-of-speech tags, dependency trees and lemmata for further analysis and processing. The tuples saved in the dependency table have the following format: dependency( article_id, sentence_id, dependency_type, head_id, head_token, dependent_token, dependent_id, dependency_id) To demonstrate the relation of database entries and dependency parses, consider the following sample sentence, and the related parse tree (figure 4.2) and set of tuples (table 4.1). The ventricular arrhythmias responded to intravenous administration of lidocaine and to direct current electric shock ...3 1 https://www.sqlite.org/ https://www.sqlite.org/whentouse.html 3 2004 | 6 in the development set 2 CHAPTER 4. RULE-BASED RELATION EXTRACTION 53 Figure 4.2: Parse tree of a sample sentence. aid 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 sid 6 6 6 6 6 6 6 6 6 6 6 6 6 6 type amod nsubj ccomp prep amod pobj prep pobj cc aux xcomp amod amod dobj hid 115 116 132 116 119 117 119 120 117 124 116 127 127 124 head_token arrhythmias responded required responded administration to administration of to direct responded shock shock direct dep_token ventricular arrhythmias responded to intravenous administration of lidocaine and to direct current electric shock did 114 115 116 117 118 119 120 121 122 123 124 125 126 127 id 56733 56734 56735 56736 56737 56738 56739 56740 56741 56742 56743 56744 56745 56746 Table 4.1: Dependency tuples for a sample sentence (abbreviated header names). Indices Indices are used to increase the efficiency of querying the database. Finding relations relies heavily on joining on the dependent_id and head_id columns, while maintaining that a relationship cannot extend over several articles. This forces all potential constituents of a relation to have the same article_id. In order to maintain an easy mapping between the position of the tokens in the article and the dependent_id or head_id, respectively, the CHAPTER 4. RULE-BASED RELATION EXTRACTION 54 dependent_id or head_id are not unique across the database, but rather commence at 0 for every new article_id. Because of this, so-called compound indices that allow easy joining on several columns are created on the column pairs (article_id, head_id) and (article_id, dependent_id). The effects of these compound indices on query performance are described in Section 5.1.2. 4.2.3 query_helper This module aids with the creation of complex SQL queries for extracting relations. The key idea is that relation patterns can be split into fragments, which are then combined in various ways. The module is thus particularly useful in automatically generating queries that exhaust all possible combinations between fragments. For example, a relationship might be expressed in the pattern of X causes Y or X induces Y, which are equivalent in terms of their dependency pattern. Another way to express the same relation is X appears to cause Y, or X seems to cause Y. In this example, six different queries are needed to capture all possibilities. This highlights the usefulness of a tool that automatically generates all possible queries given a minimal set of building blocks. We have thus created our own short-hand notation for such fragments, which are parsed by the query_helper module. The module then in turn offers functions that generate queries based on the user-supplied fragments. Fragments Fragments represent conditions that apply to dependencies in the database, and that can be chained together. Example 4.2 is a comparably simple fragment that would match the phrase result in. It is used here to explain the notation of fragments used by the query_helper module. Fragments are saved in plain text files, and a single text file can contain multiple fragments. Having multiple fragments in a single file allows for similar fragments to reside in the same file, and thus helps organization. 1 2 3 // result in d1 . head_id , d1 . head_id d1 . head_token LIKE ’ result % ’ CHAPTER 4. RULE-BASED RELATION EXTRACTION 4 55 d1 . dependent_token LIKE ’in ’ Listing 4.2: Simple fragment matching the phrase result in. The first two lines in every fragment carry special meaning. Line 1 is the title line, and contains the name of the fragment so that it can later be referred to. The title line is marked by being prefixed with //. It marks the beginning of a new fragment: Every subsequent line that does not begin with // is considered a part of the fragment. Line 2 is the joining elements line, the use of which will be explained below. The remaining lines contain conditions. Every fragment is defined by a set of conditions that apply to a set of dependency tuples as they are stored in the database (see Section 4.2.2). A single dependency tuple is referred to by a name such as d1 within a fragment. Inspired by SQL notation, the elements of a tuple are referred to by the following notation dependency_name .element_name. Conditions on the elements can either be expressed by = or by LIKE, which is the same operator as in SQL. Namely, it allows the righthand operand to contain the wild-card %, which represents missing letters, and also allows for the matching to be case-indifferent. The condition d1. head_token LIKE ’result%’ thus applies to all dependency tuples in which the head_token begins with result, including results, resulted and resulting. Using different names for tuples allows the fragment to describe patterns that extend over several tuples, as the example 4.3 shows. The condition d1.dependent_id = d2.head_id indicates the connection between the dependency tuples. If no condition specifies the relation between the two tuples, the system will merely assume that the two dependency tuples have to be in the same sentence. 1 2 3 4 5 // be responsible d1 . head_id , d2 . head_id d1 . dependency_type = ’ acomp ’ d1 . dependent_id = d2 . head_id d2 . head_token LIKE ’ responsible % ’ Listing 4.3: Fragment involving multiple dependency tuples, matching phrases such as is responsible or be responsible. The following sample sentence contains such a phrase. Again, a simplified parse tree and dependency tuples are provided below. CHAPTER 4. RULE-BASED RELATION EXTRACTION 56 ... different anatomical or pathophysiological substrates may be responsible for the generation of parkinsonian ’off’ signs and dyskinesias4 Figure 4.3: Simplified parse tree of a sample sentence containing the be responsible fragment. aid 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 11099450 sid 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 type amod amod cc conj nsubj aux acomp prep det pobj prep amod punct amod punct pobj hid 227 227 224 224 229 229 229 230 233 231 233 239 239 239 239 234 head_token substrates substrates anatomical anatomical be be be responsible generation for generation signs signs signs signs of dep_token different anatomical or pathophysiological substrates may responsible for the generation of parkinsonian ‘ off ’ signs did 223 224 225 226 227 228 230 231 232 233 234 235 236 237 238 239 id 7319 7320 7321 7322 7323 7324 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 Table 4.2: Dependency tuples corresponding to the sample sentences containing the be responsible fragment. 4 11099450 | 5 in the development set CHAPTER 4. RULE-BASED RELATION EXTRACTION 57 Joining Fragments In both example 4.2 and 4.3, the second line does not represent a condition. The first non-empty line to follow the title line describes the left and right elements of the fragment. These specify which elements of the fragment to use when several fragments are joined together using the join_fragments (left_fragment,right_fragment) function of the query_helper module. Consider fragment 4.4 below, and the results produced by calling join_fragments (subj,result in) (4.5), and join_fragments(subj,be responsible) (4.6), respectively. 1 2 3 1 2 3 4 5 1 2 3 4 5 6 // subj d1 . head_id , d1 . head_id d1 . dependency_type = ’ nsubj ’ Listing 4.4: subj fragment matching any phrase containing a subject. d1 . head_id , d1 . head_id d1 . dependency_type = ’ nsubj ’ d1 . head_id = d2 . head_id d2 . head_token LIKE ’ result % ’ d2 . dependent_token LIKE ’in ’ Listing 4.5: result of joining fragments 4.4 (subj) and 4.2 (result in), which will match phrases in which the word result is governed by a subject. d1 . head_id , d3 . head_id d1 . dependency_type = ’ nsubj ’ d1 . head_id = d2 . head_id d2 . dependency_type = ’ acomp ’ d2 . dependent_id = d3 . head_id d3 . head_token LIKE ’ responsible % ’ Listing 4.6: result of joining the fragments 4.4 (subj) and 4.3 (be responsible), which will match phrases in which a subphrases like is responsible or be responsible is governed by a subject. Note that in example 4.5 (and analogously 4.6), the resulting fragment is much more specific than just matching phrases in which a subject exists and which contain the words result. Specifying left and right joining elements in the fragment definition allows the join_fragments() function to connect the fragments in a more meaningful manner, and truly chain fragments together. CHAPTER 4. RULE-BASED RELATION EXTRACTION 58 As can be seen, the join_fragments() function also automatically renames the tuple identifiers, and adds a condition equating the left-hand tuple’s right element with the right-hand tuple’s left element. In this fashion, the relevant elements for joining fragments can be defined as part of the fragment definition. This allows for the automated joining of fragments, instead of having to specify the element on which to join elements individually for every join. Setting the option cooccur=True when calling the join_fragments() function disables this behavior, and will merely rename tuples and merge conditions. Alternatives The fragment notation allows for an effortless listing of alternatives. Consider the example 4.7, which describes phrases such as lead to or leads to. There are several verbs that behave like lead, such as attribute or relate. In order to easily account for such structurally equivalent verbs, the notation as in example 4.8 can be used. 1 2 3 4 1 2 3 4 5 6 7 // lead to d1 . head_id , d1 . head_id d1 . head_token LIKE ’ lead % ’ d1 . dependent_token LIKE ’to ’ Listing 4.7: fragment matching the phrase lead to or leads to // to d1 . head_id , d1 . head_id d1 . head_token LIKE ’ attribute % ’ || led % || lead % || relate % d1 . dependent_token LIKE ’to ’ Listing 4.8: fragment matching several phrases similar to lead to Every line beginning with || refers to the closest previous line that is not preceded by ||, and will describe an alternative to that line’s right-hand operand. CHAPTER 4. RULE-BASED RELATION EXTRACTION 59 From Fragments to SQL Queries Fragments can be directly translated into SQL queries, or first joined as many times as necessary before being turned into queries that can be used by the browse_db module, using the querify() function. This function ensures that all dependency tuples have the same article_id as well as sentence_id. For example, calling querify(join_fragments(nsubj, to)), using the fragments from examples 4.4 and 4.8, results in the following SQL query: 1 2 3 4 5 6 7 8 9 10 SELECT d1 . article_id , d1 . sentence_id FROM dependency AS d1 , dependency AS d2 WHERE d1 . article_id = d2 . article_id AND d1 . sentence_id = d2 . sentence_id AND d1 . dependency_type = ’ nsubj ’ AND ( d2 . head_token LIKE ’ attribute % ’ OR d2 . head_token LIKE ’ led % ’ OR d2 . head_token LIKE ’ lead % ’ OR d2 . head_token LIKE ’ relate % ’) AND d2 . dependent_token LIKE ’ to ’ AND d1 . head_id = d2 . head_id Listing 4.9: query generated from the result of joining the fragments subj and to. Automated Joining of Fragments The introduction of this section highlighted the importance of automatically generating all possible combinations of fragments. The function active() in the query_helper module gives an example of how from a very limited set of fragments a large set of queries can be generated. The data/example_generate directory that accompanies this thesis contains both the fragments as well as the generated queries, showcasing the usefulness of the query_helper module. 4.2.4 browse_db The browse_db module is a shell-like environment that serves three purposes: • the execution of custom queries CHAPTER 4. RULE-BASED RELATION EXTRACTION 60 • the easy execution of predefined queries on the database • the execution of related queries in one batch When calling the browse_db module, the argument -d ’path/to/file’ can be used to access a custom database. This is particularly useful when the same queries need to be executed on different data sets. After calling the browse_db module, it will present a new command-line prompt composed of the database name and the $ sign, waiting for the user to input one of the commands explained below. 1 2 user_shell$ python3 browse_db . py dependency_db . sqlite$ Custom Queries The database can be queried from the browse_db environment using the q command followed by the SQL query in quotes. For example, a simple search for a specific token can be performed as follows: 1 2 3 $ python3 browse_db . py dependency_db . sqlite$ q " SELECT * FROM dependency WHERE article_id = 2004 AND head_id = 5" 2004 ,0 , pobj ,5 , in , patients ,6 ,56625 Predefined Queries Predefined queries written in SQL are saved in plain text files, which are loaded by browse_db. Every file contains one query, and can be called from within the browse_db environment using the q command and its file name. For example, a query stored in the file x_causes_y.sql can be executed as follows: 1 2 $ python3 browse_db . py dependency_db . sqlite$ q x_causes_y Several predefined queries are described in section 4.4. More queries can easily be added by creating a new file containing the new query to either the predefined_queries directory, or by adding the new file to a custom directory and calling browse_db.py as follows: 1 $ python3 browse_db . py -q path / to / custom / directory CHAPTER 4. RULE-BASED RELATION EXTRACTION 61 Specialized Queries and Helper Functions browse_db furthermore offers helper functions that perform subtree traversal (subtree()), negation detection (is_negated()) and relative clause resolution (relative_clause()) given an article_id and a token_id. These functions can be used in user mode as follows: 1 2 dependency_db . sqlite$ subtree 2004 5 in patients receiving psychotropic drugs Furthermore, browse_db allows for specialized functions that will not only execute the query, but in addition perform further analysis of the results. The functions need to be specifically written, and can utilize the helper functions described above. For example, for the query x_cause_y such a specialized function has been written; and the listing below highlights the difference in output for custom queries and queries with specialized functions: 1 2 3 dependency_db . sqlite$ q " SELECT * FROM dependency WHERE article_id = 2004" 2004 ,0 , amod ,1 , changes , Electrocardiographic ,0 ,56619 ... 4 5 6 7 8 9 dependency_db . sqlite$ q x_cause_y ID 6504332: that (55) cause disorders (59) -> subj : that -> obj : movement disorders ... These specialized functions will be automatically used if their function name and the query name coincide. This allows for the easy addition of further specialized functions in the browse_db module by the user. Categories of Queries When loading predefined queries from a directory, the browse_db module will also keep the names of the directory and all the queries it contains, considering also subdirectories. This allows for the organization of related queries into directories, which can then be called using the name of the directory. For example, if the predefined_queries or the directory provided to browse_db contains a sub-directory example, all the queries that are contained in the example directory can be easily executed in one command as follows: CHAPTER 4. RULE-BASED RELATION EXTRACTION 1 2 3 4 5 62 dependency_db . sqlite$ q example Running query ’ x_cause_y ’ from category ’ example ’... 10539815 ,47 , it ,50 , cause , function ,53 10728962 ,31 , they ,32 , cause , vasodilation ,33 ... Again, the system will check if any specialized function has been written that matches the name of any of the queries supplied in the directory. If so, it will use that function rather than the query provided. Command-Line Mode The execution of all queries in one category can also be initiated without entering the shell-like mode. For this, the argument --ids is used when calling the module. In that case, the system will not consider any specialized functions and only execute functions as they are provided as text files as described above. It will also only return the first two fields of every row, assumed to always be article_id and sentence_id. 1 2 3 4 5 $ python3 browse_db . py -- ids ACTIVE_GENERATED 9041081 | 10 2004 | 2 2886572 | 10 ... This way of using the module is suited for subsequent automatic use, especially when the output of the module is redirected to a file (using python3 browse_db --ids category > output.txt). 4.3 Data Set In their foundational book Mining Text Data, Aggarwal et al. [1] describe the wealth of annotated corpora for domain-independent text mining. However, all these data sets are draw on broadcast news, newspaper and newswire data (as in the case of ACE [10]), from the Wall Street Journal (MUC) [13] or on the Reuters corpus (English CoNLL-2003 [36]). However, as Simpson explain, a major obstacle to the advancement of biomedical text mining is the shortage of quality annotated corpora for this CHAPTER 4. RULE-BASED RELATION EXTRACTION 63 specific domain [33]. Neves [23], for example, gives an overview of 36 annotated corpora in the biomedical domain, most of which, however, do not offer annotations of relations between entities. The study points to the quality of the corpora released in conjunction with the BioCreative challenges5 , which organizes challenges in the fields of evaluating text mining and information extraction systems applied to the biological domain, and releases annotated corpora for evaluation. For the development of predefined queries as well as the evaluation of our epymetheus system, we use the annotated corpus originally provided for the BioCreative V challenge [39]. It contains 1500 PubMed article abstracts that have been manually annotated for chemicals and diseases, as well as Chemical-Disease Relations (CDRs). It is split into three data sets (development, training, testing), each containing 500 documents. The data is presented both in BioC format, a XML standard for biomedical text mining [8], and PubTator format, a special tab-delimited text format used by the PubTator annotation tool [38]. One major shortcoming of the data set, however, is that CDR annotations are made on document level, not on mention level. This means that for every document, the annotation notes which relations are found in the entire document, but do not offer further information on which occurrence of an entity is the argument of the relation and where it is found within the document. The PubTator annotation tool highlights named entities as shown in figure 4.4, but it does not provide out-of-the-box visualization for relations, and hence is not fit for our purpose. Based on the BioCreative V corpus, we automatically extracted candidate sentences, which are likely to contain a relation (Section 4.3.1). These sentences were then manually categorized according to the pattern that contains the relation (Section 4.3.2), in order to develop queries that match the patterns and to be able to evaluate the effectiveness of the epythemeus system. Note that we chose to use a corpus containing CDR annotations not because the epythemeus system is specific to that subdomain, but due to the scarcity of high-quality annotated corpora in the biomedical domain. In fact, our system is just as suitable for relation extraction in any other subdomain. 5 http://www.biocreative.org/about/background/description/ CHAPTER 4. RULE-BASED RELATION EXTRACTION 64 Figure 4.4: Exemplary view of named entity highlighting on PubTator. 4.3.1 Conversion For the development of queries and evaluations, we only consider relations that are confined within a single sentence. While the epythemeus system is technically able to deal with relations that transcend sentence boundaries, this is beyond of the scope of this work. We thus converted the documents of the corpus6 as follows: The document is split into sentences using spaCy, and only sentences that contain both entities of an annotated relation are maintained. These sentences are printed out separately, and the entities participating in the annotated relationship are capitalized to facilitate human evaluation. 1 2 3 4 5 6 804391| t | Light chain proteinuria and cellular mediated immunity in rifampin treated patients with tuberculosis . 804391| a | Light chain proteinuria was found in 9 of 17 tuberculosis patients treated with rifampin . ... 804391 12 23 proteinuria Disease D011507 804391 58 66 rifampin Chemical D012293 ... 804391 CID D012293 D011507 Listing 4.10: Example of PubTator format 6 using the script python-ontogene/converters/pubtator_to_relations.py CHAPTER 4. RULE-BASED RELATION EXTRACTION 1 2 65 804391 | 0 | Light chain PROTEINURIA and cellular mediated immunity in RIFAMPIN treated patients with tuberculosis . 804391 | 1 | Light chain PROTEINURIA was found in 9 of 17 tuberculosis patients treated with RIFAMPIN Listing 4.11: Extracted relations after conversion Table 4.3 lists the number of sentences containing a probable mention of an annotated relationship extracted from the respective subset of the corpus. subset development training test articles in subset 500 500 500 sentences extracted 623 581 604 Table 4.3: Sentences extracted per subset. Note that table 4.3 also lists the training data set for the sake of completeness and comparison. However, that set is not used in the course of this work. 4.3.2 Categorization From the manual analysis of the sentences in the development subset, a set of 8 categories was derived, and each of the sentences manually assigned to one of these categories. The categories describe the structure of the sentence pointing towards the relation they contain. Following this, the sentences in the test set were each assigned to the same set of categories. Below, we describe the categories and the criteria that determine the association of a sentence with the respective category. While the categories could apply to other domains, too, they have been developed from sentences containing chemical-disease relations, and thus their precise definition is specific to the CDR domain. ACTIVE This category involves active sentences in the form of X causes Y (or X cause Y ). Included are constructions with modal verbs such as X may cause Y or CHAPTER 4. RULE-BASED RELATION EXTRACTION 66 X did cause Y, as well as extended patterns such as X appears to cause Y. The following sentence stands as an example for this category. This is the first documentation that METOCLOPRAMIDE provokes TORSADE DE POINTES clinically.7 A collection of verbs that establish relation in the development subset has been established. • • • • • • • • accompany associate attenuate attribute cause decrease elicit enhance • • • • • • • • increase induce invoke kindle lead to precipitate produce provoke • • • • • • • • recur on relate reflect be responsible resolve result in suppress use DEVELOP A common setting for establishing relationships between chemicals and diseases is to expose a subject to a chemical X and observe a subsequent case of disease Y [27]. This category captures sentences that express such cases. It is the broadest category, including a vast variety of patterns. An example for a simple patterns is X in _ on Y, where X is a disease, _ represents an entity, usually patient, and Y is the chemical. A more complicated patterns is case of X within _ receiving Y or X in _ admitted to using Y. Many of these patterns also contain a temporal component, such as development of X following Y treatment or X within _ of administration of Y, where _ represents some time period. The sentence below is a typical example of this category. Five patients with cancer who developed ACUTE RENAL FAILURE that followed treatment with CIPROFLOXACIN are described ...8 7 8 11858397 | 6 in the development set 8494478 | 2 in the development set CHAPTER 4. RULE-BASED RELATION EXTRACTION 67 DUE This simple category captures sentences in the form of X due to Y and related variants that contain the word due. An example of such a sentences is listed beneath: Fatal APLASTIC ANEMIA due to INDOMETHACIN–lymphocyte transformation tests in vitro.9 HYPHEN A large proportion of annotated relations were found in the pattern of Xinduced Y, such as APOMORPHINE-induced HYPERACTIVITY 10 . The category also includes more complicated variation of the pattern such as KETAMINE- or diazepam-induced NARCOSIS 11 or PILOCARPINE (416 mg/kg, s.c.)-induced limbic motor SEIZURES 12 . It also extends to the same pattern using different words, namely: • associate • attribute • induce • kindle • mediate • relate NOUN This category revolves around nouns that can express relations in patterns such as the action of X on Y or the X action of Y. For example, a sentence containing the dual action of MELATONIN on pharmacological NARCOSIS seems ...13 is considered to belong to this category. Nouns that have been found to express relations in this sense in the development subset are: • • • • • 9 action association case cause complication • • • • • effect enhancement factor induction marker 7263204 | 0 in the development set 6293644 | 2 in the development set 11 11226639 | 6 in the development set 12 9121607 | 2 in the development set 13 11226639 | 7 in the development set 10 • pathogenesis • relationship • role CHAPTER 4. RULE-BASED RELATION EXTRACTION 68 NOUN+VERB This category extends the previous one in that it applies to sentences in which one of the nouns of the NOUN category are used in conjunction with a verb to express a relation. The pattern X plays role in Y as expressed in the sentence below is a prime example of this category. ERGOT preparations continue to play a major role in MIGRAINE therapy14 PASSIVE Sentences in the form of X associated with Y or X is associated with Y belong to this category. This includes all tempora (X was associated with Y ), sentences of the pattern X appears to be associated with Y, as well as the rare case of X associated by Y. The same set of verbs as used in the ACTIVE category applies here. For example, the following sentence is assigned to this category. The HYPERACTIVITY induced by NOMIFENSINE in mice remained ...15 Symptomatic VISUAL FIELD CONSTRICTION thought to be associated with VIGABATRIN ...16 NO CATEGORY Sentences that did not match any of the previously mentioned categories were assigned the NO CATEGORY label. Note that the sentences extracted from the development subset of the original corpus do not necessarily express the annotated relation, even though both entities in the relation appear in the sentence. A large proportion of sentences were assigned to this category for that reason. For example, the following sentence does not establish any relation between the two entities sirolimus and capillary leak : Systemic toxicity following administration of SIROLIMUS (formerly rapamycin) for psoriasis: association of CAPILLARY LEAK syndrome with apoptosis of lesional lymphocytes.17 14 3300918 | 4 in the development set 2576810 | 3 in the development set 16 11077455 | 1 in the development set 17 10328196 | 0 in the development set 15 CHAPTER 4. RULE-BASED RELATION EXTRACTION 69 Another minor reason for attribution to this category are entities with short names that also occur in natural language, and are thus extracted falsely by the system. An extraction process more elaborate than the one described in 4.3.1, for example involving tokenization or even entity normalization, could ameliorate this shortcoming, but lies beyond the score of this work. 4.3.3 Development and Test Subsets Tables 4.4 and 4.5 list the number of sentences for every category in the development subset and test subset, respectively, as well as their percentage. category ACTIVE DEVELOP DUE HYPHEN NO CATEGORY NOUN NOUN+VERB PASSIVE Total sentences 49 146 7 181 109 23 11 97 623 percentage 7.865% 23.43% 1.124% 29.05% 17.5% 3.692% 1.766% 15.57% 100% Table 4.4: Categorization for sentences extracted from the development subset. category ACTIVE DEVELOP DUE HYPHEN NO CATEGORY NOUN NOUN+VERB PASSIVE Total sentences 47 128 6 150 122 21 22 85 581 percentage 8.09% 22.03% 1.033% 25.82% 21% 3.614% 3.787% 14.63% 100% Table 4.5: Categorization for sentences extracted from the test subset. CHAPTER 4. RULE-BASED RELATION EXTRACTION 70 The annotated corpora, the extracted sentences and their categorization as well as related material can be found in the data/manual_corpus directory that accompanies this thesis. 4.4 Queries This section describes the development of query sets, which should provide the reader with a fair notion of how to use the epythemeus system. Based on the development set and using the query_helper module, a set of queries was developed for three categories: • the trivial case of the HYPHEN category • the ACTIVE category, considered relatively simple • the complex DEVELOP category The query sets were aimed at having near-perfect recall for their respective category on the development set, while generalizing as much as possible. The fragments and generated queries for each query set can be found in the data/query_set directory that accompanies this work. 4.4.1 HYPHEN queries Queries for this category are trivially easy to make. The following single fragment produces a query that achieves almost perfect recall on the development set: 1 2 3 4 5 6 7 8 // hyphen d1 . dependent_id , d1 . head_id d1 . dependent_token LIKE ’% - induced ’ || % - associated || % - attributed || % - kindled || % - mediated || % - related Note that it is the dependent_token where we expect words such as levodopa-induced to occur. This is because most commonly, the phrases in the pattern X-induced Y are parsed as an amod-dependency, where Y will CHAPTER 4. RULE-BASED RELATION EXTRACTION 71 be the head_token and X-induced the dependent_token. Table 4.6 below shows the corresponding dependency tuples. aid 10091616 10091616 10091616 sid 0 0 0 type prep amod pobj hid 0 3 1 head_token Worsening dyskinesias of dep_token of levodopa-induced dyskinesias did 1 2 3 id 631 632 633 Table 4.6: Dependency tuples representing a HYPHEN relation. 4.4.2 ACTIVE queries In order to maximize generalization, a set of minimal fragments was determined that would cover as many sentences from the development as possible, and then all possible combinations of these were automatically generated. This required manual analysis of every sentence’s structure and key words. We found that sentences in the ACTIVE category are made up out of up to three sets of fragments: A first set of fragments describes a set of verbs that express a direct relationship between two entities. These words may either take direct objects (such as to cause), or require a preposition (such as to associate with). We also added to this set of fragments the case of to be responsible for. This set also includes variations involving modal verbs (may cause), different numeri (X causes Y and X and Y cause Z ) as well as tempora (X causes Y and X caused Y ). Below is an example of fragments in this set. For a full account of such fragments, refer to the data/example_generate directory. 1 2 3 4 5 6 // with d1 . head_id , d1 . head_id d1 . head_token LIKE ’ associate % ’ || co - occur % || coincide % d1 . dependent_token LIKE ’ with ’ 7 8 9 10 11 12 // active d1 . head_id , d1 . head_id d1 . head_token LIKE ’ accompan % ’ || associate % || attenuate % CHAPTER 4. RULE-BASED RELATION EXTRACTION 13 14 15 16 72 || cause % ... || use % d1 . dependency_type = ’ dobj ’ 17 18 19 20 21 22 // be responsible d1 . head_id , d2 . head_id d1 . dependency_type = ’ acomp ’ d1 . dependent_id = d2 . head_id d2 . head_token LIKE ’ responsible % ’ Figure 4.5 and table 4.7 demonstrate a parse trees for a typical sentence in this category, as well as the corresponding dependency tuples. Figure 4.5: Typical parse tree for an ACTIVE sentence. aid 11858397 11858397 11858397 11858397 11858397 sid 6 6 6 6 6 type nsubj dobj advmod compound nsubj hid 132 132 132 135 135 head_token provokes provokes provokes pointes pointes dep_token metoclopramide pointes clinically torsade de did 131 135 136 133 134 Table 4.7: Dependency tuples for a typical ACTIVE sentence. A second set captures cases in the pattern of X verb_a and verb_b Y, where verb_b expresses the relation in questin. An example of such a case is the following sentence, where the relation enhances(oral hydrocortisone,pressor responsiveness) is captured by this pattern. Note that because id 13197 13201 13202 13199 13200 CHAPTER 4. RULE-BASED RELATION EXTRACTION 73 of the way the sentence is parsed, this relation would not be discovered without this fragment (see figure 4.6). Oral hydrocortisone increases blood pressure and enhances pressor responsiveness in normal human subjects.18 Figure 4.6: Parse tree of a sentence in the pattern X verb_a and verb_b Y. 1 2 3 // conj d1 . head_id , d1 . dependent_id d1 . dependency_type = ’ conj ’ A third set entails structures like X appears to cause Y or X seems to cause Y. 1 2 3 4 5 // appears d1 . head_id , d1 . dependent_id d1 . head_token LIKE ’ appear % ’ || seem % d1 . dependency_type = ’ xcomp ’ Note that such patterns can be combined in various ways: for example, the verb to cause can occur in the pattern X causes Y, X appears to cause Y, X some_verb and causes Y, X appears to some_verb and cause Y and X some_verb and appears to cause Y. Queries that match the latter case, however, are not generated, as there are no such sentences in the development set. From these fragments, a set of 29 queries was automatically generated using the query_helper module. The set of generated queries can be found in the data/example_generate directory that accompanies this thesis. 18 2722224 | 1 in the original development set of the BioCreative corpus CHAPTER 4. RULE-BASED RELATION EXTRACTION 4.4.3 74 DEVELOP queries The patterns of sentences in the DEVELOP category certainly are the most varied ones. Recall that sentences in the DEVELOP category describe a situation where a chemical is administered to a recipient, and a disease observed in that recipient. Every pattern must thus have a part describing the disease, and one describing the chemical. Since the database does not store entity recognition information, the epythemeus system needs to rely on parsing patterns to identify diseases and chemicals, respectively. The fact that the administration of the chemical as well as the observation of the disease need to be described in the sentence makes it possible to identify the elements of a chemical-disease relationship. While the fragments presented here do not cover all the cases in the development set, they give an idea of how more complicated relations can be found. Chemicals The patterns that identify chemicals revolve around the administration of the chemical, which can manifest in a variety of ways. Below we give an example of the kind of structures that can express the administration of a chemical. The fragment titles should give sufficient description of the pattern the fragments describe. 1 2 3 4 5 // X therapy d1 . head_id , d1 . head_id d1 . head_token = ’ therapy ’ || injection % d1 . dependency_type = ’ amod ’ 6 7 8 9 10 11 12 13 // therapy with X d1 . head_id , d2 . dependent_id d1 . head_token LIKE ’ therap % ’ || injection % d1 . dependent_id = d2 . head_id d2 . head_token = ’ with ’ d2 . dependency_type = ’ pobj ’ 14 15 // injection of X CHAPTER 4. RULE-BASED RELATION EXTRACTION 16 17 18 19 20 21 22 75 d1 . head_id , d2 . dependent_id d1 . head_token LIKE ’ injection % ’ || administration || dose % d1 . dependent_id = d2 . head_id d2 . head_token = ’of ’ d2 . dependency_type = ’ pobj ’ A particular case that was often encountered when chemical administration is not explicitly described, such as in the following sentence: ... effects were ... VOMITING in the FLUMAZENIL group.19 Here, the chemical administration is only implicitly indicated as a quality of the recipient. The fragment below describes the pattern X group, but there are many other sentences in this sense such as women on ORAL CONTRACEPTIVES 20 or occurrence of SEIZURES and neurotoxicity in D2R -/mice treated with the cholinergic agonist PILOCARPINE 21 . 1 2 3 4 // X group d1 . head_id , d1 . head_id d1 . dependency_type = ’ compound ’ d1 . head_token LIKE ’ group ’ Diseases A simple example of the description of the occurrence of a disease follows: The development of CARDIAC HYPERTROPHY was studied...22 In fact, such constructions involving similar nouns as in the NOUN category are quite common, and it might be fruitful to explore possible synergies between queries for the two categories. The fragments below exemplify how such constructions can be represented as fragments: 19 1286498 | 10 in the development set 839274 | 0 in the development set 21 11860278 | 4 in the development set 22 6203632 | 1 in the development set 20 CHAPTER 4. RULE-BASED RELATION EXTRACTION 1 2 3 4 5 76 // development of X d1 . head_id , d1 . head_id d1 . head_token LIKE ’ development % ’ d1 . dependent_id = d2 . head_id d2 . dependency_type = ’ pobj ’ 6 7 8 9 10 11 // effects of X d1 . head_id , d2 . dependent_id d1 . dependent_token LIKE ’ effect % ’ d1 . dependency_type = ’ nsubj ’ d1 . head_id = d2 . head_id Chemical Disease Relation The patterns that actually capture the structures representing a relation between a disease and a chemical are very varied. We’ve identified three ways of finding them: 1. The subject exposed to the chemical can be used to establish the connection between the disease and the chemical. 2. A time word establishes a temporal relation between administration of a chemical and disease onset. 3. A preposition is used instead of a verb. The first case is the most straight-forward approach given our system. However, such sentences are surprisingly rare. While the sentence below is a good example of the kind of sentences that can be found in this approach, we found that the second case is far more fruitful. ... NICOTINE-treated rats develop LOCOMOTOR HYPERACTIVITY ...23 In fact, it seems like time words such as after are often used when describing chemical administration, which could be exploited to create more robust queries. The following fragments can be joined to fragments describing chemical administration as described above. 23 3220106 | 8 in the development set CHAPTER 4. RULE-BASED RELATION EXTRACTION 1 2 3 4 77 // after X d1 . head_id , d1 . dependent_id d1 . dependent_token = ’ after ’ d1 . dependency_type = ’ prep ’ 5 6 7 8 9 // following X d1 . head_id , d1 . dependent_id d1 . head_token = ’ following ’ d1 . dependency_type = ’ dobj ’ The resulting fragment, called time word+chemical for the purposes of discussion, can then be joined directly to the occurrence of a disease, which would allow for the finding of sentences as below: Delayed asystolic CARDIAC ARREST after DILTIAZEM overdose; resuscitation with high dose intravenous calcium.24 It could also be joined to a verb expressing disease occurrence to find sentences such as the following: A 42-year-old white man developed acute hypertension with severe HEADACHE and vomiting 2 hours after the first doses of amisulpride 100 mg and TIAPRIDE 100 mg.25 It seems, however, as if not joining the time word+chemical fragment to anything directly, but rather to create a query that merely checks for the cooccurrence of pattern expressed by the time word+chemical fragment and a disease yields good results with few false positives. While this claim needs further substantiation, we suggest that this is due to the fact that the time word+chemical fragment almost exclusively is used in sentences expressing a chemical-disease relation. A third way that is often used to express relations in the development set is to rely on prepositions rather than verbs. While this is common especially in titles, the example below shows how this can also be the case in normal text. 24 25 12101159 | 0 in the development set 15811908 | 2 in the development set CHAPTER 4. RULE-BASED RELATION EXTRACTION 78 Two cases of postinfarction ventricular SEPTAL RUPTURE in patients on long-term STEROID therapy are presented ...26 However, since prepositions such as in in the example above are so common, it is very difficult to write queries that will only return sentences that use them to express a chemical-disease relation. 4.5 Summary In summary, we created a system capable of extracting relations of any kind, and introduced the concept of fragments that aids with the process of writing queries. Both are domain-independent, and while we developed them with biomedical text mining in mind, they are just as applicable to other fields. 4.5.1 Arity of Relations Note that no constraints are put on the number of entities participating in a relation. The distinction between relation and event extraction, as it has been suggested by Simpson [33], for example (see Section 1.3), thus has little meaning. Currently, the query_helper module will generate queries that return the identifier of the sentence in which the relation is found. Using specialized queries, and making use of the subtree() function as described in Section 4.2.4, however, the epythemeus system can be adapted to return the individual entities participating in relationships of arbitrary complexity. In fact, the queries developed for the ACTIVE and HYPHEN categories return relations consisting of three entities each, where the verb expresses the quality of the relation. For example, relations of the pattern X increases Y and X suppresses Y are currently both part of the ACTIVE category, but could be assigned to different categories to allow for a more differentiated extraction of relations. In the same spirit, queries in the DEVELOP category in particular can extract relations consisting of various entities: capturing dosage of drug administration, for example, is made quite easy using fragments. 26 9105126 | 1 in the development set CHAPTER 4. RULE-BASED RELATION EXTRACTION 4.5.2 79 Query Development Insights The examples above showcase how the concept of fragments greatly facilitates the creation of queries, especially in cases where many possible combinations of similar structural patterns occur. However, the writing of queries could be facilitated if the dependency tuples would also store lemmata (forgoing the need to use the LIKE operand and allowing for more concise queries), and if word-lists could be supplied for alternatives, rather than listing every word individually. This might be especially useful to increase re-use: For example, queries for the PASSIVE category are very likely to use the same verbs as are used in the ACTIVE category. While the manual creation of queries requires a good understanding of the annotation scheme used by the parser, the automatic generation of possible variations allows the system to cover a large proportion of relations. The use of the system has been demonstrated using the example of chemical-disease relation extraction, and the queries written for the demonstration are specific to that domain. Note that the fragments and queries developed are specific to the dependency scheme employed by the parser. While efforts are made to establish universally accepted standards such as the Universal Dependency scheme27 , these are not yet widely used, limiting the re-use of existing fragments and queries. 4.5.3 Augmented Corpus In order to evaluate the system and building on a previously annotated corpus, we manually categorized over 1000 sentences extracted from PubMed articles according to the pattern that defines the relation they contain. While this categorization does not follow any particular standard such as the ones laid out by Willibur [40], and in particular offers no measure of inter-annotator agreement, we hope that this categorization will help the reader to understand how to use the epythemeus system, and may be useful for other related research. 27 http://universaldependencies.github.io/docs/ Chapter 5 Evaluation In this chapter we evaluate the epythemeus system against the test corpus described in Chapter 4. Furthermore, in order give an estimate of the effort required to process the entire PubMed using the approach described in this thesis, we apply our approach to a small test set and extrapolate the measured results. 5.1 5.1.1 Evaluation of epythemeus Query Evaluation We evaluate the query sets developed for the HYPHEN and the ACTIVE category (described in Sections 4.4.1 and 4.4.2, respectively). While we also list the results for the queries written for the DEVELOP category (Section 4.4.3), the set of queries only served to exemplify how queries for more complicated patterns can be obtained. The results for this query set thus do not give any account of the efficacy of the epythemeus system, but serve only to help to further demonstrate the query development process. For evaluation, the query sets were executed on the development set and the test set. The queries return article_id and sentence_id for every sentence in which a relation is found. From the manually categorized sentences of the development, the article_id and sentence_id for every sentence belonging to the category in question are extracted using the categories.py script. The article_id and sentence_id pairs extracted in this fashion are taken as the gold standard. 80 CHAPTER 5. EVALUATION 81 The gold standard and the output produced by the query sets are then compared using the evaluate.py script. All scripts used for evaluation, as well as intermediate results can be found in the data/manual_corpus directory. HYPHEN queries Table 5.1 shows how the query set has very high recall on the development set, and comparable recall on the test set. set development test recall 0.961 0.953 precision 0.316 0.294 F1 measure 0.476 0.449 TP 174 142 FP 7 7 FN 376 341 Table 5.1: Results of HYPHEN query set executed on development and test set The false positives (FP) on the development set are all sentences in which the hyphen connects an element in parentheses, such as the sentence below: decreased the incidence of LIDOCAINE (50 mg/kg)-induced CONVULSIONS.1 Such cases cannot be easily covered, given that in such situations the spaCy parser will treat the hyphen as an individual token, and produces a considerably more complex parse tree. Of the seven FPs on the test set, five are due to the same issue. The remaining two are explained by a spelling mistake in the original text (the appearance of these LEVODOPA-induce OCULAR DYSKINESIAS 2 ), and by the use of a word not previously encountered (rats with LITHIUM-treated DIABETES INSIPIDUS 3 ). The huge number of false negatives (FN) warrants a more thorough discussion: While a systematic evaluation of these is beyond the scope of the work, ten randomly selected FNs were manually evaluated. In one sentence4 marked as FN the relation between amiodarone and pulmonary lesion should have been found. However, the sentence only contains 1 11243580 | 5 in the development set 11835460 | 3 in the test set 3 6321816 | 2 in the test set 4 18801087 | 3 in the test set 2 CHAPTER 5. EVALUATION 82 the words amiodarone-related lesion, and thus was not extracted as one of the sentences for the gold standard, but found by the query. Similar problems arise with abbreviations: In the example below, the relation between streptozotocin and nephropathy is not recognised: ... STZ-induced diabetic nephropathy in ... mice.5 However, in the original annotation, this abbreviation STZ is annotated and given the same identifier as streptozotocin. In fact, it is the pubtator_to_relations.py converter that fails to resolve this correctly. Another problem with the same converter that led to two FNs is that sentences are not extracted from the original PubTator file (see Section 4.3.1) if the participants of the relation occur in the text with their starting letters capitalized. This occurs occasionally in titles, and is a trivial bug to fix. However, the fixing of this bug would partially invalidate the results obtained so far. Two of the ten randomly selected FNs are attribute to this fault. In one case6 , a possible relation (glucocorticoid-induced glaucoma) that has not been annotated was returned. However, without expert knowledge, it is not possible to decide wether this is an oversight of the original annotation, or correctly classified as a FN. The remaining three sentences are indeed false negatives, returning phrases such as amphetamine-induced group 7 or 5HT-related behaviors 8 . Table 5.2 below summarizes the findings from the manual evaluation of the randomly selected sample of false negatives. reason upper case in conversion different names for entities correct FNs requires expert knowledge number of sentences 3 3 3 1 projected 102.3 102.3 102.3 34.1 Table 5.2: summary of reasons for a random sample of FNs in the test set, and projected number for the entire test set 5 20682692 24691439 7 24739405 8 24114426 6 | | | | 7 3 8 3 in in in in test set test case the test set the test set CHAPTER 5. EVALUATION 83 ACTIVE queries Table 5.3 lists the results of the ACTIVE query set performed on the development and test set, respectively. set development test recall 0.939 0.596 precision 0.121 0.065 F1 measure 0.215 0.117 TP 46 28 FP 3 19 FN 333 403 Table 5.3: Results of active query set executed on development and test set All three FNs on the development set are due to incorrect parses. For example, in the sentence below, the spaCy parser considers isoniazid increase a compound noun. High doses of ISONIAZID increase HYPOTENSION induced by vasodilators9 Again, from the FNs on the test set, ten randomly selected sentences were manually evaluated. In contrast to the HYPHEN query set, the FNs for this set seem to fall in one of two categories. In six cases, the sentence returned did seem to contain a relation, but not the one that was annotated. Without expert knowledge, it is not possible to make a definite assessment, but the example below shows that it is plausible that a relation was indeed found, and the complexity of sentences that are still recognized by the query set. Application of an irreversible inhibitor of GABA transaminase, gamma-vinyl-GABA (D,L-4-amino-hex-5-enoic acid), 5 micrograms, into the SNR, bilaterally, suppressed the appearance of electrographic and behavioral seizures produced by pilocarpine10 Note that the above sentence was categorized as PASSIVE (for the relation between polcarpine and seizures, which are the annotated entities). However, the relation between gamma-vinyl-GABA and seizures which caused the sentence to be returned by the ACTIVE query set, was not annotated in the original corpus. 9 10 9915601 | 1 in the development set 3708328 | 7 in the test set CHAPTER 5. EVALUATION 84 Three sentences are correctly marked as FPs. For example, the sentence we used Poisson regression 11 is found, indicating that the word use may be to ambiguous to be used in ACTIVE queries without other structures. The sentence below is a correct FN, which however could hint at a possible relation, if the reference each drug could be resolved. Administration of each drug and their combinations did not produce any effect on locomotor activity.12 One exception is the following sentence, in which an incorrect parse causes it to be found. Naloxone (0.3, 1, 3, and 10 mg/kg) injected prior to training attenuated the retention deficit with a peak of activity at 3 mg/kg.13 The following table summarizes the random sample evaluation of false negatives. reason correct FNs requires expert knowledge incorrect parse number of sentences 3 6 1 projected 120.9 241.8 40.3 Table 5.4: Summary of reasons for a random sample of FNs DEVELOP queries As explained above, the DEVELOP query set is intended to demonstrate the query creation process, and does not aim at high performance. Table 5.5 lists its results to convey a notion of what a few fragments can achieve. set development test recall 0.233 0.102 precision 0.362 0.157 F1 measure 0.283 0.123 TP 34 13 FP 112 115 FN 60 70 Table 5.5: Results of DEVELOP query set executed on development and test set 11 25907210 | 5 in the test set 15991002 | 12 in the test set 13 3088653 | 3 in the test set 12 CHAPTER 5. EVALUATION 85 Query Evaluation Discussion The results of the HYPHEN and ACTIVE query sets indicate that the epythemeus system is capable of delivering useful results. The biggest obstacle to favorable performance is the low precision (0.294 on test set for HYPHEN queries, 0.065 on the test set for the ACTIVE queries, and 0.157 for for DEVELOP queries). Systems in the BioNLP ’09 shared tasks achieved F1 measures of up to 0.52 [17], and thus surpass our results (F1 measure of 0.449 for HYPHEN queries, 0.117 for ACTIVE queries and 0.123 for DEVELOP queries) by far. As the discussion above shows, these values are partially due to the inferior quality of the gold standard used for evaluation: It is very possible that our queries find relations that experts would consider as such, but are not annotated in our reference corpus. Besides that, further action needs to be taken to prune false negatives. As stated in Section , we deliberately do not include named entity information in our current approach. However, future versions of the epythemeus system could use NER information to reduce the number of FNs, and thus increase F1 measure. For example, all FNs returned by the HYPHEN query set on the test set could have been identified as such if NER information had been made use of. While we suggest here that NER information be used to prune the results returned by queries based solely on syntactic information, it is certainly more common to reverse the order of these approaches. As we describe in , however, this introduces the problem of cascading errors. It would thus be interesting to compare the outcomes of systems using NER as a means of pruning previously obtained results, or using NER as the basis for further refinement, respectively. 5.1.2 Speed Evaluation and Effect of Indices The effect of using the compound indices described in 4.2.2 on query execution time was evaluated using 3 sample queries: Q1 X-induced Y Q2 X causes Y Q3 X is caused by Y Additionally, one meaningless query (Q4) was created that uses a larger number of self-joins. The queries were executed in two different databases: CHAPTER 5. EVALUATION 86 D1 containing 323 004 actual dependencies, and D2 containing 1 000 000 randomly generated entries, using the command line tool of SQLite. Queries have been slightly modified to match the random nature of the data in D214 . Since the creation of D1 using the standford_pos_to_db module involves other processing such extraction of dependencies from spaCy objects, we only take note of the different creation times for D2 with and without indices. As table 5.6 below shows, adding new entries into the database took about 5 times longer when using indices. However, the database, once created, is not expected to change frequently. Thus these numbers have little relevance compared to the increase in query speed displayed in tables 5.7 and 5.8. As these tables show, the querying time can be increased by a factor of about 1.98 to 12.93 depending on the number of self-joins. Table 5.6: Table and entry creation speeds with and without indexing indexing total creation time time per entry without indices 82.368s 0.00824 ms with indices 401.685s 0.0402 ms query Q1 Q2 Q3 Q4 Table 5.7: Querying times for D1 self-joins in query without index 0 17ms 1 29ms 2 194ms 5 1549ms with index 17ms 21ms 15ms 781ms query Q1 Q2 Q3 Table 5.8: Querying times for D2 self-joins in query without index 0 1588ms 1 4641ms 2 53281ms with index 1579ms 14811ms 19553ms The execution of Q4 was interrupted after 600s, and thus is not listed in table 5.8. Note than in table 5.8, Q2 will take about 3.19 times longer 14 The materials and data used to generate the numbers described in this section can be found in the data/db_indices directory that accompanies this thesis. CHAPTER 5. EVALUATION 87 for the indexed D2 than for D2 without the index. We attribute this to the random nature of the data, and the fact that the index cannot fully unfold its potential for queries that contain only one self-join. 5.2 Processing PubMed As Section 1.4 explains, processing the entire PubMed database of over 25 million articles is considered the ultimate goal of our research. In this section, we thus attempt to give an estimate of time it would take to process PubMed in its entirety using the approaches described in this dissertation. The test set used for this evaluation as well as other intermediary results can be found in the data/pubmed_projection directory that accompanies this thesis. 5.2.1 Test Set We selected a random set of 1000 article abstracts from PubMed. The test set has an average document length of 828 characters. 5.2.2 Timing We measured the processing time for the individual stages using the Unix time command (taking the sum of user and system values). This means that the times noted below are in terms of processor time for a single core15 , and does not take into account that this task can be easily parallelized. 5.2.3 Downloading PubMed As described in Section 2.5, there are several ways to access PubMed: Downloading the complete PubMed dump published on a yearly basis is certainly most efficient, but needs to be updated to include more recent publications. Because of this variability, we do not include the time it takes to prepare the PubMed article abstracts in our calculations. 15 1,8 GHz Intel Core i7 CHAPTER 5. EVALUATION 5.2.4 88 Tagging and Parsing We used the Stanford POS tagger as described in Chapter 3 with the englishleft3words-distsim.tagger model for POS tagging; and the stanford_pos_to_db.py module as described in Section 4.2.1 for database conversion. 5.2.5 Running Queries The queries for the ACTIVE, HYPHEN and DEVELOP categories as described in Section 4.4 are executed using the browse_db module with the --ids all argument. Note that the queries written as part of this thesis do not cover all possible relations, and that they are specific to the CDR task. In order to generalize, we thus make the following assumptions: • Query sets for other applications than CDR behave similarly in terms of execution time. • The execution time of a query set is proportional to the amount of relations it is intended to find. Based on these assumptions, we note that the queries written as part of this thesis are aimed to cover the ACTIVE and HYPHEN categories completely, and achieve 23.3% recall on the DEVELOP category, thus covering about 42.374% of all relations. We thus note an extrapolated processing time (running queries* in table 5.9), which multiplies actual processing time by a factor of 2.36. 5.2.6 Results Table 5.9 below lists measured and estimated processing times. While the total projected time is an estimate that relies on many assumptions, it also shows how the systems presented in this thesis are indeed capable of processing the entire PubMed in reasonable time given appropriate infrastructure. CHAPTER 5. EVALUATION step POS tagging database conversion running queries running queries* TOTAL measured time 23.832s 48.882s 30.584s 72.176s 144.89s 89 projected for PubMed 595 800s (6 days, 22h) 1 222 100s (14 days, 3h) 764 600s (8 days, 20h) 1 804 401s (20 days, 21h) 3 622 301 (41 days, 22h) Table 5.9: Estimated processing time for the entire PubMed 5.3 5.3.1 Summary epythemeus While the performance of the epythemeus is inferior to current state-of-the art systems, our evaluation points at the validity of our approach. We identify several key factors that could unlock gains in performance, such as the inclusion of NER and lemmatization information, the employment of a more suitable evaluation corpus and the further development of queries. However, changing the database to store such information would not compromise its independence. Lemmata and named entity information could be provided by the python-ontogene pipeline, as well as other systems. This information could directly be used by fragments and queries alike without necessitating further development of the system to improve precision. More complex means to improve the performance of the epythemeus system could include pruning of dependency trees such as suggested by [5]. This could make queries more robust to variations in parsing. While the manually categorized sentences proved very useful both for query development and evaluation, the gold standard against which the queries were evaluated could have been improved. Partially, this is an extension of a shortcoming of the original BioCreative V corpus, in which relations are not annotated on a mention-level, but rather on a document basis. This prompted the need for a error-prone extraction process, and lead to lower precision in the evaluation. Given that the epythemeus system is not specific, however, to chemical-disease relation extraction, other corpora could be used to obtain more reliable results. CHAPTER 5. EVALUATION 5.3.2 90 Processing PubMed As table 5.9 indicates, we estimate a total of almost 42 days of processing time to process PubMed and run a hypothetical set of queries to extract relations. This estimate assumes that query processing time is linear in database size. While such a number may seem daunting, recall that this measure is in terms of processing time for a single core, and that test were performed on a general-use home machine. Using a dedicated infrastructure with several cores, the goal of processing PubMed seems to be in reach. Chapter 6 Conclusion In this thesis, we explored efficient rule-based relation extraction, and present a set of systems as well as a novel way to facilitate the process of generating hand-written rules. We recapitulate our contributions briefly in section 6.1. Special attention is devoted to processing speed: The final objective of this research is extraction of relations in the entire set of 25 million article abstracts that PubMed contains. This has not been possible so far, but our results put such an endeavor at reach. In this short chapter, we conclude by assessing shortcomings and highlight potential for future research. 6.1 Our Contributions In order to extract biomedical relations from unstructured text, three systems are used: 1. The python-ontogene pipeline 2. The combination of Stanford POS tagger and spaCy 3. The epythemeus system 6.1.1 python-ontogene The python-ontogene pipeline revolves around a custom Article class, which is well-suited to store biomedical articles in memory at various stages of processing. Special care was taken to keep this class flexible for various applications. The pipeline currently uses NLTK to provide tokenization and POS tagging, but was developed with modularity in mind, allowing the NLTK 91 CHAPTER 6. CONCLUSION 92 library to be replaced by other tokenizers and POS taggers. A dictionarybased named entity recognition is used to extract named entities. By avoiding file-based communication between modules, the python-ontogene pipeline outperforms existing systems by far in terms of speed while maintaining comparable levels of accuracy. 6.1.2 Parser Evaluation To our knowledge, the spaCy parser included in our evaluation has not previously been the subject of scientific evaluation. We evaluated it together with three state-of-the-art parsers in terms of accuracy and speed. spaCy by far outperforms the other parsers in terms of speed, but does not yield satisfying accuracy. We show how this shortcoming can be overcome by using spaCy parser in conjunction with Stanford POS tagger. 6.1.3 epythemeus The epythemeus builds on the work described above, but is an independent system: It takes Stanford POS tagged files as input, dependency parses them using spaCy and saves the results in a database; but other approaches can be used to populate the database. The database can then be queried using manually-created rules interactively or programmatically. 6.1.4 Fragments The main contribution of the epythemeus system, however, lies in a new approach to phrase rules and turn them into executable queries. A special shorthand notation has been developed for so-called fragments, which represent building blocks of rules. These fragments can be programmatically combined to create a set of queries that generalize well. This greatly aids the development of rules. The fragments are converted in SQL queries, allowing the concept of fragments to be useful for other systems. 6.1.5 Corpus In order to develop queries and to evaluate them, a set of over 1000 sentences containing chemical-disease relations has been manually categorized CHAPTER 6. CONCLUSION 93 according to the structure that points to the relation. We hope that this categorization can be useful in similar research. 6.2 6.2.1 Future Work Improving spaCy POS tagging While using spaCy parser and Stanford POS tagger together yields good results, the switching of environment (python3 and Java) considerably slows down processing. Given spaCy’s ability to train POS tagging models, its own POS tagger could be improved. In particular, training spaCy’s POS tagger on the output of Stanford POS tagger would allow for spaCy to deliver highquality parses while forgoing the need to leave the python3 environment. 6.2.2 Integration of spaCy and python-ontogene Development of python-ontogene preceded the evaluation of parsers described in Chapter 3. Using the spaCy library in the fashion described above would allow for it to be integrated easily into the pipeline for POS tagging. Building on that, a mapping between the spaCy objects containing dependency parses, and the above mentioned Article objects would allow for the python-ontogene pipeline to also include dependency parsing, again forgoing the need for file-based communication between modules and repeated parsing. 6.2.3 Improvements for epythemeus While the performance of the epythemeus system largely lies in the quality of the queries, the system itself has two shortcomings: Since the database does not store lemmatization nor named entity information, precision cannot as easily be improved. Especially named entity information would allow for queries to be more robust, and yield much more satisfactory results. Again, the integration of the systems would alleviate this problem. 6.2.4 Evaluation Methods The test set used for evaluation described in Chapter 4 suffers from errors in the software that generated it. While this does not jeopardize the quality of CHAPTER 6. CONCLUSION 94 the epythemeus system, a more reliable evaluation could be performed. 6.3 Processing PubMed As we explain in Section refsec:possiblepubmed, the ultimate goal of processing the entire PubMed is put at reach, owing to the special attention to efficiency we paid when developing the aforementioned systems. Bibliography [1] Charu C Aggarwal and ChengXiang Zhai. Mining text data. Springer Science & Business Media, 2012. [2] Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Douglas B Kell. Event extraction for systems biology by text mining the literature. Trends in biotechnology, 28(7):381–390, 2010. [3] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Computing in Science & Engineering, 13(2):31–39, 2011. [4] Sabine Buchholz and Erwin Marsi. Conll-x shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149–164. Association for Computational Linguistics, 2006. [5] Ekaterina Buyko, Erik Faessler, Joachim Wermter, and Udo Hahn. Event extraction from trimmed dependency graphs. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 19–27. Association for Computational Linguistics, 2009. [6] Eugene Charniak. A maximum-entropy-inspired parser. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 132–139. Association for Computational Linguistics, 2000. [7] Jinho D Choi and Martha Palmer. Guidelines for the clear style constituent to dependency conversion. Technical report, Technical Report 01-12, University of Colorado at Boulder, 2012. 95 BIBLIOGRAPHY 96 [8] Donald C Comeau, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, et al. Bioc: a minimalist approach to interoperability for biomedical text processing. Database, 2013:bat064, 2013. [9] Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual. Technical report, Technical report, Stanford University, 2008. [10] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie Strassel, and Ralph M Weischedel. The automatic content extraction (ace) program-tasks, data, and evaluation. In LREC, volume 2, page 1, 2004. [11] Tilia Renate Ellendorff, Adrian van der Lek, Lenz Furrer, and Fabio Rinaldi. A combined resource of biomedical terminology and its statistics. Proceedings of the conference Terminology and Artificial Intelligence (Granada, Spain), 2015. [12] Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi Tamura, Toshihisa Takagi, et al. Toward information extraction: identifying protein names from biological papers. In Pac Symp Biocomput, volume 707, pages 707–718. Citeseer, 1998. [13] Ralph Grishman and Beth Sundheim. Message understanding conference-6: A brief history. In COLING, volume 96, pages 466–471, 1996. [14] Jörg Hakenberg, Steffen Bickel, Conrad Plake, Ulf Brefeld, Hagen Zahn, Lukas Faulstich, Ulf Leser, and Tobias Scheffer. Systematic feature evaluation for gene name recognition. BMC bioinformatics, 6(1):1, 2005. [15] Lynette Hirschman, Alexander Yeh, Christian Blaschke, and Alfonso Valencia. Overview of biocreative: critical assessment of information extraction for biology. BMC bioinformatics, 6(Suppl 1):S1, 2005. [16] Lawrence Hunter and K Bretonnel Cohen. Biomedical language processing: what’s beyond pubmed? Molecular cell, 21(5):589–594, 2006. BIBLIOGRAPHY 97 [17] Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 1–9. Association for Computational Linguistics, 2009. [18] Jin-Dong Kim, Sampo Pyysalo, Tomoko Ohta, Robert Bossy, Ngan Nguyen, and Jun’ichi Tsujii. Overview of bionlp shared task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, pages 1–6. Association for Computational Linguistics, 2011. [19] Dan Klein and Christopher D Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics, 2003. [20] Lingpeng Kong and Noah A Smith. An empirical comparison of parsing methods for stanford dependencies. arXiv preprint arXiv:1404.4314, 2014. [21] Michael Krauthammer and Goran Nenadic. Term identification in the biomedical literature. Journal of biomedical informatics, 37(6):512–526, 2004. [22] Ulf Leser and Jörg Hakenberg. What makes a gene name? named entity recognition in the biomedical literature. Briefings in bioinformatics, 6(4):357–369, 2005. [23] Mariana Neves. An analysis on the entity annotations in biological corpora. F1000Research, 3, 2014. [24] Joakim Nivre. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT. Citeseer, 2003. [25] Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-ofspeech tagset. arXiv preprint arXiv:1104.2086, 2011. [26] Longhua Qian and Guodong Zhou. Tree kernel-based protein–protein interaction extraction from biomedical literature. Journal of biomedical informatics, 45(3):535–543, 2012. BIBLIOGRAPHY 98 [27] W Scott Richardson, Mark C Wilson, Jim Nishikawa, and Robert S Hayward. The well-built clinical question: a key to evidence-based decisions. Acp j club, 123(3):A12–3, 1995. [28] Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand, Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc Von Allmen, Pierre Parisot, Martin Romacker, et al. Ontogene in biocreative ii. Genome Biology, 9(Suppl 2):S13, 2008. [29] Fabio Rinaldi, Gerold Schneider, and Simon Clematide. Relation mining experiments in the pharmacogenomics domain. Journal of Biomedical Informatics, 45(5):851–861, 2012. [30] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Simon Clematide, Therese Vachon, and Martin Romacker. Ontogene in biocreative ii. 5. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 7(3):472–480, 2010. [31] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Michael Hess, and Martin Romacker. An environment for relation mining over richly annotated corpora: the case of genia. BMC bioinformatics, 7(Suppl 3):S3, 2006. [32] Isabel Segura Bedmar, Paloma Martínez, and María Herrero Zazo. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics, 2013. [33] Matthew S Simpson and Dina Demner-Fushman. Biomedical text mining: A survey of recent progress. In Mining Text Data, pages 465–517. Springer, 2012. [34] Larry Smith, Lorraine K Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo, I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M Friedrich, Kuzman Ganchev, et al. Overview of biocreative ii gene mention recognition. Genome biology, 9(Suppl 2):1–19, 2008. [35] Don R Swanson. Complementary structures in disjoint science literatures. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 280–289. ACM, 1991. BIBLIOGRAPHY 99 [36] Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the conll2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics, 2003. [37] Tuangthong Wattarujeekrit, Parantu K Shah, and Nigel Collier. Pasbio: predicate-argument structures for event extraction in molecular biology. BMC bioinformatics, 5(1):155, 2004. [38] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. Pubtator: a webbased text mining tool for assisting biocuration. Nucleic Acids Research, 41, 07 2013. [39] Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Jiao Li, Thomas C Wiegers, and Zhiyong Lu. Overview of the biocreative v chemical disease relation (cdr) task. In Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla, Spain, 2015. [40] W John Wilbur, Andrey Rzhetsky, and Hagit Shatkay. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC bioinformatics, 7(1):1, 2006. [41] Alexander Yeh, Alexander Morgan, Marc Colosimo, and Lynette Hirschman. Biocreative task 1a: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1):S2, 2005. [42] Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and Kevin B Cohen. Frontiers of biomedical text mining: current progress. Briefings in bioinformatics, 8(5):358–375, 2007.