How to make a natural language interface to query databases... An example ⁎ Miguel Llopis
Transcription
How to make a natural language interface to query databases... An example ⁎ Miguel Llopis
Computer Standards & Interfaces 35 (2013) 470–481 Contents lists available at SciVerse ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi How to make a natural language interface to query databases accessible to everyone: An example Miguel Llopis ⁎, Antonio Ferrández Dept. Languages and Information Systems, University of Alicante, Spain a r t i c l e i n f o Available online 12 October 2012 Keywords: Natural language interface Relational database Ontology extraction Concept hierarchy Query-authoring services a b s t r a c t Natural Language Interfaces to Query Databases (NLIDBs) have been an active research field since the 1960s. However, they have not been widely adopted. This article explores some of the biggest challenges and approaches for building NLIDBs and proposes techniques to reduce implementation and adoption costs. The article describes {AskMe*}, a new system that leverages some of these approaches and adds an innovative feature: query-authoring services, which lower the entry barrier for end users. Advantages of these approaches are proven with experimentation. Results confirm that, even when {AskMe*} is automatically reconfigurable against multiple domains, its accuracy is comparable to domain-specific NLIDBs. © 2012 Elsevier B.V. All rights reserved. 1. Introduction 2. Classification of existing NLIDBs A natural language interface to query databases (NLIDB) is a system that allows users to access information stored in a database by means of typing requests expressed in some natural language [1,2,22], such as English, Spanish, etc. NLIDBs have been a field of investigation since 1960s [2]. There have been many interesting theories and approaches about how an NLIDB could be built, in order to improve their accuracy [3], how to make them more open in terms of the natural language expressions that they accept [4,16], or even more, how to make them guess the real intend of the user who is trying to construct a query where some pieces are missing [21], etc. We will analyze these approaches in this article. While the research work on NLIDBs has led to many different systems being implemented in academic and research environments (e.g. [2–5,8,9,17–19]), it is difficult to find many of these systems being used in business environments or being commercialized in companies expanding across various market segments or domain niches [22]. In this article, we will explore previous NLIDB systems and classify them based on the different approaches that they implement. At the same time, we will explain which of these approaches lead to reduced costs at different stages of the NLIDB lifecycle. Finally, we will look at how we have implemented our proposals to minimize implementation, configuration, portability and learning costs, by analyzing the implementation of {AskMe*}, an ongoing NLIDB research work. As we outlined in the previous section, there have been many different approaches to the construction of NLIDBs. There are various ways for classifying them. In this article, we will explore two of the most common taxonomies for classification of NLIDBs which appear across various overview articles about the NLIDB field (e.g. [2,22]) and is also complemented by our own research observations: - Based on user interface: textual NLIDBs vs. graphical NLIDBs. - Based on domain-dependency: domain-dependent vs. domainindependent NLIDBs. ο As part of the previous classification, we will divide these NLIDBs in subcategories, based on their degree of portability and reconfiguration capabilities. This particular classification is not something that we have found on previous work per se, but rather a pattern that we have extracted based on the characteristics of systems that we have analyzed and previous research papers on the field of NLIDBs that we have taken into account for our work. In the next sections, we will explore the idiosyncrasies of each of these approaches. It is important to emphasize that we do not claim one of these approaches to be better than others, as each of the approaches has its advantages and disadvantages [2,22]. However, we will evaluate the convenience of each of these approaches in regards to the main goal of our research work: optimize costs of NLIDBs. 2.1. NLIDBs by their user interface: textual interfaces vs. graphical interfaces ⁎ Corresponding author. E-mail addresses: [email protected] (M. Llopis), [email protected] (A. Ferrández). 0920-5489/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.csi.2012.09.005 One of the biggest questions in the space of NLIDBs through decades has been the disjunctive of choosing a textual or a graphical user M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 interface to build the system. Each of these two alternatives has its own advantages and disadvantages that are worth considering, as described in [2,22]: - Textual NLIDBs: examples of this type of NLIDB are HEY [18], AT&T [19], LUNAR [2,24] or PRECISE [3]. ο Advantage: ■ User is not required to learn any additional language. ο Disadvantages: ■ Linguistic coverage of the system is not obvious. ■ Overlap of linguistic and conceptual failures. - Graphical NLIDBs: an example of this type of NLIDBs is NL-Menu [30]. ο Advantage: ■ Easy to dynamically constrain query formulation based on user selections, in order to only build valid queries. ο Disadvantages: ■ Lack of flexibility in query formulation. ■ Expressivity power reduced to the user interface design (less expressivity power than a textual natural language). While most of the NLIDBs built in the past can be classified in one of these two categories, there is an intermediate option between both which consists on combining the expressivity power of a textual NLIDB with the visual feedback to the user provided by a graphical NLIDB as we presented in our previous work [23]. This can be achieved by including query authoring services such as syntax coloring, text completions or keyword highlighting as part of the system design; our proposal is the first NLIDB that incorporates these features to the design of the system, to our best knowledge. Moreover, {AskMe*} helps the user to make valid queries by automatically distinguishing between linguistic and conceptual failures. 2.2. NLIDBs by their degree of portability and re-configurability: domain-dependent vs. domain-independent NLIDBs A second taxonomy in NLIDB classification can be made by considering the different approaches for how an NLIDB relates to the knowledge domain of the database that is being queried. - Domain-dependent NLIDBs: These NLIDBs need to “know” particularities about the underlying domain entities and restrictions in order to work. ο Non-Reconfigurable: Many of the NLIDBs in this group are designed ad-hoc for a particular problem domain (database). An example on this category is LUNAR [2,24]. ο Reconfigurable: Another group of NLIDBs are domain-dependent but can be reconfigured towards being used to query a database that belongs to a different domain. In most cases, this reconfiguration consists on remapping domain entities and terms from the DB in the query DSL.1 This, very often, requires the intervention of a technical user in order to perform these adjustments. Examples in this category include AT&T [19] and ASK [2,22]. ο Auto-reconfigurable: This bucket is the most interesting from a cost-saving perspective [23,30], as it will allow NLIDBs that are knowledgeable about the underlying domain data (and therefore, they can provide more accurate information, error messages, etc.) and at the same time it enables non-technical users to connect to multiple databases without the need for manual reconfiguration. The system knows who (the database connection string) and what (entities, properties, data types, etc. generally captured in an underlying source of knowledge, such as ontologies) to ask in order to learn how to deal with the user queries. Examples of this category include HEY [18], GINLIDB [29], FREyA [28] and an NLIDB for CINDI virtual library [4]. 1 Domain Specific Language 471 - Domain-independent NLIDBs: There are many other NLIDBs that allow the user to write queries in a natural language and that do not store any knowledge about the underlying domain; they simply translate NL queries into SQL queries and execute them against the underlying database [22]. Since the system does not know anything about the domain, it is not able to warn the user about conceptual errors in the query (entity–property mismatch, data type mismatch, etc.) and therefore the error-catching will happen in the database, thus, making the system slower and less user-friendly when the query is ill-formed. An example of NLIDB system in this category is PRECISE [3]. The problem of portability of NLIDBs is, from our perspective, one of the most critical ones to be solved. By itself, the cost of developing an NLIDB can be very high, and in most of the approaches taken for creating NLIDBs, the resulting systems are tightly coupled to the underlying databases [22]. In the last few years, there have been interesting approaches to the design of NLIDBs that are database-independent (e.g. [3,4]), in the sense that they can cope effectively with queries targeting different domains without requiring substantial reconfiguration efforts. One of the best examples of this approach is PRECISE [3]. This system combines the latest advances in statistical parsers with a new concept of semantic tractability. This approach allows PRECISE to easily become highly reconfigurable. In addition, this was one of the first NLIDB systems that used the parser as a plug-in, so it could be changed with relative ease in order to leverage newest advantages in the parsers' space. An interesting advantage of adapting the parsing process to each of the knowledge domains that the system connects to is that analyzing an input question in NLIDB systems is often based on a part-of-speech (POS) tagging, followed by a syntactic analysis (partial or full) and finally, a more or less precise semantic interpretation. Although there are broadly accepted techniques for POS tagging (e.g. [5–7]) and syntactic analysis (e.g. [6]), techniques for semantic parsing are still very diverse and ad hoc. In an open-domain situation, where the user can ask questions on any topic, this task is often very difficult and relies mainly on lexical semantics only. However, when the domain is limited (as is the case of an NLIDB), the interpretation of a question becomes easier as the space of possible meanings is smaller, and specific templates can be used [8]. It has been demonstrated [9] that meta-knowledge of the database, namely the schema of the database, can be used as an additional resource to better interpret the question in a limited domain. Another interesting existing solution, based on the creation of a new NLIDB every time that the system is connected to a new database, is the system developed for the CINDI virtual library [4], which is based in the use of semantic templates. The input sentences are syntactically parsed using the Link Grammar Parser [10], and semantically parsed through the use of domain-specific templates. The system is composed of a pre-processor and a run-time module. The pre-processor builds a conceptual knowledge base from the database schema using WordNet [13]. This knowledge base is then used at run-time to semantically parse the input and create the corresponding SQL query. The system is meant to be domain independent and has been tested with the CINDI database that contains information on a virtual library. The improvements that our research work in {AskMe*} provides in regards to the portability problem space are described in the next sections. 3. Most significant costs in NLIDBs Building an NLIDB system and bringing it into production has a significant cost [2,22,27,28,30]. This cost can be analyzed and divided across the different stages of the NLIDB lifecycle: system implementation, deployment and configuration, and finally system users' adoption. - System implementation: Creating an NLIDB is not a trivial task; it represents an engineering effort that must be taken into account when 472 M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 considering the creation of an NLIDB [2]. During the implementation phase, and independent from the planning methodology being used, the costs can be mostly divided in the following three categories: ο Design: The design of the system has an expensive cost; it is in this phase when various decisions must be taken: whether the system is designed to be domain-dependent or independent, what the different modules should look like—lexer, syntactic and semantic parsers, translation to SQL, etc. This design phase might require weeks, even months, of engineering and architectural work [22]. ο Development: Even when there are tools and frameworks to assist in the process, creating a natural language interface is a laborious task. Being able to provide high expressivity power while also processing queries efficiently is hard [2,28]. ο Testing: In order to create a system than can be reliable, efficient and error-free, it is important to significantly invest in testing it: unit testing independent modules of the system, verifying the robustness and validity of the system when integrating various pieces or validating the usability or the system in end-to-end scenarios or queries are just some of the testing activities that must be done in this phase [2]. - Deployment and configuration: This phase comprises the different activities required to deploy and adapt the system, once it has been fully implemented and tested, for the real use in a concrete enterprise. It includes, among others, the following tasks: deploying system components, configuring connections between components, connecting the system to the domain database, mapping database entities to system keywords, training the system to understand users' expressions, and ensuring robustness and high-availability of the deployed system [22]. - Users' learning process: Last but certainly not the least, once the system has been deployed to an enterprise environment, it has to be accepted and understood by end-users. This is not a trivial process, in fact, making a system easy to understand, learn and use for the target user must be considered as the most important principle from the design stage and across all the other phases: the most complete and sophisticated NLIDB is worthless if users are not happy and satisfied while using and interacting with it, or even more, if they reject using it because they do not like it. Thus, the learning process must be made smooth and compelling for the users, and this implies that a few different factors must be taken into account: users' learning curve for NLI constructions, database entities and relationships, etc. may be slow without the help from the system; users' learning curve for system graphical user interface may require additional learning effort, users need to be trained in order to be able to troubleshoot the most frequent system errors by themselves (connectivity issues, users access, etc.) [27]. It is due to all the costs enumerated previously that we believe that an NLIDB, in order to be successfully and widely adopted in the real-world enterprise has to be designed once, being portable and able to target different databases and knowledge domains, can be reconfigured easily in order to connect to a different database without the need of specialized deployment or reconfiguration steps that end-users cannot understand and, finally, must allow users to be productive using the system since the first day of use, while implementing a mechanism for letting users learn more advanced concepts of the system as they use it. 4. Contributions of our approach compared to previous related work The main improvements of our proposal compared to other existing systems are the significant reduction of costs: implementation and reconfiguration costs are optimized due to the dynamic nature of the system; and learning costs for end users are greatly reduced as well thanks to the use of query-authoring services. Some other NLIDB systems developed in the past few years include GINLIDB [29], WASP [22,25] and NALIX [22,26]. GINLIDB represents an interesting attempt at creating a fully auto-reconfigurable or “generic interactive” (as the “G” and “I” letters in the acronym stand for) approach to the creation of NLIDBs. This system has been an inspiration for the work developed in {AskMe*}, however our system tries to go a step beyond what GINLIDB accomplished in auto-reconfiguration of the NLIDB; while GINLIDB lets the user define custom mappings between words in their input queries and actual database entities by means of graphical menus that are displayed after a query with errors or ambiguities has been introduced by the user, {AskMe*} attempts to provide richer query-authoring services, which are aimed at helping users to easily learn how to ask questions in a new domain, by providing query suggestions, error highlighting and domain-specific error descriptions, as we will describe later. WASP (Word Alignment-based Semantic Parsing) is a system developed at the University of Texas by Yuk Wah Wong [25]. While the system is designed to address the broader goal of constructing “a complete, formal, symbolic, meaningful representation of a natural language sentence”, it can also be applied to the NLIDB domain. A predicate logic (Prolog) was used as the formal query language. WASP learns to build a semantic parser given a corpus a set of natural language sentences annotated with their correct formal query languages. The strength of WASP comes from the ability to build a semantic parser from annotated corpora. This approach is beneficial because it uses statistical machine translation with minimal supervision. Therefore, the system does not have to manually develop a grammar in different domains. In spite of the strength, WASP also has two weaknesses. The first is: the system is based solely on the analysis of a sentence and its possible query translation, and the database part is therefore left untouched. There is a lot of information that can be extracted from a database, such as the lexical notation, the structure, and the relations within. Not using this knowledge prevents WASP to achieve better performances, and this is an approach that {AskMe*} tries to improve as we will see later. The second problem is that the system requires a large amount of annotated corpora before it can be used, and building such corpora requires a large amount of work [22]. NALIX is a “Natural Language Interface for an XML Database” [26]. The database used for this system is extensible markup language (XML) database with Schema-Free XQuery as the database query language. Schema-Free XQuery is a query language designed mainly for retrieving information in XML. The idea is to use keyword search for databases. However, pure keyword search certainly cannot be applied. Therefore, some richer query mechanisms are added [26]. Given a collection of keywords, each keyword has several candidate XML elements to relate. All of these candidates are added to MQF (Meaningful Query Focus), which will automatically find all the relations between these elements. The main advantage of Schema-Free XQuery is that it is not necessary to map a query into the exact database schema, since it will automatically find all the relations given certain keywords. In NALIX the transformation processes are done in three steps: generating a parse tree, validating the parse tree, and translating the parse tree to an XQuery expression. This approach is being leveraged by our system as well, in the sense that user queries are validated by {AskMe*} before being executed against the database, thanks to the information available from the database schema, but with {AskMe*} we try to go beyond this in order to provide richer query-authoring services to the user in order to make the “writing a query” step more interactive and educational for the user, as we will describe later. One of the first natural language interfaces that provide a notion of suggestions to the user in order to author a query is OWLPath [27]. This system suggests to the user how to complete a query by combining the knowledge of two ontologies, namely, the question and the domain ontologies. The question ontology plays the role of a grammar, providing the basic syntactic structure for building sentences. The domain ontology characterizes the structure of the application-domain knowledge in terms of concepts and relationships. The system makes then suggestions M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 based on the content of the question ontology and its relationships with the domain ontology. Once the user has finished formulating the natural language query, OWLPath transforms it into a SPARQL query and issues it to the ontology repository. In the end, the results of the query are shown back to the user. This is an interesting approach in natural language interfaces to query ontologies that were published just a few months before the first publication about {AskMe*} [23]. While both systems leverage ontologies in order to provide the user suggestions in how to complete their queries, the systems are considerably different in a few aspects: OWLPath is a natural language interface to query ontologies, while {AskMe*} is a natural language interface to query databases that leverages ontology generation as a technique to capture the characteristics and semantics of the underlying database schema. In addition, while OWLPath provides query suggestions or auto-completions for terms that exist in the underlying ontology, it does not provide error information that is specific to the domain, in case that a query contains errors, which is something that {AskMe*} tries to emphasize in order to educate users and help them learn how to use the system and understand the logical model of the underlying domain. Another interesting and recent approach that inspires our work is FREyA [28], which combines syntactic parsing with the knowledge encoded in ontologies in order to reduce the customization effort. If the system fails to automatically derive an answer, it will generate clarification dialogs for the user. The user's selections are saved and used for training the system in order to improve its performance over time. While this is an interesting approach and inspires {AskMe*} in its principles, it differs from our research work in the sense that {AskMe*} focuses on helping users create valid queries from the beginning, as opposed to FREyA's approach of letting them introduce wrong queries and help the system correct them by means of clarification dialogs which are used for auto-correcting errors in the future. {AskMe*} is the first NLIDB system that proposes the combination of textual NLIDBs with rich query-authoring services (syntax coloring, error squiggles, tooltips, etc.). This provides a substantial improvement in the user experience when writing queries, especially in regards to query accuracy in order to solve both linguistic failures and conceptual failures, which could not be fully solved by the use of menu-based user interfaces either. The use of query-authoring services helps to reinforce the conceptual center of the dialog between the user and the NLIDB around the domain entities in focus. 473 In order to achieve this, we combine domain-specific information, captured in concept-hierarchy ontologies any time the system is connected to a new database. The system automatically generates the syntactic and semantic parsing templates and the rest of components needed in order to provide query-authoring services. In addition, the system is fully auto-reconfigurable without the need of any specialized knowledge. This is a significant improvement compared to the existing portable solutions mentioned before, because it makes the entire reconfiguration process fully transparent to the end users as opposed to having to perform some reconfiguration steps for entity mapping, disambiguation, etc. This is a substantial improvement not only by the amount of extra work that is saved in the reconfiguration steps, but also because it enables the system to be automatically managed without user intervention. This represents a step towards the democratization of NLIDBs, as users fitting a non-technical profile will be able to use the system on their own, throughout the entire system lifecycle, from the very early steps of adoption and deployment of the system towards a real-world production environment to the management, reconfiguration and diagnosis steps across multiple domains, for which the system is able to adapt itself automatically. In this sense, also, the role of queryauthoring services is fundamental because they enable to perform the very little manual reconfigurations needed, if any, driven by intuitive real-time hints in the query authoring process. 5. {AskMe*}: an NLIDB that reduces adoption, portability and users' learning costs {AskMe*} is a database-independent NLIDB and uses a templatebased approach for the dynamic generation of the lexer, syntactic and semantic parsers. Fig. 1 shows the different modules of the system. An exhaustive description of every component of the system is out of the scope for this paper, instead we will focus on describing the most relevant techniques that enable the proposed improvements of the system compared to other state-of-the-art systems: dynamic generation of the system and query-authoring services. In order to make this analysis easier to follow, we will use a case of study and complement the description of each of these components with the application on a given domain. We will use a sub-set of Northwind [14] (see Fig. 2), a canonical example of a relational database Fig. 1. {AskMe*}'s high level architecture. 474 M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 Fig. 2. Sub-set of Northwind database schema. that captures the domain of a fictitious trading company, containing information about products, orders, suppliers, employees, etc. 5.1. Ontology builder The first operation performed once {AskMe*} connected to a database is to search for the ontology representing that domain in the ontology repository. This repository consists on a dictionary that stores ontology references for any given tuples bServer, Database> that the system has been connected to. If the ontology for that particular domain does not exist, the ontology generation process is triggered. This process [10] analyzes the database catalog and schema in order to build the ontology that captures the domain entities, properties, relationships and constraints. {AskMe*} is using OWL for representing ontologies. In order to build the ontology for each database, and keep the system within a manageable range of data volume, only the minimal information needed from the database is stored into the ontology. Concretely, entity names, properties and value types are mapped from the database into the ontology, while the actual data is not. The reason that motivates this decision is that we are using OWL as a way to represent domain characteristics (entity names, entity properties, relationships, etc.) in the underlying database, however, the actual data is much bigger in size than the schema and also changes more often than the schema. Therefore, we decided to perform an analysis of the database schema that allows us to capture the nature of the domain, while the actual data retrieval process part of each user query execution is performed directly against the database, after validating that all domain restrictions are being satisfied by the user query, for which we leverage the domain representation captured in the OWL ontology. As a matter of fact, if a user query that complies to all domain restrictions stored in OWL is then executed against the database and the result indicates that there has been a change in the underlying domain that makes the OWL be out of date, a new XML–OWL generation process is triggered automatically, in order to refresh the domain ontology and keep it accurate at any time with the underlying database. In order to build the ontology capturing the mapping described above, {AskMe*} leverages OWLminer's approach [7] which consists on implementing the algorithm known as Feature and Relation Selection (FARS) [11]. FARS is multi relation feature selection that uses target tables and attributes in order to create join chains with other tables using foreign keys as links. The algorithm also uses Levenshtein Distance [12] as a metric for determining whether features are related or not. This metric is based on closeness between text and feature's value of the dataset. During this approximate search, every set of input texts from the set of relations and tables in the given database is analyzed. The result of this analysis is a set of attributes that meet the constraint: all members must be columns (properties) within the current database table (entity) as described in Table 1. After this first-level search has been performed for a given table, the next steps consist on finding the cross-table relationships, taxonomic and non-taxonomic relations and dependencies, in order to make the ontology grow in this dimension too. The attributes identified in the previous step are now used to analyze and discover the set of corresponding tables. As part of this process, also the primary and foreign keys are identified. The output of the feature and relation selection algorithm described in Fig. 3 is represented as an XML document where the first-level nodes in the tree represent tables. An example of the output of this algorithm can be found on Fig. 4, based on the Northwind schema described previously. Table 1 Database–ontology mapping. Database component OWL component Table/entity Column Column metadata: - Data type. - Mandatory/non-nullable. - Nullable. Class Functional property OWL property restriction: - All values from restriction. - Cardinality() restriction. - MaxCardinality() restriction. M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 475 Fig. 3. Feature and relation selection algorithm. As can be seen in Fig. 4, all entities in the previously described sample based on Northwind are being captured in a custom XML tree structure. This XML tree contains not only the entity (table) names but also columns and column types for each table, as well as primary key and foreign key information. The next step in the extraction process consists on converting this XML tree into an ontology that can be used to generate all queryauthoring services information needed. In order to achieve this, the previous XML tree needs to be converted into OWL and for that the tree is processed and each table node is converted into an OWL class in the resulting document. For this sample we will only focus on the foreign key relations for Category, Product, Supplier, Order and OrderDetail, but all other object properties would be represented in this OWL as well. In a similar way, each foreign key is expressed as an OWL object property in which two primary classes are related using domain and range attributes (see Fig. 5). By using this approach, the building process of the OWL ontology is accelerated and also the use of background knowledge helps to extract the required knowledge from the database. This approach is considerably better in cost (time and space) than simply mirroring the database schema to the ontology, based on multiple experiments as described in [11]. 5.2. Dynamic parser generation After building the ontology that captures the overall characteristics of the database domain, the next step consists on automatically building the parsers that will help understand users' queries and translate them into SQL queries to be executed against the database. As described previously, {AskMe*} is fully auto-reconfigurable and it can be pointed against multiple domains, while at the same time it is able to offer domain-specific features such as lexical, semantic and conceptual error detection. The key for these capabilities resides in the ability to perform this dynamic parser generation at all three levels: lexical, syntactic and semantic. 5.2.1. Lexicon A lexicon is formed by the set of terms that can be understood by the system; that is, the set of terms that have a special meaning in a given NLIDB. This particularly means the set of entities and properties that have been identified in the database schema. In the case of {AskMe*}, these terms are also captured in the domain ontology. In order to build the lexicon, we combine the set of nouns derived from the domain knowledge contained in the database, namely the entity and property names, with a general-knowledge vocabulary terms, mostly verbs, adjectives and adverbs. We are retrieving these general-knowledge vocabulary terms from WordNet [13], a large lexical database of English. This database classifies nouns, verbs, adjectives and adverbs into sets of cognitive synonyms. Cognitive synonyms (also named in WordNet as “synsets” [13]) are terms which belong to a different syntactic category (i.e. nouns, verbs, etc.) but represent related concepts; an example of a set of cognitive synonyms could be “approximation” (noun), “approximated” (adjective) and “approximate” (verb). Thanks to these cognitive synonyms sets, we are also able to complement the existing set of domain-specific nouns (entities and properties from the domain ontology) with an important amount of synonyms, into the system lexicon. This aspect is very important, as it will allow the lexer to automatically accept terms that, even when they are not the exact noun used in the underlying database schema, represent the same concept for the user. For example, the database may contain a property called “Telephone” for the entity “Customer”; while the user probably refers to it simply as “Phone”. The lexer is able to recognize “phone” as a valid term as well. In case the term is not in WordNet, such as “ProductID”, several heuristics are applied (e.g. by splitting a term into several terms when there is an uppercase letter in the middle of a term in lower case letters: “Product+ ID”). Finally, the user can review this lexicon in order to add or suppress synonyms (e.g. the term “Emp” is not in WordNet so the user could add synonyms such as “Employee”). Following the example of Northwind described previously (see Fig. 2), the domain-specific lexicon of nouns built from the ontology (WordNet synonyms in parentheses) is presented in Table 2. Note that it contains Entities and Properties as specialized terms, this classification is not relevant to the lexicon itself, but will be used lately for semantic analysis as we will describe. By having the dynamic lexicon generation process, {AskMe*} can implement an interesting feature such as the lexical error detection 476 M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 Fig. 4. Generated custom XML tree containing all Northwind entities in the sample. capability. Once the system has been configured, a user can start typing in a query and it will be processed by the lexer at first. Every time that a white space has been added to the buffer, the lexer analyzes the term Fig. 5. OWL capturing classes and relations from Northwind. that goes immediately before this white space and decides whether it is valid from the lexical perspective or not. If the term does not appear in the lexicon, the lexer will tag it as an invalid lexical item. This tag information is automatically retrieved by the query-authoring services component, that will underline the invalid term with red squiggles in the query bar, making it evident to the user that the underlined part is wrong in his query, even before he finishes writing it and offering tooltip information about the invalid term (Fig. 6). In the example shown in Fig. 6, a query about “projects” is being provided by the user. The system analyzes this query in real time and determines that “projects” is the entity that needs to be found in the underlying domain. In order to do this, {AskMe*} looks for this entity on the lexicon (from Table 2) and determines that it does not exist. As a result of this, the system notifies the user about the error in the query by adding red squiggles as an underline to the term that has not been found in the lexicon. When the user places the mouse on this term, a tooltip containing additional information about the error is displayed to the user. The other query-authoring service offered by {AskMe*} at lexer level is the completion suggestions mechanism, which offers in a dropdown pop-up menu, which appears below the word that the user is currently typing, the set of suggested words that contain the portion typed by the user as a fragment. This helps the user to remember the exact word that he is trying to write, and also to autocomplete it, making him write queries faster (Fig. 7). M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 477 Table 2 Northwind's lexicon with WordNet synonyms. Entities (and synonyms) Properties Product (merchandise, ware) Product ID (product identifier), product name (product denomination), supplier ID (provider identifier), category ID (type identifier, class identifier), quantity per unit, unit price (unit cost), units in stock, units in order, reorder level, discontinued. Order ID (order identifier), employee ID (worker identifier), order date (command date), required date (due date), shipped date, ship via, freight (cargo), ship name, ship address, ship city (town, municipality), ship region, ship postal code (ship zip code), ship country. Order ID (order identifier), product ID (product identifier, ware identifier, merchandise identifier), unit price (unit cost), quantity (amount), discount (deduction, reduction, allowance). Category ID (type identifier, class identifier), category name (category denomination, class name, class denomination, type denomination, type name), description (representation, information), picture (photo, photograph, image). Supplier ID (dealer identifier, provider identifier, vendor identifier), company name (business name, enterprise name), contact name (correspondent name), contact title (correspondent appellation), address (direction, domicile), city (town, municipality), region (territory, district), postal code (zip code), country (nation, state), phone (telephone), fax (facsimile), home page. Order (command) Order details Categories (types, classes) Suppliers (dealers, providers, vendors) 5.2.2. Syntactic parser {AskMe*} leverages the Link Grammar Parser [10] for the core syntactic parsing operations. The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a “constituent” representation of a sentence (showing noun phrases, verb phrases, etc.), like the one shown in Fig. 8. The parser has a dictionary of about 60,000 word forms. It has coverage of a wide variety of syntactic constructions, including many rare and idiomatic ones. The parser is robust; it is able to skip over portions of the sentence that it cannot understand, and assign some structure to the rest of the sentence. It is able to handle unknown vocabulary, and make intelligent guesses from context and spelling about the syntactic categories of unknown words. It has knowledge of capitalization, numerical expressions, and a variety of punctuation symbols. A full description of the Link Grammar is out of scope for this article, however it is noteworthy that, by using the Link Grammar API, the totality of this parser's capabilities can be leveraged in {AskMe*}, thus enabling our efforts to focus on other innovative areas such as the combination of query-authoring services within the proposed NLIDB, as well as the portability of the system. The concurrency mechanisms implemented on top of the Link Grammar Parser API are based on event notifications for all the syntactic parser events: every time the parser processes and tags a fragment of the input query an event is generated, containing information about the syntactic classification for each token. This is a key component for driving the syntactic query-authoring service that {AskMe*} implements: syntactic error squiggles (green). These squiggles warn the user about syntactic errors in a query, even before the query authoring has been fully completed (Fig. 9). 5.2.3. Semantic parser The third parsing step performed to an input query is the semantic parsing. In {AskMe*}, given its dynamic domain-specific knowledge acquisition nature, it may be feasible to find that a certain query is valid according to the lexical and syntactic analysis, but does not represent a concept that fits into the current domain. For example, the query “Name and date of the customers from the country where most orders were made in 2010” could be lexically and syntactically valid, all the terms in the sentence may be present in the dynamic lexicon, and the syntactic construction and order of words match one of the valid categories of phrases in the Link Grammar Parser. However, as you will notice, the concept of Date may not exist for the entity Customer. This is definitely an error in the input query, a semantic error. In order to detect this kind of errors, the semantic parsing step is applied to the input query. The semantic parser is guided by the use of semantic templates which are filled with the concepts captured in the domain ontology. The set of rules that are modeled by these dynamically-generated semantic templates are: - Entity–Property correspondence: This rule enforces that all the requested properties for an entity in a query are indeed part of the current domain schema. - Cross-entities relationships: This rule is applied to queries that contain multiple sub-phrases, and its purpose is to enforce that there exists a foreign-key relationship in the database schema between the entities in the query. Fig. 6. Lexical error squiggles and tooltip error information. Fig. 8. Constituent tree for the query “Suppliers that are not in United States”. Fig. 7. Completions for supplier properties starting with “Co”. Fig. 9. Syntactic error squiggles and tooltip error information. 478 M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 Table 3 Examples of semantic rules behavior. Schema relationships Input query Result Reason Customers–orders and orders–products Products from customers whose last name is Llopis. Fail Customers–orders and orders–products Products that were ordered by more than 100 customers in 2010. Success There is not an existing relationship between products and customers in this domain. Products are related to orders, and every order references a customer. 6. Evaluation Table 4 Examples of template-based semantic error messages. Inconsistency type Error description message Entity–property mismatch Missing relationship “Entity A” does not contain a property called “Property A” (where Entity A and Property A are the values in a query). “Entity A” and “Entity B” are not related to each other. - Entities' default attributes: There are cases in which the query is valid from a lexical, syntactic and semantic analysis, but it does not specify which attributes must be present in the result. For instance, the query in Table 3, “Products that were ordered by more than 100 customers in 2010”, does not specify which product properties we are interested in. This semantic rule does not invalidate a given input query, but rather imposes that the resulting SQL query must return all the product attributes that are not-null in the database schema, such as the Product ID, Product Name, Price, etc. This information, as we explained previously, was captured in the domain ontology as an OWL cardinality metadata attribute. Some examples of these rules are presented and analyzed in Table 3. In the case that one or more of these semantic requirements are not met by the input query, the semantic analysis would report errors. These errors are notified to the system in the form of events. The query-authoring services component is subscribed to these semantic events, in the same way as it is to the lexical and syntactic ones, and would therefore notify the user in a visual way about the issue, by highlighting the portions of the input query that cause the inconsistency. When the user hovers with the mouse over these highlighted regions, a tooltip containing a description of the inconsistency comes up. This description is also template-based, see Table 4. In order to evaluate the effectiveness of our approach, we are applying three different experiments: - Accuracy in query interpretation for a concrete domain. - Effectiveness of query authoring services for a concrete domain. - Portability of the system across domains. 6.1. Accuracy in query interpretation for a concrete domain The first experiment consists on evaluating the accuracy of our query interpretation process in a concrete domain. For that purpose, we evaluated our system using data from the Air Travel Information (ATIS) domain [15]. The ATIS database is based on air travel data obtained from tile Official Airline Guide (OAG) in June 1992 and current at that time. The database includes information for 46 cities and 52 airports in the US and Canada. The largest table in the expanded database, the flight table, includes information on 23,457 flights. A complete reference about the ATIS domain can be found at [15]. The selection of ATIS was motivated by three concerns. First, a large corpus of ATIS sentences already exists and is readily available. Second, ATIS provides an existing evaluation methodology, complete with independent training and test corpora, and scoring programs. Finally, evaluation on a common corpus makes it easy to compare the performance of the system with those based on different approaches. Our experiments utilized the 448 context independent questions in the ATIS “Scoring Set A”, which is one of the sets of questions of the ATIS benchmark, generally the most commonly used for the evaluation of other systems, and the one that lets us compare with most of them. {AskMe*} produced an accuracy rate of 94.8%. System accuracy rate is calculated based on the equation in Fig. 11. Fig. 10. Examples of queries from ATIS and results obtained with {AskMe*}. Fig. 11. System accuracy equation. M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 Table 5 Accuracy comparison using ATIS between various NLIDB systems. HEY [16] SRI [17] PRECISE [3] {AskMe*} MIT [18 ] AT&T [19] 92.5 93 94 94.8 95.5 96.2 Table 5 contains a comparison of the results obtained by {AskMe*} in the ATIS benchmark, to other state-of-the art systems. In some cases, as displayed in Fig. 10, some of the failures are due to domain-specific information or query shortcuts (such as “tomorrow” →Date–Time, etc.) which {AskMe*} does not support yet because other functional work was prioritized higher, such as domain-portability or query-authoring services. These results confirm that, even when {AskMe*} is a fully reconfigurable system that can be targeted to multiple knowledge domains, its accuracy results against a particular domain are very similar to the results for other state-of-the-art systems which are tailored to the underlying domain. 6.2. Effectiveness of query authoring services for a concrete domain The second experiment that we are using to evaluate our system is measuring how using query-authoring services improves the overall usability of the system, by enabling early detection of query errors. In order to do that, we asked a set of ten users to write fifty queries per user in a given domain. These users were completely new to the system and they did not have any previous knowledge about the underlying domain. We gave them an initial description of the Northwind database, without schema representation or concrete entity/property names, and let them query the system in an exploratory way. This description was as simple as explaining them that the database contained information about products, product categories, orders, order details and suppliers. During this process, users are very likely to introduce mistakes in most of the queries they come up with for the first time. We captured traces for all of these queries and recorded in which stage of the parsing process they were raised. Our results indicate that, from the set of fifty input queries per user, almost 90% of them contained errors, from which roughly the 80% of these wrong queries could be detected before they were translated into SQL and, therefore, before they were being executed against the database. This fact results in significant improvements in terms of latency time for wrong queries, since thanks to the query-authoring services that {AskMe*} implements, they are locally detected by the system instead of being translated into SQL and executed against the database. The results of this experiment show that while an important amount of errors (23%) are due to lexical errors (usually things like typos), and 26% of them correspond to syntactic errors (mostly ill-formed sentences in the English language), most of the errors are due to semantic errors (51%). In order to help minimizing the probability of having lexical errors in a query, the system provides auto-completion for 479 entities and properties, and also auto-correction of typos based on distance-editing algorithms. Table 6 shows some of the most interesting queries written by users and how {AskMe*} guided them towards the right query. In terms of semantic error distribution classified by the main semantic rules that {AskMe*} implements, this evaluation determines that 51% of them fall in the rule of entity–property mismatch, thus being the most common semantic error, 41% of errors correspond to queries trying to refer to a missing relationship that does not exist in the domain and the remaining 8% represents semantic errors due to the query specifying invalid values in property conditions. 6.3. Portability of the system across multiple domains Our third experiment focuses on evaluating the portability of the system. For this purpose, we have created a script that simulates the user actions through the visual interface. In this test, the system will be connected to three different databases that we have previously configured: ATIS, AdventureWorksDB [20] and Northwind [14]. For each of these database connections, a custom benchmark made up of fifty different queries that are relevant to the correspondent domains (ATIS as described in the first experiment, Northwind as described through different sections of this paper and Adventure Works shown in Fig. 12) is executed against the system, asserting that the queryauthoring services work as expected and that the resulting SQL query is generated as expected as well. Finally, the test also evaluates the behavior when the system is connected to a database that had been already connected before, checking that the ontology generation process is not kicked-off again, but rather the existing ontology for that source is pulled back from the store and brought into the current connection context. The results of this experiment indicate that there is not any lose in accuracy after a reconnection to a different database, and the results are the same as if the system was only connected to a single database for its lifetime. This means that the same results observed in the first and second experiments apply to the scenario of multiple database reconnections without degrading the overall accuracy of the system after connecting to multiple domains. 7. Conclusions and future work {AskMe*} is an adaptive natural language interface and environment system to query arbitrary databases. Internally, the system leverages an ontology-based approach in which a new ontology is auto-generated every time the system is connected to a different database. Once this ontology has been generated, the rest of the system – domain-specific grammar, query-authoring services, etc. – reconfigures itself based on the set of language terms and relationships contained in the ontology. This automatic reconfiguration enables an effective lexical, syntactic and semantic validation of an input query, which will result in a higher Table 6 Sample queries fixed by user interaction with query services. User query Corrected query List all categories of products List all categories of products How was it fixed? Auto-correction of typos and lexical level (distance-editing algorithm comparing to “known” valid tokens) Products from customers whose last name is Llopis Products ordered by customers whose last name is Llopis A semantic error tooltip is displayed in user query; they learn that the relation products–customers is transitive, via order details and orders (which contain customer ID). The user follows guidance in order to end up with a valid query. Products from whose last name is Llopis Products from customers whose last name is Llopis The syntactic parser detects a syntactic error when the user types “whose” as the entity is missing. By providing an error squiggle and tooltip, the user is able to identify the missing piece on the query and correct it in order to fix and complete the rest of the query. 480 M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 Fig. 12. Adventure Works database simplified schema used in the second experiment. accuracy of the system. The evaluation process showed how, despite the system is not specific to any concrete domain, the result of 94.8% of accuracy against the ATIS benchmark is relatively good compared to other existing state-of-the-art systems, both domain-dependent and independent. Furthermore, this approach enables full portability of the system without any reconfiguration steps needed for the system to successfully execute queries against any new database. Extra mapping reconfigurations, user preferred ways to refer to elements of the domain-model, can be done through easy user interface gestures such as right-clicking elements (i.e. words) of a given query. We believe that the simplification of the reconfiguration process when connecting to new database schemas is a very important step towards the democratization of NLIDBs in real world setup, as it enables non-technical users to be able to fully control the system through its entire lifecycle. In addition, it enables the construction of a customized textual query environment in which a set of query-authoring services can be provided to the user, to help authoring and disambiguating queries. These query-authoring services play a fundamental role in systems' usability, making it possible to early detect query errors, as demonstrated in the evaluation section, where we observed that around the 80% of the queries that contained errors could be detected before they were actually translated into SQL, resulting in a more efficient, lower-latency, user-interactive system. The classification of these errors based on the parsing stage in which they are detected, as shown in the evaluation, gives us the possibility to selectively focus on improving the quality and functionality of query-authoring services at each stage of the parsing process, in order to maximize the investment in relation to the gain of the overall user experience. Finally, just remark that {AskMe*} helps the user to make valid queries as well by automatically distinguishing between linguistic and conceptual failures. Based on our very positive evaluation results for early error detection, thanks to the use of query-authoring services, as future work, we are trying to maximize this benefit by experimenting with new query-authoring services and improving the existing ones. Moreover, we will add anaphora and ellipsis resolution capabilities in {AskMe*}. Anaphora and ellipsis resolution are an active research field in the space of NLIDBs; this capability enables users to have the possibility to dramatically abbreviate the number of words to be written when asking different questions about different aspects of the same entity, which will result, again, in another important usability shift for {AskMe*} [21]. The main drawback of using this kind of resolution is its low precision. However, we plan to overcome the low precision of anaphora and ellipsis resolution by means of benefiting of query authoring services. Acknowledgments This research has been partially funded by the Valencia Government under Project PROMETEO/2009/119, and by the Spanish Government under Project Textmess 2.0 (TIN2009-13391-C04-01) and TIN2012-31224. References [1] S. Abiteboul, V. Hull, R. Viannu, Foundations of Database Systems, Addison Wesley, 1995. [2] L. Androutsopoulos, Natural language interfaces to databases—an introduction, Journal of Natural Language Engineering 1 (1995) 29–81. [3] A. Popescu, A. Armanasu, O. Etzioni, D. Ko, A. Yates, PRECISE on ATIS: Semantic Tractability and experimental results, in: Proceedings of the National Conference on Artificial Intelligence – AAAI, 2004, pp. 1026–1027. [4] N. Stratica, L. Kosseim, B.C. Desai, Using Semantic Templates for a natural language interface to the CINDI virtual library, Data & Knowledge Engineering Journal 55 (1) (2004) 4–19. [5] D. Jurafsky, J. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition and Computational Linguistics, Prentice Hall, 2000. [6] C. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. [7] H. Santoso, S. Haw, Z.T. Abdul-Mehdi, Ontology extraction from relational database: concept hierarchy as background knowledge, Knowledge-Based Systems 24 (3) (2011) 457–464. [8] M. Watson, NLBean(tm) version 4: a natural language interface to databases, www.markwatson.com. [9] R. Bartolini, C. Caracciolo, E. Giovanetti, A. Lenci, S. Marchi, V. Pirrelli, C. Renso, L. Spinsanti, Creation and use of lexicons and ontologies for NL interfaces to databases, in: Proceedings of the International Conference on Language Resources and Evaluation, vol. 1, 2006, pp. 219–224. [10] D. Sleator, D. Temperley, Parsing English with a link grammar, in: Proceedings of the Third International Workshop on Parsing Technologies, 1991. [11] B. Hu, H. Liu, J. He, X. Du, FARS: multi-relational feature and relation selection approach for efficient classification, in: Proceedings of the Advance Data Mining and Application Conference, vol. 1, 2008, pp. 73–86. M. Llopis, A. Ferrández / Computer Standards & Interfaces 35 (2013) 470–481 [12] V.I. Levenhstein, Binary Codes capable of correcting deletions, insertions, and reversals, Soviet Physics – Doklady 10 (8) (1966) 707–710. [13] G.A. Miller, WordNet: a lexical database for English, Communications of the ACM Journal - CACM 38 (11) (1995) 39–41. [14] Northwind, http://msdn.microsoft.com/en-us/library/aa276825(SQL.80).aspx. [15] M. Bates, S. Boisen, J. Makhoul, Developing an evaluation methodology for spoken language systems, in: Proceedings of the Speech and Natural Language Workshop, vol. 1, 1990, pp. 102–108. [16] H. Young, S. Young, A data-driven spoken language understanding system, in: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, vol. 1, 2003, pp. 583–588. [17] R.C. Moore, D.E. Appelt, SRI's experience with the ATIS evaluation, in: Proceedings of the Workshop on Speech and Natural Language, 1990, pp. 147–148. [18] V. Zue, J. Glass, D. Goddeau, D. Goodine, L. Hirschman, M. Phillips, J. Polifroni, S. Seneff, The MIT ATIS system: February 1992 Progress Report, in: Proceedings of the Workshop on Speech and Natural Language, 1992, pp. 84–88. [19] D. Hindle, An analogical parser for restricted domains, in: Proceedings of the Workshop on Speech and Natural Language, 1992, pp. 150–154. [20] Adventure Works, http://msdn.microsoft.com/en-us/library/ms124659.aspx. [21] J.L. Vicedo, A. Ferrandez, Importance of pronominal anaphora resolution in question answering systems, in: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, 2000, pp. 555–562. [22] N. Nihalani, S. Silakari, M. Motwani, Natural language interface for database: a brief review, International Journal of Computer Science Issues 8 (2) (2011) 600–608. [23] M. Llopis, A. Ferrandez, {AskMe*}: Reducing the costs of adoption, portability and learning process in a natural language interface to query databases, in: Proceedings of the 8th International Workshop on Natural Language Processing and Cognitive Science, vol. 1, 2011, pp. 75–89. [24] W.A. Woods, R.M. Kaplan, B.N. Webber, The Lunar Sciences Natural Language Information System: Final Report, in: BBN Report, 2378, 1972. [25] Y.W. Wong, Learning for semantic parsing using statistical machine translation techniques, in: Technical Report UT-AI-05-323, University of Texas, Austin, 2005. [26] Y. Li, H. Yang, H.V. Jagadish, NALIX: an interactive natural language interface for querying XML, in: Proceedings of the International Conference on Management of Data, 2005, pp. 900–902. [27] R. Valencia-Garcia, F. Garcia-Sanchez, D. Castellanos-Nieves, J.T. Fernandez-Breis, OWLPath: an OWL ontology-guided query editor, IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 41 (1) (2011) 121–136. [28] D. Damljanovic, M. Agatonovic, H. Cunningham, Natural language interfaces to ontologies: combining syntactic analysis and ontology-based lookup through the 481 user interaction, in: Proceedings of the 7th Extended Semantic Web Conference, 2010, pp. 106–120. [29] P.R. Devale, A. Deshpande, Probabilistic context free grammar: an approach to generic interactive natural language interfaces to databases, Journal of Information, Knowledge and Research in Computer Engineering 1 (2) (2010) 52–58. [30] H.R. Tennant, K.M. Ross, M. Saenz, C.W. Thompson, J.R. Miller, Menu-based natural language understanding, in: Proceedings of the 21st Annual Meeting of ACL, 1983, pp. 151–158. Miguel Llopis is a Ph.D. Student at the Department of Software and Computing Systems in the University of Alicante (Spain). His research interests include: Natural Language Processing, Question Answering and Domain-Specific Languages. He has written various papers in journals and participated in international conferences related to his research topics. Besides his Ph.D. studies and research activity, Miguel works as a Program Manager in the SQL Server Team at Microsoft Corporation (Redmond, Washington). Contact him at [email protected]. Antonio Ferrández is a Full-time Lecturer at the Department of Software and Computing Systems in the University of Alicante (Spain). He obtained his Ph.D. in Computer Science from the University of Alicante (Spain). His research interests are: Natural Language Processing, Anaphora Resolution, Information Extraction, Information Retrieval and Question Answering. He has participated in numerous projects, agreements with private companies and public organizations related to his research topics. Finally, he has supervised Ph.D. Thesis and participated in many papers in Journals and Conferences related to their research interests. Contact him at [email protected].