REAP em Português Information Systems and Computer - INESC-ID

Transcription

REAP em Português Information Systems and Computer - INESC-ID
REAP em Português
Luı́s Carlos dos Santos Marujo
Dissertation for obtaining the Master’s Degree in
Information Systems and Computer Engineering
Jury
President:
Advisor:
Evaluation Jury:
Professor
Professor
Professor
Professor
Ana Maria Severino de Almeida e Paiva
Nuno João Neves Mamede
Isabel Maria Martins Trancoso
Jorge Manuel Evangelista Baptista
July 2009
To my parents,
Acknowledgements
I would like to thank my advisors, Professor Nuno Mamede and Professor Isabel Trancoso,
for their guidance throughout the development of this work. I want to thank all member of
the REAP project at CMU, namely Professor Maxine Eskenazi and Juan Pino for his help
and support in the beginning of this thesis, specially during my visit to CMU, where they
provided me with much information and inside knowledge of English version of REAP.
I also want to thank all the members of L2 F , in particular for providing me support and tools
necessary for the completion of this work. Many thanks to Tiago Luı́s and Miguel Bugalho
for many discussions and help using and installing some tools at L2 F servers.
I want to express my gratitude to the members of Universidade do Algarve, for their collaboration providing the list of focus words, and cloze questions.
The author would like also to thank Lisboa Editora and Porto Editora for giving access to
the training corpus and the on-line dictionary.
Another special thanks to FCT for sponsoring this research under grant CMUPT/HuMach/0053/2008.
Finally, I would like to acknowledge the support of my family and friends throughout the
process of writing this thesis.
Lisboa, July 27, 2009
Luı́s Carlos dos Santos Marujo
Resumo
Actualmente a aquisição contı́nua de conhecimento é essencial para a integração activa no
mercado de trabalho. Aprender uma lı́ngua é um exemplo deste processo de aprendizagem
contı́nuo. O Ensino de Lı́ngua Assistido por Computador é uma área de investigação que se
concentra no desenvolvimento de ferramentas que podem melhorar o processo de aprendizagem duma lı́ngua.
O sistema REAP (desenvolvido na CMU), assim como o sistema REAP.PT (a versão portada
desenvolvida neste trabalho) são exemplos de sistemas tutores de lı́ngua. A construção de
tais sistemas envolve diversas tarefas que foram desenvolvidas no âmbito desta dissertação:
(i) O desenvolvimento e integração de recursos linguı́sticos, como por exemplo dicionários,
lista académica de palavras, perguntas de escolha múltipla;
(ii) O desenvolvimento de uma arquitectura escalável, distribuı́da capaz de processar grandes
quantidades de informação, como um corpus Web, ex.: WPT05; no REAP.PT, esta arquitectura incluı́ uma cadeia de filtros, que excluem documentos inapropriados, e classificadores
de tópicos e inteligibilidade que permitem aos alunos aprender a partir de documentos que
estejam de acordo com o seus interesses e nı́vel de conhecimento da lı́ngua estrangeira.
(iii) A adaptação e a extensão da interface Web;
(iv) O desenvolvimento e adaptação de ferramentas transversais ao Processamento de Lı́ngua
Natural, como por exemplo classificadores de categorias gramaticais, sintetizadores de fala
e reconhecedores de fala. A flexibilidade do REAP foi melhorada através da introdução de
áudio, baseado tanto na sı́ntese de fala como no alinhamento automático de documentos previamente gravados (livros falados e notı́cias). A análise do seu impacto em alunos estrangeiros
é um dos objectivos em futuros testes.
Abstract
Nowadays the continuous acquisition of knowledge is essential for the active integration in the
technical job market. Language learning is an example of this continuous process. The Computer Assisted Language Learning (CALL) research area is concerned with the development
of tools that can improve the language learning process.
The REAP system (developed at CMU), as well as REAP.PT (the ported version developed
in this work) are examples of CALL tutoring systems. Building such systems involves several
tasks, which were developed in the scope of this thesis:
(i) The development and integration of linguistic resources, such as dictionaries, academic
word lists, cloze questions;
(ii) The development of scalable data processing parallel architectures capable of processing
vasts amounts of data, such as a Web corpus, e.g., WPT05; in REAP.PT, this architecture
comprises a chain of filters, that excludes inappropriate documents, and topic and readability classifiers that allow the students to learn from documents matching their interests and
knowledge level of the foreign language;
(iii) The adaptation and extension of the Web interface;
(iv) The development or adaptation of transversal Natural Language Processing tools, such
Part of Speech (POS) taggers, morphological generators, text-to-speech and speech-to-text
converters. REAP’s flexibility has been improved by adding audio playing capabilities, based
on either text-to-speech synthesis or automatic alignment of previous recorded documents
(digital talking books and broadcast news stories). Their impact for L2 learners of European
Portuguese is also one of the target goals of the forthcoming field trials.
Palavras Chave
Keywords
Palavras Chave
Ensino da Lı́ngua Assistido por Computador
Inteligibilidade / Apreensibilidade textual
Tópicos
Extracção de informação
Keywords
Computer Assisted Language Learning
Readability
Topics
Information Retrieval
List of Abbreviations
APACALL - Asia-Pacific Association for CALL is an on-line association of CALL researchers
and practitioners whose sponsoring organization comes from University of Southern
Queensland in Australia.
API
- Application Programming Interface defines a set of routines, object classes, data
structures provided by libraries to offer support for building or extending existing
applications.
ARC - Internet Archive’s ARC file format defines a file standard for combining into an
aggregate archival file combined with related information. It is used for storing multiple
web recourses into a single file. The .arc file extension is used to represent different file
types that have in common being some kind of archive files. The arc is also a lossless
data compression and archival format developed by System Enhancement Associates.
ASR
- Automatic Speech Recognition, also known as computer speech recognition, is the
process of converting the speech signal into written text.
BN
- Broadcast News transmitted over radio and TV.
CAI
- Computer Accelerated Instruction is another name for CALL.
CALICO - the Computer Assisted Language Instruction Consortium is a professional organization dedicated to computer assisted language learning.
CALL - Computer Assisted Language Learning, also known as Computer Aided Language
Learning, is an approach to teaching and learning in which the computer and computerbased resources such as the Internet are used to present, reinforce and assess material
to be learned and usually includes a substantial interactive element.
CAPT - Computer Assisted/Accelerated Pronunciation Training is a subfield of CALL that
addresses the problem of teaching pronunciation.
CERCLES - Confédération Européenne des Centres de Langues de l’Enseignement Supérieur
(European Confederation of Language Centres in Higher Education) is a confederation of 290 Language Centres, Departments, Institutes, Faculties or Schools in Higher
Education whose main responsibility is the teaching of language.
i
CMU - Carnegie Mellon University is a private research university in Pittsburgh.
CSV
- Comma Separated Values is a file format that is used to store tabular data. It uses a
comma to separate values, but many applications, such as WEKA, allow alternatives
separators.
DAT - DAT(a) file contains meta-data about the documents stored in ARC files. DAT file
extension indicates that it is a data file, however the format of the data in .dat files
is often specific to the software that created it. It is one of the most common file
extensions.
DIXI - popular Latin expression, literally translated as “I have spoken”, was the name
given to the Portuguese speech synthesizer licensed to L2 f spinoff company (Voice
Interaction) and is now a commercial product.
DTB - Digital Talking Books, also known as audio books, are a multimedia representation
of a print publication, where a human voice is used to render the audio.
DTD - Document Type Definition defines the XML document structure with a list of legal
elements and attributes.
EUROCALL - The European Association for CALL is an organisation of language teaching
professionals from Europe and world-wide.
FLEAT - Foreign Language Education and Technology is conference sponsor by IALT.
GB
- Gigabyte is a multiple of byte, a unit for measuring digital information size.
GFS
- Google File System is a distributed file system developed by Google.
GLoCALL - The Globalization and Localization in CALL is conference jointly sponsored by
APACALL and PacCALL.
HDFS - Hadoop Distributed File System is a highly fault-tolerant distributed file system
designed to be deployes on low-cost hardware.
HTTP - Hypertext Transfer Protocol is an application layer network protocol built on top of
TCP and provides a standard for Web browsers and servers to communicate.
IALLT - International Association for Language Learning Technology, which is made of eight
U.S. regional groups, is a professional organization dedicated to promoting effective
uses of media centres for language teaching, learning, and research.
INESC-ID - Institute for Systems and Computer Engineering: Research and Development is
a non-profit organization devoted to research in the field of information and communication technologies.
ii
ISCA - International Speech Communication Association is a non-profit organization which
aims to promote, in an international world-wide context, activities and exchanges in
all fields related to speech communication science and technology.
JALTCALL - Japan Association for Language Teaching CALL is a special-interest group
supported by The Japan Association for Language Teaching.
JPython - Java Python is an implementation of object-oriented Python integrated with the
Java platform.
JVM - Java Virtual Machine is a virtual software process running inside a system that is
responsible for making Java code platform independent.
L2 F
- Spoken Language Systems Laboratory is a research department at INESC-ID.
L2
- Second language, also designated by Foreign Language, is any language a person
knows or is acquiring in addition to his native language.
LET
- Japan Association for Language Education and Technology is a Japanese association
that explores theories, methods and knowledge related to the use of educational media
among foreign language-teaching professionals with the goals of advancement of the
field and sharing of resources among members.
LTI
- Language Technologies Institute is a research department in the School of Computer
Science at Carnegie Mellon University.
MAC - Macintosh is a brand of Apple Computer.
MIME - Multipurpose Internet Mail Extensions is an Internet standard that allows e-mail
systems to support non-ASCII characters sets, non-text attachements and messages
bodies containing multiple parts. Web browsers also rely in the MIME type to accurately display the wide variety of documents or launch a separate application to
handle. MIME type was more recently called Internet Media Type and it is designated
as Content-type in several web protocols.
MMORPG - Massively Multiplayer Online Game is a game genre that enables thousands or
millions of players to play a game in a virtual world simultaneously via the Internet.
MOB - Mobile object is in widespread use in the MMORPG genre and refers to monsters,
non player attackers and opponents.
NASA - National Aeronautics and Space Administration is an independent agency of the
United States government responsible for aviation and space flight.
iii
NLP
- Natural Language Processing is a field of artificial intelligence and linguistics that
studies the problems intrinsic to the processing and manipulation of natural language.
OAI-ORE - Open Archives Initiative Object Reuse and Exchange is the new data exchange
standard proposed by the Open Archives group. It provides a model for the description
and exchange of aggregations of Web resources. A resource can be any object that is
identified by a URI, such as web sites.
ODP - Open Directory Project is a multilingual open content directory of World Wide Web,
that uses a hierarchical ontology scheme for organizing content.
OOV - Out-Of-Vocabulary, i.e, a word that is not known or is not include in the dictionary
or model.
PacCALL - Pacific CALL is an association in the Pacific, including countries from East to
Southeast Asia, Oceania and some from Americas dedicated to the CALL field.
PBS
- Portable Batch System, a queuing system developed for NASA in the early 1990s.
PC
- Personal Computer is any general purpose computer.
PHP - PHP: Hypertext Preprocessor is an HTML-embedded scripting language.
PHP5 - PHP: Hypertext Preprocessor 5 is the version 5 of PHP.
PO
- Proportional Odds model is a regression model for ordinal dependent variables.
POS
- Part of Speech tag, also known as word class, lexical class or lexical class are traditional categories of words intended to reflect their functions within a sentence.
POSIX - Portable Operating System Interface Unix is a standard that defines an interface
between programs and operating systems Unix.
RDF - Resource Description Framework is a standard proposed by the World Wide Web
consortium (W3C) for a metadata model and a component in the proposed W3C’s
Semantic Web.
REAP - REAder-specific Practice is a tutoring system developed at the Language Technologies Institute (LTI) of Carnegie Mellon University (CMU) to support the teaching of
a language for either native or foreign speakers, through the activity of reading and
focusing the students in learning vocabulary in context.
REAP.PT - REAder-specific Practice Portuguese is the Portuguese version of REAP.
REST - REpresentational State Transfer is an “architectural style for hypermedia system
such as the WWW.
iv
RMSE - Root Mean Square Error, also known as RMSD, is a commonly used measure of
differences between the values expected by a model and the values observed.
SBARs - Subordinate Clause are non-terminal nodes in the parse tree.
SLaTE - Speech and Language Technology in Education is a recent (2006) special interest
group of the International Speech Communication Association (ISCA).
SMO - Sequential Minimal Optimization is an algoritm used to train SVMs.
SMOreg - Sequential Minimal Optimization regression is the generalization of the SMO algoritm applied in regression problems.
SOAP - Simple Object Access Protocol is an XML-based protocol to let applications exchange
information over HTTP. It is a W3C recommendation.
soapUI - soap User Interface is a tool for web services testing.
SVMs - Support Vector Machines are a set of supervised learning methods used for classification and regression based on the Structural Risk Minimization inductive principle.
SVN
- Subversion is a version control system used to maintain current and historical version
of files such as source code, documentation, and web pages.
TELL - Technology Enhanced Language Learning is an alternative term of CALL.
TTS
- Text-To-Speech synthesizer is a system that converts text into speech.
URI
- Uniform Resource Identifier is a platform-independent way to identify or name a
resource somewhere on the web.
URL - Uniform Resource Locator is a subset of URI. It is the unique address for a file
available on the web.
UTF-8 - 8-bit Unicode Transformation Format is a variable-length character encoding using
one to four bytes to represent any character.
W3C - World Wide Consortium is an industry consortium which seeks to produce and
promote standards for the evolution and interoperability of the Web.
WARC - Web ARChive is an international standard: ISO 28500:2009 is a revision of the
Internet Archive’s ARC File Format. The WARC format provides new possibilities,
notably the recording of HTTP request headers, the recording of arbitrary metadata
(e.g., language, encoding, topic classification, and readability level), the allocation of
a globally unique identifier for every contained file, stored revisit events of migrated
records, and the segmentation of the records.
v
WEKA - Waikato Environment for Knowledge Analysis is a collection of machine learning
algorithms for data mining tasks.
WorldCALL - Word CALL is a worldwide association for the exchanging information of the
areas of CALL and to create professional relationships between teachers, researcher
and industry leaders around the world, namely using annual conferences counting with
around 500 attendees.
WSDL - Web Services Description Language is an XML-based language that provides a model
for describing Web services.
WWW - World Wide Web, frequently abbreviated as the “Web”, is described by W3C, as
the universe of network-accessible information.
WYSIWYG - What You See Is What You Get is used to describe a editor in which the
content displayed during editing phase appears very similar to the final output, that
can be either a web page, a printed document, etc.
XLDB - eXtremely Large Data Bases research group hosted by the Faculty of Sciences,
University of Lisbon.
XML - Extensible Markup Language is a specification defined by W3C, which allows to
extensibly create markup languages, with the main purpose of sharing structured data
between different information systems.
XSS
- Cross-site Extensible Markup Language is an attack against web applications in
which malicious scripting code is injected into web pages.
vi
Contents
1 Introduction
1
1.1
Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Structure of this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 State of the Art
2.1
2.2
5
CALL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
7
Associations devoted to CALL . . . . . . . . . . . . . . . . . . . . . .
Readability Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1
Flesch Reading Ease formula . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.2
Dale-Chall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.3
Fog Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.4
SMOG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.5
Flesch-Kincaid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.6
Lexile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.7
Collins-Thompson & Callan . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.8
Schwarm & Ostendorf . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.9
Heilman, Collins-Thompson & Eskenazi . . . . . . . . . . . . . . . . .
15
2.2.10 Summary of Readability Measures . . . . . . . . . . . . . . . . . . . .
16
3 Architecture
19
3.1
Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Oral Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
vii
3.3
3.4
3.5
3.6
3.2.1
TTS integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2.2
Multimedia documents integration . . . . . . . . . . . . . . . . . . . .
23
Linguistic resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3.1
Portuguese dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.3.2
Web document corpus . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.3.3
School textbooks and exams classified by level . . . . . . . . . . . . .
28
3.3.4
Focus word list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.5
Cloze questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Documents filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4.1
Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.4.2
Chain of Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.5.1
Creation of the Training Data File List . . . . . . . . . . . . . . . . .
36
3.5.2
Generation of a Feature and Data Set . . . . . . . . . . . . . . . . . .
36
3.5.3
Training a WEKA model . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.5.4
Creating and Testing a Readability model . . . . . . . . . . . . . . . .
37
Topic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4 Evaluation
4.1
41
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.1.1
Readability classifier results . . . . . . . . . . . . . . . . . . . . . . . .
42
4.1.2
Topic classifier results . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.1.3
Chain of filters results . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.1.4
Cluster details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
viii
5 Conclusions
49
5.1
Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.2.1
Integration of syntactic information in the Readability Classifier . . . .
50
5.2.2
Integration of text simplification tools with the Readability Classifier .
50
5.2.3
Expanding the Topics Classifier . . . . . . . . . . . . . . . . . . . . . .
51
5.2.4
Graphical Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.2.5
Automatic generation of Cloze Questions . . . . . . . . . . . . . . . .
52
5.2.6
Integration of Automatic Translation Tools . . . . . . . . . . . . . . .
52
5.2.7
Integration Automatic Summarization tools . . . . . . . . . . . . . . .
52
ix
x
List of Figures
1.1
Top 10 languages in the internet. . . . . . . . . . . . . . . . . . . . . . . . . .
1
3.1
REAP.PT Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
Individual Reading Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.3
Cloze Questions Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.4
Oral Comprehension Interface. . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.5
XIP-L2 F Chain based on [53]. . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.1
WPT05 readability distribution. . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2
Example of the WPT05 document. . . . . . . . . . . . . . . . . . . . . . . . .
45
4.3
Statistics about the WPT05 filtering process. . . . . . . . . . . . . . . . . . .
46
4.4
Chain of Filters details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
xi
xii
List of Tables
2.1
Dale-Call grade correction chart. . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Summary of the readability measures. . . . . . . . . . . . . . . . . . . . . . .
17
3.1
Statistics of the school textbooks and exercise books corpus for each level. . .
29
3.2
Statistics of the national exams corpus retrieved. . . . . . . . . . . . . . . . .
29
3.3
Statistics about the List of Focus Words. . . . . . . . . . . . . . . . . . . . . .
30
3.4
Readability Classifier training results using 10 fold cross-validation. . . . . . .
36
4.1
First stage evaluation metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2
Evaluation of the readability classifier. . . . . . . . . . . . . . . . . . . . . . .
44
xiii
xiv
1
Introduction
According to Internet World Stats1 , Portuguese is the 8th most used internet language in the
world, as shown in Figure 1.1. This website also reports a growth of 668% between 2000
and 2008, making Portuguese the 3rd fastest growing internet language. These figures are
very good indicators of the importance of the Portuguese language in the Web and in the
world. The most interesting figure for the topic of this thesis, however, would be the ranking
of Portuguese learned as a second language. Although we could not find such statistics, we
believe that this ranking is much lower, thus adding a further motivation to develop computer
aided tools for assisting the learning of Portuguese.
Figure 1.1: Top 10 languages in the internet.
1
http://www.internetworldstats.com/stats7.htm (visited in Dec. 2008)
1
Nowadays the continuous acquisition of knowledge is essential for the active integration in the
technical job market. Language learning is an example of this continuous process, since it is
a multi-level task that includes several elements such as words, sentences structures (syntax),
semantics, pronunciation, and culture. The activity of reading is a major component of first
and second language learning, namely vocabulary learning [71]. Learning how to read implies
a set of knowledge components including not only grammatical rules and their exceptions,
but all the lexical items in a language. Dictionaries and other lexical resources define words
meaning in a limited way and are not the most indicated for direct study. The Internet
can be used for direct study because it provides a vast amount of information and services.
Therefore finding suitable reading materials is a challenge for language teachers, because
most pages are not suitable for reading practice. The current general search engines (e.g.
Google2 , AltaVista3 ) are designed to run short queries against a huge collection of hyperlinked
documents in a quick way and are optimized for the frequent queries submitted [9]. For
instance, they do not filter the results of a search by levels of reading skills, neither remove
unwanted material containing sensitive information, such as profanity text. As a result, these
tools may need to include more advanced support from information retrieval technologies
than it is currently used and must be available to answer the increasing complexity of user
needs. This motivates the development of a varied set of properties and data models that
capture several aspects of user needs, such as his/her context and goals. High importance is
given to the creation of tools to help people finding relevant information which is suitable to
their interests.
The framework for this thesis is a tool for language learning (REAP), specifically lexical
practice, that may be tuned to a particular student, by selecting materials from the Web
which are adequate to the student’s fluency level and current interests.
The acronym REAP stands for “REAder-specific Practice” and it is a tutoring system developed at the Language Technologies Institute (LTI) of Carnegie Mellon University (CMU) to
support the teaching of a language for either native or foreign speakers, through the activity
of reading and focusing the students in learning vocabulary in context.
1.1
Goals
The main purpose of this work is to build a Portuguese version of REAP (REAP.PT). The
development of REAP.PT is one of the goals of the CMU-Portugal dual PhD program in
the area of languages technologies, which brings together an interdisciplinary team of engineers from Spoken Language Systems Lab (L2 f ) of INESC-ID Lisboa, and linguists from the
2
3
http://www.google.com
http://www.altavista.com/
2
Universities of Algarve and Lisbon.
Porting REAP to a new language, such as Portuguese, involves several tasks, the most obvious
one being the adaptation of the interface for Portuguese. But is also involves integrating
new linguistic resources, such as dictionaries for looking up the meaning of unfamiliar words,
retrieving and processing Web corpora, academic word lists (also designated as focus or target
word list through the following chapters); new tools such as topic classifiers, and making the
necessary adaptations for this topologically different languages.
There are several other tasks, however, where the project goals are not limited to porting,
and involve research challenges. Readability is one of such tasks, where the aim is to attribute
a reading difficulty value to a document. Generating statists about the readability (the ease
of understanding or comprehension of a text) and topic classifications of a Portuguese Web
corpus, such as WPT05 is also an interesting goal.
In addition to these porting and research-oriented tasks, REAP.PT is also concerned with
the integration of oral comprehension features in the tutoring system. Learning a new word
does not only mean learning how to write it but also how to understand its spoken form.
This may be especially important for a language such as European Portuguese, characterized
by strong vowel reduction (ranging from quality change to shortening and deletion), where
familiarization with the spoken/written language in multimedia documents may turn out very
helpful for non-native speakers.
The work developed in this thesis was accepted for oral presentation at SLaTE 20094 workshop:
[44] Luı́s Marujo, José Lopes, Nuno Mamede, Isabel Trancoso, Juan Pino, Maxine Eskenazi,
Jorge Baptista, and Céu Viana. Porting REAP to European Portuguese. In SLaTE
2009 - Speech and Language Technology in Education, Brighton, UK, 2009. Elsevier.
1.2
Structure of this Document
This thesis consists of 5 chapter is structured as follows:
• REAP.PT is a kind of software that can be classified as CALL (Computer Aided Language Learning). That reason motivates the inclusion of the state of the art in Computer
Assisted Language Learning (CALL) in Chapter 2.
• Chapter 3 describes the work done and the resulting REAP.PT solution architecture.
4
http://www.eee.bham.ac.uk/SLaTE2009/index.html (validated in July 2009)
3
• Chapter 4 introduces the main evaluation and the results obtained.
• The document ends with the presentation of the conclusion and future work.
4
2
State of the Art
This chapter is divided in two parts: the first part describes the CALL state of the art and
in the second part is detailed the currently available readability measures.
2.1
CALL
CALL is a term that appeared in the early 1980s; replacing the older expression CALI (Computer Assisted Language Instruction) that became associated with programmed learning.
This distinction was necessary since during the 1980s, CALL widened its scope, including the
communicative approach and a range of new technologies. The field of CALL is intrinsically
multidisciplinary. It emcompasses research from linguistics, psychology, second language acquisition, sociology, cognitive science, natural language processing, artificial intelligence and
computer science. It has now established itself as an important area of research in higher
education. CALL now includes highly interactive and communicative support for listening,
speaking, reading and writing, including extensive use of multimedia CD-ROMs and the Internet. An alternative term to CALL emerged in the late 1980s, called Technology Enhanced
Language Learning (TELL), which tried to provide a more accurate description of same the
activities. This acronym was adopted by the TELL Consortium1 .
The most succinct and accepted definition of CALL provided by Levy [40] is: Computer
Assisted Language Learning (CALL) may be defined as “the search for and study of applications of the computer in language teaching and learning”. CALL is an approach to teaching
and learning in which the computer and computer-based resources such as the Internet are
used to present, reinforce and assess material to be learned and usually includes a substantial
interactive element.
Warschauer and Healey [73] tried to model the progress of CALL into three phases of CALL
that are known as “Behaviorist”, “Communicative” and “Integrative”. The Behaviorist CALL
is, situated between 1960s and 1970s, where CALL features were essentially repetitive languages drills developed to run in mainframes. Subsequently stage, the Communicative CALL
positioned between 1970s and 1980s consider the PCs as their targets and usually included
1
http://www.hull.ac.uk/cti/tell/ (validated on Oct. 2008)
5
text reconstruction programs and simulations, which motivated dialogue and discovery amid
students working in groups. Finally the Integrative CALL (starting in the 21st century) is
centered in Multimedia and Internet. It is the result of the shift to globalization where the
teachers turn into facilitators instead of being the font of knowledge and the students should
interpret and organize the information given in an active way.
This model also mentioned that the stages are not precisely restricted, because it says that
as a new state emerges, the previous stages remain.
The described model is cited in many places, although some researchers do not accept this
attempt to describe CALL evolution [7]. Bax argues that there are some inconsistencies in the
theory found in Warschaeuer’s publications. For example Behaviorist CALL was designated
as Structural CALL and the dates shift to 10 year after. He defends that the names chosen
led to some confusions. So he also proposes 3 stages: Restricted CALL, Open CALL and
Integrated CALL. This theory states that we are in the Open Call stage and we are moving
towards the Integrated CALL and normalization.
The last theory can be considered as a refinement over the first one. Furthermore the concept
of normalization can be considered its main contribution. The author states that a technology
becomes normalized when is “invisible, hardly even recognized as technology”. Normalization
of a technology occurs when it is taken for granted in our daily life. CALL is far from
reaching this situation, for instance acronym CALL is almost completely unknown between
many people, as well as the applications developed under this field. However the author
developed a model to classify in which stage of normalization CALL is. He established an
evolutionary model with 7 stages: Early Adopters, Ignorance/Skepticism, Try once, Try
Again, Fear/awe, Normalising and Normalisation. He suggests that many teachers and some
institutions are at stage “Fear/awe” and “Normalising”. Finally he concludes that “One
criterion of CALL’s successful integration into language learning will be that it ceases to exist
as a separate concept and a field for discussion. CALL practitioners should be aiming at their
own extinction”. Although earlier technologies did not require specialization such as using a
book or a pen, due to the complexity and hugeness of CALL that is not possible. Although
normalization can be one goal for students and teachers, as [34] comments, it is not obvious
why extinguishing CALL as a field is useful or a compulsory target. Moreover [34] ends his
paper saying: “I believe the future of CALL and teacher education is bright, but as noted
earlier in this paper, there are a number of obstacles. The greatest of these is the limited
number of qualified personnel able to integrate technology into language effectively . . . ” and
“If CALL is to survive and prosper, then we need a dedicated cadre of graduate students,
especially doctoral students, willing to select CALL as their area of specialization. The paths
of CALL and language teacher education will increasingly be determined by such students
and those they will educate in the decades to come.”. The last citation shows how important
6
this author considers the work in this field and the potential relevance of this work.
2.1.1
Associations devoted to CALL
There are several associations devoted to CALL, the most well known being:
• APACALL2 - Asia-Pacific Association for CALL is an on-line association of CALL researchers and practitioners whose sponsoring organization comes from University of
Southern Queensland in Australia. The activities of APACALL include the following: e-list to share information between members; creation of SIGs (Special Interest
Groups) on the basis of common research areas; professional meetings & conferences
(namely GLoCALL3 Conference, with over 500 participants in 2008, co-organized with
PacCALL) and the production of an electronic journal named International Journal of
Pedagogies and Learning.
• CALICO4 - the Computer Assisted Language Instruction Consortium is a professional
organization dedicated to computer assisted language learning, which focus its activity in publications (comprising the CALICO Journal, and software reviews), annual
conferences and some special interest groups: Courseware, Computer Assisted Communication5 , Intelligent Computer Assisted Language Learning6 , Virtual Worlds, Second
Language Acquisition and Technology.
• JALTCALL7 - is a special interest group supported by The Japan Association for Language Teaching. It annually organizes meetings. They also publish a journal.
• PacCALL8 - Pacific CALL is an association in the Pacific, including countries from
East to Southeast Asia, Oceania and some from Americas. It produces a Journal and
organises some conferences, namely the GLoCALL co-organized by APACALL.
• EUROCALL9 - The European Association for CALL is an organisation of language
teaching professionals from Europe and world-wide, which works on: providing information and advice on all aspects of the use of technology for language learning; spreading
information via the ReCALL Journal; organising special interest meetings and annual
conferences and the development of electronic communications systems. It presently
2
http://www.apacall.org/ (validated on Oct. 2008)
http://glocall.org (validated on Oct. 2008)
4
http://www.calico.org/ (validated on Oct. 2008)
5
http://teacheredsig.ning.com/ (validated on Oct. 2008)
6
http://purl.org/calico/icall (validated on Oct. 2008)
7
http://jaltcall.org/ (validated on Oct. 2008)
8
http://www.paccall.org/ (validated on Oct. 2008)
9
http://eurocall-languages.org/ (validated on Oct. 2008)
3
7
has 3 SIGs: Computer Mediated Communication (CMC)10 , CorpusCALL11 , and Natural Language Processing12 . This Association counts with over 239 individuals, and 80
corporate members. CALICO and IALLT are affiliated with EUROCALL.
• IALLT13 - is International Association for Language Learning Technology which is made
of eight U.S. regional groups. IALLT is a professional organization dedicated to promoting effective uses of media centres for language teaching, learning, and research.
Manages regular conferences. IALLT’s major activities are the Journal as well as its
conferences organized every two years with the presence of several hundred members
and host local meetings between the regional groups. Furthermore according to the
information available in the official site, they also sponsor FLEAT (Foreign Language
Education and Technology) with with the Japanese Association for Language Education
and Technology14 (LET) .
• WorldCALL15 - a worldwide association for the exchanging information of the areas of
CALL and to create professional relationships between teachers, researcher and industry
leaders around the world, namely using annual conferences counting with around 500
attendees. It also provides scholarships to enable postgraduate students and junior academics to attend the conferences. The existing members are: EUROCALL, CALICO,
IALLT, LET, CERCLES.
• SLaTE16 - Speech and Language Technology in Education is a recent (2006) special interest group of International Speech Communication Association17 (ISCA). Its activities
are focus on the creation of Workshops and related meetings.
2.2
Readability Algorithms
The emphasis of this thesis on readability justifies a more in depth study of the state of the art
in this area. Readability assumes a great relevance for this work because we want REAP.PT
to be capable of supplying the students with documents at their levels of knowledge.
Readability has two important meanings. The first and more general meaning is defined as
reading ease. A tutoring system such as REAP.PT needs to be capable of automatically
determining the level of a document retrieved from the Web. This readability level may for
10
http://groups.yahoo.com/group/CMC_SIG/ (validated on Oct. 2008)
http://www.corpuscall.org.uk/ (validated on Oct. 2008)
12
http://siglp.eurocall-languages.org/ (validated on Oct. 2008)
13
http://www.iallt.org/ (validated on Oct. 2008)
14
http://www.j-let.org/en/ (validated on Oct. 2008)
15
http://www.worldcall.org/ (validated on Oct. 2008)
16
http://www.sigslate.org/ (validated on Oct. 2008)
17
http://isca-speech.org/index.php (validated on Oct. 2008)
11
8
instance range between one (first grade or first year of the Elementary school) to 12 (twelfth
grade or last year of High school).
The second meaning [21] is related to the field of computer science and refers to ease with
which a programmer can understand the utilization, control flow and procedures written in
a source code. It is also an important aspect because programmers spend most of their
time reading, trying to comprehend and change source code. According to a study done in
the General Motors Research Lab, about 75 percent of all programmers’ time was spent on
program modification. Although this study has more than thirty years, the reality did not
change significantly. Moreover, code with low readability is also error/bug prone, inefficient
and difficult to maintain. In order to improve this type of readability it is necessary that
the programmer follows or develops a programming style. It can include specific indentation,
capitalization of constants, use of comments, etc. But it must maintain coherence with all
source code developed. Although these guidelines will be followed during the implementation
of this work, the rest of this section will refer to the first meaning.
Readability (“inteligibilidade” or “apreensibilidade” in Portuguese) is commonly confused
with legibility that concerns typeface and layout. Legibility is an important part of readability,
however not everything that is legible is readable neither is the reverse [6].
We can find several definitions for readability, which stress different parts of the concept. For
example Klare [38] describes readability as “the ease of understanding or comprehension due
to the style of writing” focusing his definition on writing style, ignoring other issues such as
content, coherence, and organization. Another famous definition of readability was given by
the creator of the SMOG readability formula [50], which focus on the interaction between
text and a group of readers having a common known base of knowledge, motivation and
reading skill. He defines readability as: “the degree to which a given class of people find
certain reading matter compelling and comprehensible”. The definition given by Edgar Dale
and Jeanne Chall’s [17] is probably the most comprehensive: “The sum total (including all
the interactions) of all those elements within a given piece of printed material that affect the
success a group of readers have with it. The success is the extent to which they understand
it, read it at an optimal speed, and find it interesting”. In a more paradigmatic way, a
reading difficulty measure can be described as a function that maps a text to an algebraic
value corresponding to a difficulty level or grade.
According to DuBay [20], the first readability formulas were discovered by educators in the
1920s. They used vocabulary difficulty and sentence length to estimate the difficulty of a
text. These formulas only became widely used in the 1950s, when researchers like Edgar
Dale, Jeanne Chall, George Klare and Rudolf Flesch started developing and using them on
a regular basis for several fields, such as law, industry and journalism. There has been
9
extensive work on predicting the readability of texts. For example in the 1980s, the number
of readability formulas was about 200 [20]. Most of them were developed for English and
are not that different from each other. The coefficients in these formulas are presented as
experimental results obtained by the authors.
The widely used readability measures are: Flesh Reading Ease formula (1943), DaleChall (1948), Gunning’s “Fog Index” (1952), SMOG (1969), Flesch-Kincaid (1975), Lexile
(1988), Collins-Thompson & Callan (2004), Schwarm & Ostendorf (2005) and Heilman et al.
(2007/2008).
2.2.1
Flesch Reading Ease formula
In 1943, Rudolf Flesch received a Ph.D in education research for his dissertation [22], which
contains his first readability formula for measuring adult reading material. In 1948, Flesch
changed his initial formula and split it in 2 sections. The first section, the Reading Ease
formula, dropped the use of affixes and used only 2 variables, the number of syllables and
sentences for each 100 words sample. It classifies reading ease on a scale from 1 to 100,
where 30 is “very difficult and 70 is “easy”. A score of 100 states the reading material is
completely understood by readers who have completed the fourth grade. The second part of
Flesch’s formula calculates the human interest in the reading matter by counting the number
of personal words (such as pronouns and names) and personal sentences (such as quotes,
exclamations and incomplete). The formula for Flesch Reading Ease measure is:
Result = 206.835–(1.015 × ASL)–(84.6 × ASW )
(2.1)
Where:
Result = value on a scale of 0 to 100.
ASL = average sentence length, i.e. the number of words divided by the number of sentences.
ASW = average number of syllables per word, i.e. the number of syllables divided by the
number of words.
The use of this measure is so popular that it is included with popular word processing software
such as Microsoft Word18 , Lotus WordPro19 , WordPerfect20 , and Google Docs21 . However
usually it is only available when the user is writing in English.
18
http://office.microsoft.com
http://www-01.ibm.com/software/lotus/products/smartsuite/wordprofeatures.html
20
http://www.corel.com/servlet/Satellite/us/en/Product/1207676528492#tabview=tab0 (validate on
Dec. 2008)
21
http://docs.google.com/
19
10
2.2.2
Dale-Chall
The original Dale-Chall readability measure [16] and [18] was developed for adults and children
over the 4th grade. According to this model, reading difficulty is a linear combination of the
mean sentence length and the percentage of rare words. It uses a list of 3000 words, 80 of
which are commonly known by the fourth grade. The percentage of words not in this list is
used as a measure of lexical difficulty of a text. This measure of lexical difficulty depends on
the hypothesis that the percentage of words in a text increases linearly with the readability
level. Furthermore, according to this assumption, the percentage of rare words should increase
even within the grades below fourth when not all the common words are likely to be known.
As such, it seems that such a simple lexical model that is defined in relation to a single grade
may be problematic. The measure selects 100-word samples throughout the text (for books,
every tenth page is recommended) and then calculates the result of the equation:
Raw Result = 0.1579P HW + 0.0496ASL + 3.6365
(2.2)
Where:
Raw Result = “reading grade score of a pupil who could answer one-half of the test questions
correctly”.
PDW = percentage of hard words (words outside the Dale-Chall word list).
ASL = average sentence length.
The Raw Result needs to be corrected at the higher grades, so in [16] the following chart 2.1
is suggested for correcting the Raw Scores.
Raw result
4.9 and below
5.0 to 5.9
6.0 to 6.9
7.0 to 7.9
8.0 to 8.9
9.0 to 9.9
10 and above
Dale-Chall Score
Grade 4 and below
Grades 5-6
Grades 7-8
Grades 9-10
Grades 11-12
Grades 13-15 (college)
Grades 16 and above (college graduate)
Table 2.1: Dale-Call grade correction chart.
2.2.3
Fog Index
In the mid 1930s, teachers were starting to perceive high school graduates who were not able
to read. Robert Gunning understood that the main reason of the reading problem was indeed
11
a writing problem, because he found that magazines, newspapers and business papers did not
provide the information straight on and they included pointless complexity. In [26] and [27],
Gunning published the Fog Index, a readability measure that uses average sentence length as
measure of grammatical difficulty and the number of words with more than two syllables as
the indicator of grammatical difficulty. This readability measure uses a linear combination of
the two components, as the following formula demonstrates:
Result = 0.4(ASL + hard words)
(2.3)
Where:
Result = reading grade of a reader.
ASL = Average Sentence Length in words.
hard words = percentage of words of 3 or more syllables.
2.2.4
SMOG
The SMOG readability formula [50] also employs word length. McLaughlin published his
SMOG formula in the belief that the word length and sentence length should be multiplied
rather than added. As he says in [50] :
Therefore, a readability formula should not be of the usual form, Readability = a
+ b (Word Length)+ c (Sentence Length), but should be of the form Readability
= a + b (Word Length x Sentence Length) where a, b and c are constants. By a
stroke of good fortune, the more valid type of formula is easier to calculate, not
merely because it has one constant less than the traditional type, but because,
with a bit of ingenuity, one can eliminate the chore of multiplication completely!
Furthermore he also wanted to simplify the computation of readability estimates. The SMOG
prediction for a given text is calculated from a sample of thirty sentences distributed throughout the text. The model estimates readability using the square root of the number of polysyllabic words (words with three or more syllables). It is important to note that sentence
length is implicit in the calculations. Thirty sentences are sampled from the text in order
to count the number of polysyllabic words. If these sentences are longer, then the predicted
readability is probably higher because polysyllabic words are more likely a priori in longer
sentences. The simple formula given by McLaughlin is:
SM OG grading = 3 + square root of polysyllabic word count.
12
(2.4)
2.2.5
Flesch-Kincaid
The Flesch-Kincaid measure [37] is probably the most common readability measure in use.
Like the Flesh Reading Ease, it is implemented in common word processing programs such as
Microsoft Word, etc. This measure is a linear combination of the mean number of syllables per
word and the mean number of words per sentence. As such, it employs the same assumptions
as the Fog Index and SMOG measure, but calculates a finer grain measure of word length.
Result = (11.8 × ASW ) + (0.39 × ASL) − 15.59
(2.5)
Where:
Result = reading grade of a reader.
ASW = average number of syllables per word.
ASL = average sentence length.
2.2.6
Lexile
The Lexile Framework [66] uses average sentence length and average word frequency estimates
for measuring readability. It assumes the existence of two components: the semantic component and the syntactic component. The mean log word frequency is used as the semantic
component because it is an indicator of lexical difficulty. It assumes that lexical difficulty is
simply based on the frequency of that word. The frequency estimates come from American
Heritage Intermediate Corpus, a large, miscellaneous corpus of text [10] . The log of the mean
sentence length is used as the grammatical component. The two components are used in a
Rasch model [59] for estimating the readability level of a given text. The Rasch model is
used for analyzing data, since it represents the structure which data should exhibit in order
to get measured. It is based on a logistic regression model to estimate student ability levels
based on performance on tasks of changeable difficulty. Rasch models are used since the
performances of students on tasks of varying difficulty do not have a linear relationship with
students’ capacities. For instance, if a high school text is given to two first grade students
of different ability levels, their level of comprehension is not likely to differ because neither
would comprehend much of the text. However if the same text was given to two ninth grade
students of different ability levels, their level of comprehension might be very different. In
Lexile’s case, the Rasch model estimates that if a student has the same Lexile measure of a
text, he would comprehend 75% of that text. The Lexile measure ranges from 0L (Lexile) to
2000L. Lexile is a converted log odds ratio, but this value can be easily mapped to elementary
through high school grade levels as the following formulas shows: Lexile=500Ln(Grade Level)
or, the counterpart GradeLevel=e0.002(Lexile) .
13
The equation used by the lexile framework is:
T heoretical Logit = (9.82247 ∗ LM SL) − (2.14634 ∗ M LW F )–constant
(2.6)
Where:
LMSL = Log of the Mean Sentence Length.
MLWF = Mean of the Log Word Frequencies.
Constant = this value is not specified in public information.
Lexile text measure = [(Logit + 3.3) × 180] + 200
2.2.7
(2.7)
Collins-Thompson & Callan
The researchers Collins-Thompson & Callan at Carnegie Mellon University have applied probablistic language modeling techniques to the development of a new readability measure [13]
and [68]. It is based on a variation of the multinominal naı́ve Bayes classifier. For shortness
they call their model Smoothed Unigram model, since it uses smoothed unigram language
modeling to capture the predictive ability of individual words based on their frequency at
each reading difficulty level. Estimates of word frequency are gathered from a corpus of texts
labeled by reading level. However, it is common to encounter unusual types (words) that were
not included in the model, known as “out-of-vocabulary”, and rare types (types with few observations), which should not have a probability of 0. The mass probability distribution from
known types to unseen or rare types is identified as smoothing. The simple Good-Turing [23]
algorithm is the smoothing method in use by this measure. Language models for each readability level enable this approach to report differences in predictive power of individual words
at different levels. During the development of this measure, it was found that some words
were very foretelling of certain levels. For instance the word “grownup” was very predictive
of grade 1 and the word “essay” was very predictive of grade 12. For a given text, this readability measure estimates the likelihood that the text was generated by each reading difficulty
level’s language model. The readability prediction is the level of the model with the highest
likelihood of generating the text. It is obtained by creations partitions of 100 tokens or less
for the remaining text, then calculating the likelihood L(T—Gi ) for each partition and in the
end calculating the average of top N (the results reported used N=2).
L(T |Gi ) =
X
C(w)log(P (w|Gi )
w∈V
Where:
C(w) = count of token w.
Gi = grade language model (i is in range between 1 and 12).
14
(2.8)
T = text.
Even though the lexical component of reading difficulty is reasonably sophisticated in this
readability measure, there is no grammatical component.
2.2.8
Schwarm & Ostendorf
In 2005, Sarah Schwarm and Mari Ostendorf [62] developed a system for measuring readability.
It utilizes Support Vector Machines (SVMs) [14] for classification to combine a variety of
lexical and grammatical features. Lexical features consist of the average sentence length,
average number of syllables per word, Flesch-Kincaid score [37], 6 out-of-vocabulary (OOV)
rate scores, 4 parse features (per sentence) and 12 n-gram language model perplexity scores.
The OOV measures are similar to the Dale-Chall component for lexical difficulty. Additionally,
the language model perplexity scores are analogous to the methods employed by CollinsThompson & Callan [68] [13]. The language models applied here also include an n-gram
model, which considers that word sequence follows an (n-1)th order Markov chain [28]. They
used n = 1, 2 and 3 (unigrams, bigrams and trigrams models respectively). The classification
models are trained on a corpus of labeled texts like the previous measure. The 4 parse features
are derived from syntactic parses of text. These features are the mean parse tree height, the
mean number of noun phrases, the mean number of verb phrases, and the mean number of
“SBARs.” “SBARs” are non-terminal nodes in the parse tree associated with subordinate
clauses. One assumption made by this model is that grammatical difficulty is satisfactorily
encapsulated by a combination of n-gram language models, sentence length, and these four
structural features derived from parse trees. Parse trees offer more information about the
syntactic structure than what was used in the previously discussed model. However, it also
raises an important question, which is whether this information is accurate enough to be
included as the grammatical component of a readability measure. Evaluations of parsers
generally show accuracy values around 90% [11]. Consequently, estimates of grammatical
difficulty based on syntactic parses are subject to some amount of additional variance. From
natural language technologies in general, it is still an open question if the reduced bias of
using more complex syntactic features outweighs the variance resulting from parser errors.
2.2.9
Heilman, Collins-Thompson & Eskenazi
This readability measure, explained in [32], is based on a linear function of the lexical and
grammatical components. In this model the lexical features are word unigrams, which were
already used in [13]. The authors also explored higher order n-grams, like bigrams and trigrams, but the results did not improve in their preliminary tests. The grammatical features
15
are automatically obtained from context-free grammar parses of sentences. This measure extends the work of [31], which used the frequencies of approximately 20 grammatical structures
(e.g., passive voice, relative clauses and diverse verb tenses) retrieved from English as Second
Language grammar textbooks. Instead of using the frequencies of features manually defined
based in the knowledge of the language, they extract the frequencies using an automatically
defined set of subtree patterns derived from parse trees of all texts in the training corpus. The
authors claim that a nominal scale of measurement is more appropriate than an ordinal or
interval. The statistical models tested were Linear Regression (interval), Proportional Odds
(nominal) and Multi-class logistic regression (ordinal). More details about Linear Regression
and Multi-class Logistic Regression can be found in [70] and about Proportional Odds in [49].
2.2.10
Summary of Readability Measures
The readability measures described above were developed for English. There are some studies that applied readability formulas like Flesch-Kincaid to Brazilian Portuguese texts [42]
and more recently to Brazilian Government websites [43] [6]. However, these works do not
consider recent work done in the area [68]. The table 2.2 summarizes the readability measures described. In this work we applied Collins-Thompson & Callan readability measure,
because it provides more accurate results than the previous measures. Moreover, all posteriors
readability measures are based in word unigrams that were extended to include grammatical
features
16
Measure
Flesch Reading Ease
Year
1943
Dale-Chall
1948
Gunning’s “Fog Index”
1952
SMOG
1969
Flesch-Kincaid
1975
Lexile
1988
Lexical Features
Average number of
syllables per word
Percentage of Hard Words
(words outside the
Dale-Chall word list)
Percentage of
polysyllable words
Square root of
polysyllable word count
Average number of
syllables per word
Word frequency
Collins-Thompson
& Callan
Schwarm & Ostendorf
2004
Word unigrams
2005
Word n-grams, . . .
Heilman,
Collins-Thompson,
Callan & Eskenazi
Heilman,
Collins-Thompson
& Eskenazi
2007
Word unigrams
2008
Word unigrams
Grammatical Features
Average sentence
length
Average sentence
length
Average sentence
length
Average sentence
length (implicit)
Average sentence
length
Average sentence
length
Average sentence
length, SBARs,
parse tree depth, . . .
Manually defined
grammatical
constructions
Automatically Defined,
Extracted syntactic
sub-tree features
Table 2.2: Summary of the readability measures.
18
3
Architecture
Porting REAP to a new language requires several tasks, the most obvious one is the adaptation
of the interface to Portuguese. However, it also requires the integration of new linguistic
tools and resources, as well as the indispensable adaptations for this topologically different
language.
We shall start by discussing the Web interface, which is followed by the new oral comprehension section inexistent in the english version of REAP. The description proceeds with the
linguistic resources and then addresses the modules related to the filtering documents. The
chapter ends describing the readability and topic classifiers.
Before dive in the Web interface, an overview of REAP.PT Architecture is presented in Figure
3.1 and a detail view of the Chain of Filters module is displayed in Figure 4.4. In addition, for
each filter is included the percentages of documents processed and excluded. In both figures,
the “trash can”/“recycle bin” is used as metaphor for the documents’ deletion/exclusion.
3.1
Web interface
Students interact with REAP via a Web interface supported by any Web browser available,
for example Firefox, Safari, Camino, Internet Explorer, etc.
REAP.PT is a student-oriented tool that helps both students and teachers. For instance,
each year or each semester, the teacher may organize the focus word list that the students
must learn and retain. This kind of list is common in English and other languages such as
Mandarin. In Portugal, however, such lists are not known to us, so this is one of the topics
that is currently under investigation. The construction and wide dissemination of such lists
via REAP.PT will therefore be an additional outcome of this project.
Since REAP is essentially student-oriented, it needs to evaluate the student’s level of knowledge and ask the students to define their topics interests. So, after the first login, the interface
runs a Pretest in which the student is to answer a certain number of questions in order to
be assigned a level of language proficiency. Then the system gives 3 options: group readings, individual readings (shown in Figure 3.2) and topic interests. The group readings and
19
REAP.PT
Architecture
Users
World Wide Web
Spider
Crawler
- User interaction
- Dictionary access
TTS
DIXI
WPT05
Academic
Word List
Web
Interface
- Action Logging
- Documents retrieval
- Cloze Questions retrieval
Cloze
Questions
Chain Of Filters
Adequate
Quality
No
Yes
Readability
Classifier
Topic
Classifier
REAP.PT
Database
Figure 3.1: REAP.PT Architecture.
individual readings are very similar, the only difference being that in the first case the text
selected for reading is chosen by the teacher and it is common to all the students in the class.
The topic interest menu displays a catalogue of topics and asks the student to classify them
by checking one box from “not interested” to “very interest” and the information is stored in
the database. The focus words are highlighted in the text and the student can search for the
meaning of any the words by clicking in them or using the search field of the system. This
is important because we want to track all the actions of the student, namely the access to
the dictionary, in order to keep updated the progress of the student. The individual reading
exercise is followed by a series of cloze questions (shown in Figure 3.3), or fill-in-the-blank
questions about, in both cases, the words that were highlighted. Please read Section 3.3.5 for
more details concerning cloze questions. The interface also has a teacher menu. It allows the
teacher to classify the quality of the document, rate the readability level, select documents for
group reading, and discard documents. It also supports the creation of a teacher report. The
20
Figure 3.2: Individual Reading Interface.
Figure 3.3: Cloze Questions Interface.
latter is also a work in progress because it is not yet completely established which information
should be included in the report. Finally the interface was extended to incorporate an oral
comprehension module (shown in Figure 3.4) described in Section 3.2.
3.2
Oral Comprehension
One of the most striking differences between Brazilian and European Portuguese varieties
concerns vowel reduction, which is much more extreme in the latter ([5], [48]). In the European
variety unstressed high vowels are often erased and rather long consonant clusters may surface
within as well as and across word boundaries, which are not allowed in the Brazilian variety.
This makes European Portuguese typically more difficult to understand for foreign learners
21
Figure 3.4: Oral Comprehension Interface.
and is one of the motivations for including audio playing options in REAP.PT as a first
step towards integrating oral comprehension in the system. We endeavour to familiarize the
student with the way each word/sentence sounds in two ways: by integrating a text-to-speech
synthesizer (TTS) for European Portuguese, and by letting the student learn not only from
text documents but also from multimedia documents containing audio (and possibly video)
as well.
3.2.1
TTS integration
The first component is available for every text document. Students can highlight words
or word sequences as long as they want in the document and click on the “listen” button.
When searching for the meaning of a particular word, the dictionary window also includes
the same listening option. For this purpose, we have integrated DIXI, a concatenative unit
selection synthesizer [57] based on Festival1 . Word highlighting and word sequences selection
are provided after applying some transformations to each document (HTML page) displayed.
The modifications include some javascript code injections and tag wrapping around each
word of the document. When a student clicks on the “listen” option, the Javascript code
calls a PHP proxy server, that invokes the DIXI Server using Web Services. The DIXI Server
provides a description of the operations offered by the serviceHTML written in the Web
Service Description Language (WSDL). It receives XML messages that follow the Simple
Object Access Protocol (SOAP) standard. The approach of communicating using REST,
1
http://www.cstr.ed.ac.uk/projects/festival/
22
was also analysed as it is simpler to use than SOAP, because it does not require a toolkit.
Unfortunately, REST is not always the best solution for every Web service. There are data
that needs to be securely transmitted and should not be sent as parameter in URIs. Even
considering that this problem is solvable using cryptography algorithms, REST faces another
problem when there are large amounts of data to be transfer, like in audio which can become
out of bounds within a URI. In these cases, SOAP is indeed a solid solution and can not
be replaced by REST. To established a connection with the DIXI server, we needed a client
program. The client was written in PHP to provide a tighter integration with the Web
interface. To ensure the client portability between systems we decided to use the NuSOAP SOAP Toolkit for PHP2 because it allow developers to create and consume Web services based
on SOAP 1.1 and WSDL 1.1, without relying on PHP extensions, which are not available for
some systems (for example: Mac Os X 10.5 brings PHP5 built within the system without
support for such extensions). To accomplish the debugging we recourse to the use of soapUI3 .
3.2.2
Multimedia documents integration
The second component involves multimedia documents that consist of either pre-recorded
digital talking books (DTB) and broadcast news (BN) stories. DTB as most often used
for entertainment and inclusion applications (e.g., for visually impaired or dyslexic users).
Their application in the area of CALL is not so typical [69], however the possibility of
listening to the word or word sequences may be very important for L2 students. In addition,
compared with other Languages, such as English, there are a scarcity of materials that
provide controlled quality recordings both at the segmental and prosodic levels. Therefore
DTBs could be a really good way to become familiar with the language, for non-native
students.
The alignment of each spoken word with the read text is achieved using the automatic speech
recognition system (ASR) in a forced alignment mode. AUDIMUS [52] is a hybrid recognizer
whose acoustic models combine the temporal modeling capabilities of Hidden Markov Models
with the pattern classification capabilities of multi-layer perceptrons. Its decoder, which is
based on weighted finite state transducers, proved very robust even for aligning very long
recordings.
The repository of aligned DTBs is still quite limited, being mostly used for demonstration
purposes, but it already includes a wide range of genres: fiction, poetry, children’s stories,
and didactic text books. This repository, however, does not typically cover the potentially
2
3
http://sourceforge.net/projects/nusoap/
http://www.soapui.org/
23
very wide areas of interest of L2 students. That was the main motivation for adding a totally
different repository of BN stories, taking advantage of the large corpus that was manually
transcribed for the purpose of training/testing AUDIMUS. The corpus includes over 80 hours
of manually transcribed news shows. Because transcriptions were only manually aligned at the
utterance level, AUDIMUS is again used in its forced alignment mode to produce word-level
alignment.
The “First Law” formulated by Dr.Tomatis states: “you cannot reproduce a sound you cannot
hear”4 . The basis of this theory assumes that a person’s auditory system is constantly listening
its native language and becomes attuned to its frequencies, and when one learns a second
language, one cannot hear well the sounds that are not present in the native language. The
english language, for instance, consists of high pitch tones ranging from 2000 to 12000 Hz.
French, on the other hand, rarely uses such tones. To test this hypothesis, Coomen experiment
was realized. The experiment consist of 30 high school students with no english knowledge
divided into 2 groups. One group was taught regular English lessons for 1 year, while the
other group was given a 3 month perception training and had another 6 months of regular
English lessons. At the end of that year, the second group had out performed the first, and
the year after that the differences became more evident.
The interdependence of speech perception and production is one of the motivations behind
this section. As a result to a certain extend, this section can be consider as a very rudimentar
Computer Assisted Pronunciation Training (CAPT) system. In addition the Broadcast news
stories videos can provide Visual Training, which is viewed as an essencial complement of
speech signal which is often adequate for communication. But his intelligibility is greatly
improved by visual clues such as movements of the lips, tongue and jaws [45]. Facial expressions, emotions and gestures also enhance communication. The visual components of speech
are substantially important for individuals suffering hearing loss or in noise environments.
Understanding the visual signals can spare a person of a life of isolation from oral society. To
help improving pronunciation, the talking heads, like Baldi [46], were made transparent to
make the vocal tract visible. Baldi has a tongue, hard palate, 3-D teeth and his internal articulatory movements were based on data from electropalatography and ultrasound data [36].
The use of Baldi was tested with success with both autistic children and hearing loss [47].
The effectiveness of visual training (using Baldi) was showed in [46]. Another system based
on this concept is Ville [58]5 , which is used to teach Swedish as a second language and Timo
which is oriented to children with disabilities.
The second component is being actively explored by José David as part of his PhD research work. Our contribution was essentially support, because he was not familiar with
4
5
http://www.tomatis.com/English/Articles/languages.htm (validated in July 2009)
read with the help of Google Translate (http://translate.google.com/)
24
the REAP.PT interface, nor the programing languages involved (PHP, Javascript, HTML).
We also helped him using Subversion (SVN) and manipulating the database structure and
content. We also discussed some implementing details, such as how to insert meta-data
information in the documents to properly extract the audio matching the text selected.
3.3
Linguistic resources
In terms of linguistic resources, porting started with the integration of a Portuguese dictionary,
which allows the students to look up the meaning of unfamiliar words. The system has logging
users (students) actions as a strong requirement because the action of looking up of a word
the student does not known can be considered in the adjustment or progression of the student
into another grade.
3.3.1
Portuguese dictionary
There are three main European Portuguese dictionaries available in the Web, which are
Priberam, Porto Editora and Wikcionário. The Priberam and Porto Editora have dictionaries
as their core business, and as a result these two companies do not even consider licensing or
selling the full electronic version of their dictionary. Furthermore they also limit the number
of online accesses to their dictionary, to prevent the retrieval of all data available in the
dictionary.
The Wikcionário is a dictionary that follows the same philosophy of the Wikipedia, created
under a MediaWiki engine. It currently contains around 51,000 entries for the Portuguese
language. This number includes entries for other varieties besides European Portuguese, such
as Brazilian Portuguese, which may be problematic, as the orthography of the two varieties
is not yet unified. The Wikipedia foundation creates XML dumps of the content of the Wikcionário, usually at least twice a month. The dump dated from November 26th , 2008 was
retrieved and processed in order to filter unwanted information. The Wikcionário entries do
not follow a common internal nor external structure. While some entries do have pronunciation, translation to another language, anagrams entry, while others do not. For instance
consider the words “casa” (home) and “universidade” (university). While “casa” has pronunciation for both European and Brazilian Portuguese, translation to several Spanish and Italian
dialects and Romanian; “universidade” has some pronunciation entries for several Brazilian
accents, and it lacks European Portuguese, the section of translation to another languages is
empty, the anagram section is not present. Another example that illustrates the lack of cohesion is the word “corrida” (run) that lacks pronunciation information and presents the first
meaning as a noun, then translation to English and ends with the second meaning (adjective)
25
illustrated by an example. Internally the number of tags used to identify the section has several variants for the same sections of each entry, which entails great problems for automatic
parsing. Furthermore the number of Portuguese word entries is much smaller than 51000.
Several entries belonging to another languages and a large number of these entries were added
automatically using Web robots (usually leaving a footprint in <contributor><username>
tag – SpaceBirdyBot, RobotGMwikt, etc).
Both Priberam and Porto Editora provide an Online European Portuguese dictionary, however
to protect their servers and their core business they restrict the number of accesses. The
Priberam dictionary contains about 96,000 lexical entries and Porto Editora dictionary has
more than 920,000 word entries, which makes the Porto Editora dictionary almost an order
of magnitude larger. Given that the quality is comparable we choose the one that is more
complete, the Porto Editora dictionary. The English version of REAP includes the English
Cambridge dictionary inside the system Database. We could not get a similar resource;
therefore we opted for remote access to an electronic, from Porto Editora, which displays the
meaning, together with the part of speech (POS) tag of each searched (possibly inflected)
word.
In order to communicate with the dictionary server, an intermediate server/proxy was needed
to register the words that are looked up and to update student models. One way to do this
is to use AJAX technology that allows the communication between Web servers using XML
requests. However, modern browsers do not allow these requests to access outside domains, in
order to prevent cross-site scripting XSS vulnerabilities [72]. The adopted solution was the use
of a proxy server written in PHP that receives the XML requests and establishes a connection
to the server of Porto Editora, using an HTTP connection. In order to establish the HTTP
connection, the Snoopy PHP Library6 was necessary because it automates and simplify the
task of retrieving Web page content from Porto Editora server. Instead of retrieving the
HTML of Web portal, we opted to access their Google Gadget interface since it provides a
more compact and elegant version of the HTML for word entry lookup. Even so some extra
post-processing is done to remove some irrelevant content (HTML tags). Finally since the
access to the dictionary assumes an ISO-LATIN1 encoding, some format conversions from
others text encodings were adopted.
3.3.2
Web document corpus
The second step in terms of linguistic resources was to build the Web documents corpus.
The hypothesis of crawling the Web for Portuguese pages was contemplated, however it is
a time consuming task. Currently it takes at least 2 months7 to crawl a Web corpus using
6
7
http://sourceforge.net/projects/snoopy/
http://boston.lti.cs.cmu.edu/Data/Web08-bst/planning.html (validated in July 2009)
26
Web Robots. Web Robots [56], also known as Internet Robots, WWW robots or just bots
are software applications that execute automatically tasks over the Web. Although they can
be used for malicious actions8 , the role of bots is unmeasurable in our society [39] because
all search engines, and a huge number of Web applications such as REAP greatly rely on
bots to acquire documents. Bots that navigate through hyperlinks crawling Web documents
are also named as “spiders”, or “crawlers”. For example, The Viúva Negra Crawler (The
Black Widow Crawler) [25] was used to build the WPT05 Web corpus. See Section 3.4 for
more details about WPT05 corpus. To avoid crawling, REAP.PT uses WPT059 as the main
document source. This collection of over ten million documents was obtained by the crawler
of the Tumba! search engine, the Viúva Negra crawler, developed by the XLDB Node of
Linguateca. The contents were crawled in 2005 and have been harvested among documents
written in Portuguese either hosted in a .pt domain or hosted in a .com, .org, .net or .tv
domain, and referenced by a hyperlink from, at least, one page hosted in a .pt domain. The
WPT05 Collection is available in two formats:
• The RDF/XML version contains the metadata and extracted text from the retrieved Web documents following Internet Media Type (application/PDF, application/postscript, application/vnd.ms-office, text/HTML, text/plain, text/RTF). It has the
Web pages arrange in a hierarchy following the OAI-ORE standard. All the extracted
text is encoded in the UTF-8 format and each file is a valid XML file. The average
document size is 3000 characters. It occupies 7.8 GB compressed and about 43 GB
uncompressed.
• The ARC version contains the raw documents as they were retrieved, without any kind
of post-processing, such as elimination of duplicates documents, detection and removal
of non text-rich documents, or encoding normalization. It adopted the ARC format10
from the Internet Archive. Searching for tools to handle ARC files is quite difficult
because the file extension was previously used by popular lossless data compression and
archival format developed by System Enhancement Associates. This version occupies
86 GB compressed and about 306 GB uncompressed.
Each record in the RDF file has the name of the corresponding arc file containing the full
document content. Therefore the RDF files can be seen as indexes of the ARC records.
Unfortunately they fail to provide certain useful information, as uncompressed arc file offset
that would greatly improve the navigation in the arc file. In order to provide the meta-data
information, the ARC files are usually complemented by a DAT File11 . This disjunction
8
http://meta.wikimedia.org/wiki/Bot (visited in Jun. 2009)
http://xldb.di.fc.ul.pt/wiki/WPT_05_in_English (validated in July 2009)
10
http://www.digitalpreservation.gov/formats/fdd/fdd000235.shtml (validated in July 2009)
11
http://www.archive.org/Web/researcher/dat_file_format.php (validated in July 2009)
9
27
of meta-data content was one of the reasons that lead to the creation of the WARC format.
Published in June 2009, the ISO 28500:2009 specifies the WARC file format, which is a revision
of the Internet Archive’s ARC File Format. The WARC format provides new possibilities,
such as the recording of HTTP request headers, the recording of arbitrary metadata (e.g.,
language, encoding, topic classification, and readability level), the allocation of a globally
unique identifier for every contained file, stored revisit events of migrated records, and the
segmentation of the records. Standardization is seen as a guarantee of durability and evolution
for the WARC format. This argument is strengthened by some WARC compliant applications:
• Heritrix12 - open-source Web crawler;
• warc-tools13 - library and tools to manipulate WARC files;
• NutchWAX14 - open source Web search software;
• Lemur Toolkit15 - open-source toolkit for language modeling and information retrieval.
The standardization allied with the extensibility and recent support of Lemur Toolkit offer by
WARC format outweighed the advantages of developing code for the ARC format. WPT05
ARC version was migrated to WARC format with the help of warc-tools. During this process
2 out of 871 files were found corrupted. Since these 2 files are about 0.23% of all corpus, they
were simply removed from the collection.
3.3.3
School textbooks and exams classified by level
As in other versions of REAP, the standard unit for reading difficulty is the grade level. This
first version of REAP.PT is intended both for native high school students and non-native
(L2) students. Given that we had no access to enough materials for training distinct level
classifiers for the latter, we opted for training classifiers for levels 5-12. The training and test
corpora consist of 47 textbooks and exercise books. The statistics are shown in Table 3.1.
Two books from each grade constitutes the held-out test set. The same literary texts may be
included in more than one text book from the same level.
This test set of textbooks was complemented by a set of national exams16 for the 6th, 9th
and 12th levels. Statistics about this test set are provided in table 3.2.
12
http://crawler.archive.org/
http://code.google.com/p/warc-tools/
14
http://archive-access.sourceforge.net/projects/nutch/
15
http://www.lemurproject.org/
16
http://www.gave.min-edu.pt
13
28
Grade Level
5
6
7
8
9
10
11
12
Total
#Books
5
6
6
5
7
8
5
5
47
#Word Tokens
367,584
436,814
510,350
434,814
862,754
1,163,924
962,800
1,085,640
6,862,024
#Word Types
18,048
21,409
25,859
21,409
31,944
40,966
36,427
36,229
94,857
Table 3.1: Statistics of the school textbooks and exercise books corpus for each level.
Grade Level
6
9
12
Total
#Exams
5
7
6
18
#Word Tokens
3,490
5,334
3,658
12482
#Word Types
1,384
2,072
1,558
4024
Table 3.2: Statistics of the national exams corpus retrieved.
3.3.4
Focus word list
To retrieve documents from the Web, a list of words is required as keywords to search for
those documents. Additionally, documents that are more recent are generally preferred by
students. One approach to get that list of words is to extract them from a dictionary. However,
considering the whole dictionary as a list would consume an unfeasible amount of time and
space. Hence both English and French versions of REAP are using a smaller list.
In the English version, this word list was created by Coxhead [15]. The list contains 3000
words which were selected because they most often used in English academic writing. In the
French version, the list was adopted from the scale “Dubois-Buyse” [67].
The list of words must address some requirements: there must be an injective function that
maps each word to a readability level; the list should be made of ordinary vocabulary, leaving
out slang, technical words and function words (or grammatical words - articles, pronouns,
conjunctions, auxiliary verbs, interjections, particles, expletives, pro-sentences, etc.) that
are the most frequent words in texts although they have very little lexical meaning or even
ambiguous meaning.
Both the English and the French versions have word lists which are based solely on the
orthographic complexity and they do not consider other aspects like the semantic difficulty
29
POS
Adverb
Preposition
Conjunction
Noun
Adjective
Verb
Total
#Lemmas
Forms
222
2
5
935
493
443
2100
# Inflected
227
# Inflected Forms
(without pairs of words)
208
3132
3132
22871
26230
22871
26210
Table 3.3: Statistics about the List of Focus Words.
of words or their written or spoken frequency. The authors argue that frequencies are not
always precise to determine the semantic difficulty of a word. They exemplify it with the
pair of words knife and fork, which have respective frequencies of 1.65 ∗ 10−5 and 6.77 ∗ 10−6
according to the Cambridge English dictionary.
After having the list of words, it is necessary to generate the morphological variants of each
word, because we do not know the stemming algorithms of the most widely used Web search
engines, namely AltaVista17 . The REAP system uses AltaVista instead of other popular Web
search engines such as Google18 or Yahoo!19 , since it gives users the option of selecting pages
from a specific range of time, for example between Oct 5th 1999 and Oct 5th 2004. The other
search engines, Google included, only allow localize pages of any time or in the last year,
month, week or day. This feature is important to reduce the number of repeated results and
speed up the workflow.
The English focus word list was the starting point for the elaboration of the Portuguese word
list. The list, elaborated at Algarve University, is made of 2100 lemmas. See Table 3.3) for
more detailed statistics about the list.
The expansion of the lemmas to inflected forms was made using some regular expressions for
all POS except verbs, where we used the VerbForms20 , a verb tenses generator.
3.3.5
Cloze questions
The cloze questions, or fill-in-the-blank about words that were highlighted in the texts are
presented in the end of each reading session. The sentences used in the cloze questions are
17
http://www.altavista.com/
http://www.google.com/
19
http://www.yahoo.com/
20
https://www.l2f.inesc-id.pt/wiki/index.php/VerbForms
18
30
manually selected/created at Algarve University and L2 F . However cloze questions distractors are chosen randomly from the list of focus words. However we use the part of speech
(POS) classification to restrict the level of randomness. Selecting distractors based in POS is
suggested by [29], as a simple way of avoiding obviously wrong answers, including oposite gender, number, and completely unrelated POS. For each focus word, described in Section 3.3.4,
there are at least 5 questions, whereas this number is higher for polysemic words. Therefore
there will be more than 10720 cloze question. However, currently only 1246 questions were
inserted in the system because the full list of questions is not yet complete.
3.4
Documents filtering
The documents filtering task deals with large collections of data, as section 3.3.2 describes,
that require significant amount of available resources, both in terms of processing time and
space. Of course, this is a recurrent problem in Natural Language Processing (NLP). Distributed data-parallel computing is one of the most used approaches to solve such problems.
There are several parallel programming models that fit in this category and they are discussed
in [41]. This paper also suggests the Hadoop21 framework as the most suitable one for NLP
tasks.
We analysed the several options described before following the suggestion. Simple Job Schedulers, such as Condor22 were excluded because they do not move computations closer to their
input data. In order words, they lack a distributed file system that provides information
about the position of the multiple fragments of a given file. However, their integration with
MapReduce frameworks can be useful, e.g., Hadoop On Demand that bonds Hadoop with
schedulers like Condor and PBS23 to allow a fair and efficient use of the cluster. Otherwise,
the Hadoop system assumes that a cluster is made of dedicated servers, and as a result, it
lacks a scheduling policy.
Besides MapReduce frameworks, none of the other parallel programming models support a
distributed system. Therefore they were automatically excluded.
Despite the fact that Dryad extends MapReduce programming style with dataflow graphs to
solve computation tasks, we consider that the extra complexity introduced is not worthwhile.
In addition it is built over the .Net platform and is tied to Microsoft Server 2008. Mono
2.424 , an an open source UNIX version of the Microsoft .NET development platform, do not
support .NET 3.5 fully, nor anything else related to the .NET Initiative, such as Passport or
21
http://hadoop.apache.org/core/ (validated in July 2009)
http://www.cs.wisc.edu/condor/
23
http://www.pbsgridworks.com/
24
http://mono-project.com/ (validated in July 2009)
22
31
software-as-a-service where Dryad fits. As a result, Dryad is not currently available for unix
systems, which was considered a requirement because the INESC-ID cluster only runs Linux
(OpenSuse 10 and 11). Thus we decided to develop our solution over Hadoop. Kimball et
al [63] also made the same decision. The following sub-section 3.4.2 briefly describes Hadoop
and explains some of the options involved.
3.4.1
Hadoop
Apache Hadoop is an open source project that at his core provides two main abstractions: a
distributed file system, HDFS and a MapReduce [19] programing framework. The HDFS is a
distributed file system designed for batch processing applications that need to store very large
files across several machines in a larger cluster. These goals lead to a list of assumptions:
• Handling hardware failures through replication and checksums;
• “Moving Computation is Cheaper than Moving Data”, this is true for large data set
when transfer the data between network nodes causes network congestion, and consequently decrease the overall throughput of the system;
• Relaxed file access semantics is based on the write-once-read-many access model. It
allows to relax some consistency requirements available in POSIX API. See Ananthanarayanan et al. [4] for more details about this and also an interesting comparation
between HDFS and GFS (Google’s File System), which was a HDFS percursor;
• Optimized for large read-only files;
• Portability across several hardware and software platforms.
Hadoop map-reduce tasks store their final and intermediate outputs in HDFS. A map-reduce
job usually splits the input dataset into independent chunks, which are processed by the map
tasks in parallel. A map task, also called mapper, executes a user function to transform input
(key,value) pairs into a new set of (key,value) pairs. The framework sorts the outputs of the
maps, and forwards them to the reduce tasks (reducer). A reduce task combines all (key,value)
pairs with the same key into new (key,value) pairs. Finally, the reduced outputs are stored in
a file system. If the number of reducers is set to zero the reduce process is not executed and
the output from mappers is consider final. Filtering execution uses all mappers available in
the cluster and zero reducers. The statistics about the filtering and classification use counters
added to Hadoop’s Reporter object. Reporter is a way for Map-Reduce application to report
their progress, to let the cluster know they are still alive, and to provide information for the
user.
32
3.4.2
Chain of Filters
The filtering process follows the Chain of Responsibility Design pattern [24]. The Chain
of Responsibility pattern is particularly suitable for filtering because the filters in the chain
forward the documents along the chain until one filter marks the document for removal. As
soon as a document is excluded, it is removed from the processing pipeline. The pipeline
filtering sequence is designed to improve the processing speed. The chain of filters is made of:
1. Text/HTML MIME type documents are the only type of documents accepted in
REAP.PT Web interface. Hence the first filter removes all documents not belonging
to the Text/html MIME type;
2. The second filter removes small documents (less than 300 words). It is followed by the
Profanity Words filter. The purpose of this filter is to detect documents containing
obscene language. It removes the documents that appear in a list of 160 words we
created. This list is based in our knowledge of rude words and borrows a small subset
of words from Dicionário aberto de calão e expressões idiomáticas25 (Open Dictionary
of slang and idiomatic expressions);
3. The next filter removes documents lacking at least 3 words from the focus word list (see
Section 3.3.4 for more details);
4. Some of the documents that passed through the last described filters can just contain
lists of words. To trim them out we compared the proximity between them and a reference document (a 255557 words extract of CETEMPublico26 ). The proximity metric
assumes that the document in process and the reference represent vectors: each POS
3-Gram is a dimension and the number of occurrences is their magnitude. Thus the
proximity between two vectors can be calculated using Cosine Similarity:
n
P
dai × dbi
i=1
s
cos(da , db ) = s
n
n
P a 2
P
(di ) ×
(dbi )2
i=1
(3.1)
i=1
The integration of the POS tagger was another challenge. Currently the NLP Chain (XIPL2 F) aggregates the following tools:
• Palavroso [51] is a POS tagger, also named as morphological analyser;
25
26
http://www.linguateca.pt/cetempublico/
http://natura.di.uminho.pt/jjbin/dac
33
• MARv [60] is a probabilistic disambiguation module;
• RuDriCo (an improved and extended PAsMo version [35]) is a post-morphological, that
rewrites the results of the POS tagger by applying transformation rules based on pattern
matching;
• XIP [1] (Xerox Incremental Parser) is an analyser for syntactic dependencies extraction;
The XIP-L2 F chain (Figure 3.5) provides a detailed text analysis, containing syntactic information, some NER, which is not relevant for this filter and is time consuming. In addition,
there is the overhead from data conversion between the 4 modules. For a complete POS
disambiguation it is necessary to run the chain except for XIP tool process. 48 hours is a
raw estimation about the time needed to classify 1 of 77 RDF files from WPT05 collection.
This estimation is based on a linear interpolation of the time needed to classify 10 documents
running in a single machine (system details in Section 4.1.4).
Thus we searched for another POS tagger for Portuguese or that could be trained for
this purpose. POS taggers can be classified in 3 categories [30]: rule-based, stochastic, or
transformation-based learning approaches. Palavroso fits in the first category, where a set of
hand-written rules assign a tag or several tags to each word. The stochastic or probabilistic are
based on first/second order Markov models or Maximum entropy. The transformation-based
learning approaches are hybrid versions combining the rule-based and stochastic approaches,
e.g., Palavroso + MARv + RuDriCo. The only other POS tagger found for Portuguese was
Tree Tagger27 (implemented in C) that provides a model for Portuguese and Galician28 . The
number of tags and quality of the model given is considerably low. Therefore, we decided to
train the OpenNLP tagger for Portuguese. We choose OpenNLP29 over Tree Tagger to reduce
integration costs. We already had the filter threshold value for the 3-Tagger POS based on
the tags output by the XIP-L2 F chain (without running XIP module). So the XML output
was converted to meet the train requirement format file. The output of XIP-L2 F produces
multi-word lexical units that needed to be expanded to fit the requirement one tag per word.
Since expanding all multi-word lexical units would required a considerable amount of time, it
was decided to assign the noun tag to words that did have a tag because most of the words
in multi-word lexical units are nouns, e.g., “Universidade Nova de Lisboa”, “Presidência da
República”, “Bilhete de Identidade”.
27
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/(validated in July 2009)
http://gramatica.usc.es/~gamallo/ (validated in July 2009)
29
http://opennlp.sourceforge.net/
28
34
Input
Palavroso
input
+
ambiguous pos
RuDriCo
input'
+
ambiguous pos
Output
XIP
input''
+
pos
MARv
Figure 3.5: XIP-L2 F Chain based on [53].
3.5
Readability
The readability module available for English classifies the texts in an ordinal scale between
five (fifth grade or first year of the elementary school) to 12 (twelfth grade or last year of
high school). It annotates the downloaded documents with readability level. This module is
included in the last stage of filtering.
The first step to develop the readability classifier is to have a corpus of documents theoretically
well classified, for example texts in books for students of the third grade are assumed to have
readability level 3. The model created will be trained on school textbooks corpus described
in Section 3.3.3. The baseline model is based on the lexical features, such as statistics of
word unigrams. It can be improved with grammatical features, as in the English version of
REAP [68], through the use of a syntactic parser. The French version did not include this
feature, because a free syntactic parser was not found for French.
Our experiments with Support Vector Machines (SVMs) were made using the SMO tool
implemented in the WEKA [74] machine learning toolkit. At this stage, no lemmatization
was adopted for this purpose. The use of lemmatization needs further research, as some
verbal forms (nowadays mostly found in literary texts and not in normal conversations) may
influence reading difficulty.
While grade levels are assigned evenly spaced integers, the ranges of reading difficulty corresponding to these grades are not necessarily evenly spaced. In order to take this into account,
Heilman et al. [32] tested models for increasing assumptions about the relationships between
grade values: nominal, ordinal, interval, and ratio. The best results were obtained using
the Proportional Odds (PO) Model [49], which assumes an ordinal relationship. This is the
justification for adopting this PO model for this project.
The creation of the readability models comprises 4 steps:
35
Frequency
of Features
0.000010
0.000030
0.000005
0.000005
0.000010
0.000030
0.000010
0.000030
0.000010
0.000030
a
Number
Function
of Features
Words
10000
3
4400a
3
15000
3
25000
3
11800a
3
5000
3
10000
7
5000
7
10000
7
10000
7
Names
3
3
3
3
7
7
3
3
7
7
Correlation
Coefficient
0.9482
0.9564
0.9476
0.9175
0.9486
0.9235
0.9290
0.9508
0.9291
0.9217
Root Mean
Squared Error
0.7886
0.6765
0.8396
0.9243
0.8228
0.8972
0.9087
0.7307
0.9351
0.9116
- This value corresponds to the maximum number of features available according to
the imposed restriction.
Table 3.4: Readability Classifier training results using 10 fold cross-validation.
3.5.1
Creation of the Training Data File List
The system needs a list of training files which will be used to create its readability models.
Our training list includes the school textbooks described in Section 3.3.3 except two from
each grade level, which were used for testing.
3.5.2
Generation of a Feature and Data Set
The second step comprises the creation of a feature set of the most common features. It serves
as a proxy for better feature selection. These can be either lexical or grammatical features.
Currently only lexical features were explored in this REAP.PT version. The contribution
of grammatical features is left for future work. It is necessary to specify the thresholds of
the minimum frequency features and maximum number of features to choose. 0.0001 and
10,000 were the recommended values for the English version. Of course, these values are
heavily dependent on the amount of training data, because the frequency of words (unigrams)
decreases when the number of words in the corpus increases. Table 3.4 shows the results
of some of the experiments conducted to determine the best values. In addition, we also
analysed the effects of names and functions words in the readability classifier. The feature
set is created with the most common features. The next step is to generate a dataset in
tab-delimited format (.csv) that can be read by the WEKA. The feature set can be viewed as
an intermediate step that was created to improved the interchangeability of data set modules
and features selectors.
36
3.5.3
Training a WEKA model
The dataset is loaded into WEKA to create a regression model. In practice, SVM regression
(SMOreg) appears to be the most effective type of model for either English and Portuguese.
SVM has two main parameters which can be fine tuned by the user. There is a trade-off
between the values of margin width and the classification error controlled by the complexity
parameter (C), and the kernel functions. The SVM formulation does not include criteria to
select a kernel function that provides a good generalisation. Chin [12] discusses the subject of
tuning the various user-configurable parameters in SVMs and includes experimental results
when applying them to speech pattern classification. Despite the fact that we experimented
the 6 kernel functions available in WEKA, the SVMs only converged for the Polynomial and
Radial Basis Function kernels. The Radial Basis Function provided the best results, that can
be found in Table 3.4. During the SVM training we defined as target a low root mean square
error because it is a good measure of accuracy.
3.5.4
Creating and Testing a Readability model
Wrapping the readability model from the WEKA model is the last step to create a Readability
model. This model can be used to classify the text documents. Table 4.2 shows the test results.
3.6
Topic classification
Another feature included in this module is topic categorization. The data to create the topic
models for both the English and the French version was the open directory project (ODP)30 .
Topic models were created using Support Vector Machines (SVMs). The SVMLight31 publicdomain toolkit was used for this purpose.
Unfortunately, the number of Portuguese documents in ODP is significantly smaller than for
English or French. At the Spoken Language Systems Lab of INESC-ID, topic classification has
been applied to stories in Broadcast News. This was first done using a hierarchical thesaurus
adopted by the European Broadcast News [3] and later with a much simpler topic list adopted
by a media watch company [2]. All stories are indexed using 10 topics labels:
• Economy
• Education
30
31
http://www.dmoz.org/
http://svmlight.joachims.org/
37
• Environment
• Health
• Justice
• Meteorology
• Politics
• Security
• Society
• Sports
Besides these 10 main topics, the classifier includes 2 extra topics: National and International.
REAP.PT currently integrates this 10-class classifier due to the better quality of the trained
models, although the set of topics is not the most adequate for teaching purposes. The
topic classification method can be retrained for other domains more suitable for REAP.PT,
depending on the availability of manually classified training data from such domains. For
each of these 10 classes, topic and non-topic unigram language models were created using the
stories of the media-watch corpus, which were pre-processed in order to remove function words
and lemmatize the remaining ones. Topic classification is based on the log likelihood ratio
between the topic likelihood p(W/Ti ) and the non-topic likelihood p(W/Ti ). The detection
of any topic in a story occurs every time the correspondent score is higher than a predefined
threshold. The threshold is different for each topic in order to account for the differences in
the modeling quality of the topics. The average accuracy is 91.8% on a held-out test set for 10
topic labels. Expanding the classification to the 12 topics labels lowers the average accuracy
to 90.7%. The original classifier receives as input an XML file generated by the AUDIMUS
ASR [54][55].
The topic classifier extracted both the content and the confidence parameter (conf.). The
confidence parameter value corresponds to the level of confidence estimated by the ASR for
that word. In order to adapt the topic classifier for the REAP.PT documents, the classifier
previously written in Python was rewriten in Java. Several factors led to this decision. First
of all, the start up time of the python classifier is about 15 seconds (result obtained using
Python 2.5.1 under Intel(R) Xeon(R) X5355 @ 2.66GHz, 12G RAM, 80gb 7200 rpm HD,
openSuse 11.1). During the startup time, all topic models (about 35 Megabytes) are loaded
into memory. The Python I/O library performance can be pointed out as the main reason for
this slow result. In spite of the fact that in the last version of Python (3.0) a new I/O Library
is provided32 , Python 3.0 runs the pystone benchmark 10% slower than Python 2.5 and it
32
http://docs.python.org/3.0/whatsnew/3.0.html#performance
38
breaks compatibility with older version. One of the first options explored was using Jpythonc
utility to compile the Python source code to Java. Jpythonc is part of JPython. JPython is an
implementation of a compiler to compile Python source code down to Java bytecodes which
can run directly on a JVM, and it also includes a set of support libraries which are used by the
compiled Java bytecode. JPython has been recently renamed to Jython. Unfortunately the
Java code produced by Jpythonc did not work properly. The python source was written using
a procedural paradigm instead of the object oriented, which caused some mapping problems
when Python code is mapped to Java objects. Integrating the topic classifier with the chain
of filters written in Java was another motivation. Finally we also tested wrapping the Python
code by a Java class, through the use of system calls, however we faced integration problems
with Apache Hadoop Core. See section 3.4 for more details about Hadoop and the filtering
process.
39
4
Evaluation
Although we cannot yet report field trial results, we shall attempt to evaluate separately
different modules of REAP.PT. The original evaluation plan included evaluating REAP.PT
in the field (classrooms). These field trials should ideally involve adults. The field trials
were scheduled for the first week of June 2009. Unfortunately they were not performed for
several reasons. In the first place, the Portuguese academic word list (focus word list) was not
completed, nor the adjacent cloze questions. In the second place, the chain of filters was not
finished, as the integration of the just-list-of-words filter was only finished in the first week of
June. Therefore it was recognized that the system was not yet ready for field trials. It is still
under exploration the hypotesis of realizing the field trials in the firsts weeks of August 2009
during Portuguese courses at Algarve University. The field trials are relevant to evaluate the
system in terms of usability and usefulness. To analyse if the system is useful for a student, we
would like to have 2 classes with 30 or more students (minimum sample size to allow a normal
distribution approximation) each from similar backgrounds and similar learning capacities.
Only one class would have access to the system. After at least 2 or 3 weeks, a test should be
given to the students to evaluate their learning progress. Based on these results we can infer
the usefulness of the system. Regarding the usability, after each reading the student is asked
about the level of interest and difficulty of the text given. Of course, this questionnaire is
going to be expanded to include more questions about usability. Other technics of retrieving
information may be also used, such as video recording. While video recording allows capturing
a huge amount of information, it also introduces some disadvantages: it can generate more
information than our capacity of analyses and it frequently causes the Hawthorne effect (a
change in behavior when a person knows that is being the object of study).
4.1
Experimental Results
The experimental results were divided in 3 subsections: readability classifier results, topic
classifier results, and chain of filter results.
41
4.1.1
Readability classifier results
It is not clear what is the best measure for analyzing the quality of the classifier, as Heilman et
al. [32] refers. However, in order to established a comparison with the work done at LTI, the
metrics chosen will be root mean square error (RMSE), Pearson product-moment correlation
coefficient and accuracy within 1 grade level.
RMSE (also called root mean square deviation - RMSD) is a commonly used measure of the
differences between the values expected by a model and the values observed. In the framework
of this project, it can be understood as the average number of grade levels that diverge from
the manually assigned text grade level labels. RMSE is a good measure of accuracy because
it truly penalizes those errors that are further away from the expected value. For further
details about RMSE see [8].
The Pearson product-moment correlation coefficient or just Pearson’s correlation coefficient [61] is an usual measure of correlation, i.e., linear dependence or similarity of tendency,
between two random variables X and Z. Regardless of its name, its first reference dates from
1880s by Francis Galton. It assumes values in a scale between +1 and -1. A high correlation
(near or equal to +1) can be an indicator that difficult texts would probably receive high
difficulty predicted grade values and easier texts low difficulty grade levels. A low value of
correlation (near or equal to zero) would point out to no relationship between grade levels and
texts difficulty. Obtaining a correlation around -1 would indicate an awkward relationship in
this case, because it is expected that in higher (school) years the students read more difficult
texts. Thus the scale will be restricted to values between 0 and 1 inclusive.
Adjacent accuracy within 1 grade level is the percentage of predictions that are equal or
with one grade level of difference to the manually assigned label. Measuring strict accuracy
is considered too demanding because manually assigned labels may not be always flawless
or consistent. For example, one school class might read the Portuguese epic O Cavaleiro
da Dinamarca (The Knight from Denmark) from Sophia Andresen in the 6th grade while
other might read it in the 7th grade. The main drawback of this accuracy metric is the
undifferentiated treatment to predictions that are two levels off or eleven levels off, which is
relevant to evaluate if the classifier is giving acceptable or completely wrong classifications.
Table 4.1 summarizes the evaluation metrics for the first stage.
Experimental results from the readability classifier were divided in two parts. Whereas the
first part includes the training and testing results, the second part shows the results from
classifying all WPT05 documents not excluded by the chain of filters.
With respect to the first part, the readability results are shown in Table 4.2, for the 10-fold
cross-validation set, the held-out test set and the exams test set. The 10-fold cross-validation
42
Measure
Pearson’s
Correlation
Coefficient
Description
Analyze the force of the
relationship between
predictions and grade level
labels.
Adjacent
Accuracy
Percentage of predictions that
were within 1 grade of label.
RMSE
Square root of mean squared
difference of predictions from
labels.
Details
Measures the tendency, but
not assert about the degree
of variation of the classifier
predictions and hand classified
grade level labels.
Near miss predictions are
treated in the same manner as
predictions that are eleven levels off.
Strongly penalizes great
errors.
Range
[0, 1]
[0, 1]
[0, ∞[
Table 4.1: First stage evaluation metrics.
set is obtain from Weka when the SVM training ends. The held-out test set is made of 2 school
textbooks (described in Section 3.3.3) per grade level. Before starting training the readability
classifier we divided the school textbooks corpus in training and test set. While the test set
had 2 school text books per grade level, the training set had all the remained school textbooks.
We found that testing our classifier in a national exams test set (also described in Section
3.3.3) would be a remarkable way to evaluate the performance of the readability classifier. It
is interesting to notice that for most exams, the assigned level is either correct or one-level
below. As expected, the correlation results of about 0.9 shows a strong correlation between
the grade level and the texts difficulty (readability level). The RMSE results show that in
average the documents classified in the tests sets diverge 1 level from the expected readability
level. A team of linguistics was asked to perform such task and the results are comparable.
Regarding the adjacent accuracy, we obtained a worst result for the exams test set because
several exams were classified with a lower readability level, which also helps explaining the
higher RMSE of the Exams test set when compared with the other 2 test sets. It is important
to refer that although the exams were classified in the correct level or one-level below, we
can not infer exams difficulty based only on the texts difficulty, we would need to analyze the
semantic difficulty of the questions and or establish a relationship withe the students scores.
Such analysis is outside the scope of this work.
The second part comprises the readability distribution of the WPT05 corpus is show in
Figure 4.1. Considering that in 2005 the compulsory school attendance was still defined at
9th grade; and the school drop out values (see Statistics Portugal [65] for more details); and
the number of internet accesses in Portugal (see Statistics Portugal [64] for more details); we
expected that most of documents would be classified between 9 and 11 readability measure.
This expectative was confirmed because about 71% of the documents were classified between
43
Cross-validation set
Held-out Test set
Exams test set
Correlation
0.956
0.994
0.898
RMSE
0.676
0.448
1.450
Adjacent Acc.
0.876
1.000
0.550
Table 4.2: Evaluation of the readability classifier.
Figure 4.1: WPT05 readability distribution.
9 and 11 readability level. It is also interesting to notice that the number of documents
classified between 5 and 7 totalized about 6%.
4.1.2
Topic classifier results
The topic classifier was initially designed and has been applied to stories in Broadcast News.
Broadcast News audio recognized by AUDIMUS ASR was the training material of the classifier. Although we knew that broadcast news and web documents are two different domains,
it was applied without retraining to WPT05 documents.
The results did not meet the minimum level of quality suitable for being presented, due to
the specific format of WPT05 documents. The WPT05 documents contain HTML which has
been pre-processed by removing HTML tags. However, this pre-processing does not remove
some undesirable content, such as menus, lists of links which often contain URLs, and date
and time information. Figure 4.2 exemplifies this problem, showing one of the documents
found in the WPT05 collection.
44
<rdf:Description rdf:about="http://xldb.di.fc.ul.pt/linguateca/primeira_proposta.html">
...
<dc:format rdf:resource="text/html" />
<wpt:arcName rdf:resource="WPT-9-20080822122528-00677" />
<wpt:filteredText>
XLDB Group - primeira proposta
fcul
Home | Publications | Members | Tumba! | GREASE | Linguateca-XLDB | ReBIL | XMLBase
XLDB Group primeira proposta
...
O primeiro passo a dar foi estabelecer e caracterizar as entidades que pretendemos
identificar e de que forma serao anotadas.
..
Naquele ano, as Brigadas Vermelhas ( BR ) estavam no auge da actividade terrorista, ...
...
Na confusao que se segue, parte um primeiro tiro, depois um segundo, e os dois homens
caem ao mar.
Autora : Cristina ...
Data da ultima revisao : 26 de Fevereiro de 2003
http://xldb.fc.ul.pt Home | Publications | Members | Tumba! | GREASE | Linguateca-XLDB
| ReBIL | XMLBase
http://xldb.fc.ul.pt Home | Publications | Members | Tumba! | GREASE | Linguateca-XLDB
| ReBIL | XMLBase
</wpt:filteredText>
</rdf:Description>
Figure 4.2: Example of the WPT05 document.
4.1.3
Chain of filters results
As described in the previous Chapter, the process of filtering and classifying documents by
readability and topics was executed over a cluster (described in Section 4.1.4). The run
time of this process is about 30 hours. The number of documents excluded by each filter is
shown in Figure 4.3. Not surprisingly, about 58% of the documents are small texts with less
than 300 words. Regarding the “just list of words” filter, 0.8 was cosine similarity threshold
value that yield better results. The cosine similarity between documents calculated based in
the POS 3-Grams of the documents where 0 is a document containing 0 POS 3-Grams in
common with the reference and 1 is a document containing all. Although it is responsible for
excluding about 29% 4.4 of the documents, for longer documents if the document contains a
large section of text and a list of words, the document can passed the filter. Thus an extra
filter to analyse documents containing both texts and lists of words is left as future work.
This significant percentage of documents is excluded at “just list of words” filter is justified
by the huge amount of undesirable content found in WTP05 collection (see Section for more
details 4.1.2).
Only about 3.5% of all documents passed all filters.
45
Figure 4.3: Statistics about the WPT05 filtering process.
4.1.4
Cluster details
The documents filtering and classification processes were performed on a cluster with 20
machines. Each machine has an Intel Quad-Core Q6600 processor, 8 GB of DDR2 RAM
at 667 MHz and is connected to a gigabit ethernet. The Hadoop framework is installed in
these machines and each one has been configured to run 8 map tasks and 8 reduce tasks
simultaneously. The Hadoop framework uses HDFS as storage. This file system has been
configured to split each file into 64 MB chunks. These blocks are replicated across machines
in the cluster in order to tolerate machine failures. The HDFS replication factor is three.
46
WPT05
100%
2,67%
HTML
Document
No
97,33%
Yes
58,10%
> 300
Words
No
39,23%
Yes
6,70%
Has
Profanity
Words?
Yes
32,53%
No
0,46%
Has Focus
Words?
No
32,07%
Yes
28,57%
Has Just List
of Words?
Yes
3,50%
No
Readability
Classier
Figure 4.4: Chain of Filters details.
47
48
5
Conclusions
This last chapter presents the final remarks of this thesis, summarizing the work that was
realized. It concludes by presenting some ideas for future work.
5.1
Final remarks
In this work, REAP, described in Chapter 3, was progressively ported to a Portuguese version
named REAP.PT. REAP is a tutoring platform that may integrate a large number of language
processing tools and resources. Despite the great amount of work already done there is still
much room to enhance the Portuguese version. Nevertheless, we have already built the
minimum requirements for progressing to the first field trials, that may be scheduled for
August 2009, in the University of the Algarve. REAP has also been ported to a prototype
French version. Porting to Portuguese has extended our experience on the general issues
encountered when porting this software to other languages. Apart from encoding issues,
an obvious difficulty is the relative lack of computational linguistic resources in languages
different from English. For example, when building a readability and/or topic classifier, a
stemmer might be needed. In addition, in order to use syntactic features in the classifier, a
syntactic parser is required. Although lexical features might be enough, it could be argued
that in some languages, syntactic features are more important, for example for languages that
are morphologically richer than English. If a set of Web pages is not previously available, one
may need to generate queries to crawl the Web, which may mean applying a morphological
generator to the focus words for query expansion. We suggest the reader to get access to
ClueWeb091 data set as a starting point to create a version of REAP in another language or to
developed another CALL tool. ClueWeb has Web pages from 10 languages: English, Chinese,
Spanish, Japanese, German, French, Korean, Italian, Portuguese, and Arabic. However, if
target language is European Portuguese and if quality guarantees are desired it is necessary
to use a news media corpus, such as CETEMpúblico2 . We have already started crawling a
corpus, initially restricted to Euronews3 and PressEurop4 Webpages. Using these Webpages
1
http://boston.lti.cs.cmu.edu/Data/clueweb09/ (validated in July 2009)
http://www.linguateca.pt/cetempublico/
3
http://www.euronews.net/
4
http://www.presseurop.eu/
2
49
has an important advantage over other media news: it provides the same news in 8 and 10
languages respectively, which may be extremely helpful if we integrate Automatic Translation
Tools in a future version of REAP.PT.
A syntactic parser and a morphological generator was harder to find for French and Portuguese
(in the absence of XIP-L2 F), and the same can be probably said for many other languages. A
POS tagger should also be used to measure text quality. Again, there were no directly available
tools for French nor Portuguese, which motivated their adaptation from English. The training
materials for the readability and topic classifiers are two other very important resources
not always easy to find. The amount and quality of training data has a great importance
when training a readability classifier as we can infer ours results with the results obtained
for the english version. Dictionary integration may be a major issue because open source
dictionaries such as Wikicionary do not provide enough quality. Moreover good dictionaries
are not available in electronic format or have access restrictions. Finally, in order to generate
synonym questions, one might prefer building a thesaurus automatically rather than rely on
non free resources such as EuroWordNet5 . REAP’s flexibility has been improved by adding
audio playing capabilities, based on either text-to-speech synthesis or automatic alignment
of previous recorded documents (DTBs and BN stories). Their impact for L2 learners of
European Portuguese is one of the target goals of the forthcoming tests.
5.2
Future work
This section presents some of the lines that can be explored in the future work. The following
subsections should not be seen as a complete list of future work, but as some topics that we
believe that are worth researching.
5.2.1
Integration of syntactic information in the Readability Classifier
Currently the Readability classifier can only classify documents by using lexical features.
Although the English version already integrates syntactic features, the inclusion of these
features may provide even better results for the Portuguese version.
5.2.2
Integration of text simplification tools with the Readability Classifier
Investigating text simplification techniques and link them with the Readability Classifier and
possibly a WYSIWYG text editor is another interesting topic that we might pursue in a
future PhD work.
5
http://www.illc.uva.nl/EuroWordNet/ (validated in July 2009)
50
5.2.3
Expanding the Topics Classifier
New topics that have more impact on the students’ interests could be included into the
classifier. These topics includes computer science, astronomy, music (e.g., news about a
popular pop band), movies (e.g., reviews and critics), etc.
5.2.4
Graphical Interface
Although the current graphical interface is characterized by its simplicity (a helping section
is also left as future work), providing the student with the possibility of customizing by
choosing an optional template or creating is own we believe that may contribute to motivate
the students using the system. Creating a social network software that includes REAP.PT
is another interesting investigation topic. We believe that porting REAP into a game could
motivate even more students learning a language. There are already CALL games such as
Alelo games’ Tactical Language Trainingtm6 that USA army uses to train US troops sent
to Irak, not only teaching them the irakian language but also their culture. An interesting
aspect of this system, their users learn by playing an interactive game, taken place in Irak.
Players have to accomplish certain tasks which require them engage in conversations with
irakian NPCs (Non playable characters). Another example of a game-based language learning
system is DEAL [33], which is a free-standing part of Ville. The game takes contexts in the
trade market, where a player is given some currency in order to purchase some objects. The
actual game involves bargaining with the shop keeper to spend the least possible. There
is a certain degree of challenge involved since the shop keeper will be offended if he does
not recognize what the player says or if the player gets too greedy. Finally, Timot m
7
is
a game system that teaches English to children suffering from speech pathologies including
autism, hearing impairments and developmental delays. This system is designed to teach
vocabulary and grammar, improve speech articulation and develop linguistic and phonological
awareness. One possible idea for a REAP game based system, would be creating or expanding
a MMORPG game such as World of Warcraft (the most popular MMORPG game, which had
more than 11.5 million subscribers worldwide8 ), including the texts the question in the target
language with the adequate readability level. In addition to the usual kill the mobs, craft or
find weapons/armor based level up games, small linguists tests, such as cloze question could
be presented as a additional requirement to level up. The questions should containing the
relevant lexicon for the student and could be about a quest done.
6
http://www.tacticallanguage.com/ (validated in July 2009)
http://www.animatedspeech.com/ (validated in July 2009)
8
http://www.blizzard.com/us/press/081121.html (validated in July 2009)
7
51
5.2.5
Automatic generation of Cloze Questions
Automatic generation of Cloze questions is another area of investigation, which is currently
being investigated by Rui Correia in his Master thesis work.
5.2.6
Integration of Automatic Translation Tools
The next version of REAP.PT might show definitions of words translated to other languages,
e.g., when a student click in a word, beyond showing the dictionary word entry, the translation
of the word could also be presented. Furthermore the videos shown in the Oral Comprehension
could include subtitles in a desired language.
5.2.7
Integration Automatic Summarization tools
Providing a summary or abstract of the documents could also be introduced in the system to
provide a quick information about the documents content. The development of such tool is
part of Ricardo Ribeiro PhD work. After its completion, the tool might be integrated with
REAP.PT.
52
Bibliography
[1] Salah Ait-Mokhtar, Jean-Pierre Chanod, and Claude Roux. A multi-input dependency
parser. In In Proceedings of the Seventh International Workshop on Parsing Technologies,
pages 201–204, Beijing, China, 2001. Tsinghua University Press.
[2] Rui Amaral and Isabel Trancoso. Topic segmentation and indexation in a media watch
system. In ISCA, editor, Interspeech 2008, Brisbane, Australia, 2008.
[3] Rui Amaral, Hugo Meinedo, Diamantino Caseiro, Isabel Trancoso, and João Neto. A
prototype system for selective dissemination of broadcast news in european portuguese.
EURASIP Journal on Advances in Signal Processing, pages 1–12, 2007.
[4] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu
Pucha Prasenjit Sarkar, Mansi Shah, and Renu Tewari. Cloud analytics: Do we
really need to reinvent the storage stack? HotCloud 09 - Workshop on Hot Topics in
Cloud Computing, 2009.
[5] Plı́nio Barbosa and Eleonora Albano. Brazilian Portuguese. Journal of the International
Phonetic Association, 34(02):227–232, 2004.
[6] Elza Barboza and Eny Nunes. A inteligibilidade dos websites governamentais brasileiros
eo acesso para usuários com baixo nı́vel de escolaridade. Revista Inclusão Social, pages
19–23, 2007.
[7] Stephen Bax. CALL - past, present and future. System, 2003.
[8] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science
and Statistics. Springer, 2006.
[9] Andrey Broder. A taxonomy of web search. ACM SIGIR FORUM, 36(2):3–10, 2002.
[10] John B. Carroll, Peter Davies, and Barry M. Richman. Word frequency book. Houghton
Mifflin, Boston, USA, 1971.
[11] Eugene Charniak. A maximum-entropy-inspired parser. ACM International Conference
Proceeding Series, 2000.
[12] KK Chin. Support vector machines applied to speech pattern classification. Master’s
thesis, Engineering Department, Cambridge University, 1999.
[13] Kevyn Collins-Thompson and Jamie Callan. Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and
Technology, 56(13):1448–1462, 2005.
53
[14] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 1995.
[15] Averil Coxhead. A new academic word list. TESOL quarterly, pages 213–238, 2000.
[16] Edgar Dale and Jeanne Chall. A formula for predicting readability. Educational Research
Bulletin, 1948.
[17] Edgar Dale and Jeanne Chall. The concept of readability. Elementary English, 26(23),
1949.
[18] Edgar Dale and Jeanne S. Chall. Readability Revisited: The New Dale-Chall Readability
Formula. Brookline Books, Cambridge, MA, 1995.
[19] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large
clusters. Communications of the ACM, pages 107–113, 2008.
[20] William DuBay. The principles of readability. Impact Information, 2004.
[21] James L. Elshoff and Michael Marcotty. Improving computer program readability to aid
modification. Communications of the ACM, 25(8):512–521, 1982. ISSN 0001-0782.
[22] Rudolf Flesch. Marks of a Readable Style. PhD thesis, Columbia University, 1943.
[23] William Gale and Geoffrey Sampson. Good-turing frequency estimation without tears.
Journal of Quantitative Linguistics, 2:217–237, 1995.
[24] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns:
elements of reusable object-oriented software. Addison-Wesley Professional, 1995.
[25] Daniel Gomes and Mário Silva. The Viúva Negra crawler: an experience report. Software:
Practice and Experience, 38(2), 2008.
[26] Robert Gunning. The technique of clear writing. Brookline Books, New York, USA,
1952.
[27] Robert Gunning. The fog index after twenty years. Journal of Business Communication,
1969.
[28] Olle Haggstrom. Finite Markov Chains and Algorithmic Applications. University Press,
2002.
[29] Thomas Haladyna, Steven Downing, and Michael Rodriguez. A review of multiple-choice
item-writing guidelines for classroom assessment. Applied measurement in education, 15
(3):309–334, 2002.
[30] Fahim Muhammad Hasan, Naushad UzZaman, and Mumit Khan. Comparison of different POS Tagging Techniques (N-Gram, HMM and Brill’s tagger) for Bangla. In International Conference on Systems, Computing Sciences and Software Engineering (SCS2 06)
of International Joint Conferences on Computer, Information, and Systems Sciences,
and Engineering (CISSE 06), pages 4–14, Connecticut, USA, 2006. Springer.
[31] Michael Heilman, Jamie Callan, Maxine Eskenazi, and Kevyn Collins-Thompson. Combining lexical and grammatical features to improve readability measures for first and
second language texts. In In Proceedings of HLT/NAACL, New York, USA, 2007. ACL.
54
[32] Michael Heilman, Kevyn Collins-Thompson, and Maxine Eskenazi. An analysis of statistical models and features for reading difficulty prediction. In The Third Workshop on
Innovative Use of NLP for Building Educational Applications, Ohio, USA, 2008. ACL.
[33] Anna Hjalmarsson, Preben Wik, and Jenny Brusk. Dealing with deal: a dialogue system
for conversation training. In Proceedings of SigDial, pages 132–135, Antwerp, Belgium,
2007. ISCA.
[34] P. Hubbard. CALL and Future of Language Teacher Education. CALICO Journal, 2008.
[35] Joana Lúcio Paulo. PAsMo - Pós Analisador Morfológico. Master’s thesis, Instituto
Superior Técnico – Universidade Técnica de Lisboa, Lisbon, Portugal, 2001.
[36] Michael Cohen Jonas, Michael Cohen, Jonas Beskow, and Dominic Massaro. Recent
developments in facial animation: An inside view. In In Proceedings Auditory-Visual
Speech Processing (AVSP ’98), pages 201–206, Sydney, Australia, 1998.
[37] Peter Kincaid, Robert Fishburne, Richard Rogers, and Brad Chissom. Derivation of
new readability formulas (automated readability index, fog count and flesch reading ease
formula) for navy enlisted personnel. Research Branch Report 8-75, 1975.
[38] George Klare. The Measurement of Readability. Iowa State University Press, Ames,
USA, 1963.
[39] Santanu Kolay, Paolo D’Alberto, Ali Dasdan, and Arnab Bhattacharjee. A large-scale
study of robots.txt. In In Proceedings of the 16th international conference on World
Wide Web, pages 1123–1124, New York, USA, 2007. ACM.
[40] Michael Levy. CALL: Context and Conceptualisation. Oxford University Press, Oxford,
UK, 1997.
[41] Tiago Luı́s. Paralelização de algoritmos de processamento de lı́ngua natural em ambientes
distribuı́dos. Master’s thesis, Instituto Superior Técnico, October 2008.
[42] Tereza Martin, Maria Graças Nunes, Claudete Ghiraldelo, and Osvaldo Oliveira. Readability formulas applied to textbooks in brazilian portuguese. Notas do ICMSC-USP,
Série Computação, 28, 1996.
[43] Stefan Martins and Lucia Filgueiras. Métodos de Avaliação de Apreensibilidade das
Informações Textuais: uma Aplicação em Sı́tios de Governo Eletrônico. Workshop on
Perspectives, Challenges and Opportunities for Human-Computer Interaction in Latin
America, 2007.
[44] Luı́s Marujo, José Lopes, Nuno Mamede, Isabel Trancoso, Juan Pino, Maxine Eskenazi,
Jorge Baptista, and Céu Viana. Porting REAP to European Portuguese. In SLaTE 2009
- Speech and Language Technology in Education, Brighton, UK, 2009. Elsevier.
[45] Dominic Massaro. Perceiving Talking Faces:From Speech Perception To Behavioral Principle. MIT Press: Cambridge, MA, 1998.
[46] Dominic Massaro. Symbiotic value of an embodied agent in language learning. In In
Proceedings of the HICCS Conference, Hawai, USA, 2004. IEEE Computer Society Press.
55
[47] Dominic Massaro, Alexis Bosseler, and Joanna Light. Development and evaluation of a
computer-animated tutor for language and vocabulary learning. In In Proceedings of the
15th International Congress of Phonetic, Barcelona, Spain, 2003. Futurgraphic.
[48] Maria Helena Mateus and Ernesto d’Andrade. The phonology of Portuguese. Oxford
University Press, 2000.
[49] Peter McCullagh. Regression models for ordinal data. Journal of the Royal Statistical
Society - Series B (Methodological), 42(2):109–142, 1980.
[50] Harry McLaughlin. Smog grading: A new readability formula. Journal of Reading, 1969.
[51] José Carlos Medeiros. Processamento morfológico e correcção ortográfica do português.
Master’s thesis, Instituto Superior Técnico – Universidade Técnica de Lisboa, Portugal,
1995.
[52] Hugo Meinedo, Márcio Viveiros, and João Neto. Evaluation of a live broadcast news
subtitling system for Portuguese. In Interspeech 2008. ISCA, 2008.
[53] Ana Mendes, Luı́sa Coheur, Nuno Mamede, Luis Romão, João Loureiro, Ricardo Ribeiro,
Fernando Batistaa, and David Matos. QA@ L2F@ QA@ CLEF. In Cross Language
Evaluation Forum: Working Notes-CLEF 2007 Workshop, Budapest, Hungary, 2007.
Springer.
[54] João Neto, Hugo Meinedo, Rui Amaral, and Isabel Trancoso. The development of an
automatic system for selective dissemination of multimedia information. In In CBMI’03
- Third International Workshop on Content-Based Multimedia Indexing, Rennes, France,
2003. IRISA.
[55] João Neto, Hugo Meinedo, Rui Amaral, and Isabel Trancoso. A system for selective
dissemination of multimedia information. In In Proceedings ISCA ITRW on Multilingual
Spoken Document Retrieval, Hong Kong, China, 2003. ISCA.
[56] Pang ning Tan and Vipin Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 6(1):9–35, 2002.
[57] Luis Oliveira, Céu Viana, and Isabel Trancoso. DIXI-A Generic Text-to-Speech System
for European Portuguese. In PROPOR’2008 - 8th International Workshop on Computacional Processing of the Portuguese Language, Aveiro, Portugal, 2008. Springer-Verlag.
[58] Wik Preben and Granström Björn. Att lära sig språk med en virtuell lärare, pages 51–70.
Nätuniversitetet, Stockholm, Sweden, 2007.
[59] George Rasch. Probabilistic Models for Some Intelligence and Attainment Tests. University of Chicago, Chicago, USA, 1980.
[60] Ricardo Ribeiro, Nuno J. Mamede, , and Isabel Trancoso. Using Morphossyntactic Information in TTS Systems: comparing strategies for European Portuguese. In Computational Processing of the Portuguese Language: 6th International Workshop, PROPOR
2003, Faro, Portugal, June 26-27, 2003. Proceedings, volume 2721 of Lecture Notes in
Computer Science. Springer, 2003.
56
[61] Joseph Rodgers and Alan Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, 1988.
[62] Sarah Schwarm and Mari Ostendorf. Reading level assessment using support vector
machines and statistical language models. In In Proceedings of the 43rd Annual Meeting
on the Association for Computational Linguistics, Ann Arbor, USA, 2005. ACL.
[63] Aaron Slettve and Christophe Bisciglia. Cluster computing for web-scale data processing.
In Proceedings of the 39th SIGCSE technical symposium on Computer science education,
pages 116–120, New York, USA, 2008. ACM.
[64] INE staffers. Statistical Yearbook of Portugal 2005. INE, Lisbon, 2005.
[65] INE staffers. Social Indicators - 2007. INE, Lisbon, Portugal, 2008.
[66] Jackson Stenner. Measuring reading comprehension with the lexile framework. In Fourth
North American Conference on Adolescent/Adult, London, UK, 1996. Academic Press
Ltd.
[67] François Ters, Georges Mayer, and Daniel Reichenbach. L’Echelle Dubois-Buyse. Editions M.D.I, Paris, 1995.
[68] Kevyn Collins Thompson and Jamie Callan. A language modeling approach to predicting
reading difficulty. In Proceedings of the HLT/NAACL 2004 Conference, 2004.
[69] Isabel Trancoso, António Serralheiro, Céu Viana, Diamantino Caseiro, and Isabel Mascarenhas. Digital Talking Books in Multiple Languages and Varieties. In In 3rd Language
& Technology Conference, Poznan, Poland, 2007. Springer.
[70] Robert Tibshirani Trevor Hastie. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, 2001.
[71] Ian Tudor and Fateh Muhammad. Extensive reading and the development of language
skills. ELT Journal, (1):4–13, 1986.
[72] Philipp Vogt, Florian Nentwich, Nenad Jovanovic, Engin Kirda, Christopher Kruegel,
and Giovanni Vigna. Cross-site scripting prevention with dynamic data tainting and
static analysis. In Proceeding of the Network and Distributed System Security Symposium
(NDSS’07), San Diego, USA, 2007. ACM.
[73] Mark Warschauer and Deborah Healey. Computers and language learning: An overview.
Language Teaching, 31(1):57–71, 1998.
[74] Ian Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. 2nd edition.
57
58