Media Service Profile

Transcription

Media Service Profile
Media Service Highlights
ATAPY Software was established in 2001 with active
participation from ABBYY
Software House, the manufacturer of FineReader OCR
product family. ATAPY focuses on custom software development in the fields of
OCR and data capture,
document imaging, document management and
computer linguistics.
Media Service (scanning, recognition, data entry, proofreading, formatting, XML
markup, etc.) is another important field of the company's activity. Compared to
conventional media service
bureaux, ATAPY is able to
offer better results in shorter
time, through developing
on-demand software tools
which allow to streamline
of even fully automate certain jobs. This approach is
especially efficient in largescale digitization projects,
allowing to considerably
slash the amount of manual
labor and to ensure highest
quality within reasonable
digitization budgets.
ATAPY Media Service Department
was established with the purpose of
helping libraries, data archives,
publishing houses, and other
information-intensive organizations
in their digitization and electronic
publishing endeavors. For materials
dating back many decades or even
centuries, digitization is a synonym
to preservation and successful
dissemination of cultural values.
ATAPY has worked with various materials,
such as books in old European languages,
backlogs of periodicals, including wide
format, theater scripts, and others.
However complicated the material is,
featuring pale/uneven print, multiple
language symbols present in one page,
outdated fonts, scientific formulae and
other elements which are deemed
obstacles for the majority of modern OCR
systems - ATAPY possesses sufficient means
and resources to transform it into a
searchable, accessible and well-structured
electronic archive that will not be lost due to
a fire, flood, or uncivilized reader.
ATAPY also carries out mass data capture from standard structured and
semi-structured documents (forms). The sensitive nature of information
contained in forms often requires exceptionally high OCR accuracy (e.g.,
financial documents or education tests), which cannot be achieved
without manual verification and data validation. ABBYY FormReader and
ABBYY FlexiCapture technology, strengthened by ATAPY's engineering
experience and backed up by a pool of qualified operators, allows to
ensure the required accuracy in practically any European language, and
with minimal manual intervention. ATAPY possesses hands-on experience
in development of custom data validation tools and export modules
which allow export to third-party information and Document
Management systems, pre-OCR
image enhancement tools, and
other technical means that
enable smart and error-free data
entry, as compared to traditional
brute-force approach.
TM
Another strong point about ATAPY’s Media Service is the ability to handle material with complex layout, such as
newspapers and magazines. ATAPY has a track record of creating digital archives for several European magazines
issued in English, Danish, German, and Swedish languages.
This experience came in useful for development of the Smart Newspaper Page Zoning Tool - a specialized page
segmentation product targeted at newspaper-type layouts. It has been continuously developed for a number of
years and improved upon testing on various periodicals, especially focusing on old editions (first half of the XX
century). A distinctive feature of the tool is a flexible set of parameters that affect the segmentation process.
Users can tune up the tool performance to achieve the best results on a particular type of material.
The intelligence accumulated in this development allows to correct identification of column borders, difficult
headings, sorting out decorative and layout-specific elements - such as frames and separators - that often
mislead modern OCR systems. This slashes the manual labor otherwise required to correct the segmentation
produced by most OCR systems.
These are actual examples of one and the same newspaper page natively segmented by ABBYY FineReader and
segmented by ABBYY FineReader with the help of ATAPY’s Smart Newspaper Page Zoning Tool. Segmentation
using ATAPY’s product is visibly more accurate and requires no correction.
ABBYY FineReader
ABBYY FineReader + Smart Newspaper Page
Zoning Tool
The way it works at ATAPY
Media Service: services and techniques
I. Scanning
ATAPY is well equipped for provision of scanning services. For Metzler Verlag, a German
Publishing house, ATAPY carried out scanning and recognition of the 85-volume Pauly's
Encyclopedia of Antiquity (Realencyclopadie der classischen Altertumswissenschaften).
59.500 pages have been scanned in high resolution grayscale mode, and stored on 198
CD-ROMs.
University of Innsbruck has entrusted ATAPY a batch of XIX-century Austrian books for high-resolution scanning.
Due to high value of the books, they have been shipped back and forth via courier mail; yet, the postal charges did
not outweigh the cost savings that the University has gained through outsourcing the task to ATAPY.
Scanning services can be provided at the recently opened ATAPY Sales and Technical support office in Munich,
Germany a facility specially intended to bring ATAPY one more step closer to its customers. When reasonable,
the material can be scanned locally and sent to ATAPY over the Internet in mutually agreed format.
II. Pre-OCR image processing
This phase is required when recognition quality suffers due to such source image flaws as garbage (speckles of
different nature), page skew, colored or patterned background, etc. ATAPY uses a number of its own imaging
tools, as well as third-party ones, to enhance the image quality prior to OCR, in order to ensure ultimate efficiency
of the automatic recognition phase.
III. Recognition
Scanned images are recognized using ABBYY FineReader OCR/ICR technology. Sometimes certain programming
effort is required to tune up and customize ABBYY products. This may happen if the customer puts forth specific
requirements not covered by “off-the-shelf” products, or if the material exhibits special characteristics, such as
unusual fonts, unsupported language dialects, non-standard symbols and characters, specific layout, and the
like.
A good example is our contribution to the international Meta-E
ABBYY
initiative - a project undertaken by a consortium of 14 universities
from 7 European countries and the US and co-funded by the
European Commission. ATAPY's part of the project included tuning
ABBYY FineReader to work with text printed in old European languages. ATAPY implemented Language Models
for five Old European languages to be used in ABBYY FineReader OCR dictionaries. These dictionaries, together
with ABBYY's part of the project - adaptation of FineReader for reading Frakturschrift (a specific the Gothic blackletter typeface typical for old European books), formed the basis of the ABBYY FineReader XIX - a product later
announced by ABBYY as a specialized FineReader version targeted at old printed sources.
Since then, ABBYY FineReader XIX has become one of main tools used by ATAPY when dealing with old material
and sources printed in Fraktur; it has proved its efficiency a number of projects for European organizations.
IV. Verification and QA
Despite the intelligence applied at pre-OCR phase and powerful ABBYY OCR
technology, in many cases - especially with difficult materials such as old print or
scientific content - a human eye is required to back up machine recognition.
ATAPY employs a number of professional multilingual operators well-trained in
proofreading/correction techniques. When ultimate quality of recognition is a
project requirement, the double verification technique is used, which means that
each page is verified by two independent operators. Such an approach drastically
improves the results as compared to ordinary, single verification, allowing to
achieve quality rates as much as 99,997% (a real figure obtained by ATAPY QA
Department in one of the projects).
ATAPY's engineering potential comes into play again by supplying handy custom utilities that help operators
automate routine tasks, such as inserting non-keyboard characters, or fixing OCR mistakes specific for certain
types of material. During the years of providing Media services, ATAPY has developed an arsenal of such utilities,
which are re-used in new projects.
V. Pre-publishing
If required, ATAPY converts the entire mass of recognized and verified material into any specific data format:
multi-layer PDF, XML/XHMTL, database, etc. - including those not offered by off- the-shelf OCR products. Most
ATAPY's tools and algorithms used at preceding phases are already optimized for subsequent XML conversion therefore, this phase is largely automatic. In some cases additional processing is required, such as specific XML
markup, DTP layout correction, etc.
An example of work in this area is the project for the Royal Danish Library, in which ATAPY converted 230
old Danish books forming the entire Danish Literary Canon, into XML. ATAPY marked up the text with
XML tags and validated the resulting files against the customer's XML schema.
In the Landolt-Bornstein Encyclopedia digitization project, which ATAPY Software
carried out for Springer Verlag, the material was converted into a customer-specific XML
format known at Springer as A++.
VI. Additional IT services
ATAPY offers the services of enhancing third-party Electronic Record Management
Systems and Document Management Systems with OCR modules based on ABBYY
OCR toolkits.
For PRNet, a media monitoring agency operating in Turkey, ATAPY has provided such
integration, and has also enhanced PRNet's web application with a number of
additional features and modules, such as Statistics and Reporting, Web-based
Administration, etc.
Location
ATAPY head office is located in Novosibirsk, Russia. Novosibirsk has always been one of the largest Russian hubs
of scientific research related to Artificial Intelligence technology. The first industrial ICR system for reading ZIP
codes, developed back in the 70's of XX century and still in use by the Russian postal service, was developed here
in Novosibirsk. At the same time, this location allows us to offer services at attractive prices that fit into the
budgets of libraries and public organizations.
The recently opened Sales and Technical support office in Munich, Germany, provides for closer interaction
between ATAPY Software and its European clientele. By signing a contract with a German company ATAPY
Software GmbH, customers can eliminate the concerns associated with foreign contracts.
©2010-2013 ATAPY Software. All rights reserved.
ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.
All the other trademarks are the property of their respective owners.
ATAPY Software
ATAPY Software GmbH
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
Elsenheimerstrasse 47
80687 Munich, Germany
Tel. +49 89 5111 5968
[email protected]
ATAPY for Science
H. Landolt's reference book
"Physikalisch-chemische Tabellen" (first issued in 1883
in Germany) presents physicochemical constants of organic and inorganic matters
in tabular format. The totally
remade 6th edition named
“Landolt-Bornstein Zahlenwerte und Funktionen aus
Naturwissen-Physik, Chemie, Astronomie, Geophysik und Technik" was issued
in 1950-1980.
The appearance of new methods of researches caused
the release of “New series” a reference book named
“Landolt Bornstein. New Series. Numerical Data and
Functional Relationships in
Science and Technology".
Since 1961 more than 150
volumes have been issued.
Long-term cooperation between the largest European scientific
publisher Springer Verlag and ATAPY Software
Both departments (Software Development & Media Service) of ATAPY
Software have been continuously contributing to cooperation with
Springer Verlag, the world’s second-largest scientific Publishing house, a
part of Springer Science+Business Media group which unites 70
publishing companies all over the globe.
The cooperation started in 2003 with a pilot project intended to assess
the efficiency of synergetic use of ABBYY FineReader technology and the
expertise of ATAPY engineer-linguists and media operators in digitizing
scientific content. ATAPY was entrusted with a small part of LandoltBornstein Numerical Data and Functional Relationships in Science and
Technology Encyclopedia - a systematic and comprehensive collection of
critically assessed data from all fields of physics, physical chemistry, bioand geophysics, astronomy, materials science, and technology.
The task of converting scientific information into electronic document
format is not trivial; prior to contacting ATAPY, Springer had made such
attempts, which weren’t efficient enough due to out-of-date technology
base.
ATAPY successfully accomplished the pilot project which involved TIFF to
Microsoft® Word conversion of a series of Encyclopedia pages. The project
allowed to understand the nature of the upcoming project, delimit the
necessary skills and technologies, understand the main complexities and
elaborate solutions for them.
The most serious challenge was a
large amount of purely scientific
data (formulae, tables, etc.) that
contained special symbols missing
from the Unicode Ñharacter map.
This issue was partially overcome
by creation of special dictionaries
inside ABBYY FineReader - the
software package used for fulltext recognition - and by implementing a specialized program to help operators promptly insert nonkeyboard symbols at verification phase.
The second goal to achieve was an accuracy level of
over 99.99%, which meant less than one mistake per
10,000 characters. The goal was achieved with the use
of excellent OCR technologies of ABBYY Software
House adjusted for this particular task by ATAPY, and
due to the meticulous verification work of ATAPY
Media Service department.
This promising beginning grew into full-fledged
cooperation between Springer and ATAPY.
Scanning
PDF document
The second project involved digitization of larger
amounts of the same Edition with further conversion
into A++, the customer-specific XML format. As a part
of this task, ATAPY engineers automated the A++
conversion of the reference lists following each
chapter of the Edition.
ATAPY Software also worked on digitization of the
numerous charts „populating“ the Edition in order to
allow the material’s online usage and eventual
interactivity for scientists of the XXI century.
Segmenting,
recognition
Layout correction,
formulae editing
Double
proofreading
Online publishing
Springer Science+Business Media, or Springer, is a worldwide Publishing house based in Germany
which publishes textbooks, academic reference books, and peer-reviewed topical journals with a
focus on science, technology, mathematics, and medicine. Within the science, technology, and
medicine sector, Springer is the largest book publisher, and second-largest journal publisher
worldwide, with over 60 publishing houses, 1,900 journals, 5,500 new books published each year, sales
of 924 million euro (in 2006) and 5,000 employees. Springer has major offices in Berlin, Heidelberg,
Dordrecht (the Netherlands) and New York.
© 2010-2013 ATAPY Software. All rights reserved.
ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.
All the other trademarks used are the property of their respective owners.
ATAPY Software
Springer Science+Business Media
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
Heidelberger Platz 3 14197
Berlin, Deutschland
Tel. +49 6221 4870 Fax +49 6221 3450
www.springer.com
ATAPY Software Participates in the
Development of International
Computer Dictionaries
“ATAPY reached 99.992%
text accuracy in the German-Russian Dictionary (1
mistake per 8,760 symbols),
and 99.997% quality for the
Spanish-Russian Dictionary
project (1 mistake per
31,500 symbols). They also
corrected many mistakes in
the source dictionary text,
including typographical
misprints and even mistakes
in special marks that are
almost impossible to detect
without special programming tools and profound
knowledge of linguistics.”
Anna Zhavoronkova
Project Manager,
ABBYY Software House
Electronic dictionaries and translation systems are an area of great
practical importance in the ever-globalizing world. ABBYY Software
House, a world leader in OCR/ICR and linguistic technologies, develops
and sells Lingvo electronic dictionaries. For many years Lingvo has been
known as the best English-Russian dictionary on the market. In version
8.0, support of 3 more languages were planned for adding; to introduce
those to Lingvo, it was required to digitize world's latest best-of-breed
dictionaries reflecting the modern state of the new supported languages.
The ABBYY Lingvo 8.0 product line
includes ABBYY Lingvo 8.0 Multilingual Edition, ABBYY Lingvo 8.0
for Pocket PC, as well as an updated and expanded version of
ABBYY Lingvo English-Russian Edition.
5
ABBYY Lingvo 8.0 Multilingual Edition supports eight translation
directions: English-Russian, German-Russian, French-Russian,
Italian-Russian, Russian-English, Russian-German, RussianFrench, and Russian-Italian. This Edition of ABBYY Lingvo
includes more than 40 dictionaries containing more than
2,400,000 entries.
ABBYY turned to ATAPY Software, its outsourcing partner in Novosibirsk,
for digital conversion of two dictionaries from the list picked out by the
Linguistics Department. The 3-volume 1750-page Leping GermanRussian Dictionary and the 830-page Narumov Spanish-Russian Dictionary were to be recognized and proofread for automatic conversion into
the ABBYY Lingvo database.
Highest possible text recognition accuracy was obviously a must. A single
mistake could break the words' alphabetical order and tear the word
away from its paradigm. If the number of such mistakes were above even
a very modest threshold, the dictionary would have become
unsearchable. Adequate interpretation of special dictionary marks was
no less vital for the project. They were used as field delimiters in the
automatic database conversion process and had to be recognized 100%
accurately. Special marks appeared either as text characteristics
(bold/italics), or as special symbols (brackets, asterisks), or as a
combination of the two (e.g., italics + brackets indicated a dictionary
comment). Omitting a single bracket or missing italization would break
the article's structure. This is why the project required both intelligent
programming and highly qualified manual effort - a true challenge for any
contractor in the media service area.
The dictionaries were scanned and automatically
recognized with ABBYY FineReader specially tuned-up
for processing this material. Then a team of qualified
operators proofread and cross-checked the results
using the Double verification technique to ensure
recognition accuracy. Double verification allowed to
detect certain unexpected cases, such as typos in the
source dictionary text, which have been corrected
according to the ABBYY's guidelines. In its effort to
automate the proofreading work to the maximum of
possible extent, ATAPY developed and customized a
number of in-house utilities. One of them was
Glyphica, a tool for quick input of characters that
cannot be found on the keyboard. For Leping
Dictionary ATAPY developed a custom converter with
built-in spellchecking and punctuation checking
utilities which allowed to weed out mistakes
unspotted during the previous stages and finally
convert the material into the Lingvo vocabulary
database.
ABBYY Software House (www.abbyy.com) is based in Moscow, Russia. The
company was founded in 1989. Today ABBYY has over 880 employees worldwide,
including offices in Russia, USA, Ukraine, UK, Germany, Taiwan, Japan and
Cyprus. ABBYY develops software products in the fields of artificial intelligence, document
recognition, data capture and applied linguistics. ABBYY is most notable for their optical character
recognition package ABBYY FineReader.
©2010-2013 ATAPY Software. All rights reserved.
AutoStore is a registered trademark of Notable Solutions, Inc.
All the other trademarks are the property of their respective owners.
ATAPY Software
ABBYY Software House
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 69 9 Fax +7 383 36 39 69 8
www.atapy.com [email protected]
P.O. Box #20, Moscow,
Russia, 127273
Tel. +7 (495) 783 3700 Fax +7 095 783 2663
www.abbyy.com [email protected]
Meeting the Challenge of Time
ATAPY Software participates in the development of ABBYY
FineReader XIX - an OCR system for reading old European books
“I've got FineReader XIX
installed on my computer.
The Frakturschrift recognition is very good. Even
though old text recognition
is not a large and growing
market, I am sure all the
service bureaus here in
Germany will be ordering 1
or 2 copies and have it run
7x24”
Johannes Stöpetie
CEO
ABBYY Europe GmbH
Based on FineReader 7.0
Meta-E (http://meta-e.uibk.ac.at) is a collaborative initiative undertaken
by a consortium of 14 universities from 7 European countries and the US,
co-funded by the European Union. The project is focused on providing
technology basis for digitization and web-publishing of valuable old
printed sources spanning several centuries of European history. For this
purpose, an OCR system was required, capable of recognizing historical
texts for the period 1800-1938, including those printed with
Frakturschrift (an old-styled black-letter typeface prevalent at that time).
Until now, no omnifont OCR systems for reading Frakturschrift have been
available: the OCR products had to be trained on each individual book
before processing. Meta-E coordinators searched for a high quality OCR
package that could be customized to meet their requirements. ABBYY
FineReader was chosen due to its unrivalled recognition accuracy, support
for 177 modern languages, and ease-of-use. ABBYY Software House, the
international manufacturer of FineReader product line, took up the
project as a direct contractor to carry out the development of the
omnifont part (introducing the Frakturschrift graphics to FineReader). The
linguistic part of the project was subcontracted to ATAPY Software,
ABBYY's long-term partner in OCR and computer linguistics
development.
ATAPY's role in the Meta-E project was constructing Old Language Models (LMs) for 5 European languages: English, French, German, Italian,
and Spanish. A LM is a computer database that describes the vocabulary
of a language. FineReader uses LMs during recognition for building OCR
hypotheses and spellchecking. LMs are not just full lists of words in all
possible grammar forms: such a database would be enormous in size and
hardly manageable. FineReader LMs store only stems of each word, and
describe the grammar as a set of flexing rules (paradigms). Each stem is
assigned a list of paradigms; applying them to the stem produces all
possible forms of the word. ATAPY’s task was to study a large amount of
authentic dictionaries and original old European texts dating back to the
targeted time span, review the word stock, add the words that had been
phased out of the languages, and to correct the paradigm assignments to
synchronize the LMs with the actual grammatical practice used at that
time.
To complete this task, ATAPY's linguists carefully selected 10 dictionaries
reflecting the state of the 5 languages, published between 1808 and
1930. ATAPY had also thoroughly analyzed 105 authentic books of that
period, comprising more than 50 MB of text. The next step was to build
FineReader LMs. ATAPY's linguists manually compared the information
from authentic dictionaries and texts - about 500,000 entries in total - to
the existing FineReader vocabularies. This work turned up a total of
458,767 words, from which 61% remained unchanged, and 36% were
added to the vocabularies from the analyzed sources. About 3% of the
words had their paradigms corrected towards the XVIII-early XX century
grammar rules. To carry out such correction, the linguists had to add 159
historic grammar paradigms that were missing in the contemporary
models.
Finally, the LMs were compiled and tested on the
control text corpus. They manifested 98.91% vocabulary coverage for Old English, 99.16% for Old French,
96.58% for Old German, 98.58% for Old Italian, and
98.79% for Old Spanish languages.
To illustrate the above, let’s look at a few samples. A
regular FineReader package, or any other contemporary OCR system, will make a lot of mistakes here.
For example, “Alterthumskunde” may become “Allerlhumskunde“ on the first fragment; on the second
fragment, “UEBERSICHT” (“Ubersicht” in modern
German) is recognized as two words “UEBER SICHT”,
etc. These mistakes occur because of two factors. The
first is the low quality of the printing, but there is
nothing that can be done about it. The second is the
old spelling used in those styles. All existing OCR
systems are targeted at modern texts and therefore
only know modern spelling.
Once the five LMs were merged into FineReader 7
shell, ABBYY was able to offer a special version of
FineReader which “knows” the spelling and font
specifics of Old European languages. This version can
more accurately read old texts, eliminating many of
the mistakes made when only using modern OCR
systems. In effect, users will be able to scan and
recognize old texts with higher quality, saving much of
the time previously spent on error correction.
The updated ABBYY FineReader promises to be a
powerful tool for use by the Meta-E consortuim in its
large-scale digitization work. In addition, ABBYY has
shortly afterwards released FineReader XIX
(www.frakturschrift.com) - the industry's first box
OCR product to recognize Renaissance and Late
Medieval sources, a product specially targeted at
European libraries and public organizations engaged
in preservation and publishing of cultural assets, and
at service bureaus helping them fulfill this mission.
ABBYY Europe GmbH is a European department of ABBYY Software House based
in Munich, Germany. ABBYY Software House is the manufacturer of software
products in the fields of artificial intelligence, document recognition and applied
linguistics. One of the most notable products by ABBYY Software House is the optical character
recognition package ABBYY FineReader.
©2010-2013 ATAPY Software. All rights reserved.
ABBYY, ABBYY FineReader and FineReader XIX are registered trademarks of ABBYY Software House.
All the other trademarks are the property of their respective owners.
ATAPY Software
ABBYY Europe Software House
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
80687 Munich, Germany
Elsenheimerstrasse 49
Tel. +49 89 5111590 Fax +49 89 51115959
[email protected] www.abbyyeu.com
ATAPY Helps UNESCO School
in Transatlantic Slave Trade
Education Project
Fredensborg is the best
documented wreck of a
Transatlantic slave trade
ship located so far.
The ship left Copenhagen in June 1767 and
traded for 265 slaves at
Danish-Norwegian forts
along the Gold Coast of
Africa. About 10% of the
slaves died during the
ship's middle passage,
but over one-third of the
crew also died during the
voyage. The Dutch sold
their human cargo at St.
Croix, and then loaded
the ship with sugar,
tobacco, and other tropical products for the return trip. The ship had
almost reached its destination when it wrecked
during a violent storm.
On September, 15, 1974, divers Odd K Osmundsen, Tore Svalesen
and Leif Svalesen discovered wreckage and giant elephant tusks at
the bottom of the sea near Tromoy, off the southern coast of
Norway. Along with the ivory, cannons, ship timber and other
interesting objects were found. Almost everything was hidden
under layers of seaweed, rocks, and sand. However, as a result of
thorough planning and intense study of old documents from the
archives, the three divers knew exactly what they had found.
The Danish-Norwegian slave ship Fredensborg that sank on
December, 1, 1768 was a typical ship engaged in the so-called
Triangular Trade.
Triangular Trade is the name given to the trading route used by
European merchants who exchanged goods with Africans for
slaves, shipped the slaves to the Americas, sold them and brought
goods from the Americas back to Europe. Ships left Europe with
cargoes of a broad assortment of goods considered suitable for
the slave trade. Once anchored in the forts, the interiors of the
ships were rebuilt to accommodate enslaved Africans.
Det Digitale Nordjylland
In September, 2003 ATAPY was contacted by
Mr. Jeff Klinto, an educator at Vesthimmerlands
Gymnasium. This UNESCO school participates
actively in the Transatlantic Slave Trade
Education Project supported by the Danish
UNESCO Committee and The Digital North
Denmark (Det Digitale Nordjylland) project.
As a part of the project, Mr. Klinto initiated the creation of a CDROM with teaching materials about the Danish involvement in the
Triangular Trade. The CD-ROM had to contain contemporary
materials as well as materials from the age when the use of Gothic
letters was common. That is where the project faced a challenge.
The Transatlantic Slave Trade Project is aimed
to break the silence surrounding the
Transatlantic Slave Trade. By learning about the
past, young people can fully understand the
present and prepare a better future together in a
world free of all types of enslavement, injustice,
discrimination and prejudice.
Photo: Reconstruction of a slave ship by students of Vesthimmerlands Gymnasium
Mr. Klinto:
"Gothic letters caused a number of problems in
relation to the OCR programs that are currently
available on the market. These problems led us to
approach the Royal Library in Copenhagen,
where they recommended that we address the
Russian company ATAPY Software, which
handled the Royal Library's Gothic materials.
However, the thought of having Gothic texts
which were written in Danish handled by Russian
employees seemed unrealistic. The materials
Old Danish books page
were from an age when there was no national
samples
orthography yet; the dictionaries in the OCR
program would be useless. How would they be able to work? Add to this the very uneven quality of
printing in the old works, and the task seems rather impossible."
The Media Service Department of ATAPY accepted the challenge and did excellent work on
recognition, proofreading and exporting to HTML of more than 5500 pages of Old Danish books.
Let Mr. Klinto draw a line under the project history with his own words:
"The materials that were returned were of a high standard, and ATAPY was
incredibly obliging and helpful. The communication with project leadership
functioned excellently and the team worked wonders with the materials which
were often of a poor quality. Therefore I would like to sincerely recommend this
company. Thanks to the competent staff of ATAPY, it is now possible for the public
to have access to materials which may not be issued at libraries anymore because
of their age and rarity. Incidentally, it is worth noticing that the work was done for a
very favorable price.”
©2011-2013 ATAPY Software. All rights reserved.
UNESCO, the UNESCO logo are copyrights of UNESCO the United Nations Educational, Scientific and Cultural Organization.
All trademarks used are the property of their respective owners.
ATAPY Software
Det Digitale Nordjylland
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
Projekt 202
Att. Jeff Klinto
Vesthimmerlands Gymnasium
[email protected]
The Royal Danish Library in
Copenhagen and Arkiv for Dansk
Litteratur
"Working with ATAPY has
been a pleasure. We have
been impressed with the
high level of concern for
producing the best possible
text of the works and the
accuracy of the results.”
Virginia Laursen
Webmaster
Royal Danish Library
The Royal Danish Library
in Copenhagen is the national library of Denmark
and the largest library in the
Nordic countries. It contains
numerous historical treasures; all works that have been
printed in Denmark since
the XVIIth century are deposited there. Thanks to
extensive donations in the
past the library holds nearly
all known Danish printed
works back to the first Danish book printed in 1482.
ATAPY Software converts the entire Danish Classic Literature
Canon into XML
The Royal Danish Library in Copenhagen (http://www.kb.dk) has the
largest book collection in Northern Europe and strives to facilitate access
to its resources using advanced technologies. It undertook an ambitious
project named "Arkiv for Dansk Litteratur" geared towards converting
the whole of Danish literary canon (the works of 70 carefully selected
Danish authors from the ÕIth to the early part of the ÕÕth century) into
computer text - namely, to XML format, and making it available on the
web.
The great number of books, their diverse contents (verses, prose, pictures,
tables, notes and comments), as well as layout and typesetting preservation requirements made this project a very special task. In order to
succeed, the contractor had to possess some seemingly incompatible
qualities. On the one hand, this company had to be competent with
modern Optical Character Recognition packages, proficient in XML
coding, and capable of designing specialized software instruments to
facilitate the conversion process. This required high IT qualification and
extensive hands-on experience in data capture technologies. On the other
hand, almost all real-life mass data input projects still entail a lot of
manual labor. No matter how accurate an OCR system is, it will make
mistakes - especially on such a difficult material as old books with their
complex layout. Besides, on the Library’s material, full automation of XML
coding wasn’t possible due to the diversity of attributes. Therefore, the
contractor had to be able to offer many qualified operators at a
reasonable cost, otherwise the project’s price tag would exceed financial
capabilities of any library.
The IT staff of the Library attempted to solve this problem by searching for
a partner outside the EU. Their attention was drawn to Russia, the home
of the world-renowned OCR system ABBYY FineReader. Following
months of trial, the Library fixed upon ATAPY Software, a leading
developer of custom OCR solutions based on FineReader technologies
and an experienced media service provider. The pilot projects demonstrated that ATAPY had combined high IT professionalism with access to
an extensive pool of qualified multi-lingual operator resources.
The books conversion process was organized in three
large phases:
1. Reading scanned images into text format.
The Library provided ATAPY with scanned pages in TIFF
format. The quality of the images was remarkably good,
which was an important contribution to the efficiency of
the remaining stages. The images were automatically
analyzed by ABBYY FineReader, which segmented them
to distinguish text from pictures and revealed the table
structure. The segmentation results were reviewed by
layout operators. After that, pages were recognized
using FineReader's outstanding omnifont capabilities
augmented with many font-specific patterns that raised
the recognition quality for most old books. Then a group
of operators proofread the OCR results. Special attention
was paid to non-Danish inclusions, some of which could
not even be OCRed (Old Greek, Hebrew etc).
Danish page samples
2. Preparation of initial XML documents.
Verified text was exported to Microsoft® Word format. A group of XML operators armed with an arsenal of
custom tools and macro programs used Microsoft® Word as the environment for adding XML tags. This had been
a very intelligent task since the full list of tags contained over 50 entries, and only half of them yielded to
automatic identification. The remaining half had to be spotted and marked manually - all that in Danish
language.
3. Assembly of book XML files.
Once markup was finished, XML specialists assembled the books, adding supplementary "entire-book" tags and
bibliographic information.
Being a software company in addition to a media service company made it possible for ATAPY to dispatch
experienced customization engineers and develop project-specific program utilities for every conversion phase.
This allowed ATAPY, as the project moved on, to gradually decrease processing time by another 10 to 20%,
passing the savings to the client. Since the successful completion of the project, all books are available online at
http://www.adl.dk.
After years of working in the area of media service, ATAPY became a true expert in this field, having dealt with
texts of different layouts, structures and languages. Those included library cards, encyclopedia articles, magazine
publications, rarities that date back to the XIXth century, and other materials of all genres and formats. In addition
to the Royal Danish Library, on the list of ATAPY's Media service clients are Springer Publishing house (Germany),
University of Innsbruck (Austria), J.B. Metzler Verlag (Germany), EasyData B.V. (Netherlands), Consodata (France),
PRNet (Turkey), and many other institutions and companies. ATAPY uses a highly effective data capture process both in terms of IT infrastructure and human resources. High-speed, high-quality multi-language material
processing, client communication in 4 languages, and very affordable pricing are ATAPY's trademarks which it
continues to exhibit in every contract, big or small.
©2011-2013 ATAPY Software. All rights reserved.
ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.
All the other trademarks are the property of their respective owners.
ATAPY Software
The Royal Danish Library*Det Kongelige Bibliotek
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
P.O.Box 2149 DK-1016
Copenhagen, Denmark
Tel. +45 33 47 47 47 Fax: +45 33 93 22 18
www.kb.dk [email protected]
Backlog Conversion
of Danish Musical Magazines
ATAPY acquires new clients in the field of media service
"All files validated against
the schema, very nice! I
took a closer look at a
random selection of files,
and I am very impressed by
the quality of your work!
Metadata as well as OCRtreated text is of excellent
quality, so, I think we can
regard this as "mission
accomplished" and sign the
Act of Acceptance."
Henning Olesen, IT Project
Manager
The State and University
Library
Universitetsparken
DK-Aàrhus
ATAPY prides itself on being able to
handle most challenging data capture
tasks by employing its intelligent
digitization approach - the one that
involves using specialized software
tools at all phases of the process to
ensure high accuracy of the results with
minimum manual effort.
One of the recent projects requiring
such an approach had been carried out
for "Nordic Sounds", a versatile Danish
musical magazine uniting under its
cover all kinds of musical genres that can be found in contemporary
North European music. The print is widely distributed outside
Denmark and therefore published in English.
"Nordic Sounds" editorship requested the creation of a digital
archive for all issues up to the present moment. The resulting
archive had to be not just a collection of digitized texts, but also a
true archive that could be searched and structured. This goal was
perfectly achievable using XML as an output format.
The task was committed to the Media Service Department of
ATAPY. In order to meet the customer's requirements, it was
necessary to create one XML file per article, which meant that
magazine contents analysis was needed. Although the majority of
magazine materials were in English, a lot of proper names and
quotations were in North European languages (Danish, Swedish,
Norwegian, Finnish, and Icelandic). This peculiarity required special
attention from engineer-linguists who worked on "Nordic Sounds"
digitization and XML conversion. Former and current experiences
in processing multi-language information sources (Danish in
particular) appeared of a great use here.
ATAPY Software achieved the project goals on time (the project
lasted for approximately two months), with excellent quality
acknowledged by "Nordic Sounds".
Thanks to ATAPY, starting May, 2005 the Nordic
Sounds magazine archive is available online as
part of the Online Music Research Library
(www.dvm.nu).
GAFFA and MM covers
After such a successful start, ATAPY digitized
backlogs for two more popular Danish musical
magazines: "MM" and "GAFFA". The GAFFA
magazine archive (1983 up to 2008) is also
published online, providing full keyword search
and original page images retrieval.
GAFFA spread samples
AARHUS
UNIVERSITY
Aarhus University (www.au.dk), located in the city of Aarhus, Denmark, is Denmark's second oldest
and second largest university, after the University of Copenhagen. The university was founded in 1928
and has an annual enrollment of more than 35,000 students. Aarhus University housed Denmark's
first professor of sociology (Theodor Geiger, from 1938–1952) and in 1997 professor Jens Christian
Skou received the Nobel Prize for Chemistry for his discovery of the sodium-potassium pump.
©2011-2013 ATAPY Software. All rights reserved.
All trademarks used are the property of their respective owners.
ATAPY Software
Aarhus University
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
Nordre Ringgade 1
8000 Arhus C
Tel. +45 8942 1111 Fax: +45 8942 1109
[email protected] www.au.dk
ATAPY Media Service Operations: Helping Preserve
Swedish Cultural Heritage
ATAPY’s track record in Scandinavian countries includes such projects as the digitization of a large
collection of books for Royal Danish Library, creating a mini-archive of XVIII-century North-European
prints for a UNESCO educational project, and backlog conversion for several Danish musical
magazines. As a result of this work, ATAPY was lately entrusted with several new projects in
Scandinavia, particularly in Sweden.
Building an archive of Old Swedish plays for Riksteatern Sweden
The Swedish National Touring Theatre (Riksteatern
Sweden) came upon a need to convert its collection
of Old Swedish plays into digital format.
According to the project requirements, texts were to
be converted to Microsoft® Word, with original page
design preserved as much as possible. Due to the age
of the material and to its layout specifics, the task
required both expert knowledge of OCR technology
and heavy manual formatting effort.
In this project, ATAPY has processed more than 12,000 pages in Old Swedish. For
almost 50% of the material, double verification was used in order to ensure high
recognition accuracy and excellent searchability of the text. All the digitized material is
now available online on one of Rikrteatern’s web sites, in Microsoft® Word and PDF
formats.
Riksteatern is the name of the popular "National Touring Theater"/"National Theater
Company" (English) in Sweden. Established in 1933 with a goal to promote and produce
Results quality theater throughout Sweden, Riksteatern is now the biggest theater company on tour
in Sweden, financed and owned by 240 local Swedish economic associations.
Digitization of Old Prints Collection for Gothenburg
University
In the same year 2008, ATAPY converted into text format a series of old
Swedish printed sources dating back to XVIII-XIX centuries for Gothenburg
University Library.
In this ongoing project, which comprises several phases and by now exceeds
the count of 75,000 pages, about 65% of the material was subject to full
verification. The remaining material, which yielded better OCR results,
underwent partial verification (that of uncertainly recognized symbols only).
This approach allowed to deliver considerable cost savings. As a next step,
ATAPY performed manual markup of files for subsequent conversion to XML.
The University of Gothenburg is one of the major universities in
northern Europe (approximately 37,000 students). The University’s
40 Departments cover most scientific disciplines, making the University one of Sweden’s most
diversified higher education institutions.
In both projects, ATAPY faced the typical challenges associated with old books:
Low quality of the original page images (old weathered paper, pale print, etc.)
Uneven lines, “jumping” print, different spacing between letters, words, lines, etc.
Old Swedish words and grammar (ABBYY FineReader and sometimes even
FineReader XIX dictionaries failed)
Through use of ABBYY FineReader XIX - a specialized package for processing prints
in Old European languages and typefaces, through smart segmentation of the material and applying
qualified manual services where necessary - ATAPY managed to overcome these difficulties. As always,
ATAPY’s strategy was to automate the work where possible, minimizing customers’ expenses without
sacrificing the quality.
Creation of an electronic archive of Selma Lagerlof’s works for
National Library of Sweden
Selma Lagerlof (1858-1940), one of the most prominent Swedish authors, a winner of the
Nobel Prize in Literature and a Swedish Academy Member, has left a literary heritage of
more than 2,500 pages. In 2010, the National Library of Sweden undertook a project
aimed at making this material available online. One of ATAPY’s former customers
recommended the company as an excellent service provider with an affordable price tag
and hands-on experience with sources in Scandinavian
languages.
The project involved the following phases:
OCR of scanned images
Full verification of OCR results
XML markup of basic layout elements: titles, page numbers, separator elements, etc.
ATAPY processed the material within a short timeframe with the workforce of
three media service operators. That same year, the Library began publishing
selected portions of its Lagerlof Collection online to commemorate the 150th
anniversary of Selma Lagerlof's birth.
The National Library of Sweden is a state agency with offices in Stockholm. The Library
has been collecting virtually everything printed in Sweden or in Swedish since 1661.
Currently the Library coordinates services and programs for all research libraries in
Sweden and administers LIBRIS, the Swedish national library catalog system.
©2011-2013 ATAPY Software. All rights reserved.
ABBYY, ABBYY FineReader and ABBYY FineReader XIX are registered trademarks of ABBYY Software House.
All the other trademarks used are the property of their respective owners.
ATAPY Software
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
Data Input Project for Novosibirsk Mayor’s Office
ATAPY took part in a social project for
Novosibirsk Mayor’s Office in cooperation
with «Zolotaya Korona», a Russian nationwide retail electronic payments network.
The project was aimed at technical modernization of
the fare collection procedure in public transport
(metro, buses, trolley buses and tramways) by
introducing electronic transportation passes based on
microprocessor plastic cards.
A serious challenge for the project was the large
amount of passengers using social security benefits
and discounts. Novosibirsk Public Transportation
Authority compensated transport carriers for
transportation of such passengers from the city
budget. One of the goals of the «Transportation pass»
project was to retain the benefits/discounts scheme
for people who were entitled to them, and at the same
time provide a precise and convenient mechanism for
gathering the transportation statistics for such
passengers. It was necessary to exclude every
possibility of fraud. Neglecting this issue was a weak
point of the formerly used transportation pass system.
«Zolotaya Korona» offered a solution based on
contactless microprocessor cards — personalized ones
for citizens entitled for benefits/discounts scheme and
non-personal ones for ordinary passengers. «Zolotaya
Korona» issued separate types of personal cards for
each appointed category of beneficiaries: students
(the Student Transportation Pass), school children (the
School Child Transportation Pass), and social security
beneficiaries (the Social Security Transportation Pass).
There was yet another challenge. To obtain a personal
transportation pass, each person had to fill in a paper
machine-readable form. «Zolotaya Korona» had to
deal with tens of thousands of application forms,
which needed to be processed within a short period
of time, and with high accuracy. The additional
requirement was storing the colored photo of the
applicant in the database.
The peak of the «paper flood» was anticipated to
come with the issue of Student Transportation Passes.
To deal with the challenge, «Zolotaya Korona» turned
to ATAPY Software — a data capture company with a
proven track record.
The data capture process involved the following
phases:
1. Filling-in the form by applicant (in handprint)
2. Scanning
3. Machine recognition
4. Verification of recognized data
5. Export to the database
6. Card production
7. Card issue
Working in a close
cooperation with
engineers from
«Zolotaya Korona», ATAPY developers helped to
design a machine-readable application form to be
filled by students at Phase 1 - a form specially
optimized for processing by ABBYY FormReader.
Then, ATAPY engineers developed a special preprocessing algorithm to be applied to form images
between Phase 2 and Phase 3. It removed the color
background from the form and improved the handprinted text recognition quality, while retaining the
colored photo. The procedure allowed to significantly
ÒÌ
reduce the form processing turnaround at phases 3-5 (recognition and verification of data using ABBYY
FormReader), and to ensure high accuracy of captured data.
Thanks to combined efforts of «Zolotaya Korona» and ATAPY Software, Novosibirsk students can now travel by
all means of transport using the Student Transportation Pass, without reaching for money, student ID or entering
a PIN code.
In the near future, «Zolotaya Korona» plans to extrapolate this valuable experience to other Russian cities.
Zolotaya Korona is a Russian nation-wide retail electronic payments network uniting
220 banks from 75 regions of Russia, CIS and foreign countries. «Zolotaya Korona»
cards are accepted in 273 cities of Russia, and also in Ukraine, Belarus, Kyrgyzstan,
Mongolia and China. The «Zolotaya Korona» system was established in 1994 by
Center of Financial Technologies.
©2010-2013 ATAPY Software. All rights reserved.
ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House.
All the other trademarks are the property of their respective owners.
ATAPY Software
Zolotaya Korona
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
630055, Shaturskaya street, 2
Novosibirsk, Russia
Tel. +7 383 336 49 49 +7 383 335 80 88
www.korona.net [email protected]
sense your media
PRNet is a Media Monitoring
and Analysis company
serving over 300 corporate
clients in Turkey.
The company acts as a
strategic partner for communication specialists and
executives, who aim to develop corporate reputation
and who need to assess the
results of their communication strategies. PRNet
provides access to their
online database where customers can search among
more than 25 thousand clips
and 80 million results stored
since 2000, survey 4,500
pages of newspapers and
magazines, view videos of
74 TV channels recorded on
a 24/7 basis, and access
more than 1,000 Internet
portals.
According to ISO 500
research, 7 of the top 10
companies of Turkey, and 84
of the top 100, prefer PRNet
for serving their mediamonitoring and industrial
information needs.
©2011 ATAPY Software. All rights
reserved. ABBYY, the ABBYY logo
and ABBYY FineReader are registered trademarks of ABBYY Software
House. PRNet and the PRNet logo are
registered trademarks of PRNet.
A Networked Media
Clipping System
For more than a century, daily, systematic analysis of printed media
has been an important tool for successful businesses worldwide.
Media clipping companies, using tools suited to the last century,
provided the analysis business demanded. All those years, the
rustling of pages and jingle of scissors were the constant audio
background of media clipping companies' operations.
Arrival of the digital age healed the callused hands of operators.
Fewer and fewer scissors were used as companies switched to
scanning printed material. Paper no longer left the scanner room,
and reading was done from computer monitors.
But overall processing of newspapers and magazines still required
too much human input to automate, so the amount of labor spent
by media clipping companies remained largely the same. Early 90s
OCR programs worked for letters and faxes, but turned out to be
useless when confronted with the complex layout and font variety
of newspapers.
workload was largely shifted to unattended
computers: OCR PCs had to be rackmounted 10
units tall to fit into a single room, with one hotswitchable monitor for control.
In 1997, a Turkish media research company
named PRNet approached ABBYY Software
House, the manufacturer of FineReader OCR
products, with the request to design a system to
streamline the media clipping process. Dalian
1.0 went into operation in 1998, delivering
subscribers a service previously unheard of. As
early as nine in the morning, subscribers could
log on to PRNet's web site, click on their own
customized albums, and view a new page with
clippings from that very day's morning
newspapers. Only clippings containing this
subscriber's keywords went to his/her albums.
Content was delivered as text and pictures in
HTML format, allowing the subscriber to copy &
paste it into other software for distribution or
editing. Pictures were delivered as well.
Keywords were highlighted. All major Turkish
publications were covered (50 titles). The clippings were preserved in MS SQL Server database
for long-term storage and future reference.
All this was achieved with an average staff
presence of 14 operators - a fantastic efficiency
compared to less sophisticated systems. The
When the new version of FineReader OCR came
out, PRNet invited ABBYY to migrate Dalian to
this new platform. Pursuant to new corporate
outsourcing policies, ABBYY transferred the
project to ATAPY Software, an IT development
company specializing in custom OCR tools.
Besides migration, PRNet asked ATAPY to add
web-based administration, system statistics and
reports, a web client for extended media search,
improved output for clippings, and many other
features and enhancements.
The new Dalian 2.0 went into operation in
2003, providing media insights to about 80
clients, including the Turkish offices of Alcatel,
Compaq, Toyota, Uniliver, Vestel, CNN, Reebok,
and Siemens, as well as such local giants as
members of the Koñ Group and the leading
banks of Turkey.
The dramatic improvement in recognition rate,
the possibility to employ home-based operators
working through web interfaces, and other
serious advancements in system functionality
and manageability place Dalian 2.0 in the top
rank of modern media clipping software
solutions.
©2011-2013 ATAPY Software. All rights reserved. ABBYY, the ABBYY logo and
ABBYY FineReader are registered trademarks of ABBYY Software.
PRNet and the PRNet logo are registered trademarks of PRNet.
ATAPY Software
PRNet
630090, Engineernaya Street, 16
Novosibirsk, Russia
Tel. +7 383 36 39 699 Fax +7 383 36 39 698
www.atapy.com [email protected]
Spring Giz Plaza B Blok 17/18,
Maslak 80670 Istanbul, Turkey
Tel. +90 212 328 18 09 Fax +90 212 328 18 07
www.prnet.com.tr [email protected]

Similar documents

ATAPY Software: Participation in the Development of FineReader XIX

ATAPY Software: Participation in the Development of FineReader XIX typeface that was prevalent). There were no omnifont-Frakturschrift systems available: all OCR products had to be trained on each individual book before processing it. Meta-E coordinators started l...

More information

ATAPY Software in Brief

ATAPY Software in Brief Among EasyData's customers is Oce, one of the world leaders in hardware and software for document processing. Cooperation had started with a relatively simple “Oce Document Interpreter” application...

More information