Media Service Profile
Transcription
Media Service Profile
Media Service Highlights ATAPY Software was established in 2001 with active participation from ABBYY Software House, the manufacturer of FineReader OCR product family. ATAPY focuses on custom software development in the fields of OCR and data capture, document imaging, document management and computer linguistics. Media Service (scanning, recognition, data entry, proofreading, formatting, XML markup, etc.) is another important field of the company's activity. Compared to conventional media service bureaux, ATAPY is able to offer better results in shorter time, through developing on-demand software tools which allow to streamline of even fully automate certain jobs. This approach is especially efficient in largescale digitization projects, allowing to considerably slash the amount of manual labor and to ensure highest quality within reasonable digitization budgets. ATAPY Media Service Department was established with the purpose of helping libraries, data archives, publishing houses, and other information-intensive organizations in their digitization and electronic publishing endeavors. For materials dating back many decades or even centuries, digitization is a synonym to preservation and successful dissemination of cultural values. ATAPY has worked with various materials, such as books in old European languages, backlogs of periodicals, including wide format, theater scripts, and others. However complicated the material is, featuring pale/uneven print, multiple language symbols present in one page, outdated fonts, scientific formulae and other elements which are deemed obstacles for the majority of modern OCR systems - ATAPY possesses sufficient means and resources to transform it into a searchable, accessible and well-structured electronic archive that will not be lost due to a fire, flood, or uncivilized reader. ATAPY also carries out mass data capture from standard structured and semi-structured documents (forms). The sensitive nature of information contained in forms often requires exceptionally high OCR accuracy (e.g., financial documents or education tests), which cannot be achieved without manual verification and data validation. ABBYY FormReader and ABBYY FlexiCapture technology, strengthened by ATAPY's engineering experience and backed up by a pool of qualified operators, allows to ensure the required accuracy in practically any European language, and with minimal manual intervention. ATAPY possesses hands-on experience in development of custom data validation tools and export modules which allow export to third-party information and Document Management systems, pre-OCR image enhancement tools, and other technical means that enable smart and error-free data entry, as compared to traditional brute-force approach. TM Another strong point about ATAPY’s Media Service is the ability to handle material with complex layout, such as newspapers and magazines. ATAPY has a track record of creating digital archives for several European magazines issued in English, Danish, German, and Swedish languages. This experience came in useful for development of the Smart Newspaper Page Zoning Tool - a specialized page segmentation product targeted at newspaper-type layouts. It has been continuously developed for a number of years and improved upon testing on various periodicals, especially focusing on old editions (first half of the XX century). A distinctive feature of the tool is a flexible set of parameters that affect the segmentation process. Users can tune up the tool performance to achieve the best results on a particular type of material. The intelligence accumulated in this development allows to correct identification of column borders, difficult headings, sorting out decorative and layout-specific elements - such as frames and separators - that often mislead modern OCR systems. This slashes the manual labor otherwise required to correct the segmentation produced by most OCR systems. These are actual examples of one and the same newspaper page natively segmented by ABBYY FineReader and segmented by ABBYY FineReader with the help of ATAPY’s Smart Newspaper Page Zoning Tool. Segmentation using ATAPY’s product is visibly more accurate and requires no correction. ABBYY FineReader ABBYY FineReader + Smart Newspaper Page Zoning Tool The way it works at ATAPY Media Service: services and techniques I. Scanning ATAPY is well equipped for provision of scanning services. For Metzler Verlag, a German Publishing house, ATAPY carried out scanning and recognition of the 85-volume Pauly's Encyclopedia of Antiquity (Realencyclopadie der classischen Altertumswissenschaften). 59.500 pages have been scanned in high resolution grayscale mode, and stored on 198 CD-ROMs. University of Innsbruck has entrusted ATAPY a batch of XIX-century Austrian books for high-resolution scanning. Due to high value of the books, they have been shipped back and forth via courier mail; yet, the postal charges did not outweigh the cost savings that the University has gained through outsourcing the task to ATAPY. Scanning services can be provided at the recently opened ATAPY Sales and Technical support office in Munich, Germany a facility specially intended to bring ATAPY one more step closer to its customers. When reasonable, the material can be scanned locally and sent to ATAPY over the Internet in mutually agreed format. II. Pre-OCR image processing This phase is required when recognition quality suffers due to such source image flaws as garbage (speckles of different nature), page skew, colored or patterned background, etc. ATAPY uses a number of its own imaging tools, as well as third-party ones, to enhance the image quality prior to OCR, in order to ensure ultimate efficiency of the automatic recognition phase. III. Recognition Scanned images are recognized using ABBYY FineReader OCR/ICR technology. Sometimes certain programming effort is required to tune up and customize ABBYY products. This may happen if the customer puts forth specific requirements not covered by “off-the-shelf” products, or if the material exhibits special characteristics, such as unusual fonts, unsupported language dialects, non-standard symbols and characters, specific layout, and the like. A good example is our contribution to the international Meta-E ABBYY initiative - a project undertaken by a consortium of 14 universities from 7 European countries and the US and co-funded by the European Commission. ATAPY's part of the project included tuning ABBYY FineReader to work with text printed in old European languages. ATAPY implemented Language Models for five Old European languages to be used in ABBYY FineReader OCR dictionaries. These dictionaries, together with ABBYY's part of the project - adaptation of FineReader for reading Frakturschrift (a specific the Gothic blackletter typeface typical for old European books), formed the basis of the ABBYY FineReader XIX - a product later announced by ABBYY as a specialized FineReader version targeted at old printed sources. Since then, ABBYY FineReader XIX has become one of main tools used by ATAPY when dealing with old material and sources printed in Fraktur; it has proved its efficiency a number of projects for European organizations. IV. Verification and QA Despite the intelligence applied at pre-OCR phase and powerful ABBYY OCR technology, in many cases - especially with difficult materials such as old print or scientific content - a human eye is required to back up machine recognition. ATAPY employs a number of professional multilingual operators well-trained in proofreading/correction techniques. When ultimate quality of recognition is a project requirement, the double verification technique is used, which means that each page is verified by two independent operators. Such an approach drastically improves the results as compared to ordinary, single verification, allowing to achieve quality rates as much as 99,997% (a real figure obtained by ATAPY QA Department in one of the projects). ATAPY's engineering potential comes into play again by supplying handy custom utilities that help operators automate routine tasks, such as inserting non-keyboard characters, or fixing OCR mistakes specific for certain types of material. During the years of providing Media services, ATAPY has developed an arsenal of such utilities, which are re-used in new projects. V. Pre-publishing If required, ATAPY converts the entire mass of recognized and verified material into any specific data format: multi-layer PDF, XML/XHMTL, database, etc. - including those not offered by off- the-shelf OCR products. Most ATAPY's tools and algorithms used at preceding phases are already optimized for subsequent XML conversion therefore, this phase is largely automatic. In some cases additional processing is required, such as specific XML markup, DTP layout correction, etc. An example of work in this area is the project for the Royal Danish Library, in which ATAPY converted 230 old Danish books forming the entire Danish Literary Canon, into XML. ATAPY marked up the text with XML tags and validated the resulting files against the customer's XML schema. In the Landolt-Bornstein Encyclopedia digitization project, which ATAPY Software carried out for Springer Verlag, the material was converted into a customer-specific XML format known at Springer as A++. VI. Additional IT services ATAPY offers the services of enhancing third-party Electronic Record Management Systems and Document Management Systems with OCR modules based on ABBYY OCR toolkits. For PRNet, a media monitoring agency operating in Turkey, ATAPY has provided such integration, and has also enhanced PRNet's web application with a number of additional features and modules, such as Statistics and Reporting, Web-based Administration, etc. Location ATAPY head office is located in Novosibirsk, Russia. Novosibirsk has always been one of the largest Russian hubs of scientific research related to Artificial Intelligence technology. The first industrial ICR system for reading ZIP codes, developed back in the 70's of XX century and still in use by the Russian postal service, was developed here in Novosibirsk. At the same time, this location allows us to offer services at attractive prices that fit into the budgets of libraries and public organizations. The recently opened Sales and Technical support office in Munich, Germany, provides for closer interaction between ATAPY Software and its European clientele. By signing a contract with a German company ATAPY Software GmbH, customers can eliminate the concerns associated with foreign contracts. ©2010-2013 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House. All the other trademarks are the property of their respective owners. ATAPY Software ATAPY Software GmbH 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] Elsenheimerstrasse 47 80687 Munich, Germany Tel. +49 89 5111 5968 [email protected] ATAPY for Science H. Landolt's reference book "Physikalisch-chemische Tabellen" (first issued in 1883 in Germany) presents physicochemical constants of organic and inorganic matters in tabular format. The totally remade 6th edition named “Landolt-Bornstein Zahlenwerte und Funktionen aus Naturwissen-Physik, Chemie, Astronomie, Geophysik und Technik" was issued in 1950-1980. The appearance of new methods of researches caused the release of “New series” a reference book named “Landolt Bornstein. New Series. Numerical Data and Functional Relationships in Science and Technology". Since 1961 more than 150 volumes have been issued. Long-term cooperation between the largest European scientific publisher Springer Verlag and ATAPY Software Both departments (Software Development & Media Service) of ATAPY Software have been continuously contributing to cooperation with Springer Verlag, the world’s second-largest scientific Publishing house, a part of Springer Science+Business Media group which unites 70 publishing companies all over the globe. The cooperation started in 2003 with a pilot project intended to assess the efficiency of synergetic use of ABBYY FineReader technology and the expertise of ATAPY engineer-linguists and media operators in digitizing scientific content. ATAPY was entrusted with a small part of LandoltBornstein Numerical Data and Functional Relationships in Science and Technology Encyclopedia - a systematic and comprehensive collection of critically assessed data from all fields of physics, physical chemistry, bioand geophysics, astronomy, materials science, and technology. The task of converting scientific information into electronic document format is not trivial; prior to contacting ATAPY, Springer had made such attempts, which weren’t efficient enough due to out-of-date technology base. ATAPY successfully accomplished the pilot project which involved TIFF to Microsoft® Word conversion of a series of Encyclopedia pages. The project allowed to understand the nature of the upcoming project, delimit the necessary skills and technologies, understand the main complexities and elaborate solutions for them. The most serious challenge was a large amount of purely scientific data (formulae, tables, etc.) that contained special symbols missing from the Unicode Ñharacter map. This issue was partially overcome by creation of special dictionaries inside ABBYY FineReader - the software package used for fulltext recognition - and by implementing a specialized program to help operators promptly insert nonkeyboard symbols at verification phase. The second goal to achieve was an accuracy level of over 99.99%, which meant less than one mistake per 10,000 characters. The goal was achieved with the use of excellent OCR technologies of ABBYY Software House adjusted for this particular task by ATAPY, and due to the meticulous verification work of ATAPY Media Service department. This promising beginning grew into full-fledged cooperation between Springer and ATAPY. Scanning PDF document The second project involved digitization of larger amounts of the same Edition with further conversion into A++, the customer-specific XML format. As a part of this task, ATAPY engineers automated the A++ conversion of the reference lists following each chapter of the Edition. ATAPY Software also worked on digitization of the numerous charts „populating“ the Edition in order to allow the material’s online usage and eventual interactivity for scientists of the XXI century. Segmenting, recognition Layout correction, formulae editing Double proofreading Online publishing Springer Science+Business Media, or Springer, is a worldwide Publishing house based in Germany which publishes textbooks, academic reference books, and peer-reviewed topical journals with a focus on science, technology, mathematics, and medicine. Within the science, technology, and medicine sector, Springer is the largest book publisher, and second-largest journal publisher worldwide, with over 60 publishing houses, 1,900 journals, 5,500 new books published each year, sales of 924 million euro (in 2006) and 5,000 employees. Springer has major offices in Berlin, Heidelberg, Dordrecht (the Netherlands) and New York. © 2010-2013 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House. All the other trademarks used are the property of their respective owners. ATAPY Software Springer Science+Business Media 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] Heidelberger Platz 3 14197 Berlin, Deutschland Tel. +49 6221 4870 Fax +49 6221 3450 www.springer.com ATAPY Software Participates in the Development of International Computer Dictionaries “ATAPY reached 99.992% text accuracy in the German-Russian Dictionary (1 mistake per 8,760 symbols), and 99.997% quality for the Spanish-Russian Dictionary project (1 mistake per 31,500 symbols). They also corrected many mistakes in the source dictionary text, including typographical misprints and even mistakes in special marks that are almost impossible to detect without special programming tools and profound knowledge of linguistics.” Anna Zhavoronkova Project Manager, ABBYY Software House Electronic dictionaries and translation systems are an area of great practical importance in the ever-globalizing world. ABBYY Software House, a world leader in OCR/ICR and linguistic technologies, develops and sells Lingvo electronic dictionaries. For many years Lingvo has been known as the best English-Russian dictionary on the market. In version 8.0, support of 3 more languages were planned for adding; to introduce those to Lingvo, it was required to digitize world's latest best-of-breed dictionaries reflecting the modern state of the new supported languages. The ABBYY Lingvo 8.0 product line includes ABBYY Lingvo 8.0 Multilingual Edition, ABBYY Lingvo 8.0 for Pocket PC, as well as an updated and expanded version of ABBYY Lingvo English-Russian Edition. 5 ABBYY Lingvo 8.0 Multilingual Edition supports eight translation directions: English-Russian, German-Russian, French-Russian, Italian-Russian, Russian-English, Russian-German, RussianFrench, and Russian-Italian. This Edition of ABBYY Lingvo includes more than 40 dictionaries containing more than 2,400,000 entries. ABBYY turned to ATAPY Software, its outsourcing partner in Novosibirsk, for digital conversion of two dictionaries from the list picked out by the Linguistics Department. The 3-volume 1750-page Leping GermanRussian Dictionary and the 830-page Narumov Spanish-Russian Dictionary were to be recognized and proofread for automatic conversion into the ABBYY Lingvo database. Highest possible text recognition accuracy was obviously a must. A single mistake could break the words' alphabetical order and tear the word away from its paradigm. If the number of such mistakes were above even a very modest threshold, the dictionary would have become unsearchable. Adequate interpretation of special dictionary marks was no less vital for the project. They were used as field delimiters in the automatic database conversion process and had to be recognized 100% accurately. Special marks appeared either as text characteristics (bold/italics), or as special symbols (brackets, asterisks), or as a combination of the two (e.g., italics + brackets indicated a dictionary comment). Omitting a single bracket or missing italization would break the article's structure. This is why the project required both intelligent programming and highly qualified manual effort - a true challenge for any contractor in the media service area. The dictionaries were scanned and automatically recognized with ABBYY FineReader specially tuned-up for processing this material. Then a team of qualified operators proofread and cross-checked the results using the Double verification technique to ensure recognition accuracy. Double verification allowed to detect certain unexpected cases, such as typos in the source dictionary text, which have been corrected according to the ABBYY's guidelines. In its effort to automate the proofreading work to the maximum of possible extent, ATAPY developed and customized a number of in-house utilities. One of them was Glyphica, a tool for quick input of characters that cannot be found on the keyboard. For Leping Dictionary ATAPY developed a custom converter with built-in spellchecking and punctuation checking utilities which allowed to weed out mistakes unspotted during the previous stages and finally convert the material into the Lingvo vocabulary database. ABBYY Software House (www.abbyy.com) is based in Moscow, Russia. The company was founded in 1989. Today ABBYY has over 880 employees worldwide, including offices in Russia, USA, Ukraine, UK, Germany, Taiwan, Japan and Cyprus. ABBYY develops software products in the fields of artificial intelligence, document recognition, data capture and applied linguistics. ABBYY is most notable for their optical character recognition package ABBYY FineReader. ©2010-2013 ATAPY Software. All rights reserved. AutoStore is a registered trademark of Notable Solutions, Inc. All the other trademarks are the property of their respective owners. ATAPY Software ABBYY Software House 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 69 9 Fax +7 383 36 39 69 8 www.atapy.com [email protected] P.O. Box #20, Moscow, Russia, 127273 Tel. +7 (495) 783 3700 Fax +7 095 783 2663 www.abbyy.com [email protected] Meeting the Challenge of Time ATAPY Software participates in the development of ABBYY FineReader XIX - an OCR system for reading old European books “I've got FineReader XIX installed on my computer. The Frakturschrift recognition is very good. Even though old text recognition is not a large and growing market, I am sure all the service bureaus here in Germany will be ordering 1 or 2 copies and have it run 7x24” Johannes Stöpetie CEO ABBYY Europe GmbH Based on FineReader 7.0 Meta-E (http://meta-e.uibk.ac.at) is a collaborative initiative undertaken by a consortium of 14 universities from 7 European countries and the US, co-funded by the European Union. The project is focused on providing technology basis for digitization and web-publishing of valuable old printed sources spanning several centuries of European history. For this purpose, an OCR system was required, capable of recognizing historical texts for the period 1800-1938, including those printed with Frakturschrift (an old-styled black-letter typeface prevalent at that time). Until now, no omnifont OCR systems for reading Frakturschrift have been available: the OCR products had to be trained on each individual book before processing. Meta-E coordinators searched for a high quality OCR package that could be customized to meet their requirements. ABBYY FineReader was chosen due to its unrivalled recognition accuracy, support for 177 modern languages, and ease-of-use. ABBYY Software House, the international manufacturer of FineReader product line, took up the project as a direct contractor to carry out the development of the omnifont part (introducing the Frakturschrift graphics to FineReader). The linguistic part of the project was subcontracted to ATAPY Software, ABBYY's long-term partner in OCR and computer linguistics development. ATAPY's role in the Meta-E project was constructing Old Language Models (LMs) for 5 European languages: English, French, German, Italian, and Spanish. A LM is a computer database that describes the vocabulary of a language. FineReader uses LMs during recognition for building OCR hypotheses and spellchecking. LMs are not just full lists of words in all possible grammar forms: such a database would be enormous in size and hardly manageable. FineReader LMs store only stems of each word, and describe the grammar as a set of flexing rules (paradigms). Each stem is assigned a list of paradigms; applying them to the stem produces all possible forms of the word. ATAPY’s task was to study a large amount of authentic dictionaries and original old European texts dating back to the targeted time span, review the word stock, add the words that had been phased out of the languages, and to correct the paradigm assignments to synchronize the LMs with the actual grammatical practice used at that time. To complete this task, ATAPY's linguists carefully selected 10 dictionaries reflecting the state of the 5 languages, published between 1808 and 1930. ATAPY had also thoroughly analyzed 105 authentic books of that period, comprising more than 50 MB of text. The next step was to build FineReader LMs. ATAPY's linguists manually compared the information from authentic dictionaries and texts - about 500,000 entries in total - to the existing FineReader vocabularies. This work turned up a total of 458,767 words, from which 61% remained unchanged, and 36% were added to the vocabularies from the analyzed sources. About 3% of the words had their paradigms corrected towards the XVIII-early XX century grammar rules. To carry out such correction, the linguists had to add 159 historic grammar paradigms that were missing in the contemporary models. Finally, the LMs were compiled and tested on the control text corpus. They manifested 98.91% vocabulary coverage for Old English, 99.16% for Old French, 96.58% for Old German, 98.58% for Old Italian, and 98.79% for Old Spanish languages. To illustrate the above, let’s look at a few samples. A regular FineReader package, or any other contemporary OCR system, will make a lot of mistakes here. For example, “Alterthumskunde” may become “Allerlhumskunde“ on the first fragment; on the second fragment, “UEBERSICHT” (“Ubersicht” in modern German) is recognized as two words “UEBER SICHT”, etc. These mistakes occur because of two factors. The first is the low quality of the printing, but there is nothing that can be done about it. The second is the old spelling used in those styles. All existing OCR systems are targeted at modern texts and therefore only know modern spelling. Once the five LMs were merged into FineReader 7 shell, ABBYY was able to offer a special version of FineReader which “knows” the spelling and font specifics of Old European languages. This version can more accurately read old texts, eliminating many of the mistakes made when only using modern OCR systems. In effect, users will be able to scan and recognize old texts with higher quality, saving much of the time previously spent on error correction. The updated ABBYY FineReader promises to be a powerful tool for use by the Meta-E consortuim in its large-scale digitization work. In addition, ABBYY has shortly afterwards released FineReader XIX (www.frakturschrift.com) - the industry's first box OCR product to recognize Renaissance and Late Medieval sources, a product specially targeted at European libraries and public organizations engaged in preservation and publishing of cultural assets, and at service bureaus helping them fulfill this mission. ABBYY Europe GmbH is a European department of ABBYY Software House based in Munich, Germany. ABBYY Software House is the manufacturer of software products in the fields of artificial intelligence, document recognition and applied linguistics. One of the most notable products by ABBYY Software House is the optical character recognition package ABBYY FineReader. ©2010-2013 ATAPY Software. All rights reserved. ABBYY, ABBYY FineReader and FineReader XIX are registered trademarks of ABBYY Software House. All the other trademarks are the property of their respective owners. ATAPY Software ABBYY Europe Software House 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] 80687 Munich, Germany Elsenheimerstrasse 49 Tel. +49 89 5111590 Fax +49 89 51115959 [email protected] www.abbyyeu.com ATAPY Helps UNESCO School in Transatlantic Slave Trade Education Project Fredensborg is the best documented wreck of a Transatlantic slave trade ship located so far. The ship left Copenhagen in June 1767 and traded for 265 slaves at Danish-Norwegian forts along the Gold Coast of Africa. About 10% of the slaves died during the ship's middle passage, but over one-third of the crew also died during the voyage. The Dutch sold their human cargo at St. Croix, and then loaded the ship with sugar, tobacco, and other tropical products for the return trip. The ship had almost reached its destination when it wrecked during a violent storm. On September, 15, 1974, divers Odd K Osmundsen, Tore Svalesen and Leif Svalesen discovered wreckage and giant elephant tusks at the bottom of the sea near Tromoy, off the southern coast of Norway. Along with the ivory, cannons, ship timber and other interesting objects were found. Almost everything was hidden under layers of seaweed, rocks, and sand. However, as a result of thorough planning and intense study of old documents from the archives, the three divers knew exactly what they had found. The Danish-Norwegian slave ship Fredensborg that sank on December, 1, 1768 was a typical ship engaged in the so-called Triangular Trade. Triangular Trade is the name given to the trading route used by European merchants who exchanged goods with Africans for slaves, shipped the slaves to the Americas, sold them and brought goods from the Americas back to Europe. Ships left Europe with cargoes of a broad assortment of goods considered suitable for the slave trade. Once anchored in the forts, the interiors of the ships were rebuilt to accommodate enslaved Africans. Det Digitale Nordjylland In September, 2003 ATAPY was contacted by Mr. Jeff Klinto, an educator at Vesthimmerlands Gymnasium. This UNESCO school participates actively in the Transatlantic Slave Trade Education Project supported by the Danish UNESCO Committee and The Digital North Denmark (Det Digitale Nordjylland) project. As a part of the project, Mr. Klinto initiated the creation of a CDROM with teaching materials about the Danish involvement in the Triangular Trade. The CD-ROM had to contain contemporary materials as well as materials from the age when the use of Gothic letters was common. That is where the project faced a challenge. The Transatlantic Slave Trade Project is aimed to break the silence surrounding the Transatlantic Slave Trade. By learning about the past, young people can fully understand the present and prepare a better future together in a world free of all types of enslavement, injustice, discrimination and prejudice. Photo: Reconstruction of a slave ship by students of Vesthimmerlands Gymnasium Mr. Klinto: "Gothic letters caused a number of problems in relation to the OCR programs that are currently available on the market. These problems led us to approach the Royal Library in Copenhagen, where they recommended that we address the Russian company ATAPY Software, which handled the Royal Library's Gothic materials. However, the thought of having Gothic texts which were written in Danish handled by Russian employees seemed unrealistic. The materials Old Danish books page were from an age when there was no national samples orthography yet; the dictionaries in the OCR program would be useless. How would they be able to work? Add to this the very uneven quality of printing in the old works, and the task seems rather impossible." The Media Service Department of ATAPY accepted the challenge and did excellent work on recognition, proofreading and exporting to HTML of more than 5500 pages of Old Danish books. Let Mr. Klinto draw a line under the project history with his own words: "The materials that were returned were of a high standard, and ATAPY was incredibly obliging and helpful. The communication with project leadership functioned excellently and the team worked wonders with the materials which were often of a poor quality. Therefore I would like to sincerely recommend this company. Thanks to the competent staff of ATAPY, it is now possible for the public to have access to materials which may not be issued at libraries anymore because of their age and rarity. Incidentally, it is worth noticing that the work was done for a very favorable price.” ©2011-2013 ATAPY Software. All rights reserved. UNESCO, the UNESCO logo are copyrights of UNESCO the United Nations Educational, Scientific and Cultural Organization. All trademarks used are the property of their respective owners. ATAPY Software Det Digitale Nordjylland 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] Projekt 202 Att. Jeff Klinto Vesthimmerlands Gymnasium [email protected] The Royal Danish Library in Copenhagen and Arkiv for Dansk Litteratur "Working with ATAPY has been a pleasure. We have been impressed with the high level of concern for producing the best possible text of the works and the accuracy of the results.” Virginia Laursen Webmaster Royal Danish Library The Royal Danish Library in Copenhagen is the national library of Denmark and the largest library in the Nordic countries. It contains numerous historical treasures; all works that have been printed in Denmark since the XVIIth century are deposited there. Thanks to extensive donations in the past the library holds nearly all known Danish printed works back to the first Danish book printed in 1482. ATAPY Software converts the entire Danish Classic Literature Canon into XML The Royal Danish Library in Copenhagen (http://www.kb.dk) has the largest book collection in Northern Europe and strives to facilitate access to its resources using advanced technologies. It undertook an ambitious project named "Arkiv for Dansk Litteratur" geared towards converting the whole of Danish literary canon (the works of 70 carefully selected Danish authors from the ÕIth to the early part of the ÕÕth century) into computer text - namely, to XML format, and making it available on the web. The great number of books, their diverse contents (verses, prose, pictures, tables, notes and comments), as well as layout and typesetting preservation requirements made this project a very special task. In order to succeed, the contractor had to possess some seemingly incompatible qualities. On the one hand, this company had to be competent with modern Optical Character Recognition packages, proficient in XML coding, and capable of designing specialized software instruments to facilitate the conversion process. This required high IT qualification and extensive hands-on experience in data capture technologies. On the other hand, almost all real-life mass data input projects still entail a lot of manual labor. No matter how accurate an OCR system is, it will make mistakes - especially on such a difficult material as old books with their complex layout. Besides, on the Library’s material, full automation of XML coding wasn’t possible due to the diversity of attributes. Therefore, the contractor had to be able to offer many qualified operators at a reasonable cost, otherwise the project’s price tag would exceed financial capabilities of any library. The IT staff of the Library attempted to solve this problem by searching for a partner outside the EU. Their attention was drawn to Russia, the home of the world-renowned OCR system ABBYY FineReader. Following months of trial, the Library fixed upon ATAPY Software, a leading developer of custom OCR solutions based on FineReader technologies and an experienced media service provider. The pilot projects demonstrated that ATAPY had combined high IT professionalism with access to an extensive pool of qualified multi-lingual operator resources. The books conversion process was organized in three large phases: 1. Reading scanned images into text format. The Library provided ATAPY with scanned pages in TIFF format. The quality of the images was remarkably good, which was an important contribution to the efficiency of the remaining stages. The images were automatically analyzed by ABBYY FineReader, which segmented them to distinguish text from pictures and revealed the table structure. The segmentation results were reviewed by layout operators. After that, pages were recognized using FineReader's outstanding omnifont capabilities augmented with many font-specific patterns that raised the recognition quality for most old books. Then a group of operators proofread the OCR results. Special attention was paid to non-Danish inclusions, some of which could not even be OCRed (Old Greek, Hebrew etc). Danish page samples 2. Preparation of initial XML documents. Verified text was exported to Microsoft® Word format. A group of XML operators armed with an arsenal of custom tools and macro programs used Microsoft® Word as the environment for adding XML tags. This had been a very intelligent task since the full list of tags contained over 50 entries, and only half of them yielded to automatic identification. The remaining half had to be spotted and marked manually - all that in Danish language. 3. Assembly of book XML files. Once markup was finished, XML specialists assembled the books, adding supplementary "entire-book" tags and bibliographic information. Being a software company in addition to a media service company made it possible for ATAPY to dispatch experienced customization engineers and develop project-specific program utilities for every conversion phase. This allowed ATAPY, as the project moved on, to gradually decrease processing time by another 10 to 20%, passing the savings to the client. Since the successful completion of the project, all books are available online at http://www.adl.dk. After years of working in the area of media service, ATAPY became a true expert in this field, having dealt with texts of different layouts, structures and languages. Those included library cards, encyclopedia articles, magazine publications, rarities that date back to the XIXth century, and other materials of all genres and formats. In addition to the Royal Danish Library, on the list of ATAPY's Media service clients are Springer Publishing house (Germany), University of Innsbruck (Austria), J.B. Metzler Verlag (Germany), EasyData B.V. (Netherlands), Consodata (France), PRNet (Turkey), and many other institutions and companies. ATAPY uses a highly effective data capture process both in terms of IT infrastructure and human resources. High-speed, high-quality multi-language material processing, client communication in 4 languages, and very affordable pricing are ATAPY's trademarks which it continues to exhibit in every contract, big or small. ©2011-2013 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House. All the other trademarks are the property of their respective owners. ATAPY Software The Royal Danish Library*Det Kongelige Bibliotek 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] P.O.Box 2149 DK-1016 Copenhagen, Denmark Tel. +45 33 47 47 47 Fax: +45 33 93 22 18 www.kb.dk [email protected] Backlog Conversion of Danish Musical Magazines ATAPY acquires new clients in the field of media service "All files validated against the schema, very nice! I took a closer look at a random selection of files, and I am very impressed by the quality of your work! Metadata as well as OCRtreated text is of excellent quality, so, I think we can regard this as "mission accomplished" and sign the Act of Acceptance." Henning Olesen, IT Project Manager The State and University Library Universitetsparken DK-Aàrhus ATAPY prides itself on being able to handle most challenging data capture tasks by employing its intelligent digitization approach - the one that involves using specialized software tools at all phases of the process to ensure high accuracy of the results with minimum manual effort. One of the recent projects requiring such an approach had been carried out for "Nordic Sounds", a versatile Danish musical magazine uniting under its cover all kinds of musical genres that can be found in contemporary North European music. The print is widely distributed outside Denmark and therefore published in English. "Nordic Sounds" editorship requested the creation of a digital archive for all issues up to the present moment. The resulting archive had to be not just a collection of digitized texts, but also a true archive that could be searched and structured. This goal was perfectly achievable using XML as an output format. The task was committed to the Media Service Department of ATAPY. In order to meet the customer's requirements, it was necessary to create one XML file per article, which meant that magazine contents analysis was needed. Although the majority of magazine materials were in English, a lot of proper names and quotations were in North European languages (Danish, Swedish, Norwegian, Finnish, and Icelandic). This peculiarity required special attention from engineer-linguists who worked on "Nordic Sounds" digitization and XML conversion. Former and current experiences in processing multi-language information sources (Danish in particular) appeared of a great use here. ATAPY Software achieved the project goals on time (the project lasted for approximately two months), with excellent quality acknowledged by "Nordic Sounds". Thanks to ATAPY, starting May, 2005 the Nordic Sounds magazine archive is available online as part of the Online Music Research Library (www.dvm.nu). GAFFA and MM covers After such a successful start, ATAPY digitized backlogs for two more popular Danish musical magazines: "MM" and "GAFFA". The GAFFA magazine archive (1983 up to 2008) is also published online, providing full keyword search and original page images retrieval. GAFFA spread samples AARHUS UNIVERSITY Aarhus University (www.au.dk), located in the city of Aarhus, Denmark, is Denmark's second oldest and second largest university, after the University of Copenhagen. The university was founded in 1928 and has an annual enrollment of more than 35,000 students. Aarhus University housed Denmark's first professor of sociology (Theodor Geiger, from 1938–1952) and in 1997 professor Jens Christian Skou received the Nobel Prize for Chemistry for his discovery of the sodium-potassium pump. ©2011-2013 ATAPY Software. All rights reserved. All trademarks used are the property of their respective owners. ATAPY Software Aarhus University 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] Nordre Ringgade 1 8000 Arhus C Tel. +45 8942 1111 Fax: +45 8942 1109 [email protected] www.au.dk ATAPY Media Service Operations: Helping Preserve Swedish Cultural Heritage ATAPY’s track record in Scandinavian countries includes such projects as the digitization of a large collection of books for Royal Danish Library, creating a mini-archive of XVIII-century North-European prints for a UNESCO educational project, and backlog conversion for several Danish musical magazines. As a result of this work, ATAPY was lately entrusted with several new projects in Scandinavia, particularly in Sweden. Building an archive of Old Swedish plays for Riksteatern Sweden The Swedish National Touring Theatre (Riksteatern Sweden) came upon a need to convert its collection of Old Swedish plays into digital format. According to the project requirements, texts were to be converted to Microsoft® Word, with original page design preserved as much as possible. Due to the age of the material and to its layout specifics, the task required both expert knowledge of OCR technology and heavy manual formatting effort. In this project, ATAPY has processed more than 12,000 pages in Old Swedish. For almost 50% of the material, double verification was used in order to ensure high recognition accuracy and excellent searchability of the text. All the digitized material is now available online on one of Rikrteatern’s web sites, in Microsoft® Word and PDF formats. Riksteatern is the name of the popular "National Touring Theater"/"National Theater Company" (English) in Sweden. Established in 1933 with a goal to promote and produce Results quality theater throughout Sweden, Riksteatern is now the biggest theater company on tour in Sweden, financed and owned by 240 local Swedish economic associations. Digitization of Old Prints Collection for Gothenburg University In the same year 2008, ATAPY converted into text format a series of old Swedish printed sources dating back to XVIII-XIX centuries for Gothenburg University Library. In this ongoing project, which comprises several phases and by now exceeds the count of 75,000 pages, about 65% of the material was subject to full verification. The remaining material, which yielded better OCR results, underwent partial verification (that of uncertainly recognized symbols only). This approach allowed to deliver considerable cost savings. As a next step, ATAPY performed manual markup of files for subsequent conversion to XML. The University of Gothenburg is one of the major universities in northern Europe (approximately 37,000 students). The University’s 40 Departments cover most scientific disciplines, making the University one of Sweden’s most diversified higher education institutions. In both projects, ATAPY faced the typical challenges associated with old books: Low quality of the original page images (old weathered paper, pale print, etc.) Uneven lines, “jumping” print, different spacing between letters, words, lines, etc. Old Swedish words and grammar (ABBYY FineReader and sometimes even FineReader XIX dictionaries failed) Through use of ABBYY FineReader XIX - a specialized package for processing prints in Old European languages and typefaces, through smart segmentation of the material and applying qualified manual services where necessary - ATAPY managed to overcome these difficulties. As always, ATAPY’s strategy was to automate the work where possible, minimizing customers’ expenses without sacrificing the quality. Creation of an electronic archive of Selma Lagerlof’s works for National Library of Sweden Selma Lagerlof (1858-1940), one of the most prominent Swedish authors, a winner of the Nobel Prize in Literature and a Swedish Academy Member, has left a literary heritage of more than 2,500 pages. In 2010, the National Library of Sweden undertook a project aimed at making this material available online. One of ATAPY’s former customers recommended the company as an excellent service provider with an affordable price tag and hands-on experience with sources in Scandinavian languages. The project involved the following phases: OCR of scanned images Full verification of OCR results XML markup of basic layout elements: titles, page numbers, separator elements, etc. ATAPY processed the material within a short timeframe with the workforce of three media service operators. That same year, the Library began publishing selected portions of its Lagerlof Collection online to commemorate the 150th anniversary of Selma Lagerlof's birth. The National Library of Sweden is a state agency with offices in Stockholm. The Library has been collecting virtually everything printed in Sweden or in Swedish since 1661. Currently the Library coordinates services and programs for all research libraries in Sweden and administers LIBRIS, the Swedish national library catalog system. ©2011-2013 ATAPY Software. All rights reserved. ABBYY, ABBYY FineReader and ABBYY FineReader XIX are registered trademarks of ABBYY Software House. All the other trademarks used are the property of their respective owners. ATAPY Software 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] Data Input Project for Novosibirsk Mayor’s Office ATAPY took part in a social project for Novosibirsk Mayor’s Office in cooperation with «Zolotaya Korona», a Russian nationwide retail electronic payments network. The project was aimed at technical modernization of the fare collection procedure in public transport (metro, buses, trolley buses and tramways) by introducing electronic transportation passes based on microprocessor plastic cards. A serious challenge for the project was the large amount of passengers using social security benefits and discounts. Novosibirsk Public Transportation Authority compensated transport carriers for transportation of such passengers from the city budget. One of the goals of the «Transportation pass» project was to retain the benefits/discounts scheme for people who were entitled to them, and at the same time provide a precise and convenient mechanism for gathering the transportation statistics for such passengers. It was necessary to exclude every possibility of fraud. Neglecting this issue was a weak point of the formerly used transportation pass system. «Zolotaya Korona» offered a solution based on contactless microprocessor cards — personalized ones for citizens entitled for benefits/discounts scheme and non-personal ones for ordinary passengers. «Zolotaya Korona» issued separate types of personal cards for each appointed category of beneficiaries: students (the Student Transportation Pass), school children (the School Child Transportation Pass), and social security beneficiaries (the Social Security Transportation Pass). There was yet another challenge. To obtain a personal transportation pass, each person had to fill in a paper machine-readable form. «Zolotaya Korona» had to deal with tens of thousands of application forms, which needed to be processed within a short period of time, and with high accuracy. The additional requirement was storing the colored photo of the applicant in the database. The peak of the «paper flood» was anticipated to come with the issue of Student Transportation Passes. To deal with the challenge, «Zolotaya Korona» turned to ATAPY Software — a data capture company with a proven track record. The data capture process involved the following phases: 1. Filling-in the form by applicant (in handprint) 2. Scanning 3. Machine recognition 4. Verification of recognized data 5. Export to the database 6. Card production 7. Card issue Working in a close cooperation with engineers from «Zolotaya Korona», ATAPY developers helped to design a machine-readable application form to be filled by students at Phase 1 - a form specially optimized for processing by ABBYY FormReader. Then, ATAPY engineers developed a special preprocessing algorithm to be applied to form images between Phase 2 and Phase 3. It removed the color background from the form and improved the handprinted text recognition quality, while retaining the colored photo. The procedure allowed to significantly ÒÌ reduce the form processing turnaround at phases 3-5 (recognition and verification of data using ABBYY FormReader), and to ensure high accuracy of captured data. Thanks to combined efforts of «Zolotaya Korona» and ATAPY Software, Novosibirsk students can now travel by all means of transport using the Student Transportation Pass, without reaching for money, student ID or entering a PIN code. In the near future, «Zolotaya Korona» plans to extrapolate this valuable experience to other Russian cities. Zolotaya Korona is a Russian nation-wide retail electronic payments network uniting 220 banks from 75 regions of Russia, CIS and foreign countries. «Zolotaya Korona» cards are accepted in 273 cities of Russia, and also in Ukraine, Belarus, Kyrgyzstan, Mongolia and China. The «Zolotaya Korona» system was established in 1994 by Center of Financial Technologies. ©2010-2013 ATAPY Software. All rights reserved. ABBYY and ABBYY FineReader are registered trademarks of ABBYY Software House. All the other trademarks are the property of their respective owners. ATAPY Software Zolotaya Korona 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] 630055, Shaturskaya street, 2 Novosibirsk, Russia Tel. +7 383 336 49 49 +7 383 335 80 88 www.korona.net [email protected] sense your media PRNet is a Media Monitoring and Analysis company serving over 300 corporate clients in Turkey. The company acts as a strategic partner for communication specialists and executives, who aim to develop corporate reputation and who need to assess the results of their communication strategies. PRNet provides access to their online database where customers can search among more than 25 thousand clips and 80 million results stored since 2000, survey 4,500 pages of newspapers and magazines, view videos of 74 TV channels recorded on a 24/7 basis, and access more than 1,000 Internet portals. According to ISO 500 research, 7 of the top 10 companies of Turkey, and 84 of the top 100, prefer PRNet for serving their mediamonitoring and industrial information needs. ©2011 ATAPY Software. All rights reserved. ABBYY, the ABBYY logo and ABBYY FineReader are registered trademarks of ABBYY Software House. PRNet and the PRNet logo are registered trademarks of PRNet. A Networked Media Clipping System For more than a century, daily, systematic analysis of printed media has been an important tool for successful businesses worldwide. Media clipping companies, using tools suited to the last century, provided the analysis business demanded. All those years, the rustling of pages and jingle of scissors were the constant audio background of media clipping companies' operations. Arrival of the digital age healed the callused hands of operators. Fewer and fewer scissors were used as companies switched to scanning printed material. Paper no longer left the scanner room, and reading was done from computer monitors. But overall processing of newspapers and magazines still required too much human input to automate, so the amount of labor spent by media clipping companies remained largely the same. Early 90s OCR programs worked for letters and faxes, but turned out to be useless when confronted with the complex layout and font variety of newspapers. workload was largely shifted to unattended computers: OCR PCs had to be rackmounted 10 units tall to fit into a single room, with one hotswitchable monitor for control. In 1997, a Turkish media research company named PRNet approached ABBYY Software House, the manufacturer of FineReader OCR products, with the request to design a system to streamline the media clipping process. Dalian 1.0 went into operation in 1998, delivering subscribers a service previously unheard of. As early as nine in the morning, subscribers could log on to PRNet's web site, click on their own customized albums, and view a new page with clippings from that very day's morning newspapers. Only clippings containing this subscriber's keywords went to his/her albums. Content was delivered as text and pictures in HTML format, allowing the subscriber to copy & paste it into other software for distribution or editing. Pictures were delivered as well. Keywords were highlighted. All major Turkish publications were covered (50 titles). The clippings were preserved in MS SQL Server database for long-term storage and future reference. All this was achieved with an average staff presence of 14 operators - a fantastic efficiency compared to less sophisticated systems. The When the new version of FineReader OCR came out, PRNet invited ABBYY to migrate Dalian to this new platform. Pursuant to new corporate outsourcing policies, ABBYY transferred the project to ATAPY Software, an IT development company specializing in custom OCR tools. Besides migration, PRNet asked ATAPY to add web-based administration, system statistics and reports, a web client for extended media search, improved output for clippings, and many other features and enhancements. The new Dalian 2.0 went into operation in 2003, providing media insights to about 80 clients, including the Turkish offices of Alcatel, Compaq, Toyota, Uniliver, Vestel, CNN, Reebok, and Siemens, as well as such local giants as members of the Koñ Group and the leading banks of Turkey. The dramatic improvement in recognition rate, the possibility to employ home-based operators working through web interfaces, and other serious advancements in system functionality and manageability place Dalian 2.0 in the top rank of modern media clipping software solutions. ©2011-2013 ATAPY Software. All rights reserved. ABBYY, the ABBYY logo and ABBYY FineReader are registered trademarks of ABBYY Software. PRNet and the PRNet logo are registered trademarks of PRNet. ATAPY Software PRNet 630090, Engineernaya Street, 16 Novosibirsk, Russia Tel. +7 383 36 39 699 Fax +7 383 36 39 698 www.atapy.com [email protected] Spring Giz Plaza B Blok 17/18, Maslak 80670 Istanbul, Turkey Tel. +90 212 328 18 09 Fax +90 212 328 18 07 www.prnet.com.tr [email protected]
Similar documents
ATAPY Software: Participation in the Development of FineReader XIX
typeface that was prevalent). There were no omnifont-Frakturschrift systems available: all OCR products had to be trained on each individual book before processing it. Meta-E coordinators started l...
More informationATAPY Software in Brief
Among EasyData's customers is Oce, one of the world leaders in hardware and software for document processing. Cooperation had started with a relatively simple “Oce Document Interpreter” application...
More information