Devanagari Font Design for Optical Character
Transcription
Devanagari Font Design for Optical Character
Devanagari Font Design for Optical Character Recognition Dual Degree Dissertation Submitted in partial fulfillment of the requirements of the degree of (Bachelor of Technology & Master of Technology) By Mustafa Saifee 07D07004 Supervisor: Prof. Ravi Poovaiah Department of Electrical Engineering INDIAN INSTITUTE OF TECHNOLOGY BOMBAY May 2012 Approval Sheet Devanagiri Font Design for Optical Character Recognition by Mustafa Saifee is approved for the degree of Bachelor of Technology in Electrical Engineering & Master of Technology in Communication and Signal Processing. Examiner Examiner Supervisor Chairman Date Place i Declaration I declare that this written submission represents my ideas in my own words and where others’ ideas or words have been included, I have adequately cited and referenced the original sources. I also declare that I have adhered to all principles of academic honesty and integrity and have stand that any violation of the above will be cause for disciplinary action by the institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed. (Signature) (Name) (Roll No.) Date ii Acknowledgements I express my sincere gratitude to Prof. Ravi Poovaiah for his support, guidance and constant encouragement. I would like to thank Manoj G. from CDAC for his guidance and support. This time I remember my parents and brother with great reverence whose support and prayers are always my strength. I would also like to thank my faculty advisor and HoD Prof. Abhay Karandikar for his support. Mustafa Saifee 07D07004 iii Abstract The original motivations for developing optical character recognition technologies were modest to convert printed text on flat physical media to digital form, producing machine-readable digital content. By doing this, words that had been inert and bound to physical material would be brought into the digital realm and thus gain new and powerful functionalities and analytical possibilities. It is crucial to the computerization of printed texts so that they can be electronically searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text to speech and text mining. We design a Devanagari script font optimized for OCR. In this report we first study the the basics of OCR systems and working of Devanagari OCR. We also study Latin fonts designed for OCR and what precaution needs to be taken while designing a Devanagari script font for OCR. Then we apply this knowledge to design the font which is showcased in the report. In this report we also discuss the different features of letterforms which help in recognition. In conclusion, we tested the font and found the results encouraging. iv Content 1. Introduction 1 2. Devanagari Script 3 2.1 Alphabets 3 2.2 Anatomy 6 2.2 Character Frequency in Hindi 8 3. Optical Character Recognition 9 3.1 History of OCR 9 3.2 Applications of OCR 10 3.3 Recognition of Devanagari Script 11 4. Typefaces for OCR 17 4.1 Latin Font for OCR 16 4.3 Devanagari Font for OCR 21 5. Font Design for OCRs 22 5.1 Design Features Important for OCR 22 5.2 Special Care for Devanagari 24 6. Proposed Font Designed for OCRs 27 6.1 Designing Characters 29 6.2 Evolution of Typeface 41 6.3 Anatomy of the Typeface 42 6.4 Testing of the Typefaces 43 7. Future Work 47 References 48 v List of Figures Figure 2.3 Bhagwat’s grouping of letters on the basis of graphical similarities Figure 2.4 Bhagwat’s guidelines Figure 2.5 Naik’s grouping of letters on the basis of the position of the vertical bar or the kana Figure 3.1 OCR-A (top); OCR-B (Bottom) Figure 3.2 Procedure of Devanagari Recognition Figure 3.3 Image before binarization (left); Image after binarization (right) Figure 3.4 Horizontal Projection Profiles of a document for line segmentation Figure 3.5 Vertical Projection Profiles of a document for word segmentation Figure 3.6 Three part of a Devanagari word Figure 3.7 The procedure of Hindi character segmentation Figure 4.1 Characters of OCR-A Figure 4.2 Characters of OCR-B Figure 4.3 Optimal Overlapping of characters Figure 4.4 More coverage ratio because of serifs Figure 4.5 First test version of OCR-B Figure 4.6 First published version of OCR-B Figure 4.7 OCR-B (top), OCRBczyk (bottom) Figure 5.1 Addition of elements in OCR-B to differentiate two characters Figure 5.2 Serifs in a typeface (grey serifs) Figure 5.3 Shadow Characters Figure 5.4 Counters is the circular negative space (grey) Figure 5.5 Example of few characters after the removal of Shiro Rekha Figure 5.6 Similarities between different characters once the Shiro Rekha is removed Figure 5.7 Similarities between different characters once the bottom strip is removed Figure 5.8 Similarities between descender and Halanta Figure 5.9 Example of open counter (grey) Figure 6.1 Extension in the diagonal stem of र to differentiate it from स Figure 6.2 Difference in the shape of the bowl Figure 6.3 Final design of letters र स and ख Figure 6.4 Common element of ग न भ and म Figure 6.5 Final design of letters ग न भ and म Figure 6.6 Difference in the vertical and horizontal distance of the vertical and horizontal bar Figure 6.7 Difference in भ and म once Shiro Rekha is removed vi Figure 6.8 Closed counter in the letters क ब and व Figure 6.9 Final design of the letters क ब and व Figure 6.10 Small diagonal stroke and the openness of the counter in ल Figure 6.11 Overlapping of the letters ल and त Figure 6.12 Final design of the letters ल and त Figure 6.13 Distance between horizontal and vertical bar of the letters Figure 6.14 Difference in ध and घ once Shiro Rekha is removed Figure 6.15 Final design of the letters च ज छ ध and घ Figure 6.17 Final design of the letters प फ ण and ष Figure 6.18 Final design of the letters ट ठ ढ and द Figure 6.19 Equal character height of all the letters Figure 6.20 Final design of the letters इ ई ङ ड झ and ह Figure 6.21 Overlap of the letters थ and य without the Shiro Rekha Figure 6.22 Final design of the letters थ and य Figure 6.23 Overlap of the letters अ and उ Figure 6.24 Final design of the letters अ आ उ and ऊ Figure 6.25 Final design of the letters ए ऐ ञ and श Figure 6.26 Small white space between the letters and lower matras Figure 6.27 Final design of matras Figure 6.28 First version of OCR-D (top); final version of OCR-D (bottom) Figure 6.29 Grid used in OCR-D Figure 6.30 H:V Ratio Figure 6.31 Test document typed in OCR-D Figure 6.32 Test document after binarization Figure 6.33 Extracted test from the image Figure 6.34 Scanned document typeset in OCR-D Figure 6.35 Extracted text when OCR algorithm is executed on document typed in OCR-D Figure 6.36 Scanned document typeset in Yogesh Figure 6.37 Extracted text when OCR algorithm is executed on document typed in Yogesh Figure 6.38 Scanned document typeset in Surekh Figure 6.39 Extracted text when OCR algorithm is executed on document typed in Surekh vii List of Tables Table 2.1 Vowels in Devanagari Table 2.2 Consonants in Devanagari Table 2.3 Character Frequency in Hindi viii Chapter 1: Introduction Machine replicating human functions, like reading, is an old dream. However, over the last five decades, machine reading has grown from a dream to reality. Machine reading uses the principles of Optical Character Recognition (OCR). OCR has also become one of the most successful applications of technology in the field of pattern recognition and artificial intelligence. Since the mid 1950s, OCR has been a very active field of research and development. While the OCR technology for some scripts like Latin is fairly mature and commercial OCR systems like Nuance OmniPage Pro or ABBYY FineReader are available which can perform with high accuracy, it is still under development for other scripts like Chinese and Devangari. Although a great deal of research has been done for OCR applications for Latin script, even theses OCR based machines are still not able to compete with human reading capabilities. This problem is more prominent for other scripts for which OCR technology is relatively newer. Typefaces are very important in determining the performance of the OCR technology. Hence in order to improve the accuracy of the OCR system, typefaces which are specially designed for OCR are required. For Latin script, quite a few typefaces have been designed which are optimized for OCR. These specially designed typefaces have a unique and well defined character set which allows for greater accuracy in recognition. This in turn helps in building low cost systems which can recognize characters using simple algorithms. However, no Devanagari script font is available which is designed specifically for machine reading and we address this problem in this report. In general, documents contain text, graphics, and images. The procedure of reading the text component in such a document can be divided into three steps: 1. Document layout analysis in which the text component of the document is extracted. 2. Segmentation, i.e. extraction of characters from the text component of the document. 3. Recognition of the segmented characters. Typically, the OCR character segmentation stage needs to be redesigned for each new script, while the other stages are easier to port from one script to another and can be generalized over large classes of languages. There is a great need for OCR related research in Indian languages as there are many technical challenges which are specific to Devanagari script. With the spread of computers in organizations and offices, automatic processing and machine reading of paper documents is gaining importance in India. Although a lot of research is going on Devanagari script recognition, there is no commercial OCR systems focusing on Devanagari based languages. OCR for 1 Devanagari is still in the research and development stage. In chapter 2, we give an overview of the Devanagari script. We discuss the alphabets in the Devanagari script and how they are grouped. Thereafter we discuss the anatomy of the script and the graphical grouping of the alphabets. In chapter 3, we have a look at the basics of OCR systems. First, we discuss the history and the applications of OCR system and then we look at one of the algorithms used in OCR systems for Devanagari script. We analyse this algorithm discussing all the steps that are involved in character recognition. In chapter 4, we discuss the features of typefaces which are designed specifically for OCR systems. We discuss the need of a specially designed typeface for OCR and perform an in-depth analysis of one of the most commonly used Latin typeface for OCR system – OCR-B. We discuss the precautions taken by its designer while designing the typeface. Finally, we look at the lack of Devanagari typeface designed for OCR systems. In chapter 5, we discuss the precautions that need to be taken while designing Devanagari typefaces for OCR systems. We also look at the design decisions taken specifically for Devanagari script because of the difference in the recognition algorithm of Devanagari and Latin script. In chapter 6, we design a new Devanagari font optimized for OCR systems. We discuss each character in detail and how the features of the letters are designed for improved performance in OCR systems. We also present the evolution of our font from the first version to the final version. Thereafter, we test this font on an OCR system which is available for free download on the internet. Finally, we discuss the scope of future work and the improvements in the design which can further enhance the performance of the font. 2 Chapter 2: Devanagari Script Devanagari script is the most important and widely used script in India. It is the script used by many Indian languages like Hindi, Marathi, Nepali and Sanskrit. Several other languages like Punjabi, Kashmiri use close variations of this script. It was also formerly used to write Gujrati. Devanagari is a part of the Brahmi script family. An evolutionary transition can be seen from Brahmi script to the Gupta script to the Nagari script to Devanagari script. It was first seen in 7th century A.D. and the transition to a more stable form can be seen from the 11th century onwards. The current appearance of Devanagari was reached sometime around 12th century. Etymologically, the word Devanagari is considered to be combination of two Sanskrit words ‘Deva’ meaning God, Brahma or sometime the king and ‘nagara’ meaning city. Thus, literally combining to form the ‘city of god’ or the script used in the city of god. The use of the name Devanagari is relatively recent, the older term Nagari is also used. The Devanagari script represents the sounds which are consistent. Unlike letters of the English alphabet which can be pronounced in different ways, the letters of the Devanagari script have the same pronunciations (with a few minor exceptions). Some of the conceptual differences in Latin and Devanagari scripts are as follows: • In Devanagari script each character has a horizontal bar (Shiro Rekha) at the top. In contemporary time the Shiro Rekha is broken between words, to differentiate between two words. • Devanagari alphabets do not have distinct letter cases i.e. upper and lower case character. • The concept of matras is not present in Latin script. They can occur as a standalone characters or with other alphabets to modify their sound. 2.1 Alphabets There are around 50 basic characters in the script. The grouping of vowels and consonants is called Swaras and Vyanjanas respectively. The grouping of vowels and consonants in Devanagari is done on the basis of phonetic point of articulation. Within a word, vowels often take modified shapes called modifiers or matras. Consonant modifiers are also possible. Moreover, 2 to 5 consonants can combine to form compound characters called conjunct, which may partly retain the shape of the constituent consonants. Along with these there also exist a set of sign or diacritical mark which indicates the nasalization of vowels or use of Persian sound etc. 3 Vowels Devanagari in its most elaborate form has 18 vowels out of which 11 are frequently used. Others can be seen in the Vedic and non-Vedic Sanskrit text. Vowels in Devanagari are transcribed in two distinct forms: the independent form, and the dependent (matra) form. The independent form is used when the vowel letter appears alone, at the beginning of a word, or immediately following another vowel letter. Matras are used when the vowel follows a consonant. Independent form अ आ इ ई उ ऊ ऋ ॠ Modifier or Matras Independent form Modifier or Matras None ए ◌े ◌ा ऐ ि◌ ◌ी ◌ु ◌ू ◌ृ ◌ॄ ऎ ओ औ ऒ ऌ ॡ ◌ै ◌ॆ ◌ो ◌ौ ◌ॊ ◌ॢ ◌ॣ Table 2.1 Vowels in Devanagari Apart from these, there also exist another set of vowels which has been added to the traditional Devanagari to expand its range. For example ऑ is used to write transliteration or English loan word like ball (बॉल). Consonants There are around 33 consonants in Devanagari script which are grouped phonetically. The first set of 25 consonants are called occlusive,and rest 8 are called non occlusive. The occlusive consonants are further divided into five groups: gutturals, palatals, cerebals or retroflex, dentals and labials. The first four consonants in these groups are further divided in two groups: plosive and voiced 4 plosive and the last consonant is the nasal consonant. The plosive and voiced plosive are again divided into unaspirated and aspirated version (each having one character). There 8 non occlusive consonants are divided in three groups semivowel or approximant, sabilants and aspirate each have four, three and one character respectively. Occlusive Consonants Plosive Voiced Plosive Unaspirated Aspirated Unaspirated Aspirated क च ख छ ग ज घ झ य श र ष ल स व Gutturals Palatals Cerebals Dentals Labials ट त प ठ थ फ ड द ब ढ ध भ Nasal ङ ञ ण न म Non Occlusive Consonants Semivowels Sibilants Aspirate ह Table 2.2 Consonants in Devanagari Conjunct Conjuncts are combination of two to five consonants. There are about a thousand conjucts in Devanagari script. Some of these conjuncts partly retain the shape of the constituent consonants while there are others like � (द् + य) which are not clearly derived from the letters making up their components. 5 Diacritics Diacritics are glyphs added to a letter, or basic glyph to change the sound of the letter. Some of the commonly used diacritics in Devanagari are Visarga, Chandra, Halanta and Nukta. Visarga (◌ः) is an unvoiced variation of ह. Chandra is an open mid front rounded independent vowel ऍ. In its dependent form it is placed on the top of the consonants (◌ॅ ). Chandrabindu is use to represent the inherent nasalization of the vowel Halant (◌्) is use to represent a lone consonant without a vowel. It kills the vowel अ and reduces the consonant to its base form. Nukta (◌ ़) is used represent the Persian sound encountered in some of the borrowed Urdu words like ज़ for ظ ز ذor ग़ for غ. 2.2 Anatomy The anatomy of a letter can be defined as a system which depicts the structural form of a letter; describing key features of a letter in a typeface. The first attempt of graphical classification of Devanagari script was done by S. V. Bhagwat. He grouped letters on the basis of graphical similarities as shown in figure 2.1. Figure 2.1 Bhagwat’s grouping of letters on the basis of graphical similarities[1] He has also defined guidelines for the letters and terminology for some of the graphical elements present in the letters which are shown in figure 2.2. Figure 2.2 Bhagwat’s guidelines[1] 6 The top most line is the Rafar Line, which is followed by the Matra Line. Matra Line denotes the top of the upper matras. After the Matra Line, Head Line is indicated. Head Line is the top of the Shiro Rekha. Head Line is followed by the Upper Mean Line and the Lower Mean Line. Upper Mean Line indicates the point where the actual letter starts for example the upper part of the counter of ‘ब’ or ‘व’. Lower Mean Line denotes the point where the characteristic feature of the letter comes to an end for example the lower part of the counter of ‘ब’ or ‘व’. This is followed by the Base Line which denotes the end of the character and the point where the lower matra starts. The lowermost line is the Rukar Line which denotes the end of the lowest portion of the Rukar. Bapurao Naik also attempted the graphical grouping of letters. Naik organized letters graphically in five groups on the basis of the position of the kana or the vertical bar. The important aspect of this grouping is that ए and ऐ are missing from the group. Naik's grouping of letters is shown in figure 2.3. Figure 2.3 Naik’s grouping of letters on the basis of the position of the vertical bar or the kana[1] Few other people like M. W. Gokhale, Mahendra Patel have also proposed different method to group the letters and create the vocabulary of Devanagari script. A comprehensive study on anatomy on Devanagari script can be found in [2]. 2.3 Character Frequency in Hindi Table 2.3 shows the frequencies of letters in Hindi language. Basis of this list were some Hindi texts with together 978.430 characters (238.604 words), 736.216 characters were used for the counting. The texts consist of a good mix of different literary genres. Of course, if other texts were used as a basis, the result would be slightly different. 7 ◌ा 8.22% व 1.62% फ 0.35% क 7.14% ◌ु 1.45% इ 0.31% ◌े 6.85% ज 1.39% ◌ ँ 0.30% र 5.91% ए 1.34% ष 0.27% ह 4.82% ग 1.31% घ 0.20% स 3.78% च 1.16% ई 0.20% न 3.48% थ 1.15% झ 0.19% ◌ी 3.47% अ 1.01% ठ 0.17% ◌ं 3.44% औ 0.94% ◌ौ 0.15% म 3.28% ◌ू 0.81% ण 0.13% ि◌ 3.20% उ 0.78% ◌ृ 0.10% ◌् 3.02% श 0.76% ओ 0.10% त 2.89% ड 0.75% ◌ॉ 0.10% प 2.66% ख 0.70% ढ 0.09% ल 2.45% ◌़ 0.67% ऊ 0.05% ◌ो 2.21% भ 0.67% ऐ 0.03% य 2.20% आ 0.66% ऑ 0.03% ◌ै 1.96% ट 0.57% ञ 0.01% ब 1.78% छ 0.45% ◌ः 0.01% द 1.68% ध 0.36% ◌ ॅ 0.00% Table 2.3 Character Frequency in Hindi[3] 8 Chapter 3: Optical Character Recognition With the recent emergence and widespread application of multimedia technologies, there is increasing demand to create a paperless environment in our daily life. Wide variety of information which has been conventionally stored on paper is now converted to electronic form for better storage and intelligent processing. The primary purpose of such system is to facilitate the retrieval of information based on a given query. Representation of documents as images is also undesirable because it does not allow the user to edit or search the document. These limitations can be overcome by representing the date as text, which takes less storage space and is also easier to process. This kind of conversion can be achieved by Optical Character Recognition. Optical Character Recognition or OCR is technology which allows machine to recognize text from an image. It is the conversion of scanned image of printed or hand-written text to machine encoded text. It is important for computerizing printed text so that they can be searched electronically, stored compactly or used for machine processing like translation or text to speech conversion. 3.1 History of OCR The dream of making machines perform humane tasks like reading is not new. The first attempt was in 1870 when C. R. Carey invented an image transmission system. During the first decade of 19th Century many attempt were made. But the modern version of OCR came into existence in 1940s. The First OCR The first OCR was installed in Reader’s Digest in 1954. It was used to convert typewritten report to punched card so that they can be input in the computer First Generation OCR The first commercial OCR appeared from 1960 to 1965. These OCR had a constrained letter shape read. The characters were specifically designed for machine recognition and were not very natural. With time the OCR was able to recognize up to 10 different fonts. Second Generation OCR The reading machines of the second generation appeared in the middle of the 1960’s and early 1970’s. These systems were able to recognize regular machine printed characters and also had hand-printed character recognition capabilities. The first one of this kind was IBM 1287. In this 9 period, characters in Latin script were also standardized. OCR-A and OCR-B were also designed in this period. These fonts were designed so that they can be recognized by a machine but were also still readable by a human. ABCDEFGHIJKL MNOPQRSTUVWX YZ0123456789 ABCDEFGHIJKL MNOPQRSTUVWX YZ0123456789 Figure 3.1 OCR-A (top); OCR-B (Bottom) Third Generation OCR These first appeared in the middle of 1970s. The challenge was to recognize poorly scanned documents ad hand-written character set. Also low cost and high accuracy were main objective which was achieved also because of the advancement in the technology. Present OCR Today OCRs are available at a very low cost and OCR systems are also available as software package. Omnifont OCRs are available for Latin script. Although systems are available for Latin, Cyrillic, far eastern and many middle eastern scripts, such systems for Devanagari are still in the research and development stage. This is mainly due to a lack of a commercial market. 3.2 Applications of OCR OCR has been used to computerize data for dissemination and processing. The first major use of OCR was in the banking industry where it was first used to read credit card numbers. Nowadays OCRs are widely used for automated data entry especially in banks where it is used to read account number, customer identification, amount of money etc. It is also used for text entry i.e. extracting text out of a scanned document. The reading machine is used to process large amount of text, which can then be used of several other purposes like for searching within the document. 10 OCRs also have huge application for the blind. This was one of the earliest thought applications of OCRs. Combined with text to speech conversion OCRs would enable blind people to read the printed documents. It can also be used for automated license plate reading and can also help in reading specially designed forms automatically. Once the text is computerized it can be used for machine processes like text to speech conversion, language translation and text mining. 3.3 Recognition of Devanagari Script The most important principle of automatic pattern recognition is training the machine what kind of pattern may be present and what they look like. In OCR the patterns are letters, numbers and punctuations. Machine is trained to recognize the pattern by showing it all the kind of characters present in the script. This period is referred as the training period. On the basis of these examples the machine builts a prototype of all the characters. Then during recognition the machine compares the unknown character to the prototype and assigns the character which is the closest match. The four steps in recognition shown in figure 3.2 are as follow: 1. Preprocessing 2. Segmentation 3. Recognition 4. Post Processing Preprocessing The text document is generally scanned at 300 or 400 DPI. Preprocessing is also done to improve the accuracy of the recognition algorithm. Main steps in preprocessing are noise removal, binarization and skew correction. Noise Removal or De-Noising The main sources of noise in the input image are as follows: • Noise due to the quality of paper on which the printing is done. • Noise induced due to printing on both sides of paper or the quality of printing • Noise added due to the scanner source brightness and sensors. All this noise results in reduction of accuracy of OCR system. As a result of this having a noise correction routine in place becomes inevitable. To reduce the amount of noise, image is passed through a mean filter; in this filter the intensity of the each pixel is replaced by the average intensity of pixels surrounding it. After de-noising the image is subjected to binarization and skew (or tilt) correction. 11 Preprocessing Noise removal, Binarization and Skew (or tilt) correction Segmentation Line, word and character segmentation Recognition Post Processing Output Text Figure 3.2 Procedure of Devanagari Recognition Binarization Printed documents generally are black text on white background. Hence most of the OCR algorithms are designed to interpret bi-level images (an image that has only two possible value of pixel i.e. black and white). This process of converting colored or grayscale images to bi-level image is often known as binarization or thresholding. A comprehensive study on the method of binarization for OCRs can found in [4] Figure 3.3 Image before binarization (left); Image after binarization (right) 12 Skew (or Tilt) Correction When a document is scanned a small amount of skew (or tilt) is unavoidable. Skew angle is the angle that the text lines make with the horizontal line. Skew estimation and correction are important preprocessing steps of document layout analysis and character recognition. One of the popular skew estimation techniques is based on projection profile of the documents. The horizontal/vertical projection profile is a histogram of the number of black pixels along horizontal/vertical scan lines. In Devanagari Shiro Rekha is use to find the skew angle. The algorithm of skew correction can be found in [5]. Segmentation Segmentation is the process of the dividing the page into its constituent element. The aim of segmentation is to extract out all the character from the text in the image. This is needed to recognize these characters. Segmentation phase is a very crucial stage since this is where most of the errors occur. Even in good quality documents, sometimes adjacent characters touch each other due to inappropriate scanning resolution or the design of characters. This can create problems in segmentation. Incorrect segmentation leads to incorrect recognition. Segmentation phase includes line, word and character segmentation. Segmentation in OCR occurs in three steps: line segmentation, word segmentation and character segmentation. While the precise algorithm for segmentation can be found in [6] and [7], an overview of segmentation process is given below. Line Segmentation In line segmentation our aim is to separate out the line of text from the image. For this global horizontal projection profile method is used which constructs a histogram of all the black pixels in every row as shown in figure 3.4. Based on the peak/valley points of the histogram, individual lines are separated. The steps for line segmentation are as follow: 1. Horizontal projection profile for the image is created. 2. Using the projection profile, the points from which the line starts and ends are found. 3. For a line of text, upper line is drawn at a point where we start finding black pixels and lower line is drawn where we start finding absence of black pixels. And the process continues for next line and so on. Word Segmentation After line segmentation the boundary of the line (i.e. the top and bottom of the line) is known. Word 13 Figure 3.4 Horizontal Projection Profiles of a document for line segmentation[8] segmentation is extracting out the boundary of the words from these segmented lines. Word segmentation is done in the same way as line segmentation but in place of horizontal profiling, vertical projection profiling is done as shown in figure 3.5. The steps for line segmentation are as follow: 1. Vertical projection profile for the image is created. 2. Using the projection profile, points from which the word starts and ends are found. 3. Then we create vertical lines at the start and end of each line. And the process continues for next word and so on. Figure 3.5 Vertical Projection Profiles of a document for word segmentation[8] Character Segmentation Once the words are segmented, the next step is to extract out the characters form these words. A word in Devanagari script is further divided into three parts: as shown in figure 3.6: 1. Top 2. Core (or Middle) 3. Bottom The top strip and the core part are separated by the Header Line or the Shiro Rekha. But there is no separation between the core strip and the bottom strip. The top strip contains the top matras and the bottom strip contains the bottom matras or the descenders of some on the characters. The 14 Shiro Rekha is a unique feature of Devanagari script and helps to identify Devanagari in multilingual document. It also helps in the identification of the baseline of the text. अकुलीन Top Strip Core Strip Head Line (Shiro Rekha) Bottom Strip Figure 3.6 Three part of a Devanagari word The steps of character segmentation shown in figure 3.7 are as follows: 1. Shiro Rekha is identified and the top strip is seperated from the core and bottom strip. So now the text is divided in two parts a.) The Shiro Rekha and the top mantra and b.) The core-bottom part of the text 2. Core strip and bottom strip from the core-bottom part of the text, is identified and lower matras are extracted. 3. Core strip is segmented into different letters or characters which may include conjuncts, punctuation or numerals. 4. Conjuncts are segmented into single characters. 5. Shiro Rekha is removed form the extracted top strip and top matras are extracted. 6. Once the segmentation of the core character is done, Shiro Rekha is put back on the top of individual characters. Figure 3.7 The procedure of Hindi character segmentation[6] 15 Recognition Segmentation is followed by recognition of the characters. The two main methods used for recognizing characters are as follows: • Template Matching • Feature Based Recognition Template Matching In this method a matrix containing the image of the input character is matched with the set of prototypes created in the training period. The distance between the pattern and each prototype is computed and the character which is the best match to the pattern is assigned to the pattern. The technique is simple and easy to implement in hardware. However, this technique is sensitive to noise and style variations and has no way of handling rotated characters. Feature Based Recognition In this method significant features of the pattern are measured and examined. These features are then compared to the prototypes developed in training phase. The description which provide the closest match provides the recognition. These features can be like presence of vertical bar or the number of conjunctions. Algorithm of recognition in detail can be found in [9]. Post Processing The result of recognition is set of some characters. However these characters doesn't contain the complete information. We would like to combine these individual characters to form strings. This process is called grouping. Grouping of string depends on the location of string in the document. Strings which are close to each other are grouped together to form a word, since the distance between two words is more than the distance between the letters of the word. 16 Chapter 4: Typefaces for OCR Typeface is a design of a collection of alphanumeric symbols. A typeface may include letters, numerals, punctuation, various symbols, and more — often for multiple languages. It is usually grouped together in a family containing individual fonts for italic, bold, and other variations of the primary design. Although typeface and font are used interchangeably; font refers to the physical embodiment of the typeface (i.e. the a computer file or a metal piece in letterpress). Typeface is what we see whereas font is what we use. In rest of this thesis, font and typefaces are used interchangeably. 4.1 Latin Fonts for OCR Typefaces are designed for OCR so that they can be read by low cost systems. These fonts have a unique and well defined character sets which allow for greater accuracy in recognition. The most popular Latin script fonts which were designed for OCRs are as follows: • OCR-A • OCR-B OCR-A OCR-A is a monospaced font designed by American Type Founders. It was developed to meet the standards set by the American National Standards Institute for the processing of documents by banks, credit card companies and similar businesses. The design was simple so that it can be read by machine but it is very difficult to read it by human eyes. ABCDEFGHIJKLM NOPQRSTUVWXYZ abcdefghijklm nopqrstuvwxyz 0123456789 &$£%.!? Figure 4.1 Characters of OCR-A OCR-B OCR-B was also designed by Adrian Frutiger for European Computer Manufacturers Association 17 (ECMA). It is a monospaced font and was designed following the standards of ECMA. The first version contained 109 characters. The main objective was to create international standards for optical recognition. They also wanted to avoid the wider acceptance of OCR-A because of its unnatural looks. Therefore, OCR-B was designed to be pleasant to human eyes. ABCDEFGHIJKLM NOPQRSTUVWXYZ abcdefghijklm nopqrstuvwxyz 0123456789 &$£%.!? Figure 4.2 Characters of OCR-B It pushed the limits of optical recognition. This was the first typeface, with respect to the machine readable typeface which gave consideration to aesthetics. OCR-B was declared worldwide standard in 1973. The principle of OCR-B was based on the fact that all the characters must differ from each other by at least 7% in the worst possible case. To check this two characters were overlapped in such a way that they overlapped optimally as shown in figure 4.3. This test was also carried out using two different printing weights, a fine weight due to the lack of ink in typewriter ribbon and a fat weight due to the ink blots. Figure 4.3 Optimal Overlapping of characters Generous character spacing was provided since its important for correct recognition whereas serif were avoided because it increases the common coverage area of the characters therefore increasing the similarity between characters as shown in figure 4.4. 18 Figure 4.4 More coverage ratio because of serifs First Test Version The first test version was designed in 1963 and had 109 characters is shown in figure 4.5. In the first version of the font the bowl shape in the uppercase case letters was constant whereas there were two types of bowl shapes in lowercase letters: a round bowl for example in c d p, and a flat bowl in b g q. Also initially the height of the numerals and uppercase letters was kept the same which was then changed before the first test version All the numerals also had dynamic shape but the curves were different. The uppercase O was very similar to the numeral 0. Figure 4.5 First test version of OCR-B[10] First Published Version Figure 4.6 First published version of OCR-B[10] 19 OCR-B was first published as Standard EMCA-11 in 1965 containing 112 characters and is shown in figure 4.6. Some characters underwent considerable changes. The flat dynamic bowl of b g q was converted to static bowl shape to match the rest of the characters. The height of the upperwww.linotype.com F2Fcase OCRBczyk™ characters wasRegular also reduced to differentiate it more from the numerals. Some character had undergone considerable correction. W’s outer diagonals were curved in the new version. Also the 0 received a more oval shape to differentiate it with the character O and the dot of j was OHamburgefonstiv 24 pt numeral now normally placed. Altogether the typeface now had a more consistent look and feel to it. 36 pt OHamburgefonstiv Some further corrections took place from 1969 onwards.. The British pound sign was considerably changed. There was still problems in differentiating D O 0 and B 8 &.. With D the curve stroke now OHamburgefonsti started directly at the stem; O was given a much more oval shape and 0 became more angular. B 48 pt was made wider which resolved to problem of B-8 pair and the upper bowl of & was made smaller. A horizontal bar was added to j (just like i) and the descender of y was curved. All these were not beneficial in term of shape or aesthetics but were very important for differentiating different characters. OHamburgefon 60 pt Even with the international standard in 1976 OCR-B project was far from over. The number of character increased constantly; from 121 characters in 1976 to 147 in 1994. Also in 1994, a designer named Alexander Branczyk designed a proportional version of OCR-B called OCRBczyk. It fea- OHamburgef tured much finer visual features but remained true to the design of OCR-B. 0Hamburg OHamburg 72 pt 84 pt Figure 4.7 OCR-B (top), OCRBczyk (bottom) OHambur Application if OCR-B 96 pt Since 1960s machine readable typefaces have been used for data recognition. They can be found on cheques, bank statement, credit cards and postal forms. OCR-B can also be found in many countries paying-in forms and countries identity card. Most of the barcode numbers are also set in OCR-B. 20 F2F OCRBczyk is a trademark of Linotype GmbH and may be registered in certain jurisdictions. For further information please contact: [email protected] 4.3 Devanagari Font for OCR Although development of OCRs for Indian script is an active area of research today, not much work has been done for designing a Devanagari font optimized for OCR. Unlike the Latin script there is not even one commercially available Devanagari font which is optimized for Devanagari OCR systems. Few of the most common Hindi fonts are KrutiDev, Mangal, Surekh and Yogesh. But none of them is designed for OCR. All these fonts have some letters with parts above the Shiro Rekha. Also KrutiDev and Yogesh have some letters which are not connected horizontally like स. Also the stroke with os Yogesh is also thin for an OCR font. Surekh is not a monolinear font that is why it cannnot be used for OCR. Therefore there is need of a Devanagari font designed for OCR. 21 Chapter 5: Font Design for OCRs Font design is the art and process of designing typefaces. Regardless of the method used to specify type design, all the characters of type should have artistic consistency. No character should look small or large as compared to the other characters in the font. Although while designing type for OCR systems special precautions have to be taken for better accuracy. Many of the principles of type design for Latin fonts for OCR system apply directly to Devanagari fonts, but due to the difference in the segmentation algorithm extra care need to be taken while designing for Devanagari OCR system. 5.1 Design Features Important for OCR Every letter has to be more strongly differentiated than is customary in type design. Most of the principle for designing type for OCR remain same as Latin, while special care need to be taken for Devanagari because of the difference in character segmentation. However, many constraints which were present while designing OCR-B are not applicable now because of the advancement in technology for example previously OCRs were only able to detect monospaced font but now because of the development in the OCR system it can also recognize proportional fonts with accuracy. Some of the things that should be taken care while designing type for OCRs are as follows One Character should Never be Contained in Another Character No character when overlapped with another should be completely inside the other letter. This is very important for correct recognition. To do this certain additional feature or elements are added to differentiate it from the other characters as shown in figure 5.1. We can also have different counter size of similar looking characters like प and ष. Figure 5.1 Addition of elements in OCR-B to differentiate two characters Font should be Monolinear Monolinear fonts are the fonts that have same visual weight of the vertical and horizontal strokes. If a font has different stroke width then there is a possibility of the breaking of the thin stroke at small point size while scanning or during the process of binarization thus creating problems while recognizing. 22 Font should be Sans Serif Serif is a small decorative line added at the end of some of the strokes that make up thee basic form of a character as shown in figure 5.2. A typeface with serifs is called a serif and a typeface without serifs is called sans serif. Sans serif typefaces are preferred for OCR because serifs increases the common coverage area of the characters therefore increasing the similarity between characters. Figure 5.2 Serifs in a typeface (grey serifs) Generous Character Spacing Character spacing is the distance between two characters. White space between two characters help in character segmentation but it should not be comparable to the space bar (' '). If the character spacing is not enough, the characters can end up touching each other because of the noise added while scanning; then this would create problem in character segmentation. Shadow characters should also be avoided. A character is said to be under the shadow of another character if they do not physically touch each other but it is not possible to separate them merely by drawing a vertical line. Example of shadow characters is shown in figure 5.3. Although the algorithm takes care of shadow characters, it reduces the accuracy in some cases. Figure 5.3 Shadow Characters[6] Big Closed Counters The enclosed or partially enclosed circular or curved negative space (white space) of some letters such as d, o, and s is the counter as shown in figure 5.4. 23 Figure 5.4 Counters is the circular negative space (grey) While designing for OCRs, counter size needs to be kept huge so that they don't get completely filled because of noise while scanning or they can also get filled while printing. This can result in faulty recognition as a character can be confused for other characters, for example if the counter of ढ is filled it can be confused for द by the OCR. Bold Strokes Stroke width is another feature of a font which is very important for recognition as thin strokes can get smudged and get broken because of poor quality of printing and scanning. Bold stroke is also helpful in the process of binarization. 5.2 Special Care for Devanagari Apart from the precautions stated above some special care has to be taken for Devanagari because of the complicated segmentation process. For character segmentation the script is divided in three parts: top, core (or middle) and bottom and all these parts are recognized separately. This increases the complication because unlike Latin script, descenders and ascenders of the characters (in core strip) won't be treated as the part of the character in Devanagari script. So no differentiating feature can be present in the ascender or descender of the character. These special precautions that need to be taken care of are discussed below. Removal of Shiro Rekha and the Top Strip Removal of Shiro Rekha is the second step in character segmentation (as shown in figure 3.7). When Shiro Rekha is removed, all the features of the character at the level of Shiro Rekha or above it are also removed from the core strip as shown in figure 5.5. Figure 5.5 Example of few characters after the removal of Shiro Rekha 24 When some important features of the character are at the level of Shiro Rekha or above it gets removed resulting in no recognition or recognizing a different character. For example भ has a curve at the level of Shiro Rekha which when removed results in looking like म. Similarly ध looks like घ when the Shiro Rekha is removed. This can be seen in figure 5.6 Figure 5.6 Similarities between different characters once the Shiro Rekha is removed Also the differentiating characteristic between the kana (ा) and purna viraam (।) is the presence of Shiro Rekha above the kana. Once the Shiro Rekha is removed there is no differentiating features between theses two characters and one character can be confused for other. So while designing some differentiating features have to be added in either of two characters so that they can be recognized accurately. Removal of Bottom Strip The step after the removal of top strip in character segmentation is the removal of bottom strip. Bottom strip is the strip which contains the lower matras, halanta and descenders of the letters in the core strip. The most difficult part of this step is to determine where the core strip ends and the bottom strip begin because in Devanagari script the lower matras are connected to the characters in the core strip. Also a few characters like इ झ has characteristic features extending to the bottom strip. When these features are removed the character might closely resemble other characters as shown in figure 5.7. Figure 5.7 Similarities between different characters once the bottom strip is removed Also in some cases the descender resembles a particular lower matra or a diacritical mark. While recognizing the lower matras in the bottom strip, the descender can be confused for the lower matra which would result in incorrect recognition of both the character and the lower matra as shown in figure 5.8. 25 Figure 5.8 Similarities between descender and Halanta Recognition of Characters Recognition of characters is much more complicated in Devanagari than in Latin because of the graphical similarities in the letters. The graphical similarities in the letters in Devanagari is much more than that in Latin. Some of the letters have just a difference of a stroke like ष just has an additional diagonal stroke as compared to प. While there are others which differ from each other only because of the presence of vertical line like न and म. Also unlike Latin script, Devanagari has letters which are disjoint horizontally. This should be avoided in the characters in which this can be avoided for example रव can also be designed as ख. This results in inaccurate recognition. Also the open counters in the letters should be designed carefully. Open counter is the curved part of the character that encloses curved parts (counter) of some letters as shown in figure 5.9. Figure 5.9 Example of open counter (grey) While designing the counters, special care need to be taken so that the strokes forming these curves don't get connected because of noise or smudging. This results in the algorithm to confuse between two letters. For example if the strokes of ल connects together they can be recognized as न. 26 Chapter 6: Proposed Font Designed for OCRs The proposed version contains 53 characters including letters and matras. The font is unicode based. A reduction in the calligraphic strokes can be seen. All the characters are designed to have same height i.e. no part of the character goes above the Shiro Rekha or goes below in the bottom strip. Characters which were similar in design were given additional features. Also a small gap is given between the lower matras and the core strip which helps in segmentation. Font designed is monolinear and have a bold stroke so that the strokes are not broken in the process of binarization. अ ऊ क च ट त प य श ह आ ए ख छ ठ थ फ र ष इ ऐ ग ज ड द ब ल स 27 ई ओ घ झ ढ ध भ व उ औ ङ ञ ण न म 6 pts िकसी जाती का जीवन तथा इन 8 pts िकसी जाती का जीवन तथा इन 9 pts िकसी जाती का जीवन तथा इन 10 pts िकसी जाती का जीवन तथा इन 11 pts िकसी जाती का जीवन तथा इन 12 pts िकसी जाती का जीवन तथा इन 14 pts िकसी जाती का जीवन तथा इन 18 pts िकसी जाती का जीवन तथा इन 24 pts िकसी जाती का जीवन तथा इन 36 pts िकसी जाती का जीवन तथा इन 48 pts 60 pts 72 pts 96 pts िकसी जाती का जीवन तथा िकसी जाती का जीवन िकसी जाती का जी िकसी जाती क 28 8 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 10 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 11 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 12 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 14 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 18 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 24 pts हालािक सूर के जीवन के बारे मे कई जनश्रुितया प्रचिलत है पर इन मे िकतनी सच्चाई है यह कहना किठन है। कहा जाता है उनका जन्म िदल्ली के पास एक गरीब ब्राह्मीण पिरवार मे हुआ। जनश्रुित के अनुसार सूरदास जन्म से ही अन्धे थे। आजकल थी अन्धे आदमी अक्सर सूरदास कहलाते है। कई लोगो ने उन्हे गुरु के रूप मे अपनाया और उनकी पूजा करना शुरु कर िदया। 29 6.1 Designing Characters रसख The characters र, स, and ख have similar features. Hence while designing these characters, care must be taken so that OCR is able to differentiate between these characters. In order to incorporate differences in the features of these characters, following steps are taken: • The diagonal bar of र is elongated so that it can be differentiated from स. The elongated part of र is shown in Figure 6.1. • The bowl of these characters are designed differently so that even if there is smudging and the horizontal bar in स breaks, these characters can be differentiated by the shapes of their bowls. The bowls of र and स are kept same as they can be differentiated using the elongated diagonal bar and the bowls of र and ख are different. The difference in the shape of bowls of र and ख is shown in Figure 6.2 by overlapping these Figure 6.1 Extension in the diagonal stem of र to differentiate it from स Figure 6.2 Difference in the shape of the bowl In the first version, the width of ख was also compressed assuming it would provide better result but test results showed that the width didn't have a prominent impact and hence the width of ख was changed to the standard in the final design. The final design of र, स, and ख is shown in Figure 6.3. 30 Figure 6.3 Final design of letters र, स and ख गनभम The common element in this group is the presence of the filled counter as shown in the figure 6.4 which is the most distinguishing feature of the character. The final design of the letters can be seen in the figure 6.5. Figure 6.4 Common element of ग न भ and म Figure 6.5 Final design of letters ग न भ and म 31 To differentiate the letters in this group the horizontal and the vertical distance of the horizontal and vertical bar is not kept the same as shown in figure 6.6. Figure 6.6 Difference in the vertical and horizontal distance of the vertical and horizontal bar Also the top part of भ doesn't go above the Shiro Rekha so that it doesn't look like म once the Shiro Rekha is removed as shown in figure 6.7. Figure 6.7 Difference in भ and म once Shiro Rekha is removed 32 कबव The counter of the letters should be large so that it does not get filled with noise. Also a closed counter was designed so that if there is a joint is broken from one end it still doesn't look like half form of the letter as shown in the figure 6.8. Also the diagonal bar of ब has to be bold so that it is not broken while printing or scanning or in the process of binarization. Figure 6.8 Closed counter in the letters क ब व The final design of the letters can be seen in the figure 6.9. Figure 6.9 Final design of the letters क ब and व तल The challenge while designing this group was that ल should not look like �. For this the length of the diagonal stroke was reduced which also made the counter ore open as shown in the figure 6.10. Also ल should not like त if the lower counter is filled. The difference of the shape of त and ल can be seen in the figure 6.11. Figure 6.10 Small diagonal stroke and the openness of the counter in ल 33 Figure 6.11 Overlapping of the letters ल and त The final design of the letters can be seen in the figure 6.12. Figure 6.12 Final design of the letters ल and त चजछधघ While designing these characters, the following care must be taken so that OCR is able to recognize these characters • The horizontal bar shown in the figure 6.13 should not touch the vertical bar at the left even in the presence of noise, in order to do this the distance should be kept more. • The closed counter of छ should be of large size so that it is not filled in small size or in presence of noise. • The letter ध should not look like घ after the Shiro Rekha is removed as shown in the figure 6.14. Figure 6.13 Distance between horizontal and vertical bar of the letters 34 Figure 6.14 Difference in ध and घ once Shiro Rekha is removed The final design of the letters can be seen in the figure 6.15. Figure 6.15 Final design of the letters च ज छ ध and घ पफणष While designing these characters, the following care must be taken so that OCR is able to recognize these characters • No character when overlapped with another should be completely inside the other letter. • The diagonal bar of ष has to be bold so that it is not broken while printing or scanning or in the process of binarization The final design of the letters can be seen in the figure 6.17. 35 Figure 6.17 Final design of the letters प फ ण and ष टठढद The counter of ढ had to be large so that it does not get filled with noise because if it is filled with noise the OCR system confuses ढ with द. The final design of the letters can be seen in the figure 6.18. Figure 6.18 Final design of the letters ट ठ ढ and द 36 इईङडझह While designing these characters, the following care must be taken so that OCR is able to recognize these characters • The letters इ and झ are designed to have the same characters height as the other letters as shown in figure 6.19. • The ending of the stroke of ड had to be extended more than required for the normal design so that it is not confused by इ. Figure 6.19 Equal character height of all the letters The final design of the letters can be seen in the figure 6.20. Figure 6.20 Final design of the letters इ ई ङ ड झ and ह 37 थय The main concern while designing these letters is that थ should not look like य after the Shiro Rekha is removed. An overlap of these letters without the Shiro Rekha is shown in the figure 6.21 and the final design of the letters is shown in the figure 6.22. Figure 6.21 Overlap of the letters थ and य without the Shiro Rekha Figure 6.22 Final design of the letters थ and य अआउऊ While designing these characters, the following care must be taken so that OCR is able to recognize these characters • The horizontal bar of अ should be bold so that its not broken while scanning or printing. • The top of अ should not go above the Shiro Rekha. • The letter अ should not completely overlap उ. The difference is shown in figure 6.23. Figure 6.23 Overlap of the letters अ and उ 38 The final design of the letters can be seen in the figure 6.24. Figure 6.24 Final design of the letters अ आ उ and ऊ एऐञश These characters don't have any common element. The final design of the letters is shown in figure 6.25. Figure 6.25 Final design of the letters ए ऐ ञ and श 39 Matras While designing the lower matras a small white space was given between the lower matras and the पु पू का िक की कु कू के कै को कौ क् bottom of the letters as shown in figure 6.26. The final design is shown in figure 6.27. Figure 6.26 Small white space between the letters and lower matras Figure 6.27 Final design of matras 40 6.2 Evolution of the Typefaces The first version contained 52 characters. All the character had the same character height. The stroke width of the first version was very less. The bowl of र and स had a dynamic shape. Some of the letters like ख were compressed so that the character width of all the letters is comparable. The lower mean line was also kept higher. The final version on the other hand has a much bolder stroke so that the strokes are not broken while printing or scanning. Some characters underwent considerable corrections. The horizontal bar of न and त were brought closed to the Shiro Rekha. The knot at the bottom part of इ, ई, झ and द was removed to make the characters more open and the diagonal stroke at the bottom was also converted to a straight stroke. क was given a mre calligraphic look. Althogether the typeface now has a more natural look and has a better stroke consistency as compared to the first version. Comparison of the first and final version is shown in figure 6.28. अआइउऊएऐओऔकख गघङचछजझटठडढण तथदधनपफबभमयर लवशषसह अआइईउऊएऐओऔ कखगघङचछजझञ टठडढणतथदधनप फबभमयरलवशषस ह Figure 6.28 First version of OCR-D (top); final version of OCR-D (bottom) 41 6.3 Anatomy of the Typefaces The grid used is shown in figure 6.29. बुटे Upper Matra Shiro Rekha Character Height Figure 6.29 Grid used in OCR-D The H:V ratio used is 1.1 as shown in figure 6.30 H V Figure 6.30 H:V Ratio 42 Lower Matra 6.4 Testing of the Typefaces Once the font is designed the next step is to test its accuracy on an OCR system. For this an OCR system called HindiOCR is used. Oliver Hellwig of Department for Languages and Cultures of Southern Asia, Freie Universität Berlin designed HindiOCR. HindiOCR converts printed Hindi texts into rich-text documents (RTF) in Devanagari-Unicode encoding. It processes standard image formats i.e. *.jpeg, *.png and *.bmp. A free demo version of HindiOCR can be found at [11]. The text document can be seen in figure 6.31. Figure 6.32 shows the text after the process of binarization and figure 6.33 shows the result of HindiOCR when the algorithm was run on test document in figure 6.31. Figure 6.31 Test document typed in OCR-D Figure 6.32 Test document after binarization Figure 6.33 Extracted test from the image 43 Comparison with Other Fonts Performance of the font was compared to other fonts. For testing purposes the fonts Surekh and Yogesh were used. Same text was typeset in all the three fonts and the results were matched. The best result was of OCR-D and Surekh had the maximum errors. Although Yogesh performed much better than Surekh, there were some errors which occurred consistently like 'र 'ु was most of the time recognized as 'द 'ु and sometimes 'स' was recognized as a combination of र and न because of the disjoint in स. Figure 6.34 shows the scanned text document typeset in OCR-D and figure 6.35 shows the test result when OCR algorithm was executed on this scanned documents. Figures 6.36 and figure 6.38 shows the scanned text document typeset in Yogesh and Surekh respectively and figure 6.37 and figure 6.39 show their results. Figure 6.34 Scanned document typeset in OCR-D Figure 6.35 Extracted text when OCR algorithm is executed on document in figure 6.34 44 Figure 6.36 Scanned document typeset in Yogesh Figure 6.37 Extracted text when OCR algorithm is executed on document in figure 6.36 Figure 6.38 Scanned document typeset in Surekh 45 Figure 6.39 Extracted text when OCR algorithm is executed on document in figure 6.38 46 Chapter 7: Future Work All the alphabets in the Devanagari script are designed and has been tested on an OCR system. Although all the vowels (independent and dependent forms) and the consonants have been designed, numerals and some of the diacritics still have to be designed. Recognition of conjucts in Devanagari has to be studied. This includes the algorithm for recognition and the algorithm for separation of the half form of the letter from the full form.. Designing of conjunct has to be completed. Also the kerning table has to decicded upon keeping in mind generous character spacing. A comprehensive testing of the font also needs to done and based on the results of the test, design of the characters have to be tweaked appropiately. 47 References [1] Bapurao S. Naik — Typography of Devanagari [2] Girish Dalvi — Anatomy of Devanagari Typefaces [3] www. stefantrost.de [4] Tushar Patnaik, Shalu Gupta, Deepak Arya — Comparison of Binarization Algorithm in Indian Language OCR [5] B.B. Chaudhuri and U. Pal — Skew Angle Detection of Digitized Indian Script Documents [6] Huanfeng Ma, David Doermann — Adaptive Hindi OCR Using Generalized Hausdorff Image Comparison [7] Vijay Kumar, Pankaj K. Sengar — Segmentation of Printed Text in Devanagari Script and Gurmukhi Script [8] Mudit Agrawal, M. N. S. S. K. Pavan Kumar, C. V. Jawahar — Indexing and Retrieval of Devanagari Text in Printed Documents [9] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal — Offline Recognition of Devanagari Script: A Survey 10] Heidrun Osterer, Plilipp Stamm — Adrian Frutiger Typefaces. The Complete Work [11] http://www.indsenz.com 48