Offline Urdu OCR using Ligature based Segmentation for Nastaliq
Transcription
Offline Urdu OCR using Ligature based Segmentation for Nastaliq
Indian Journal of Science and Technology, Vol 8(35), DOI: 10.17485/ijst/2015/v8i35/86807, December 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script Ankur Rana1* and Gurpreet Singh Lehal2 Department of Computer Engineering, Punjabi University, Patiala - 147002, Punjab, India; [email protected] 2 Department of Computer Science, Punjabi University, Patiala - 147002, Punjab, India; [email protected] 1 Abstract There are two most popular writing styles of Urdu i.e. Naskh and Nastaliq. Considering Arabic OCR research, ample amount of work has been done on Naskh writing style; focusing on Urdu, which uses Arabic character set commonly used Nastaliq writing style. Due to Nastaliq writing style, Urdu OCR poses many distinct challenges like compactness, diagonal orientation and context character shape sensitivity etc., for OCR system to correctly recognize the Urdu text image. Due to compactness and slanting nature of Nastaliq writing style, existing methods for Naskh style would not give desirable results. Therefore, in this paper, we are presenting ligature based segmentation OCR system for Urdu Nastaliq script. We have discussed in detail various unique challenges for the Urdu OCR and different feature extraction techniques for Ligature recognition using SVM and kNN classifier. The system is trained to recognize 11,000 Urdu ligatures. We have achieved overall 90.29% accuracy tested on Urdu text images. Keywords: Feature Extraction (DCT, Directional, Gabor and Gradient), K-Nearest Neighbor, SVM, Urdu OCR 1. Introduction Optical character recognition is technique used to convert images data into editable text format. Optical character recognition is used for digitization of the printed text material like books, newspapers and old literature book, blind book reader, banks etc. A lot of research is being done for Arabic languages. Urdu also shares its character set with the Arabic character set. Generally Arabic language is written from left and right in Naskh writing style whereas Urdu is written in Nastaliq writing style from left to right. Nastaliq writing style is cursive writing style. Therefore very little work has been done for Urdu language because of Nastaliq writing style. For the cursive script like Urdu, segmentation and segmentation free approaches were being used by the researchers. Most of the researchers either worked on isolated Urdu character recognition or the small set of the Urdu ligature. Shamsher et al.1 and Ahmad et al.2 developed OCR for printed Urdu script. They used feed forward neural network for training and classification of the Urdu *Author for correspondence character. Shamsher et al.1 extracted the extreme points of all characters as feature vector for the neural network. They reported 98.3% accuracy in recognizing individual Urdu character. Ahmad et al.2 experimented with the structural features of the machine printed ligatures. They reported 70% accuracy. Pathan3 worked on the handwritten isolated Urdu characters. Al Muhtaseb et al.4 used HMM for developing Urdu OCR. They used vertical sliding with three pixels width and extracted 16 features by dividing window into 8 equal sizes and the sum up number of black pixels in each subwindow. They used 2500 lines for the training and 266 lines for testing. They achieved 99% accuracy for the Arial font. Their system is font style and font size depended. Hasan et al.5 used LSTM (Long Short-Term Memory) neural network to recognize the printed Urdu Nastaliq text. They also used sliding window of width 30 for feature extraction. They took pixels values as the feature vector. They reported 94.85% character recognition accuracy. Lehal and Rana6 explored ligatures recognition for the Urdu OCR. They have divided ligature into two Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script shown in Figure 3. Rest of character exception these 12 components i.e. Primary component (main shape of non joiners connect with succeeding and proceeding ligature without diacritics) and secondary component character and change its shape. (diacritics). They have achieved 98% recognition rate in primary component and 99.9% recognition rate in diacritics component using SVM as classifier. Lehal7 and 3. Challenges in Urdu Recognition Hussain8 worked on the segmentation of the Nastaliq Urdu OCR. The development of OCR for Urdu Script involves many We have considered 10082ligatures for the developunique complexities which are as follows: 2 character. Ahmadis et al. experimented ment of UrduUrdu OCR. Our approach segmentation based with the structural features of the machine printed ligatures. 3 isolated Urdu characters. They 70%pre-segmented accuracy. Pathan Urdu is written diagonally from right to left and top approach. We are reported considering text worked for the on the• handwritten to bottom. All ligatures look tilted at some angle from recognition. We have divided 4our ligatures into two sub verticalleft sliding with three widthjoinand al. used HMM topThey rightused to bottom direction as thepixels different components Al i.e.Muhtaseb i) PrimaryetComponent andfor ii) developing Secondary Urdu OCR. extracted 16 features by dividing window into 8 equal sizes and the sum up number of black pixels in each subers are written in Urdu. Numerals add another level of Component. So our total classes are reduced to 1845 classes window. They used 2500 lines for the training and 266 lines for testing. accuracy for the the complexity to the They Urduachieved OCR. As99% observed from for primary component and 19 classes for the secondary Arial font. Their system is font style and font size depended. books having both Roman as well as Arabic Numerals components. We have achieved total 98.15% for and number. Figure 2.accuracy Urdu alphabets written. It is found that Urdu and Roman numerals are primary component and599.91% accuracy for the secondused LSTM (Long Short-Term Memory) neural network to recognize the as printed Urdu Nastaliq Hasan et al. Urdu characters are classified as joiner non-joiner. Those Urdu which join with written left and to right in the Urdu showcharacters in Figure 4. ary components. Total overall accuracy on input of Urdu text. They also used sliding window of width 30 for feature extraction. They took pixels values as the feature character but notfor with the•succeeding character non-joiner. There aretop 12 non joiners As Urdu is mostly termed written as diagonally from right text image (having 500 ligatures on average) achieved Figure 2. Urdu alphabets and number. vector. They reported character recognition to bottom left, there is problem of segmentation. Thissucceeding an the Urdu OCR is 90.29% accuracy.94.85% shown in Figure 3. Rest ofaccuracy. character exception these 12 non joiners connect with Urdu characters are classified as joiner and non-joiner. Those Urdu which join with t poses problem for the character and characters word segmencharacter and change its shape. 6 Lehal and Rana explored ligatures recognition for the Urdu OCR. They have divided ligature into two tation. Figure 5 exhibit overlapped ligatures character but not with the succeeding character termedshows as non-joiner. There are 12 non joiners i 2. Overview ofi.e.Urdu components Primary component (main shape of ligature without diacritics) and secondary component marked exception as in red, green bluejoiners color. connect with succeeding an shown in Figure 3. Rest of character these and 12 non (diacritics). They have achieved 98% recognition rate in primary component and 99.9% recognition rate in Urdu is the national language of the Pakistan characterand andofficial change its shape. 7 8 diacritics component using SVM as classifier. Lehal and Hussain worked on the segmentation of the Nastaliq language of six Indian states Delhi, Jammu Kashmir,in Urdu. Figureand 3. Non-joiner Urdu OCR. Uttar Pradesh, Bihar, Andhra Pradesh and Telangana. 3. Challenges Urdu Recognition Urdu is the national of the Pakistan andforofficial We havelanguage considered 10082ligatures theindevelopment of Urdu OCR. Our approach is segmentation based language of six Indian states Jammu and Kashmir, Figurepre-segmented 3. Non-joiner in Urdu. approach. We areDelhi, considering text for the recognition. We have divided our ligatures into two Theand development of OCR for Urdu Script involves unique complexities Uttar Pradesh, Bihar, Andhra Pradesh Telangana. Figure 2.many Urdu alphabets and number.which are as follows: sub components i.e. i) Primary Component and ii) Secondary Component. So our total classes are reduced to Urdu language is derived from the Farsi Urdu lan3. script. Challenges in Urdu Recognition 1845 classes for primary component 19 classes for the secondary components. Weclassified haveAllachieved total Urdu characters as joiner andtilted non-joiner been Urduvastly is and written left and top to are bottom. ligatures look at som guage is particularly important as it has used diagonally from right to 98.15% accuracy for primary component and 99.91% accuracy for the secondary components. Total overall top right to bottom left direction as the different joiners are written in Urdu. Numerals add character but not with the succeeding character terme development of OCR for Urdu Script involves many unique complexities which are as follows: by poets for composing their poetry.The Urdu is written in accuracy on input of Urdu textofimage (having 500 ligatures on average) achieved for the Urdu OCR is 90.29% observed from 3. the books having both Roman as w in Figure Rest of character exception these 1 Nastaliq style whereas Arabic is written in complexity the Naskhto the Urdu OCR. As shown accuracy. Figure 2. Urdu alphabets and number. Urdu is written diagonally from right to left and top to bottom. All ligatures look tilted at som Numerals written. It is found that Urdu and Roman numerals are written left to right in the character and change its shape. style. Figure 1 shows the same Urdu text in two different Figure 2. Urdu alphabets and number. top right to in Figure 4. bottom left direction as the different joiners are written in Urdu. Numerals add writing styles. 2. Overview of Urdu are classified joinerhaving and non-joiner. Those complexity OCR.characters As observed from theasbooks both Roman as Ur w Urdu has 38 basic letters. Figure 2 ofshows all theto the UrduUrdu character but not with the succeeding character termed as non-jo Numerals written. It is found that Urdu and Roman numerals are written left to right in the basic letters of Urdu script. It is written from to left and Urdu is the national language of right the Pakistan official3. language of six Indian states Delhi, Jammu and Figure Urdu. shownNon-joiner in Figure Rest of character in Figure 4. whereas Urdu numerals are written as roman numerals Figure 3. 3.inNon-joiner in Urdu. exception these 12 non join Kashmir, Uttar Pradesh, Bihar, Andhra Pradesh and Telangana. Urdu is the national language of the Pakistan character and change its shape. i.e. from left to right. and official language of six Indian states Delhi, Jammu and Kashmir, 3. Uttar Pradesh, Bihar, Andhra Pradesh and Challenges inleft Urdu Recognition Figure 4. Urdu writing style in nastalique from right to and number written from left to righ Urdu characters are classified as joiner and non-joiner. Telangana. Urdu language is derived from the Farsi script. Urdu language is particularly important as it has Those Urdu characters which join with the preceding been vastly used by poets forAscomposing their poetry. Urdu is written in Nastaliq style Arabic is Urdu is mostly right top to bottom left, there is problem of many segm The development of OCRwhereas for Urdu Script involves character but not with the succeeding character termed written diagonally from Figure 4. Urdu writing style in nastalique from right to left written in the Naskh style. Figure 1 shows the same Urdu text in two different writing styles.Figure 5 exhibit shows overlap poses problem for the character and word segmentation. as non-joiner. There are 12 non joiners in the Urdu Figureas 4. Urdu writingFigure style in3.nastalique from right to left and number written from left to righ Non-joiner inleft Urdu. and number written from to right. 2 Urdu is written diagonally from right to left and to marked as in red, green and blue color. bottom left direction as the different As Urdu is mostly written diagonally from top right top to toRecognition bottom left, there is problem of segm 3. Challenges in right Urdu of complexity to the Urdu OCR. As observed fr poses problem for the character and word segmentation. Figure 5 exhibit shows overlap Numerals written. It is found that Urdu and Roma development of OCR for Urdu Script involves many unique co marked as in red, green andThe blue color. in Figure 4. Figure 1. Red and1.Blue textcolor is intext Nastaliq and Naskh Figure Red color and Blue is in Nastaliq and Naskh writing style respectively. Urdu is written diagonally from right to left and top to bottom writing style respectively. Figure 5. Ligature overlapping. Figure bottom 5. Ligature overlapping. topofright direction as the Urdu has 38 basic letters. Figure 2 shows all the basic letters Urdutoscript. It left is written from rightdifferent to left joiners are of complexity to the Urdu OCR. As observed the bo whereas Urdu numerals are roman numerals from toContext right. sensitive means that the shape from written Urdu as is context sensitive i.e. script as left well. of the chara Numerals written. It isJournal found that Urdu and Roman numerals Vol 8 (35) | December 2015 | www.indjst.org Indian of Science andtogether, Technology on the succeeding character. When different characters are written the shape of Figure overlapping. in Figure 4. 5. Ligature Figure 4. Urdu writing style in nastalique from r depends on the shape of the character it follows as shown in table. Character in red color in Urdu is the context sensitive as well. sensitive means thatdiagonally the shapefrom of the chara side of equal to is thescript character which written with other characters changes its Context Aswhen Urdu is mostly written right top ﻭ+ﺕ= ﺕ ﻭ ﺏ+ﺕﺏ= ﺕ ﻥ+ﺕﻥ= ﺕ ﺏ+ٹﺏ= ٹ ﻥ+ٹﻥ= ٹ ڈ+ٹ= ٹ ڈ Ankur Rana and Gurpreet +ﺕ==ﻡ ﺏ ﻡ+ﻥ +=ﺕﺏﻡ =ﻥ ﻡ ﻥ+ﺭﺕ+ ﻭ+ﺕﻝ ﻝ ﻭSingh ﺕLehal =ﻡ ﺭ=ﺕﻡﻥ • Urdu is context sensitive script as well. Context sensitive means that the shape of the character depends on the succeeding character. When different characters are written together, the shape of the character depends on the shape of the character it follows as shown in table. Character in red color in the left hand side of the equal to is the character which when written with other characters changes its shape on the right hand side, because of the context sensitive nature of script, where in Naskh each character has its four shapes i.e. isolated, middle, left and right as show in Table. But Urdu in Nastaliq script has more than four shapes for its character set. Naz9 reported 32 different shapes of the one character in nastaliq script. • One of the major problems of the Urdu OCR is the broken character in the image. If one of the diacritics is not printed or not correctly scanned then the OCR cannot correctly identify it. In Figure 6 ligature is broken at the end. Consequently, OCR will get confused whether to treat broken ligature as one component or more than one. • Another problem with the Urdu OCR is the case merged characters. In Urdu different ligatures have same basic shape but different diacritics. From the diacritics of the ligature, we can identify the correct meaning of the ligature. But some time diacritics get ﺕ= ﺕ ﻭof ﺏ+the ﺕﺏ= ﺕligature ﻥ+ﺕﻥ= ﺕand it merged with the primaryﻭ+shape ﻭ+ﺕ= ﺕ ﻭ ﺏ+ﻥ ﺕﺏ= ﺕ+ﺕﻥ= ﺕ is difficult to segment even ligature primaryڈ+shape and ﺏ+ﻥ ٹﺏ= ٹ+ٹﻥ= ٹ ٹ= ٹ ڈ diacritics as shown in Figure 7. ﺏ+ٹﺏ= ٹ ﻥ+ﻝٹ+=ٹﻥ ﻡ =ﻡ ﻝ ﻥڈ++ٹ ڈ ٹﻡ=ﻥ= ﻡ Table 1. Different Shape ofٹﺏtey, tay and ﺏ+= ٹ ﻥ+ٹ =ٹﻥmeemڈ+with ٹ= ٹ ڈdifferen • There is very little space in between words in Urdu. Different One of ligatures the major either problems of the isﻥ+the in have space ﻝUrdu += ﻡor ﻝOCR ﻡnon-joiner ﻥ= ﻡbroken ﻡ ﺭ+character ﻡ ﺭ= ﻡ not as printed not correctly scanned the8,OCR at end a last or character. As shown in then Figure bar cannot correct Table 1. Different Shape of tey, tay and meem with different oth brokencolor at the end. the Consequently, OCR will differget confused whethe in green shows boundary between or more than component One of where the major ofbar therepresent Urdu OCR isthe the space broken character in the ent words barproblems in redone. not printed or not correctly scanned then the OCR cannot correctly id between two ligatures in one words. From the space broken at the end. Consequently, OCR will get confused whether to between ligature and words, we cannot find the word component or more than one. boundary between different words. • Line segmentation also pose problem in Urdu OCR. Figure 6. Broken ligatures in urdu text. Because of the diagonal writing style of the script, two ligatures in two difference lines get merged as shown in Figure 9. Figure 6. Broken ligatures in urdu text. problem with the Urdu is the • It is Another observed in some Urdu booksOCR footer ofcase the merged page characters. In basic shape but different diacritics. From the diacritics is written in Naskh style which add another complex- of the ligature, ofAnother the recognition. ligature. Butwith some diacritics get merged with the primary ity the Having both Naskh and Nastliq to problem thetime Urdu OCR is the case merged characters. In Urs tobasic segment ligature primary shape anddiacritics diacritics in we Figu writing style oneven one page adds another level of chalshape but different diacritics. From the of as theshown ligature, of the ligature. But some time diacritics get merged with the primary shap lenge in correct recognition of the page. As shown to segment even ligature diacritics as shown in Figure 7 in Figure 10, Urdu text inprimary green shape colorand is in nastaliq writing style and red color Urdu text is in Naskh writing style. Having both type of script on page is type of multiscript recognition which further increases the complexity of the system. Figure 7. Merged diacritics with ligature in Urdu 7. Merged diacritics withDifferent ligature in Urdu text There is very little space inFigure between words in Urdu. ligature 4. Urdu Data Preparation end as a last character. As shown in Figure 8, bar in green color sho There is very little space in between words in Urdu. Different ligatures eit words where bar in red bar representcan thebe space between two ligat Urdu text incharacter. books and divided endprinted as a last Asnewspapers shown in Figure 8, bar in green color shows t between ligature and words, we cannot find the word betwe into twowords generations. books printed before 1995 areboundary where barThe in red bar represent the space between two ligatures between ligature and words, we cannot find the word boundary between d ﺭ+ﻡ ﺭ= ﻡ Table 1. Different ShapeShape of tey, tay and meem with ﻝ+Table ﻝ= ﻡ1.ﻡDifferentﻥ+ ﻥ= ﻡofﻡtey, tayﺭand + ﻡmeem = ﻡ ﺭwith different other characters different other characters One of the major problems of the Urdu OCR is the broken character in the image. If oneFigure of the diacritics is with very small space between ligatures. 8. Urdu text ﺕofcorrectly =tey, ﺕ ﻭtayﺏ +ﺕmeem =ﺏ ﺕwith ﻥ+OCR ﺕdifferent =cannot ﺕﻥother not printed orﻭ+ not scanned then the correctly identify it. Figure In Figure 68. ligature istexttext Table 1. Different Shape and characters Figure 8. Urdu with very small spacesmall betweenspace ligatures. Urdu with very between broken at the end. Consequently, OCR will get confused whether to treat broken as one ligature Line segmentation also pose problem in Urdu OCR. Because of the dia ligatures. component more OCR than one. Line segmentation also poselines problem in Urduas OCR. Because of the9.diagona he major problems of theorUrdu is the broken character in the image. If one of the diacritics is ligatures in two difference get merged shown in Figure ﺏ+ﻥ ٹﺏ= ٹ+ٹﻥ= ٹ ڈ+ٹ= ٹ ڈ ligatures in two difference lines get merged as shown in Figure 9. ed or not correctly scanned then the OCR cannot correctly identify it. In Figure 6 ligature is ﻡ ﻝ= ﻡwill get ﻥ+= ﻡconfused ﻡﻥ ﺭ+ﻡwhether =ﻡ ﺭ at the end. Consequently,ﻝ+OCR to treat broken ligature as one nt or more than one. Figure 6. Broken ligatures in urdu text. Table 1. Different Shape of tey, tay and meem with different other characters of challenge in correct recognition of the page. As shown in figure 10, Urdu t writing style and red color Urdu lines text isget in Naskh writing style. Having both ty Figure 9. Two ligature different merged he major problems of the Urdu OCR is the broken character in the image. If one of the diacritics is in in Figure 9.9.Two ligature different lines getmerged. merged. Figure ligature in different lines get multiscript recognition which further increases the complexity of the system. Another problem with the Urdu OCR is the case merged characters. In Urdu different ligatures haveTwo same ed or not correctly scanned then the OCR cannot correctly identify it. In Figure 6 ligature is basic shape but different diacritics. From the diacritics of the ligature, we can identify the correct meaning ligature ItItitisisis observed some Urdu Urdu books booksfooter footerof ofthethe page is written and observed in some page is written in N at the end. Consequently, OCR get confused to treat as one in of the ligature. But somewill time diacritics get mergedwhether with the primary shape ofbroken the ligature difficult Figure 6. Broken ligatures in urdu text. Figure 6. Broken ligatures in urdu text. complexity to the recognition. Having both Naskh and Nastliq writing s complexity to the recognition. Having both Naskh and Nastliq writing style to segment even ligature primary shape and diacritics as shown in Figure 7. ent or more than one. Journal Journal problem with the Urdu OCR is the case merged characters. In Urdu different ligatures have same Figure 10. Urdu in two Figure 10. Urdu TextText writtenwritten in two different stylesdifferent (multiscriptsstyles on one page). pe but different diacritics. From the diacritics of the ligature, we can identify the correct meaning Figure 7. Merged diacritics with ligature in Urdu text. Figure 6. Broken ligatures in urdu text. Figure Mergedget diacritics in Urdu text.of the ligature (multiscripts on one page). ature. But some time7. diacritics mergedwith withligature the primary shape and it is difficult 4. Urdu Data Preparation There is veryshape little space between words in Urdu. Different ligatures nt even ligature primary andin diacritics as shown in Figure 7. either have space or non-joiner at end as a last character. As shown in Figure 8, bar in green color shows the boundary between different Urdu text printed in books and newspapers can be divided into two generatio words where bar in red bar represent the space between two ligatures in one words. From the space 3 1995 use 1995 are all hand written whileJournal majority of the and books published after Vol 8 (35) | December 2015 | www.indjst.org Indian of Science Technology between words, we cannot find characters. the word boundary between different words. with the Urduligature OCR and is the case merged In Urdu different ligatures have same problem fonts such as Alvi Nastaliq or Noori Nastaliq. Shape of the characters in Urdu pe but different diacritics. From the diacritics of the ligature, we can identify the correct meaning character it follows. That’s why we cannot take the character unit as the classifi Ligature at character level is shown in Figure 11. ature. But some time diacritics get merged with the primary shape of the ligature and it is segmentation difficult yles (multiscripts on one page).Text written in two different styles (multiscripts on one page). Figure 10. Urdu Offline multiscripts on one page).Urdu OCR using Ligature based Segmentation for Nastaliq Script Urdu Data Preparation multiscripts on on4.one one page). multiscripts page). We have gathered training data from various scanned all hand written while majority of the books published ers can be divided intoprinted two generations. Thenewspapers books printed Urdu text in books and can before be divided into two generations. The books printed before books. Some of the least frequent ligatures, as mentioned after 1995 use computer generated Nastaliq fonts such as of published useThe computer generated Nastaliq 1995 allafter hand1995 written while majority of the books published after 1995 10use computer generated Nastaliq anthe be books divided intoare two generations. books printed before by Lehal , rarely occur books. Alvi Nastaliq or Noori Nastaliq. Shape of the characters aliq. Shape of the such characters Urdu depends onNastaliq. the shape of the fonts Alviin Nastaliq orbooks Noori Shape of the characters in Urdu depends on in thesome shape of theTo generate the e books published 1995 use computer generated an be divided into after twoas generations. The printedNastaliq before take the character unitit depends as classification unitthe for the recognition. data for those primary components of the ligain Urdu on thewhy shape of the character follows. unit astraining character follows. we cannot take character the classification unit for the recognition. the characters inthe Urdu depends on shape of the theit eShape booksofpublished after 1995 useThat’s computer generated Nastaliq shown in Figure 11.as ture, we made some synthetic images with different font That’s why we cannot take the character unit as the claseShape the character unit the classification unit for the recognition. Ligature segmentation at character level is shown in Figure 11. of the characters in Urdu depends on the shape of the allenge in correct recognition of the page. As shown in figure 10, Urdu text in green color is in nastaliq n correct recognition of the page. As shown in figure 10, Urdu text in green color is in nastaliq size i.e. 35, 38, 40, 45, 50, 55 and different format option incharacter sification the recognition. Ligature Figure 11. ewn ge in correct recognition ofclassification thefor page. Asunit shown in figure 10,segmentation Urdu text in green color is in nastaliq the unit as theunit for the recognition. = = ng andcolor red text Naskh style. Having type of script page type likepage bold regular. ndstyle red color Urdu textUrdu is in is Naskh writing style. Having both typeboth of script on page ison type ofis atcolor character level isin shown inwriting Figure 11. wn in Figure 11. yle and red Urdu text in is Naskh writing style. Having both type of script on is or type of ofWe have removed diacritics marks from these ligatures to get the primary component. We For the development of Urdu OCR we have taken script recognition which further increases the complexity of the system. ognition which further increases the complexity of the system. t recognition which further increases the complexity of the system. have gathered total 1200 primary component from the ligatures as a classification unit. Ligature is the connected Figure 11. characters Urdu word with scanned books and rest of the primary components were component and has different Urdu andcharacter end segmentation. word with character segmentation. generated synthetically. Even in 1200 primary compocharacter is either non joiner or space. Urdu words can ve taken ligatures a classification the taken connected For theas development of unit. Urdu Ligature OCR we ishave ligatures as a classification unit. Ligature is the connected word with with character character segmentation. word segmentation. nent, some of primary component samples have more than onejoiner ligature. As for example, the word is either ers and endcomponent character isand either or space. words cancharacter hasnon different Urdu and end non joiner or space. Urdu words can are less than aken ken ligatures as a classification unit. Ligature ischaracters theUrdu connected the required number of samples. To complete ( ) (badshah) composed four ligatures: two ligample, the word ()ﺑﺎﺩﺷﺎﻩ is composed of four ligatures: two have thannon one ligature. As for example, and character isclassification either joiner or space. words the can word (( )ﺑﺎﺩﺷﺎﻩbadshah) is composed of four ligatures: two the samples ken end ligatures as amore unit. Ligature isUrdu the connected ﺑand and )ﺷﺎ and have two single character ligature (ﺩ and )ﻩ. The two we mixed synthetic and (ﺩ scanned samples. We have tures having multiple are ((ﺑﺎ andtwo and have two single multiple ligatures havingnon ))ﺷﺎand character ligature and )ﻩ.books The two theend word ()ﺑﺎﺩﺷﺎﻩ (badshah) is composed of four ligatures: character is either joiner characters or space. Urdu words can er composed of two characters each (ﺏ+ ﺍ = ﺑﺎ and ﺵ+ ﺍ = )ﺷﺎ also trained Urdu two single character ligature (( ﺩand The two ligatures character ligature two ligatures with multiple character are further composed of two characters each (ﺏ+ ﺑﺎ = ﺍnumerals and ﺵ+ = ﺍand )ﺷﺎroman numerals as the dthe )ﺷﺎword and have two single )ﻩ.). The ()ﺑﺎﺩﺷﺎﻩ (badshah) is composed of four ligatures: two .t Urdu Text written indifferent two different styles (multiscripts on one page). written inhave two styles (multiscripts on one page). Text written indifferent two styles (multiscripts on one page). primary component. Sample of the primary components with multiple character are further composed of two of two characters each (ﺏ+ ﺍ = ﺑﺎ and ﺵ+ ﺍ = )ﺷﺎ dumposed )ﺷﺎand two single character ligature (ﺩ and )ﻩ. The two cognizable unit for10Urdu OCR. He hasanalysis taken 6,533,057 words corpus did the statistical of the recognizable unit for Urdu OCR. He has taken 6,533,057 words corpus Lehal mposed of two characters each (ﺏ+ ﺍ = ﺑﺎ and ﺵ+ ﺍ = )ﺷﺎ are shown in Figure 13. characters ( and ) identified nearly 10082 ligatures which are usedwords 99% time inUrdu the ueta Data reparation and identified 25,957 unique ligatures. He identified nearly ligatures which are used 99% time in the llenge inPreparation correct recognition of the page. As shown in figure 10, text10082 in green color is in nastaliq izable unit for Urdu OCR. has taken 6,533,057 corpus Preparation 10 He We have also collected samples of the merged ligatures. Lehal did the statistical analysis of the recognizable whole corpus. ntified nearly 10082 ligatures which are used 99% timeHaving in theboth type of script on page is type of for Urdu OCR. He taken 6,533,057 words corpus gizable style unit and red color Urdu texthas is in Naskh writing style. Merged ligature are those ligature in which diacritics merge unit for Urdu OCR. He has taken 6,533,057 words corxt in books and newspapers can be divided into two The books printed before inprinted books and newspapers canwhich be divided into twotime generations. The The books printed before ntified 10082 ligatures aredivided used 99% thegenerations. nted innearly books and newspapers can be into two generations. books printed before cript recognition which further increases the complexity ofin the system. with these 10082 ligatures. But books toour the training data for with the component shape. merged characters pus and identified 25,957 unique ligatures. He10082 identified We have developed proposed system with these 10082 ligatures. But to primary collect the training data for As 10082 ehand all hand written while majority ofcollect the books published after 1995 use computer generated Nastaliq written while majority of the published after 1995 use computer generated Nastaliq written while majority of the books published after 1995 use computer generated Nastaliq 10 10for the recognition or touching character pose problem has me task. To decrease the number of recognition classes, Lehal nearly 10082 ligatures which are used 99% time in the these 10082 ligatures. But to collect the training data for 10082 ligatures from the books very cumbersome task. To decrease the number recognition chAlvi as Nastaliq Alvi or Nastaliq or Noori Nastaliq. of characters the characters in Urdu depends on shape the shape of the classes, Lehal has Nastaliq Noori Nastaliq. Shape ofisShape the in Urdu depends on the ofof the or Noori Nastaliq. Shape of characters the in Urdu depends on shape the of the 10 10 in any OCR system. whole corpus. dary component as shown in Figure 12. r it follows. That’s why we cannot take the character unit as the classification unit for the recognition. ws. That’s why we cannot take the character unit as the classification unit for the recognition. has ask. To decrease the number of recognition classes, Lehal these 10082 ligatures. But to collect the training data for 10082 separated ligature into primary and secondary component as shown in Figure 12. ollows. That’s why we cannot take the character unit as the classification unit for the recognition. After analyzing scanned images from 10 10 books, we have found some diacritics components merged Wenumber have our proposed system segmentation at character level is shown in ation at decrease character level isFigure shown in Figure 11.Figure haswith these ask. To the of12. recognition classes, component shown in mentation at as character level isdeveloped shown in Figure 11. 11.Lehal with the primary component shape. We have collected 10082 ligatures. But to collect the training data for 10082 component as shown in Figure 12. = = = total 41 such primary components merged with the existligatures from the books is very cumbersome task. To ing primary components. Some of the diacritics touching decrease the number of recognition classes, Lehal10 has primary components are shown in the Figure 14. separated ligature into primary and secondary compoUrdu Text in two different styles (multiscripts on one page). (b) written (c)11. Figure 11. Urdu word with character segmentation. Figure Urdu word with character segmentation. (a) (b) (c) 11. Urdu word with We have collected150 samples for each secondnentFigure as shown in Figure 12.character segmentation. gature Ligature secondary component. Figure 12. (a) Urdu Ligature (b) Ligature primary component (c) Ligaturecomponents secondary component. (b) primary component (c) After (c) Even of in the 1200 primary com ary component fromwere thegenerated scanned synthetically. books. Sample removing diacritics from the 10082ligatures and Data development of(c)Urdu OCR we have taken ligatures as a classification unit. Ligature is connected the connected ment ofPreparation Urdu OCR we(c)have taken ligatures as a as classification unit.unit. Ligature is theis connected lopment of Urdu OCR we have taken ligatures a classification Ligature the e(b) primary component Ligature secondary component. (b) samples are less than the required number of samples. To complete (c) s econdary component is shown in the Figure 15: grouping ligatures having same we ligatures gatures and grouping ligatures having sameend primary component, ent anddifferent has different characters and character iscomponent, either non or space. Urdu can After removing diacritics the 10082ligatures andwe grouping having same primary component, we asprimary different Urdu characters and end character isprimary either non joiner or joiner space. Urdu words canwords has Urdu characters and from end character is either non joiner or space. Urdu words canbooks component (c)Urdu Ligature secondary component. eendprimary component (c) Ligature secondary component. scanned samples. We have also trained Urdu numerals and roma Complete statistics of training data is given in 1845 classes of theword primary component and 16 secprinted in books and newspapers can be divided into (badshah) two generations. The books printed before ponent and 16have secondary components. When these secondary re than one ligature. As for example, the word ()ﺑﺎﺩﺷﺎﻩ is16 composed of four ligatures: two these one ligature. As for example, theof word ()ﺑﺎﺩﺷﺎﻩ (badshah) is composed of four ligatures: two two res and grouping ligatures having same primary component, we have 1845 classes the primary component secondary components. When secondary han one ligature. As for example, the ()ﺑﺎﺩﺷﺎﻩ (badshah) isand composed of four ligatures: Sample of the primary components are shown in Figure 13. all hand written while of the books published after 1995 use ligature computer generated ftotal ollowing Table 2. two ondary components. When these secondary components having multiple characters are (ﺑﺎ and )ﺷﺎ have two single character (ﺩThe and )ﻩ.Nastaliq The multiple characters are (ﺑﺎ and )ﺷﺎ and have two single character ligature (ﺩligature and )ﻩ. two mary component we getmajority aare total of 10082ligatures. ing multiple characters (ﺑﺎ and )ﺷﺎ and have two single character and The two ent and 16 components secondary components. When these secondary res and grouping ligatures having same primary component, we are used along withand 1845 primary component we get a(ﺩ of)ﻩ. 10082ligatures. h as Alvi Nastaliq or Noori Nastaliq. Shape of the characters in Urdu depends on the shape of the with multiple character are further of characters twocomponent characters (ﺏ+ ﺵﺍ+ = ﺍ ﺑﺎand tiple character are composed ofWhen two characters eacheach (ﺏ+each ﺏ(ﺍ+ = ﺑﺎget = ﺵ ﺍ)ﺷﺎ+ are used along withcomposed 1845 primary we hent multiple character further composed of two ﺍand = ﺑﺎa and ﺵ+ = )ﺷﺎ = ﺍ)ﺷﺎ and 16 secondary components. these secondary component we get further aare total of 10082ligatures. it follows. That’s why we cannot take the character unit as the classification unit for the recognition. arious scanned books. Some oftraining the least ligatures, as books. Some of the least frequent ligatures, as 10082ligatures. We have gathered datafrequent from various scanned component wetotal get aoftotal of 10082ligatures. Urdu word with character segmentation. component we a total ofthe 10082ligatures. the statistical analysis oflevel unit for Urdu OCR. Hetaken has taken 6,533,057 words corpus atistical analysis of the recognizable unit for Urdu OCR. He has taken 6,533,057 words corpus egmentation at get character isrecognizable shown inthose Figure 11. 10 eid statistical analysis of the recognizable for Urdu OCR. Heas has 6,533,057 words corpus books. To generate the training data forunit primary components us scanned books. Some of the least frequent ligatures, , rarely occur in some books. To generate the training data for those mentioned by Lehal ntified 25,957 unique ligatures. He identified nearly 10082 ligatures which are used 99% time in theprimary components ,957 unique ligatures. He identified nearly 10082 ligatures which are used 99% time in the d 25,957 unique ligatures. Hei.e. identified nearly 10082 ligatures which are used 99% time in the mages with different font size 35, 38, 40, 45, 50, 55images and different oks. ks. To generate the training data for those primary components us scanned books. Some of the least frequent ligatures, as of the ligature, we made some synthetic with different font size i.e. 35, 38, 40, 45, 50, 55 and different = orpus. . ve removed diacritics marks from these ligatures to get the primary ks. To generate the training data for those primary components ess with different fontoption size i.e. 35,bold 38, or 40,regular. 45, 50, We 55 and different format like have removed diacritics marks from these ligatures to get the primary our proposed system with these 10082 ligatures. But to collect thedata training data for 10082 ed our different proposed system with these 10082 ligatures. But to the training for 10082 primary component from the scanned and rest of the primary these moved diacritics marks from ligatures toligatures. get the primary sdeveloped with font size i.e. 35, 38, 40,books 45, 50, 55 and different eloped our proposed system with these 10082 Butcollect to collect the training for 10082 component. We have gathered total 1200 primary component from thedata scanned books and rest of the primary 10 10 10 has from the books is very cumbersome task. To decrease the number of recognition classes, Lehal has e books is very cumbersome task. To decrease the number of recognition classes, Lehal moved diacritics marks from these ligatures to get the primary thenumber primaryof recognition classes, Lehal has from the scanned books rest ofthe mary thecomponent books is very cumbersome task. Toand decrease Figure 11. Urdu word with character segmentation. Figure 11. Urdu wordcomponent with character segmentation. dinto ligature intoand primary and secondary as shown in Figure eature primary secondary component as shown inthe Figure 12. ary component from the scanned books and rest of primary into primary and secondary component as shown in Figure 12.5 12. Page Journal Page 5 evelopment of Urdu OCR we have taken ligatures as a classification unit. Ligature is the connected Page 5 nt and has different Urdu characters and end character is either non joiner or space. Urdu words can Page 5 is composed of four ligatures: two e than one ligature. As for example, the word (( )ﺑﺎﺩﺷﺎﻩbadshah) having multiple characters are ( ﺑﺎand )ﺷﺎand have two single character ligature ( ﺩand )ﻩ. The two with multiple character are further composed of two characters each (ﺏ+ ﺑﺎ = ﺍand ﺵ+ )ﺷﺎ = ﺍ (a) (a) (b) (b) (c) (a) (a) (b) (b) (c) (c) (c) d12. theFigure statistical analysis of the(b) recognizable unitcomponent for Urdu OCR. He hassecondary taken 6,533,057 words corpus 12. (a) Urdu Ligature Ligature primary (c) Ligature component. e (a) Urdu Ligature (b) Ligature primary component (c) Ligature secondary component. Figure 12. (a) Urdu Ligature (b) Ligature primary component (c) Ligature secondary component. Figureligatures. 12. (a)He Urdu Ligature Ligature primary ified 25,957 unique identified nearly(b) 10082 ligatures which are used 99% time in the component (c) Ligature secondary component. Figure 13. wePrimary pus. moving diacritics from the 10082ligatures and grouping ligatures having same primary component, acritics from the 10082ligatures and grouping ligatures having same primary component, ng diacritics from the 10082ligatures and grouping ligatures having same primary component, we wecomponent training sample data. Figure 13. Primary component training sample data. 45 classes of the primary component and 16 secondary components. When these secondary es of the primary component and 16 secondary components. When these secondary classes of our the proposed primary system component and 16 secondary these data secondary developed with these 10082 ligatures.components. But to collectWhen the training for 10082 ents are used along with 1845 primary component we get a total of 10082ligatures. used along with 1845 primary component we get a total of 10082ligatures. We have10also samples of the merged ligatures. Merged ligatu are used along with 1845 primary component we get a total of 10082ligatures. hascollected rom the books is very To decrease the number of recognition classes, Lehal 4 Vol 8 (35)cumbersome | December 2015task. | www.indjst.org Indian Journal of Science and Technology merge with the primary component shape. As merged characters or to ligature intodata primary and secondary component as shown inthe Figure ethered gathered training data from various scanned books. of 12. the least frequent ligatures, ed training from various scanned books. Some of Some least frequent ligatures, as as inasany OCR system. After analyzing scanned images from training data from various scanned books. Some of the least frequent ligatures, recognition 10 10 10 , rarely occur in some books. To generate the training data for those primary components ed by Lehal , rarely occur in some books. To generate the training data for those primary components al components merged with the primary component shape. We ha y Lehal , rarely occur in some books. To generate the training data for those primary components components merged with the primary component shape. We have collected total 41 such primary merge with the primary component shape. As merged characters or touching character pose problem for the components merged with the existing primary components. Some of the diacritics touching primary recognition in any OCR system. After analyzing scanned images from books, we have found some diacritics components are shown in the Figure 14. components merged with the primary component shape. We have collected total 41 such primary components merged with the existing primary components. Some of the diacritics touching primary Ankur Rana and Gurpreet Singh Lehal components are shown in the Figure 14. Figure 14. Sample of merged primary component with secondary component. a(q) = We have each secondary component Figure 14. collected150 Sample ofsamples mergedforprimary component with from the scanned books. Figure 14. Sample merged component with secondary component. secondary component is of shown in primary the Figure 15: secondary component. 1 , v=0 N 2 Sample , 1of≤ qthe ≤ N −1 N the images × 32 for the DCT. We have collected150 samples for each secondary component We fromhave the scaled scannedallbooks. Sampletoof32 the secondary component is shown in the Figure 15: Whole image DCT gives us 1024 frequency components. DCT has one important property that left top of the DCT matrix gives high frequency component. High freFigure 15. Secondary component training sample data. Figure 15. Secondary component training sample data. quency component means maximum information about Complete statistics of training data is given in following Table 2. the image is stored on top left the DCT matrix. We have 15. Secondary component training sample data. Table 2. Figure Training data statistics Table 2. Training data statistics extracted total 100 features from the total of 1024 feature Complete statistics of training data is given in following Total Ligatures 10082 Table 2. values in zigzag manner as shown in Figure 16. Primary Components (After removing diacriticsTable 2.1845 Training data statistics from the ligatures) Total Ligatures 5.2 Gabor10082 Features English Numerals 10 Gabor function G(i,j) is the linear filter. Rajneesh Rani Urdu Numerals 10 Journal Page 6 Total Ligatures 10082 for et al.11 used gabor features the script identification. It is Secondary Components 10 used for the edge detection in image processing. It is mulPunctuation marks as Primary Component 9 Journal Page 6 tiplication of harmonic function and Gaussian function. Punctuation marks as secondary Component 6 G(m, n) =P (m, n) * C(m, n) 5. Features Extraction For the classification of any pattern, relevant features have to be extracted. For Urdu OCR many researchers use different features. We have recognized primary and secondary components. For classification of the primary component of the ligature, we calculated DCT, Gabor, directional and gradient features. 5.1 Discrete Cosine Transformation (DCT) DCT is the statistical feature extraction technique. DCT maps ligature image from spatial domain to the frequency domain. DCT maps the entire high frequency component to the upper right corner of the image matrix and low frequency components maps to the bottom right corner of the image. DCT coefficients f(p, q) of image I(m, n) are computed by equation(1) : f ( p, q) = a( p)a(q) Where M −1 N −1 This is used both for the orientation and spatial frequency. A Gabor filter is defined as n 2 2 −1 R21 + R22 j 2pR1 s 2 s m, n, q , sm , s = e m n *e l G where R1 = m cos q + n sin q and R2 = − m sin q + n cos q (2m + 1)p p (2n + 1)pq cos 2M 2N ∑ ∑ f (m, n)cos m= 0 N = 0 (1) 1 , u=0 M a( p) = 2 , 1 ≤ p ≤ M −1 M Vol 8 (35) | December 2015 | www.indjst.org Figure 16. 16. DCT DCTcoefficients coefficients of image in zigzag Figure of image selectedselected in zigzag direction into one vector. direction into one vector. 5.2 Gabor Features Gabor function G(i,j) isIndian the linear filter. Rajneesh Rani et al.11 5used gabor Journal of Science and Technology It is used for the edge detection in image processing. It is multiplication function. ( ) ( ) ( ) Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script q is the orientation of sinusoidal plane wave, λ is the wavelength. sm and sn are the standard deviations. We have taken both the standard deviations equal of the feature extraction. To calculate the feature of the input, first image is scaled to 32 × 32 pixels. Then it is further partitioned into four equal non overlapping sub regions of size 16 × 16. These sub regions are again further partitioned into 4 non overlapping sub-sub regions of size 8 × 8. After 8x8 sub region division we get total 16 small regions. These 21 images are then convolved with odd symmetric and even symmetric Gabor filters in nine different angles, of orientation q of 20 degrees, to obtain a feature vector of 189 values. 5.3 Directional Features Directional features12 calculated the distance of black and white pixels in eight different directions for each pixel. We have scaled our input image to 36 × 36 pixels. After scaling, directional features vector is calculated in eight different direction i.e. 0o, 45o, 90o, 135o, 180o, 225o, 270o and 315o. It gives directional feature vector of length 16 for each pixel. To down sample 20736 (16*36*36) directional features value, we divided our image into 9 (3 × 3) zones. Then we have generated 16 feature vector lengths from each zone by adding corresponding directional feature vector values of all the pixels in that zone. So, we have obtained directional feature vector of length 144 i.e. 16 feature values ∗ 9 zones. 5.4 Gradient Features After calculating gradient vector, we calculated the magnitude and direction as given in equation. Magnitude = g x ( x , y ) + g y ( x , y ) 2 2 ( direction = tan −1 g x ( x , y ) g y ( x , y ) ) The direction gradient vector is then decomposed along 8 chain code directions (D0, D1, D2, D3, D4, D5, D6 and D7) as shown in Figure 18. After that, the character image is divided into 9 × 9 blocks (81 blocks). If the gradient vector lies between two directions, then it is decomposed, else, its magnitude of the vector is retained. This results in 63 × 63 × 8 values. Next, the spatial resolution 9 × 9 is reduced to 5 × 5 for the down sampling of every two horizontal and vertical blocks with 5 × 5 Gaussian filter to get the 200 features value per image. 6. Experiments 6.1 Primary Component Recognition Lehal and Rana6 reported 98.01% and 96.78% accuracy of the 2190 primary component classes using Support Vector Machine14 (SVM) and k nearest neighbor respectively. We have found that out of 2190 primary component many primary shapes of the ligatures looks same. Therefore, our primary component count of the ligature is reduced to 1873. We have used DCT, Gabor, directional and gradient for the feature extraction of the primary component. We have used SVM (linear and polynomial kernel) and kNN classifier for the primary component recognition. Results primary component recognition with SVM classifier using linear and polynomial kernel having degree 3 and 4 is given in Table 3. A gradient feature12,13 calculates the magnitude and direction of the maximum changes in intensity in the neighborhood of the pixels. For the gradient features extraction, first image is normalized to 63 × 63 pixel sizes. After normalizing image, the gradient vector is calculated in both x and y direction at each pixel position using the sobel operatorAfter as shown in Figure 17. the gradient vector is calculated in both x and y direction at each pixel position using normalizing image, shown gx(x, y) = z (x +the 1, ysobel – 1) operator + 2z(x + as 1, y) + z(xin+figure 1, y +17. 1) – z(x – 1, y – 1) – 2z(x – 1, y) – z(x – 1, y – 1) gx(x,y) =z(x+1,y-1) +2z(x+1,y) + z(x+1,y+1) - z(x-1,y-1) – 2 z(x-1, y) – z(x-1,y+1) gy(x, y) = z(x – 1, y + 1) + 2z(x, y +1) + z(x + 1, y + 1) – z(xgy–(x,y) 1, y=z(x-1,y+1) – 1) – 2z(x,+2z(x,y+1) y – 1) – z(x + 1, y – 1)- z(x-1,y-1) – 2 z(x, y-1) – z(x+1,y-1) + z(x+1,y+1) -1 -2 -1 0 0 0 1 2 1 1 0 -1 2 0 -2 1 0 21 (a) (b) Figure 18. (a) Eight (D0–D7) chain code direction. Figure 17. Sobel operator. Figure 17. Sobel operator. (b) Breakdown of gradient vector. After calculating gradient vector, we calculated the magnitude and direction as given in equation. 6 Vol 8 (35) | December 2015 | www.indjst.org 𝑀𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = �𝑔𝑥 (𝑥, 𝑦)2 +𝑔𝑦 (𝑥, 𝑦)2 Indian Journal of Science and Technology 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 = tan−1 (𝑔𝑥 (𝑥, 𝑦)⁄𝑔𝑦 (𝑥, 𝑦)) Ankur Rana and Gurpreet Singh Lehal We also experimented with k nearest neighbor c lassifier with different values of k (1, 3 and 5). Result of the primary component recognition using kNN classifier is shown in Table 4. With increase in value of k we have observed that output goes to 98.30%. We observed that due to large number of classes SVM takes average 559 seconds to recognize 3746 primary components of ligature where as kNN classifier takes only 170 seconds to recognize the same. Table 3. Primary component of ligature recognition using SVM classifier Features Feature Polynomial Polynomial Linear Vector SVM SVM SVM Length (Degree 3) (Degree 4) Discrete Cosine Transformation (DCT) 100 98.15 94.62 Discrete Cosine Transformation (DCT) 32 95.93 92.97 Discrete Cosine Transformation (DCT) 64 97.03 94.12 Gabor 189 94.68 94.55 Directional Feature 144 90.10 89.53 Gradient Features 200 95.14 91.56 92.87 90.57 7. Formation of Ligature After getting the primary and secondary component, we form ligature from the grouping of these two Table 5. Secondary component of ligature recognition using SVM classifier Features 90.15 Feature Vector Length Linear SVM Polynomial SVM (Degree = 3) 96.87 99.50% 94.37 95% 89.50 DCT (Feature Vector Length 100) 100 93.96 Gabor (Feature vector length 189) 189 Directional Feature 144 95.16 95.69 Gradient Features 200 96.77 93.01 Feature Vector Length K=1 K=3 Discrete Cosine Transformation (DCT) 100 95.17 97.96 Discrete Cosine Transformation (DCT) 32 93.34 97.23 Discrete Cosine Transformation (DCT) 64 94.90 97.78 Gabor 189 88.56 94.93 96.39 Directional Feature 144 86.32 94.07 Gradient Features 200 93.60 97.25 Vol 8 (35) | December 2015 | www.indjst.org We have 19 secondary component classes, for which, we have extracted DCT, Gabor and zoning features. For training samples, we have 100 samples for each class. Table 5 shows the result of secondary component recognition with different features and classifiers. As we can see, the combination of DCT and polynomial SVM (degree = 3) classifier attains 99.50% recognition accuracy. We also use kNN classifier for the secondary component recognition with different values of k (1 and 3) Experimental result of the secondary component recognition with kNN is given in Table 6. 92.08 Table 4. Primary component of ligature recognition using kNN classifier Features 6.2 Secondary Component Recognition K=5 Table 6. Secondary component of ligature recognition using kNN classifier Features Feature Vector Length K=1 K=3 Discrete Cosine Transformation (DCT) 100 94.11 98.93 Discrete Cosine Transformation (DCT) 32 93.58 98.93 Discrete Cosine Transformation (DCT) 64 94.11 98.93 Gabor 189 93.04 97.32 95.79 Directional Feature 144 94.11 99.99 97.83 Gradient Features 200 93.04 99.46 98.30 97.88 98.27 Indian Journal of Science and Technology 7 component code and secondary string code. From the combination of primary and secondary components code we extracted the ligature from the primary component code and the secondary string code. To search the combination of primary and secondary code, we implemented Binary Search Tree (BST). Nodes of the binary searchOCR tree contain the primary code and linked list of nodes Offline Urdu using Ligature based Segmentation for Nastaliq Script having secondary code and their ligature code in Unicode. Binary search tree and nodes of the linked list of our code book structure is shown in the Figure 19. 9. References 45 1. Shamsher I, Ahmad Z, Orakzai JK, Adnan A. OCR for printed urdu script using feed forward neural network. 40 53 Proceeding of World Academy of Science, Engineering and Technology. 2007; 172–5. ISSN – 1307-6884. 2. Ahmad Z, Orakzai JK, Shamsher I, Adnan A. Urdu nas38 43 46 56 taleeq optical character recognition. Proceedings of World Academy of Science, Engineering and Technology. 2007; 26:249–52. ﻙ گ b hb گ hA 3. Pathan IK, Ahmed Ali A, Ramteke RJ. Recognition of offline ﺗﻮ ﺗﻮ ﺑﻮ handwritten isolated urdu character. Journal of Advances in Computational Research. 2012; 4(1):117–21 گ ﮔﻰ ﮔﻨﻮ hC hp hB 4. Al-Muhtaseb HA, Mahmoud SA, Qahwaji RS. Recognition ﻭ ﭘﻮ of off-line printed Arabic text using Hidden Markov modFigure 19. A sample of Binary Search Tree Depiction of Figure 19. A sample of Binary Search Tree Depiction of Code book. els. Journal on Signal Processing. 2008; 2902–12. DOI: Code book. 10.1016/j.sigpro.2008.06.013. 5. Ul-Hasan A, Ahmed SB, Rashid SF, Shafait F, Breuel TM. Offline printed Urdu Nastaleeq script recognition with codes (primary and secondary component code). We bidirectional LSTM networks. International Journal crafted code book which comprises primary Page 11 Conference manually on Document Analysis and Recongnition; 2013. p. 1061–5. component code and secondary string code. From the DOI: 10.1109/ICDAR.2013.212. combination of primary and secondary components 6. Lehal GS, Rana A. Recognition of Nastalique Urdu code we extracted the ligature from the primary comLigatures. Proceedings of 4th International Workshop of ponent code and the secondary string code. To search Multilingual OCR, ACM; Washington DC, USA. 2013. the combination of primary and secondary code, we DOI: 10.1145/2505377.2505379. implemented Binary Search Tree (BST). Nodes of the 7. Lehal GS. Ligature Segmentation for Urdu OCR. Proceedings 12th International Conference on Document binary search tree contain the primary code and linked Analysis and Recognition (ICDAR). IEEE. 2013. p. 1130–4. list of nodes having secondary code and their ligature DOI: 10.1109/ICDAR.2013.229. code in Unicode. Binary search tree and nodes of the 8. Sarmad H, Salman A, Qurat ul Ain Akram A. Nastalique linked list of our code book structure is shown in the segmentation-based approach for Urdu OCR. International Figure 19. Journal on Document Analysis and Recognition. 2015. Let primary component classifier gives a code of 46 DOI: 10.1007/s10032-015-0250-2. and the string code is hB. gives Thena BST Letsecondary primary component classifier code search of 46 and the secondary string code is hB. Then BST search gives 9. Saeeda N, Khizar H, Muhammad Imran R, Mohammad gives ligature ligature ﮔﯩﻮas the identified identifiedligatures. ligatures. Waqas A, Madani Sajjat A, Same KU. The Optical Character recognition of Urdu-like cursive script. Journal of Pattern Recognition. 2013 Oct; 11:1229–48. 8. Conclusions and Future 8. Conclusions and Future Scope Scope 10. Lehal GS. Choice of recognizable units for Urdu OCR. Proceedings of the Workshop on Document Analysis and We have used DCT with linear kernel SVM for the We have used DCT with linear kernel SVM for the primary component. For the secondary component Recognition (Mumbai, India, DAR 2012), Publisher ACM, primary component. For the secondary component recrecognition we used DCT features (feature vector length 100) and polynomial kernel SVM (Degree 3). We have USA. 2012; 79–85. ognition we used DCT features (feature vector length tested our system on 110 pages Urdu and got 83% accuracy. Urdu images having no broken or merged 11. Rani R, Dhir R, Lehal GS. Gabor featuresprimary based script 100) and polynomial kernel SVM (Degree 3). We have identification of lines within a bilingual/trilingual or secondary component and no Naskh style Urdu text have accuracy nearly 90.29%. New methodology needs tested our system on 110 pages Urdu and got 83% accuInternational Journal of on Advanced to beimages devisedhaving to handle brokenor ormerged mergedprimary ligatures. Also todocument. recognize the Naskh writing style the sameScience racy. Urdu no broken and Technology. 2014; 66(2014):1–12. ISSN: 2005Urdu text page, Naskh recognition and font text identification needs to be developed. or secondary component and no NaskhOCR style Urdu 4238. Available from: http://dx.doi.org/10.14257/ have accuracy nearly 90.29%. New methodology needs ijast.2014.66.01 9. References to be devised to handle broken or merged ligatures. Also 12. Singh J, Lehal GS. Comparative Performance Analysis of to recognize the Naskh writing style on the same Urdu Combination for Devanagari 1. Shamsher I, Ahmad Z, Orakzai JK, Adnan A. OCR for printedFeature(S)-Classifier urdu script using feed forward neural network.Optical text page, Naskh recognition OCR and font identification Character Recognition System. International Proceeding of World Academy of Science, Engineering and Technology. 2007; 172–5. ISSN – 1307-6884.Journal of needs to be developed. Advanced Computer Science and Applications (IJACSA). 2. Ahmad Z, Orakzai JK, Shamsher I, Adnan A. Urdu nastaleeq optical character recognition. Proceedings of World Academy of Science, Engineering and Technology, 2007; 26:249–52. ﺝ ﻫ ڑ 8 3.| December Pathan2015 IK, Ahmed Ali A, Ramteke RJ. Vol 8 (35) | www.indjst.org Recognition of offline handwritten isolatedIndian urduJournal character. Journal of of Science and Technology Advances in Computational Research. 2012; 4(1):117–21 4. Al-Muhtaseb HA, Mahmoud SA, Qahwaji RS. Recognition of off-line printed Arabic text using Hidden Ankur Rana and Gurpreet Singh Lehal 2014; 5(6). Available from: http://dx.doi.org/10.14569/ IJACSA.2014.050608 13. Aggarwal A, Rani R, Dhir R. Recognition of devanagari handwritten numerals using gradient features and SVM. International Journal of Computer Applications. 2012. DOI: 10.5120/7371-0151. 14. Hsu C-W, Chang C-C, Lin C-J. A Practical Guide to Support Vector Classification. 2010. Available from: http://www. csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Vol 8 (35) | December 2015 | www.indjst.org 15. Akram Q, Hussain S, Adeeba F, Rehman S, Saeed M. Framework of Urdu Nastalique Optical Character Recognition System. In the Proceedings of Conference on Language and Technology. (CLT 14), Karachi, Pakistan. 2014. 16. Sabbour N, Shafait F. A segmentation-free approach to Arabic and Urdu OCR. SPIE Document Recognition and Retrieval XX, DRR’13; San Francisco, CA, USA. 2013 Feb. Indian Journal of Science and Technology 9