“Improved Synonym Approach to Linguistic Steganography” Design and Proof-of-Concept Implementation
Transcription
“Improved Synonym Approach to Linguistic Steganography” Design and Proof-of-Concept Implementation
“Improved Synonym Approach to Linguistic Steganography” Design and Proof-of-Concept Implementation Aniket M. Nanhe Mayuresh P. Kunjir Sumedh V. Sakdeo B.Tech Comp. Sci., College of Engineering, Pune, India. B.Tech Comp. Sci., College of Engineering, Pune, India. B.Tech Comp. Sci., College of Engineering, Pune, India. [email protected] [email protected] m [email protected] Abstract This paper develops a linguistically robust Linguistic steganography approach using synonym replacement, which converts a message into semantically innocuous text. Drawing upon linguistic criteria, this approach uses word replacement, with substitution classes based on traditional word replacement features (syntactic categories and subcategories), as well as features under-exploited in earlier works: semantic criteria, inflectional class, and frequency statistics. The original message is hidden through use of a cover text which is shared between sender and receiver. This paper also presents a new approach of sharing the cover text and changing it periodically to make the algorithm safe from steganalysis. 1. Introduction While current encryption techniques are sufficiently advanced to make code-breaking practically impossible, one major drawback of current encryption methods is the ease in identifying an encrypted text—they do not resemble natural text in any way. Steganography attempts to answer this need, acting to conceal the message's existence, in order to transmit encrypted messages without arousing suspicion. Steganography is the art and science of concealing a secret message inside a cover object. When the secret message is in digital form, it leaves enormous choices for the cover objects. For instance, one could hide digital information inside images, audios, binaries, videos, texts etc. Steganography can be classified depending upon the type of cover objects used. These cover objects could be images, audio, binaries and as in our case, natural language text. The study of this subject in scientific literature dates back to 1983, when Simmons formulated it as ―The Prisoners‘ Problem‖. It says, ―Alice and Bob are in jail and wish to hatch an escape plan; their entire communication pass through the warden Wendy; and if Wendy detects any suspicious messages, he will frustrate their plan by throwing them in solitary confinement. So they must find some way of hiding their secret message in an innocuous looking cover text‖. Information hiding has taken one form in image based steganography, utilizing minimal changes in pixels or watermarking techniques. While text-based messages have also been used within image-based maneuvers, by modifying the white space between letters and by minutely changing the fonts, this has proved less fruitful because text can be retyped and is often altered in the conversion from one program version or platform to another. Proving more productive, as well as resistant to the difficulties surrounding the re-typing of text-based messages is lexical steganography, which uses linguistic structures to disguise encryption of text messages such that the appearance of the message remains semantically and syntactically innocent. This paper presents a new approach of Linguistic Steganography using synonym replacement, a linguistically-informed alternative to existing text-based steganography systems. This approach adds extra features like inflectional class and frequency statistics thereby producing semantically and syntactically correct text which is more natural in appearance to human eye. There is one more area which is under-exploited in earlier works: sharing of cover text and changing it periodically. The cover text is very critical as it is the text which is transformed and sent over communication channel. Our aim is to make steganalysis difficult by altering cover text frequently so no suspicion would be detected. This paper discusses a new approach of achieving this by exploiting the part of cover text which remains unchanged in transformation. This paper analyzes Past Research in Section 2, Basic Algorithm in Section 3 and Cover Text Selection in Section 4. 2. Past Research Lexical steganography has had three main veins of research: watermarking techniques that manipulate sentences through syntactic transformations a.k.a. ontological techniques, word replacement systems both with and without cover texts, and context-free grammars such as NICETEXT. We will see the work carried out in each of these techniques. 2.1. Syntactical Steganography The approaches to syntactical steganography exploit the syntactic structures of a text. The approaches make use of Context Free Grammars (CFG) to build syntactically correct sentences. The famous algorithm of CFG based Mimicry developed by Peter Wayner[1] comes under this category. There is another famous algorithm by Chapman et al.[8], NICETEXT, also based on CFG. NICETEXT uses the cover text simply as a source of syntactic patterns: by running the cover text through a part-of-speech tagger, NICETEXT obtains a set of "sentence frames," e.g. [(noun) (verb) (prep) (det) (noun)] for ‗I sat in the tree.‘ It also compiles a lexicon of words found in the cover text via part-of-speech tags, with each word in the lexicon associated (arbitrarily) to either of the binary digits 0 or 1. In encryption, the plain text message is converted into a sequence of binary digits. A random sentence frame is chosen and the part of speech tags in it are replaced by words in the lexicon according to the sequence of binary digits. Although, NICETEXT produces syntactically correct sentences; it fails on the count of semantics. The output text is almost always set of ungrammatical and semantically anomalous sentences. Another factor worth considering is the density of encryption within the cover text. Ideally, the cover text should work to hide the word frequencies and syntactic structure of the hidden plain text message. Steganographic goals encourage sparse encryption, which does not alter a majority of the text by the word replacement. NICETEXT encryption is maximally dense—every word within the final encrypted cover text is conveying hidden information. Given that each encrypted word is part of the original information bearing message and common word usage patterns are unavoidable, this is problematic for the original steganographic intent: avoiding detection and producing naturalistic text. In general, syntactical steganography techniques produce text having syntactically wellformedness without semantically wellformedness. It can be seen from Chomsky‘s famous sentence ―Colorless green ideas sleep furiously‖. 2.2. Lexical Steganography In lexical Steganography lexical units of natural language text such as words are used to hide secret bits. The most straightforward subliminal channel in natural language is probably the choice of words. A word could be replaced by its synonym and the choice of word to be chosen from the list of synonyms would depend upon secret bits. For example consider a sentence – ―Pune is a nice little city‖ Now, suppose list of synonyms for nice is {nice, wonderful, great, and decent}. Each of the synonyms can be represented by two bits as shown in the table: Word Code Nice 00 Wonderful 01 Great 10 Decent 11 Table 1: Lexical code table Depending upon the input secret bits appropriate synonym for ‗nice‘ will be selected and put in the stego text. So, the possible stego texts could be: a) b) c) d) Pune is a nice little city. Pune is a wonderful little city. Pune is a decent little city. Pune is a great little city. The lexical techniques produce better quality text than syntactical techniques. It is hard to find presence of hidden message for statistical attacks. The replacements are critical part of these techniques. To give an example, in the above mentioned synonym replacement approach, some words can have more than one sense. (Noun ―bank‖ has two senses – ―a long pile or heap‖ or ―an institution for receiving, lending, and safeguarding money and transacting other financial business‖) If we don‘t use synonym having same sense as that of original word, the output will look suspicious. 1. Bring those instruments. 2. Bring those tool. A further impediment to synonym based word replacement is inflection classes (i.e., legal and illegal word combinations). (2) replaces a plural noun ‗instruments‘ of (1) by its singular synonym noun ‗tool‘; thus making sentence grammatically incorrect. 2.3. Ontological Technique Of the techniques considered herein, the ontological one is the most sophisticated approach with respect to modeling semantics. Instead of implicitly leaving semantics intact by replacing only synonymous words while embedding information into an innocuous text, an explicit model for ―meaning‖ is used to evaluate equivalence between texts. Atallah et al.[4] watermark texts by manipulating and exploiting the syntax (formal word order and grammatical voice) of sentences. Through common generative transformations (clefting (4), adjunct fronting (5), passivization (6), adverbial insertion (7)), the syntax of each sentence is altered: 3. The lion ate the food yesterday.(original sentence) 4. It was the lion that ate the food yesterday. 5. Yesterday, the lion ate the food. 6. The food was eaten by the lion yesterday. 7. Surprisingly, the lion ate the food yesterday. The Ontological techniques though have some problems. The transformations sometimes affect the semantics of a text. Newer theories of language argue for the interconnectedness of the semantic and syntactic levels, demonstrating that the syntactic pattern is itself inherently meaningful. Furthermore, statistically, various syntactic structures (word orders) are not equal in distribution: different genres of text have wildly different syntactic structures, and replacing such structures freely could create a text which is trivially broken by statistical methods—a security threat to the program. 3. Basic Algorithm The algorithm replaces all the nouns, adjectives, verbs and adverbs of cover text by their respective synonyms. A semantically and orthographically correct text is used as cover text to hide messages. We use a word dictionary to get synonym. The input text to be hidden is compressed using Huffman Compression Algorithm and a string of bits is generated. The input bits are consumed in selection of synonyms. The algorithm works in stages. The various stages of the algorithms are: 3.1. Part-of-speech Tagging The basic requirement of this algorithm is, a cover text should be shared between sender and receiver. Natural Language Processing is done on the cover text in order to determine the part of speech of each word. This is essential part of the algorithm as we are going to replace only common nouns, adjectives, adverbs and verbs in the cover text. A Parts-of-speech tagger is applied on cover text which outputs each word followed by its part-ofspeech. 3.2. Input Compression The input secret text is treated as binary bit string. These bits are to be used in synonym replacement stage to make a choice of synonym that is to be used in place of a word. Using standard ASCII representation, we need 7 bits for each character of input. So we need to hide (7 * ‗number of characters‘) bits. We can improve on this number by exploiting characteristics of English language. Some characters appear more frequently in normal English text than other characters. If we use less number of bits for such characters, we can easily reduce number of input bits to be hidden. To achieve this, we use Huffman Compression algorithm. On an average, Huffman coding reduces the size of input bit string to 33% of the original. 3.3. Synonym Replacement This stage is the core of the algorithm. The actual task of hiding bits into a cover text is carried out here. The inputs to this stage are the compressed bit string and tagged cover text at sender‘s side and receiver needs encoded text (a.k.a. stego text) and tagged cover text. This stage makes use of dictionary to find replacements for word. 3.3.1 Dictionaries We use three dictionaries here: a. WordNet2.1 English dictionary WordNet is an open source English dictionary containing almost all English words. We can get all synonyms of a word using this dictionary. A word may have more than one sense in which it can be used. WordNet provides output in terms of ‗synsets‘. A synset defines a sense of a word. Each synset contains all the synonyms of given word which can be used in that particular sense. e.g.: Synonym Sets of the word ―travelling‖ are: 1. travel, go, move, locomote 2. travel, journey 3. travel, trip, jaunt 4. travel, journey WordNet also provides frequency of occurrence of a word in normal English text. This information is very useful in our algorithm using which we can encode the synonyms. Huffman coding is used here again so that more frequently occurring synonyms get shorter codes and vice versa. This is important from Word Sense Disambiguation aspect, wherein the WordNet‘s 1st Sense assigns some frequency to each synonym of the word. This frequency is assigned to a word depending upon the use of that synonym for that particular word in normal English text. Assigning a shorter code to most frequently used synonym ensures maintaining proper word sense. ―travelling―, the stego text should contain present participle for of the ―go― to maintain the tense of the sentence. So verb dictionary provides this inflected form of the base form ―go‖ for actual replacement. c. Noun Inflection dictionary A noun can be either in singular or plural form. Again as the case with verbs, WordNet always gives the synonyms of a noun in their singular form. If we replace a plural by its singular synonym noun, we will get grammatically incorrect sentence. So we use separate noun dictionary to avoid this situation. We maintain a list of nouns (about 89,051 nouns) in their singular as well as plural forms. Before replacing a noun by its synonym, we check whether both are in same form. If not, we select appropriate form of synonym noun from noun dictionary and replace original by it. e.g.: Tagged Cover Text: A/DT group/NN of/IN frogs/NNS were/VBD traveling/VBG through/IN the/DT woods/NNS ,/, The noun ―frogs‖ is plural form of the word frog. Suppose synonym Gaul is selected by input bit string; we need to ensure that it is plural as ‗frogs‘ is a plural. So we use this dictionary to obtain singular and plural forms of the nouns present in dictionary. b. Verb Inflection dictionary 3.4. Synonym Replacement A verb can have many inflected forms like present participle form, past tense form, past participle form and base form. When we try to find synonyms for a verb, WordNet always gives the synonyms of a verb in their base form irrespective of inflected form of input verb. If we replace a verb by a synonym in its base form, it will make output sentence grammatically incorrect. To avoid this situation, we use a separate verb dictionary. We maintain a list of all verbs (about 16,064 verbs) along with their all inflected forms in a separate file. Before replacing a verb by its synonym, we check whether inflected forms of both original verb and its synonym match. If they don‘t match, we select appropriate inflected form of synonym verb from verb dictionary and replace original by it. Figure 1 shows the mechanism that is carried out at sender end. As can be seen, tagged text obtained from stage 1 is scanned. Whenever a noun, adjective, verb or adverb is found, its synonyms are obtained from WordNet. All synonyms are put in a frequency table; the frequencies are obtained from WordNet. Huffman coding is done on this frequency table to obtain codes for all synonyms. By using frequencies, we achieve word sense disambiguation also, as more frequently used senses get shorter codes so that they have higher probability of being used. After building the encode table, we use input bit string to select one of the synonyms from the table. If we are replacing a verb, the inflected forms are checked and appropriate form of verb is obtained from verb dictionary. Similarly if we are replacing a noun, the singular or plural form is selected from noun dictionary in accordance of original noun‘s form. Otherwise the selected synonym is put in place of original word. Appendix shows examples of sample stego text generated from cover text by hiding the secret text. e.g.: Tagged Cover Text: A/DT group/NN of/IN frogs/NNS were/VBD traveling/VBG through/IN the/DT woods/NNS ,/, Suppose, synonym ―go‖ is selected for replacement from WordNet as a synonym of bits are obtained and these are appended to output string. The output string when decompressed, produce original secret text. 4. Cover Text Selection Figure 1: Sender end Steganalysis is identifying existence of a secret message. This is obvious as the field of steganography aims to conceal the existence of a message, not scramble it. Our approach uses Word Replacement in cover text. As only few words are replaced by their synonyms majority of text remains unchanged. If same cover text is used again and again, an attack ―Known Stego-Text‖, in which intruder keeps a track of text being sent on the communication medium is possible. To prevent text from steganalysis, the cover text needs to be changed periodically. Our approach uses a book, a collection of different chapters. Book should be privately owned by sender and receiver. One of the chapters from the book can be selected as cover text. The choice of the chapter is randomly decided by sender. For reverse transformation at the receiver end, same chapter should be selected as cover text. To achieve this, we exploit the unchanged part of the cover text. Receiver decides which chapter is to be used as cover text from stego text. This approach calculates difference between individual chapters in book and stego text and selects the chapter with minimum difference. Initially, each chapter in book is scanned once to obtain a code for each sentence. Words which are common nouns, adjectives, adverbs or verbs are ignored from the calculation of the code. Values of other words are calculated using ASCII values of characters and position of the word in that particular sentence. All these values are summed up to determine the code for that sentence. Similarly, codes for all sentences are calculated. e.g.: ―This is important for me‖. Figure 2: Receiver End Reverse algorithm is carried out at receiver end. Figure 2 shows the mechanism. Tagged text is scanned for noun, adjectives, verbs and adverbs. When one is found, its synonyms are obtained from WordNet. As done at sender end, frequency table and later encode table is formed using frequencies of the words. At the same time, stego text is also scanned. From that stego text, we obtain the synonym selected at sender‘s side. Then from the table, its Code =‗t‘ * 1 + ‗h‘ * 1 + ‗i‘ * 1 +‗s‘ * 1 + ‗i‘ * 2 + ‗s‘ * 2 + ‗f‘ * 4 + ‗o‘ * 4 + ‗r‘ * 4 + ‗m‘ * 5 + ‗e‘ * 5 ―important‖ is not used in calculation of the code for this sentence as it is an adjective. Similar algorithm for code generation is applied on stego text. Code of stego text is compared with codes for all chapters. Ideally codes for all sentences of stego text should match with codes of a chapter which was used as cover text by sender. But our algorithm allows compound word replacements (―travel‖ can be replaced by ―move around‖). This causes a difference between codes of stego text and chapter to be used as cover text. This difference although is very small compared to difference for other chapters. So the chapter with minimum difference is selected as cover text. 5. Conclusion The field of Linguistic Steganography is very interesting as it conceals the very existence of secret message from intruder, which is not achievable by cryptography. The Synonym replacement approach used for Linguistic Steganography produces innocuous looking English text thereby making detection of secret message very hard. The famous Stego-Turing Test states that it is very hard for a computer to alter a natural language text in a way that is undetectable to a human. Many approaches have been carried out in the past in doing this. But none has been able to solve the problem. Though our solution doesn‘t solve the problem, it produces better quality of output than previously done approaches. Also we give a new approach to dynamically choose a cover text from chunk of text being shared by sender and receiver. This allows user to use different cover text for hiding message each time, thus making steganalysis difficult. Synonym Replacement Approach to Linguistic Steganography using Inflection classes, Frequency Statistics and Dynamic Cover Text Selection is a new improvement in the field of Linguistic Steganography and provides a very good, efficient tool for Information Hiding. 6. References [1] P. Wayner, ―Mimic functions,‖ Cryptologia XVI, pp. 193–214, July 1992. [2] P. Wayner, ―Strong theoretical steganography,‖ Cryptologia XIX, pp. 285–299, July 1995. [3] P. Wayner, ―Disappearing CryptographyInformation Hiding: Steganography & Watermarking‖, 2nd edition Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second ed., 2002 pp. 67-126, pp. 303-314. [4] M. J. Atallah, V. Raskin, M. Crogan, C. Hempelmann, F. Kerschbaum, D. Mohamed, and S. Naik, ―Natural language watermarking: Design, analysis, and a proof-of-concept implementation,‖ in Information Hiding: Fourth International Workshop, I. S. Moskowitz, ed., Lecture Notes in Computer Science 2137, pp. 185–199, Springer, April 2001. [5] M. J. Atallah, V. Raskin, C. F. Hempelmann, M. Karahan, R. Sion, U. Topkara, and K. E. Triezenberg, ―Natural language watermarking and tamperproofing,‖ in Information Hiding: Fifth International Workshop, F. A. P. Petitcolas, ed., Lecture Notes in Computer Science 2578, pp. 196–212, Springer, October 2002. [6] K. Bennett, ―Linguistic steganography: Survey, analysis, and robustness concerns for hiding information in text,‖ Tech. Rep. TR 2004-13, Purdue CERIAS, May 2004. [7] M. T. Chapman, ―Hiding the hidden: A software system for concealing ciphertext as innocuous text,‖ Master‘s thesis, University of WisconsinMilwaukee, May 1997. [8] M. T. Chapman and G. I. Davida, ―Hiding the hidden: A software system for concealing ciphertext as innocuous text,‖ in Information and Communications Security: First International Conference, O. S. Q. Yongfei Han Tatsuaki, ed., Lecture Notes in Computer Science 1334, Springer, August 1997. [9] M. T. Chapman, G. I. Davida, and M. Rennhard, ―A practical and effective approach to largescale automated linguistic steganography,‖ in Information Security: Fourth International Conference, G. I. Davida and Y. Frankel, eds., Lecture Notes in Computer Science 2200, p. 156ff, Springer, October 2001. [10] R. Bergmair, ―Towards linguistic steganography: A systematic investigation of approaches, systems, and issues.‖ final year thesis, April 2004. handed in in partial fulfillment of the degree requirements for the degree ―B.Sc. (Hons.) in Computer Studies‖ to the University of Derby. [11] I. A. Bolshakov, ―A method of linguistic steganography based on collocationally-verified synonymy.,‖ in Information Hiding: 6th International Workshop, J. J. Fridrich, ed., Lecture Notes in Computer Science 3200, pp. 180–191, Springer, May 2004. [12] K. Winstein, ―Lexical steganography through adaptive modulation of the word choice hash,‖ January 1999. Was disseminated during secondary education at the Illinois Mathematics and Science Academy. The paper won the third prize in the 2000 Intel Science Talent Search. [13] A. J. Tenenbaum, ―Linguistic steganography: Passing covert data using text-based mimicry.‖ final year thesis, April 2002. Submitted in partial fulfillment of the requirements for the degree of ―Bachelor of Applied Science‖ to the University of Toronto. [14] Vineeta Chand and C. Orhan Orgun, ―Exploiting linguistic features in Lexical Steganography‖, Proceedings on 39th Hawaii International Conference on System Sciences - 2006. APPENDIX Secret Text: Escape from jail today evening. Sample Cover Text: EVER since I have been scrutinizing political events, I have taken a tremendous interest in propagandist activity. I saw that the Socialist-Marxist organizations mastered and applied this instrument with astounding skill. And I soon realized that the correct use of propaganda is a true art which has remained practically unknown to the bourgeois parties. Only the Christian-Social movement, especially in Lueger's time, achieved a certain virtuosity on this instrument, to which it owed many of its successes. But it was not until the War that it became evident what immense results could be obtained by a correct application of propaganda. Here again, unfortunately, all our studying had to be done on the enemy side, for the activity on our side was modest, to say the least. The total miscarriage of the German 'enlightenment ' service stared every soldier in the face, and this spurred me to take up the question of propaganda even more deeply than before. There was often more than enough time for thinking, and the enemy offered practical instruction which, to our sorrow, was only too good. Sample StegoText: ever since I have been scrutinizing political cases, I have taken a tremendous interest in propagandist action. I experienced that the SocialistMarxist organizations subdued and practiced this tool with astounding accomplishment. And I shortly recognized that the right use of propaganda is a truthful art which has stayed much unknown to the businessperson parties. merely the Christian-Social move, particularly in Lueger 's time, accomplished a sure virtuosity on this instrument, to which it owed many of its successes. But it was not until the War that it turned evident what immense effects could be found by a right application of propaganda. hither once more, unfortunately, all our considering had to be made on the enemy side, for the action on our side was modest, to say the least. The total stillbirth of the German ` Nirvana ' service starred every soldier in the face, and this spurred me to bring up the inquiry of propaganda even more deeply than ahead. There was frequently more than plenty sentence for mentation, and the opposition offered practical education which, to our regret, was only too good.