“Improved Synonym Approach to Linguistic Steganography” Design and Proof-of-Concept Implementation

Transcription

“Improved Synonym Approach to Linguistic Steganography” Design and Proof-of-Concept Implementation
“Improved Synonym Approach to Linguistic Steganography”
Design and Proof-of-Concept Implementation
Aniket M. Nanhe
Mayuresh P. Kunjir
Sumedh V. Sakdeo
B.Tech Comp. Sci.,
College of Engineering,
Pune, India.
B.Tech Comp. Sci.,
College of Engineering,
Pune, India.
B.Tech Comp. Sci.,
College of Engineering,
Pune, India.
[email protected]
[email protected]
m
[email protected]
Abstract
This paper develops a linguistically robust
Linguistic steganography approach using synonym
replacement, which converts a message into
semantically innocuous text. Drawing upon linguistic
criteria, this approach uses word replacement, with
substitution classes based on traditional word
replacement features (syntactic categories and
subcategories), as well as features under-exploited
in earlier works: semantic criteria, inflectional class,
and frequency statistics. The original message is
hidden through use of a cover text which is shared
between sender and receiver. This paper also
presents a new approach of sharing the cover text
and changing it periodically to make the algorithm
safe from steganalysis.
1. Introduction
While current encryption techniques are
sufficiently advanced to make code-breaking
practically impossible, one major drawback of
current encryption methods is the ease in identifying
an encrypted text—they do not resemble natural text
in any way. Steganography attempts to answer this
need, acting to conceal the message's existence, in
order to transmit encrypted messages without
arousing suspicion. Steganography is the art and
science of concealing a secret message inside a
cover object. When the secret message is in digital
form, it leaves enormous choices for the cover
objects. For instance, one could hide digital
information inside images, audios, binaries, videos,
texts etc.
Steganography can be classified
depending upon the type of cover objects used.
These cover objects could be images, audio,
binaries and as in our case, natural language text.
The study of this subject in scientific literature
dates back to 1983, when Simmons formulated it as
―The Prisoners‘ Problem‖. It says, ―Alice and Bob are
in jail and wish to hatch an escape plan; their entire
communication pass through the warden Wendy;
and if Wendy detects any suspicious messages, he
will frustrate their plan by throwing them in solitary
confinement. So they must find some way of hiding
their secret message in an innocuous looking cover
text‖.
Information hiding has taken one form in image
based steganography, utilizing minimal changes in
pixels or watermarking techniques. While text-based
messages have also been used within image-based
maneuvers, by modifying the white space between
letters and by minutely changing the fonts, this has
proved less fruitful because text can be retyped and
is often altered in the conversion from one program
version or platform to another. Proving more
productive, as well as resistant to the difficulties
surrounding the re-typing of text-based messages is
lexical steganography, which uses linguistic
structures to disguise encryption of text messages
such that the appearance of the message remains
semantically and syntactically innocent.
This paper presents a new approach of
Linguistic
Steganography
using
synonym
replacement, a linguistically-informed alternative to
existing text-based steganography systems. This
approach adds extra features like inflectional class
and frequency statistics thereby producing
semantically and syntactically correct text which is
more natural in appearance to human eye.
There is one more area which is under-exploited
in earlier works: sharing of cover text and changing it
periodically. The cover text is very critical as it is the
text which is transformed and sent over
communication channel. Our aim is to make
steganalysis difficult by altering cover text frequently
so no suspicion would be detected. This paper
discusses a new approach of achieving this by
exploiting the part of cover text which remains
unchanged in transformation.
This paper analyzes Past Research in Section 2,
Basic Algorithm in Section 3 and Cover Text
Selection in Section 4.
2. Past Research
Lexical steganography has had three main veins
of research: watermarking techniques that
manipulate
sentences
through
syntactic
transformations a.k.a. ontological techniques, word
replacement systems both with and without cover
texts, and context-free grammars such as
NICETEXT. We will see the work carried out in each
of these techniques.
2.1. Syntactical Steganography
The approaches to syntactical steganography
exploit the syntactic structures of a text. The
approaches make use of Context Free Grammars
(CFG) to build syntactically correct sentences. The
famous algorithm of CFG based Mimicry developed
by Peter Wayner[1] comes under this category.
There is another famous algorithm by Chapman et
al.[8], NICETEXT, also based on CFG.
NICETEXT uses the cover text simply as a
source of syntactic patterns: by running the cover
text through a part-of-speech tagger, NICETEXT
obtains a set of "sentence frames," e.g. [(noun)
(verb) (prep) (det) (noun)] for ‗I sat in the tree.‘ It also
compiles a lexicon of words found in the cover text
via part-of-speech tags, with each word in the
lexicon associated (arbitrarily) to either of the binary
digits 0 or 1. In encryption, the plain text message is
converted into a sequence of binary digits. A random
sentence frame is chosen and the part of speech
tags in it are replaced by words in the lexicon
according to the sequence of binary digits.
Although, NICETEXT produces syntactically
correct sentences; it fails on the count of semantics.
The output text is almost always set of
ungrammatical
and
semantically
anomalous
sentences.
Another factor worth considering is the density
of encryption within the cover text. Ideally, the cover
text should work to hide the word frequencies and
syntactic structure of the hidden plain text message.
Steganographic goals encourage sparse encryption,
which does not alter a majority of the text by the
word replacement. NICETEXT encryption is
maximally dense—every word within the final
encrypted cover text is conveying hidden
information. Given that each encrypted word is part
of the original information bearing message and
common word usage patterns are unavoidable, this
is problematic for the original steganographic intent:
avoiding detection and producing naturalistic text.
In
general,
syntactical
steganography
techniques produce text having syntactically
wellformedness
without
semantically
wellformedness. It can be seen from Chomsky‘s
famous sentence ―Colorless green ideas sleep
furiously‖.
2.2. Lexical Steganography
In lexical Steganography lexical units of natural
language text such as words are used to hide secret
bits. The most straightforward subliminal channel in
natural language is probably the choice of words. A
word could be replaced by its synonym and the
choice of word to be chosen from the list of
synonyms would depend upon secret bits. For
example consider a sentence –
―Pune is a nice little city‖
Now, suppose list of synonyms for nice is {nice,
wonderful, great, and decent}. Each of the
synonyms can be represented by two bits as shown
in the table:
Word
Code
Nice
00
Wonderful
01
Great
10
Decent
11
Table 1: Lexical code table
Depending upon the input secret bits
appropriate synonym for ‗nice‘ will be selected and
put in the stego text. So, the possible stego texts
could be:
a)
b)
c)
d)
Pune is a nice little city.
Pune is a wonderful little city.
Pune is a decent little city.
Pune is a great little city.
The lexical techniques produce better quality
text than syntactical techniques. It is hard to find
presence of hidden message for statistical attacks.
The replacements are critical part of these
techniques. To give an example, in the above
mentioned synonym replacement approach, some
words can have more than one sense. (Noun ―bank‖
has two senses – ―a long pile or heap‖ or ―an
institution for receiving, lending, and safeguarding
money and transacting other financial business‖) If
we don‘t use synonym having same sense as that of
original word, the output will look suspicious.
1. Bring those instruments.
2. Bring those tool.
A further impediment to synonym based word
replacement is inflection classes (i.e., legal and
illegal word combinations). (2) replaces a plural
noun ‗instruments‘ of (1) by its singular synonym
noun ‗tool‘; thus making sentence grammatically
incorrect.
2.3. Ontological Technique
Of the techniques considered herein, the
ontological one is the most sophisticated approach
with respect to modeling semantics. Instead of
implicitly leaving semantics intact by replacing only
synonymous words while embedding information
into an innocuous text, an explicit model for
―meaning‖ is used to evaluate equivalence between
texts.
Atallah et al.[4] watermark texts by manipulating
and exploiting the syntax (formal word order and
grammatical voice) of sentences. Through common
generative transformations (clefting (4), adjunct
fronting (5), passivization (6), adverbial insertion (7)),
the syntax of each sentence is altered:
3. The lion ate the food yesterday.(original sentence)
4. It was the lion that ate the food yesterday.
5. Yesterday, the lion ate the food.
6. The food was eaten by the lion yesterday.
7. Surprisingly, the lion ate the food yesterday.
The Ontological techniques though have some
problems. The transformations sometimes affect the
semantics of a text. Newer theories of language
argue for the interconnectedness of the semantic
and syntactic levels, demonstrating that the syntactic
pattern is itself inherently meaningful. Furthermore,
statistically, various syntactic structures (word
orders) are not equal in distribution: different genres
of text have wildly different syntactic structures, and
replacing such structures freely could create a text
which is trivially broken by statistical methods—a
security threat to the program.
3. Basic Algorithm
The algorithm replaces all the nouns, adjectives,
verbs and adverbs of cover text by their respective
synonyms. A semantically and orthographically
correct text is used as cover text to hide messages.
We use a word dictionary to get synonym. The input
text to be hidden is compressed using Huffman
Compression Algorithm and a string of bits is
generated. The input bits are consumed in selection
of synonyms.
The algorithm works in stages. The various
stages of the algorithms are:
3.1. Part-of-speech Tagging
The basic requirement of this algorithm is, a
cover text should be shared between sender and
receiver. Natural Language Processing is done on
the cover text in order to determine the part of
speech of each word. This is essential part of the
algorithm as we are going to replace only common
nouns, adjectives, adverbs and verbs in the cover
text. A Parts-of-speech tagger is applied on cover
text which outputs each word followed by its part-ofspeech.
3.2. Input Compression
The input secret text is treated as binary bit
string. These bits are to be used in synonym
replacement stage to make a choice of synonym
that is to be used in place of a word. Using standard
ASCII representation, we need 7 bits for each
character of input. So we need to hide (7 * ‗number
of characters‘) bits. We can improve on this number
by exploiting characteristics of English language.
Some characters appear more frequently in normal
English text than other characters. If we use less
number of bits for such characters, we can easily
reduce number of input bits to be hidden. To achieve
this, we use Huffman Compression algorithm.
On an average, Huffman coding reduces the
size of input bit string to 33% of the original.
3.3. Synonym Replacement
This stage is the core of the algorithm. The
actual task of hiding bits into a cover text is carried
out here.
The inputs to this stage are the
compressed bit string and tagged cover text at
sender‘s side and receiver needs encoded text
(a.k.a. stego text) and tagged cover text. This stage
makes use of dictionary to find replacements for
word.
3.3.1 Dictionaries
We use three dictionaries here:
a. WordNet2.1 English dictionary
WordNet is an open source English dictionary
containing almost all English words. We can get all
synonyms of a word using this dictionary. A word
may have more than one sense in which it can be
used. WordNet provides output in terms of ‗synsets‘.
A synset defines a sense of a word. Each synset
contains all the synonyms of given word which can
be used in that particular sense.
e.g.: Synonym Sets of the word ―travelling‖ are:
1. travel, go, move, locomote
2. travel, journey
3. travel, trip, jaunt
4. travel, journey
WordNet also provides frequency of occurrence
of a word in normal English text. This information is
very useful in our algorithm using which we can
encode the synonyms. Huffman coding is used here
again so that more frequently occurring synonyms
get shorter codes and vice versa. This is important
from Word Sense Disambiguation aspect, wherein
the WordNet‘s 1st Sense assigns some frequency to
each synonym of the word. This frequency is
assigned to a word depending upon the use of that
synonym for that particular word in normal English
text. Assigning a shorter code to most frequently
used synonym ensures maintaining proper word
sense.
―travelling―, the stego text should contain present
participle for of the ―go― to maintain the tense of the
sentence. So verb dictionary provides this inflected
form of the base form ―go‖ for actual replacement.
c. Noun Inflection dictionary
A noun can be either in singular or plural form.
Again as the case with verbs, WordNet always gives
the synonyms of a noun in their singular form. If we
replace a plural by its singular synonym noun, we
will get grammatically incorrect sentence. So we use
separate noun dictionary to avoid this situation.
We maintain a list of nouns (about 89,051
nouns) in their singular as well as plural forms.
Before replacing a noun by its synonym, we check
whether both are in same form. If not, we select
appropriate form of synonym noun from noun
dictionary and replace original by it.
e.g.: Tagged Cover Text:
A/DT group/NN of/IN frogs/NNS were/VBD
traveling/VBG through/IN the/DT woods/NNS ,/,
The noun ―frogs‖ is plural form of the word frog.
Suppose synonym Gaul is selected by input bit
string; we need to ensure that it is plural as ‗frogs‘ is
a plural. So we use this dictionary to obtain singular
and plural forms of the nouns present in dictionary.
b. Verb Inflection dictionary
3.4. Synonym Replacement
A verb can have many inflected forms like
present participle form, past tense form, past
participle form and base form. When we try to find
synonyms for a verb, WordNet always gives the
synonyms of a verb in their base form irrespective of
inflected form of input verb. If we replace a verb by a
synonym in its base form, it will make output
sentence grammatically incorrect. To avoid this
situation, we use a separate verb dictionary.
We maintain a list of all verbs (about 16,064
verbs) along with their all inflected forms in a
separate file. Before replacing a verb by its
synonym, we check whether inflected forms of both
original verb and its synonym match. If they don‘t
match, we select appropriate inflected form of
synonym verb from verb dictionary and replace
original by it.
Figure 1 shows the mechanism that is carried
out at sender end. As can be seen, tagged text
obtained from stage 1 is scanned. Whenever a
noun, adjective, verb or adverb is found, its
synonyms are obtained from WordNet. All synonyms
are put in a frequency table; the frequencies are
obtained from WordNet. Huffman coding is done on
this frequency table to obtain codes for all
synonyms. By using frequencies, we achieve word
sense disambiguation also, as more frequently used
senses get shorter codes so that they have higher
probability of being used.
After building the encode table, we use input bit
string to select one of the synonyms from the table.
If we are replacing a verb, the inflected forms are
checked and appropriate form of verb is obtained
from verb dictionary. Similarly if we are replacing a
noun, the singular or plural form is selected from
noun dictionary in accordance of original noun‘s
form. Otherwise the selected synonym is put in place
of original word. Appendix shows examples of
sample stego text generated from cover text by
hiding the secret text.
e.g.: Tagged Cover Text:
A/DT group/NN of/IN frogs/NNS were/VBD
traveling/VBG through/IN the/DT woods/NNS ,/,
Suppose, synonym ―go‖ is selected for
replacement from WordNet as a synonym of
bits are obtained and these are appended to output
string. The output string when decompressed,
produce original secret text.
4. Cover Text Selection
Figure 1: Sender end
Steganalysis is identifying existence of a secret
message. This is obvious as the field of
steganography aims to conceal the existence of a
message, not scramble it. Our approach uses Word
Replacement in cover text. As only few words are
replaced by their synonyms majority of text remains
unchanged. If same cover text is used again and
again, an attack ―Known Stego-Text‖, in which
intruder keeps a track of text being sent on the
communication medium is possible. To prevent text
from steganalysis, the cover text needs to be
changed periodically. Our approach uses a book, a
collection of different chapters. Book should be
privately owned by sender and receiver. One of the
chapters from the book can be selected as cover
text. The choice of the chapter is randomly decided
by sender. For reverse transformation at the receiver
end, same chapter should be selected as cover text.
To achieve this, we exploit the unchanged part of
the cover text.
Receiver decides which chapter is to be used as
cover text from stego text. This approach calculates
difference between individual chapters in book and
stego text and selects the chapter with minimum
difference.
Initially, each chapter in book is scanned once
to obtain a code for each sentence. Words which are
common nouns, adjectives, adverbs or verbs are
ignored from the calculation of the code. Values of
other words are calculated using ASCII values of
characters and position of the word in that particular
sentence. All these values are summed up to
determine the code for that sentence. Similarly,
codes for all sentences are calculated.
e.g.: ―This is important for me‖.
Figure 2: Receiver End
Reverse algorithm is carried out at receiver end.
Figure 2 shows the mechanism. Tagged text is
scanned for noun, adjectives, verbs and adverbs.
When one is found, its synonyms are obtained from
WordNet. As done at sender end, frequency table
and later encode table is formed using frequencies
of the words.
At the same time, stego text is also scanned.
From that stego text, we obtain the synonym
selected at sender‘s side. Then from the table, its
Code =‗t‘ * 1 + ‗h‘ * 1 + ‗i‘ * 1 +‗s‘ * 1 + ‗i‘ * 2 +
‗s‘ * 2 + ‗f‘ * 4 + ‗o‘ * 4 + ‗r‘ * 4 + ‗m‘ * 5 + ‗e‘ * 5
―important‖ is not used in calculation of the code
for this sentence as it is an adjective.
Similar algorithm for code generation is applied
on stego text. Code of stego text is compared with
codes for all chapters. Ideally codes for all
sentences of stego text should match with codes of
a chapter which was used as cover text by sender.
But our algorithm allows compound word
replacements (―travel‖ can be replaced by ―move
around‖). This causes a difference between codes of
stego text and chapter to be used as cover text. This
difference although is very small compared to
difference for other chapters. So the chapter with
minimum difference is selected as cover text.
5. Conclusion
The field of Linguistic Steganography is very
interesting as it conceals the very existence of
secret message from intruder, which is not
achievable by cryptography. The Synonym
replacement
approach
used
for
Linguistic
Steganography produces innocuous looking English
text thereby making detection of secret message
very hard. The famous Stego-Turing Test states that
it is very hard for a computer to alter a natural
language text in a way that is undetectable to a
human. Many approaches have been carried out in
the past in doing this. But none has been able to
solve the problem. Though our solution doesn‘t
solve the problem, it produces better quality of
output than previously done approaches.
Also we give a new approach to dynamically
choose a cover text from chunk of text being shared
by sender and receiver. This allows user to use
different cover text for hiding message each time,
thus making steganalysis difficult.
Synonym Replacement Approach to Linguistic
Steganography using Inflection classes, Frequency
Statistics and Dynamic Cover Text Selection is a
new improvement in the field of Linguistic
Steganography and provides a very good, efficient
tool for Information Hiding.
6. References
[1] P. Wayner, ―Mimic functions,‖ Cryptologia XVI,
pp. 193–214, July 1992.
[2] P. Wayner, ―Strong theoretical steganography,‖
Cryptologia XIX, pp. 285–299, July 1995.
[3] P. Wayner,
―Disappearing CryptographyInformation
Hiding:
Steganography
&
Watermarking‖, 2nd edition Morgan Kaufmann
Publishers, Los Altos, CA 94022, USA, second
ed., 2002 pp. 67-126, pp. 303-314.
[4] M. J. Atallah, V. Raskin, M. Crogan, C.
Hempelmann, F. Kerschbaum, D. Mohamed,
and S. Naik, ―Natural language watermarking:
Design, analysis, and a proof-of-concept
implementation,‖ in Information Hiding: Fourth
International Workshop, I. S. Moskowitz, ed.,
Lecture Notes in Computer Science 2137, pp.
185–199, Springer, April 2001.
[5] M. J. Atallah, V. Raskin, C. F. Hempelmann, M.
Karahan, R. Sion, U. Topkara, and K. E.
Triezenberg, ―Natural language watermarking
and tamperproofing,‖ in Information Hiding: Fifth
International Workshop, F. A. P. Petitcolas, ed.,
Lecture Notes in Computer Science 2578, pp.
196–212, Springer, October 2002.
[6] K. Bennett, ―Linguistic steganography: Survey,
analysis, and robustness concerns for hiding
information in text,‖ Tech. Rep. TR 2004-13,
Purdue CERIAS, May 2004.
[7] M. T. Chapman, ―Hiding the hidden: A software
system for concealing ciphertext as innocuous
text,‖ Master‘s thesis, University of WisconsinMilwaukee, May 1997.
[8] M. T. Chapman and G. I. Davida, ―Hiding the
hidden: A software system for concealing
ciphertext as innocuous text,‖ in Information and
Communications Security: First International
Conference, O. S. Q. Yongfei Han Tatsuaki, ed.,
Lecture Notes in Computer Science 1334,
Springer, August 1997.
[9] M. T. Chapman, G. I. Davida, and M. Rennhard,
―A practical and effective approach to largescale automated linguistic steganography,‖ in
Information Security: Fourth International
Conference, G. I. Davida and Y. Frankel, eds.,
Lecture Notes in Computer Science 2200, p.
156ff, Springer, October 2001.
[10] R. Bergmair, ―Towards linguistic steganography:
A systematic investigation of approaches,
systems, and issues.‖ final year thesis, April
2004. handed in in partial fulfillment of the
degree requirements for the degree ―B.Sc.
(Hons.) in Computer Studies‖ to the University of
Derby.
[11] I. A. Bolshakov, ―A method of linguistic
steganography based on collocationally-verified
synonymy.,‖ in Information Hiding: 6th
International Workshop, J. J. Fridrich, ed.,
Lecture Notes in Computer Science 3200, pp.
180–191, Springer, May 2004.
[12] K. Winstein, ―Lexical steganography through
adaptive modulation of the word choice hash,‖
January 1999. Was disseminated during
secondary education at the Illinois Mathematics
and Science Academy. The paper won the third
prize in the 2000 Intel Science Talent Search.
[13] A. J. Tenenbaum, ―Linguistic steganography:
Passing covert data using text-based mimicry.‖
final year thesis, April 2002. Submitted in partial
fulfillment of the requirements for the degree of
―Bachelor of Applied Science‖ to the University
of Toronto.
[14] Vineeta Chand and C. Orhan Orgun, ―Exploiting
linguistic features in Lexical Steganography‖,
Proceedings on 39th Hawaii International
Conference on System Sciences - 2006.
APPENDIX
Secret Text:
Escape from jail today evening.
Sample Cover Text:
EVER since I have been scrutinizing political
events, I have taken a tremendous interest in
propagandist activity. I saw that the Socialist-Marxist
organizations mastered and applied this instrument
with astounding skill. And I soon realized that the
correct use of propaganda is a true art which has
remained practically unknown to the bourgeois
parties. Only the Christian-Social movement,
especially in Lueger's time, achieved a certain
virtuosity on this instrument, to which it owed many
of its successes.
But it was not until the War that it became
evident what immense results could be obtained by
a correct application of propaganda. Here again,
unfortunately, all our studying had to be done on the
enemy side, for the activity on our side was modest,
to say the least. The total miscarriage of the German
'enlightenment ' service stared every soldier in the
face, and this spurred me to take up the question of
propaganda even more deeply than before.
There was often more than enough time for
thinking, and the enemy offered practical instruction
which, to our sorrow, was only too good.
Sample StegoText:
ever since I have been scrutinizing political
cases, I have taken a tremendous interest in
propagandist action. I experienced that the SocialistMarxist organizations subdued and practiced this
tool with astounding accomplishment. And I shortly
recognized that the right use of propaganda is a
truthful art which has stayed much unknown to the
businessperson parties. merely the Christian-Social
move, particularly in Lueger 's time, accomplished a
sure virtuosity on this instrument, to which it owed
many of its successes. But it was not until the War
that it turned evident what immense effects could be
found by a right application of propaganda. hither
once more, unfortunately, all our considering had to
be made on the enemy side, for the action on our
side was modest, to say the least. The total stillbirth
of the German ` Nirvana ' service starred every
soldier in the face, and this spurred me to bring up
the inquiry of propaganda even more deeply than
ahead. There was frequently more than plenty
sentence for mentation, and the opposition offered
practical education which, to our regret, was only too
good.