Automated or Manual Examination?

Transcription

Automated or Manual Examination?
Proceedings of the
Conference on
Digital Forensics,
Security, and Law
2012
Richmond, Virginia
May 30-31
Richmond, Virginia
May 30-31, 2012
Conference Chair
Glenn S. Dardick
Longwood University
Virginia, USA
Association of Digital Forensics, Security and Law
Copyright © 2012 ADFSL, the Association of Digital Forensics, Security and Law. Permission to make digital or printed
copies of all or any part of this journal is granted without fee for personal or classroom use only and provided that such
copies are not made or distributed for profit or commercial use. All copies must be accompanied by this copyright notice
and a full citation. Permission from the ADFSL is required to make digital or printed copies of all or any part of this journal
for-profit or commercial use. Permission requests should be sent to Dr. Glenn S. Dardick, Association of Digital Forensics,
Security and Law, 1642 Horsepen Hills Road, Maidens, Virginia 23102 or emailed to [email protected].
ISSN 1931-7379
ADFSL Conference on Digital Forensics, Security and Law, 2012
Sponsors
2
ADFSL Conference on Digital Forensics, Security and Law, 2012
Contents
Committee ................................................................................................................................................ 4
Schedule ................................................................................................................................................... 5
Update on the State of the Science of Digital Evidence Examination ................................................. 7
Fred Cohen*
A Proposal for Incorporating Programming Blunder as Important Evidence in
Abstraction-Filtration-Comparison Test ............................................................................................ 19
P. Vinod Bhattathiripad*
The Xbox 360 and Steganography: How Criminals and Terrorists could be “Going Dark” ......... 33
Ashley Podhradsky*, Rob D’Ovidio and Cindy Casey
Double-Compressed JPEG Detection in a Steganalysis System ........................................................ 53
Jennifer L. Davidson* and Pooja Parajape
Toward Alignment between Communities of Practice and Knowledge-Based Decision
Support ................................................................................................................................................... 79
Jason Nichols*, David Biros* and Mark Weiser
A Fuzzy Hashing Approach Based on Random Sequences and Hamming Distance ...................... 89
Frank Breitinger* and Harald Baier
Cloud Forensics Investigation: Tracing Infringing Sharing of Copyrighted Content in Cloud .. 101
Yi-Jun He, Echo P. Zhang*, Lucas C.K. Hui, Siu Ming Yiu* and K.P. Chow
iPad2 Logical Acquisition: Automated or Manual Examination? .................................................. 113
Somaya Ali*, Sumaya AlHosani*, Farah AlZarooni and Ibrahim Baggili
Multi-Parameter Sensitivity Analysis of a Bayesian Network from a Digital Forensic
Investigation......................................................................................................................................... 129
Richard E. Overill, Echo P. Zhang* and Kam-Pui Chow
Facilitating Forensics in the Mobile Millennium Through Proactive Enterprise Security .......... 141
Andrew R. Scholnick*
A Case Study of the Challenges of Cyber Forensics Analysis of Digital Evidence in a
Child Pornography Trial .................................................................................................................... 155
Richard Boddington*
After Five Years of E-Discovery Missteps: Sanctions or Safe Harbor? ......................................... 173
Milton Luoma* and Vicki Luoma*
Digital Evidence Education in Schools of Law ................................................................................. 183
Aaron Alva* and Barbara Endicott-Popovsky
Pathway into a Security Professional: A new Cyber Security and Forensic Computing
Curriculum .......................................................................................................................................... 193
Elena Sitnikova* and Jill Slay
* Author Presenting and/or Attending
3
ADFSL Conference on Digital Forensics, Security and Law, 2012
Conference Committee
The 2012 ADFSL Conference on Digital Forensics, Security and Law is pleased to have the following
members of the conference committee.
Glenn Dardick
[email protected]
General chair
Longwood University
Virginia
USA
John Bagby
[email protected]
The Pennsylvania State
University
Pennsylvania
USA
Nick Vincent Flor
[email protected]
University of New Mexico
New Mexico
USA
Jigang Liu
[email protected]
Metropolitan State University
Minnesota
USA
Diane Barrett
[email protected]
University of Advanced
Technology, Arizona
USA
Felix Freiling
[email protected]
University of Erlangen
Nürnberg
Germany
John Riley
[email protected]
Bloomsburg University
Pennsylvania
USA
David Biros
[email protected]
Oklahoma State University
Oklahoma
USA
Simson Garfinkel
[email protected]
Naval Postgraduate School
Monterey, CA
USA
Pedro Luis Prospero Sanchez
[email protected]
University of Sao Paulo
Sao Paulo
Brazil
Mohamed Chawki
[email protected]
University of Aix-Marseille III
France
Andy Jones
[email protected]
Khalifa University
UAE
Jill Slay
[email protected]
Polytechnic of Namibia
Windhoek
Namibia
Fred Cohen
[email protected]
California Sciences Institute
Livermore, CA
USA
Gregg Gunsch
[email protected]
Defiance College
Ohio
USA
Eli Weintraub
[email protected]
Afeka Tel Aviv Academic College
of Engineering
Israel
David Dampier
[email protected]
Mississippi State University
Mississippi
USA
Gary Kessler
[email protected]
Gary Kessler Associates
Vermont
USA
Craig Valli
[email protected]
Edith Cowan University
Western Australia
Australia
Denis Edgar-Nevill
denis.edgar-nevill@
canterbury.ac.uk
Canterbury Christ Church
University
UK
Ki Jung Lee
[email protected]
Drexel University
Pennsylvania
USA
4
ADFSL Conference on Digital Forensics, Security and Law, 2012
Schedule
Wednesday, May 30
 07:30 AM CONTINENTAL BREAKFAST
 07:30 AM On-site Registration
 08:30 AM Introductions
 Glenn S. Dardick, Conference Chair and Director of the ADFSL
 08:40 AM Welcome
 Paul Barrett, Dean of the College of Business and Economics at
Longwood University
 08:50 AM Papers/Presentation Session I
 Fred Cohen, USA: Update on the State of the Science of Digital
Evidence Examination
 P. Vinod Bhattathiripad, India: A Proposal for Incorporating
Programming Blunder as Important Evidence in AbstractionFiltration-Comparison Test
 10:10 AM BREAK
 10:30 AM Papers/Presentation Session II
 Ashley L. Podhradsky, USA: The Xbox 360 and Steganography: How
Criminals and Terrorists Could Be “Going Dark”
 Jennifer Davidson, USA: Double-Compressed JPEG Detection in a
Steganalysis System
 Jason Nichols and David Biros, USA: Toward Alignment Between
Communities of Proactive and Knowledge-Based Decision Support
 12:30 PM LUNCH (provided)
 01:15 PM Keynote
 Mohamed Chawki, Senior Judge at the Council of State, Egypt and
Founder-Chairman of the International Association of Cybercrime
Prevention (AILCC) in Paris - "IT and Regime Change"
 01:45 PM Papers/Presentation Session III
 Frank Breitinger, Germany: A Fuzzy Hashing Approach based on
Random Sequences and Hamming Distance
 03:00 PM BREAK
 03:20 PM Papers/Presentation Session IV
 Echo P. Zhang and Siu Ming Yiu, China: Cloud Forensics Investigation:
Tracing Infringing Sharing of Copyrighted Content in the Cloud
 Somaya Ali AlWejdani and Sumaya AbdulAziz AlHosani, UAE iPad2
logical acquisition: Automated or Manual Examination?
 04:40 PM Conference Close for Day
5
ADFSL Conference on Digital Forensics, Security and Law, 2012
Schedule
Thursday, May 31
 07:30 AM CONTINENTAL BREAKFAST
 07:30 AM On-site Registration
 08:30 AM Papers/Presentation Session I
 Paul Poroshin, US Branch Vice-President, Group-IB, Russia: The
Analysis of the 2011 Russian Cybercrime Scene
 Gareth Davies, UK: NAND Memory Technology Forensics
 10:15 AM BREAK
 10:30 AM Papers/Presentation Session II
 Ping Zhang, China: Multi-Parameter Sensitivity Analysis of a Bayesian
Network From a Digital Forensic Investigation
 Andrew Scholnick, USA: Facilitating Forensics in the Mobile
Millennium Through Proactive Enterprise Security
 12:30 PM LUNCH (provided)
 01:15 PM Keynote Speech
 Nigel Wilson, Senior Lecturer at the University of Adelaide Law School
in Australia - "Digital Experts – Why is Effective Communication So
Important to Both the Admissibility and Persuasiveness of Expert
Opinion?"
 01:45 PM Papers/Presentation Session III
 Richard Boddington, Australia: A Case Study of the Challenges of
Cyber Forensics Analysis of Digital Evidence in a Child Pornography
Trial
 Milton Luoma and Vicki Luoma, USA: After Five Years of E-Discovery
Missteps: Sanctions or Safe Harbor?
 03:00 PM BREAK
 03:20 PM Papers/Presentation Session IV
 Aaron Alva, USA: Digital Evidence Education in Schools of Law
 Elena Sitnikova, Australia: Pathway into a Security Professional: A
New Cyber Security and Forensic Computing Curriculum
 04:40 PM Conference Close
6
ADFSL Conference on Digital Forensics, Security and Law, 2012
UPDATE ON THE STATE OF THE SCIENCE OF DIGITAL
EVIDENCE EXAMINATION
Fred Cohen
CEO, Fred Cohen & Associates
President, California Sciences Institute
ABSTRACT
This paper updates previous work on the level of consensus in foundational elements of digital
evidence examination. Significant consensus is found present only after definitions are made explicit,
suggesting that, while there is a scientific agreement around some of the basic notions identified, the
use of a common language is lacking.
Keywords:
Digital forensics examination, terminology, scientific methodology, testability,
validation, classification, scientific consensus
1. INTRODUCTION AND BACKGROUND
There have been increasing calls for scientific approaches and formal methods, (e.g.,
[1][2][3][4][5][6]), and at least one study has shown that, in the relatively mature area of evidence
collection, there is a lack of agreement among and between the technical and legal community about
what constitutes proper process. [7] The National Institute of Standards and Technology has
performed testing on limited sorts of tools used in digital forensics, including substantial efforts
related to evidence collection technologies, and it has found that the tools have substantial limitations
about which the user and examiner must be aware if reliable tool usage and results are to be assured.
[8]
In an earlier paper seeking to understanding the state of the science in digital evidence examination
(i.e., analysis, interpretation, attribution, reconstruction, and aspects of presentation),[26] results
suggested a lack of consensus and a lack of common language usage. A major question remained as to
whether the lack of consensus stemmed from the language differences or the lack of a common body
of agreed-upon knowledge. This paper updates the results from that previous work by using a survey
to try to differentiate between these two possibilities. In the context of the legal mandates of the US
Federal Rules of Evidence [9] and relevant case law, this helps to clarify the extent to which expert
testimony may be relied upon.
1.1 The rules and rulings of the courts
The US Federal Rules of Evidence (FRE) [9], rulings in the Daubert case[10], and in the Frye case
[11], express the most commonly applied standards with respect to issues of expert witnesses (FRE
Rules 701-706). Digital forensic evidence is normally introduced by expert witnesses except in cases
where non-experts can bring clarity to non-scientific issues by stating what they observed or did.
According to the FRE [9], only expert witnesses can address issues based on scientific, technical, or
other specialized knowledge. A witness qualified as an expert by knowledge, skill, experience,
training, or education, may testify in the form of an opinion or otherwise, if (1) the testimony is based
on sufficient facts or data, (2) the testimony is the product of reliable principles and methods, and (3)
the witness has applied the principles and methods reliably to the facts of the case. If facts are
reasonably relied upon by experts in forming opinions or inferences, the facts need not be admissible
for the opinion or inference to be admitted; however, the expert may in any event be required to
disclose the underlying facts or data on cross-examination.
The Daubert standard [10] essentially allows the use of accepted methods of analysis that reliably and
accurately reflect the data they rely on. The Frye standard [11] is basically: (1) whether or not the
7
ADFSL Conference on Digital Forensics, Security and Law, 2012
findings presented are generally accepted within the relevant field; and (2) whether they are beyond
the general knowledge of the jurors. In both cases, there is a fundamental reliance on scientific
methodology properly applied.
The requirements for the use of scientific evidence through expert opinion in the United States and
throughout the world are based on principles and specific rulings that dictate, in essence, that the
evidence be (1) beyond the normal knowledge of non-experts, (2) based on a scientific methodology
that is testable, (3) characterized in specific terms with regard to reliability and rates of error, (4) that
the tools used be properly tested and calibrated, and (5) that the scientific methodology is properly
applied by the expert as demonstrated by the information provided by the expert.[9][10][11][12]
Failures to meet these requirements are, in some cases, spectacular. For example, in the Madrid
bombing case, where the US FBI declared that a fingerprint from the scene demonstrated the presence
of an Oregon attorney. However, that attorney, after having been arrested, was clearly demonstrated to
have been on the other side of the world at the time in question. [13] The side effect is that fingerprints
are now challenged as scientific evidence around the World. [24]
1.2 The foundations of science
Science is based on the notion of testability. In particular, and without limit, a scientific theory must be
testable in the sense that an independent individual who is reasonably skilled in the relevant arts
should be able to test the theory by performing experiments that, if they produced certain outcomes,
would refute the theory. Once refuted, such a theory is no longer considered a valid scientific theory,
and must be abandoned, hopefully in favor of a different theory that meets the evidence, at least in
circumstances where the refutation applies. A statement about a universal principle can be disproven
by a single refutation, but any number of confirmations can not prove it to be universally true. [14]
In order to make scientific statements regarding digital evidence, there are some deeper issues that
may have to be addressed. In particular, there has to be some underlying common language that allows
the scientists to communicate both the theories and experiments, a defined and agreed upon set of
methods for carrying out experiments and interpreting their outcomes (i.e., a methodology), and a
predefined set of outcomes with a standard way of interpreting them (i.e., a system of measurement)
against which to measure tests. These ultimately have to come to be accepted in the scientific
community as a consensus.
One way to test for science is to examine peer reviewed literature to determine if these things are
present. This was undertaken in a 2011 study [26] which suggested a lack of common language and a
subsequent proposal to move toward a common language based on archival science.[27] One way to
test for consensus is to poll individuals actively participating in a field (e.g., those who testify as
expert witnesses and authors publishing in relevant peer reviewed publications) regarding their
understandings to see whether and to what extent there is a consensus in that community. This method
is used across fields [15][16][17], with >86% agreement and <5% disagreement for climatologist
consensus regarding the question “Do you think human activity is a significant contributing factor in
changing mean global temperatures?” in one survey. [18]
1.3 The previous study being updated
In the previous study of consensus in digital forensics evidence examination, which we quote liberally
from with permission in this section,[25] results suggested a lack of consensus surrounding a series of
basic statements:
1
Digital Evidence consists only of sequences of bits.
2
The physics of digital information is different from that of the physical world.
3
Digital evidence is finite in granularity in both space and time.
4
It is possible to observe digital information without altering it.
8
ADFSL Conference on Digital Forensics, Security and Law, 2012
5
It is possible to duplicate digital information without removing it.
6
Digital evidence is trace evidence.
7
Digital evidence is not transfer evidence.
8
Digital evidence is latent in nature.
9
Computational complexity limits digital forensic analysis.
10 Theories of digital forensic evidence form a physics.
11 The fundamental theorem of digital forensics is "What is inconsistent is not true".
These statements were evaluated by survey participants against a scale of “I disagree.”, “I don't
know.”, and “I agree.”, and participants were solicited from the members of the Digital Forensics
Certification Board (DFCB), individuals who have authored or co-authored a paper or attended the
International Federation of Information Processing (IFIP) working group 11.9 (digital forensics)
conference over the last three years in Kyoto, Orlando, and Hong Kong, members of a Bay Area
chapter of the High Tech Crime Investigators Association (HTCIA), and a group of largely university
researchers at an NSF-sponsored event. Control questions were used to control for random guessing
and consensus around other areas of science and untrue statements of fact.
Analysis was undertaken to identify responses exceeding 86% consensus (i.e., that for global climate
change among climatologists), not exceeding 5% non-consensus for refutation, and failing to refute
the null hypothesis. Consensus margin of error calculations were done per the t-test by computing the
margin of error for 86% and 5% consensus based on the number of respondents and size of the
population with a Web-based calculator.[22] Similar calculations were done using the confidence
interval for one proportion and sample size for one proportion, and they produced similar results.[23]
It was identified that the scale applied (3-valued instead of a Likert scale) leads to an inability to
validate the statistical characteristics using common methods.
No agreement reached 86% confidence levels or were within the margin of error (.77), and only a
control question, #4 (∑a/N=.68), #5 (∑a/N=.75), and #9 (∑a/N=.64) (N=54) exceeded random
levels of agreement. For disagreement, (N=28) only the same and one other control question, #5
(∑d/N=.14), and #9 (∑d/N=.10) were within the margin of error of not refuting consensus by
disagreement (.05+.09=.14) levels. Only #1 (∑d/N=.53) and #11 (∑d/N=.50) were within random
levels of refutation of consensus from disagreements. In summary, only #5 and #9 are viable
candidates for overall community consensus of any sort, and those at levels of only 75% and 64%
consensus respectively.[25]
The previous effort also involved a literature survey. 125 reviews of 95 unique published articles (31%
redundant reviews) were undertaken. Of these, 34% are conference papers, 25% journal articles, 18%
workshop papers, 8% book chapters, and 10% others. Publications surveyed included, without limit,
IFIP (4), IEEE (16), ACM (6), HTCIA (3), Digital Investigation (30), doctoral dissertations (2), books,
and other similar sources. A reasonable estimate is that there were less than 500 peer reviewed papers
at that time that speak directly to the issues at hand. Results from examining 95 of those papers, which
represent 19% of the total corpus, produces a 95% confidence level with a 9% margin of error. Of
these reviews, 88% have no identified common language defined, 82% have no identified scientific
concepts or basis identified, 76% have no identified testability criteria or testing identified, 75% have
no identified validation identified, while 59% identify a methodology.
Internal consistency of these results was checked by testing redundant reviews to determine how often
reviewers disagreed as to the “none” designation. Out of 20 redundant reviews (40 reviews, 2 each of
20 papers), inconsistencies were found for Science (3/20 = 15%), Physics (0/20 = 0%), Testability
(4/20 = 20%), Validation (1/20 = 5%), and Language (1/20 = 5%). This indicates an aggregate error
9
ADFSL Conference on Digital Forensics, Security and Law, 2012
rate of 9/100 = 9% of entries in which reviewers disagreed about the absence of these indicators of
scientific basis.
2. THE PRESENT STUDY
In order to differentiate between the problems associated with a lack of common terminology,
language use, and methodological issues in the field, a short study was undertaken to try to
differentiate actual consensus from linguistic issues. To do this, we created a survey that defines each
term and measures agreement with the definition, and then tells the participant to assume the definition
and evaluate the statement.
2.1 The survey methodology
As in the previous study, surveys were performed using the “SurveyMonkey” Web site. No identityrelated data was collected or retained, although the survey mechanism prevents individuals from
taking the survey from the same computer more than once unless they act to circumvent the
mechanism. No accounting was taken to try to identify individuals who may have taken the survey as
members of more than one group because the group overlap is relatively small. The new survey was
introduced as follows:
“This is a survey designed to identify, to a first approximation, whether or not there is
a consensus in the scientific community with regard to some of the basic principles of
the examination of digital forensic evidence.
This survey is NOT about the physical realization of that evidence and NOT about the
media in which it is stored, processed, or transported. It is ONLY about the bits.
Please read carefully before answering.
For each numbered definition, review the definition and example(s) and indicate
the extent to which you agree or disagree with the definition. For each follow-up,
assume that the definition above it is correct, and respond to the statement in
light of that definition, indicating the extent to which you agree or disagree with
it.
Following the instructions, the questions are provided in a format approximately as in Figure 1:
Strongly
disagree
Disagree
Don't agree
or disagree
Agree
Strongly
Agree
1: Definition: ...
ʘ
ʘ
ʘ
ʘ
ʘ
Assuming the definition as the basis for your answer ...
ʘ
ʘ
ʘ
ʘ
ʘ
2: Definition: ...
ʘ
ʘ
ʘ
ʘ
ʘ
Figure 1 – The survey appearance
The set of questions and statements in the survey were as follows:
1: Definition: Digital forensics (as opposed to computer forensics) deals with sequences of bits
and their use in legal actions (as opposed to attack detection or other similar areas). Example:
A law suit or criminal charges with digital evidence will normally involve digital forensics. Do
you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement:
Digital evidence is only sequences of bits.
10
ADFSL Conference on Digital Forensics, Security and Law, 2012
2: Definition: Finite granularity means that, at the end of the day, you can only get so small,
and no smaller. Finite granularity in space means that there is a minimum size (i.e., the bit)
and finite granularity in time means that there is a minimum time unit (i.e., the number of bits
that represent a time unit, or alternatively, the maximum clock speed of the mechanisms
generating the bits). Example: At the level of digital evidence, as described earlier, no matter
how much of it you have, there is no unit of that evidence smaller than a bit, and no matter has
much precision measurements are made by, the times associated with those bits cannot be
infinitesimally small. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement:
Digital evidence is finite in granularity in both space and time.
3: Definition: Observation of digital information means to be able to determine what binary
value each bit of the information has, while alteration is the changing of a bit value to the other
bit value (digital information by definition only has two different values for bits). Example: At
the level of digital evidence, when reading bits from a disk, the bits on the disk are observed
by the mechanisms of the disk drive. When writing to a disk, some of the bits on the disk may
be altered. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement: It is
normally possible to observe digital information without altering it.
4: Definition: Duplication of digital information means to make an exact copy of the bit
sequences comprising the digital information. Removal of digital information means to make
alterations so that the original information (i.e., bit sequence) is no longer present where it
originally was present before. Example: A duplicate of the bit sequence 1 0 1 will also be the
bit sequence 1 0 1. The removal of the bit sequence 1 0 1 from a location would mean that the
bit sequence at that location was no longer 1 0 1. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement: It is
normally possible to duplicate digital information without removing it.
5: Definition: Trace evidence is evidence that is produced as the result of a process, so that
the presence of the evidence is consistent with the execution of the process. Example: When
a pen writes on paper, the indentations in the paper resulting from the writing are traces of the
process of writing. Similarly, when a computer program produces bit sequences that are
stored on a disk, the bit sequences are traces of the execution of the computer program that
produced them. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement:
Digital evidence is normally trace evidence.
6: Definition: Transfer is a concept in which two objects coming into contact with each other
each leaves something of itself with the other. Example: When a shirt rubs up against a sharp
object, some shards from the object may get transfered to the shirt while some fibers from the
shirt may get transferred to the sharp object. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement:
Digital evidence is normally not transfer evidence.
7: Definition: Latent evidence is evidence that cannot be directly observed by human senses.
Example: DNA evidence is normally latent evidence in that the DNA has to be processed by a
mechanism to produce a result that can be examined by human senses. Do you agree with
this definition?
11
ADFSL Conference on Digital Forensics, Security and Law, 2012
Assuming this definition as the basis for your answer, respond to the following statement:
Digital evidence is normally latent in nature.
8: Definition: Computational complexity is the number of low-level computing operations
required in order to perform an algorithm which processes information. Example: It takes more
computer time to determine the best possible route from house to house throughout a city that
to find any route that passes by all of those same houses. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement:
Computational complexity limits digital forensic analysis.
9: Definition: A physics is a set of mathematical equations or other rules in context for
describing and predicting behaviors of a system. Example: In the physical world, is is thought
to be impossible to observe anything without altering it, because the act of observation alters
the thing observed, and the physical world has no limits to granularity of space or time, so that
no matter how small something is, there is always something smaller. Similarly, the physics of
digital information, if you agree to statements above to that effect, is such that it is possible to
observe bits without altering them, make an exact duplicate without altering the original, and
so forth. Do you agree with this definition?
Assuming this definition as the basis for your answer, respond to the following statement: The
physics of digital information is different than that of the physical world.
10: Definition: Consistency between two or more things means that each is the way you would
expect it to be if the other ones are the way you observe them to be. Example: If you see a
black box and someone else viewing the same object under the same conditions states that it
is a white sphere, your observation is inconsistent with their statement. Similarly, if a sworn
statement states that a particular file was created at 10AM on a particular day in a particular
place, and the metadata for the file indicates that the same file was created at a different time
on a different day, the sworn statement is inconsistent with the metadata. Do you agree with
this definition?
Assuming this definition as the basis for your answer, respond to the following statement: As a
fundamental of digital forensics, what is inconsistent is not true. (or in other words, the
inconsistent things cannot all be true)
These correspond to questions 1, 3-11 of the previous survey, question 2 being removed because of the
confusion surrounding it in discussions following the previous survey.
2.2 The raw data
The raw data from the survey is shown in Figure 2. Numbered (gray) columns correspond to
definitions with the level of agreement to statements in the unnumbered (white) columns to their right.
Rows represent responses, with responses 22-24 from the DFCB group and 1-21 from the IFIP group.
At the bottom of the table there are two rows with responses from 2 other groups (Ignore 1 and Ignore
2), one response per group. The data from these groups was not included in the analysis because we
could not adequately characterize these groups in terms of size or expertise, could not assure that they
were independent of the DFCB and IFIP groups, and the number of samples is so small that no
independent meaningful statistics can reasonably be gleaned. We will comment on their potential
impact later.
In this table, 2 is “strongly disagree”, -1 is “disagree”, 0 is “don't agree or disagree”, 1 is “agree”, and
2 is “strongly agree”. In this table, a non-answer is treated as a “0”. At the end of the table (black
background) are rows with calculations across columns (within responses or pairs of responses). LD is
the level of disagreement between a definition and the relevant statement. That is, the number of
12
ADFSL Conference on Digital Forensics, Security and Law, 2012
instances where respondents agree (disagree) with the definition and disagree (agree) with the related
statement (i.e., -1 or -2 for the definition and 1 or 2 for the statement or vice versa). A lower level of
disagreement indicates a stronger correlation between the view of the definition and the view of the
corresponding statement. “Agree” and “Disagree” are rows representing the number of responses
agreeing (>0) and disagreeing (<0) respectively. Columns D, A, and DD represent the number of
disagreements, agreements, and definition disagreements, respectively, for each respondent.
1
1
-1
2
3
2
3
0
-1
-1
-1
-1
1
-1
-1
-1
-2
0
0
-1
1
1
1
-2
-1
1
1
4
-1
-1
0
-1
-1
1
2
2
-2
5
1
2
2
2
2
1
2
1
6
0
-2
0
1
1
-1
-2
-2
2
-1
7
1
-1
0
1
1
-1
-1
0
1
8
-2
-2
0
0
1
1
1
9
0
-1
-1
-1
2
2
-1
1
2
10
1
-1
1
1
1
1
1
11
2
2
2
2
2
12
1
0
1
1
-1
2
-1
13
2
2
2
2
2
2
14
2
-1
2
2
15
1
1
1
1
2
1
16
2
2
2
2
17
0
0
18
-1
1
1
2
19
-2
-2
20
21
-2
1
22
23
Disagree
-1
1
1
-1
-2
-2
6
-1
Agree
-1
5
1
24
-1
4
-1
7
8
-1
-1
-2
1
1
1
-1
-2
0
2
-2
2
0
2
2
-2
2
2
1
1
1
1
2
1
1
2
2
2
2
2
-2
1
1
2
2
2
1
0
0
1
1
0
1
2
2
1
1
-2
0
1
1
2
1
-2
-2
9
-2
2
1
1
1
-1
0
-2
-2
-2
2
1
-2
2
2
2
2
2
-2
2
0
0
1
1
2
1
1
1
1
2
1
2
1
-1
-1
1
2
2
2
0
-1
1
2
2
1
2
2
2
-1
2
0
1
0
0
1
-1
2
2
2
2
1
2
1
1
2
2
2
2
-1
1
1
1
1
1
0
1
1
1
0
2
1
1
1
1
1
1
1
0
0
2
2
2
2
2
10
D
-2
-1
1
1
1
-1
-2
-2
-2
-2
2
2
2
1
-2
1
1
1
1
2
2
1
1
1
2
2
2
-1
1
1
2
2
-1
2
0
0
1
0
2
-1
1
1
-2
1
1
1
1
2
2
15
4
8
-1
7
13
4
-2
14
2
7
-2
11
8
3
0
0
0
17
0
1
-1
9
8
2
-1
1
-1
5
12
1
1
1
1
1
2
16
1
1
1
2
1
4
15
2
1
1
1
1
3
17
1
2
2
2
2
2
0
20
0
1
-1
1
1
0
6
11
5
1
2
1
2
2
2
0
20
0
2
1
-1
-1
1
1
1
5
15
1
0
0
0
1
1
1
1
0
10
0
2
2
2
2
2
2
2
1
1
19
0
1
1
0
0
0
1
0
1
0
0
11
0
-1
2
1
2
2
0
2
1
2
3
16
1
1
1
1
1
0
1
1
1
1
3
16
1
1
1
1
1
1
0
1
0
1
0
3
12
1
1
-1
1
1
1
2
1
2
1
1
1
19
0
1
1
0
0
1
0
1
1
-1
1
-1
2
13
0
2
2
2
2
2
2
2
2
2
2
2
2
18
1
0
20
0
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
12
11
12
16
20
16
15
17
22
14
18
14
19
18
19
16
19
16
20
14
9
11
2
4
4
6
7
4
2
7
4
7
2
3
2
3
4
6
3
6
LD
8
2
2
5
A DD
5
4
3
3
6
328
96
5
Ignore 1
-1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
19
1
Ignore 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
-1
1
1
1
1
1
19
0
Figure 2 - Data collected from the combined surveys
2.3 Analysis of survey results
This analysis covers the collection undertaken from the IFIP (N=21) and DFCB (N=3) groups, for a
total population of 24 respondents. As depicted above, -2 is “strongly disagree”, -1 is “disagree”, 0 is
“don't agree or disagree”, 1 is “agree”, and 2 is “strongly agree”. N is the number of respondents
expressing an opinion (either agree or disagree), μ is the mean, and σ the standard deviation. Treating
negative answers (-1, -2) as rejections of the asserted definition or statement (D=disagree), and
positive answers (1 and 2) as affirmations of the asserted definition or statement (A=agree), we present
13
ADFSL Conference on Digital Forensics, Security and Law, 2012
the ratio of agreement (A/N) and disagreement (D/N) out of all respondents indicating a preference.
The margin of error for 95% confidence for the identified sample sizes (from 17 to 24 out of a total
estimated population of 250) is indicated under the column labeled M.[22] The C column indicates
consensus above the margin of error for agreements (A) or disagreement (D) and is contains a “-”
when no such consensus levels were found.
#
1
2
3
4
5
6
7
8
9
10
1
2
N
μ
σ
-2
-1
0
M
C
4
5
3
6
6
21 .21
1.44
9
12 .43
.57
.21
-
Definition
6
5
2
6
5
22 -.04
1.51
11 11 .50
.50
.20
-
Only sequences of bits.
0
2
7
7
8
17 .88
.97
2
15 .12
.88
.23
A Definition
0
4
4
7
9
20 .88
1.09
4
16 .20
.80
.22
A Finite granularity space and time.
0
4
0
12
8
24 1.0
1
4
20 .17
.83
.20
A Definition
2
4
2
8
8
22 .66
1.31
6
16 .27
.73
.20
A Can observe bits w/out alteration.
3
4
2
9
6
22 .46
1.35
7
15 .32
.68
.20
-
1
3
3
8
9
21 .88
1.17
4
17 .19
.81
.21
A Can duplicate without removal.
0
2
0
11
11 24 1.29
.84
2
22 .08
.92
.20
A Definition
3
4
3
8
6
1.35
7
14 .33
.67
.21
-
0
4
2
8
10 22 1
1.08
4
18 .18
.82
.20
A Definition
2
5
3
8
6
1.29
7
14 .33
.67
.21
-
0
2
3
8
11 21 1.17
.94
2
19 .10
.90
.21
A Definition
3
0
3
11
7
1.22
3
18 .14
.86
.21
A Digital evidence latent.
2
0
3
9
10 21 1.04
1.14
2
19 .10
.90
.21
A Definition
1
2
5
6
10 19 .92
1.15
3
16 .16
.84
.22
A Computational complexity limits.
2
2
1
14
5
23 .75
1.13
4
19 .17
.83
.20
A Definition
4
2
2
8
8
22 .58
1.44
6
16 .27
.73
.20
A Digital != real world physics.
1
2
1
14
7
24 .96
1.02
3
21 .14
.86
.20
A Definition
2
4
5
9
4
19 .38
1.18
6
13 .32
.68
.22
-
21 .42
21 .46
21 .79
D
A
D/N
A/N
Issue
Definition
Digital evidence trace evidence.
Digital evidence not transfer.
What is inconsistent is not true.
Figure 3 – Analysis of Consensus
Consensus above the margin of error from random is present for agreement with statements #2 (.80),
#3 (.73), #4 (.81), #7 (.86), #8 (.84) and #9 (.73). This may be reasonably interpreted as indicating that
for the full sample of the two organizations combined, there is a 95% chance that agreement would be
above the margin of error from random for these 6 statements. In addition, agreement to #7 and #8 are
within the margin of error of the 86% level of consensus seen among climatologists for global climate
change.[18] That is, consensus was shown for:
Digital evidence is finite in granularity in both space and time.
It is normally possible to observe digital information without altering it.
14
ADFSL Conference on Digital Forensics, Security and Law, 2012
It is normally possible to duplicate digital information without removing it.
Digital evidence is normally latent in nature.
Computational complexity limits digital forensic analysis.
The physics of digital information is different than that of the physical world.
And consensus was NOT shown for:
Digital evidence is only sequences of bits.
Digital evidence is normally trace evidence.
Digital evidence is normally not transfer evidence.
As a fundamental of digital forensics, what is inconsistent is not true. (or in other words, the
inconsistent things cannot all be true)
A more important point is that consensus above the margin of error is present for 6 statements in the
present study, whereas without the use of definitions in the survey, only two (corresponding to items
#4 and #8 in this survey) were above consensus.[25] In addition, consensus levels are higher (#4 went
from .74 to .81 and #8 went from .64 to .84) when definitions were included.
It appears that a significant source of lack of consensus in the previous study was related to the lack of
common language and agreed upon terminology in the field also identified in that study.[25]
Definitional disagreements are also worthy of commentary. While a few respondents (i.e., #1, #3, and
#12) indicated disagreement or strong disagreement to at least half (5/10) of the definitions, only the
definitions of what constitutes digital evidence (#1) and what constitutes duplication (#4) fail to reach
consensus above random levels (8/10 definitions are at consensus levels above random). Definitions
#2(.88), #5(.92), #7(.90), #8(.90) and #10(.86) (5/10) meet or exceed the 86% level of agreement for
global climate change. There appears to be substantial disagreement regarding what constitutes digital
evidence, and this is reflected in much of the methodological literature in the field. The lack of
consensus levels for the definition of duplication is less clear based on the literature.
Analysis shows that of the 96 total disagreements, 49 of them (51%) stem from 4 respondents (#1, #3,
#4, and #6). Without these respondents, consensus levels would be far higher. In addition, the two
samples not included add only 2 disagreements, one of which is the definitional disagreement over
what constitutes digital evidence, and the other a disagreement to statement 8. Adding these results in
would drive consensus higher for all but statement 8 (which would go from 84% to 81%). None of
these changes would result in moving non-consensus statements outside of the margin of error for
randomness.
The internal level of disagreement (LD) between agreement on definitions and related statements is
also of interest. Despite the lack of consensus around Definition #1, 8/24 respondents indicated
different agreement to the definition than the statement. This suggests that respondents were able, as a
group, to overcome differences in views on definitions to express views on statements in the context
of the definitions provided. Among the 4 respondents constituting 51% of the disagreements, 2 of the
4 gave different answers to the statement than to the definition, again suggesting their ability to
differentiate between their disagreement with the definitions and their agreement/disagreement with
the statements. Looking at this more closely, only respondents #3, #8, and #23 always disagreed with
statements when disagreeing with definitions. The skew of results toward “Agree” largely invalidates
such an analysis regarding agreements, where 6 respondents agreed to all of the statements and
definitions. Only respondent #11 had all identical answers. (strongly agree), indicating that
respondents, as a whole, considered to some level their responses to each question.
3. COMMENTS FROM REVIEWERS
The review process yielded the following residual comments/questions which I address here.
15
ADFSL Conference on Digital Forensics, Security and Law, 2012
Reviewers seemed to comment on two basic issues; what the paper was about, and statistical issues.
The reviewers seemed to think that the paper was about the use of language and not a consensus
around science. Comments included:
“the research sought people's opinions on generally accepted terminology in the field
of digital forensics.”, “the [authors] claim that use of terminology is somehow a
measure of science.”, “I think it is a bridge too far to suggest that there is no science
(testability) without common terminology used by those in the discipline. ...”, “I think
fundamentally that this paper has examined legal terminology, not underpinning
science ...”
This represents a misunderstanding of the purpose of the research and paper. Quoting from the
abstract: “This paper updates previous work on the level of consensus in foundational elements of
digital evidence examination. Significant consensus is found present only after definitions are made
explicit, suggesting that, while there is a scientific agreement around some of the basic notions
identified, the use of a common language is lacking.” The purpose was to mitigate the differences in
use of language which were suspected as a partial cause of the lack of consensus identified in the
previous paper. The question being addressed was whether the lack of identified consensus in prior
research was due to an actual lack of agreement on the content or on differences in use of the
language. This paper suggests that there is a lack of common definition that must be compensated for
in order to measure consensus in this field, and that there is more consensus than previously thought.
The statistical question identifies correctly that 24 respondents is a seemingly small sample size. This
is addressed by computing the margin of error for that sample size out of the total population, in this
case estimated at 250. As such, this sample represents almost 10% of the total population and is
proportionally a very large sample size compared to most statistical studies. Full details are provided
so the reader can do further analysis and evaluate the actual responses using any desired method. The
larger statistical problem is that the respondents are self-selected from the larger population, all of
whom were notified of the study. We know of no way to compensate for this limitation through
analysis and have no means to force compliance or expectation of gaining adequate samples from
random polling.
4. SUMMARY, CONCLUSIONS, AND FURTHER WORK
It appears that this study confirms the hypothesis that the lack of consensus suggested in the previous
study was due, at least in part, to a lack of common definitions and language in the digital forensics
community. This study suggests that consensus is substantially present in many of the fundamental
areas that are foundational to the acceptance of such evidence in legal proceedings. Clarity around
definitions appears to be necessary for the field of digital forensics to reach levels of consensus present
in other areas of science.
5. REFERENCES
[1] R. Leigland and A. Krings, "A Formalization of Digital Forensics", International Journal of Digital
Evidence, Fall 2004, Volume 3, Issue 2.
[2] Ryan Hankins, T Uehara, and J Liu, "A Comparative Study of Forensic Science and Computer
Forensics", 2009 Third IEEE International Conference on Secure Software Integration and Reliability
Improvement.
[3] Committee on Identifying the Needs of the Forensic Sciences Community, "Strengthening
Forensic Science in the United States: A Path Forward", ISBN: 978-0-309-13130-8, 254 pages,
(2009).; Committee on Applied and Theoretical Statistics, National Research Council.
[4] Scientific Working Group on Digital Evidence (SWGDE) Position on the National Research
Council Report to Congress - Strengthening Forensic Science in the United States: A Path Forward
16
ADFSL Conference on Digital Forensics, Security and Law, 2012
[5] S Garfinkel, P. Farrella, V Roussev, G Dinolt, "Bringing science to digital forensics with
standardized forensic corpora", Digital Investigation 6 (2009) S2-S11
[6] M. Pollitt, "Applying Traditional Forensic Taxonomy to Digital Forensics", Advances in Digital
Forensics IV, IFIP TC11.9 Conference Proceedings, 2009.
[7] G. Carlton and R. Worthley, "An evaluation of agreement and conflict among computer forensics
experts", Proceedings of the 42nd Hawaii International Conference on System Sciences, 2009
[8] NIST, "Computer Forensics Tool Testing (CFTT) Project", http://www.cftt.nist.gov/
[9] The Federal Rules of Evidence, Section 702.
[10] Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 125 L. Ed. 2d 469, 113 S. Ct. 2786
(1993).
[11] Frye v. United States, 293 F 1013 D.C. Cir, 1923
[12] Reference Manual on Scientific Evidence - Second Edition - Federal Judicial Center, available at
http://air.fjc.gov/public/fjcweb.nsf/pages/16
[13] U.S. Department of Justice, “A Review of the FBI’s Handling of the Brandon Mayfield Case”,
unclassified executive summary, January 2006. (http://www.justice.gov/oig/special/s0601/exec.pdf)
[14] K. Popper, The Logic of Scientific Discovery (1959), Hutchins and Company, London. ISBN10:
0415278449.
[15] J. Jones and D. Hunter, “Qualitative Research: Consensus methods for medical and health
services research”, Volume 311, Number 7001, BMJ 1995; 311 : 376 (5 August 1995).
[16] Karin D. Knorr, “The Nature of Scientific Consensus and the Case of the Social Sciences”, in
Karin D. Knorr, Karin Knorr-Cetina, Hermann Strässer, Hans-Georg Zilian, “Determinants and
controls of scientific development”, Institut für Höhere Studien und Wissenschaftliche Forschung
(Vienna, Austria), pp 227-256, 1975.
[17] A. Fink, J. Kosecoff, M. Chassin, and R. Brook, “Consensus Methods: Characteristics and
Guidelines for Use”, AJPH September 1984, Vol. 74, No. 9.
[18] Margaret R. K. Zimmerman, “The Consensus on the Consensus: An Opinion Survey of Earth
Scientists on Global Climate Change”, Dissertation, 2008.
[19] North Eastern Forensics Exchange, Georgetown University, 8/13 – 8/14, 2010.
[20] Forensics Data Base is available at http://calsci.org/ under the “FDB” menu selection.
[21] Edmond Locard and D. J. Larson, “The Analysis of Dust Traces” (in 3 parts), The American
Journal of Police Science, V1 #4, 1930.
[22] A calculator from http://www.raosoft.com/samplesize.html was used to perform this calculation,
based on the Z value method, which is imprecise at sample sizes under 30, but close enough for the
purposes applied.
[23] Lenth, R. V. (2006-9). Java Applets for Power and Sample Size [Computer software]. Retrieved
2010-09-27 from http://www.stat.uiowa.edu/~rlenth/Power.
[24] Cole, Simon A. “Out of the Daubert Fire and Into the Frying Pan? Self-validation, meta-expertise,
and the admissibility of Latent Print Evidence in Frye Jurisdictions”, Minn. Journal of Law, Science,
and Technology, V9#2, pp 453-541, 2008.
[25] Bar-Anan, Yoav; Wilson, Timothy D.; Hassin, Ran R., “Inaccurate self-knowledge formation as a
result of automatic behavior.”, J. of Experimental Social Psychology, V46, #6, pp 884-895, 2010
17
ADFSL Conference on Digital Forensics, Security and Law, 2012
[26] F. Cohen, J. Lowrie, and C. Preston, “The State of the Science of Digital Evidence Examination”,
IFIP TC11, Jan 2011.
[27] F. Cohen, “Putting the Science in Digital Forensics“, Journal of Digital Forensics, Security and
Law, Vol. 6(1) 7 Column 1, July, 2011.
18
ADFSL Conference on Digital Forensics, Security and Law, 2012
A PROPOSAL FOR INCORPORATING PROGRAMMING
BLUNDER AS IMPORTANT EVIDENCE IN ABSTRACTIONFILTRATION-COMPARISON TEST1
P. Vinod Bhattathiripad
Cyber Forensic Consultant
Polpaya Mana, Thiruthiyad
Calicut-673004
Kerala, India
Telephone: +91-495-2720522, +91-94470-60066 (m)
E-mail: [email protected]; [email protected]
ABSTRACT
This paper investigates an unexplored concept in Cyber Forensics, namely, a Programming Blunder.
Programming Blunder is identified as a variable or a code segment or a field in a database table, which
is hardly used or executed in the context of the application or the user’s functionality. Blunder genes
can be found in many parts of any program. It is the contention of this paper that this phenomenon of
blunders needs to be studied systematically from its very genetic origins to their surface realizations in
contrast to bugs and flaws, especially in view of their importance in software copyright infringement
forensics. Some suggestions as to their applicability and functional importance for cyber forensics are
also given including the vital need and a way to incorporate programming blunders into AbstractionFiltration-Comparison test, the official software copyright infringement investigation procedure of US
judiciary
Keywords: Bug, error, blunder, genes, software piracy, software copyright, software copyright
infringement, software piracy forensics, AFC, idea-expression dichotomy
1. INTRODUCTION
A programming flaw occasionally survives in well tested and implemented software. It can surface in
the form of a variable or a code segment or a field in a database table, which is hardly used or
executed in the context of the application or the user’s functionality. Such a flaw in design can be
called a Programming Blunder2 (Bhattathiripad and Baboo, 2009, 2011). The term programming
blunder has already been casually used (in many publications, for instance, in (McConnell, S., 1996))
to denote bad practices in programming.
The phenomenon of blunder needs to be specifically contrasted with a programming error, as unlike an
error, the blunder is most unlikely to cause problems during the execution. Ideally, all blunders (like
all errors) in software should be and are routinely removed at the various quality control stages of the
software development. Even if it (unfortunately) makes through all quality control stages, there is
again a slim chance for it be detected and removed at the implementation stage. Even so, occasionally,
a few programming blunders may survive all these stages of software development and may finally
appear unattended (or unnoticed) in the implemented software. Despite their apparent status as
1
This paper is an enhanced form of the paper “Software Piracy Forensics: Programming Blunder as an important
evidence” that was accepted as a short paper (but not presented) in the Third ICST International Conference on Digital
Forensics and Cyber Crime, Dublin, Ireland, 26 - 28 October, 2011. Also, this paper contains points extracted from the
author’s Ph D thesis “Judiciary-friendly software piracy forensics with POSAR”.
2
This type of programming flaw has been christened as “Programming Blunder” because the very existence of it (in a welltested and implemented program) is a mark of blunder-like weaknesses of the respective programmer / quality engineer. The
naming is done from the forensic perspective of such programming flaws.
19
ADFSL Conference on Digital Forensics, Security and Law, 2012
harmless vestiges of inattentive vetting, these blunders do provide an important service to the cyber
forensic expert. They can form an important basis for providing evidence in case of an allegation and
ensuing investigation of copyright infringement of such software. It is this increased cyber forensic
importance (despite their being less important in areas such as software engineering and software
testing) that underscore the need to specifically understand and study them, not just in the cyber
forensic perspective but right from their definitional aspects.
2. OBJECTIVES OF THIS PAPER
Spafford and Weeber (1992) have already anticipated the importance of (blunder-like) execution paths
as cyber forensic evidence in “providing clues to the author (of the suspect / alleged program)” and
this anticipation is the point of departure for of this study. The emergent concept of programming
blunders (in this paper) is a natural outcome of a specific study of all such execution paths in the
3
context of software piracy . The objectives of this paper can be set thus: (1) to thoroughly investigate
the phenomenon of blunders in detail and by doing so attempt to denotationally concretize and define
the term “programming blunder”; (2) to discretely identify the study of programming blunders as
different from other software bugs and (3) to discuss the cyber forensic importance of programming
blunders in the investigation of an allegedly pirated (copyright infringed) software.
3. DEFINING THE TERM PROGRAMMING BLUNDER
The term programming blunder has already been introduced and identified (but not properly defined)
in some previous publications (Bhattathiripad and Baboo, 2009; 2011). Additionally, without using the
term “Programming Blunder”, Spafford and Weeber (1992) have already mentioned certain execution
paths (as said above) of a code that “cannot be executed”.
A common factor found when analyzing student programs and also when
analyzing some malicious code is the presence of code that cannot be executed.
The code is present either as a feature that was never fully enabled, or is
present as code that was present for debugging and not removed. This is
different from code that is present but not executed because of an error in a
logic condition—it is code that is fully functional, but never referenced by any
execution path. The manner in which it is elided leaves the code intact, and
may provide some clue to the manner in which the program was developed.
(Spafford and Weeber, 1992)
By taking a cue from Spafford and Weeber, can one define programming blunder as “any execution
path in the program that need not and so will not be executed during the lifetime of the program on
any execution platform”? Possibly not, because, such a definition has an inherent limitation in that it
considers only inoperative statements (non-executed path) in the program. It overlooks and excludes
those operative statements (executed paths) which are very much still present there but are not
necessary for the successful functioning of the program. That means, it excludes those statements
which may have been executed at some stage of the program but are not necessary for producing the
final result. In other words, it does not consider those operative statements which are incoherently,
redundantly and/or dysfunctionally appearing in the text of the program and/or which may have been
executed at some stage but are hardly used in the user’s functionality (or to arrive at the final results).
So, programming blunders turn out to be a lot more than what Spafford and Weeber had suggested.
Like Spafford and Weeber (1992), several other researchers also have already mentioned
programming flaws of this genre (without using the term programming blunder) and studied their
importance in software testing, author identification and other software forensic areas. For instance,
3
In this article, the term ‘piracy’ refers to the piracy of a copyrighted software.
20
ADFSL Conference on Digital Forensics, Security and Law, 2012
through a recent comprehensive paper4 (Hayes and Offutt, 2010), Jane Huffman Hayes and Jeff Offutt
examine (among other things) whether lint5 (a static analyzer that detects poor programming practices
such as variables that are defined but not used) for a program can be used to identify the author or a
small set of potential authors of the program. So, the notion of a programming blunder may not
entirely be new. Nevertheless, none of the previous publications (where the concept of programming
blunder was used in the research related to software testing, author identification and other software
forensic areas) have tried to seriously explore the concept in some detail in an effort to denotationally
concretize / crystallize the term programming blunder, and so differentiate it from other programming
bugs and finally, study its forensic importance. This is the reason for setting the primary objective of
this paper, viz. to thoroughly investigate the phenomenon of blunders in detail and by doing so attempt
to concretize and define the term “programming blunder”.
Even though the existing definitions of “programming blunders” subsume execution paths of a code
that “cannot be executed” (Spafford and Weeber, 1992) and variables that are defined but not used
(Hayes and Offutt, 2010), a more cautious definition employed in this study is:
A programming blunder found (in well tested, implemented and allegedly
pirated software) can be defined as a variable in a program or a program code
segment or a field in a database table which is hardly executed in the context of
the application and/or is unnecessary for the user’s functionality
(Bhattathiripad and Baboo, 2009).
This definition subsumes not only the execution paths of a code that “cannot be executed” and
variables that are defined but not used but also unnecessary non-execution paths (like comment lines
and blocked execution paths).
A blunder in a program gains significance during the piracy forensics (or copyright infringement
forensics) of the program (see below).
4. GENETIC ORIGIN OF PROGRAMMING BLUNDERS
A proper investigation of blunder, like that of any organism, should ideally start with a look into its
genetic origin. Blunder genes6 (or genes of programming blunders) are those elements in the program
that can often form the basis for (the existence or surfacing of) a programming blunder. Blunder genes
can be traceable to many parts of the program like a variable, class, object, procedure, function, or
field (in a database table structure). A blunder gene is developmentally (and perhaps philologically)
different from a blunder just as an embryo can be from the baby. While every blunder gene has
significance in software engineering research, a blunder has additional forensic significance. What the
programmer leaves in the program is a blunder gene and this blunder gene can develop into and
surface as a blunder during the piracy forensic analysis (or copyright infringement forensic analysis).
What elements in the program can then form the genetic basis of a blunder? The simple answer is that
any item in (or a segment of) a program which is unnecessary or redundant to customer requirements
can form the genetic basis for a programming blunder. Such items can, however, surface in a program
in three different ways. In other words, programming blunders can be categorized in three ways
according to their genetic differences.
1. Any defined but-unused item (or a code segment) in a program.
4
A note of gratitude to the reviewers of the Third ICST International Conference on Digital Forensics and Cyber Crime, for
drawing my attention to this paper.
5
The UNIX utility lint that is commonly used to bring out flaws like programming blunders as compiler warnings.
6
My sincere gratitude to Dr. P. B. Nayar, Lincoln University, UK, for his valuable suggestions
21
ADFSL Conference on Digital Forensics, Security and Law, 2012
2. Any defined item (in the program) which is further used for data-entry and calculation
but never used for the user’s functionality of a program.
3. Any blocked operative statement in a program.
Primarily, any defined but unused variable, class, object, procedure, function, or field (in a database
table structure) can appear as a programming blunder. Hence, any such defined, concrete, tangible
item (or a blunder gene) in the body of a program (or an external routine or data base as part of the
software) which was subsequently found unnecessary or irrelevant for the operation of the program /
software can evolve or materialize as a programming blunder during a forensic analysis of the
program. Thus, a programming blunder may be an item (or a segment of a program) that is well
defined at the beginning of an execution path in a program but is not part of the remaining stages of
the execution path (example: Processing stages, Reporting stages etc.) in the program. For instance,
the integer variable ‘a’ in the C-program given in Table-1 is a programming blunder as this variable
has not been used anywhere else in the program. This variable has no relevance in the operation (for
producing the intended output) of the program.
#include <stdio.h>
#include <conio.h>
main()
{
int a=0, b=2, c=0;
C=b*b;
printf(“The result is %d”, c);
getch()
}
Table 1. A defined but unused variable ‘a’ in a C-program
Secondly, any defined item (or the blunder gene) at the beginning of an execution path in a program)
which is further used for data-entry but never used in the remaining stages of the execution path in the
program, can also appear as a programming blunder during a forensic analysis of the program. Thus,
the integer variable ‘a’ in the C-program given in table 2 surfaces as a programming blunder as this
variable has been well defined and used for data entry but not used anywhere else in the program.
#include <stdio.h>
#include <conio.h>
main()
{
int a=0;
scanf("%d", &a); /* reading the value of a*/
printf("Hello, World");
getch();
}
Table 2. Unnecessary declaration and input statements in a C-program
Thirdly, a blocked operative statement (or a remarked operative programming statement), which is
practically an inoperative element of the program, can appear as a programming blunder. Thus, the
remark statement (or the blunder gene) /* int a; */ in the C-program given in table 3 can turn out to be
a programming blunder during a forensic analysis of the program as this statement need not be there in
the first place and does not justify its being there at all for long, unattended (unlike the other
programming remark /* This program prints Hello, World */ which has a purpose in the program).
22
ADFSL Conference on Digital Forensics, Security and Law, 2012
#include <stdio.h>
#include <conio.h>
main()
{
/* This program prints “Hello, World” */
/*int a=0;*/
printf("Hello, World");
getch();
}
Table 3. A program in C-language
All the above suggest that, any defined variable, class, object, procedure, function, or field (in a
database table structure) in a program which has no relevance in the final output (user’s functionality)
of the (well-tested and implemented) program can manifest itself as a blunder during the copyright
infringement forensic analysis of the program.
5. COMMONALITIES AND SIMILARITIES AMONG PROGRAMMING BLUNDERS
Irrespective of their genetic origin, all programming blunders do share some features, properties,
attributes and characteristics. A programming blunder
1) is a concrete, tangible item in (or segment of) a program and not a process.
2) can be an execution path which got past the quality control stage, undetected.
3) does not affect the syntax or sometimes even the semantics of a program which makes
it hard to detect.
4) is not necessary to execute the program.
5) is not necessary for user’s functionality.
6) does not justify its being there at all.
7) is a matter related to the design pattern and programming pattern of the software.
6. ETIOLOGY OF PROGRAMMING BLUNDERS
The etiology of programming blunders can be discussed along three different weaknesses of the
programmer / quality engineer. Firstly, his/her inability to completely remove from the program those
elements that do not help meet customer requirements can be a cause for a blunder. Secondly, his/her
inattention to completely remove those statements that have been abandoned as a result of the
programmer’s afterthought can also be a cause for a blunder. Thirdly, his/her inattention to identify
and remove items that do not contribute to either strong coupling between modules or strong cohesion
within any module of the program can also be a cause for a blunder. These three different weaknesses
of the programmer / quality engineer can thus be reasons for programming blunders.
7. PROGRAMMING BLUNDERS JUXTAPOSED WITH SOFTWARE BUGS
The next objective of this paper is to identify the study of programming blunders as different and
discrete from that of other software bugs. Quite naturally, even experts in the software engineering
might need some convincing as to why programming blunders need or demand a distinct status as
against software bugs. There definitely does exist a need to be convincing because, the above
mentioned genetic origins, manifestations, features, properties, attributes, characteristics and etiology
of blunders may prima facie be identified with those of software bugs as well. Therefore, what makes
a programming blunder deserve a special consideration different from other software bugs?
A software bug is the common term used to describe an error, flaw, mistake, failure, or fault in a
23
ADFSL Conference on Digital Forensics, Security and Law, 2012
computer program or system that produces an incorrect or unexpected result, or causes it to behave in
unintended ways (IEEE610.12-1990, 1990). An error is “the difference between a computed,
observed, or measured value or condition and the true, specified, or theoretically correct value or
condition” (IEEE610.12-1990, 1990, p31). Other related terms in the computer science context are
fault, failure and mistake. A fault is “an incorrect step, process, or data definition in a computer
program” (IEEE610.12-1990, 1990, p31). A failure is “the [incorrect] result of a fault” (IEEE610.121990, 1990, p31) and mistake is “a human action that produces an incorrect result” (IEEE610.12-1990,
1990, p31). Most bugs arise from mistakes, errors, or faults made by people or flaws and failures in
either a program's source code or its design, and a few are caused by compilers producing incorrect
code7. A programming blunder (as defined at the beginning of the article) does not resemble a bug
either in raison d’être or function (see above). In other words, a software bug is different from a
programming blunder and this difference (which is significantly relevant for the forensic expert) may
look simple, but is by no means simplistic.
8. PROGRAMMING BLUNDERS AND THE IDEA-EXPRESSION DICHOTOMY
The idea-expression dichotomy (Newman, 1999) provides an excellent theoretical perspective to look
at and explain blunders. Any genuine idea which is properly expressed in the definition stage but
improperly (or not at all) expressed in the remaining stages (in a program) in a manner that does not
adversely affect the syntax (or sometimes even the semantics) of the program can become a
programming blunder. So, the integer variable ‘a’ in the C-program given in Table-2, for example,
when looked at in the idea-expression perspective, is a programming blunder. So, from the perspective
of the idea-expression dichotomy, programming blunder is a partly-made8 functional expression of an
idea. This clearly opens the door to linking blunders directly to copyright infringements of any
program because the idea-expression perspective is the basis of formulation of software copyright
laws in several countries (Hollaar, 2002; Newman, 1999).
Copyright laws of several countries (especially the US copyright laws) say that if there is only one
(exclusive) way of effectively expressing an idea, this idea and its expression tend to “merge”
(Walker, 1996, p83) and in such instances an idea is not protectable through copyright (Hollaar, 2002).
However, if the same idea can be realized through more than one expression, all such different
realizations are protected by copyright laws. Interestingly this means that the copyright of a program is
directly related to the concept of the merger between idea and expression and that when there is no
merger, the copyright of a program can extend to the blunders contained therein as well.
9. FORENSIC IMPORTANCE OF PROGRAMMING BLUNDERS
Despite their apparent functionally inactive and thus innocuous nature in a program, blunders, when
copyrighted, can be of great value / assistance to the cyber forensic expert. They provide evidence of
software copyright infringement and a discussion of this evidence is one of the prime objectives of this
article. On the forensic importance of programming blunders, Spafford and Weeber (1992) have noted
that:
Furthermore, it (a programming blunder) may contain references to variables
and code that was not included in working parts of the final program —
possibly providing clues to the author and to other sources of code used in this
program.
7
http://en.wikipedia.org/wiki/Software_bug visited on 6th Feb, 2011
By partly-made functional expression, what is meant or intended is an element which is defined, implemented but left
unused or inoperative in the remaining stages.
8
24
ADFSL Conference on Digital Forensics, Security and Law, 2012
Since Spafford and Weeber (1992), a variety of experiments (for instance, Krsul (1994)) have been
performed on authorship analysis of source codes and copyright infringement establishment of
software. Also, at least half a dozen techniques and procedures (for instance, AFC (Hollaar, 2002),
SCAP (Frantzeskou, 2007) etc.) have been put forward for establishing authorship and copyright
infringement of software. However, none of them have taken cue from the above note of Spafford and
Weeber and considered programming blunders as evidence in the court.
Not all blunders are “substantial” in the copyright infringement forensic analysis. Blunders which can
provide direct (substantial) evidence to establish piracy (and thus, to establish copyright infringement)
or can provide probable, corroborative or supporting evidence are forensically important. The
repetition of programming blunders (even if these programming blunders are ‘generic’ by nature) in
both original9 and the pirated10 in identical contexts would be a serious indication of piracy. If, for
instance, a variable with a universally uncommon name ‘PXRN_CODE_AUCQ CHAR[6]’ is defined
in identical places of the identical procedures in the original11 as well as the pirated12 software, but
not used anywhere else (see the three categories of blunders, above), that is an instance of
programming blunder and such programming blunders attract/deserve forensic importance. The
forensic importance of blunders arises from the obvious fact that it is highly unlikely that two
programmers will blunder exactly in the same way, thus elevating the similarity into possible evidence
of piracy (See also Appendix-1).
While trying to establish a variable name or a code segment as a programming blunder, the expert
needs to confirm that it is (i) absent elsewhere in the original, (ii) present in the allegedly pirated
exactly in the same context as it was found in the original and (iii) absent elsewhere in the allegedly
pirated (Bhattathiripad & Baboo, 2010).
In the absence of other direct evidence of piracy, programming blunders can form the only basis for
the expert to convince the judiciary about the possibility of piracy.
9
Throughout this article, original means the version of the software that the complainant submits to the law enforcement
agency for software piracy forensics. This article presupposes that the law enforcement agency has satisfactorily verified the
legal aspects of the documentary evidence of copyright produced by the complainant and is convinced that the complainant is
the copyright holder of this version of the alleged software.
10
Throughout this article, pirated means the allegedly pirated software
25
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig 1: The software copyright infringement forensic procedure of AFC13
13
Abstraction-Filtration-Comparison Analysis Guidelines for Expert Witnesses
http://web20.nixonpeabody.com/np20/np20wiki/Wiki%20Pages/Abstraction-FiltrationComparison%20Analysis%20Guidelines%20for%20Expert%20Witnesses.aspx
26
ADFSL Conference on Digital Forensics, Security and Law, 2012
10. PROGRAMMING BLUNDERS AND THE AFC TEST
As things stand, it appears that forensic procedural protocols of software copyright infringement have
not fully recognized the importance of programming blunders. Nor have there been many attempts to
ensure their inclusion in the forensic expert’s repertoire. Programming blunders are very unlikely to be
detected, for instance, by the AFC test (see Appendix-3 and fig 1) which is the only judiciary-accepted
procedure for establishing software copyright infringement in the US. They are not detected or even
considered because during the abstraction of the program, only the functionally active parts (of the
program) will be considered for abstraction and used for further investigation (Hollaar, 2002, p86).
The functionally inactive parts (or those items are irrelevant for user’s functionality) like programming
blunders will not be considered for abstraction14. Moreover, abstraction may remove many individual
lines of code from consideration and so blunder genes as well as blunders may also get automatically
removed during the abstraction stage of AFC. In such case of unfortunate non-consideration, the
programming blunders will not be available for final comparison and this unavailability may make the
final results of AFC, incomplete and thus, unreliable. This is one of the fallibilities of AFC
(Bhattathiripad and Baboo, 2010) and can probably be a reason for the reluctance to use the AFC test
in the copyright infringement forensics of modern software (however, US judiciary continues to use
AFC test for software copyright infringement forensics15). Hence, this paper proposes that, along with
the AFC results, the evidence concerning the programming blunders, if any, should also be identified
and gathered separately by the expert by comparing the original with the pirated, before applying the
AFC test and reporting his/her the final findings and inferences to the court16. This paper also proposes
that the AFC test should be re-designed so as to ensure that possible evidence like programming
blunders are available for final comparison.
Before concluding, a note on what a judge expects from an AFC expert, would add value to the special
attention and consideration given to programming blunders. In any software comparison report, what
the judge would expect from the AFC expert is a set of details that help the court in arriving at a
decision on the copyrightable aspects of programming blunders. Jon O. Newman (1999), one of the
judges on the panel (in the 2nd circuit of U. S. judiciary) that received an amicus brief17 (concerning a
software copyright infringement case) indicates what he needs from expert witnesses in a trial (or in
the amicus brief) on an appeal involving software copyright infringement:
These professionals would be well advised not to tell me simply that the source code
is or is not protectable expression. Their opinion are relevant, but, as with all opinions,
what renders them persuasive is not the vehemence of their assertion and not even the
credentials of those asserting them; it is the cogency and persuasive force of the
reasons they give for their respective positions. These reasons had better relate to the
specifics of the computer field. For example, as Altai (United States Court of Appeals,
1992) indicates, even with its overly structured mode of analysis, it will be very
14
Because AFC test does allow for 'literal' similarities between two pieces of code, there is a general belief that AFC can
make available “literal” pieces like programming blunders (for example, variables that are defined but unused and execution
paths that “cannot be executed”) for final comparison. But in practice, AFC does not either abstract these “literal” pieces of
evidence or filter them out in the filtration stage and either way, these programming blunders like “literal” pieces are
unavailable for final comparison.
15
See United States District Court of Massachusetts (2010), Memorandum and Order, Civil Action number 07-12157 PBS,
Real View LLC. Vs. 20-20 Technologies, p.2
16
The ultimate decisions like whether these pieces of evidence (concerning programming blunders) (1) are useful or not; (2)
form direct or secondary evidence; and (3) are generic (and hence, non-copyrightable) by nature or not, come under judicial
matters and are subject to the provisions of the prevailing Evidence Act of the respective country. However, cyber forensic
report can remain suggestive on these three aspects and also on the (conclusive, corroborative, and supportive) nature of the
programming blunders found in the original and the pirated.
17
Amicus Brief of Computer Scientists, Harbor Solutions v. Applied Systems No. 97-7197 (2nd Circuit, 1998) at 8-9
(citations omitted)
27
ADFSL Conference on Digital Forensics, Security and Law, 2012
important for me to know whether the essential function being performed by the
copyrighted program is a function that can be accomplished with a variety of source
codes, which will strengthen the case for protection, or, on the other hand, is a
function, capable of execution with very few variations in source code, or, variations
of such triviality as to be disregarded, in which event protection will be unlikely. For
me, this mode of analysis is essentially what in other contexts we call the merger
doctrine – the expression is said to have merged with the idea because the idea can be
expressed in such limited ways that protection of the plaintiff’s expression unduly
risks protecting the idea itself. (Newman, 1999)
So, what is expected in the case of programming blunders also is a set of details on the merger aspects
of the ideas and expressions contained in any programming blunder (This is because, as said earlier,).
In conclusion, the following facts emerge: it seems reasonable to state that any variable or a code
segment or a field in a database table, which is hardly used or executed in the context of the
application or the user’s functionality can form the basis of a programming blunder. The copyright of
the software can often extend to the blunders contained therein. Also, any programming blunder that
goes against the merger doctrine is copyrightable and repetition of it in another program can be a
serious indication of copyright infringement. As a result, programming blunders can greatly influence
the expert to give supportive evidence to his findings, thus indirectly contributing to the judge’s
decision. Because of this, identification and reporting of programming blunders need to be a part of
the copyright infringement investigation. Hence, procedures like the AFC test needs to be re-designed
so as to ensure that possible evidence like programming blunders are not filtered out.
11. REFERENCES
Bhattathiripad., P. V., and Baboo (2009), S. S., Software Piracy Forensics: Exploiting Nonautomated
and Judiciary-Friendly Technique', Journal of Digital Forensic Practice, 2:4, pages 175 — 182
Bhattathiripad., P. V., and Baboo (2011), S. S., Software Piracy Forensics: Impact and Implications of
post-piracy modifications, Conference on Digital Forensics, Security & Law, Richmond, USA
Bhattathiripad., P. V., and Baboo, S. S. (2010), Software Piracy Forensics: The need for further
developing AFC, 2nd International ICST Conference on Digital Forensics & Cyber Crime, 4-6
October 2010, Abu Dhabi, UAE
European Software Analysis Laboratory (2007), The “SIMILE Workshop”: Automating the detection
of counterfeit software, available at www.esalab.com
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C. E., and Howald, B. S. (2007), Identifying
Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method, International
Journal of Digital Evidence, 6, 1
Hayes, J. H., and Offutt, J. (2010), Recognizing authors: an examination of the consistent programmer
hypothesis, Software Testing, Verification & Reliability - STVR , vol. 20, no. 4, pp. 329-356
Hollaar L. A. (2002), Legal Protection Of Digital Information, BNA Books
IEEE610.12-1990 definition (1990), IEEE Standard Glossary of Software Engineering
Krsul, I., (1994), Identifying the author of a program, Technical Report, CSD-TR-94-030, Purdue
University, available at http://isis.poly.edu/kulesh/forensics/docs/krsul96authorship.pdf , visited
on 14th April, 2011
McConnell, S. (1996), Who cares about software construction?, Software, IEEE, Volume 13, Issue 1,
Jan 1996, p.127-128
Newman J. O. (1999), New Lyrics For An Old Melody, The Idea/Expression Dichotomy In The
Computer Age, 17, Cardozo Arts & Ent. Law Journal, p.691
28
ADFSL Conference on Digital Forensics, Security and Law, 2012
Raysman R. and Brown P. (2006), Copyright Infringement of computer software and the Altai test,
New York Law Journal, Volume 235, No. 89
Spafford, E. H., and Weeber, S. A. (1992), Software forensics: Can we track the code back to its
authors?, Purdue Technical Report CSD–TR 92–010, SERC Technical Report SERC–TR 110–P,
Department of Computer Sciences, Purdue University
United States Court of Appeals (1992), Second Circuit, Computer Associates International, Inc. v.
Altai, Inc., 982 F.2d 693; 1992 U.S. App. LEXIS 33369; 119 A.L.R. Fed. 741; 92 Cal. Daily, Op.
Service 10213, January 9, 1992, Argued, December 17, 1992, Filed
United States District Court of Massachusetts (2010), Memorandum and Order, Civil Action number
07-12157 PBS, Real View LLC. Vs. 20-20 Technologies, p.2
Walker, J. (1996), "Protectable 'Nuggets': Drawing the Line Between Idea and Expression in computer
Program Copyright Protection", 44, Journal of the Copyright Society of USA, Vol 44, Issue 79
29
ADFSL Conference on Digital Forensics, Security and Law, 2012
APPENDIX-1: SECONDARY OR INCONCLUSIVE PROGRAMMING BLUNDER GENES
Most genes of programming blunders can be conclusively identified. But, there are certain
elements in a program that may not have yet surfaced as blunders but are potentially prone to
surfacing later as programming blunders. Conversely, all ideas which are successfully
expressed but are superfluous to customer requirements also may not surface as a blunder
because such expressions (or a code segment) may affect the semantics of the program and
thus end up as an error at some point of time during the life time of the program (if not during
its pre-implementation testing stage). Any such code segment is basically a gene of an error
and not of a blunder. However, any such code segment in a time-tested program may
eventually form the basis of a blunder because such an item or a code segment does not justify
its being there at all for long unattended. But, such blunders are very difficult to be
conclusively identified and used.
APPENDIX-2: BUG REPEATED V. BLUNDER REPEATED
A bug repeated (in a pirated program) may be noticed during the functioning of the pirated.
Unlike a bug, a programming blunder repeated (in a pirated program, though noticeable)
would remain unnoticed during the functioning of the pirated. In the absence of a thorough
quality control round by the pirated (which is very likely), these programming blunders would
remain intact in the pirated and may turn into evidence of piracy.
30
ADFSL Conference on Digital Forensics, Security and Law, 2012
APPENDIX-3: A NOTE ON ABSTRACTION-FILTRATION-COMPARISON TEST
AFC (Abstraction-Filtration-Comparison) test was primarily developed by Randall Davis
of the Massachusetts Institute of Technology for evaluating copyright infringement
claims involving computer software and used in the 1992 Computer Associates vs. Altai
case, in the court of appeal of the 2nd federal circuit in the United States (Walker, 1996).
Since 1992, AFC has been recognized as a legal precedent for evaluating copyright
infringement claims involving computer software in several appeal courts in the United
States, including the fourth, tenth, eleventh and federal circuit courts of appeals
(European Software Analysis Laboratory, 2007; Raysman & Brown, 2006; United States
District Court of Massachusetts, 2010). AFC is basically a manual comparison approach.
The theoretical framework of AFC not only makes possible the determination of “literal”
similarities between two pieces of code, but it also takes into account their functionality
to identify “substantial” similarities (Walker, 1996). In the AFC test, both the pirated as
well as the original software are put through three stages, namely, abstraction, filtration
and comparison. While other approaches (and the automated tools based on these
approaches) generally compare two software packages literally, without considering
globally common elements in the software, AFC, as the name indicates, first abstracts the
original as well as the allegedly pirated, then filters out the globally common elements in
them to zero in on two sets of comparable elements and finally compares these two sets
to bring out similarities or “nuggets” (Walker, 1996).
On the copyright aspects of the software, the AFC-test specifies three categories (more
aptly, levels) of exclusions, called doctrines (Walker, 1996). Firstly, if there is only one
way of effectively expressing an idea (a function), this idea and its expression tend
to “merge”. Since the idea itself would not be protectable, the expression of this idea
would also escape from the field of the copyright protection. Secondly, there is the
“scènes a faire” doctrine which excludes from the field of protection, any code that has
been made necessary by the technical environment or some external rules imposed on the
programmer. Thirdly, there are those elements that are in the public domain. These three
categories of elements in the software are filtered out of the original as well as the
allegedly pirated before arriving at the two set of comparable elements, mentioned above.
31
ADFSL Conference on Digital Forensics, Security and Law, 2012
32
ADFSL Conference on Digital Forensics, Security and Law, 2012
THE XBOX 360 AND STEGANOGRAPHY: HOW CRIMINALS
AND TERRORISTS COULD BE “GOING DARK”
Ashley Podhradsky
Drexel University
Rob D’Ovidio
Drexel University
Cindy Casey
Drexel University
ABSTRACT
Video game consoles have evolved from single-player embedded systems with rudimentary
processing and graphics capabilities to multipurpose devices that provide users with parallel
functionality to contemporary desktop and laptop computers. Besides offering video games with rich
graphics and multiuser network play, today's gaming consoles give users the ability to communicate
via email, video and text chat; transfer pictures, videos, and file;, and surf the World-Wide-Web.
These communication capabilities have, unfortunately, been exploited by people to plan and commit a
variety of criminal activities. In an attempt to cover the digital tracks of these unlawful undertakings,
anti-forensic techniques, such as steganography, may be utilized to hide or alter evidence. This paper
will explore how criminals and terrorists might be using the Xbox 360 to convey messages and files
using steganographic techniques. Specific attention will be paid to the "going dark" problem and the
disjoint between forensic capabilities for analyzing traditional computers and forensic capabilities for
analyzing video game consoles. Forensic approaches for examining Microsoft's Xbox 360 will be
detailed and the resulting evidentiary capabilities will be discussed.
Keywords: Digital Forensics, Xbox Gaming Console, Steganography, Terrorism, Cyber Crime
1. INTRODUCTION
The use of nontraditional computing devices as a means to access the internet and communicate with
one another has become increasingly popular [1]. People are not just going online through traditional
means with a PC anymore, they are now frequently using cell phones, smart phones, and gaming
consoles as well. Criminals and terrorists also rely on these technologies to communicate while
maintaining confidentiality and anonymity. When information-masking techniques are combined with
non-traditional communication devices, the chances of interception or discovery are significantly
reduced. Hiding information in plain site by altering image, text, or sound data has been going on for
centuries. Steganography, the discipline of concealing the fact that a message or some form of
communication exists, poses a major threat to our national security particularly when it is being
transmitted over exploitable communication channels [2].
2. STEGANOGRAPHY
Steganography is often confused with cryptography, the latter being the art of obscuring a message so
that it is meaningless to anyone except the person it is intended for [3]. Essentially, a cryptographic
message hides the meaning of a message, whereas steganography conceals the fact that a message
even exists [3]. The origin of the word steganography is Greek for steganos (στεγανός) which means
“covered” and graphia (γραφή), which means “writing” [4]. Unlike encryption, which scrambles or
encodes text, with steganography the text is inserted or hidden in another medium such as a
photograph, webpage, or audio file, called the carrier file. The goal of concealing the message is to
33
ADFSL Conference on Digital Forensics, Security and Law, 2012
keep the communication a secret even though it is in plain view. IT security professionals refer to this
concept as security through obscurity. Unlike cryptography, where an encrypted channel sends a red
flag to digital investigators, steganography offers covert communication channels which typically go
unnoticed. .
Although today’s technologies provide a multiplicity of avenues in which to conceal messages,
steganography is not new [5]. In fact, the origins of steganography can be traced back to the ancient
Greek historian Herodotus (c. 484-c 425 B.C.) [6]. In his Histories, Herodotus describes how a secret
message was tattooed onto a slave’s head. Once the slave’s hair grew back, the message was
concealed, thus enabling him to travel through enemy territory without the communication being
discovered. Once the slave arrived at his destination, his head was shaved and the message read [6].
Another method of steganography described by Herodotus is a technique employed by the King of
Sparta when he needed to send covert messages to the Greeks. The message was written onto a
wooden tablet that was then covered with wax so that it appeared empty [7]. Perhaps the most wellknown form of steganography is invisible ink, which became popular during the Second World War.
Invisible ink can be synthetic, like the invisible ink pens many of us used to write secret messages
with as children, or organic, such as body fluids or lemon juice.
There are two primary methods of steganography. The first is referred to as insertion and involves
taking data from one file (the secret) and embedding it into another file (the host or carrier). With
insertion, the size of the image changes once the hidden message has been added [8].
The second form of steganography is called substitution. Here bits of the host file are replaced with
other bits of information. For example, in an 8-bit graphic file, the digits furthest to the left are
referred to as the Most Significant Digit (MSD), while the digits furthest to the right are the Least
Significant Digit (LSD). By replacing the LSD digits, the pixel will change the least. So a bit, which
might read 11110000, can be changed to 11110001, and the effect on the image will be minuscule - or
undetectable to the human eye [8].
Although this may initially sound complex, there is a plethora of tools available on the internet, many
of which are open-source or can be downloaded for a minimal investment or free trial, which makes
substitution steganography a relatively simple task.
3. INVISIBLE SECRETS
Invisible Secrets is an encryption and steganography tool which also provides an array of privacy
options such as erasing internet and computer traces, shredding data to avoid recovery, and locking
computer applications [9]. One of the key attributes of this software is that it affords individuals with
little or no experience the opportunity to utilize the ancient practice of steganography with minimal, if
any, difficulty. Invisible Secrets utilizes both encryption and steganography to create carrier files
which can only be read by the individual they are intended for. By employing cryptography to transfer
the key necessary to unlock the secret message as it traverses over an unsecure connection, potential
Man in the Middle attacks are thwarted.
The following images illustrate how a secret message can be hidden in a JPG photograph. The two
images look identical, however, a closer look shows that they differ in both size and hash value. Using
Invisible Secrets, a short text message was inserted into the photograph. Hidden messages are not just
limited to text - maps can also be embedded. For demonstration purposes, the message was not
encrypted or compressed. Had the message been compressed, the size differences between the two
images would be considerably less, making discovery more challenging.
34
ADFSL Conference on Digital Forensics, Security and Law, 2012
Creating a hidden message using Invisible Secrets
Image 1 - Inserting the secret message into the photograph
35
ADFSL Conference on Digital Forensics, Security and Law, 2012
Image 2 - Carrier image created
4. STEGANALYSIS
To the human eye, the two images (titled original and carrier) are identical. However, an initial
examination reveals that they differ in both size and hash value. Another way to determine if an image
has been altered is to compare the histograms of the photos in question. A histogram is a visual
impression or graph of the data distribution of an object obtained by mathematically calculating an
object’s density [10]. It is relevant to note that a histogram is dependent upon the bin size calculated
[11]. If calculated poorly, when the data is tabulated it could result in misleading or incomplete
information about the data [11]. Although this task is typically performed by a computer, variances
should be expected, and a histogram independent of other supporting evidence may not be sufficient
enough to determine if an image has been altered.
Image 3 - Original unaltered image
MD5 checksum 3e8d80d0e03324331215d83cba00caf8
Size 2.31 MB
36
ADFSL Conference on Digital Forensics, Security and Law, 2012
Image 4 - Carrier image
MD5 checksum a463f9edbeeea630fb320671c5a65895
Size 4.62 MB
By comparing the histograms of the two image files using Adobe Photoshop Elements [12], an
apparent difference between the two is noted. Each histogram consists of precisely 256 invisible bars
which represent a different level of brightness in the image [13]. The higher the bar is on the graph, the
more pixels at that specific point [13]. When placed side by side, we can see that the smoothness,
where pixel values sit in relation to each instance of true color and shading transition occurs, is only
present in the original image file [14].
Image 5 - Examining the histogram of the image with Adobe Photoshop
Image 6 - Comparing the two histograms
A common tool used by digital examiners when analyzing suspected stenographic images is a hex
editor. Hex editors enable investigators to examine the raw data of a file at a very granular level. Using
37
ADFSL Conference on Digital Forensics, Security and Law, 2012
WinHex Hexadecimal Editor [15], the two images were compared. The hexadecimal characters which
denote the beginning and end of a JPG file are “FF D8” and “FF D9” respectively [16]. Within a
matter of seconds we can see that data has been appended to the end of the carrier image file. Further
analysis showed that there was a considerable variance in byte values between the two files. Typically,
forensic examiners are not privy to both images for analysis. While there are some steganalysis tools
available, investigators usually have to rely on more complex methodologies such as looking for
embedded ASCII text, utilizing some type of classification or statistical algorithm such as quadratic
mirror filters or raw quick pairs, or developing a model of what the suspect image should look like
[17,18]. Not only do these techniques require advanced skill, but they are also time consuming and
costly.
Image 7 - Comparing the two images in WinHex
StegSpy is an open-source utility that searches for steganography signatures in files and the program
used to create the file [17]. StegSpy was initially developed to aid in the detection of hidden messages
following the terrorist attacks of September 11, 2001 [18]. According to the developer’s website,
StegSpy is capable of detecting the following steganography programs: Hiderman, JPHideand Seek,
Masker, JPegX, and Invisible Secrets [17]. Although the tool did detect the presence of a hidden
message in the carrier file, it identified the program used to create the file incorrectly.
In his thesis paper, “Using an Artificial Neural Network to Detect the Presence of Image
Steganography”, Aron Chandrababu tested several steganalysis tools, including StegSpy, to determine
their usefulness in detecting covert messages embedded in images. Chandrababu’s research concluded
that StegSpy was unreliable when used to examine a sample of 100 color images - 50 containing
embedded messages, and 50 containing no message at all [18]. Chandrababu’s experiment found that
StegSpy was only able to detect 8 of the 50 carrier images tested with the program [18]. Thus, the
probability that StegSpy is capable of revealing an image containing an embedded message is only
0.16, or 16%.
38
ADFSL Conference on Digital Forensics, Security and Law, 2012
Image 8 -Examining carrier file with StepSpy V2.1
Steganalysis, regardless of the technique employed (human calculated or computer generated
algorithm), is an empirical process dependent upon two primary dynamics. First, the algorithm used,
and secondly, the model the dataset is being measured against [19]. A shift in any of these variables
can alter the results. An additional factor investigators must consider is the fact that technology is in
constant flux.
5. TERRORISM AND STEGANOGRAPHY
Shortly after the terrorist attacks in the United States on September 11, 2001, the Washington Post
reported that federal agents had gathered at least three years of intelligence confirming that Osama bin
Laden and members of al-Qa’ida had been communicating through embedded messages via email and
on web sites [20]. Although there was no definitive evidence proving that the terrorists used
steganography to plot the attacks of September 11th, it is highly probable [20].
With over 21 percent of the world’s population communicating by means of the internet, it should not
come as a surprise that terrorists also use the internet to communicate with each other, spread
propaganda, fund their activities, and recruit [21].
In the second issue of Technical Mujahid, a bimonthly digital training manual for Jihadis, section one, titled Covert Communications and Hiding
Secrets inside Images, discusses steganography techniques, specifically hiding messages in images
and audio files [22,23].
Unlike emails, which can be easily traced, steganography enables terrorists to communicate with each
other while maintaining anonymity. Security expert Bruce Schneier uses the analogy of a dead drop to
demonstrate exactly how steganography benefits terrorists [22]. A dead drop is a communication
technique used by accused Russian spy Robert Hanssen. Hanssen never directly communicated with
his Russian cohorts, but rather left messages, documents, or money in plastic bags under a bridge [22].
Chalk lines left in public places, in plain sight, would direct Hanssen where to collect or leave
packages [22]. Consequently, the parties communicating never had to meet or even know each other’s
identity. According to Schneier, dead drops “…can be used to facilitate completely anonymous,
asynchronous communications” [22]. So if we think of steganography as an electronic dead drop [22],
39
ADFSL Conference on Digital Forensics, Security and Law, 2012
terrorist groups such as al-Qa’ida, Hezbollah, and Hamas are capable of communicating anonymously
and with no shortage of places to leave their virtual chalk lines.
6. CRIME AND THE XBOX 360
The use of nontraditional computing devices as a means to access the internet and communicate with
one another has become increasingly popular [1]. However, average users are not the only ones
reaping the benefits of these evolving technologies, so are the criminals. Gaming consoles, specifically
the Xbox 360, have become a favorite nontraditional computing medium, not only as an instrument to
perform illegal activities but as a target as well.
According to a recent FBI report, Bronx Blood gang members were communicating through Sony’s
PlayStation 3 gaming console while under house arrest [23]. Similar to the Xbox 360, the PS3
provides users with a multiplicity of services which facilitate communication such as chat, social
networking sites, instant messaging, multiplayer gaming, video downloading and sharing, and crossgame chat rooms [23].
New Jersey Regional Operations Intelligence’s Threat Analysis Program reported in September 2010
that as of June 2010 Mara Salvatrucha (MS-13) gang members were conducting criminal activities,
including ordering the murder of a witness, using Microsoft’s Xbox 360 and Sony’s PS3 [24]
Robert Lynch, a 20-year-old Michigan man, was arrested and charged in March 2011 of attempting to
accost school-aged girls for immoral purposes. Lynch used his Xbox 360 gaming console to meet,
befriend, and lure his young victims who were between 11 and 14 years of age [25].
In January 2011, 36-year old Rachel Ann Hicks, lied about her age to befriend under aged boys on
Xbox Live. Once she gained their trust, she sent them illicit photos of herself and X-rated movies.
The California resident drove from her home state to Maryland to molest one 13-year-old boy [26].
Gaming consoles enable gangs and terrorist organizations to communicate internationally while
avoiding detection by the Central Intelligence Agency (CIA) and National Security Agency/Central
Security Service (NSA/CSA) [27]. Defendants sentenced to house arrest, particularly sex offenders,
are often prohibited from using a computer to access the internet [30,31,32]. However, if gaming
consoles are not prohibited, the offender still has the capability of accessing the internet.
While many gaming consoles exist, Microsoft’s Xbox 360 is the most popular among American
consumers, selling over thirty-nine million consoles, six million more than their top competitor the
PS3 [28]. In October 2011, Microsoft announced plans for integrating their Xbox 360 gaming
dashboard with a pay television feature called Live TV. Live TV will enable Xbox Live users to
access Comcast and Verizon services directly from their gaming consoles [29]. With this rise in
popularity, the Xbox 360 has also become a popular medium for criminals. When Bill Gates first
announced his plans for the Xbox 360 gaming system in January 2000, at the International Electronic
Consumers Show in Las Vegas, some critics proclaimed that this new console was nothing more than
a “...PC in a black box [30].” These critics were not too far off the mark.
The Xbox 360 is not only similar to a personal computer - it is actually more powerful than most
average personal computers. The hardware and technical specifications found in today’s Xbox 360
console includes a detachable 250GB hard drive, an IBM customized power –PC based CPU
containing three symmetrical cores each capable of running 3.2 GHz, a 512 MB GDDR3 RAM (which
reduces the heat dispersal burden and is capable of transferring 4 bits of data per pin in 2 clock cycles
for increased throughput), and 700 MHz DDR (theoretically supplying a swift 1400 MB per second
maximum bandwidth) memory [31]
7. XBOX 360 IMAGE STEGANOGRAPHY
Using open-source game modification tools, the carrier image created earlier with Invisible Secrets,
and a USB 2.0 to SATA adaptor with a 50/60 Hz power supply cable, researchers tested the feasibility
40
ADFSL Conference on Digital Forensics, Security and Law, 2012
of inserting a stenographic image into an Xbox 360 hard drive. The process was straightforward and
the results significant.
Modio, an open-source Xbox 360 modification tool popular with gamers because it enables users to
customize their console, was used to open the hard drive [32]. Once the drive was opened, theme
creator was selected. Theme creator allows users to create custom Xbox 360 dashboards themes. The
user interface supports four images, main, media library, game library, and system settings. The carrier
image was uploaded as the main image and the original, unaltered picture of the clouds, uploaded to
the media and game libraries, as well as the system settings. The theme was named Stego and saved to
the device.
Image 9 - Creating a dashboard theme in Modio using the carrier image created with Invisible Secrets
Image 10 -Saving the new theme to the hard drive
When the drive was reopened, our newly created theme, containing the carrier image complete with
secret message, was found in Partition 3 under Profile Storage/Skins. The file was then extracted to
the desktop and opened with wxPirs [33]. WxPirs is another open-source utility commonly used by
gamers seeking to modify their gaming consoles. It enables users to open PIRS, CON, and LIVE files
- commonly found on the Xbox 360 drive. When opened in wxPirs, the entire contents of the Stego
theme file can be viewed. Although the contents of the newly created theme file (wallpaper1) can also
41
ADFSL Conference on Digital Forensics, Security and Law, 2012
be viewed in Modio by right-clicking Open in Resigner and selecting the General File Info tab,
opening the file in wxPirs reveals that the file was created in Modio, Knowing that a game
modification tool created the file could warrant further investigation. The carrier file, Wallpaper1, was
then extracted to the desktop and opened with Windows Photo Viewer. Although a MD5 checksum
showed that the hash value and file size had changed (MD5 0ea9f8bfa3f54fb214028f1e2f578b02, size
190 KB), when the image was opened with Invisible Secrets, our secret message remained intact and
unaltered. (Image 15)
Image 11 -Newly created Stego Theme saved to hard drive
Image 12 - Stego theme opened in wxPirs
Image 13 -Stego theme examined with Modio’s Resigner
42
ADFSL Conference on Digital Forensics, Security and Law, 2012
Image 14 - Extracted image opened with Windows Photo Viewer - although checksum and size have
changed, no viewable differences are noticed
Image 15 -Wallpaper1 is then opened with Invisible Secrets to reveal the hidden message
8. VIDEO STEGANOGRAPHY
Although digital images are the most popular carriers due to their rapid proliferation and the excessive
numbers of bytes available for manipulation, messages can also be embedded in audio and video files,
programming and web codes, and even in network control protocols [34]. Because video files are
essentially a collection of audio sounds and images, many of the techniques used for image or audio
steganography can likewise be used to hide data in videos [35]. Furthermore, because a video is a
moving stream of images and sound, not only is there a vast area to work with, but the continual
motion makes detection extremely challenging [35]. By treating a digital video file as single frames,
rather than one continuous bit stream, data can be evenly dispersed into any number of frames or
images. By slightly altering the output images displayed on each video frame, when the video is
played back it should not be distorted or changed enough to be recognized by the human eye [36].
This approach provides a vast field of workspace and eliminates the need for compression. Although
this may involve utilizing techniques such as the Least Significant Bit (LSB), which can be tricky
when working with grey-scaled images [36], inserting a message into a video file does not require
expertise in video editing. There is a plethora of tools, many of which are right on the average desktop,
to assist in the process. Furthermore, if the host video is one that is assumed highly unlikely of being
altered, such as a proprietary video game trailer, it may possibly evade inspection altogether.
To demonstrate how this is possible, researchers extracted, altered, and then reinstated a proprietary
43
ADFSL Conference on Digital Forensics, Security and Law, 2012
game trailer from an Xbox 360 hard drive using Modio and Windows Live Movie Maker. In partition
3, under contents/downloads, the file for Forza Motorsport 3 was located and opened in Modio’s
resigner (Image 16). Under the general info tab, the Forza Motorsport 3’s movie trailer was extracted
to the desktop (Image 17).
Image 16 – Forza Mothersport 3 opened in resigner
Image 17 – Forza Motorsport 3 game trailer exported to desktop
The trailer was opened from the desktop using Windows Video Maker. Several “secret messages”
were inserted throughout the video (Image 18), including names and locations. At the end of the video,
credits were added using the names of some of the game designers who spoke throughout the video,
and one secret message. The third “name” on the list, Semaine Suivante, isn’t a name at all but rather
French for “next week” (Image 19). The modified video was saved as Stego.wmv.
Image 18 – Name inserted in frame
44
ADFSL Conference on Digital Forensics, Security and Law, 2012
Image 19 – Credits added with hidden message
Using resigner’s replace file option, the stenographic video, Stego.wmv, not only replaced the original
default.wmv, but retained the original file name and date (default.wmv, 11/22/2005) as well. This is
rather significant because it means that digital examiners investigating Xbox 360 hard drives may not
see any evidence that a file was altered (Image 20). Thus, a stenographic file could go undiscovered
unless the probable hundreds of proprietary files on the drive were extracted and examined one by one.
Because the forensic analysis of gaming consoles is in its infancy, the available tools and
methodologies investigators use to examine computers are not suitable for analyzing gaming consoles,
this is indeed problematic at best. When the new default.wmv is extracted and played on the desktop
or on a television screen, the secret messages are revealed (Image 21).
Image 20 – Stenographic file replaced original game trailer retaining the original file’s name
and creation date
45
ADFSL Conference on Digital Forensics, Security and Law, 2012
Image 21 - Altered game trailer with hidden message, Semaine Suivante, revealed
9. LINGUISTIC STEGANOGRAPHY
The decision to use a French phrase in the previous example was deliberate. Although it may initially
present as somewhat simplistic, given the availability of so many user- friendly software programs and
complex stenographic techniques, the use of foreign languages is still a very valuable method of
hiding messages. The syntactical nature of any language can make the interpretation of a message
written in a foreign language challenging enough, but when that language is deliberately skewed to
conceal its true meaning, challenging may very well be an understatement [37].
Following the terrorist attacks of September 11, 2001, it became apparent that the Federal Bureau of
Investigation (FBI) and the Central Intelligence Agency (CIA) did not have language specialists
capable of interpreting vital documents that could have forewarned security experts about the planned
terrorist attacks [38]. Recognizing the significance of foreign languages as they pertain to security, the
Defense Language Institute Foreign Language Center (DLIFLC) was created on November 1, 1941,
upon America’s entry into world War II [39].
Today the DLIFLC educational and research institute provides linguistic training through intense
studies and cultural immersion to the Department of Defense (DoD) and other Federal Agencies [39].
Approximately forty languages are taught at DLIFLC including Arabic, Chinese, French, German,
Russian, Spanish, and Japanese [39]. Some examples of linguistic steganography include, but are not
limited to:




Use of Arabic letter points and extensions – A technique where the extensions of
pointed letters hold one bit and the un-pointed letters zero bits [40].
Format-based Methodologies – The physical formatting of text to provide space for
hiding information. May include insertion of spaces, non-displayed characters,
variations in font size, and deliberate misspellings [41].
Synonym Substitution – A common form of steganography where selected words
from the cover text are replaced with one of their synonyms as predetermined by the
encoding rules established [42].
Feature Specific Encoding – Encoding secret messages into formatted text through
the alteration of text attributes (i.e.: length, size) [43]
Because the Xbox 360 is an international gaming platform, it is not uncharacteristic to find multiple
languages on the hard drive. On the specific drive used for this project, two instances of foreign
languages, French and Dutch, were found upon examination. Both appear to be game character dialogs
46
ADFSL Conference on Digital Forensics, Security and Law, 2012
and gaming instructions (Images 22, 23). From an investigative perspective, what is perplexing is that
these dialogs were the only part of the proprietary game files not encrypted or compressed. This
suggests that Microsoft may employ foreign languages as a security measure (security through
obscurity). Previous research found Spanish, French, and German in both the marketplace files and
networking files (i.e.: NAT) [44]. To date, researchers have recorded French, Spanish, Dutch, German
Russian, possibly Chinese Unicode in the Xbox 360 [44].
From a steganalysis perspective, this suggests that there is an abundance of proprietary files on the
hard drive where a message could be inserted using a foreign language in order to evade discovery.
Where digital investigators have traditionally looked for user data, this is no longer the case.
Proprietary files must also be diligently examined.
Image 22 – French dialog as viewed in FTK Imager
Image 23 – French translated to English using Google Translator
10. CONCLUSION
Steganography is concerned with secret communication, not indecipherable communication. Videos,
proprietary and redundant programming files, and audio and digital images are all potential carriers. In
47
ADFSL Conference on Digital Forensics, Security and Law, 2012
the Art of War, Sun Tzu said that “The natural formation of the country is the soldier's best ally…”
[45]. However, when dealing with the Xbox 360, the topography is unknown.
The file data format used on the Xbox is FATX, which is an offshoot of the more familiar FAT32
found on older computers and storage devices [46]. Although the two possess similar format and file
data layouts, they are not the same. FATX does not contain the backup boot or file system information
sectors found in FAT32. The reasoning behind these variations in the file format is that the Xbox was
designed primarily for entertainment as opposed to productivity. Thus, redundancy and legacy are
apparently forfeited in order to increase the system’s speed. It is also relevant to note that the specific
Xbox 360 model, applied Microsoft updates, and condition of the drive all have an impact on what the
examiner may or may not find upon examination. This, combined with the veiled nature of
steganography, can make analysis very difficult.
11. RECOMMENDATIONS
The first thing the investigator should do upon receiving an Xbox 360 console is to record the 12-digit
serial number which is located on the back side of the console where the USB slots are located [47].
The examiner will need to push the oval-shaped USB door open in order to view this number [47].
Each console contains a unique serial number which corresponds to a serial number stored throughout
the drive. It is pertinent to record this number, as it not only identifies the system on Xbox Live but
could indicate if the drive has been changed [47]. However, because gamers frequently switch hard
drives in and out, these numbers may differ.
The SATA hard drive is housed securely within the detachable hard drive case inside a second
enclosure. To access the actual drive two torx wrenches, sizes 6 and 10, are required. The T6 wrench
is needed to remove the screws from the exterior housing. Three of the screws are easily visible, but
the forth screw is located beneath the Microsoft sticker (Image 24). Once this sticker has been
removed, the Xbox warranty is voided. A missing sticker could be indicative of a drive that has been
modified or tampered with.
Image 24: Removing Microsoft sticker reveals the fourth screw and indicates the drive has been
accessed
As technology evolves, so must the methodologies used by examiners. Although steganography dates
back to antiquity, digital steganalysis is a relatively new discipline which is in flux. Consequently, the
steganalysis of an Xbox drive is a long, slow, systematic process. It is difficult to identify structural
abnormalities or signs of manipulation in a digital environment which is still fundamentally undefined
[48]. Compounding this is the fact that there are no current reference guides or approved tools
available for forensically examining Xbox 360 drives.
48
ADFSL Conference on Digital Forensics, Security and Law, 2012
12. REFERENCES
1. The Diffusion Group. TDG Releases New Report Examining Web Browsing from Top Consumer
Electronics Devices: No Keyboard, No Mouse, No Problem? The Diffusion Group Press Releases.
[Online]
9
8,
2001.
[Cited:
1
1,
2012.]
http://tdgresearch.com/blogs/pressreleases/archive/2011/09/19/tdg-releases-new-report-examining-web-browsing-from-top-consumerelectronics-devices-no-keyboard-no-mouse-no-problem.aspx.
2. Covert computer and network communications. Newman, Robert C. New York : ACM.
Proceedings of the 4th annual conference on Information security curriculum development . p. InfoSec
CD 2007.
3. On The Limits of Steganography. Ross J. Anderson, Fabien A.P. Petitcolas. 1998, IEEE Journal
of Selected Areas in Communications,, pp. 474-481.
4. Fridrich, Jessica. Steganography in Digital Media: Principles, Algorithms, and Applications.
Cambridge : Cambridge University Press, 2009.
5. Hide and Seek: An Introduction to Steganography. Honeyman, Niels Provo and Peter. 2003, IEEE
Security and Privacy , pp. 32-44.
6. Hempstalk, Kathryn. A Java Steganography Tool. Source Forge. [Online] 3 24, 2005. [Cited: 11
1, 2011.] http://diit.sourceforge.net/files/Proposal.pdf.
7. Strassler, Robert B. The Landmark Herodotus: The Histories. New York : Randmom House, 2009.
8. Frank Enfinger, Amelia Phillips, Bill Nelson, Christopher Steuart. Guide to Computer
Forensics and Investigations . Boston : Thomson Course Technology, 2005.
9. NeoByte Solutions. Invisible Secrets 4. Invisible Secrets . [Online] 2011. [Cited: 11 2, 2011.]
http://www.invisiblesecrets.com/.
10. Pearson, K. Contributions to the Mathematical Theory of Evolution. II. Skew Variation in
Homogeneous Material. Philosophical Trans-actions of the Royal Society of London. 1885-1887, pp.
186, 343–414.
11. Jenkinson, Mark. Histogram Bin Size . FMRIB Centre Research Facility . [Online] 5 10, 2000.
[Cited: 1 2, 2012.] http://www.fmrib.ox.ac.uk/analysis/techrep/tr00mj2/tr00mj2/node24.html.
12. Adobe . Elements Product Famility. Adobe . [Online] 2012. [Cited: 1 1, 2012.]
http://www.adobe.com/products/photoshop-elements.html.
13. Patterson, Steve. How To Read An Image Histogram In Photoshop. Adobe Photoshop Tutorials
and Training. [Online] 2012. [Cited: 1 2, 2012.] http://www.photoshopessentials.com/photoediting/histogram/.
14. Liu, Matthew Stamm and K. J. Ray. Blind Forensics of Contrast Enhancement in Digital Images
. Dept. of Electrical and Computer Engineering, University of Maryland,. [Online] 6 13, 2008. [Cited:
1 2, 2012.] http://www.ece.umd.edu/~mcstamm/Stamm_Liu%20-%20ICIP08.pdf.
15. X-Ways Software Technology. WinHex 16.3. WinHex: Computer Forensics & Data Recovery
Software Hex Editor & Disk Editor. [Online] 2012. [Cited: 1 2, 2012.] http://x-ways.net/winhex/.
16. Bill Nelson, Amelia Phillips, Christopher Steuart. Guide to Computer Forensics and
Investigations . Florence : Cencage, 2009.
17. spy-hunter.com. StegSpy. spy-hunter.com. [Online] 2009. [Cited: 1 3, 2012.] http://www.spyhunter.com/stegspy.
49
ADFSL Conference on Digital Forensics, Security and Law, 2012
18. Chandrababu, Aron. Using an Artificial Neural Network to Detect The Presence of Image
Steganography. Thesis, The University of Akron. Akron, OH : s.n., 5 2009.
19. Ingemar Cox, Matthew Miller, Jeffrey Bloom, Jessica Fridrich, Ton Kalker. Digital
Watermarking and Steganography. Waltham : Morgan Kaufmann, 2007.
20. Ariana Eunjung Cha, Jonathan Krim. Terrorists' Online Methods Elusive . Washington Post.
[Online] 12 19, 2001. [Cited: 1 3, 2012.] http://www.washingtonpost.com/ac2/wpdyn?pagename=article&node=&contentId=A52687-2001Sep18.
21. Denning, Dorothy E. Terror‟s Web: How the Internet Is Transforming Terrorism. [book auth.]
Yvonne Jewkes and Majid Yar. Handbook on Internet Crime. Devon : Willan, 2009, pp. 194-214.
22. Schneier, Bruce. Terrorists and steganography. ZDNet. [Online] 9 24, 2001. [Cited: 1 4, 2012.]
http://www.zdnet.com/news/terrorists-and-steganography/116733.
23. Federal Bureau of Investigation. Situational Information Report, Criminal Tradecraft Alert.
New York : F.B.I., 2011.
24. NJ Regional Operations Intelligence Center . NJ ROIC Analysis Element, Threat Analysis
Program, AE201009‐839 . West Trenton : NJ ROIC , 2010.
25. FOX News. Police: Man Used Xbox to Lure Middle School Girls for Sex. FOX News. [Online] 3
29, 2011. [Cited: 1 2, 2012.] http://www.myfoxdetroit.com/dpp/news/local/police%3A-man-usedxbox-to-lure-middle-school-girls-for-sex-20110329-mr.
26. Mick, Jason. Woman Arrested For Molesting 13-Year-Old Xbox Friend. Daily TECH . [Online] 1
10,
2011.
[Cited:
1
2,
2012.]
http://www.dailytech.com/Woman+Arrested+For+Molesting+13YearOld+Xbox+Friend/article20615.
htm.
27. Storm, Darlene. Law Enforcement: Gangs, terrorists plot evil over Xbox and PS3.
Computerworld
.
[Online]
8
2,
2011.
[Cited:
1
3,
2012.]
http://blogs.computerworld.com/18725/law_enforcement_gangs_terrorists_plot_evil_over_xbox_and_
ps3.
28. Bloomberg Businessweek. Microsoft's Xbox Sales Beat Wii, PS3 in February on "BioShock".
Bloomsberg.com. [Online] 3 11, 2010. [Cited: 1 2, 2011.] http://www.businessweek.com/news/201003-11/microsoft-s-xbox-sales-beat-wii-ps3-in-february-on-bioshock-.html..
29. June, Laura. Microsoft announces Xbox Live TV partners including Verizon, HBO, and
Comcast. The Verge. [Online] 10 5, 2011. [Cited: 1 2, 2012.] http://www.xbox.com/enUS/Xbox360/Consoles.
30. Official Xbox Magazine staff. The Complete Hisotry of Xbox. CVG Gaming. [Online] 12 13,
2005. [Cited: 1 1, 2012.] http://www.computerandvideogames.com/article.php?id=131066.
31. Berardini, C. The Xbox 360 System Specifications. Team Xbox. [Online] 12 5, 2005. [Cited: 1 2,
2012.] http://hardware.teamxbox.com/articles/xbox/1144/The-Xbox-360-System-Specifications/p1.
32. Game-Tuts. Modio. Game-Tuts. [Online]
tuts.com/community/index.php?pageid=modio.
[Cited:
1
2,
2012.]
http://www.game-
33. Xbox-Scene. http://www.xbox-scene.com/xbox360-tools/wxPirs.php. Xbox-Scene. [Online] 2010.
[Cited: 1 2, 2012.] http://www.xbox-scene.com/xbox360-tools/wxPirs.php.
34. T. Morkel, J.H.P. Eloff, M.S. Olivier. An Overview of Steganography . University of Pretoria,
South Africa. [Online] 8 27, 2005. [Cited: 5 2012, 1.] http://martinolivier.com/open/stegoverview.pdf.
35. Steganography and steganalysis. Krenn, Robert. California : University of California , 2004.
Proceedings of IEEE Conference,. p. 143.
50
ADFSL Conference on Digital Forensics, Security and Law, 2012
36. Hiding Data in Video File: An Overview. A.K. Al-Frajat, H.A. Jalab, Z.M. Kasirun, A.A.
Zaidan and B.B. Zaidan. 2010, Journal of Applied Sciences, pp. 1644-1649.
37. A Natural Language Steganography Technique for Text Hiding Using LSB's. Salman, Dr. Hana'a
M. 2004, English and Technology, Vol.26, No3, University of Technology, Baghdad, Iraq, p. 351.
38. Alfred Cumming, Todd Masse. Intelligence Reform Implementation at the Federal Bureau of
Investigation: Issues and Options for Congress. Washington : Congressional Research Service, The
Library of Congress, 2005.
39. Defense Language Institute . Foreign Language Center . Defense Language Institute . [Online]
2012. [Cited: 1 1, 2012.] http://www.dliflc.edu/index.html.
40. A Novel Arabic Text Steganography Method Using Letter Points and Extensions . Adnan AbdulAziz Gutub, and Manal Mohammad Fattani. 2007, World Academy of Science, Engineering and
Technology, pp. 28-31.
41. Bennett, Krista. Linguistic Steganography: Survey, Analysis, and Robustness Concerns for
Hiding Information in Text. West Lafayette : Purdue University, 2004.
42. Cuneyt M. Taskirana, Umut Topkarab, Mercan Topkarab, and Edward J. Delpc. Attacks on
Lexical Natural Language Steganography Systems. Indiana : School of Electrical and Computer
Engineering, Purdue University, 2006.
43. Dunbar, Bret. A Detailed Look at Steganographic Techniques and their use in an Open-Systems
Environment . SANS Institute InfoSec Reading Room . [Online] 1 18, 2002. [Cited: 1 7, 2012.]
http://www.sans.org/reading_room/whitepapers/covert/detailed-steganographic-techniques-opensystems-environment_677.
44. A Practitioners Guide to the Forensic Investigation of Xbox 360 Gaming Consoles. Dr. Ashley
Podhradsky, Dr. Rob D'Ovidio, Cindy Casey. Richmond : Digital Forensics, Security and Law ,
2011. ADFSL Conference.
45. Tzu, Sun. The Art Of War. New York : Tribeca, 2010.
46. Burkea, Paul K. Xbox Forensics. Journal of Digital Forensic Practice. [Online] 2006. [Cited: 12
22, 2011.] http://dx.doi.org/10.1080/15567280701417991.
47. Microsoft. How to find the Xbox 360 console serial number and console ID. Microsoft Support.
[Online] 7 27, 2010. [Cited: 1 7, 2012.] http://support.microsoft.com/kb/907615.
48. Gary Kessler. An Overview of Steganography for the Computer Forensics Examiner. Gary
Kessler
Associates.
[Online]
2011.
[Cited:
1
3,
2012.]
http://www.garykessler.net/library/fsc_stego.html.
49. Benchmarking steganographic and steganalysis techniques. Mehdi Kharrazi, Husrev T. Sencar,
Nasir Memon. San Jose : CA, 2005. EI SPIE .
50. Marcus, Ilana. Steganography Detection. University of Rhode Island . [Online] 2011. [Cited: 1 2,
2012.] http://www.uri.edu/personal2/imarcus/stegdetect.htm.
51. Bakier, Abdul Hameed. The New Issue of Technical Mujahid, a Training Manual for Jihadis. The
Jamestown
Foundation
.
[Online]
3
30,
2007.
[Cited:
1
2,
2012.]
http://www.jamestown.org/programs/gta/single/?tx_ttnews[tt_news]=1057&tx_ttnews[backPid]=182
&no_cache=1.
52. United States Court of Appeals, Ninth Circuit. UNITED STATES v. REARDEN. Los Angeles,
CA : s.n., 10 9, 2003.
53. STATE of North Dakota. State v Ehli . Bismarck, N.D. : s.n., 6 30, 2004.
54. State of Colorado. People vs. Harrison. 2 8, 2005.
51
ADFSL Conference on Digital Forensics, Security and Law, 2012
52
ADFSL Conference on Digital Forensics, Security and Law, 2012
DOUBLE-COMPRESSED JPEG DETECTION IN A
STEGANALYSIS SYSTEM
Jennifer L. Davidson
Department of Mathematics
Iowa State University, Ames, IA 50011
Phone: (515) 294-0302
Fax: (515) 294-5454
[email protected]
Pooja Parajape
Department of Electrical and Computer Engineering
Iowa State University, Ames, IA 50011
[email protected]
ABSTRACT
The detection of hidden messages in JPEG images is a growing concern. Current detection of JPEG
stego images must include detection of double compression: a JPEG image is double compressed if it
has been compressed with one quality factor, uncompressed, and then re-compressed with a different
quality factor. When detection of double compression is not included, erroneous detection rates are
very high. The main contribution of this paper is to present an efficient double-compression detection
algorithm that has relatively lower dimensionality of features and relatively lower computational time
for the detection part, than current comparative classifiers. We use a model-based approach for
creating features, using a subclass of Markov random fields called partially ordered Markov models
(POMMs) to modeling the phenomenon of the bit changes that occur in an image after an application
of steganography. We model as noise the embedding process, and create features to capture this noise
characteristic. We show that the nonparametric conditional probabilities that are modeled using a
POMM can work very well to distinguish between an image that has been double compressed and one
that has not, with lower overall computational cost. After double compression detection, we analyze
histogram patterns that identify the primary quality compression factor to classify the image as stego
or cover. The latter is an analytic approach that requires no classifier training. We compare our results
with another state-of-the-art double compression detector.
Keywords: steganalysis; steganography; JPEG; double compression; digital image forensics.
1. INTRODUCTION
Steganography is the practice of hiding a secret message or a payload in innocent objects such that the
very existence of the message is undetectable. The goal of steganography is to embed a secret payload
in a cover object in such a way that nobody apart from the sender and the receiver can detect the
presence of the payload. Steganalysis, on the other hand, deals with finding the presence of such
hidden message. Steganalysis can be categorized as passive or active. Passive steganalysis deals with
detecting the presence of embedded message. Active steganalysis seeks further information about the
secret message such as length, embedding algorithm used and actual content. Steganalysis identifies
the presence of such hidden messages in an image. Since JPEG is a compressed file format, it requires
lower bandwidth for transmission and lesser space for storage. Many embedding software for JPEG
such as Outguess, F5, and Jsteg are freely available on the Internet [1]. This makes JPEG a good
medium for steganography. Hence, we focus our attention on the steganalysis of JPEG images.
If a JPEG image is compressed twice, each time using a different compression factor or quantization
matrix, then the image is said to be double-compressed. Steganographic algorithms such as F5 [2] and
53
ADFSL Conference on Digital Forensics, Security and Law, 2012
Outguess [3] can produce such double-compressed images during the process of embedding the
payload. Existence of double-compression thus suggests manipulation of the original image. Blind
steganalyzers built on the assumption of single-compressed images give misleading results for the
double-compressed images. In Table 1, we present a state-of-the-art stego classifier that has a very
accurate stego vs. cover detection rate [4], but applied to JPEG data that has been double compressed.
The quality factor for the JPEG images is 75%, and the amount of data embedded in the JPEG
coefficients are described in terms of bits per non zero (JPEG) coefficient, or bpnz. 0.05 designate that
roughly 5% of the available bits for embedding were used, a significantly small amount. 0.4 bpnz is
typically the maximum capacity available for embedding. Thus, Table 1 shows that the accuracies for
Canvass on JPEG images with no double compression, which are typically in the range of 40% to
100%, here are far less when double compressed images are fed into the Canvass stego detector that
was designed to detect images with no double compression, and in the range of 28%-75%. Thus, it is
clearly important to detect the presence of double-compression for satisfactory performance of stego
images. Detection of double-compression is a binary classification problem where a given image can
be classified as either single-compressed or double-compressed. A double-compression detector can
be thought of as a pre-classifier to multi-classifiers that detects a number of different steganographic
methods used for embedding the payload.
Table 1. Detection accuracy of the GUI software package Canvass applied to double
compressed JPEG data.
Percentage of Detection Accuracy of Canvass
SQF = 75
Cover
0.05 bpnz
45.39
0.1 bpnz
0.4 bpnz
Average
45.39
F5
Outguess
28.84
42.04
34.54
57.73
58.29
74.84
40.56
58.02
2. JPEG COMPRESSION, DOUBLE COMPRESSION, AND MODE HISTOGRAMS
JPEG stands for Joint Photographic Experts Group. It is a lossy compression algorithm that stores the
data in the form of quantized DCT coefficients. Figure 1 shows the process involved in JPEG
compression. Each 8x8 block in the original image has the Discrete Cosine Transform (DCT) applied,
then those real values are quantized into integer values using the quantization matrix; those integers
are scanned in a zig-zag order and the resulting string is entropy encoded using the Huffman coding.
The resulting file is then stored as a .jpg file.
Figure 1. JPEG Compression Process.
The DCT coefficients are given by bpq :
54
ADFSL Conference on Digital Forensics, Security and Law, 2012
The DCT coefficients bpq are then quantized by dividing each value point-wise using a 8x8 matrix Q
followed by rounding to the nearest integer. Q is called the quantization matrix (QM). The quantized
DCT coefficients are given by
where [ ] denotes rounding to the nearest integer and 0 ≤ p, q ≤ 7. Each location (p,q) in the
quantization matrix refers to a unique frequency and is called a mode. The coefficients at mode (0,0)
are called DC coefficients and the coefficients at all the other modes are called AC coefficients.
Quantization is an irreversible lossy step. As we can see, division by larger quantization steps results
in more compressed data. JPEG standard allows 100 different quality factors. A quality factor is an
integer between 1 and 100, inclusive. The value 100 is for Qpq = 1 for 0 ≤ p,q ≤ 7 and it thus
corresponds to an uncompressed and highest quality image. Each quality factor is associated with a
unique quantization matrix that computes the quality factor at that compression rate. The quantized
DCT coefficients are then Huffman encoded for further compression of data. This step is lossless.
Next, we look at the effect of double-compression on the DCT coefficients of the JPEG image. A
JPEG image is double-compressed if it is compressed twice with different quantization matrices Q1
and Q2. In that case we have
b  Q1 
Eq. 1
Brs   rs1 * rs2 .
Q
Q
 rs  rs 
Here, Q1 is called the primary quantization matrix (PQM) and Q2 is called the secondary quantization
matrix (SQM). Figure 2 shows the double-compression process for a JPEG image. In this example, the
image is first compressed at a lower quality factor QF = 50, uncompressed and then recompressed at a
higher quality factor QF= 75. Note that a larger quality factor results in a quantization matrix with
smaller values. We observe that, depending on the values of the quantization steps in primary and
secondary quantization matrices, the histograms of the DCT coefficients exhibit different
characteristic patterns. These patterns are discussed later and are key to our efficient detectors.
55
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 2. JPEG Double Compression Process. First compression is at QF=50% and second
compression is at QF=75%.
When the Primary Quality Factor (PQF) is smaller than the Secondary Quality Factor (SQF), the
image is first coarsely quantized and then finely quantized such as in Figure 2. The quantized DCT
coefficients can thus take values only from the set 0, n, 2n, 3n,.. where n is determined from Eq. 1. The
histogram exhibits zeros at remaining points. The histogram in Figure 3(a) is from a single compressed
image, and in Figure 3(b) the histogram shows the characteristic zero pattern for a double-compressed
image. Here the primary quality factor is 63 and secondary quality factor is 75. The primary
quantization step for mode (2,2) is 9 whereas the secondary quantization step is 6. After the first
compression step, the de-quantized DCT coefficients can thus take values that are multiples of 9. After
the second compression that includes requantization with step size of 6 and rounding, the quantized
coefficients can take values only from the set 0, 2, 3, 4, 6, 8, 9, 11, .... . We notice that these values are
the rounded integer multiples of n = 9/6 = 1.5. As a result, zeros occur at locations 1, 4, 7, 10 and so
on. This is summarized in Figure 4. We exploit this characteristic to develop features to classify single
versus double-compressed images as well as cover versus stego images.
(a)
(b)
Figure 3. Histograms of quantized DCT coefficients for mode (2, 2) in (a) single-compressed image,
(b) double-compressed image with PQF < SQF
56
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 4. (a) Example of double-compressed coefficient values and (b) resulting zero patterns at mode
(2,2).
When the primary quality factor is greater than the secondary quality factor, the mode histograms of
the DCT coefficients exhibit peaks at certain places. The locations of these peaks vary according to the
combinations of primary and secondary quantization steps in a way similar to the previous case. By
experimentation on the image data, we determine that if there is 20% increase in the histogram values
(bin heights) followed by a drop in the values, it indicates a peak. The value 20% is chosen after a
series of trials and errors. Equation 1.4 also models this phenomenon. In Figure 5(a) and 5(b), we
present histograms to demonstrate this phenomenon. In this case, the Primary Quality Factor is 80 and
the Secondary Quality Factor is 75. The primary quantization step for mode (2,2) is 5 and the
secondary quantization step is 6. The dequantized coefficients after the first compression are thus
integer multiples of 5. After requantization with step 6, peaks occur at locations 3, 8 and 13 and so on.
These are indicated by the arrows in Figure 5(b). Figure 6 explains this phenomenon in detail. A peak
value occurs at bin value 3 after second compression because bin value 3 and 4 from the first
compression end up in bin values 3 after the second compression.
3. PATTERN CLASSIFICATION
Detection of double-compression is a binary classification problem where an input image is classified
as single-compressed or double-compressed. We propose to use a blind steganalyzer to solve this
classification problem. Generally, blind steganalyis is carried out by converting the input image to a
lower dimensional feature space. These features are then used to train a pattern classifier. Among
various pattern classification techniques, Support Vector Machines (SVMs) have proven to be very
powerful for the binary classification. SVMs use either linear or kernel-based supervised machine
learning algorithms. Linear SVMs are seldom suitable for real world data that is hard to separate
without error by a hyperplane. We next discuss the basic workings of an SVM.
57
ADFSL Conference on Digital Forensics, Security and Law, 2012
(a)
(b)
Figure 5. Histograms of quantized DCT coefficients for mode (2,2) in (a) single-compressed image
(b) double-compressed image with PQF > SQF.
Figure 6. Example of double-compressed coefficient values and resulting peak patterns at mode (2,2).
Bins 3 and 4 (row 1) combine to produce bin 3 (row 3).
r
r
Consider a given set of training pairs (xi , y i ), i  1,2,...,k where x i  R n , n represents the number of
training data or features, and y ∈ (−1, 1)k represents the class. The SVMs require solution in the form
r
of a weight vector w i to the following optimization problem



58
ADFSL Conference on Digital Forensics, Security and Law, 2012
Eq. 2
The training data are mapped into a higher dimensional space by function Φ. Then SVM finds a linear
separating hyperplane with the maximal margin of space between the hyperplane and data in the
higher dimensional space. In this case,
is called the kernel function whereas C > 0 is the penalty parameter of the error term ξ.
There are two distinct phases involved while using a SVM for classification. In the first phase, the
r
SVM is trained on the features extracted from a large number of images, that is, a weight vector w i is
found satisfying Eq. 2. In the second phase, the SVM is tested on the data which is previously unseen,
that is, a vector of data with unknown class is passed through the SVM and its class is output.

We next discuss the different steps involved in training the SVM. First, the input data is linearly
scaled
so that all the elements are in the range [-1, 1] or [0, 1]. The same scaling is used for both the training
and the testing data. This step is necessary to prevent the large feature values from dominating the
small values. Then, we choose a kernel function. We have various options for the kernel functions,
such as Gaussian, Radial Basis Function (RBF), Polynomial, Sigmoid etc. We use RBF kernel for all
the experiments. RBF kernel is the most suitable when the number of features is small and the data
needs to be mapped to higher dimensional space using non linear kernel. Also, the RBF kernel is
numerically less complex than polynomial and sigmoid kernels. It is given by:
where γ is the kernel parameter and the norm used is the Euclidean distance.
To run the code to implement the SVM, we used a standard SVM library in [5], which determines the
optimum values for the kernel parameters C and γ. Selection of the optimum values is done by
performing an exhaustive grid search on the predefined points. Cross validation is required to ensure
good performance of the detector on the unknown data. In v-fold cross-validation, the training data is
divided into v subsets of equal sample size and v-1 SVMs are created. Each SVM uses v-1 subsets for
training and the remaining subset is used as unknown test data. This gives the estimate of prediction
accuracy for the unknown data by averaging the accuracy values. Cross validation also prevents the
over-fitting problem resulting in better test accuracy [5]. In our experiments we use v = 5.
4. LITERATURE REVIEW
Double-compression in JPEG image almost always indicates image manipulation. An innocent image
with a certain quality factor can sometimes be saved at a different default quality factor, resulting in
double-compression of innocent image. In image forensics, double-compression is used to detect
image forgery. A doctored image may be created by pasting a small part of one image into another
image. If the quality factors of these two images are different, it results in double-compression of the
pasted region.
The embedding algorithms we used were F5, Jsteg-jpeg, JPHide and Seek, Outguess and StegHide.
These algorithms are easily accessible to many non-technical users, and available and easy to install
from code accessible on the internet. Other more recent algorithms including nsF5, MB, and YASS are
not considered here due to their more recent introduction and unfamiliarity outside the steganalysis
59
ADFSL Conference on Digital Forensics, Security and Law, 2012
community.
In [6] Fridrich et al. proposed a targeted attack for JPEG images embedded using F5 steganographic
algorithm. The method mainly considered single-compressed images and was based on estimating the
cover histogram from the stego image and determining the relative number of modifications in the
histogram values. The authors also suggested a method to estimate the primary quantization steps
based on simulating the double-compression using all the candidate quantization matrices. However,
in [6], the authors assumed that the images are already known to be double-compressed. No method
was implemented for the double-compression detection. Lukas and Fridrich extended this idea in [7]
and proposed three methods for detection of primary quantization steps in double-compressed JPEG
images. Two of these methods were based on simulated double-compression as in [6]. The methods
were computationally intensive and did not lead to reliable results. The third method was based on
neural networks and it outperformed the first two methods. However, in the latter paper the authors
assumed that the double-compressed images were cover images only.
He et al. [8] proposed a scheme to detect doctored JPEG images based on double quantization effects
seen in the DCT coefficient histograms. The method used Bayesian approach for detecting doctored
blocks in the images. A support vector machine was trained on the four dimensional features that are
similar to Fisher discriminator in pattern recognition. This method is not efficient when a high quality
image is recompressed using a lower quality factor. Also, for more accurate formulation of the feature
vectors, the method needs the primary quality factor that cannot be estimated from the proposed
algorithm. In [9], M. Sorell developed a method to detect the primary quantization coefficients of
double-compressed images for forensic purposes. The method was used to detect re-quantization in
innocent JPEG images to identify the source of the images. Methods for detecting doctoring of JPEG
images, where a region undergoes double-compression after being cut and pasted into a larger image,
have been investigated in [8,9,10]. Although some of these methods lead to accurate detection of
double-compression, the application was for forensic purposes, not steganalysis. Since the embedding
process for steganography can change the image statistics significantly, application of these methods
for steganalysis might involve significant algorithmic modifications.
In [11], T. Pevny and J. Fridrich proposed a stego detector for both single and double- compressed
images. They used a neural network-based primary quantization step detector to detect doublecompression. Depending on the result of this detector, the image was sent to one of the two banks of
multi-classifiers designed for single and double-compressed images. The detection of doublecompression was based on the naive approach of comparing the primary and secondary quantization
steps and it led to inaccuracies in the detection. To overcome this, a separate double-compression
detector was designed in [12] and it was used as a pre-classifier to the stego detector. Instead of neural
networks, the authors used Support Vector Machines. The SVMs were trained on features based on
first order statistics of the mode histograms of quantized DCT coefficients. The primary quantization
step detection followed the double- compression detection. For each possible combination of primary
and secondary quantization steps, the authors use a separate SVM which led to a total of 159 SVMs.
This approach, while increasing accuracies, is very compute-intensive.
C. Chen et al. proposed similar machine learning based scheme for JPEG double-compression
detection [7] similar to [12]. The proposed 324 dimensional features were modeled using a Markov
process and transition probability matrices of 2D difference arrays of quantized DCT coefficients were
used to create the feature vectors. Then a support vector machine was trained on these features. The
method is discussed in detail in Section 6 and the results are compared with our detector. Chen’s
model is highly compute-intensive and we show that accuracies are lower than those resulting from
our approach. Pevny et. al created a complete multi-classifier for single and double-compressed
images [2] by combining all the modules from [10,12]. The double-compression detector was based on
histogram mode statistics. The primary quality step estimation was done by a bank of SVMs designed
for all the possible combinations of primary and secondary quality steps. Then two banks of multi-
60
ADFSL Conference on Digital Forensics, Security and Law, 2012
classifiers were created for single-and double-compressed images. This model is also highly computeintensive.
On a related topic, in [17], the authors perform an extensive analysis of the errors introduced by initial
JPEG compression and double compression. The main difference is that they do not apply their
analysis to stegoimages, only to cover or innocent images. They investigate three different types of
forensics, one of which is determination of the primary standard JPEG quantization table in a doublecompressed JPEG image. Their detection schemes are based on sound and novel analysis of the
characteristics of the error that can occur during JPEG compression. The authors use a simple formula
and have relatively high detection accuracy, ranging from roughly 70 percent to 100 percent. The
method we implement gives ranges of roughly 50-100% accuracy, but for cover and stego images, the
latter embedded with a range of embedding rates, from 0.05 to 0.4 bpnz. See Figure 16 below. The
two detection schemes cannot be compared directly as different databases and image types (cover and
stego) were used in the two experiments. Other methods such as in [18], use models to characterize the
effects of rounding and recompression, but again do not use stego images in the analysis. Our methods
are applied to both cover and stego image data.
5. OUR APPROACH
The model proposed by Chen in [7] detected double compressed images, as did the model in [12]. The
main disadvantage we see is very heavy computational resource requirements. In [12], 15 SVMs are
required for steg vs. cover detection of single-compressed images, 159 additional SVMs are required
for primary quality steps estimation whereas 3 SVMs are required for steg vs. cover detection of
double-compressed images. Considering the time required for training and testing of each of these
SVMs and the computational resources at hand, we decided to create a different approach that would
minimize the number of SVMs required while maintaining accuracies, thus saving a lot of
computational time.
A major contribution of our approach towards time saving is achieved in the primary quality factor
estimation step. Instead of using SVM for each possible combination of primary and secondary quality
step as in [2], we use an analytical approach based on the signature patterns in the histograms. The
analytical method does not involve training and testing of SVM and is significantly less
computationally intensive.
We created a steganalyzer that (refer to Figure 7):
1. Classifies an image as single-compressed or double-compressed (Box 1).
2. Classifies a double-compressed image as cover or stego and estimate the primary quality
factor from the set of standard JPEG quality factors for further analysis of the image using
targeted attacks. [6,13]. (Box 2).
3. Classifies a single-compressed image as cover or stego (Box 3). (Note: Only QF = 75 case is
implemented in the current software.)
We assume three classes for double-compressed images: cover, F5 and Outguess. In practice, doublecompressed images can have any quality factor but we restrict the scope to secondary quality factors
75 and 80 that are the default quality factor for Outguess and F5 respectively.
In section 3.1, we propose the overall scheme for steganalysis of single and double-compressed
images. In section 3.2, we describe the image database used for training and testing various SVMs.
Section 3.3 gives detailed description of the features used for steganalysis whereas section 3.4 contains
a brief discussion about features used by the state-of-the-art double- compression detectors.
61
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 7. Our steganalyzer.
Figure 7 shows the complete flow of our steganalyzer. If the observed quality factor of the input image
is 75 or 80, there is a high probability of double-compression. This image is passed to the doublecompression detector. If the image is single-compressed at QF = 75, it is passed to the existing singlecompression steganalyzer software. If the image has QF  75 or QF  80, then we do not process it at
this time.
If the image is found to be double-compressed, it is further given to PQF splitter module. The PQF
splitter determines if the PQF of the image is less than or greater than its SQF and accordingly
classifies the image into two classes. This step helps in the further processing because it is observed
that the signature patterns in the histograms are different based on the relation between the PQF and
the SQF. Depending on this relation, two different approaches are followed. When the PQF is less
than the SQF, we use the analytical approach based on the histogram signature patterns to achieve
stego vs. cover classification. When the PQF is greater than the SQF, we use a multi-class SVM-based
pattern classifier.
6. IMAGE DATA
In this section, we describe the database used for all the experiments. This description is necessary
because the performance of a SVM-based classifier depends on the database used for training.
We use images from BOWS-2 database available from [7]. This database was created for the BOWS-2
contest. It consists of approximately 10000 grey-scale images in pgm format. The images are of
uniform size of 512x512 pixels. We divide the BOWS-2 database into two mutually exclusive sets of
equal size. One set is used for creating the training database and other set is used for creating the
testing database. This division allows use of test images that are previously not seen by the SVM. For
all the experiments, a total of 30000 images are used for training, 15000 of each class.
For single-compressed images, five steganographic algorithms are considered: F5, Jsteg, JP Hide and
Seek, Outguess and StegHide. For double-compressed images, only F5 and Outguess are considered
because we assume that these are the only algorithms that are likely to produce double-compressed
images. For all the embedding algorithms except Outguess, three embedding levels are used: 0.05
bpnz, 0.1 bpnz and 0.4 bpnz. Outguess fails to embed the payload at 0.4 bpnz. Hence the 0.2 bpnz
embedding rate is used in this case only. In this case, Bits Per Non-Zero ac coefficient (bpnz) is used
to describe the payload length. Each steganographic algorithm embeds a binary payload consisting of
a randomly generated bitstream into the quantized DCT coefficients array of the JPEG image.
In our experiments, we consider 33 primary quality factors in the set
S = {63, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
93, 94}.
Secondary quality factors are limited to 75 and 80. We ignore primary quality factors 64, 65, 66 and
67 because F5 and Outguess fail to embed the payload in the images with these quality factors.
62
ADFSL Conference on Digital Forensics, Security and Law, 2012
7. FEATURES
There are a wide variety of features that have been used for pattern classification. Farid [14] used
features based on the linear prediction error of wavelet sub-band coefficients. In [15], features based
on quantized DCT coefficients were used for blind steganalysis. Shi et al. [16] used a Markov process
to model four directional differences between the neighboring quantized DCT coefficients to create
features. In all the methods, the idea is to use features that are sensitive to the changes introduced by
the embedding process. Such features contain the information that can effectively separate the classes
under consideration. We used the conditional probabilities that are from Partially Ordered Markov
Models (POMMs) for double-compression detection and steganalysis of single and doublecompressed images. POMMs exploit neighborhood dependencies amongst the quantized DCT
coefficients and have been successfully used for the forward and inverse texture modeling problem. In
order to improve the performance of the POMMs, we additionally use features based on the histogram
signature patterns. We next present the details of these features.
Markov random field models (MRFs) are a well-known stochastic model applied to many different
problems. In imaging problems, a MRF is defined in terms of local neighborhoods of pixels called
cliques that exhibit the probabilistic dependency in the imaging problem. There is an implicit
underlying graph when using a MRF. The neighborhoods of pixels for POMMs, however, have an
implicit underlying acyclic directed graph, which in turn is related mathematically to a partially
ordered set. We omit the details of the relation to partial orders, but now describe the basic POMM
model in terms of an acyclic directed graph.
Let a finite acyclic digraph (V, E) with corresponding poset (V, ≺). Here V = (V1,V2,..,Vk) is the set of
vertices and E is the set of directed edges given by E = {(i,j) : Vi,Vj ∈ V and (i,j) is an edge with tail on
i and head on j}. (V, E) is called acyclic when there does not exist a set of edges (i1, j1), (i2, j2), .., (ik,
jk) where
jn=in+1, n=1, 2, .., k
and jk =i1. Figure 8 shows an acyclic digraph. We notice
that there is no path of directed edges that start and end at the same vertex.
Figure 8. Example of an acyclic directed graph.
Definition 1: A set of elements V with a binary relation ≺ is said to have a partial order with respect to
≺ if:
1. a ≺ a, ∀a ∈ V (reflexivity)
2. a≺b,b≺c⇒a≺c(transitivity)
3. If a≺b and b≺a, then a=b(anti-symmetry).
In this case, (V, ≺) is called a partially ordered set or a poset.
Definition 2: For any B ∈ V, the cone of B is the set cone B = {C ∈ V : C ≺ B, C  B}.
Definition 3: For any B ∈ V, the adjacent lower neighbors of B are elements in C such that (C, B) is a
directed edge in (V, E). adj≺B = {C : (C, B) is a directed edge in (V, E)}.
63
ADFSL Conference on Digital Forensics, Security and Law, 2012
Definition 4: Any B ∈ V is a minimal element if there is no element C ∈ V such that C ≺ B, that is,
there is no directed edge (C, B). Let L0 be the set of minimal elements in the poset.
Figure 9 explains these definitions pictorially.
With this background, we now proceed towards defining POMMs. Let P(A) be the discrete probability
for r.v. A and P(A|B) be the conditional probability of r.v A given another r.v. B. L0 is the set of
minimal elements in the poset.
Figure 9. Adjacent lower neighborhoods, cone sets, minimal set.
With this background, we now proceed towards defining POMMs. Let P(A) be the discrete probability
for r.v. A and P(A|B) be the conditional probability of r.v A given another r.v. B. L 0 is the set of
minimal elements in the poset.
Definition 5: Consider a finite acyclic digraph (V, E) of r.v.s with the corresponding poset (V,≺). For
B∈V, consider a set YB of r.v.s not related to B,YB ={C: B and C are not related}. Then (V, ≺) is
called a partially ordered Markov model (POMM) if for any B ∈ V \L0 and any subset UB ⊂ YB we
have
P(B | cone B,UB) = P(B | adj≺B)
In this case, the lower adjacent neighbors in B describe the Markovian property of the model.
Next, we are interested in using POMMs for modeling steganographic changes to image. For
steganalysis applications, we create features that the use directly the quantized DCT coefficient values
that are modeled by a POMM. We describe a general approach that lets the steganalyst use her
expertise to construct such a POMM. First, subsets of pixels are chosen that contain relative
information that may have changed after embedding a payload. Next, a function f from the set of
subsets to the real numbers is found that is used to quantify the change in values that occur after
embedding, and applied to each subset under consideration. Then, an acyclic directed graph is created
from the set of subsets and values in the range of f (vertices) and the binary relation describing the
function and its range (edges). The induced poset is constructed from this acyclic digraph, and a
POMM is created using this poset. This gives rise to the conditional probabilities P(B | adj ≺B) which
are used as features for steganalysis.
Let A be a MxN array of quantized DCT coefficients: A={A,ij :1≤i≤M,1≤j≤N}. Let Y = {Y1, Y2, .., Yt}
be a collections of subsets of r.v.s in A. Here, every Yi is a shift invariant ordered set. For example,
consider a case where Y1h = {A1,1, A1,2}, Y2h = {A1,2, A1,3}, and so on. Each Yih is a set containing two
pixels that are adjacent in the horizontal direction. Let Y h = {Y1h, Y2h, .., Ykh} contain all such sets of
r.v.s in the array A. Let f be the function f : Y → R where R is a set of real numbers defined by
f(y1,y2)=y1 −y2
64
ADFSL Conference on Digital Forensics, Security and Law, 2012
f(Yih) = f(Aj,k, Aj,k+1) = Aj,k - Aj,k+1, for some indices i, j and k.
In this case, f(Yi) is the image of Yi under f and Yi is the pre-image of f(Yi). An acyclic digraph (V, E)
is created from the set of vertices V = Y ∪ f(Y ) and the set of edges E = {Ei} between an element of
the range from f and an element of Y. Thus, edge Ei = (f(Yi),Yi) has tail on image f(Yi) and head on
pre-image Yi . Thus (V, E) forms a function-subset or f-Y acyclic digraph. This acyclic digraph is used
to create a sequence of POMMs whose conditional probabilities are used as features. If f is a function
that exploits the dependency among the quantized DCT coefficients, then it is considered useful for
steganalysis. For such function f, P(Yi | f(Yi)), which is a frequency of occurrence of pre-image of
f(Yi), can be used to distinguish between cover and stego images. This is the motivation for using f-Y
acyclic digraphs.
Figure 10 shows the example of f-Y directed graphs we use for our features. There are two subsets
Y112 and Y249 shown which represent a portion of the quantized DCT coefficients array. Function f,
when applied to these subsets produces value of -1. So the probability P (3, 4 | (3 − 4)) measures the
frequency of occurrence of this pattern. A collection of all such conditional probabilities are captured
by the POMMs.
As mentioned above, function f = y1 − y2, where y1 and y2 are adjacent pixels, is a very useful feature
for steganalysis. The directional difference arrays created in this manner have been successfully used
in [7], [16]. We use this function f to create a series of POMMs, one in each of the horizontal, vertical,
diagonal, and minor diagonal (the reverse diagonal) directions. The conditional probabilities thus
obtained are averaged over a collection of POMMs.
For a rectangular array A of r.v.s we consider four directional subsets in horizontal, vertical, diagonal
and minor diagonal directions given by Yi,kh = {Ai,k, Ai,k+1}, Yi,kv = {Ai,k, Ai+1,k}, Yi,kd = {Ai,k, Ai+1,k+1},
and Yi,km = {Ai+1,k, Ai,k+1}. From the four sets Yh, Yv, Yd, Ym thus obtained, we create four acyclic
digraphs (V*,E*) where V* = (Y* ∪f(Y*)) and E* = {Ei* : Ei* = (f(Yi*), Yi*)}, and * ∈ {h, v, d, m}. For
each digraph, a POMM is defined by its conditional probability given by:
P*(Y* |f(Y*))= P*(Y*,f(Y*) / P*(f(Y*))
These probabilities are calculated from the histogram bins. The histograms are clipped between [-T,
T]. This not only avoids the sparse probability density function at the histogram ends, but also reduces
the number of computations and size of feature vector. Thus, we have -T ≤ Ai,j ≤ T. This limits the
values of P* to (2T + 1)2 in each direction. In our experiments, we use T = 3. We now define a matrix
F* of size (2T +1)×(2T +1):
F*(w,z)=P*(Y* |f(Y*))=P*(w,z|f(w−z)),
From these, a set of (2T + 1)2 features can then be defined as
Eq. 3
The POMMs are applied on the global array to capture intrablock dependencies and on the mode
arrays to capture interblock dependencies. These conditional probabilities are used to define our
features.
The intrablock features are defined on the set of global Q-DCT coefficients, on the entire array itself.
65
ADFSL Conference on Digital Forensics, Security and Law, 2012
Eq. 3 is used to calculate the average values of the four directions. This produces (2T + 1)2 features.
Interblock features are defined on the mode arrays. 64 mode arrays are created from the 64 modes by
collecting all the Q-DCT values at each mode. Eq. 3 is then applied to each of the 64 arrays. Those
values are averaged over all 64 arrays, giving an additional (2T + 1)2 feature values.
We apply calibration to the image before extracting features. An image is calibrated by
decompressing to the spatial domain and then compressing it back after cropping by few pixels in both
directions in the spatial domain. This helps to create the features that are dependent on the data
embedded in the image rather than the image content itself [16]. For an image Io, we obtain a
calibrated image Ic. The intra and inter block POMMs are calculated for both Io and Ic. The final
feature vector F is obtained by taking the difference between F0 and Fc, F = (F0 - Fc)(w, z). We get a
total of 2(2T + 1)2 POMM features. For our experiments, we use threshold value T = 3. This results in
49 intrablock and 49 interblock features giving a total of 98 POMM features.
The second set of features is derived from the histogram patterns noted in the above section. As shown
above, when the primary quality factor is smaller than secondary quality factor, the mode histograms
of Q-DCT coefficients have zeros at some places. These zeros can be captured by looking at the
forward differences between histogram bin values. For a single-compressed image, the distribution of
DCT coefficients is ideally Gaussian. The forward differences are thus almost always small and
positive. For double-compressed images, the zeros can take the form of local minimum instead of
being exact zeros. So instead of finding absolute zeros, if the drop in the successive histogram values
is more that 75%, it is considered to indicate presence of a zero. For images with primary quality
factors less than the secondary quality factor, this approach is followed. For secondary quality factor
of 75, the primary quality factors under consideration are given by:
s ∈ S75 = {63,68,69,70,71,72,73}
The quality factor 74 is not taken into account because the quantization steps for the 9 lower modes we
use are identical to the quantization steps for quality factor 75. Thus images having primary quality
factor 74 and secondary quality factor 75 are considered as single-compressed for classification
purposes.
When the secondary quality factor is 80, there are 13 primary quality factors under consideration
which are given by:
s ∈ S80 = {63,68,69,70,71,72,73,74,75,76,77,78,79}
For any given image, we first inspect the first 22 bin values in the histograms of the nine low
frequency DCT coefficients. The nine low frequency modes are:
H = {(0, 1), (0, 2), (0, 3), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (3, 1)}
The remaining modes are not considered because most of the quantized DCT coefficients have value
zero or close to zero. The histogram location k for mode m is represented by hm(k) where 0 < (m = i+j)
≤ 3; i and j are the coordinates of the location in each 8×8 block. Using the values from hm(k), we
create a 9 dimensional matrix Mˆ for the candidate image.

66
ADFSL Conference on Digital Forensics, Security and Law, 2012
where 1≤m≤9 and 1 ≤ k ≤ 22, and  is the indicator function. Each value Mˆ (m,k) is a zero feature. The
matrix Mˆ is called the zero feature matrix. We use zero features for double-compression detection,
Primary quality factor estimation, and cover vs. stego detection when the primary quality factor is
smaller than the secondary quality factor.

We achieve highly accurate double-compression detection rate by combining the POMM features and
zero features. We observed that the accuracy of the double-compression detector improves
significantly when the zero features are added to the 98 POMM features. The zero features are
particularly effective when the primary quality factor of an image is lower than the secondary quality
factor. For a given input image, a 9 x 21 sized matrix Mˆ of the zero locations is created, and then each
matrix row of values are summed to produce 9 features. These features are appended to the 98 POMM
features and used to train the SVMs.
 in the histograms are generally distinctive for a given
We also observed that the signature patterns
combination of primary and secondary quality factors. We quantized the effect of this phenomenon for
estimating the primary quality factor of the image. For a given combination of primary and secondary
quality factors, we create standard matrices Mst, where t is the secondary quality factor. These matrices
capture the expected locations of zero-valued histogram bins. Thus, for primary quality factor s and
secondary quality factor t, s < t we define
All of the matrices Mst can be calculated beforehand and stored. To use these matrices to estimate the
primary quality factor of an unknown image that has already been detected as double compressed
whose secondary quality factor t is less than 75 (or 80), we calculate the matrix Mˆ for the unknown
image. Then we find the matrix Mst that is the “closest” to Mˆ from the standard set of matrices
already precalculated. The closest estimated standard matrix Mest is formed by minimizing the
absolute difference between each Mst and Mˆ :



We also observed that the dips in the histograms assume absolute zero value whereas for stego images,
the dips can be characterized by drop in the histogram values. We used this fact in the successful
detection of stego images.
Cover Vs. stego detection for PQF < SQF. The histogram zero patterns can also be used for stego
detection in images with PQF < SQF. It was stated earlier that the 75% drop in the histogram bin
values characterizes a zero for a double-compressed image. Ideally a zero bin should have value
exactly equal to zero but the statistics of the DCT coefficients changes due to the process of
embedding. In order to account for these changes, a 75% drop is considered to indicate a zero. For
double compressed cover images, the drop is 100% and the zero bins actually take zero value. For
double compressed stego images, we consider even a small drop in the bin value as a zero. This acts as
a distinguishing feature between cover and stego images. For all the standard matrices Mst from the
standard set Sstd, we count the number of absolute zero bins that are expected to occur for each
combination of PQF and SQF.
For an input image, first the PQF is estimated. For the estimated PQF, the expected count of zero bins
is obtained. From matrix Mˆ derived from an input image, we count the total number of absolute zero
values. If this count is matches exactly with the expected standard count, the image is a cover image;

67
ADFSL Conference on Digital Forensics, Security and Law, 2012
otherwise it is a stego image.
In this case, classification is done using an analytical method as opposed to SVM-based detection. For
a given detector, set Sstd of standard matrices is created and stored beforehand. This method saves a lot
of time required in feature extraction and SVM training, and is also very accurate.
8. RESULTS
We compare our double compression detector with a state-of-the-art double-compression detector by
Chen et al [7]. In this method, nearest-neighbour differences in the vertical, horizontal, diagonal and
minor diagonal directions are calculated, resulting in a set of four 2-dimensional difference arrays. In
order to reduce the computations, these arrays are thresholded to values in between -4 and +4. The
difference arrays are modeled using a one-step Markov process characterized by a transition
probability matrix (TPM). This results in a 9 TPM matrix which leads to 324 features. These features
are used to train a support vector machine. See [7] for more details.
The authors assume that the double-compressed images are cover images only and the primary quality
factors considered are in the range 50 to 95 in steps of 5. The detection accuracies obtained are very
high. However, when the method is used for double-compressed stego images, the detection
accuracies drop. We compare this detector and our POMM based detector.
POMM detector. An unknown image is input to the steganalyzer. If the observed quality factor of the
image is 75 or 80, it is likely to be double-compressed. This image is passed on to the doublecompression detector (referred to as DCDetector). Two separate detectors are created for secondary
quality factors 75 and 80. Each detector is a binary classifier trained on 15000 single-compressed and
15000 double-compressed image data. For the first detector, the images with PQF 74 and 75 are
excluded whereas for the second detector, images with PQF 80 are excluded from the training
database. This is because the statistics of the images with PQF equal to SQF is the same as singlecompressed images. The stego images are further divided amongst the number of classes and
subdivided amongst three embedding levels. Double-compressed images are also divided equally
amongst the different primary quality factors.
For testing, we start with 1500 unique cover images at each PQF in the range 63 to 94. In case of F5
and outguess, we use 500 images at each of the three embedding levels, resulting in 1500 images at
each PQF.
Our double-compression detector uses combination of 98 POMM features and 9 zero features giving a
total of 107 features. A double-compression detector is required to have very low false positive rate.
False positive rate is the percentage of single-compressed test images that get classified as doublecompressed. If a single-compressed image is classified as double-compressed, it can be assigned to
only cover, F5 or outguess instead of six classes: cover, F5, jp hide & seek, jsteg, outguess and
steghide. This leads to classification errors. In Figure 11, we present the accuracy plots.
68
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 11. Accuracy of double-compression detectors on double-compressed cover [(a),(b)], F5
[(c),(d)] and outguess [(e),(f)] images.
69
ADFSL Conference on Digital Forensics, Security and Law, 2012
The DCDetector gets very high accuracy for many of the values. In Figure 11, it can be seen that the
DCDetector can classify cover images accurately except for two cases. First case is when the PQF is
equal to or close to the SQF (74, 75 and 76 in the left column plots and 80 and 81 in the right column
plots) and second case is when PQF is greater than 90. PQF = 74 and SQF = 75 is a special case
because
the nine low frequency locations that we consider are identical for quality factors 74 and 75.
This makes it impossible to detect images with this combination of quality factors using only
this information. This is because when the PQF is very close to the SQF, the statistics of
double-compressed image is close to the corresponding single-compressed image. Also, for
PQF greater than 90, most of the quantization steps are equal to one due to which the effect of
double-compression is not obvious. This justifies the drop in
the detection accuracies in these two cases. A similar trend is observed in case of F5 and
outguess with respect to the PQF. Also, the detection accuracies drop when the embedding
level increases. For the cases mentioned above where the PQF is close to the SQF, if the
detector classifies the image as single-compressed, it is considered as a correct decision. In
order to determine the overall detection accuracy, we consider the average of true positive rate
for double-compressed test images and true negative rate for the single-compressed test
images.
In Table 2, we present the false positive rates. Except for F5 at 0.4 bpnz and outguess at 0.2
bpnz, the false positive rates are lower than 8% for SQF = 75. For SQF = 80, the false positive
rates are lower than corresponding cases of SQF = 75 by at least 0.5.
Table 2. False positive rates for cover and stego for the double compression detector.
We also implemented Chen’s double compression detector. In Figure 12, we give the percent false
posoitive rate for the three levels of embedding for the 5 embedding alrogithms for both Chen’s
detector and the POMM detector. In Figure 13, ew give the overall average detector accuracy rates.
These two figures show that Chen's detector has accuracies comparable to our POMM-based detector
for a limited number of PQFs. For most of the PQFs, the accuracies are lower than our detector.
Figure 12. False positive rates for Chen’s detector and POMM detector.
70
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 13. Overall average detector for Chen and POMM detectors.
To use the histogram information to its best use, we created an SVM that could distinguish between
the primary quantization matrix of double compressed images into two groups: greater than 75 (or 80)
and less than 75 (or 80). We call this the primary quantization factor splitter, PQF splitter. The PQF
splitter is accurate because the histogram signature patterns in the histograms of the quantized DCT
coefficients are different when the PQF is smaller than the SQF and when the PQF is greater than the
SQF. In order to decide which approach to take, it is necessary to find the relation between PQF and
SQF for a particular image. The PQF Splitter classifies an image based on whether the PQF is smaller
than or greater than the SQF. Two separate binary classifiers are created for SQF of 75 and 80. For
SQF = 75, there are 7 PQFs in the range below 75 (63, 68, .., 73) and 19 PQFs above 75 (76, 77,.., 94).
We use 15000 training images for each class. The images are further divided into cover, F5 and
outguess categories and then subdivided based on the PQF and embedding levels as well. In the same
way, a database is created for SQF = 80 case. In this case, there are 13 PQFS below 80 and 14 PQFs
above 80. So the division of images changes accordingly. The classification accuracies of the PQF
Splitters are shown in Figure 14. The plots on the left show accuracies for SQF = 75 and the plots on
the right show accuracies for SQF = 80 case.
Note that when the SQF = 75, Figure 14 shows the detection accuracies are almost always close to
100% over the entire range of PQFs for cover, F5 and outguess. When SQF = 80, the detection
accuracies for cover images are close to 100% over the entire range of PQFs. When SQF = 80, the
detection accuracies for F5 and outguess images drop when PQF is 79. In this case, the doublecompressed image is close to its single-compressed version. For PQFSplitter, the detection accuracies
do not vary with the embedding levels.
Cover vs. stego detector. Once the range of PQF is determined by the PQF splitter, we perform cover
Vs. stego detection. Depending on whether the PQF of an image is smaller or larger than SQF, we use
different methods for classification.
Cover Vs. stego detector for Primary Quality Factor < Secondary Quality Factor. When PQF of
an image is less than SQF, the analytical method is used. The method is based on the histogram
signature patterns. This approach saves time and computational resources required for the feature
extraction process and intensive training of support vector machines. Figure 15 shows the detection
accuracy for when the observed quality factor (SQF) is 75 (left column) and for when the observed
quality factor is 80 (right column). We observe that the detection accuracies are almost always close to
100% for F5 and outguess. For cover, the detection accuracies are close to 100%. When the PQF is
very close to SQF, there is a drop in the detection accuracy.
Cover Vs. stego classifier for Primary Quality Factor > Secondary Quality Factor. We tried three
different approaches for this multi-classification problem: Cover Vs. stego binary classifier,
multiclassifier, and majority vote classifier. Our results showed that the majority vote multiclassifier
worked the best. We summarize the results in Table 4.
71
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 4. Confusion matrices for cover Vs. stego detection for PQF > SQF and (a) SQF = 75 and (b)
SQF = 80.
We summarize the results of cover Vs. stego detection for PQF > SQF case in the confusion matrix
given in Table 4. We pass test images of known class to the detector and calculate the probabilities of
both true and false detection. We observe that the detection accuracies for cover and outguess are
above 96%. For F5, 9.5% of the images get misclassified as cover when SQF is 75 and this
corresponds to the lowest detection accuracy. The low accuracies for F5 are due to the fact that F5
does not preserve the global statistics of the image. The artifacts of double-compression are thus lost
during the process of embedding.
At this point, we have cover Vs. stego detectors for both PQF < SQF and PQF > SQF case. We
determine the overall accuracy of the detector by averaging the accuracies of these two detectors over
all PQFs and all embedding levels. This gives us the cover Vs. stego detection accuracy of our
detector for double-compressed images. In Table 5, we compare these results with those obtained from
previous Canvass software. We can clearly see significant rise in the detection accuracies.
Table 5. Overall detection accuracy of (a) previous Canvass (b) our detector tested on doublecompressed images.
Primary Quality Factor Detector. The last step in the blind steganalysis is estimation of the primary
quality factor. It can be used to extract further information regarding the secret payload; although we
do not use it for further processing in this work. The signature zero patterns in the histograms are used
72
ADFSL Conference on Digital Forensics, Security and Law, 2012
for the PQF detection. Separate detectors are created for SQF = 75 and SQF = 80 case. Figure 16
shows the detection accuracy plots. Again, the plots on the left represent SQF = 75 case and plots on
the right represent SQF = 80 case.
We first discuss the results when SQF = 75. These are shown in the left column of Figure 16. The
PQF detection accuracies for double-compressed cover images are almost always 100% when PQF <
SQF. In general, when PQF is less than the SQF, the zero patterns in the histograms are prominent.
This leads to high detection accuracies. Double-compressed F5 images with PQF 70 get misclassi_ed
as PQF = 71. Out of the nine low-frequency modes considered for the detection, the quantization
matrices for quality factors 70 and 71 vary for 2 modes. We do not detect PQF 89. This is because the
histogram patterns arising from the
combination of PQFs 88 and 89 with SQF 75 are identical. The quality factor 89 gets detected as 88.
Therefore we exclude it from the detection algorithm. In general, the detection accuracies are low
when PQF > SQF because the histogram patterns are not very prominent.
We now discuss the results for SQF = 80 case. These are shown in the right column of Figure 16.
Similar to the SQF = 75 case, the detection accuracies are close to 100% when PQF < SQF. PQF 75
gets detected as 74. This is because the values at nine-low frequency modes under consideration are
identical for these two quality factors. Therefore we exclude the quality factor 75. Similarly, for F5
and outguess, PQF 85 almost always gets detected as 84 and PQF 88 gets detected as 89. This explains
the drops in the detection accuracies for these PQFs. We do not detect PQFs 92 and 93. This is
because the histogram patterns arising from the combinations of PQFs 91, 92 and 93 with SQF 80 are
identical. The quality factors 92 and 93 get detected as 91. Therefore we exclude these quality factors
from the detection algorithm.
73
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 14. Accuracy of PQF Splitters on double-compressed Cover, F5 and Outguess images.
74
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 15. Accuracy of PQF Splitters on double-compressed Cover, F5 and Outguess images.
75
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 16. Accuracy of PQF detector.
76
ADFSL Conference on Digital Forensics, Security and Law, 2012
9. CONLCUSIONS.
In this work, we created a complete steganalyzer system for single as well as double-compressed
images. We introduced a new statistical modeling tool to measure the changes caused by various
steganographic algorithms as well as by double-compression. We showed that the POMM features
perform better than the state-of-the-art double-compression detectors. We also proved the utility of
POMMs for solving variety of classification problems such as double-compression detection, PQF
splitting and cover Vs. stego detection.
We introduced analytical methods for cover Vs. stego detection and primary quality factor detection.
The methods are based on the signature patterns in the histograms of quantized DCT coefficients, as
opposed to the other SVM-based classification methods. Each SVM in our experiments is trained on
30000 image data and tested on approximately 1,35,000 image data for different cases. Creating each
SVM involves 1 day to extract the training features from 30000 images, 1.5 days of actual SVM
training and 2 days for testing. The SVMs-based approach [6] requires 1 SVM for double-compression
detector, 159 SVMs for primary quality
step estimation, 3 binary classifiers for stego detection of double-compressed images and 15 SVMs for
stego detection of single-compressed images. Thus a total of 178 SVMs are required. Our approach on
the other hand uses 1 SVM for double-compression detection, 1 for PQF splitting, 15 SVMs for
single-compressed stego detection and 3 SVMs for double-compressed stego detection, which gives a
total of 20 SVMs. These novel analytical methods thus save a large amount of time required for
feature extraction and intensive SVM training. The histogram pattern features were also used in
addition to POMMs to improve the detection accuracies of SVM-based classifiers.
We show that the detection scheme works better if a double-compression detector is used as a preclassifier. The detection accuracies for double-compressed images improve significantly compared to
the previous software. We also compare the performance of the various modules with those presented
in [12]. The POMM-based PQF detector has high detection accuracies for PQF < SQF. The
conditional probabilities given by POMMs describe the relations between inter and intrablock pixels
in a JPEG image. In the future, other functions could be investigated to describe other pixel
dependencies. Also, currently we limit the steganalysis of single-compressed images to quality factor
75 and that of double-compressed images to secondary quality factors of 75 and 80. But this novel
method for steganalysis of double-compressed data looks promising and could be generalized for any
combination of primary and secondary quality factors.
10. REFERENCES
1. Stegoarchive. Available online at http://www.stegoarchive.com
2. A. Westfeld. F5 A Steganographic Algorithm: High Capacity Despite Better Steganalysis.
Information Hiding, 4th International Workshop, I. S. Moskowitz, Ed. Pittsburgh, PA, USA, April
2001, pages 289-302.
3. N. Provos. Outguess: Universal Steganography. http://www.Outguess.org, August 1998.
4. Davidson, J., Jalan, J. Canvass - A Steganalysis forensic tool for JPEG images, 2010 ADFSL
Conference on Digital Forensics, Security and Law, St. Paul, MN, May 2010.
5. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. Software
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
6. J. Fridrich, M. Goljan and D. Hogea. Steganalysis of JPEG Images: Breaking the F5 Algorithm. 5th
Information Hiding Workshop, Noordwijkerhout, The Netherlands, October 2002 , pages 310-323.
7. C. Chen, Y. Q. Shi, and W. Su. A machine learning based scheme for double JPEG compression
detection. Proceedings IEEE Int. Conf. Pattern Recognition, Tampa, FL. pages 8-11, 2008.
77
ADFSL Conference on Digital Forensics, Security and Law, 2012
8. J. He, Z. Lin, L. Wang, X. Tang. Detecting doctored JPEG images via DCT coefficient analysis.
Proceedings of the 9th European Conference on Computer Vision, Lecture Notes in Computer
Sciences, vol. 3953, Springer, Berlin, pages 423-435, 2006.
9. M. Sorell. Conditions for e_ective detection and identification of primary quantization of requantized JPEG images. E-FORENSICS. 1st International ICST Conference on Forensic Applications
and Techniques in Telecommunications, Information and Multimedia. ICST, 2008.
10. Z. Lin, J. He, X. Tang, C. Tang. Fast, automatic and fine-grained tampered JPEG image detection
via DCT coefficient analysis. PR(42), No. 11, pages 2492-2501, November 2009.
11. J. Fridrich, T. Penvy. Determining the Stego Algorithm for JPEG Images, Special Issue of IEE
Proceedings-Information Security, 153(3), pp. 75-139, 2006.
12. T. Pevny and J. Fridrich. Detection of double-compression in JPEG images for applications in
steganography. IEEE Transactions on Information Forensics and Security, vol. 3, no. 2, pages 247258, June 2008.
13. J. Fridrich, M. Goljan and D. Hogea. Attacking the outguess. Proc. of the ACM Workshop on
Multimedia and Security, Juan-les-Pins, France, December 2002.
14. Siwei Lyu and Hany Farid. Detecting hidden messages using higher-order statistics and support
vector machines. Information Hiding, 5th International Workshop, IW 2002, volume 2578 of Lecture
Notes in Computer Science, pages 340354. Springer-Verlag, New York.
15. T. Pevny and J. Fridrich. Merging Markov and DCT features for multi-class JPEG steganalysis. In
Proceedings of SPIE, Electronic Imaging, Security, Steganography, and Watermarking of Multimedia
Contents IX, San Jose, CA, volume 6505, pages 0304, January 2007
16. Y. Q. Shi, C. Chen, and W. Chen. A Markov process based approach to effective attacking jpeg
steganography. In Proceedings of the 8th Information Hiding Workshop, volume 4437 of LNCS,
pages 249-264. Springer, 2006.
17. W. Luo, J. Huang, G. Qui. JPEG error analysis and its applications to digital image forensics.
IEEE Transactions on Information Forensics and Security, Vol. 5, No. 3, pages 480-491, Sep. 2010.
18. T. Gloe. Demystifying histograms of multi-quantised DCT coefficients. IEEE Conf. on
Multimedia and Expo (ICME), pages 1-6, 2011.
78
ADFSL Conference on Digital Forensics, Security and Law, 2012
TOWARD ALIGNMENT BETWEEN COMMUNITIES OF
PRACTICE AND KNOWLEDGE-BASED DECISION
SUPPORT
Jason Nichols, David Biros, Mark Weiser
Management Science and Information Systems
Spears School of Business
Oklahoma State University
Stillwater, OK 74078
ABSTRACT
The National Repository of Digital Forensics Information (NRDFI) is a knowledge repository for law
enforcement digital forensics investigators (LEDFI). Over six years, the NRDFI has undertaken
significant design revisions in order to more closely align the architecture of the system with theory
addressing motivation to share knowledge and communication within ego-centric groups and
communities of practice. These revisions have been met with minimal change in usage patterns by
LEDFI community members, calling into question the applicability of relevant theory when the
domain for knowledge sharing activities expands beyond the confines of an individual organization to
a community of practice. When considered alongside an empirical study that demonstrated a lack of
generalizability for existing theory on motivators to share knowledge, a call for deeper investigation is
clear. In the current study, researchers apply grounded theory methodology through interviews with
members of the LEDFI community to discover aspects of community context that appear to position
communities of practice along a continuum between process focus and knowledge focus. Findings
suggest that these contextual categories impact a community’s willingness to participate in various
classes of knowledge support initiatives, and community positioning along these categories dictates
prescription for design of knowledge based decision support systems beyond that which can be found
in the current literature.
Keywords: grounded theory, decision support, communities of practice, knowledge management
1. INTRODUCTION
The Center for Telecommunications and Network Security (CTANS), a recognized National Security
Agency Center of Excellence in Information Assurance Education (CAEIAE), has been developing,
hosting, and continuously evolving web-based software to support law enforcement digital forensics
investigators (LEDFI) via access to forensics resources and communication channels for the past 6
years. The cornerstone of this initiative has been the National Repository of Digital Forensics
Information (NRDFI), a collaborative effort with the Defense Cyber Crime Center (DC3), which has
evolved into the Digital Forensics Investigator Link (DFILink) over the past two years. DFILink is
soon to receive additional innovations tailored to its LEDFI audience, and the manuscript herein is an
account of recent grounded theory research efforts targeting the LEDFI community in order to form a
baseline to match their needs with the resources and services contained within DFILink. More
broadly, the grounded theory that is emerging from this study highlights critical characteristics of
context for a knowledge-based decision support implementation that the current literature on
motivating knowledge sharing appears to be lacking. In order to motivate the need for this grounded
theory work, the following sub-sections briefly describe the theory-driven approaches to early NRDFI
design, the evolution from NRDFI to DFILink, and replication of a prior empirical study that
highlights the potential gap in theory as relates to motivators for knowledge sharing and actual system
use.
79
ADFSL Conference on Digital Forensics, Security and Law, 2012
1.1. NRDFI
The development of the NRDFI was guided by the theory of the ego-centric group and how these
groups share knowledge and resources amongst one another in a community of practice (Jarvenpaa &
Majchrzak, 2005). Within an ego-centric community of practice, experts are identified through
interaction, knowledge remains primarily tacit, and informal communication mechanisms are used to
transfer this knowledge from one participant to the other. The informality of knowledge transfer in
this context can lead to local pockets of expertise as well as redundancy of effort across the broader
community as a whole. In response to these weaknesses, the NRDFI was developed as a hub for
knowledge transfer between local law enforcement communities. The NRDFI site was locked down
so that only members of law enforcement were able to access content, and members were provided the
ability to upload knowledge documents and tools that may have developed locally within their
community, so that the broader law enforcement community of practice could utilize their
contributions and reduce redundancy of efforts. The Defense Cyber Crime Center, a co-sponsor of the
NRDFI initiative, provided a wealth of knowledge documents and tools in order to seed the system
with content.
Response from the LEDFI community was positive, and membership to the NRDFI site quickly
jumped to over 1000 users. However, the usage pattern for these members was almost exclusively
unidirectional. LEDFI members would periodically log on, download a batch of tools and knowledge
documents, and then not log on again until the knowledge content on the site was extensively
refreshed. The mechanisms in place for local LEDFI communities to share their own knowledge and
tools sat largely unused. From here, CTANS began to explore the literature with regards to motivating
knowledge sharing, and began a re-design of NRDFI driven by the extant literature, and focused on
promoting sharing within the LEDFI community through the NRDFI.
1.2. Motivating Knowledge Sharing and the DFILink
DFILink is a redesign of NRDFI that shifts the focus of sharing within the community from formal
knowledge documents and tools to informal discussion and collaboration surrounding existing
documents and tools within the system. The same broad set of knowledge resources from NRDFI is
available through DFILink, however the ability to discuss these resources has been given equal
importance in the design of the system.
This shift in focus was driven primarily by two discoveries in the literature surrounding motivation for
knowledge sharing: First, the primary motivators for sharing knowledge are intrinsic in nature (i.e.
through positive feedback, a sense of community, and incremental praise). Second, these intrinsic
motivators are more effective when the overhead for making a contribution is low (Bock & Kim,
2002; Bock, Lee, Zmud, & Kim, 2005). These two discoveries were taken from what appears to be
the prevailing model in the literature for motivating knowledge sharing, and formed the backbone for a
redesign strategy that emphasized the social aspect of participating in a community of practice. The
ability to pose questions, make comments, and informally engage the community across all aspects of
the system and the resources contained therein was underscored in the resulting transition to DFILink.
Additionally, these informal communications mechanisms served to bring the system closer in
alignment to theory for how egocentric groups actually communicate (Fisher, 2005). In short,
DFILink was built to embody the best lessons from the literature with regards to motivating sharing
and supporting communication within a community of practice.
However, two years after the transition, usage patterns for DFILink mirror that of its predecessor
NRDFI. LEDFI members will log on to pull down resources, but rarely if ever upload and share their
own or utilize the informal communications channels embedded within the system. Design based
upon the prevailing theory surrounding motivating knowledge sharing within communities of practice
appears to have had little-to-no impact on sharing within the LEDFI community itself. Empirical
research performed by the investigators during the transition from NRDFI to DFILink further
80
ADFSL Conference on Digital Forensics, Security and Law, 2012
highlights the potential gap in the literature between the theory of motivating knowledge sharing and
what can be observed in communities of practice such as LEDFI.
1.3. Re-examining Motivation to Share Knowledge
One of the preeminent works in the area of motivators to share knowledge examines the relative
importance of intrinsic versus extrinsic motivators in the context of a broad sampling of asian firms
(Bock, et al., 2005). The outcome of this study demonstrates that there is a strong link between
intrinsic motivation and intention to share knowledge, and extrinsic motivators can actually serve as a
demotivational factor in the long run. The literature has used this study as a foundation for further
work (e.g. Chiu, Hsu, & Wang, 2006; Hsu, Ju, Yen, & Chang, 2007; Kankanhalli, Bernard, & W.,
2005), and the notion that intrinsic motivators drive the sharing of knowledge is widely held within the
domain. The transition from NRDFI to DFILink adhered to this principle through the incorporation of
social mechanisms for positive feedback and contribution through informal communications. Still, we
were interested in the generalizability of the prior study to the context of egocentric groups and, more
broadly, distributed communities of practice such as LEDFI. A replication of the study was performed
with a sample of LEDFI members, and the results called into question the findings of the earlier work
(Hass, et al., 2009).
In a community of practice such as LEDFI, the link between intrinsic motivation and intention to share
knowledge was observed to be significantly weaker, and bordering on non-existent. Interestingly,
while the link between extrinsic motivators and intention to share was no longer significantly negative
as in the previous study, it too remained tenuous at best. In short, when the commonly accepted model
of motivation to share knowledge was applied to the LEDFI community, neither intrinsic nor extrinsic
motivators appeared to provide strong support for what would drive an LEDFI member to share their
knowledge.
With this in mind, and coupled with the observation of stagnant usage patterns throughout the theorydriven transition from NRDFI to DFILink, the investigators noted a potential gap in the literature as
relates to theory regarding willingness to share knowledge in a distributed community of practice.
What follows is an account of the first round of grounded theory research regarding this gap, initial
findings from interviews and a focus group with a sample of the LEDFI community, and a discussion
of resulting prescription for knowledge-based decision support systems targeting communities of this
nature.
2. METHODOLOGY
The investigators selected grounded theory, a specifically qualitative approach, based upon their
experience applying the results of existing quantitative studies to the design of DFILink and meeting
minimal success in their objectives, as well as the discovery of contradictory findings when applying
an accepted quantitative model to the context of the LEDFI community. Grounded theory is markedly
process-driven in its focus (Strauss & Corbin, 1998), and avoids a priori assumptions regarding the
processes underlying the phenomena of interest. This is in contrast to a deductive quantitative
approach, and is appropriate in scenarios where the accepted theory in a domain is unable to
adequately capture behaviors of practitioners in the field. The process-focus of grounded theory
allows the researcher to examine directly what occurs in practice, and the inductive nature of the
methodology supports contributions to existing theory that can more adequately capture and explain
behavior in the field.
Interviews were carried out at the 2012 Department of Defense Cyber Crimes Conference in Atlanta,
in order to purposefully sample members of the LEDFI community of various positions within their
respective departments. Our initial five interview subjects spanned the range of positions from direct
forensics investigators to mid-level forensic lab managers to higher-level departmental management.
Early interviews were purposefully unstructured and open ended, focusing on the identification of
patterns in process for applying knowledge in order to complete digital forensics tasks. Nightly
81
ADFSL Conference on Digital Forensics, Security and Law, 2012
coding of interview notes took place in accordance with guidelines for grounded theory (Glaser,
1978), which followed the pattern of initial “open coding” to first identify key concepts or dimensions
(referred to as categories), and subsequent “selective coding” once uniformities in the interview notes
were revealed.
As the resulting categories became saturated, interviews became more tightly structured in order to
explore these categories further, until no new properties emerged from additional investigation. A
total of 20 interviews were conducted in this first round of investigation, which is within guidelines for
the volume of interviews recommended to begin to answer research questions through grounded
theory (McCracken, 1988). Subsequently, a summary of the findings and resulting implications for
practice was shared with a focus group comprised of an additional 10 LEDFI members. Glaser (1978,
1992, 2001) emphasizes the following criteria for assessing rigor and validity of grounded theory
studies: fit, relevance, workability, modifiability, parsimony and scope. Table 1 is provided as a
summary of the investigators’ effort within this framework (in line with similar grounded theory
studies e.g. Mello, Stank, & Esper, 2008).
Table 1. An assessment of rigor for grounded theory
Criteria
Definition
Evidence
Fit
Do the findings match the
conditions within the domain
under investigation?

Findings were drawn based on patterns across
all interviews

Initial theory and implications were presented
and validated by a focus group of community
members
Does the outcome contribute to
solving a real problem in
practice? Do the results
contribute to existing theory
through a broader
understanding?

Findings from the study directly impact the
evolution of an existing artifact within the
community, in a fashion validated by
community members.

Continuing research seeks to position these
findings within the knowledge management,
decision support, and task/technology fit
domains.
Do the findings directly
address what is happening
within the domain?

Early theory derived from interviews was
shared and confirmed by participants of the
study.
Modifiability Can contradictions be included
in the emerging theory through
modification?

The emergent categories from this first round
of inquiry will tested and augmented as
necessary through continuing theoretical
sampling and data collection.
Parsimony
Is the theory limited to a
minimum of categories needed
to explain the phenomenon?

Selective coding was applied to the opencoded data in order to reduce the number of
categories while maintaining explanatory
coverage across all cases in the study.
Scope
Is the theory flexible enough to 
provide insight into a variety of
situations?
Scope for the categories discovered in this
first round of data collection will be
examined through continuing theoretical
sampling of a broader range of communities
of practice.
Relevance
Workability
82
ADFSL Conference on Digital Forensics, Security and Law, 2012
3. FINDINGS
An analysis of the data collected from the interviews revealed three critical categories that impact the
way in which a LEDFI member is willing to participate in knowledge sharing activities: organizational
structure, task complexity, and workload. These characteristics were a recurring theme across the
interviews conducted, and revealed themselves as key aspects driving the processes and mechanisms
LEDFI members selected when either gathering or sharing knowledge within the community. Across
each category, the impact of the category on selection of knowledge sharing mechanisms was
explored. Each category is addressed individually below. The result is a reliance on local knowledge
silos and existing informal communications mechanisms almost exclusively within the community of
practice.
3.1. Organizational Structure
LEDFI members exist in a rigid organizational context. From the interviews, this exposes itself in a
number of different ways. First, due to the legal requirements surrounding the validity of their work,
investigators are encouraged to maintain an autonomous core of knowledge and tools within their own
departments. These knowledge cores are the first targets of inquiry when performing an activity that
requires support. Introduction of external sources for knowledge and tools often requires the approval
of organizational management, and is frequently limited to knowledge gathering rather than
knowledge sharing. Further, there are frequently strict guidelines regarding the sharing of internally
developed resources, which limits the participation of members in formal external knowledge sharing
efforts.
Members within this rigid organizational context prefer to offer support to their community colleagues
individually, informally, and on a case-by-case basis. While the community as a whole recognizes the
potential for inefficiency in this approach, members are often constrained by the rigidity of their
organizational boundaries and procedures from availing their knowledge cores to the broader LEDFI
community in general. If identified as an expert and approached individually, however, they are likely
to be willing to share their expertise with an LEDFI colleague on a one-to-one basis.
3.2. Task Complexity
Subjects uniformly identified an 80-20 rule with respect to the complexity of the tasks they perform.
80% of the time, their tasks are routine and require little to no knowledge support for completion. The
other 20% of their tasks require knowledge support, but that support can be achieved through access to
their department’s internal knowledge core or through informal requests to the broader community by
utilizing existing communication channels. They recognize that there may exist better tools and
solutions than what they can find within their own knowledge cores or through informal requests for
assistance, but the relatively low frequency for which they require external assistance acts as a
disincentive for exploring, becoming familiar with, and investing time on external formal knowledge
repositories. They identify a trade-off between the time and effort required to become familiar with
and actively use these external resources, and the amount of time and effort such familiarity would
potentially save them in their daily operations. For them, considering how little they find themselves
in need of knowledge support, the tradeoff does not favor active involvement in external formal
knowledge repositories.
3.3. Workload
The vast majority of subjects interviewed reported a significant backlog of work within their
department. Following the 80-20 rule identified regarding their tasks, this translated for the subjects
into heavy time pressure to apply their existing expertise towards routine tasks as quickly as possible
in order to work down the backlog. When facing a task that requires knowledge support, this time
pressure influences their preference to use existing informal and asynchronous communications
83
ADFSL Conference on Digital Forensics, Security and Law, 2012
channels to seek assistance, as they can then move on to backlogged routine tasks while they wait for a
response. In essence, the backlog of work they often face means that, even if they wanted to become
active members of an external knowledge community and gain expertise to the resources available
therein, they are forced to repurpose the time that this would take as time to continue working down
their backlog of routine tasks while they wait for informal support.
A profile of the LEDFI community across these categories is presented in figure 1. Through the
interviews performed, these categories emerged as the primary influence within the community over
how knowledge is shared and discovered amongst participants. Based upon their positioning along
these categories, LEDFI members exhibit a strong preference for locally developed knowledge cores
and existing informal communication channels when seeking support. Virtually all subjects noted
listservs as the external communication channel of choice when seeking support from the broader
community. They also recognized and were willing to accept the potential for inefficiency in
knowledge discovery through this communications channel. For them, the tradeoff in effort required
to become active users in a more structured knowledge management approach did not support the
potential gains in process improvement for their infrequent knowledge-intensive tasks. Put simply,
they recognize there may be valuable resources available externally. However, due to their rigid
organizational structure, relatively routine tasks, and heightened workload, they are willing to forego
these resources in favor of support mechanisms that fold seamlessly into their existing workflow.
Figure 1. LEDFI Community Profile
4. DISCUSSION
4.1. Implications for Theory
This first round of data collection supports a broader research objective to identify and examine
communities of practice that vary along the discovered categories of structure, complexity, and
workload. Based on findings from our work with LEDFI, it is proposed that communities of practice
experience contextual pressures related to knowledge sharing that set them apart from communities
within a formal organizational boundary. For communities of practice, the link between intrinsic
reward and active knowledge sharing may be moderated by the communities’ positioning along these
three contextual dimensions. Additional evidence of this moderation affect will serve to broaden the
organizational climate construct in the motivation literature to include external influences, rather than
84
ADFSL Conference on Digital Forensics, Security and Law, 2012
the current internal focus on fairness, affiliation, and innovativeness (Bock, et al., 2005). Our
continued efforts will seek to expand the predominant model on motivation to share knowledge, so
that the model fits in the context of communities of practice as well as in the context of individual
organizations.
Further, the work done here suggests that a community’s position along these dimensions may dictate
the degree to which knowledge management efforts must either conform to existing workflows and
processes within the community, or are free to influence the workflows and processes themselves.
This tradeoff is represented in figure 2. Continued work to explore this tradeoff within a broader set
of diverse communities of practice seeks to contribute to the literature related to task/technology fit
(Goodhue, 1995; Goodhue & Thompson, 1995). We find partial alignment with existing research in
this domain that maps task characteristics to appropriate technology support mechanisms (Zigurs &
Buckland, 1998). However, rather than focus on the capabilities availed through the technology, we
will continue to focus on the tradeoff between technology support that can achieve the greatest
hypothetical advantage, and technology support that will actually be used. In some ways, then, we are
looking to broaden the focus from task/technology fit to community/technology fit. The initial finding
here is that the best knowledge management option is not the one with the greatest performance
potential, but the one that will actually be used.
Figure 2. Tradeoff between process vs. knowledge focused support
For example, NRDFI and DFILink were designed to offer a tight integration between resource
discovery and the sharing of knowledge related to these resources by way of community involvement
within the site itself. Through this tight coupling of centralized discovery and sharing, formal
knowledge resources can be surrounded by informal, community-driven knowledge that incrementally
increases the value of the resource over time. However, the potential benefit of this tightly coupled
architecture assumes that community participants are willing to integrate use of the knowledge
repository within their existing workflows. As we have discovered here, LEDFI simply is not. The
result is a powerful knowledge management solution, engineered within the guidelines of best practice
from the literature, recognized by the community as a source of valuable content, that by in large sits
on the shelf unused. What the LEDFI community has shared with us on this issue is that rigid
organizational structure, an abundance of routine tasks, and a heavy workload all contribute to a
context where knowledge support must be folded into existing workflows if it is to be utilized. This
seamless mapping into existing workflows takes priority over the relative power of the knowledge
management capabilities available. In other words, the best knowledge management solution is the
one that gets used.
85
ADFSL Conference on Digital Forensics, Security and Law, 2012
4.2. Implications for Practice
While we continue to explore the categories that influence communities of practice along the processcentric/knowledge-centric continuum, the message is clear for a process-centric community such as
LEDFI: seamless integration of knowledge support into existing workflows and communications
channels is a requirement for knowledge discovery and use. Therefore, primary methods of
communication within the community must be identified, and knowledge management technology
must evolve to take an active role within these communications channels. For the LEDFI community,
listservs represent a primary form of communication when members seek assistance outside of their
organization. Taking cues from agent-based decision support research (Bui & Lee, 1999), the next
evolution of DFILink will be the development of a listserv agent that matches requests from users on
the listserv to resources that may prove useful. A sequence diagram for listserv agent interaction is
presented below in figure 3.
Figure 3. Sequence for user/agent interaction via listservs
The DFILink listserv agent will be designed so that it can subscribe and contribute to not only a
specific DFILink listserv, but also any partnering listserv from the LEDFI community that wishes to
participate. The agent will monitor traffic on the listservs, and respond with resource matches based
on the content of the initial question posted. As the conversation thread continues, the agent will
continue to monitor traffic so that, if any listserv member would like to interact further with the agent,
a short list of hash-tag command options are at their disposal and can be sent as a reply to the listserv
itself. For instance, if a participant would like to see additional resource matches, they can reply with
“#more”, and the agent will perform an additional search based on not only the text from the original
posting, but all subsequent postings in the email thread. Further, these email threads will be
maintained as resources within DFILink and the agent will potentially include them as matches to
future inquiries. In this fashion, the primary communications channel for the community is
86
ADFSL Conference on Digital Forensics, Security and Law, 2012
strengthened by the inclusion of relevant knowledge resources, maintains a long-term memory of tacit
knowledge transfer, and does not require any adaptation of existing workflows and processes on the
part of the community members.
5. CONCLUSION
Theory regarding motivation for knowledge sharing appears to lack fit in the context of communities
of practice. The research presented here applied a grounded theory methodology in the examination of
one such community: law enforcement digital forensics investigators. The results point towards three
community characteristics, organizational rigidity, task complexity, and participant workload, as
determinants for a community’s preference between process-centric versus knowledge-centric decision
support. Continuing research will explore the impact of these characteristics within a broader set of
communities of practice, with the aim to contribute to broader theory for motivation to share
knowledge as well as task/technology fit in the context of a community of practice. However, the
findings of this study directly impact the design of successful knowledge-based decision support
technologies for communities that share the LEDFI profile. Technologies must integrate seamlessly
into existing community workflows and processes, even at the expense of greater knowledge
management capability. For a process-centric community, knowledge management capabilities will
be ignored otherwise.
6. REFERENCES
Bock, G., & Kim, Y. (2002). Breaking the Myths of Rewards: An Exploratory Study of Attitudes
About Knowledge-Sharing. Information Resources Management Journal, 15(2), 14.
Bock, G., Lee, J., Zmud, R., & Kim, Y. (2005). Behavioral Intention Formation in Knowledge
Sharing: Examining the Roles of Extrinsic Motivators, Social-Psychological Forces, and
Organizational Climate. MIS Quarterly, 29(1), 87.
Bui, T., & Lee, J. (1999). An Agent-Based Framework for Building Decision Support Systems.
Decision Support Systems, 25(3), 225.
Chiu, C., Hsu, M., & Wang, E. (2006). Understanding Knowledge Sharing in Virtual Communities:
An Integration of Social Capital and Social Cognitive Theories. Decision Support Systems, 42(3),
1872.
Fisher, D. (2005). Using Egocentric Networks to Understand Communication. Internet Computing,
IEEE, 9(5), 20.
Glaser, B. (1978). Theoretical Sensitivity: Advances in the Methodology of Grounded Theory. Mill
Valley, CA: The Sociology Press.
Glaser, B. (1992). Basics of Grounded Theory Analysis. Mill Valley, CA.: Sociology Press.
Glaser, B. (2001). The Grounded Theory Perspective: Conceptualization Contrasted with Description.
Mill Valley, CA.: Sociology Press.
Goodhue, D. (1995). Understanding User Evaluations of Information Systems. Management Science,
41(12), 1827.
Goodhue, D., & Thompson, R. (1995). Task-Technology Fit and Individual Performance. MIS
Quarterly, 19(2), 213.
Hass, M., Nichols, J., Biros, D., Weiser, M., Burkman, J., & Thomas, J. (2009). Motivating
Knowledge Sharing in Diverse Organizational Contexts: An Argument for Reopening the Intrinsic vs.
Extrinsic Debate. Paper presented at the AMCIS 2009 Proceedings.
Hsu, M., Ju, T., Yen, C., & Chang, C. (2007). Knowledge Sharing Behavior in Virtual Communities:
The Relationship Between Trust, Self-Efficacy, and Outcome Expectations. International Journal of
Human-Computer Studies, 65(2), 153.
87
ADFSL Conference on Digital Forensics, Security and Law, 2012
Jarvenpaa, S., & Majchrzak, A. (2005). Developing Individuals' Transactive Memories of their EgoCentric Networks to Mitigate Risks of Knowledge Sharing: The Case of Professionals Protecting
CyberSecurity. Paper presented at the Proceedings of the Twenty-Sixth International Conference on
Information Systems.
Kankanhalli, A., Bernard, C., & W., K.-K. (2005). Contributing Knowledge to Electronic Knowledge
Repositories: An Emperical Investigation. MIS Quarterly, 29(1), 113.
McCracken, G. (1988). The Long Interview. Thousand Oaks, CA.: Sage Publications.
Mello, J., Stank, T., & Esper, T. (2008). A Model of Logistics Outsourcing Strategy. Transportation
Journal, 47(4), 21.
Strauss, A., & Corbin, J. (1998). Basics of Qualitative Research: Techniques and Procedures for
Developing Grounded Theory. Thousand Oaks, CA: Sage Publications.
Zigurs, I., & Buckland, B. (1998). A Theory of Task/Technology Fit and Group Support Systems
Effectiveness. MIS Quarterly, 22(3), 313.
88
ADFSL Conference on Digital Forensics, Security and Law, 2012
A FUZZY HASHING APPROACH BASED ON RANDOM
SEQUENCES AND HAMMING DISTANCE
Frank Breitinger & Harald Baier
Center for Advanced Security Research Darmstadt (CASED) and
Department of Computer Science, Hochschule Darmstadt,
Mornewegstr. 32, D – 64293 Darmstadt, Germany,
Mail: {frank.breitinger, harald.baier}@cased.de
ABSTRACT
Hash functions are well-known methods in computer science to map arbitrary large input to bit strings
of a fixed length that serve as unique input identifier/fingerprints. A key property of cryptographic
hash functions is that even if only one bit of the input is changed the output behaves pseudo randomly
and therefore similar files cannot be identified. However, in the area of computer forensics it is also
necessary to find similar files (e.g. different versions of a file), wherefore we need a similarity
preserving hash function also called fuzzy hash function.
In this paper we present a new approach for fuzzy hashing called bbHash. It is based on the idea to
‘rebuild’ an input as good as possible using a fixed set of randomly chosen byte sequences called
building blocks of byte length l (e.g. l= 128 ). The proceeding is as follows: slide through the input
byte-by-byte, read out the current input byte sequence of length l , and compute the Hamming
distances of all building blocks against the current input byte sequence. Each building block with
Hamming distance smaller than a certain threshold contributes the file’s bbHash. We discuss (dis)advantages of our bbHash to further fuzzy hash approaches. A key property of bbHash is that it is the
first fuzzy hashing approach based on a comparison to external data structures.
Keywords: Fuzzy hashing, similarity preserving hash function, similarity digests, Hamming distance,
computer forensics.
1. INTRODUCTION
The distribution and usage of electronic devices increased over the recent years. Traditional books,
photos, letters and LPs became ebooks, digital photos, email and mp3. This transformation also
influences the capacity of todays storage media ([Walter, 2005]) that changed from a few megabyte to
terabytes. Users are able to archive all their information on one simple hard disk instead of several
cardboard boxes on the garret. This convenience for consumers complicates computer forensic
investigations (e.g. by the Federal Bureau of Investigation), because the investigator has to cope with
an information overload: A search for relevant files resembles no longer to find a needle in a haystack,
but more a needle in a hay-hill.
The crucial task to solve this data overcharge is to distinguish relevant from non-relevant information.
In most of the cases an automated preprocessing is used, which tries to filter out some irrelevant data
and reduces the amount of data an investigator has to look at by hand. As of today the best practice of
this preprocessing is quite simple: Hash each file of the evidence storage medium, compare the
resulting hash value (also called fingerprint or signature) against a given set of fingerprints, and put it
in one of the three categories: known-to-be-good, known-to-be-bad, and unknown files. For instance,
unmodified files of a common operating system (e.g. Windows, Linux) or binaries of a wide-spread
application like a browser are said to be known-to-be-good and need not be inspected within an
investigation. The most common set/database of such non-relevant files is the Reference Data Set
(RDS) within the National Software Reference Library (NSRL, [NIST, 2011]) maintained by the US
89
ADFSL Conference on Digital Forensics, Security and Law, 2012
National Institute of Standards and Technology (NIST)1.
Normally cryptographic hash functions are used for this purpose, which have one key property:
Regardless how many bits changes between two inputs (e.g. 1 bit or 100 bits), the output behaves
pseudo randomly. However, in the area of computer forensics it is also convenient to find similar files
(e.g. different versions of a file), wherefore we need a similarity preserving hash function also called
fuzzy hash function.
It is important to understand that in contrast to cryptographic hash functions, there is currently no
sharp understanding of the defining properties of a fuzzy hash function. We will address this topic in
Sec. 2, but we emphasize that the output of a fuzzy hash function need not be of fixed length.
In general we consider two different levels for generating similarity hashes. On the one hand there is
the byte level which works independently of the file type and is very fast as we need not interpret the
input. On the other hand there is the semantic level which tries to interpret a file and is mostly used for
multimedia content, i.e. images, videos, audio. In this paper we only consider the first class. As
explained in Sec. 2 all existing approaches from the first class come with drawbacks with respect to
security and efficiency, respectively.
In this paper we present a new fuzzy hashing technique that is based on the idea of data deduplication
(e.g. [Maddodi et al., 2010, Sec. II]) and eigenfaces (e.g. [Turk and Pentland, 1991, Sec. 2]).
Deduplication is a backup scheme for saving files efficiently. Instead of saving files as a whole, it
makes use of small pieces. If two files share a common piece, it is only saved once, but referenced for
both files. Eigenfaces are a similar approach. They are used in biometrics for face recognition.
Roughly speaking, if we have a set of N eigenfaces, then any face can be represented by a
combination of these standard faces. Eigenfaces resemble to the well-known method in linear algebra
to represent each vector of a vector space by a linear combination of the basis vectors.
Our approach uses a fixed set of random byte sequences called building blocks. In this paper we
denote the number of building blocks by N . The length in bytes of a building block is denoted by l .
It shall be ‘short’ compared to the file size (e.g. l= 128 ). To find the optimal representation of a
given file by the set of building blocks, we slide through the input file byte-by-byte, read out the
current input byte sequence of length l , and compute the Hamming distances of all building blocks
against the current input byte sequence. If the building block with the smallest Hamming distance is
smaller than a certain threshold, too, its index contributes to the file’s bbHash.
Besides similarity of files we are also able to match back parts of files to its origin, which could arise
due to file fragmentation and deletion.
1.1 Contribution and Organization of this Paper
Similarity preserving hash functions on the byte level are a rather new area in computer forensics and
get more and more important due to the increasing amount of data. Currently the most popular
approach is implemented in the software package ssdeep ([Kornblum, 2006]). However, ssdeep can be
exploited very easily ([Baier and Breitinger, 2011]). Our approach bbHash is more robust against an
active adversary. With respect to the length of the hash value our approach is superior to sdhash
([Roussev, 2009, Roussev, 2010]), which generates hash values of about 2.6% to 3.3% of the input,
while our bbHash similarity digest only comprises 0.5% of the input size. Additionally, sdhash has a
coverage of only about 80% of the input, while our approach is designed to cover the whole input.
Additionally, in contrast to ssdeep and sdhash, our similarity digest computation is based on a
comparison to external data structures, the building blocks. The building blocks are randomly chosen
1. NIST points out that the term known-to-be-good depends on national laws. Therefore, NIST calls files within the RDS
non-relevant. However, the RDS does not contain any illicit data.
90
ADFSL Conference on Digital Forensics, Security and Law, 2012
static byte blocks being independent of the processed input. Although this implies a computational
drawback compared to other fuzzy hashing approaches, we consider this as a security benefit as
bbHash stronger withstands active anti-blacklisting.
The rest of the paper is organized as follows: In the subsequent Sec. 1.2 we introduce the notation used
in this paper. Then in Sec. 2 we introduce the related work. The core of this paper is given in Sec. 3
where we present our algorithm bbHash. We give a first analysis of our implementation in Sec. 4. Sec.
5 concludes our paper and gives an overview of future work.
1.2 Notation and Terms used in this Paper
In this paper, we make use of the following notation and terms:










A building block is a randomly chosen byte sequence to rebuild files.
N denotes the number of different building blocks used in our approach. We make use
of the default value N = 16 .
For 0≤ k ≤ N we refer to the k -th building block as bb k .
l denotes the length of a building block in bytes. Throughout this paper we assume a
default value of l= 128 .
l bit denotes the length of a building block in bits. In this paper we assume a default
value of l bit = 8⋅ 128= 1024 .
L f denotes the length of an input (file) in bytes.
bbHash denotes our new proposed fuzzy hash function on base of building blocks.
BS denotes a byte string of length l : BS = B0 B1 B 2 ... B l− 1 .
BS i denotes a byte string of length l starting at offset byte i within the input file:
BS i = Bi Bi+ 1 Bi+ 2 ... B i+ l − 1 .
t denotes the threshold value. t is an integer with 0≤ t≤ l bit .
2. FOUNDATIONS AND RELATED WORK
According to [Menezes et al., 1997] hash functions have two basic properties, compression and ease of
computation. In this case compression means that regardless the length of the input, the output has a
fixed length. This is why the term Fuzzy Hashing might be a little bit confusing and similarity digest is
more appropriate – most of the similarity preserving algorithms do not output a fixed sized hash value.
Despite this fact we call a variable-sized compression function hf a fuzzy hash function if two similar
inputs yield similar outputs.
The first fuzzy hashing approach on the byte level was proposed by Kornblum in 2006 [Kornblum,
2006], which is called context-triggered piecewise hashing, abbreviated as CTPH. Kornblum’s CTPH
is based on a spam detection algorithm of Andrew Tridgell [Tridgell, 2002]. The main idea is to
compute cryptographic hashes not over the whole file, but over parts of the file, which are called
chunks. The end of each chunk is determined by a pseudo-random function that is based on a current
context of 7 bytes. In recent years, Kornblum’s approach was examined carefully and several papers
had been published.
[Chen and Wang, 2008, Roussev et al., 2007, Seo et al., 2009] find ways to improve the existing
implementation called ssdeep with respect to both efficiency and security. On the other side [Baier and
Breitinger, 2011, Breitinger, 2011] show attacks against CTPH with respect to blacklisting and
whitelisting and also some improvements for the pseudo random function.
[Roussev, 2009, Roussev, 2010] present a similarity digest hash function sdhash, where the idea is “to
select multiple characteristic (invariant) features from the data object and compare them with features
selected from other objects”. He uses multiple unique 64-byte features selected on the basis of their
entropy. In other words files are similar if the have the same features/byte-sequences. [Sadowski and
91
ADFSL Conference on Digital Forensics, Security and Law, 2012
Levin, 2007] explains a tool called Simhash which is another approach for fuzzy hashing “based on
counting the occurrences of certain binary strings within a file”. As we denote a similarity preserving
hash function a fuzzy hash function, we rate ssdeep, sdhash and Simhash as possible implementations
for fuzzy hashing.
[Roussev et al., 2006] comes with a tool called md5bloom where “the goal is to pick from a set of
forensic images the one(s) that are most like (or perhaps most unlike) a particular target”. In addition,
it can be used for object versioning detection. The idea is to hash each hard disk block and insert the
fingerprints into a Bloom filter. Hard disks are similar if their Bloom filters are similar.
3. A NEW FUZZY HASHING SCHEME BASED ON BUILDING BLOCKS
In this section, we explain our fuzzy hashing approach bbHash. First, in Sec. 3.1 we describe the
generation of the building blocks followed by the algorithm details in Sec. 3.2. Finally, Sec. 3.3
describes how to compare two bbHash similarity digests.
bbHash aims at the following paradigm:
1. Full coverage: Every byte of the input file is expected to be within at least one offset
of the input file, for which a building block is triggered to contribute to the bbHash.
This behavior is very common for hash functions: each byte influences the hash value.
2. Variable-sized length: The length of a file’s bbHash is proportional to the length of the
original file. This is in contrast to classical hash functions. However, it ensures to be
able to store sufficiently information about the input to have good similarity
preserving properties of bbHash.
3.1 Building Blocks
We first turn to the set of building blocks. Their number is denoted by N . In our current
implementation we decided to set the default value to N = 16 . Then we can index each building
block by a unique hex digit 0,1 ,2... f (half a byte), therefore we have the building blocks
bb0 , bb 1 ,... bb16− 1 . This index is later used within the bbHash to uniquely reference a certain
building block.
The length of a building block in bytes is referred to by l and influences the following two aspects:
1. A growing l decreases the speed performance as the Hamming distance is computed
at each offset i for l bytes.
2. An increased l shortens the length of the hash value as there should be a trigger
sequence every l bytes approximately (depending on the threshold t ).
Due to run time efficiency reasons we decided to use a ‘short’ building block size compared to the file
size. Currently we make use of l= 128 .
The generation of the building blocks is given in Fig. 1. We use the rand() function to fill an array of
unsigned integers. Hence, all building blocks are stored in one array whereby the boundaries can be
determined by using their size. As rand() uses the same seed each time, it is a deterministic generation
process. Using a fixed set of building blocks is comparable to the use of a fixed initialization vector
(IV) of well-known hash functions like SHA-1. A sample building block is given in Fig. 2.
92
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 1. Generation of the building blocks
Fig. 2: Building block with index 0
3.2 Our Algorithm bbHash
In this section we give a detailed description of our fuzzy hash function bbHash. It works as follows:
To find the optimal representation of a given file by the set of building blocks, we slide through the
input file byte-by-byte, read out the current input byte sequence of length l , and compute the
Hamming distances of all building blocks against the current input byte sequence. If the building block
with the smallest Hamming distance is smaller than a certain threshold, too, its index contributes to the
file’s bbHash.
We write L f for the length of the input file in bytes. The pseudocode of our algorithm bbHash is
given in Algorithm 1. It proceeds as follows for each offset i within the input file, 0≤ i≤ L f − 1− l :
If BS i denotes the byte sequence of length l starting at the i -th byte of the input, then the algorithm
computes the N Hamming distances of BS i to all N building blocks: hd k ,i = HD(bbk , BS i ) is
the Hamming distance of the two parameters bb k and BS i , 0≤ k < N . As the Hamming distance is
the number of different bits, we have 0≤ hd k , i ≤ 8⋅ l . In Sec. 3.1 we defined the default length of a
building block in bytes to be 128, i.e. we assume l= 128 . As an example HD (bb2, BS 100 ) returns
the Hamming distance of the building block bb 2 and the bytes B100 to B 227 of the input. In other
words the algorithm slides through the input, byte-by-byte, and computes the Hamming distance at
each offset for all N building blocks like it is given in Fig. 3.
The bbHash value is formed by the ordered indicies of triggered building blocks. In order to
trigger a building block to contribute to the bbHash, it has to fulfill two further conditions:
1. For a given i (fixed offset), we only make use of the closest building block, i.e. we
are looking for the index k with the smallest Hamming distance hd k ,i .
93
ADFSL Conference on Digital Forensics, Security and Law, 2012
2. This smallest hd k ,i also needs to be smaller than a certain threshold t .
Each BS i that fulfills both conditions will be called a trigger sequence. To create the final bbHash
hash value, we concatenate all indicies k of all triggered building blocks (in case we have two
triggered building blocks for BS i , only the smallest index k is chosen).
94
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 3: Workflow of the Algorithm
In Sec. 3.1 we’ve already motivated why the choices l= 128 and N = 16 are appropriate for our
approach. In what follows we explain how to choose a fitting threshold t . Our first paradigm in the
introduction to Sec. 3 states full coverage, i.e. every byte of the input file is expected to be within at
least one offset of the input file, for which a building block is triggered to contribute to the bbHash.
Thus we expect to trigger every l -th byte, i.e. every 128-th byte. In order to have some overlap, we
decrease the statistical expectation to trigger at every 100-th byte.
For our theoretic considerations we assume a uniform probability distribution on the input file blocks
of byte length l . Let d be a non-negative integer (the distance) and P (hd k ,i = d ) denote the
probability, that the Hamming distance of our building block k at offset i of the input file is equal to
d.
1. We first consider the case d = 0 , i.e. the building block and the input block coincide.
l
1024
Then we simply have P (hd k ,i = 0)= 0.5 = 0.5 .
bit
(ld ) possibilities to find an input file block of Hamming distance
l
⋅ 0.5 = 1024 ⋅ 0.5
d to bb . Thus P (hd = d )= ( )
(d ) .
d
2. For d ≥ 0 we have
k
bit
k ,i
bit
l bit
1024
3. Finally, the probability to receive a Hamming distance smaller than t for bb k is
The binomial coefficients in Eq. (1) are large integers and we make use of the computer
algebra system LiDIA2 to evaluate Eq. (1) (LiDIA is a C++-library maintained by the
Technical University of Darmstadt).
2. http://www.cdc.informatik.tu-darmstadt.de/TI/LiDIA/ ; visited 05.09.2011
95
ADFSL Conference on Digital Forensics, Security and Law, 2012
Let p t denote the probability that at least one of the N buildings blocks satisfies Eq. (1), i.e. we
trigger our input file and find a contribution to our bbHash. This may easily computed by the opposite
N
probability that none of the building blocks triggers, that is pt = 1− (1− p 1) . As explained above
we aim at pt = 0.01 . Thus we have to find a threshold t with
Now we use our LiDIA-computations of Eq. (1) to identify a threshold of t= 461 . An example hash
value of a 68,650 byte (≈ 70 kbyte) JPG image is given in Fig. 4. Overall the hash value consists of
693 digits which is 346.5 bytes and therefore approximately 0.5% of the input.
3.3 Hash Value Comparison
This paper focuses on the hash value generation and not on its comparison. Currently there are two
approaches from existing fuzzy hash functions which could be also usable for bbHash:


Roussev uses Bloom filters where the similarity can be identified by generating the
Hamming distance.
Kornblum uses the weighted edit distance to compare the similarity of two hash
values.
Fig. 4: bbHash of a 68,650 byte JPG-Image
Both approaches may easily be adopted to be used for bbHash.
4. ANALYSIS OF BBHASH
In this section we analyze different aspects of our approach. The length of the hash values and how
can it be influenced by different parameters is presented in Sec. 4.1. Our current choices yield a
bbHash of length 0.5% of the input size. This is an improvement by a factor 6.5 compared to sdhash,
where for practical data the similarity digest comprises 3.3% of the input size. Next, in Sec. 4.2 we
discuss the applicability of bbHash depending on the file size of the input. An important topic in
96
ADFSL Conference on Digital Forensics, Security and Law, 2012
computer forensics is the identification of segments of files, which is addressed in Sec. 4.3. The run
time performance is shown in Sec. 4.4. At least we have a look at attacks and compare our approach
against existing ones. We show that bbHash is more robust against an active adversary compared to
ssdeep.
4.1 Hash Value Length
The hash value length depends on three different properties: the file size L f , the threshold t and the
building block length l . If we expect that both other parameters are fixed, then
 a larger L f will increase the hash value length as the input is supposed to have more
trigger sequences.
 a higher t will increase the hash value length as more BS i will have a Hamming
distance lower than the threshold t .
 a large l will decrease the performance and the hash value length3.
In order to have full coverage our aim is to have a final hash value where every input byte influences
the final hash value. Due to performance reasons we’ve decided for a building block length of
l= 128 byte. As we set the threshold to t= 461 , the final hash value length nearly results in
L f /100 digits whereby every 100 digit has half a byte length. Thus the final hash value is
approximately 0.5% of the original file size.
Compared to the existing approach sdhash where “the space requirements for the generated similarity
digests do not exceed 2.6% of the source data” [Roussev, 2010], it is quite short. However, for real
data sdhash generates similarity digests of 3.3% of the input size.
Besides these two main aspects the file type may also influence the hash value length. We expect that
random byte strings will have more trigger sequences. Therefore we think that compressed file formats
like ZIP, JPG or PDF have very random byte strings and thus yield a bbHash of 0.5% of the input size.
This relation may differ significantly when processing TXT, BMP or DOC formats as they are less
random.
4.2 Applicability of bbHash Depending on the File Size
Our first prototype is very restricted in terms of the input file size and will not work reliable for small
files. Files smaller than the building blocks’ length l cannot have a trigger sequence and cannot be
hashed. To receive some trigger sequences the input file should be at least 5000 bytes, which results in
5000− l− 1 possible trigger sequences. On the other side large files result in very long hashes
wherefore we recommend to process files smaller than a couple of megabytes.
By changing the threshold t , it is possible to influence the hash value length. We envisage to
customize this parameter depending on the input size which will be a future topic.
4.3 Detection of Segments of Files
A file segment is a part of a file, which could be the result of fragmentation and deletion. For instance,
if a fragmented file is deleted but not overwritten, then we can find a byte sequence, but do not know
anything about the original file. Our approach allows to compare hashes of those pieces against hash
values of complete files. Depending on the fragment size, ssdeep is not able to detect fragments
(roughly ssdeep may only fragments being at least about half the size of the original file size).
Fig. 5 simulates the aforementioned scenario. We first copy a middle segment of 20000 byte from the
3. Remember, we would like to have a triggering in approximately every l-th byte and therefore we have to adjust t.
97
ADFSL Conference on Digital Forensics, Security and Law, 2012
JPG image from Fig. 4 using dd. We then compute the hash value of this segment. If we compare this
hash value against Fig. 44 (starting line 4), we recognize that they are completely identical. Thus we
can identify short segments as a part of its origin.
Fig. 5: Hash value of a segment. The bbHash of the whole file is listed in Fig. 4.
4.4 Run Time Performance
The run time performance is the main drawback of bbHash which is quite slow compared to other
approaches. ssdeep needs about 0.15 seconds for processing a 10MiB file where bbHash needs about
117 seconds for the same file. This is due to the use of building blocks as external comparison data
structures and the computation of their Hamming distance to the currently processed input byte
sequence. Recall that at each position i we have to build the Hamming distance of 16 building blocks
each with a length of 1024 bits. To receive the Hamming distance we XOR both sequences and count
the amount of remaining ones (bitcount( bb k ⊕ BS i )). To speed up the counting process, we
16
precomputed the amount of ones for all sequences from 0 to 2 − 1 Bits. Thus we can lookup each
l
1024
bit
16-Bit-sequence with a complexity of O (1) . But since we have N⋅ 16 = 16⋅ 16 = 1024 lookups
at each processed byte of the input it is quite slow.
Although the processing in its first implementation is quite slow which results from a straight forward
programming, this problem is often discussed in literature and there are several issues for
improvements. For instance [Amir et al., 2000] states that their algorithm finds all locations where the
t log t) . Compared against our algorithm which needs
pattern has at most t errors in time O ( L f √
time O ( L f ⋅ l) that’s a great improvement. As we make use of N building blocks, we have to
multiply both run time estimations by N . For a next version we will focus on improving the run time
performance.
4.5 Attacks
This section is focusing on attacks with respect to both forensic issues blacklisting and whitelisting.
From an attacker’s point of view anti-blacklisting/anti-whitelisting can be used to hide information
and increase the amount of work for investigators.
Anti-blacklisting means that an active adversary manipulates a file in a way that fuzzy hashing will not
identify the files as similar – the hash values are too different. We rate an attack as successful if a
human observer cannot see a change between the original and manipulated version (the files
4. We rearranged the output by hand to make an identification easier.
98
ADFSL Conference on Digital Forensics, Security and Law, 2012
look/work as expected). If a file was manipulated successfully then it would not be identified as a
known- to-be-bad file and will be categorized as unknown file.
The most obvious idea is to change the triggering whereby the scope of each change depends on the
Hamming distance. For instance, at position i the Hamming distance is 450, then an active adversary
has two possibilities:
1. He needs to change at least 11 bits in this segment to overcome the threshold t and kick out
this trigger sequence from the final hash.
2. He needs to manipulate it in a way that another bb has a closer Hamming distance.
In a worst case each building block has a Hamming distance of 460 and a ‘one-bit- change’ is enough
to manipulate the triggering. In this case an active adversary approximately needs to change L f /100
bits, one bit for each position i . Actually a lot of 100 more changes needs to be done as there are also
positions where the Hamming distance is much lower than 460. This is an improvement compared to
sdHash where it is enough to change exactly one bit in every identified feature. Compared to
Kornblum’s ssdeep this is also a great improvement as [Baier and Breitinger, 2011] states that in most
of the cases 10 changes are enough to receive a non-match.
Our improvement is mainly due to the fact that in contrast to ssdeep and sdhash we do not rely on
internal triggering, but on external. However, the use of building blocks results in the bad run time
performance as discussed in Sec. 4.4.
Anti-whitelisting means that an active adversary uses a hash value from a whitelist (hashes of knownto-be-good files) and manipulates a file (normally a known-to-be-bad file) that its hash value will be
similar to one on the whitelist. Again we rate an attack as successful if a human observer couldn’t see
a change between the original and manipulated version (the files look/work as expected).
In general this approach is not preimage-resistance as it is possible to create files for a given signature:
generate valid trigger sequences for each building block and add some zero-strings in between. The
manipulation of a specific file to a given hash value should also be possible but will result in an
useless file. In a first step an active adversary has to remove all existing trigger sequences (result in
approximately L f /100 ). Second, he needs to 100 imitate the triggering behavior of the white-listed
file which will cause a lot of more changes.
5. CONCLUSION & FUTURE WORK
We have discussed in the paper at hand a new approach for fuzzy hashing. Our main contribution is a
tool called bbHash that is more robust against anti- blacklisting than ssdeep and sdHash due to the use
of external building blocks. The final signature has about 0.5% of the original file size, is not fixed and
can be adjusted by several parameters. This allows us to compare very small parts of a file against its
origin which could arise due to file fragmentation and deletion.
In general there are two next steps. On the one side the run time performance needs to be improved
wherefore we will use existing approaches (e.g. given in [Amir et al., 2000]). On the other side we
have to do a security analysis for our approach to give more details about possible attacks. Knowing
attacks also allows further improvements of bbHash.
6. ACKNOWLEDGEMENTS
This work was partly supported by the German Federal Ministry of Education and Research (project
OpenC3S) and the EU (integrated project FIDELITY, grant number 284862).
99
ADFSL Conference on Digital Forensics, Security and Law, 2012
7. REFERENCES
[Amir et al., 2000] Amir, A., Lewenstein, M., and Porat, E. (2000). Faster algorithms for string
matching with k mismatches. In 11th annual ACM-SIAM symposium on Discrete algorithms, SODA
’00, pages 794–803, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics.
[Baier and Breitinger, 2011] Baier, H. and Breitinger, F. (2011). Security Aspects of Piecewise
Hashing in Computer Forensics. IT Security Incident Management & IT Forensics, pages 21–36.
[Breitinger, 2011] Breitinger, F. (2011). Security Aspects of Fuzzy Hashing. Master’s thesis,
Hochschule Darmstadt.
[Chen and Wang, 2008] Chen, L. and Wang, G. (2008). An Efficient Piecewise Hashing Method for
Computer Forensics. Workshop on Knowledge Discovery and Data Mining, pages 635–638.
[Kornblum, 2006] Kornblum, J. (2006). Identifying almost identical files using context triggered
piecewise hashing. In Digital Investigation, volume 3S, pages 91–97.
[Maddodi et al., 2010] Maddodi, S., Attigeri, G., and Karunakar, A. (2010). Data Dedu- plication
Techniques and Analysis. In Emerging Trends in Engineering and Technology (ICETET), pages 664–
668.
[Menezes et al., 1997] Menezes, A., Oorschot, P., and Vanstone, S. (1997). Handbook of Applied
Cryptography. CRC Press.
[NIST, 2011] NIST (2011). National Software Reference Library. [Roussev, 2009] Roussev, V.
(2009). Building a Better Similarity Trap with Statistically Improbable Features. 42nd Hawaii
International Conference on System Sciences, 0:1– 10.
[Roussev, 2010] Roussev, V. (2010). Data fingerprinting with similarity digests. Internation
Federation for Information Processing, 337/2010:207–226.
[Roussev et al., 2006] Roussev, V., Chen, Y., Bourg, T., and Rechard, G. G. (2006). md5bloom:
Forensic filesystem hashing revisited. Digital Investigation 3S, pages 82– 90.
[Roussev et al., 2007] Roussev, V., Richard, G. G., and Marziale, L. (2007). Multi-resolution
similarity hashing. Digital Investigation 4S, pages 105–113.
[Sadowski and Levin, 2007] Sadowski, C. and Levin, G. (2007). Simhash: Hash-based similarity
detection. http://simhash.googlecode.com/svn/ trunk/paper/SimHashWithBib.pdf.
[Seo et al., 2009] Seo, K., Lim, K., Choi, J., Chang, K., and Lee, S. (2009). Detecting Similar Files
Based on Hash and Statistical Analysis for Digital Forensic Investigation. Computer Science and its
Applications (CSA ’09), pages 1–6.
[Tridgell, 2002] Tridgell, A. (2012). Spamsum. Readme.
http://samba.org/ftp/unpacked/junkcode/spamsum/README, 03.01.2012
[Turk and Pentland, 1991] Turk, M. and Pentland, A. (1991). Face recognition using eigenfaces. In
Computer Vision and Pattern Recognition, 1991. IEEE, pages 586 –591.
[Walter, 2005] Walter, C. (2012). Kryder’s law,
http://www.scientificamerican.com/article.cfm?id=kryders-law&ref=sciam, 18.01.2012
100
ADFSL Conference on Digital Forensics, Security and Law, 2012
CLOUD FORENSICS INVESTIGATION: TRACING
INFRINGING SHARING OF COPYRIGHTED CONTENT IN
CLOUD
Yi-Jun He, Echo P. Zhang, Lucas C.K. Hui, Siu Ming Yiu, K.P. Chow
Department of Computer Science, The University of Hong Kong
Phone: +852-22415725; Fax: +852-25598447;
E-mail:{yjhe, pzhang2, hui, smyiu, chow}@cs.hku.hk
ABSTRACT
Cloud Computing is becoming a significant technology trend nowadays, but its abrupt rise also creates
a brand new front for cybercrime investigation with various challenges. One of the challenges is to
track down infringing sharing of copyrighted content in cloud. To solve this problem, we study a
typical type of content sharing technologies in cloud computing, analyze the challenges that the new
technologies bring to forensics, formalize a procedure to get digital evidences and obtain analytical
results based on the evidences to track down illegal uploader. Furthermore, we propose a reasoning
model based on the probability distribution in a Bayesian Network to evaluate the analytical result of
forensics examinations. The proposed method can accurately and scientifically track down the origin
infringing content uploader and owner.
Keywords: cloud forensics, peer to peer, file sharing, tracking, CloudFront
1. INTRODUCTION
With broadband Internet connection and with P2P programs such as Gnutella, FrostWire, BearShare,
BitTorrent, and eMule, it takes very little effort for someone to download songs, movies, or computer
games. But nowadays, people make use of it to share copyrighted files, or even images of child sexual
exploitation. Since October 2009, over 300,000 unique installations of Gnutella have been observed
sharing known child pornography in the US. Thus, many research works have been focused on
criminal investigations of the trafficking of digital contraband on Peer-to-Peer (P2P) file sharing
networks (Chow et al. 2009, Ieong et al. 2009, Ieong et al. 2010, Liberatore et al. 2010).
Recently, cloud computing has been dramatically developed and will soon become the dominant IT
environment. It is a new computing platform where thousands of users around the world can access to
a shared pool of computing resources (e.g., storage, applications, services, etc.) without having to
download or install everything on their own computers, and only requires a minimal management
effort from the service provider. Several big service providers, Amazon, Apple, Google, etc. start to
provide content sharing services in cloud which can offer the file sharing functionalities like what P2P
networks can offer. With the cloud environment, file sharing becomes more convenient and efficient,
since sharing can be done through web browser without requiring software installation, and cloud
provides strong computation power and fast transmitting speed. Thus it is possible that cloud based
content sharing will substitute the existing file sharing programs one day.
1.1 Can Existing Investigative Models be Applied to Cloud Computing?
Before cloud computing emerging, most forensics investigation models are built on P2P networks.
When cloud based infringing content sharing happens, can existing investigative model for analyzing
P2P network be applied to analyze cloud based file sharing network? The answer is NO, because
cloud content sharing systems differentiate from P2P sharing systems in the following aspects:
1. Cloud Computing provides storage capacity at dedicate, and data is automatically
geographically dispersed on edge servers on the cloud; while P2P file sharing systems support
the trading of storage capacity between a network of ’peers’.
101
ADFSL Conference on Digital Forensics, Security and Law, 2012
2. Typically, Cloud file sharing systems operate in a centralized fashion that end user is directed
to an edge server that is near them to get data based on a complex load balancing process; but
P2P services operate in a decentralized way that nodes on the Internet cooperate in an overlay
network consisting of peers, where each peer is both a consumer and a producer of data and
gets different piece of data from other peers.
3. Further, cloud file sharing services are paid services, where the user pays, for instance, per
amount of data kept in storage and the amount of network traffic generated to upload or
download files. Contrary to cloud storage services, P2P file sharing systems are typically not
subscription based, but rather depend on group members that are part of a peer network to
trade resources, primarily disk capacity and network bandwidth.
1.2 Contributions
This is the first paper providing accurate investigations of such content sharing networks in cloud. We
analyze the functionality a typical cloud content sharing system: CloudFront, formalize an
investigation procedure, and build an analysis model on it. Our research can help investigators:
1.
2.
3.
4.
Confidently state from where and how various forms of evidences are acquired in cloud;
Understand the relative strength of each evidence;
Validate that evidences from the fruits of a search warrant;
Assess the accuracy of the investigation result based on the evidences obtained using a
Bayesian network.
In section 2, we give an overview of related works. In section 3, we introduce the background of the
content sharing system. In section 4, we simulate the crime, and describe the investigation process. In
section 5, we propose a Bayesian Network based model to obtain analytical results based on the
evidences. In section 6, we analyze the proposed model in several aspects. Finally, we conclude the
whole paper.
2. RELATED WORK
Many works have been proposed to solve security and privacy problems in the cloud field (Angin et
al. 2010, Bertino et al. 2009, He et al. 2012). They focus on aspects for protecting user privacy and
anonymous authentication. Works (Wang et al. 2010, Wang et al. 2011) improve data security in
cloud using crypto technologies.
Also, traditional computer forensics have already had many works published on how to establish
principles and guidelines to retrieve digital evidence (Grance et al. 2006). Works (Aitken et al. 2004,
Kwan et al. 2007) show how to formulate hypotheses from evidence and evaluate the hypotheses’
likelihood for the sake of legal arguments in court. In addition, the aspects of forensics in tracking
infringing files sharing in P2P networks have been addressed by several works (Chow et al. 2009,
Ieong et al. 2009, Ieong et al. 2010, Liberatore et al. 2010).
In contrast with the maturity of research on cloud security and privacy and traditional computer
forensics, the research on forensic investigations in cloud computing are relatively immature. To the
best of knowledge, our paper is the first work addressing the problem of tracking infringing sharing of
copyrighted content in the cloud. There are other works (Birk et al. 2011, Marty 2011, Zafarullah et al.
2011) discussing the issues of forensic investigations in cloud, but (Marty 2011, Zafarullah et al.
2011) focus on how to log the data needed for forensic investigations, (Birk et al. 2011) gives an
overview on forensic investigations issues without providing concrete solutions or investigation
model.
3. BACKGROUND
In this section, we provide a technical overview of a content sharing system: CloudFront, which is a
typical cloud based content sharing network.
102
ADFSL Conference on Digital Forensics, Security and Law, 2012
Amazon CloudFront (Amazon 2012) is a web service for content delivery. It delivers your content
through a worldwide network of edge locations. End users are routed to the nearest edge location, so
content is delivered with the best possible performance.
A CloudFront network is made up of four types of entities:

Objects are the files that the file owner wants CloudFront to deliver. This typically
includes web pages, images, and digital media files, but can be anything that can be
served over HTTP or a version of RTMP.

Origin Server is the location where you store the original, definitive version of your
objects.

Distribution is a link between your origin server and a domain name that CloudFront
automatically assigns. If your origin is an Amazon S3 bucket, you use this new
domain name in place of standard Amazon S3 references. For example,
http://mybucket.s3.amazonaws.com/image.jpg would instead be
http://somedomainname.cloudfront.net/image.jpg.

Edge location is a geographical site where CloudFront caches copies of your objects.
When an end user requests one of your objects, CloudFront decides which edge
location is best able to serve the request. If the edge location doesn’t have a copy,
CloudFront goes to the origin server and puts a copy of the object in the edge
location.
To share a file using CloudFront, the file owner first makes and publishes a distribution. The process
is shown in Figure 1 and it includes:
1. Register an account on the origin server, and place objects in the origin server and make them
publicly readable.
2. Create CloudFront distribution and get the distribution’s domain name that CloudFront
assigns. Example distribution ID: EDFDVBD632BHDS5, and Example domain name:
d604721fxaaqy9.cloudfront.net. The distribution ID will not necessarily match the domain
name.
3. Create the URLs that end users will use to get the objects and include them as needed in any
web application or website. Example URL:
http://d604721fxaaqy9.cloudfront.net/videos/video.mp4.
Fig. 1. The working protocol of Amazon CloudFront.
103
ADFSL Conference on Digital Forensics, Security and Law, 2012
To download a shared file from CloudFront, the end user needs to do the following steps. For
simplicity, we assume the end user resides in Hong Kong.
1. After clicking the URL from the web application or website, CloudFront determines which
edge location would be best to serve the object. In this case, it is the Hong Kong location.
2. If the Hong Kong edge location doesn’t have a copy of video.mp4, CloudFront goes to the
origin server and puts a copy of video.mp4 in the Hong Kong edge location.
3. The Hong Kong edge location then serves video.mp4 to the end user and then serves any other
requests for that file at the Hong Kong location.
4. Later, video.mp4 expires, and CloudFront deletes video.mp4 from the Hong Kong location.
CloudFront doesn’t put a new copy of video.mp4 in the Hong Kong location until an end user
requests video.mp4 again and CloudFront determines the Hong Kong location should serve the
image.
4. INVESTIGATION PROCESS
The objective of investigation is to obtain evidences through observation of data from the Internet and
other possible parties, such as service providers or seized devices. In this section, we discuss
techniques and methods for collecting evidences from Amazon CloudFront.
As the cloud based content sharing is new, we do not have a real case in Hong Kong to study. Thus we
have to suppose there is a crime and simulate the whole crime process based on the most common and
regular criminal behaviors, and find out what evidences the criminal may leave. Our paper gives a
good guidance for collecting evidences if a real case happens in the future.
In this part, we simulate a suspect intends to share infringing files using Amazon CloudFront. A
general case is that the suspect has a movie in his computer and wants to share the movie publicly. The
suspect needs to do following steps:
1. Subscribe to Amazon CloudFront service, providing email address, user name and credit card
information to service provider.
2. Register an origin server, providing email address and user name to the server administrator.
3. Upload infringing files from local disk to the origin server. This step may involve installing a
FTP/FTPS/SFTP client software in order to do the uploading.
4. Login Amazon CloudFront to create a distribution for the infringing file. The distribution is
the URL link to the file in the CloudFront network.
5. Register a forum, and publish the distribution to the forum. When end users click the
distribution, CloudFront will retrieve the files from the nearest edge location.
Once such an illegal sharing happens, we can follow the guidance below to track the suspect:
First, we can trace the suspect file link uploader’s IP address through four steps (Ei represents
evidence i):
1. E1: The suspect posted the distribution, the posted message is found.
2. E2: The suspect has a forum account and he is logged in. So, the suspect’s forum account is
found.
3. E3: The IP address must be recorded by the forum. Check with the forum administrator for the
IP address of the user who created the posts.
4. E4: Check with Internet Service Providers for the assignment record of IP address to get its
subscribed user.
Through the above four steps, we are only sure that the suspect has posted a link on the forum, but not
sure about whether it is the same suspect who created the link. Thus the second step is to check with
104
ADFSL Conference on Digital Forensics, Security and Law, 2012
the CloudFront provider for four issues:
1. E5: The suspect has an Amazon CloudFront account, and logged in the CloudFront with the
tracked IP address.
2. E6: The origin server domain name is found under the suspect’s CloudFront account.
3. E7: The infringing file distribution creation record is found under the suspect’s CloudFront
account.
4. E8: The registered credit card holder of that CloudFront account is the suspect.
The third step is to check with the origin server administrator that the suspect is the infringing file
owner. The evidences include
1. E9: The suspect has an origin server account, and logged in the origin server with the tracked
IP address.
2. E10: The infringing file exists under the suspect origin server account.
The last step is to find out the devices that the suspect used to do the infringing file sharing. The
evidences include
1. E11: Hash value of the infringing file matches that of the file existing on the suspect’s devices
(including Tablet PC, Laptop, Desktop Computer or Mobile Phone).
2. E12: A FTP/FTPS/SFTP client software is installed on the devices.
3. E13: Origin server connection record is found in FTP/FTPS/SFTP client software.
4. E14: Infringing file transmission record is found in FTP/FTPS/SFTP client software.
5. E15: Cookie of the Origin server is found on the devices.
6. E16: Internet history record on Origin server is available.
7. E17: URL of Origin server is stored in the web browser.
8. E18: The distribution origin name is the origin server.
9. E19: Credit card charge record of CloudFront is found.
10. E20: Cookie of the CloudFront website is found.
11. E21: CloudFront service log-in record is found.
12. E22: Distribution creation record is found.
13. E23: Removing the file in origin server will affect the distribution validity.
14. E24: Web browser software is available on the devices.
15. E25: Internet connection is available.
16. E26: Internet history record on publishing forum is found.
17. E27: The distribution link posted on the forum is as the same as what is created in CloudFront.
18. E28: Cookie of the publishing forum is found.
5. THE PROPOSED MODEL FOR ANALYZING EVIDENCES
In this part, we construct a Bayesian model to analyze the evidences found above and calculate the
probability of guilt. We begin the construction with the set up of the top-most hypothesis, which is
Hypothesis H: “The suspect is the origin file owner and the uploader”
Usually, this hypothesis represents the main argument that the investigator wants to determine. It is the
root node. It is an ancestor of every other node in the Bayesian Network, hence its state’s probabilities
are unconditional. To support the root hypothesis, sub-hypotheses may be added to the Bayesian
Network, since they are useful for adjusting the model to a more clearly structured graph. As show in
Table 1, in the CloudFront model, four sub-hypotheses are created for the root node and four sub-subhypotheses are created for the hypothesis H4.
105
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 1. Hypotheses with CloudFront
H1
H2
H3
H4
H4-1
H4-2
H4-3
H4-4
The suspect posted a link to the forums.
The suspect created the link using CloudFront.
The suspect is the file owner.
The seized devices (including Tablet PC, Laptop, Desktop Computer, Mobile Phone)
have been used as the initial sharing machine to share infringing file on Internet.
Has the pirated file uploaded from the seized devices to the origin server?
Has the distribution on the origin server been created using CloudFront?
Has the connection between the seized devices and the CloudFront been maintained?
Has the distribution been posted to newsgroup forum for publishing?
The built Bayesian model is shown in Figure 2:
Fig. 2. Bayesian Network Diagram for Amazon CloudFront.
5.1 How To Use the Model for Assessment
Initialization Take the Bayesian network model for Amazon CloudFront as an example. The possible
states of all evidences are:“Yes, No, Uncertain”. All hypotheses are set to be “Unobserved” when
there is no observation made to any evidence. The initial prior probability P(H) is set to be Yes:
0.3333, No: 0.3333, Uncertain: 0.3333. The initial conditional probability value of each evidence is set
as shown in Figure 3. Take E9 as an example, we assign an initial value of 0.85 for the situation when
H3 and E9 are both “Yes”. That means when the suspect is the file owner, the chance that the suspect
has an origin server account, and logged in the origin server with the tracked IP address is 85%. The
resulting posterior probability P(Hi) and P(Hij), that is the certainty of Hi and Hij based on the
initialized probability values of evidences, should be evenly distributed amongst their states, as show
in Table 2.
106
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 3. The Initial Conditional Probabilities.
Table 2. Bayesian Network Initial Posterior Probability
Hypotheses
H1
H2
H3
H4
H4-1
H4-2
H4-3
H4-4
Initial Posterior Probability
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Yes: 0.3333, No: 0.3333, Uncertain: 0.3333
Assessment In the investigation process, if an evidence Ei is found, then the state of that evidence
should be changed to “Yes”, and the prior probability P(Ei) should be set to 1. On the other hand, if
the evidence is not found, then the state of that evidence should be changed to “No” or “Uncertain”. If
“No”, the prior probability of that evidence should be 0; if “Uncertain”, the prior probability of that
evidence should be set subjectively between 0-1. If we assume all evidences are found, and switch all
the entailing evidences to state “Yes”, the propagated probability values of the hypotheses are shown
in Table 3. According to the Bayesian Network calculation, the posterior probability of H at state
“Yes” reaches the highest value 99.8819% under this circumstance, which means that there is a
maximum chance of 99.8819% that the suspect is the origin file owner and uploader.
Table 3. Propagated Probability Values of the Hypothesis
Hypotheses
H
H1
H2
H3
H4
H4−1
H4−2
H4−3
H4−4
Posterior probability when all evidences are found
Yes: 0.998819, No: 0.00118103, Uncertain: 0
Yes: 0.999823, No: 0.000177368, Uncertain: 0
Yes: 0.999823, No: 0.000177368, Uncertain: 0
Yes: 0.994364, No: 0.00563629, Uncertain: 0
Yes: 0.998967, No: 0.00103306, Uncertain: 0
Yes: 0.999999, No: 9.7077e-007, Uncertain: 0
Yes: 0.999994, No: 5.50123e-006, Uncertain: 0
Yes: 0.849277, No: 0.150723, Uncertain: 0
Yes: 0.999025, No: 0.000975118, Uncertain: 0
107
ADFSL Conference on Digital Forensics, Security and Law, 2012
However, in reality, some evidences may not be found, so we should correspondingly amend the
absent evidences states from “Yes” to “No”. For example, we assume the evidences E17, E10, E7 are not
found, thus we change the states of them from “Yes” to “No”. As a result, the posterior probability of
H at state “Yes” will be reduced from 99.8819% to 99.3379%. The result shows that the posterior
probability of H would be reduced if some evidences are absent.
We used the software “MSBNx” (Microsoft, 2001) to calculate the probability of the hypotheses. The
above analysis result of the root hypothesis tells the judge the probability that a hypothesis is true. In
our experiment, if all evidences are found, the probability would be more than 99%. The numerical
result is a good scientific reference to the judge. If real cases happen, investigators can use this model
to evaluate the digital forensics findings, and adjust the evidences found, thus get a quantitive
probability of the crime.
6. ANALYSIS OF THE MODEL
Though Bayesian Network has been used for a while in security and forensics, our analysis below
shows that its construction and usage in cloud has its own characteristics.
6.1 Difficulties in Building the Model
Following the guidances in section 4, it is not difficult to find out the origin file link uploader and file
owner. However, the nature of the cloud causes some difficulties when following our guidances to do
investigation.

First, in the cloud environment, file sharing is centrally controlled. Inevitably,
investigators need assistance from cloud service providers (SPs), such as Amazon, in
order to lock-in the suspect fast. However, many of the cloud SPs are international
organizations, so there are a number of restrictions placed on connecting evidences
from foreign organizations. Some of these restrictions are the decision of the foreign
Government, while others are the result of international organizations being unwilling
to leak customers information. Thus, one may wonder if the cloud SPs do not
collaborate with forensics investigators, does the model still work?

Fortunately, the following evaluation result of our model shows that, it is still possible
to track the suspect with a high probability even without the help from the cloud SP.
If without the help of cloud SP, E5...E8 would be absent. Thus, we set the states of
E5...E8 to be “No”, and keep the states of other evidences to be “Yes”. As a result,
P(H2) is dramatically reduced to 0.53%, and P(H) is reduced from 99.8819% to
96.38%. The result shows that, though the cloud SP is an important third party to
provide valuable evidences, its absence would not affect much of the final result when
other evidences are all found.

The second difficulty is that the suspect can use mobile phones or other persons’
computers to upload the files. Also, cloud computing allows using web based
technology, which makes content sharing easy from any device that supports a web
browser. As a result, investigators may not be able to get any evidence if just
investigate suspect’s personal computers, or may miss some important evidences
existing in other devices. Thus it expands the scope of the investigation. For example,
when finding the evidences supporting H4 in our model, investigators must investigate
all devices with browsers, including Tablet PC, laptop, desktop computer, mobile
phone.
108
ADFSL Conference on Digital Forensics, Security and Law, 2012
6.2 Sensitive Data Analysis
From the model, we can find some evidences which will have the greatest impact or the minimal
impact to the result, and some evidences which have the most interconnection with other evidences.
1. As show in Figure 3, if P(Ei|Hj) which is the initial posterior probability of each evidence Ei
caused by the hypothesis Hj is the same, for example, P(E1|H1) = P(E9|H3) = 0.85, then E9 and
E10 would have the most significant effect to the posterior probability of H. For example, if we
change the state of E9 from “Yes” to “No”, the P(H) will be reduced from 99.8819% to
99.43%. If changing any other evidence such as E1, the P(H) will be reduced from 99.8819%
to 99.86%.
2. If adopting the same initial posterior probability above, then E24 and E25 have the minimal
impact to P(H). If we change the state of E24 from “Yes” to “No”, the P(H) will be reduced
from 99.8819% to 99.8818%.
3. The hypothesis H4 is in a diverging connection with H4−1, H4−2, H4−3, H4−4, and H4−1 is in a
diverging connection with E11...E17, hence, provided the states of H4−1 and H4 are
unobservable, change in E11...E17 will also change the probability values of H4−1 and H4. When
H4−1 or H4 changes, the likelihood of E11...E17 will change also.
4. Similarly, since H, H4, H4−1 and E11 are in serial connection, hence change in E11 will also
propagate the variation to H if H4 and H4−1 remain unobservable.
5. Nodes E24 and E25 are common nodes for hypotheses H4−2, H4−3, H4−4. In other words, there is
a converging connection to E24 and E25 from hypothesis nodes H4−2, H4−3, H4−4. According to
the rules of probability propagation for converging connection in Bayesian network, when the
states of E24 and E25 are known, the probabilities of H4−2, H4−3, H4−4 will influence each other.
Therefore, change in the state of E24 or E25 will change the probability of these three
hypotheses. Further, since H4−1, H4−2, H4−3, H4−4 are in divergent connection with parent
hypothesis H4, hence changes in H4−2, H4−3, H4−4 will also influence the probability of H4−1.
6.3 Initial Probability Assignment
As show in Table 2 and Figure 3, for simplicity, in the initialization phase of assessment, we set the
initial prior probability P(H) to be Yes: 0.3333, No: 0.3333, Uncertain: 0.3333, and set the initial
posterior probability of each evidence Ei caused by the hypothesis Hj to be the same, such as P(E1|H1)
= P(E9|H3) =0.85. However, in reality, the assignment of initial prior probability and initial posterior
probability may not be like this. Such assignments often rely on subjective personal belief which is
affected by professional experience and knowledge. Also individual digital forensic examiner’s belief
may not represent the general and acceptable view in the forensic discipline. Thus, the assignment
needs to be done among a group of forensic specialists. In order to help investigators to perform a
more accurate assignment of the initial probability, we first classify the evidences into two levels in
section 6.4.
6.4 Critical Evidence Set of Evidences
The investigation found out 28 evidences. Each of them will affect P(H) in varying degrees. Thus we
classify the evidences into two levels, L0, L1 due to the degree of importance to H. L0 is the lowest
degree and L1 is the highest degree. The higher the more important. Please note that such
classification needs intensive understanding of each evidence, and it needs to be done by a group of
forensic specialists.


L1: E1, E3, E5, E7, E10, E11, E18, E22, E23, E27
L0: E2, E4, E6, E8, E9, E12, E13, E14, E15, E16, E17, E19, E24, E20, E21, E25, E26, E28
According to the critical set classification, we define the following 7 deductions, which helps
investigators to understand the logic relation among the evidences, and provides a basic knowledge to
investigators when assigning initial posterior probabilities to each evidence.
109
ADFSL Conference on Digital Forensics, Security and Law, 2012
1. If E1 and E3 are found, H1 would have a high probability to be true, no matter E2 or E4 is found
or not.
2. If E5 and E7 are found, H2 would have a high probability to be true, no matter E6 or E8 is found
or not.
3. If E10 is found, H3 would have a high probability to be true, no matter E9 is found or not.
4. If E11 is found, H4−1 would have a high probability to be true, no matter E12, E13, E14, E15, E16
or E17 is found or not.
5. If E18, E22, E23 are found, H4−2 would have a high probability to be true, no matter E19, E20, E21,
E24 or E25 is found or not.
6. If E23 is found, H4−3 would have a high probability to be true, no matter E24 or E25 is found or
not.
7. If E27 is found, H4−4 would have a high probability to be true, no matter E24, E25, E26, or E28 is
found or not.
Take deduction 1 as an example, it means that if the posted distribution and the poster’s IP address are
found, it is most likely that the suspect posted a distribution with this IP address. The existence of E2
or E4 will not affect much of the probability of H1. The basis of making such a deduction is that some
forums support anonymous posting or unregistered user posting, so E2 may not exist even if E1 and E3
are found; also the IP subscriber may not be the suspect because he can use public network for
posting, so E4 may not exist. Thus E2 and E4 are just supplementary evidences for H1, but not the
critical ones. Finally, according to deduction 1, investigators should assign higher initial probabilities
to E1 and E3, and lower posterior probabilities to E2 and E4. The exact posterior probabilities should be
carefully decided among specialists. Here, we just give an example to demonstrate the importance of
initialization.
Fig. 4. The Different Initial Posterior Probability to Elements in L1 and L0.
Example: We assign each L1 element the same posterior probabilities as E1 show in Figure 4, and
assign each L0 element the same posterior probabilities as E2 show in Figure 4. To prove deduction 1,
we set four situations as show in Table 4. We found that if E1 and E3 are found, and E2 and E4 are not
found, P(H1) is 0.9498, which is not much different from situation 1; but if E1 and E3 are not found,
and E2 and E4 are found, P(H1) is just 0.0935. Thus, it proves the deduction 1. Similarly, if we assume
all evidences in L1 are found and all evidences in L0 are not found, P(H) is 0.9935, which is still a
high value; but if all evidences in L1 are not found and all evidences in L0 are found, P(H) is only
0.0937. Thus is proves the importance of L1 evidences.
Table 4. The Impact of Critical Data
Situations E1, E3 E2,E4
1
Yes
Yes
2
No
No
3
Yes
No
4
No
Yes
P(H1)
0.9907
0.0180
0.9498
0.0935
6.5 Other Cloud Content Sharing Networks
Actually there exist many other cloud content sharing networks, such as Seagate FreeAgent GoFlex
Net Media Sharing (Engines), which represents the technologies that are hardware assisted, and
110
ADFSL Conference on Digital Forensics, Security and Law, 2012
supports transforming user’s storage device into personal cloud storage. In such GoFlex network, there
is no such origin server in CloudFront, but investigators need to investigate the user’s storage device
such as USB storage instead, because origin files exist in the USB storage. Other than that,
investigators can follow investigation process of CloudFront to find other evidences in GoFlex, and
build the Bayesian Network model.
7. CONCLUSION
Performing forensic to crime based on cloud content sharing network is new. We analyzed a typical
cloud content sharing network, and proposed guidances to track the origin file uploader and owner if
illegal sharing happens in such network. A Bayesian Network model is also built (section 5) for
analyzing the collected evidences to obtain a scientific evaluation result of the probability of a crime.
Analyses of model construction difficulties, initialization, sensitive data and critical data set are done.
One interesting result we found is that though the cloud SP is an important third party for providing
valuable evidences, its absence would not affect much of the probability of tracking the suspect when
other evidences are all found. If following the proposed guidances, there would be a chance of more
than 99% to track the origin file uploader and owner.
ACKNOWLEDGEMENTS
The work described in this paper was partially supported by the General Research Fund from the
Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. RGC
GRF HKU 713009E), the NSFC/RGC Joint Research Scheme (Project No. N\_HKU 722/09), HKU
Seed Fundings for Applied Research 201102160014, and HKU Seed Fundings for Basic Research
201011159162 and 200911159149.
REFERENCES
C.G.G. Aitken and F. Taroni (2004), Statistics and the evaluation of evidence for forensic scientists,
John Wiley.
Amazon (2012), ‘Amazon cloudfront’, http://aws.amazon.com/cloudfront/, Accessed in April, 2012.
P. Angin, B. K. Bhargava, R. Ranchal, N. Singh, M. Linderman, L. B. Othmane, and L. Lilien (2010).
‘An Entity-Centric Approach for Privacy and Identity Management in Cloud Computing’. 29 th IEEE
Symposium on Reliable Distributed Systems (SRDS). October 31 - November 3. New Delhi, Punjab,
India.
E. Bertino, F. Paci, R. Ferrini, and N. Shang (2009), “Privacy-preserving Digital Identity Management
for Cloud Computing”, IEEE Data Engineering Bulletin, Vol 32(Issue 1): Page 21–27.
D. Birk and C. Wegener (2011), ‘Technical issues of forensic investigations in cloud computing
environments’. IEEE Sixth International Workshop on Systematic Approaches to Digital Forensic
Engineering (SADFE). May 26. Oakland, CA, USA.
Yi-Jun He, Sherman S. M. Chow, Lucas C.K. Hui, S.M. Yiu (2012), ‘SPICE – Simple PrivacyPreserving Identity-Management for Cloud Environment’, 10th International Conference on Applied
Cryptography and Network Security (ACNS), June, Singapore.
K. P. Chow, R. Ieong, M. Kwan, P. Lai, F. Law, H. Tse, and K. Tse (2009), ‘Security analysis of foxy
peer-to-peer file sharing tool’. HKU Technical Report TR-2008-09.
C. Engines. ‘Pogoplug’, http://www.pogoplug.com/. Accessed in April, 2012.
R. S. C. Ieong, P. K. Y. Lai, K. P. Chow, M. Y. K. Kwan, and F. Y. W. Law (2010). ‘Identifying first
seeders in foxy peer-to-peer networks. 6th IFIP International Conference on Digital Forensics. January
4-6. Hong Kong, China.
111
ADFSL Conference on Digital Forensics, Security and Law, 2012
R. S. C. Ieong, P. K. Y. Lai, K. P. Chow, F. Y. W. Law, M. Y. K. Kwan, and K. Tse (2009). ‘A model
for foxy peer-to-peer network investigations’. 5th IFIP WG 11.9 International Conference on Digital
Forensics. January 26-28. Orlando, Florida, USA.
M. Y. Kwan, K. Chow, F. Y. Law, and P. K. Lai (2007), ‘Computer forensics using Bayesian network:
A case study’. http://www.cs.hku.hk/research/techreps/document/ TR-2007-12.pdf.
M. Liberatore, B. N. Levine, and C. Shields (2010). ‘Strengthening forensic investigations of child
pornography on P2P networks’. Proceedings of the 2010 ACM Conference on Emerging Networking
Experiments and Technology (CoNEXT). November 30 - December 03. Philadelphia, PA, USA.
R. Marty (2011). ‘Cloud application logging for forensics’. 26th ACM Symposium On Applied
Computing (SAC). March 21 – 24. TaiChung, Taiwan.
Microsoft (2001). ‘Microsoft bayesian network editor’.
us/um/redmond/groups/adapt /msbnx/. Accessed in April, 2012.
http://research.microsoft.com/en-
K. K. T. Grance, S. Chevalier and H. Dang (2006), ‘Guide to computer and network data analysis:
Applying forensic techniques to incident response’. National Institute of Standards and Technology.
C. Wang, Q. Wang, K. Ren, and W. Lou (2010). ‘Privacy-preserving public auditing for data storage
security in cloud computing’. The 29th IEEE International Conference on Computer Communications
(INFOCOM). March 15-19. San Diego, CA, USA.
Q. Wang, C. Wang, K. Ren, W. Lou, and J. Li (2011), ‘Enabling public auditability and data dynamics
for storage security in cloud computing’. IEEE Transactions on Parallel and Distributed Systems,
22(5): Page 847–859.
Zafarullah, F. Anwar, and Z. Anwar (2011). ‘Digital forensics for eucalyptus’. 9th International
Conference on Frontier of Information Technology (FIT). December 19-21. Islamabad, Pakistan.
112
ADFSL Conference on Digital Forensics, Security and Law, 2012
iPAD2 LOGICAL ACQUISITION: AUTOMATED OR
MANUAL EXAMINATION?
Somaya Ali
Sumaya AlHosani
Farah AlZarooni
Ibrahim Baggili
[email protected]
P.O. Box 144426
Abu Dhabi
U.A.E
[email protected]
P.O. Box 144426
Abu Dhabi
U.A.E
[email protected]
P.O. Box 144426
Abu Dhabi
U.A.E
[email protected]
P.O. Box 144534
Abu Dhabi
U.A.E
Advanced Cyber Forensics Research Laboratory
Zayed University
ABSTRACT
Due to their usage increase worldwide, iPads are on the path of becoming key sources of digital
evidence in criminal investigations. This research investigated the logical backup acquisition and
examination of the iPad2 device using the Apple iTunes backup utility while manually examining the
backup data (manual examination) and automatically parsing the backup data (Lantern software automated examination). The results indicate that a manual examination of the logical backup
structure from iTunes reveals more digital evidence, especially if installed application data is required
for an investigation. However, the researchers note that if a quick triage is needed of an iOS device,
then automated tools provide a faster method for obtaining digital evidence from an iOS device. The
results also illustrate that the file names in the backup folders have changed between iOS 3 and iOS 4.
Lastly, the authors note the need for an extensible software framework for future automated logical
iPad examination tools.
Keywords: iPad, forensics, logical backup, iOS, manual examination.
1. INTRODUCTION
The popularity of the iPad continues to rise. Consumers are attracted to its unique features, such as
increased storage capacity, user-friendliness, and the incorporation of an interactive touch screen. It is
considered to be more personal than a laptop and more advanced than a mobile phone (Miles, January
27, 2010). The rapid distribution of tablets to businesses and individual utilization is primarily due to
the Apple iPad device (Pilley, October 8, 2010). When Apple first released the iPad, they targeted
consumer sectors rather than business sectors. Their advertisements and marketing campaigns
reflected this strategy. However, the iPad is currently being adopted by an extensive scope of
consumers, ranging from children to executives. It is also employed in unconventional settings, like in
the hands of restaurant waiters as they take your order, or in boutique windows showcasing
promotional material (Geyer & Felske, 2011).
Since 2010, more than 25 million iPads have been sold in the US (KrollOntrack, 2011). Furthermore,
70% of the iPad2 buyers were new to the iPad world and 47% purchased a 3G model. This
demonstrates that not only are more buyers attracted to the iPad, but that consumers are looking for
always-connected iPad devices (Elmer-DeWitt, March 13, 2011). Due to the ubiquity of these devices,
criminals are starting to target iPad users, thereby committing a wider range of crimes (Pilley, October
8, 2010). Hence, digital forensic scientists need to focus on investigating these emerging devices.
Fortunately, new digital forensic tools and methods are being developed by researchers and private
corporations to help law enforcement officers forensically examine iPads, despite the challenging fact
that the iPad technology is continuously changing (Pilley, October 8, 2010).
2. RESEARCH QUESTIONS
This research focused on logical forensic methods for examining iPad devices. This research intended
113
ADFSL Conference on Digital Forensics, Security and Law, 2012
on answering the following questions:

What is the difference between using automated logical analysis of the backup files
versus using a manual approach?

What data is saved in the backup folder of an iPad2?

Where is the data saved and how can it be extracted?

What is the difference between the backup structure of the old iOS and the new iOS?
3. BACKGROUND
3.1. iPad
On January 27, 2010, the late CEO of Apple, Steve Jobs, announced the launch of a new tablet device
called iPad. On March 2, 2011, the second generation iPad2 was launched. Users could enjoy features
such as browsing the Internet, watching videos, viewing photos, and listening to music (Apple,
2011a).
With two cameras, high definition (HD) video recording capabilities, a dual core A5 chip, extended
battery life, WiFi Internet connectivity, third generation (3G) mini SIM card, an endless variety of
applications, and a thin and light design, the iPad2 stormed the technology market and created a new
era in the tablet world (Apple, 2011b). The iPad2 was released with the new version of Apple’s
operating system, iOS 4.3 (Halliday, March 2, 2011).
The following section explores the iOS file system in more detail.
3.1.1. The iOS File System
In 2008, Apple released iOS – an operating system for the iPhone, iPod, and iPad. This did not trouble
forensic investigators since the iOS used on the iPad was the same as that of the iPhone. Therefore, at
the time, no further studies were required, as they had been previously performed for the iPhone and
iPod touch. In April 2010, Apple took a major step by releasing iOS 4 which introduced further
notable features, such as multitasking, gaming features, video possibilities, and others (Morrissey,
2010).
Being familiar with all the features and having a good understanding of the Apple ecosystem was a
fundamental requirement for establishing a solid understanding of iOS forensics (Morrissey, 2010).
The iOS is a mini version of the OS X, which uses a modification of the Mac OS X kernel and its
development is based on Xcode and Cocoa (Morrissey, 2010). iOS is composed of four main
components; Cocoa Touch, Media, Core Services, and the OS X kernel. The first component, Cocoa
Touch, provides the technological infrastructure to employ the applications’ visual interface (Hoog &
Strzempka, 2011). The second component, Media, contains graphics, audio and video technologies
consisting of OpenAL, video playback, image files formats, quartz, core animation, and OpenGL
(Morrissey, 2010). According to Hoog & Strzempka (2011), the third component, Core Services,
delivers the primary system services for applications such as networking, SQLite databases, core
location, and threads. The OS X kernel which is the fourth component, delivers low level networking,
peripheral access, and memory management/file system handling (Hoog & Strzempka, 2011). It
consists of TCP/IP, sockets, power management, file system, and security (Morrissey, 2010).
Morrissey (2010) stated that the Hierarchical File System (HFS) was developed by Apple in order to
support the increased storage requirements of people. HFS formatted disks at the physical level are in
512 byte blocks, which is identical to Windows based sectors. An HFS system has two kinds of
blocks, logical blocks and allocation blocks (Morrissey, 2010). The logical blocks are numbered from
the first to the last on a given volume. They are static and are 512 bytes in size, just as the physical
blocks. On the other hand, allocation blocks are used by the HFS system to reduce fragmentation.
114
ADFSL Conference on Digital Forensics, Security and Law, 2012
They are collections of logical blocks joined together as clusters (Morrissey, 2010).
The HFS file system manages files by using balanced tree (B*tree), which is a catalog file system that
uses a catalog file and extents overflows in its organization scheme (Morrissey, 2010). B*trees consist
of nodes that are assembled together in a linear manner. This linear method increases the data access
speed by continuously balancing the extents when data is deleted or added. The HFS file system gives
a unique number (catalog ID number) for every file that is created, and increases by one each time a
file is added. The numbering of the catalog ID is tracked by the HFS volume header (Morrissey,
2010).
Figure 1. Structure of HFS+ file system
Adapted from IOS Forensic Analysis: for iPhone, iPad and iPod Touch, by Morrissey, S, 2010: Apress. Copyright 2010 © by Sean Morrissey
Figure 1 illustrates the structure of an HFS+ file system as illustrated by Morrissey (2010). The boot
blocks retain the first 1024 bytes. The subsequent 1024 bytes are retained for the volume header as
well as the last 1024 bytes of the HFS volume, which is reserved as a backup volume header. The
volume header stores information about the structure of the HFS volume. Next is the allocation file
which traces the allocation blocks that the file system is using. The following is the extents overflow
file which traces all the allocation blocks that belong to a file’s data forks. A list of all extents used by
a file and the connected blocks in the proper order is stored in this file.
The catalog file stores the volume files and folders' information that are in a hierarchical system of
nodes. There is a header node, an index node, leaf nodes, and map nodes. The attributes file is reserved
for future use of data forks, and the start up file is designed to boot a system which does not have a
built-in ROM support. The space after start up file is used by the file system for saving and tracking all
the data in a volume. The alternate volume header, as mentioned previously, is where the backup of
the volume header is retained. Disk repair is the main purpose for the alternate volume header. Finally,
the last chunk of HFS file system structure are the 512 bytes, which are reserved (Morrissey, 2010).
The file system for Apple portable devices is HFSX, but has one main difference from the HFS+. The
difference is that the HFSX is case sensitive, thereby allowing the file system to distinguish between
115
ADFSL Conference on Digital Forensics, Security and Law, 2012
two files that have the exact identical name. For instance, both test.doc and Test.doc can be created on
the HFSX file system (Morrissey, 2010).
According to Cellebrite (2011), two partitions are included in the Apple devices. The first partition is
the system partition, which is about 1 GB. The iOS and basic applications are stored in the first
partition (G´omez-Miralles & Arnedo-Moreno, 2011). Although encryption may be enabled on the
iDevice, the system partition is not encrypted (Cellebrite, July 3, 2011). However, the second partition
is the one of interest, as it holds all the user data that can have evidentiary value. This partition
includes all the photos, messages, GPS, videos, and other data that is generated by the user (Cellebrite,
July 3, 2011). Extracting the protected files can be a challenge if encryption is enabled (Cellebrite,
2011).
3.1.2 Related Work
A paper by Luis G´omez-Miralles and Joan Arnedo-Moreno explored the method of obtaining a
forensic image of the iPad using a Camera Connection Kit. The advantage of their method was the
ability to decrease time by attaching an external hard disk to the iPad using a USB connection. Their
method consisted of two parts. The first part was setting up the device by jailbreaking it. The second
part was to acquire a forensic image, by connecting Apple’s Camera Connection Kit, which was used
to mount the USB hard drive. Next, was to obtain a forensic image using the dd command (G´omezMiralles & Arnedo-Moreno, 2011).
Another research study by Mona Bader and Ibrahim Baggili, explored the forensic processing of the
third generation of Apple iPhone 3GS 2.2.1 by using the Apple iTunes backup utility to retrieve a
logical backup. Their research showed that the Backup folder with its backup files contains data that
has evidentiary value, as the iPhone file system saves the data in binary lists and database files.
Additionally, their results showed that XML format plist files are used to store device configuration,
status, and application settings. Bader and Baggili (2010) stated that the backed up files from the
iPhone contain information that could have evidentiary significance such as text messages, email
messages, and GPS locations. The manual examination introduced in this paper is an extension of their
work to the iPad2 device.
4. DESIGN – METHODS AND PROCEDURES
The methodology in this research takes into consideration the Computer Forensics Tool Testing
guidelines established by the National Institute of Standards and Technology (NIST) (NIST, 2003). At
a high level, the authors followed the following procedures:
1. Certain scenarios were entered onto three iPad2 devices
2. The devices were backed up (logically acquired)
3. The backup folders were then parsed using automated forensic tools and manually
4. The results were compared
After the scenario was created, the iPad2 was geared up for logical acquisition. The specifications of
the devices used in the acquisition process are listed in Table 1. Then, the forensic tools were chosen
to perform the logical acquisition. After reviewing some of the available tools' specifications and their
ability to support iPad logical data extraction, Katana Forensic Lantern was chosen.
116
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 1. Hardware Specifications
Device
Forensic Workstation
iPad
Specification
MacBook Pro
Mac OS X
Version 10.6.8
Processor: 2.26 GHz Intel Core 2 Duo
Memory: 2 GB 1067 MHz DDR3
iPad 3G + WiFi
Storage: 64 GB
iOS: 4.10.01
Table 2. Software Specifications
Software
Katana Forensic
Lantern
iTunes
iPhone Backup
Extractor
PlistEdit Pro
SQLite Database
Browser
Google Earth
Details
Version 2.0 operatable on:
Minimum of Intel based Mac OS X 10.6 with preferably 4GB
memory
Downloaded from: http://katanaforensics.com/
iTunes 10 version 10.4.1 (10)
Version 3.0.8
Downloaded from:
www.iphonebackupextractor.com
Version 1.7 (1.7)
Downloaded from:
http://www.fatcatsoftware.com/plisteditpro/
Version 1.3
Downloaded from: http://sqlitebrowser.sourceforge.net
Version 6.0.3.2197
Downloaded from:
http://www.google.com/earth/download/ge/agree.html
Logical Acquisition Approach
Logical backups contain a vast amount of data that includes all sorts of evidence (Bader & Baggili,
2010). They are able to connect many dots, in any case, by providing further information about the
computer that was previously used to synchronize the iDevice, and the owner of the device, including
his/her interests and whereabouts (Bader & Baggili, 2010).
The logical backup was captured through the iTunes utility using a Mac OS X version 10.6.8. The
system utilizes Apple’s synchronization protocol in order to have all of the iDevice's data copied to the
forensic workstation (Bader & Baggili, 2010).
It is important to note that any automatically initiated backup on iTunes may contaminate the iDevice
by transferring data from the forensic workstation to the iDevice as a part of the synchronization
(Bader & Baggili, 2010). This was noted, and steered clear of, prior to connecting the iPad to the
forensic workstation by disabling the synchronization option in iTunes.
The logical backup folder was parsed using different methods: first, by using Katana Forensic Lantern.
This software directly acquires the backup from the iDevice – provided that the automatic
synchronization with iTunes is turned off; second, by using the iPhone Backup Extractor tool that
117
ADFSL Conference on Digital Forensics, Security and Law, 2012
presents the backup data in a more visually appealing form than the original data; and third, through a
manual examination of the backup folder.
Other software and tools were required to manually investigate the backup data such as the SQLite
Database browser, PlistEdit Pro, and TextEdit as mentioned in Table 2. In order to prepare the iPad2
and backup data, a scenario was created on the iPad, by adding notes, email accounts, address book
entries, photos (taken with the location option being On and Off), calendar entries and bookmarks.
Then, the backup process was initiated using Mac OS X version 10.6.8 through iTunes version 10.4.1
(10). More specifications are included in Tables 1 and 2.
It must be noted that all the tests were conducted under forensically acceptable conditions, thereby
avoiding any illegal access to the device through jailbreaking, especially since the main objective was
to not contaminate the original data stored on the iPad2. According to Kruse & Heiser (2002), there
are four key phases to follow in computer forensics processes, which are, access, acquire (logical),
analyse, and report. These steps were followed in this research.
4.1. Instruments
4.1.1. Katana Forensic Lantern
Lantern is one of the iOS forensic solutions currently available on the market, developed by Katana
Forensics. It is well known for its low cost, yet fast and effective results. It enables acquiring the
logical portion of three different iOS devices: iPhones, iPads and iPod Touch (Morrissey, 2010). After
the device is connected to the forensic computer, Lantern backs up the files and allows the
examination of data while it is still processing. Timeline analysis and geographic data features are also
supported (KatanaForensics, 2011b).
The software has an intuitive user interface. It is as simple as opening a new case, entering the case
and evidence details, and then acquiring the device (KatanaForensics, 2011b). The maximum time for
acquiring the iDevice ranges between 5 to 30 minutes, depending on the size of the device (Morrissey,
2010).
4.1.2. iTunes
Media player, digital media organizer, and iPhone/iPod/iPad content manager, are the features that the
iTunes application offers since its introduction in 2001. It also connects to the iTunes Store and
enables online content purchases (Apple, 2011c).
iTunes operates on both iOS and Windows, and is available for download, with no charges, through
Apple’s portal. It is primarily used to maintain media, applications, books, and content purchases, and
are all synchronized with the owner's Apple devices. It creates backups of all the settings and media of
the connected devices such as iPods, iPhones and iPads. These backups restore the captured settings
and details of the devices (Apple, 2011c).
4.1.3. iPhone Backup Extractor
iPhone Backup Extractor is a freely available tool that can parse data from the iPhone’s backup folder
(ReincubateLtd, 2011). According to Morrissey (2010), it has the ability to convert the backup folder
into CSV, VCard, or ICAL formats, so they can be easily viewed. It can also convert Keyhole Markup
Language (KML) files for use with Google Earth if any location data is included. KML is an XML
markup language used by map-related software for marking maps (Goldberg & Castro, 2008).
Although iPhone Backup Extractor it is not considered to be a forensic tool, it provides means for
examiners to analyse backup folders (Morrissey, 2010).
4.2. Data Collection and Analysis
4.2.1. Katana Forensic Lantern
When the acquisition process was complete, the following extracted data was shown:
118
ADFSL Conference on Digital Forensics, Security and Law, 2012
Device Information: It provided general information about the device being acquired such as device
name, type, model, serial number, and software version.
Contacts: This pane showed the phone book saved on the iDevice. It also showed all the related data
to a contact in one single screen.
Notes: Thoughts, ideas, points, lists, and many more things can be typed in this application and may
be used as evidence in an investigation.
Calendar: The calendar may include a great sum of information. It may contain appointments, events,
notes, and important dates synchronized from many different devices and applications used by the
user. Lantern’s software parses all the data and alerts if they are occurring (Morrissey, 2010).
Internet: Considering the nature of the investigation being carried on, this section is of great
importance. Lantern provides a clear list of all Internet bookmarks, displayed pages, and browsing
history from the web surfing program on the iPad, which is Safari.
Gmail: This pane shows all the registered email accounts on the iPad. In this case, the tested device
had a Gmail account set up, and all related emails, their contents, sender and receiver related
information were displayed.
Dictionary: Each time text is typed on the iPad, it is logged and saved, and can be acquired during the
investigation. Those entries can appear from texts typed into the device and are ordered in the dynamic
dictionary file.
Maps: iPad users may record maps and routes into their handheld devices. All those details are parsed
and can be exported to Google Maps to show more graphical details and more specific locations.
Those details, in turn, can be exported to Google Earth so they can be plotted on a map.
WiFi: Shows a chronological order of all the WiFi connections attempted using the iPad, SSID,
BSSID, and the security of that connection. This can be useful in capturing the access points the
device was connected to.
Photos: This pane may be one of the most important panes for the investigator due to the richness of
the information that it offers. All images taken with the built-in camera include Exchangeable Image
File Format (EXIF) and embedded geographical data (KatanaForensics, 2011b). Those images can be
identified using Lantern and are also clearly viewed within the program itself. Lantern can extract the
geographical data from EXIF so that the location can be easily plotted on Google Maps, illustrating
where the photo was taken (Morrissey, 2010).
Google Earth: Interesting data was found within the Exif associated photos taken using the built-in
camera of the iPad. If the location service is on, it records the longitude and latitude of the exact
location of where the photo was taken, along with other details about the camera type, aperture value,
and many more related data. This data may be exported into a .kmz format, which then could be
opened using Google Earth and show its exact location on the map.
Report: Lantern provides an option of exporting all the acquired data into html format. This
summarizes all the panes and their related data into one page – which is easier to navigate through. It
also provides the source file from which the data was extracted.
Third Party Applications: Although Lantern was of a great assistance in examining the evidence on
iDevices, it did not provide significant details about third party applications installed on the iPad. Yet,
it was possible to manually navigate through the exported report and find some related data to those
applications.
4.2.2. iPhone Backup Extractor
iPhone Backup Extractor is not intended to serve as a forensics tool (Morrissey, 2010). It essentially
parses the extracted backup folder into useful folders and file formats that, in turn, make it easier for
119
ADFSL Conference on Digital Forensics, Security and Law, 2012
the examiner to go through (ReincubateLtd, 2011). Moreover, it parses the data in a more organized
way and sorts it neatly e.g. .plist, .sqlite3, & .db files (Morrissey, 2010). iPhone Backup Extractor is
not complicated and does not need any special knowledge to run and use (ReincubateLtd, 2011).
The following descriptions are files of interest extracted using the iPhone Backup Extractor:
Keychain-backup.plist: This file contains important tables such as the genp and inet. The tables
contain accounts, their encrypted passwords, and related services (Bader & Baggili, 2010).
Library: This folder contains most of the important data that can be found within the backup, such as
the saved entries in the Address Book (and the related images to each entry), Calendar, Keyboard
entries, Notes, Safari and more. The files are in .sqlite3 format and .db or .plist.
Media: Media folder parses the photos and their data into their relative folders.
Ocspcache.sqlite3: This database seems to include the digital certificates.
SystemConfiguration: This folder shows a collection of PList files that contain data about networks
the device is connected to, their preferences, and the power management details.
TrustStore.sqlite3: This database contains the CA certificates that the iPad trusts. The CA certificates
are used for security credentials and message encryption by using public keys (SearchSecurity.com,
2011).
Com.facebook.Facebook: This backup folder holds all the data related to the facebook account
associated with the iPad under investigation. A database of all the facebook friends’ networks can also
be found in .db format.
Dynamic-text.dat file (Dictionary): Any text entered in the iPad is logged into a file called dynamictext.dat. This file can be of a great help to the investigator due to the insight it provides about the
suspects usage of the device. It can potentially direct the whole investigation into the proper path by
providing solid facts.
com.skype.Skype: Skype provides video calls, screen sharing, and many more features. It can
potentially reveal a lot about the suspect's personality and provide corroborating evidence. The
database contains many tables such as Accounts, CallMembers and ChatMembers.
com.apple.ibooks: This database mainly holds details about all electronic books and .pdf files added
to the iBook application. It shows plenty of interesting related details such as the title of the item, the
author, and the categories. This could provide further details about the personality of the suspects and
their interests.
Com.linkedin.Linkedin.plist: This PList file holds details of the Linkedin account used on the
iDevice. It mainly shows general user profile details such as user name, job title, associated twitter
accounts, pictures, and a link to the user profile and photo URL.
4.2.3. Manual Examination of iTunes Backup Folder:
A manual examination of the iTunes backup folder was conducted without using any automated tools
to parse the files. The backup folder contained more than 2000 files. These files consisted of different
kinds of files, such as PLists, SQL databases, images, HTML files, and more.
As a part of the backup folder, three main PList files were generated; Info.plist, Status.plist, and
Manifest.plist. In Manifest.plist, a metadata file identifies the .mddata and .mdinfo files. The
Status.plist file holds the status of the backup, i.e. if it was completed successfully, and the date and
time of the backup.
Another file contained within the backup folder was the Info.plist. This plist file holds details about
the backup and other information regarding the iDevice (GitHub, 2011). Other files found in the
backup folder are Manifest.mbdb and Manifest.mbdx. These are binary index files that contain the
120
ADFSL Conference on Digital Forensics, Security and Law, 2012
names and paths of the files in the backup (LeMere, 2010). It also contains the real names of the files
representing the random strings in the backup folder (Ley, 2011). The .mbdb file holds the data of the
original files, and the .mbdx contains pointers to those files in the .mbdb file as well as the hex file
names included within the backup folder (ROYVANRIJN, April 21, 2011). Furthermore, the .mbdb
also contains a file header which holds the version number of mbdx and the number of records present
in the mbdx file (LeMere, 2010).
The following is the list of some of the files of interest found using the manual examination method:
Dynamic Dictionary: Found in 0b68edc697a550c9b977b77cd012fa9a0557dfcb.
iBooks:
The
SQLite
database
of
1c955dc79317ef8679a99f24697666bf9e4a9de6.
iBooks
is
contained
within
the
file
Skype: Many files related to the Skype application were found and examined – most of which were
text files containing encrypted data, except the user names involved in the video calls were in clear
text. In addition, a SQLite database was found with relevant data to the calls carried out, the duration,
involved usernames, and language used. The SQLite DB had the following name:
a6414be1fc3678b0ea60492a47d866db3a6d4818 file and its related Plist file had the following name:
5c7504d4b4aa4395d7b3651bc0d523de121c3159. It was also found that the hash value of the
application was not constant amongst multiple tested devices.
Bookmarks:
Bookmarks
database
is
contained
within
the
file
named
d1f062e2da26192a6625d968274bfda8d07821e4 and contains many different tables within its database
with data related to the bookmark URL and the title of each link.
Notes: Contains details about the notes saved on the device. The SHA-1 value is
ca3bc056d4da0bbf88b5fb3be254f3b7147e639c
Twitter: Twitter data was in PList format and contained the usernames and content of tweets
exchanged amongst them. The related SHA-1 values for the twitter application files were as follows:
1ae8b59701a8ef5a48a50add6818148c9cbed566
2ba214fcde984bbc5ea2b48acd151b61b651c1c8
4eb24cb35ff5209d08774a6d2db4c98a0449a9ff
7a0c2551ecd6f950316f55d0591f8b4922910721
71127e4db8d3d73d8871473f09c16695a1a2532c
c5dc95a1b0c31173c455605e58bffcca83d8b864
Connection WiFi: The file 3ea0280c7d1bd352397fc658b2328a7f3b1243b in all three tested iPads
contained related data to the WiFi and network connections.
Gmail Contacts List: An interesting finding of the manual examination was a SQLite database file
containing email accounts, names, and mobile numbers of the registered Gmail account on the iPad.
The findings contained information that was not initially entered for each contact in the list but was
synchronized with a Blackberry device Address Book of the owner.
The Gmail account is set up on the Blackberry and is synchronized with the device’s contact list. All
contacts' mobile numbers were found in the backup folder since that Gmail account was also set up on
the iPad under examination. This could lead an investigator to find a suspect's contact numbers, so
long as they were mainly saved on his/her mobile device.
Linkedin:
The
relative
PList
file
to
this
application
is
named
9c404eb0aa691005cdbd1e97ca74685c334f3635 and provides details such as the first name, last name,
user name, profile URL, and the job title.
121
ADFSL Conference on Digital Forensics, Security and Law, 2012
Calendar: The calendar contains two relative files in the backup folder, a SQLite database and PList.
The former is saved under 2041457d5fe04d39d0ab481178355df6781e6858 and the latter under
d351344f01cbe4900c9e981d1fb7ea5614e7c2e5.
Keychain: This file was found in 51a4616e576dd33cd2abadfea874eb8ff246bf0e and contains genp,
cert, inet, and keys tables. These tables provide more data about the accounts registered on the device
along with their relative passwords and details.
Table 3. Backup files and their relative applications
Backup
Keychain
WiFi Connections
File Type
PList
PList
PList
SQLite
PList
PList
PList
SQLite
PList
PList
SQLite
PList
PList
SQLite
PList
PList
Backup File
51a4616e576dd33cd2abadfea874eb8ff246bf0e
3ea0280c7d1bd352397fc658b2328a7f3b1243b
ade0340f576ee14793c607073bd7e8e409af07a8
0b68edc697a550c9b977b77cd012fa9a0557dfcb
1c955dc79317ef8679a99f24697666bf9e4a9de6
51fca3a3004e8f8e08f37a0a5ac3d7512274ee24
1ae8b59701a8ef5a48a50add6818148c9cbed566
2ba214fcde984bbc5ea2b48acd151b61b651c1c8
4eb24cb35ff5209d08774a6d2db4c98a0449a9ff
7a0c2551ecd6f950316f55d0591f8b4922910721
71127e4db8d3d73d8871473f09c16695a1a2532c
c5dc95a1b0c31173c455605e58bffcca83d8b864
9c404eb0aa691005cdbd1e97ca74685c334f3635
d1f062e2da26192a6625d968274bfda8d07821e4
5f0a990d1c729a8b20627e18929960fc94f3ee6b
5fd03a33c2a31106503589573045150c740721dd
06af93e6265bf32205de534582c3e8b8b3b5ee9e
19cb8d89a179d13e45320f9f3965c7ea7454b10d
36eb88809db6179b2fda77099cefce12792f0889
52c03edfc4da9eba398684afb69ba503a2709667
ca3bc056d4da0bbf88b5fb3be254f3b7147e639c
06642c767b9d974f12af8d72212b766709ba08fe
59445c4fae86445d6326f08d3c3bcf7b60ac54d3
2041457d5fe04d39d0ab481178355df6781e6858
d351344f01cbe4900c9e981d1fb7ea5614e7c2e5
9281049ff1d27f1129c0bd17a95c863350e6f5a2
Dynamic Dictionary
iBooks
Text
SQLite
PList
PList
PList
SQLite
PList
SQLite
a2d4e045244686a6366a266ba9f7ef88e747fe4b
bedec6d42efe57123676bfa31e98ab68b713195f
d00b419a7c5cbffd19f429d69aff468291d53b00
f936b60c64de096db559922b70a23faa8db75dbd
Twitter
LinkedIn
Bookmarks
Google+
Mail Accounts
Airplane Mode
Browser History
Keyboard options
Notes
Last Sleep
Audio Player details
Calendar
Safari Active
Documents
Mailbox details
Photos
Dropbox
Ocsp
4.3. Challenges
Generally, investigators may face challenges no matter what device is being examined. However,
since criminals continue to develop novel methods to cover their tracks, digital forensics has to
constantly evolve with many technologies being introduced to defend against those harmful attacks
(Husain, Baggili, & Sridhar, 2011).
122
ADFSL Conference on Digital Forensics, Security and Law, 2012
A challenge that can be faced in a case is a locked iPad device. Since this is a user entered PIN, there
are no default codes provided in the user manual providing an investigator with a limited number of
attempts to try different passwords. As a security precaution offered by Apple on all iDevices,
multiple unsuccessful attempts may lead to erasing all data on the device.
Examiners should consider the following when a locked iDevice is being investigated (Apple,
September 22, 2011):

Repeatedly entering the wrong password on the iDevice will lead it to being disabled for a
certain period of time, starting with 1 minute. The more the unsuccessful attempts, the longer
the disabled interval will be.

Too many unsuccessful attempts will lead to not being able to gain access to the iDevice until
it is connected to the computer it was originally synchronized with.

The user may have configured the device to erase itself after a certain number of consecutive
unsuccessful attempts.
There are many ways to obtain the PIN of a locked device during an investigation (Punja & Mislan,
2008):

Using common PIN numbers, such as the last 4 digits of the owner’s phone number, the year
of birth, or any other related PIN that is found during the investigation.

Obtaining it directly from the owner.

Brute force attack (but may be inapplicable in case of having a limited number of attempts).
Few software vendors claim now to provide features that take the investigation further by decrypting
the forensic images captured and obtaining the passcode within 20-40 minutes depending on the
passcode complexity through brute-force. Katana Forensics provides a complimentary application to
their original Lantern program that images iOS devices. Since the complimentary software was
released after the completion of this experiment, the newly added features were not tested.
Another software, Elcomsoft iOS Forensic Toolkit, provides similar features to Lantern Lite. The
former has iOS 4.x and iOS 5 Passcode Recovery. This forensic tool also performs a brute-force
attack, even for complex passcodes. It also obtains the encryption keys from the iDevice.
Another challenge is the differences in the iPhone backup structures depending on the installed iOS
version. Since the early versions of the Apple firmware, there have been changes in the backup
formats and structures. It started with .mdbackup files to .mddata and .mdinfo in the 3G iPhone and
iOS 3.0. The former file type contained data related to the phone, where the latter held the respective
metadata (Morrissey, 2010).
Once iOS 4.0 was released, a new backup structure was introduced. The hashed filenames still held
the data but without the .mddata and .mdinfo extensions (Morrissey, 2010). Instead, everything was
stored in manifest.mbdb. This database contains the full path and filenames, and could be viewed
using TextEdit (Hoog & Strzempka, 2011).
5. SUMMARY OF FINDINGS
We share below the summary of our findings from performing this research. A comparative summary
is also provided in Table 4.
Katana Forensic Lantern:

Provides detailed, neat, and well organized results.

Exports the output into an html file that is very easy to scroll and navigate through.

Easy to export photos and WiFi connections data into a .kmz file that views their locations on
123
ADFSL Conference on Digital Forensics, Security and Law, 2012
the map using Google Earth.

Limits previewing data related to third party applications in the main results window. But it is
possible to navigate for third party details manually in the extracted reports from the tool
itself.

Requires third party applications to view PList and SQLite files.
iPhone Backup Extractor:

Organizes the backup folder into sub folders, separating third party applications from the iOS
files, which contains most of the iDevice related data.

Some files found using this tool were not captured through backup folder manual examination
such as TrustStore.sqlite3.

The output does not transmit the files to their related files in the backup folders and their
related SHA-1 file names.

It is not considered as a forensics tool.
Manual Examination:

Provides a lot of data which could be considered both good and bad. It is good
because the logical backup data is not missed if fully examined. It could be
considered bad because it is messy, randomly spread, and cumbersome to manually
examine since the Backup folders contain thousands of files.

Provides more digital evidence than other built-in applications installed on the device
like Twitter, Skype and Linkedin. Evidence includes user interactions with the
application, such as chat details, private messages sent/received, usernames and the
date/time stamps associated with these digital evidence.

Some of the databases found through manual examination contained information
synchronized between the user’s handheld mobile phone and the email address set up
on the iDevice. This can be of great relevance to certain investigations.

If certain installed application evidence is needed for an investigation, then this is the
most appropriate method for an investigator to use.
Differences between iOS 3 and iOS 4:
Previously conducted research on the backup folder of the iOS resulted in relating the backup files to
certain file names in SHA-1 hash values. It proved that those SHA-1 values were constant amongst
certain applications in any backup folder, which made it easier to fetch the desired data directly from
the backup folder without the need to go through each and every file.
This research revealed some changes to the iOS backup structure and illustrated how some of those
related SHA-1 values have changed across some applications as shown in Table 5.
124
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 4. Comparison between the tools
Cost
Logical Acquisition
Physical Acquisition
Automated
Built-in Image Viewer
Built-in Text Viewer
Deleted Data
Recovery
File Sorting
Third-party
applications data (e.g.
twitter, LinkedIn,
Skype, etc…)
Report Export
(HTML)
Katana Forensic
iPhone Backup
Lantern
Extractor
Government Rate
Free Edition
$599.00
Registered
Corporate Rate
$34.95
$699.00
Data Acquisition
Yes
Yes
*Released Post
No
Experiment
Analysis
Yes
No
Yes
No
Yes
No
No
No
Manual Examination
Free
Yes
No
No
No
No
No
Yes
Partial
Yes
Partial
No
Yes
Yes
No
No
Table 5. iOS 4 backup files compared to iOS 3 backup files
Backup
Notes
File
Type
iOS 4.10.01
iOS 3.1.2
Match?
SQLite
ca3bc056d4da0bbf88b5fb3be2
54f3b7147e639c
52c03edfc4da9eba398684afb6
9ba503a2709667
2041457d5fe04d39d0ab48117
8355df6781e6858
d351344f01cbe4900c9e981d1f
b7ea5614e7c2e5
d1f062e2da26192a6625d9682
74bfda8d07821e4
5d04e8e7049cdf56df0bf82482
0cddb1db08a8e2
01e25a02a0fccbd18df66862ad
23a2784b0fe534
bdadf7ce9d86a2b33c9ba65373
11538f48e03996
ca3bc056d4da0bbf88b5fb3be2
54f3b7147e639c
N/A
Yes
PList
SQLite
Calendar
PList
SQLite
Bookmarks
PList
SQLite
History
PList
125
N/A
2041457d5fe04d39d0ab48117
8355df6781e6858.mddata
N/A
N/A
N/A
N/A
04cc352fd9943f7c0c3f0781d4
834fa137775761.mddata
N/A
No
1d6740792a2b845f4c1e6220c
43906d7f0afe8ab.mddata
Yes
N/A
No
ADFSL Conference on Digital Forensics, Security and Law, 2012
6. CONCLUSION
In sum, automated tools provide instant and visual evidentiary data from investigated iPads. Each
automated tool has different capabilities; therefore different sets of results can be obtained from each
tool. However, the manual examination provides more thorough and detailed evidentiary data from
investigated iPads. This research points out that forensic software developers need to advance the
capabilities of their automated tools. A robust technology that is capable of easing the manual
examination process is needed. Overall, the authors propose that if a quick triage of an iOS device is
needed, then an automated tool would suffice. However, if a more thorough examination is needed,
especially if unique installed application data is needed, then the manual examination process should
be utilized.
7. FUTURE WORK
On April 12, 2011, AccessData released the “Mobile Phone Examiner Plus (MPE +),” which is a
standalone cell phone forensics application (AcessData, April 12, 2011). The main feature of this
software solution is its ability to obtain the physical image of the iPad, iPhone and iPod Touch without
jailbreaking the iOS device (AccessData, 2011). Another software application that supports the
physical acquisition of iOS devices is the Lantern Light Physical Imager, which was due to be released
in mid October 2011 (KatanaForensics, 2011a). These tools should be tested in order to assure that
they are operating as required (Baggili, Mislan, & Rogers, 2007). Further studies should attempt to
examine the error rates in the forensic tools used when examining iOS devices. Lastly, this research
points to the direction of creating an extensible software framework for examining digital evidence on
iOS devices. This framework should allow the forensic software to continuously be updated with
backup file signatures depending on the version of the iOS. It should also allow the software to update
signatures of backup files for newly released iOS applications.
REFERENCES
AccessData. (2011). MPE+ MOBILE FORENSICS SOFTWARE SUPPORTS 3500+ PHONES,
INCLUDING IPHONE®, IPAD®, ANDROID™ AND BLACKBERRY® DEVICES, from
http://accessdata.com/products/computer-forensics/mobile-phone-examiner
AcessData. (April 12, 2011). ACCESSDATA RELEASES MOBILE PHONE EXAMINER PLUS 4.2
with PHYSICAL IMAGING SUPPORT FOR IPHONE®, IPAD® AND IPOD TOUCH® DEVICES.
Apple. (2011a). Apple Launches iPad 2, from http://www.apple.com/pr/library/2011/03/02AppleLaunches-iPad-2.html
Apple. (2011b). iPad Technical Specification from http://www.apple.com/ipad/specs/
Apple. (2011c). iTunes, from http://www.apple.com/itunes/
Apple. (September 22, 2011). iOS: Wrong passcode results in red disabled screen, from
http://support.apple.com/kb/ht1212
Bader, M., & Baggili, I. (2010). iPhone 3GS Forensics: Logical analysis using Apple iTunes Backup
Utility. SMALL SCALE DIGITAL DEVICE FORENSICS JOURNAL, 4(1).
Baggili, I. M., Mislan, R., & Rogers, M. (2007). Mobile Phone Forensics Tool Testing: A Database
Driven Approach. International Journal of Digital Evidence, 6(2).
Cellebrite. (July 3, 2011). Cellebrite Physical Extraction Manual for iPhone & iPad.
126
ADFSL Conference on Digital Forensics, Security and Law, 2012
Elmer-DeWitt, P. (March 13, 2011). Piper Jaffray: iPad 2 totally sold out, 70% to new buyers, Cable
News Network. Retrieved from http://tech.fortune.cnn.com/2011/03/13/piper-jaffray-ipad-2-totallysold-out-70-to-new-buyers/
G´omez-Miralles, L., & Arnedo-Moreno, J. (2011). Universal, fast method for iPad forensics imaging
via USB adapter. Paper presented at the 2011 Fifth International Conference on Innovative Mobile
and Internet Services in Ubiquitous Computing, Korea
Geyer, M., & Felske, F. (2011). Consumer toy or corporate tool: the iPad enters the workplace.
interactions, 18(4), 45-49. doi: 10.1145/1978822.1978832
GitHub. (2011). mdbackup-organiser, from https://github.com/echoz/mdbackup-organiser
Goldberg, K. H., & Castro, E. (2008). XML: Peachpit Press.
Halliday, J. (March 2, 2011). iPad 2 launch: live coverage of Apple's announcement. Retrieved from
http://www.guardian.co.uk/technology/2011/mar/02/ipad-2-launch-apple-announcement-live
Hoog, A., & Strzempka, K. (2011). IPhone and IOS Forensics: Investigation, Analysis and Mobile
Security for Apple IPhone, IPad and IOS Devices: Elsevier Science.
Husain, M. I., Baggili, I., & Sridhar, R. (2011). A Simple Cost-Effective Framework for iPhone
Forensic Analysis. Paper presented at the Digital Forensics and Cyber Crime: Second International
ICST
Conference
(October
4-6,
2010)
Abu
Dhabi,
United
Arab
Emirates.
http://books.google.ae/books?id=fiDXuEHFLhQC
KatanaForensics. (2011a). Lantern Light, from http://katanaforensics.com/forensics/lantern-lite/
KatanaForensics. (2011b). Lantern Version 2.0, from http://katanaforensics.com/forensics/lantern-v20/
KrollOntrack. (2011). Tablet Forensics – A Look at the Apple iPad®, from
http://www.krollontrack.com/resource-library/legal-articles/imi/tablet-forensics-a-look-at-the-appleipad/?utm_source=Newsletter&utm_medium=Email&utm_campaign=IMISept2011&utm_content=image
LeMere, B. (2010). Logical Backup Method, from
http://dev.iosforensics.org/acquisition/acquisition_logical_method.html
Ley, S. (2011). Processing iPhone / iPod Touch Backup Files on a Computer from
http://www.appleexaminer.com/iPhoneiPad/iPhoneBackup/iPhoneBackup.html
Miles, S. (January 27, 2010). Apple tablet unveiled as the iPad, Pocket-lint Retrieved from
http://www.pocket-lint.com/news/31084/steve-jobs-launches-ipad-apple-tablet
Morrissey, S. (2010). IOS Forensic Analysis: for IPhone, IPad and IPod Touch: Apress.
NIST. (2003). CFTT Methodology Overview, from
http://www.cftt.nist.gov/Methodology_Overview.htm
Pilley, J. (October 8, 2010). iPad and Smartphone Digital Forensics, from
http://www.articleslash.net/Computers-and-Technology/Computer-Forensics/578641__iPad-andSmartphone-Digital-Forensics.html
Punja, S. G., & Mislan, R. P. (2008). Mobile Device Analysis. SMALL SCALE DIGITAL DEVICE
FORENSICS JOURNAL, 2(1).
ReincubateLtd. (2011). iPhone Backup Extractor, from www.iphonebackupextractor.com
ROYVANRIJN. (April 21, 2011). Reading iPhone GPS data from backup (with Java), from
http://www.redcode.nl/blog/2011/04/reading-iphone-gps-data-from-backup-with-java/
127
ADFSL Conference on Digital Forensics, Security and Law, 2012
SearchSecurity.com. (2011). certificate authority (CA), from
http://searchsecurity.techtarget.com/definition/certificate-authority
128
ADFSL Conference on Digital Forensics, Security and Law, 2012
MULTI-PARAMETER SENSITIVITY ANALYSIS OF A
BAYESIAN NETWORK FROM A DIGITAL FORENSIC
INVESTIGATION
Richard E. Overill
Department of Informatics
King’s College London
Strand, London WC2R 2LS, UK
+442078482833
+442078482588
[email protected]
Echo P. Zhang
Department of Computer Science
University of Hong Kong
Pokfulam Road, Hong Kong
+85222417525
+85225998477
[email protected]
Kam-Pui Chow
Department of Computer Science
University of Hong Kong
Pokfulam Road, Hong Kong
+85228592191
+85225998477
[email protected]
ABSTRACT
A multi-parameter sensitivity analysis of a Bayesian network (BN) used in the digital forensic
investigation of the Yahoo! email case has been performed using the principle of ‘steepest gradient’ in
the parameter space of the conditional probabilities. This procedure delivers a more reliable result for
the dependence of the posterior probability of the BN on the values used to populate the conditional
probability tables (CPTs) of the BN. As such, this work extends our previous studies using singleparameter sensitivity analyses of BNs, with the overall aim of more deeply understanding the
indicative use of BNs within the digital forensic and prosecutorial processes. In particular, we find that
while our previous conclusions regarding the Yahoo! email case are generally validated by the results
of the multi-parameter sensitivity analysis, the utility of performing the latter analysis as a means of
verifying the structure and form adopted for the Bayesian network should not be underestimated.
Keywords: Bayesian network; digital forensics; multi-parameter sensitivity analysis; steepest
gradient.
1. INTRODUCTION
The use of Bayesian networks (BNs) to assist in digital forensic investigations of e-crimes is
continuing to increase [Kwan, 2008; Kwan, 2010; Kwan, 2011] since they offer a valuable means of
reasoning about the relationship between the recovered (or expected) digital evidential traces and the
129
ADFSL Conference on Digital Forensics, Security and Law, 2012
forensic sub-hypotheses that explain how the suspected e-crime was committed [Kwan, 2008].
One of the principal difficulties encountered in constructing BNs is to know what conditional
probability values are appropriate for populating the conditional probability tables (CPTs) that are
found at each node of the BN. These values can be estimated by means of a survey questionnaire of a
group of experienced experts [Kwan , 2008] but as such they are always open to challenge. There are
also a number of alternative methods for estimating CPT values, including reasoning from historical
databases, but none are entirely exempt from the potential criticism of lacking quantitative rigour. In
these circumstances it is important to know how sensitive the posterior output of the BN is to the
numerical values of these parameters. If the degree of sensitivity can be shown to be low then the
precise values adopted for these parameters is not critical to the stability of the posterior output of the
BN.
In this paper we generalize the concept of sensitivity value introduced by Renooij and van der Gaag
[Renooij, 2004] to a multi-parameter space and adapt the concept of steepest gradient from the domain
of numerical optimization to define a local multi-parameter sensitivity value in the region of the
chosen parameter set. This metric defines the steepest gradient at the chosen point in the parameter
space, which is a direct measure of the local multi-parameter sensitivity of the BN. It should be
emphasised that in this work we are not aiming to optimize either the conditional probabilities or the
posterior output of the BN as was the case with the multi-parameter optimization scheme of Chan and
Darwiche [Chan, 2004]; our objective here is to measure the stability of the latter with respect to
simultaneous small variations of the former by determining the steepest local gradient in the multiparameter space.
2. SENSITIVITY ANALYSIS
A number of types of BN sensitivity analysis have been proposed. The most straightforward, although
tedious, is the direct manipulation method which involves the iterative variation of one parameter at a
time, keeping all the others fixed. This single-parameter approach was used to demonstrate the low
sensitivity of the BitTorrent BN in [Overill, 2010]. Three somewhat more sophisticated approaches to
single-parameter sensitivity analysis, namely, bounding sensitivity analysis, sensitivity value analysis
and vertex likelihood analysis were proposed in [Renooij, 2004], and were each applied to the Yahoo!
Email BN to demonstrate its low sensitivity in [Kwan, 2011]. However, a valid criticism of each of
these single-parameter approaches is that the effect of simultaneous variation of the parameters is not
considered. The multi-parameter sensitivity analysis scheme proposed in [Chan, 2004] requires the
determination of kth-order partial derivatives with an associated computational complexity of
O(n
in order to perform a k-way sensitivity analysis on a BN of tree-width w with n
parameters where F(Xi) is the size of CPT i. In order to develop a computationally tractable approach
to local multi-parameter BN sensitivity analysis, we have generalized the original concept of the
single-parameter sensitivity value [Renooij, 2004] to multi-parameter space and then applied the
steepest gradient approach from numerical optimization to produce a metric for the local (or
instantaneous) multi-parameter sensitivity value at the selected point in parameter space.
3. YAHOO! CASE (HONG KONG)
3.1 Background of Yahoo! Case (Hong Kong)
On April 20, 2004, Chinese journalist Shi Tao used his private Yahoo! email account and sent a brief
of Number 11 document which was released by the Chinese government that day, to an overseas web
site called Asia Democracy Foundation. When the Chinese government found it out, Beijing State
Security Bureau requested the e-mail service provider, Yahoo! Holdings (Hong Kong) to provide
details of the sender’s personal information, like identifying information, login times, and e-mail
contents. According to Article 45 of the PRC Criminal Procedure Law (“Article 45”), Yahoo! Holding
(Hong Kong) was legally obliged to comply with the demand. Mr. Shi was accused with the crime of
130
ADFSL Conference on Digital Forensics, Security and Law, 2012
“providing state secrets to foreign entities.” After the course of investigation, Mr. Shi was convicted
and sentenced to ten years in prison in 2005 [Case No. 29, 2005].
In [Cap.486, 2007], it mentioned that:
“In the verdict (the “Verdict”) delivered by the People’s Court on 27 April 2005, it stated that Mr.
Shi had on 20 April 2004 at approximately 11:32 p.m. leaked information to “an overseas hostile
element, taking advantage of the fact that he was working overtime alone in his office to connect
to the internet through his phone line and used his personal email account ([email protected]) to send his notes. He also used the alias ‘198964’ as the name of the
provider …”. The Verdict reported the evidence gathered to prove the commission of the offence
which included the following: “Account holder information furnished by Yahoo! Holdings (Hong
Kong) Ltd., which confirms that for IP address 218.76.8.21 at 11:32:17 p.m. on April 20, 2004,
the corresponding user information was as follows: user telephone number: 0731-4376362 located
at the Contemporary Business News office in Hunan; address: 2F, Building 88, Jianxing New
Village, Kaifu District, Changsha.”
3.2 Digital Evidence in Yahoo! Case (Hong Kong)
From the above verdict, we can tell that the information provided by Yahoo! Holdings (Hong Kong)
Ltd. is taken as the digital evidences in the court. Since we utilize a Bayesian Network as our analysis
model, we have to construct a Hypothesis-Evidence system. In this case, the main (or root) hypothesis
is:
Hypothesis H0: “The seized computer has been used to send the material document as an
email attachment via a Yahoo! Web mail account”
Under the main hypothesis, we have six sub-hypotheses and their corresponding evidences which are
listed below:
Table 1 Sub-hypothesis H1: Linkage between the
material document and the suspect’s computer.
ID
Description
DE1
The subject document exists
in the computer
DE2
The “Last Access Time” of
the subject file lags behind
the IP address assignment
time by the ISP
Digital
DE3
The “Last Access Time” of
the subject file lags behind
or closes to the sent time of
the Yahoo! email
Digital
Table 2 Sub-hypothesis H2: Linkage between the
suspect and his computer.
Evidence
Type
ID
Digital
PE1
The suspect was in physical
possession of the computer
Physical
DE4
Files in the computer reveals
the identity of the suspect
Digital
131
Description
Evidence
Type
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 3 Sub-hypothesis H3: Linkage between the
suspect and the ISP
ID
Description
DE5
The ISP subscription details
(including the assigned IP
address) matches the
suspect’s particulars
Table 4 Sub-hypothesis H4: Linkage between the
suspect and Yahoo! email account
ID
Evidence
Type
Digital
Table 5 Sub-hypothesis H5: Linkage between the
computer and the ISP
ID
Description
DE7
Configuration setting of the
ISP Internet account is
found in the computer
Digital
DE8
Log data confirms that the
computer was powered up
at the time when the email
was sent
Digital
DE9
Web browsing program
(e.g. Internet Explorer) or
email user agent program
(e.g. Outlook) is found
activated at the time the
email was sent
Digital
DE10
Log data reveals the
assigned IP address and the
assignment time by the ISP
to the computer
Digital
DE11
ISP confirms the
assignment of the IP
address to the suspect’s
account
Digital
DE6
Description
Evidence
Type
The subscription details
(including the IP address
that sent the email) of the
Yahoo! email account
matches the suspect’s
particulars
Digital
Table 6 Sub-hypothesis H6: Linkage between the
computer and Yahoo! email account
Evidence
Type
ID
Description
DE12
Internet history logs reveal
the access of the Yahoo!
email account by the
computer
Digital
DE13
Internet cached files reveal
the subject document has
been sent as an attachment
via the Yahoo! email
account
Digital
DE14
Yahoo! confirms the IP
address of the Yahoo!
email with the attached
document
Digital
Evidence
Type
3.3 CPT values for Sub-Hypothesis and Evidence in Yahoo! Case (Hong Kong)
Before we set up the Bayesian Network, we have to obtain the CPT values of sub-hypothesis and
evidence. In this study, all of the probability values are assigned by subjective beliefs based on expert
professional opinion and experience in digital forensic analysis [Kwan, 2011]. From Table-7, we can
132
ADFSL Conference on Digital Forensics, Security and Law, 2012
see there are two states for hypothesis H0 – “yes” and “no”, and also for sub-hypotheses H1 to H6. For
the CPT values between sub-hypothesis and evidence, there are still two states for each sub-hypothesis
– “yes” and “no”, but three states for each evidence – “yes”, “no” and “uncertain”. State “uncertain”
means the evidence cannot be concluded to be either positive (“yes”) or negative (“no”) after
examination of the evidence.
Table 7 Likelihood value for sub-hypothesis H1 to H6 given hypothesis H
H1, H5, H6
H2, H3, H4
State
“yes”
“no”
“yes”
“no”
H = “yes”
0.65
0.35
0.8
0.2
H = “no”
0.35
0.65
0.2
0.8
Table 8 Conditional probability values for DE1 to DE3 given
sub-hypothesis H1
DE1
Table 9 Conditional probability values
for DE4 given sub-hypothesis H2
DE4
DE2, DE3
State
yes
no
u
yes
no
u
State
yes
no
u
H1 = yes
0.85
0.15
0
0.8
0.15
0.05
H2 = yes
0.75
0.2
0.05
H1 = no
0.15
0.85
0
0.15
0.8
0.05
H2 = no
0.2
0.75
0.05
Table 10 Conditional probability values for DE5
given sub-hypothesis H3
Table 11 Conditional probability values for DE6
given sub-hypothesis H4
DE6
DE5
State
yes
no
u
State
yes
no
u
H3 = yes
0.7
0.25
0.05
H4 = yes
0.1
0.85
0.05
H3 = no
0.25
0.7
0.05
H4 = no
0.05
0.9
0.05
Table 12 Conditional probability values for DE7 to DE11 given sub-hypothesis H5
DE7, DE8, DE10
DE9, DE11
State
yes
no
u
yes
no
u
H5 = yes
0.7
0.25
0.05
0.8
0.15
0.05
H5 = no
0.25
0.7
0.05
0.15
0.8
0.05
133
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 13 Conditional probability values for DE12 to DE14 given sub-hypothesis H6
DE12, DE13
DE14
State
yes
no
u
yes
no
u
H6 = yes
0.7
0.25
0.05
0.8
0.15
0.05
H6 = no
0.25
0.7
0.05
0.15
0.8
0.05
4. LOCAL MULTI-PARAMETER SENSITIVITY VALUE
4.1 Conditional Independence
Before we proceed with the multi-parameter sensitivity value, we have to discuss the relationship
between each sub-hypothesis and its set of evidences. In the Yahoo! case, the connections between
sub-hypothesis and evidences belong to the class of diverging connections in a Bayesian network
[Taroni, 2006] (see Figure 1). In a diverging connection model, E1 and E2 are conditionally
independent given H. This means that: (1) with the knowledge of H, the state of E1 does not change
the belief about the possible states of E2.
; (2) without the
knowledge of H, the state of E1 provides information about the possible states of E2.
H
E2
E1
Figure 1: Diverging Connection Bayesian Network
In general, we cannot conclude that E1 and E2 are also conditionally independent given . However,
for some special cases, if there are only two possible (mutually complementary) values for H, H={H1,
H2}, then given “H=H1”, E1 and E2 are conditionally independent, while, given “H=H2”, E1 and E2 are
also conditionally independent. Hence we can conclude that E1 and E2 are also conditionally
independent given .
4.2 Multi-Parameter Sensitivity Value
Under the standard assumption of proportional co-variation [Renooij, 2004] the theorems
characterizing the algebraic structure of marginal probabilities [Castillo, 1997] permit the singleparameter sensitivity function of [Renooij, 2004] to be generalized to the multi-parameter case as
follows:
If the parameters are given by x = (x1, …, xn) then the sensitivity function F is given by the multilinear quotient:
The components of the gradient vector are given by:
134
ADFSL Conference on Digital Forensics, Security and Law, 2012
The multi-parameter sensitivity value is the value of the steepest gradient at the point x and is given by
the Euclidean norm of the gradient vector
:
This last result follows from the first order Taylor expansion of F:
(4)
In principle the multi-linear forms in the numerator and the denominator of F are both of order n and
contain 2n terms leading to expressions for each
containing 22n terms in both numerator and
denominator. However, this analysis does not take into account the conditional dependencies implied
by the structure of the BN. If the BN can be represented as a set of m sub-hypotheses Hj (j=1, m), each
of which is conditionally dependent on a disjoint subset Ei (i=1, n) containing nj (j=1,m) of the set of n
evidential traces, so that
, then the total number of parameters is given by
,
but each Hj is conditionally dependent only upon its own subset of nj evidential traces. This is known
as the local Markov property of the BN. Although the conditional probabilities associated with subhypothesis Hj will influence those associated with sub-hypothesis Hk (j≠k) via the process of
Bayesian inference propagation through the network [Kwan, 2008], to a reasonable first
approximation the sensitivity values for each sub-hypothesis may be evaluated by disregarding these
indirect effects. Then, for the sub-hypotheses
is a multi-linear quotient of order m, while for the
evidential traces associated with sub-hypothesis Hj
is a multi-linear quotient of order nj. The
Markov factorisation of the BN thus substantially reduces the number of terms involved in the
expressions for
and
. Nevertheless a symbolic algebraic manipulation program, such as
MatLab [MathWorks, 2011], is required to perform the differentiations and computations reliably for a
BN representing a real-world situation such as the Yahoo! email case.
In order to derive the coefficients of the multinomial quotient for
formula for the likelihood conditional probability is:
we proceed as follows. Bayes’
Considering the conjunctive combination of the evidential traces {E1, E2, …, En}, (5) transforms into:
Then we obtain the multi-parameter posterior probability as:
Here,
As mentioned in Section 4.1, given H, E1, E2, …, En are conditionally independent of each other.
Therefore,
According to the definition of conditional independence, we cannot in general assume that E1, E2, …,
135
ADFSL Conference on Digital Forensics, Security and Law, 2012
En are conditionally independent of each other under . However, in the specific Yahoo! email case,
there are only two possible (mutually complementary) states for each hypothesis. Therefore, we also
have:
Denoting
,
, …,
by x1, x2, … xn, and
by c1, c2, … cn, we obtain the multi-parameter Sensitivity Function as:
As
an
un-biased
pre-condition,
it is usual to
, so (11) simplifies to:
take
,
the
prior
, …,
probabilities
as:
When finding the sensitivity value of the posterior of the root node H0, it is necessary to use a variant
of (3), namely, the weighted Euclidean norm of the sensitivity values of the individual sub-hypotheses:
where
is the multi-parameter sensitivity value of sub-hypothesis is the multi-parameter
sensitivity value of sub-hypothesis j as given by (3), reflecting the different contributions of the
individual sub-hypotheses to the posterior of H0.
5. RESULTS AND DISCUSSION
In Table 14 the multi-parameter sensitivity values for each sub-hypothesis of BN for the Yahoo! case
[Kwan, 2011] are set out and compared with the corresponding single-parameter sensitivity values
from [Kwan, 2011]. Before proceeding to discuss these results, however, it should be mentioned at
this point that while validating the MatLab code for the multi-parameter sensitivity values, a number
of numerical discrepancies were noted when reproducing the previously reported single-parameter
sensitivity values ([Kwan, 2011], Table 4). All but one of these discrepancies are not numerically
significant in terms of the conclusions drawn; however, in the case of digital evidential trace DE6 the
correct sensitivity value is actually 2.222, not 0.045, which implies that the effect of DE 6 on the
posterior of H4 should be significant. This revised result brings the single-parameter sensitivity value
for DE6 into line with the vertex proximity value for DE6 ([Kwan, 2011], Table 5) which indicated that
the posterior of H4 is indeed sensitive to variations in DE6, thereby resolving the apparent
disagreement between the two sensitivity metrics for the case of DE 6 noted in [Kwan, 2011]. The
corrected results for the single-parameter sensitivity values are given in Table 14 below. However, a
review of sub-hypothesis H4 reveals that it is not critical to the prosecution case since its associated
evidence DE6 is only weakly tied into the case; the fact that Mr Shi registered with Yahoo! for a
webmail account at some time in the past cannot be assigned a high probative value and this is
reflected in the ‘non-diagonal’ structure of the corresponding CPT (see Table 11), unlike all the other
CPTs in this BN.
It is reasonable to interpret the significance of a sensitivity value by comparing it with unity [Renooij,
2004; Kwan, 2011]; a value below unity implies a lack of sensitivity to small changes in the associated
conditional probability parameters, and vice versa. In other words, a sensitivity value less than unity
implies that the response of the BN is smaller than the applied perturbation. It cannot be assumed that
single-parameter sensitivity values as computed in [Kwan, 2011] are necessarily either smaller or
136
ADFSL Conference on Digital Forensics, Security and Law, 2012
larger than the corresponding multi-parameter sensitivity values computed here, since the form of the
sensitivity function being used is not identical. In the Yahoo! email case, three of the sub-hypotheses,
namely H2, H3 and H4, are associated with only a single evidential trace so each of their sensitivity
values is unchanged in the multi-parameter analysis. Of the three remaining sub-hypotheses, H1 and H6
both have three associated evidential traces whereas H5 has five.
Table 14: Single- and multi-parameter sensitivity values for H1 – H6 of the Yahoo! email case
Digital
Evidence
Single-parameter
Sensitivity value
Component j
of Multiparameter
Sensitivity value
DE1
0.1500
0.0125
DE2
0.1662
0.0134
DE3
0.1662
0.0134
H2
DE4
0.2216
0.2216
0.2216
H3
DE5
0.2770
0.2770
0.2770
H4
DE6
2.2222
2.2222
2.2222
DE7
0.2770
0.0051
DE8
0.2770
0.0051
DE9
0.1662
0.0045
DE10
0.2770
0.0051
DE11
0.1662
0.0045
DE12
0.2770
0.0565
DE13
0.2770
0.0565
DE14
0.1662
0.0494
Sub-hypothesis
H1
H5
H6
Multi-parameter
Sensitivity value
0.0225
0.0110
0.0939
From Table 14 it will be noted that, with the exception of the somewhat anomalous case of H4
discussed earlier, the largest multi-parameter sensitivity value is an order of magnitude smaller than
the smallest of the single-parameter sensitivity values reported previously [Kwan, 2011]. This finding
is at first sight somewhat surprising in that it might be expected that permitting many parameters to
vary simultaneously would produce the opportunity for greater sensitivity values to be found.
However, it must be remembered that, unlike the case of classical numerical optimization, the
parameters of the BN are not completely independent due to the conditional independence referred to
in Section 4.1 as well as the interdependence produced by the propagation of belief (posterior
probabilities) through the BN [Kwan, 2008]. In certain circumstances, these forms of co-variance can
result in smaller multi-parameter sensitivity values than might otherwise have been anticipated, as
explained below.
In Table 15 we give the single- and multi-parameter sensitivity values for the root node H0. The multiparameter sensitivity value is computed using the weighted Euclidean norm given in (12).
137
ADFSL Conference on Digital Forensics, Security and Law, 2012
Table 15: Single- and multi-parameter sensitivity values for H0 of the Yahoo! email case
Root
hypothesis
H0
Subhypothesis
Singleparameter
Sensitivity
value
Component
of
Multiparameter
Sensitivity
value
H1
0.3500
0.0064
0.0225
H2
0.2000
0.0030
0.2216
H3
0.2000
0.0030
0.2770
H4
0.2000
0.0030
2.2222
H5
0.3500
0.0037
0.0110
H6
0.3500
0.0037
0.0939
Weight of
Component
Multiparameter
Sensitivity
value
0.0052
The multi-parameter sensitivity results in Tables 14 and 15 raise a number of important issues for
discussion. Firstly, it appears that a multi-parameter sensitivity analysis should be performed in every
case in which a BN is used to reason about digital evidence. Since a multi-parameter sensitivity
analysis can often yield substantially different sensitivity values from a single-parameter sensitivity
analysis, it is necessary to assess the sensitivity of the posterior output of the BN as rigorously as
possible before putting forward forensic conclusions based on that BN. In the present Yahoo! email
case we found consistently smaller multi-parameter sensitivity values than the corresponding singleparameter results, but further investigations have shown that this is by no means a universal trend. By
creating CPTs with different ratios of ‘diagonal’ to ‘off-diagonal’ values we have been able to explore
the circumstances under which the multi-parameter sensitivity analysis is likely to yield larger values
than the corresponding single-parameter analysis. Table 16 offers a few such examples. In particular
we find that when the structure of one or more CPT tables deviates significantly from the typical
‘diagonal’ form, as was the case with H4-DE6 here, then the single- and multi-parameter sensitivity
values will increase towards and quite possibly exceed unity. This observation can be understood as
follows: A typically ‘diagonal’ CPT signifies that the truth or falsehood of the sub-hypothesis strongly
predicts the presence or absence of the associated evidence; they are logically tightly-coupled, which
is a highly desirable property. An anomalously ‘non-diagonal’ CPT, however, signals that there is
little correlation between the truth of the sub-hypothesis and the presence or absence of the evidence.
In other words, the evidence is a very poor indicator of the truth of the sub-hypothesis, and does not
discriminate effectively. This manifests itself as a steep gradient on the associated CPT parameters’
hyper-surface indicating the direction of a ‘better’ choice of parameters. Similarly, a combination of
‘diagonal’ and ‘non-diagonal’ CPTs associated with the same sub-hypothesis can create a numerical
tension or balance between the belief reinforcing effects propagated by the former and the belief
weakening effects propagated by the latter, resulting in an increased sensitivity value.
Secondly, the above observations lead us to propose that single- and multi-parameter sensitivity
analyses can be effectively employed to test an existing BN for logical consistency between its subhypotheses and their associated evidences, through the CPT values. If the single- or multi-parameter
sensitivity analysis results suggest that the BN’s posterior output is sensitive to the values of one of
more of the BN’s CPTs, it is necessary to review those CPT values critically with a view to possibly
revising them, after which the single- or multi-parameter sensitivity analysis should be repeated. It
should be noted, however, that this is not equivalent to optimizing the posterior output BN with
138
ADFSL Conference on Digital Forensics, Security and Law, 2012
respect to the CPT values. Rather, it is using the sensitivity analysis to highlight possible issues in the
process by which the original CPT values were generated. If repeated reviews of the CPT values do
not lead to a stable posterior output from the BN, this would suggest that the structure of the BN or of
the underlying sub-hypotheses and their associated evidential traces concerning the way in which the
digital crime was committed, may be faulty and require revision. We recall that this was the case with
H4-DE6 in the present study. Note, in particular, that the single-parameter sensitivity results in Table
16 would not necessarily cause concern, while the corresponding multi-parameter results clearly
indicate a serious sensitivity problem. Hence we contend that in general single-parameter sensitivity
analyses are not sufficient in and of themselves and we recommend that multi-parameter sensitivity
analyses should be undertaken as a matter of course.
A final, and more general, consideration is whether there are any special characteristics of digital
forensic analysis which would dictate that the requirements of a sensitivity analysis should differ from
traditional forensics. Because digital forensics is a much more recent discipline than most of
traditional forensics there has been less time for it to establish a corpus of verified knowledge and a
‘track record’ of demonstrably sound methodologies. This means that digital forensics needs to strive
to establish itself as a mature scientific and engineering discipline, and routinely performing sensitivity
analyses is one means of helping to achieve this goal. Indeed, it may result in the analytical methods of
traditional forensics being required to proceed in a similar fashion in order to avoid the imputation of
‘reasonable doubt’ by wily defence lawyers.
Table 16: Some examples of other CPT values yielding large multi-parameter sensitivity values
Subhypothesis
H1
H2
Singleparameter
Sensitivity
value
Digital
Evidence
DE1
0.7
0.3
0.300
DE2
0.3
0.7
0.300
DE3
0.3
0.65
0.7202
DE4
0.7
0.25
0.2770
DE5
0.6
0.3
0.3704
DE6
0.3
0.65
0.7202
DE7
0.4
0.55
0.6094
Multiparameter
Sensitivity value
2.4448
1.8680
6. CONCLUSIONS
In this paper we have described a multi-parameter sensitivity analysis on the BN from the Yahoo!
email case. . As a result of the present analysis, it can be concluded that one the Yahoo! email subhypotheses (H4) exhibits a significant degree of single- and multi-parameter sensitivity and hence its
associated evidence and conditional probabilities should be reviewed. That review indicated that while
the CPT was in fact quite accurate, it represented a poor choice of evidence and sub-hypothesis which
should either be discarded or revised.
More generally, we have shown that the definition and computation of local multi-parameter
sensitivity values is made feasible for BNs describing real-world digital crimes by the use of a
symbolic algebra system such as Matlab [MathWorks, 2011] to evaluate the steepest gradient
analytically.
Finally, we should reiterate that the principal aim of this work is to devise a metric to assess the
139
ADFSL Conference on Digital Forensics, Security and Law, 2012
instantaneous (or local) stability of a BN with respect to the values chosen to populate its CPTs. The
value of such a metric is that it enables the digital forensic examiner to know whether or not the set of
conditional probabilities chosen to populate the CPTs of the BN lies on a flat or a steep part of the
hyper-surface in parameter space. The problem of attempting to find local (or global) optima on the
parametric hyper-surface of a BN is a separate issue which has been addressed elsewhere [CD04].
ACKNOWLEDGEMENT
The work described in this paper was partially supported by the General Research Fund from the
Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. RGC
GRF HKU 713009E), the NSFC/RGC Joint Research Scheme (Project No. N_HKU 722/09), HKU
Seed Fundings for Applied Research 201102160014, and HKU Seed Fundings for Basic Research
201011159162 and 200911159149.
REFERENCES
Castillo, E., Gutierrez, J.M. and Hadi, A.S. (1997), “Sensitivity Analysis in Discrete Bayesian
Networks”, IEEE Trans. Systems, Man & Cybernetics, Pt. A, 27 (4) 412 – 423.
Chan, H. and Darwiche, A., (2004) ‘Sensitivity Analysis in Bayesian Networks: From Single to
Multiple Parameters’, 20th Conference. on Uncertainty in Artificial Intelligence, July 7-11, Banff,
Canada
Report published under Section 48(2) of the Personal Data (Privacy) Ordinance (Cap.486) (2007),
issued by Office of the Privacy Commissioner for Personal Data, Hong Kong,
http://www.pcpd.org.hk/english/publications/files/Yahoo_e.pdf (accessed on 14 March 2007)
First Instance Reasons for Verdict of the Changsha Intermediate People’s Court of Hunan Province
(2005), delivered by the Changsha Intermediate Criminal Division One Court, in Case No. 29 of 2005,
China. http://lawprofessors.typepad.com/china_law_prof_blog/files/ShiTao_verdict.pdf (accessed on 9
October 2010).
Kwan, M., Chow, K., Law, F. and Lai.P.,(2008) ‘Reasoning about evidence using Bayesian networks’,
Advances in Digital Forensics IV, Springer, Boston, Mass., pp.275-289.
Kwan, Y.K., Overill, R.E., Chow, K.-P., Silomon, J.A.M., Tse, H., Law, Y.W. and Lai, K.Y., (2010)
‘Evaluation of Evidence in Internet Auction Fraud Investigations’, Advances in Digital Forensics VI,
Springer, Boston, Mass., pp.121-132,
Kwan, M., Overill, R., Chow, K.-P., Tse, H., Law, F. and Lai, P., (2011) ‘Sensitivity Analysis of
Digital Forensic Reasoning in Bayesian Network Models’, Advances in Digital Forensics VII,
Springer, Boston, Mass., pp.231-244.
MathWorks (2011) MatLab: http://www.mathworks.co.uk/ (accessed on 11 November 2011).
Overill, R.E., Silomon, J.A.M., Kwan, Y.K., Chow, K.-P., Law, Y.W. and Lai, K.Y. (2010),
‘Sensitivity Analysis of a Bayesian Network for Reasoning about Digital Forensic Evidence’, 4th
International Workshop on Forensics for Future Generation Communication Environments, August
11-13, Cebu, Philippines.
Renooij, S. and van der Gaag, L.C. (2004), ‘Evidence-invariant Sensitivity Bounds’. 20th Conference
on Uncertainty in Artificial Intelligence, July 7-11, Banff, Canada.
Taroni, F., Aitken, C., Garbolino, P., and Biedermann, A. (2006), Bayesian Network and Probabilistic
Inference in Forensic Science, John Wiley & Sons Ltd., Chichester, UK.
140
ADFSL Conference on Digital Forensics, Security and Law, 2012
FACILITATING FORENSICS IN THE MOBILE
MILLENNIUM THROUGH PROACTIVE ENTERPRISE
SECURITY
Andrew R. Scholnick
SNVC LC
ABSTRACT
This work explores the impact of the emerging mobile communication device paradigm on the
security-conscious enterprise, with regard to providing insights for proactive Information Assurance
and facilitation of eventual Forensic analysis. Attention is given to technology evolution in the areas
of best practices, attack vectors, software and hardware performance, access and activity monitoring,
and architectural models.
Keywords: Forensics, enterprise security, mobile communication, best practices, attack vectors.
1. INTRODUCTION
The exploding popularity of smartphone technology has greatly outpaced related advancements in the
area of information security. Exacerbating this problem is the fact that few organizations properly
understand the new contexts in which security is (and is not) an applicable concern. As a result, many
enterprises that should be concerned about the security posture of their smartphone device users are, in
actuality, unknowingly at risk of being compromised. These evolutionary changes regarding
communications technology have precipitated a need to rethink both the interrelationships and the
distinguishing factors influencing information security in enterprise voice and data infrastructure.
With ever-expanding functional capabilities and the ever increasing nuances they impose on
information security models, a new way of thinking about enterprise security is needed. At a time
when many traditional enterprise infrastructures are only now acclimating to models integrating
internet presence and/or VOIP capabilities, and while they are still reeling to accommodate the
emergence of social media concerns, the new breed of mobile devices which have emerged since the
introduction of the iPhone and the iPad have shattered old paradigms for data protection by
introducing entirely new methods for transporting and accessing data and data networks.
Paramount within the discussion of how data security issues have been impacted by these evolving
technologies is an understanding of the significant paradigm shifts which are emerging in the digital
and communications worlds. The still-embryonic emergence of personally-focused digital mobility,
which is itself an outgrowth of changes in wireless communications capabilities, has triggered a fast
and furious stream of innovations which are continuing to revolutionize how people think about their
personal manner of interaction with all aspects digital technology, especially with regard to
professional assets available from their employer’s enterprise environments. Overall, the confusion
surrounding rapid evolution in any technology arena often results in draconian posturing from the
enterprise security community until such time as things become more ‘sorted out’1. This document
attempts to identify the current primary areas of confusion surrounding secure adoption of mobile
technology, and examines the impact of the current paradigm shifts on the enterprise by evaluating the
security of both the underlying communications technologies in play and the resulting changes in
access technologies being built to exploit their evolving capabilities. Working within the context of
this paradigm shift, information assurance and enterprise security issues are considered and problems
regarding the enablement of better and more informative forensic tools are discussed.
1
Gallagher, Sean (2012), “Why the next ‘ObamaBerry’ might run Android or iOS”,
http://arstechnica.com/business/news/2011/11/will-the-next-obamaberry-be-a-nexus-or-an-ipad.ars, 28-JAN-2012
141
ADFSL Conference on Digital Forensics, Security and Law, 2012
2. THE NEW PARADIGM
“What does it do?” This is a question which has probably greeted major advancements in technology
since the invention of the wheel. The answers to this question can have profound implications in the
realm of security. Nowhere is this truer than with the introduction of mobile phones into the enterprise
environment. When first introduced, for example, one might have reasonably surmised that these
devices were ‘just wireless phones’. As the technology evolved however, they quickly incorporated
the basic functionality of pagers – an advance which has further evolved into the Simple Message
Service (SMS) text messaging capability prevalent today. Further evolution of the devices allowed for
the incorporation of cameras, geo-location technology, and practical computer-equivalent
functionality. This evolutionary metamorphosis has introduced powerful technological changes which
represent a considerable shift in the overall security posture of the mobile phone.
What was once a simple voice communication tool is now a powerful ‘Swiss army knife’ offering a
wealth of ever increasing capabilities2 with an ever broadening spectrum of potential points of
compromise. In short, a potential nightmare for any organization concerned with information security.
Unfortunately, the technological changes have been so pervasive, and adoption of the resulting
computer powered multifunction portable communication technology (better known as ‘smartphone’
or ‘smart device’ technology) has occurred so rapidly, that accurate and appropriate identification of
potential security risks has not kept up, resulting in a growing concern for impending crisis3.
2.1 An Evolutionary Shift
Key to understanding the shift in paradigms is a need to appreciate the factors driving modern-day
technological change. Today’s workforce has learned that it is possible to have access to high quality
business resources regardless of where they are, when they want it, or what type of device (desktop,
laptop, tablet, or smartphone) they wish to use. Arguably, there are three primary categories of enduser demands which are contributing to the still-evolving technology solutions coming into
prominence:

Simplicity – easy to use, well integrated, interoperable

Performance – high speed, full color, real time

Comfort – safe to use, feels good, easy to access
The breakthroughs in communications technology represented by the new breed of mobile devices
have fueled a headlong charge by innovators striving to create newer and better resources for a
ravenous market. Resulting from this explosion of creative energy is a multifaceted shift in access
paradigms, usage models, and interaction scenarios which are necessitating the rethink of outdated
security practices4.
2.1.1 User Perception
At the heart of many current enterprise security problems is the rapidly emerging shift in social
attitudes towards digital communication capabilities. Succinctly, the user community knows it is now
possible to integrate everything needed for doing business into a single device they can carry with
them at all times, and they want that greater flexibility and lower cost capability now. When
combined with the rapidly changing technological environment, this ‘I want it now’ attitude
2
Fei, Liang, (2012), “Mobile app market set for increased growth”,
http://www.globaltimes.cn/NEWS/tabid/99/ID/692610/Mobile-app-market-set-for-increased-growth-Gartner.aspx, 29-JAN2012
3
Thierer, Adam (2012), ‘Prophecies of Doom & the Politics of Fear in Cybersecurity Debates’,
http://techliberation.com/2011/08/08/prophecies-of-doom-the-politics-of-fear-in-cybersecurity-debates/, 28-JAN-2012
4
Simple Security (2012), “Mobile security market in for exponential growth”,
http://www.simplysecurity.com/2011/09/30/mobile-security-market-in-for-exponential-growth/, 31-JAN-2012
142
ADFSL Conference on Digital Forensics, Security and Law, 2012
encourages potentially disastrous snap judgment decision-making which can result in impractical
security models based on outmoded demand, usage, management, and maintenance models. Thus, the
shifts in underlying communications technologies influencing this attitudinal progression embody
many of the primary factors to consider when defining an appropriate way forward.
2.1.2 Tools and Resources
For most people, the dynamic shift in the function, performance, and scope of communication tools
which is currently being experienced, can be summed up in two catch-phrases: social media and cloud
computing. These two areas of influence dovetail beautifully with the perpetual enterpriseorganization search to enhance collaboration and standardize capabilities. Therein lays the problem…
Powerful new technologies have already become ubiquitous for private use, and the modern worker is
demanding that they be allowed as business resources too. Powerful new search engines make it easy
to find data; but in most workplaces the user is still tied to the old-fashioned relational database.
Amazing and versatile collaboration environments are popping up all over the internet facilitating
geographically agnostic coordination and transfer of data among friends, classmates, families, job
seekers, and gossips; yet the average team leader must continue to ‘make do’ with email (if lucky,
with remote access), voicemail, conference rooms, and, where available, occasional access to a VPN
connection or video conference. Yet talk of new file sharing and desktop virtualization services
abound in the media, and 4G ‘hotspots’ are advertised at the local coffee shop. Meanwhile, at present,
few or none of these innovative tools and resources is provided to the workforce effectively by their
employment enterprise.
2.1.3 Boundary Changes
Who would have believed back in 2000 that security professionals in 2012 would look back at them,
nostalgically thinking about how much simpler things were? What once were clearly defined borders
for wire-line digital access and data-centric information security have morphed into a world of
network-integrated real-time video feeds, geolocation, universal access, and terabyte pocket-drives.
The combination of high-volume portability, target tracking, and unsecured endpoints has obsoleteed
many formerly effective best practices, virtually overnight. Because of these changes, information
believed to be well protected by firewalls and access controls is being found, with increasing
frequency, to be exposed in previously unanticipated ways.
3. CONCERNS OF THE ENTERPRISE
Yesterday’s designs for tomorrow’s solutions must be rewritten today. Sufficient information exists to
predict where infrastructure needs and technological capabilities are headed. Technology plans
derived based on goals statements established prior to 2010 should be considered suspect and
reviewed for necessary course-correction. Any such goals and plans derived prior to 2007 should be
reassessed with even greater prejudice. Why? Two words – iPhone (introduced in 2007) and iPad
(introduced in 2010). The introduction of these devices has revolutionized the way in which society
views everything in both the personal and business communications realms. The explosive emergence
of corresponding open source (Google Android) and proprietary (RIM Blackberry PlayBook)
commercially viable technology in the same arena require a reassessment of the very meaning behind
a concept like Enterprise Security. These new tools have redefined the framework upon which futurefacing productive work environments must be built. By analyzing the nature of these changes it
becomes possible to implement integrated proactive and reactive tools and architectures focused on
affordably and effectively protecting the enterprise.
3.1 Problem Definition
Within today’s multifaceted communications technology framework it has become necessary for
security professionals to identify logical areas for conceptual delineation and use them to define
appropriate methods for segmenting the overall security problem into more manageable pieces. This
143
ADFSL Conference on Digital Forensics, Security and Law, 2012
section will identify three major areas of concern, and provide a high-level perspective for why they
are applicable to protecting the enterprise environment.
3.1.1 Technology Trends
Perhaps the most significant advancements in technology impacting the integration of mobile device
use with enterprise security models are the advent of both deployed 4G cellular networks, which offer
access to greater bandwidth, and cool-running quad-core CPU technology5 for use at the heart of
mobile devices. These advances present a ‘good news / bad news’ dichotomy to enterprise security
professionals. The bad news is that we can now expect to see more sophisticated and powerful attacks
against and through mobile devices (thanks to the CPU enhancements), and more damaging
exfiltrations capitalizing on the higher bandwidth. The good news is that more complex and effective
defenses are now possible, thanks also to the CPU and bandwidth enhancements.
3.1.2 Device Vulnerability
From an enterprise security standpoint, the most noteworthy threats represented by mobile devices
stem from three primary attack vectors: eavesdropping, infection, and theft. All three of these risks are
magnified by the lack of sufficient protection on mobile devices both through their operating systems
and the applications they run. Insufficient security-focused device management and control options
distress any enterprise operations team attempting to implement acceptable mobile device
management (MDM) solutions while ineffective application marketplace quality controls facilitate
unsuspecting installation of apps containing malware. Added to the already challenging problems
presented by browser exploitations and popular document viewer and other third-party tool
vulnerabilities6, the intricate problem of mitigating mobile device vulnerabilities can seem daunting.
3.1.3 Infrastructure Vulnerability
Depending upon specifics of the enterprise architecture, additional aspects of two attack vectors,
eavesdropping and infection, may exist. Again, the problem stems from insufficient security-focused
device management and control options available in device operating systems and resulting failures
presented by the diversity of MDM solutions currently available.
3.2 Understanding Protection Issues
The primary issues regarding protection of enterprise environments without impeding use of mobile
technology result from the intersection of two questions: what type of access does the user need (the
access model) and what will happen to the data being accessed (the usage model). By understanding
the answers to these questions, many of the necessary steps for providing effectual mitigations and for
facilitating development and implementation of responsive and dynamic forensics tools become selfevident. With this information in hand it is subsequently possible to define effective and
comprehensive security architectures for the modern enterprise.
3.2.1 Access Models
As identified earlier in this work, recent advances in technology are fueling the already explosive
evolution and adoption of mobile technologies in ways that are significantly impacting worker
perspectives and expectations. Consider, for example, that less than a decade ago the VPN was
generally considered to be an enterprise-level tool. Utilizing them to provide personal access into an
enterprise environment was deemed inappropriate for the vast majority of individual workers. In fact,
few workers actually requested such capabilities from their employers because the tools and network
5
Purewal, Sarah Jacobsson (2012), “Nvidia Quad Core Mobile Processors Coming in August”,
http://www.pcworld.com/article/219768/nvidia_quad_core_mobile_processors_coming_in_august.html, 30-JAN-2012
Quirolgico, Steve, Voas, Jeffrey and Kuhn, Rick (2011), “Vetting Mobile Apps”, Published by the
IEEE Computer Society, JUL/AUG 2011
6
144
ADFSL Conference on Digital Forensics, Security and Law, 2012
connectivity required to properly utilize such access were prohibitively expensive. Even when costs
were not a major factor and where the enterprise environment supported such access, the personal
equipment and home connectivity used to exploit such access was often ponderously slow, rendering it
undesirable to many. In light of existing technology improvements, as well as looming exponential
performance leaps, this entire paradigm has become invalid.
Within the same span of time, technologies have emerged which allow an attacker to target resources
that were previously seen as well-protected. Email servers, customer service portals, limited access
kiosks, and other common tools and utilities have been, and continue to be, successfully compromised
by direct and indirect attack. Similarly, techniques believed sufficient to protect the data flowing to an
endpoint believed to be secure have proven to be just as lacking in veracity. Successful attacks against
cryptographic keys, security and authentication certificates, and even the protocols that utilize them
have been repeatedly in the news and recounted at numerous technical conferences. Introduction of
mobile device capabilities require that formerly improbable attack vectors be reevaluated and
mitigations and protections identified for use.
It should also be noted that the primary access models discussed below are generally used in
combination, to provide a form of defense-in-depth, by layering the overall security model. Thus,
while requiring a password for user authentication provides a degree of protection, collecting the
password from the user through an encrypted envelope (like an SSL tunnel) is even better. In fact,
encrypting the password before passing it through the tunnel can enhance protection of the
authentication credential even further.
3.2.1.1 Authentication
Arguably the oldest and most prominent access model used for computer security is password
authentication, where a secret is held by the user and shared with a system or application to gain
access. While more complex authenticators have evolved over the years (such as time-sensitive and
out-of-band tokens, biometric sensors, physical authentication credentials, and digital certificates), the
basic principle of this access model is that users must authenticate themselves to the system using
theoretically unimpeachable credentials such as one or more shared secret and/or dynamic
authenticators. The degree of geographic independence obtainable though introduction of mobile
technology to enterprise environments can severely upset the dependability of many, previously
trusted authenticators. Biometrics, for example, is virtually useless as a remote authenticator since the
assurance they provide as an authenticator is tied to physical presence. A security model that
considers accepting them remotely risks introduction of attack possibilities based on credential
counterfeiting and hijacking which would otherwise have been improbable or impossible within an
enterprise infrastructure. Similar issues exist for physical keying devices, such as DoD CAC and PIV
cards because their authentication data is static, and could thus be intercepted in transit, bypassing the
need to authenticate to their embedded credentialing.
3.2.1.2 Protective Tunneling of Data in Transit
Without straying into a lengthy discussion of the ultimate ineffectiveness of existing computer
encryption techniques, it should be taken as historically axiomatic that as computing power increases,
even the best encryption algorithms eventually get cracked. Since all data tunneling protocols rely on
some form of computer based encryption (such as SSL, TLS, and various VPN technologies) it must
be accepted that, until encryption technology becomes significantly advanced beyond current designs,
this particular weakness will be an ongoing threat. Fortunately, the evolution of more sophisticated
algorithms has effectively mitigated this risk until now. However, the protocols and underlying
software engines which are utilized to incorporate encryption into their various protection schemes
have, themselves, proven to be unnervingly susceptible to attack. (This particular threat vector is a
lingering and well-discussed security issue with impact beyond the direct scope of this work.)
145
ADFSL Conference on Digital Forensics, Security and Law, 2012
Many infrastructure analysts believe that this lingering issue for securely moving data between mobile
device users and the enterprise at the back end is effectively solved by use of VPN technology. With
the advent of convenient, acceptable-performance mobile VPN technology rapidly cresting the
horizon, it therefore might appear that satisfactory protection for corporate connectivity is at hand.
This is a false perception.
The manner in which all current mobile device operating systems implement network support for VPN
connectivity has a gaping security hole in it. This hole is a feature often referred to as ‘split
tunneling’. This feature of mobile devices neglects to perform a function deemed essential to the
default functioning of their larger computer cousins – routing all network traffic to the created VPN
tunnel unless explicitly instructed otherwise, through overrides to configuration defaults. In order to
address this specific problem, enterprise implementers must be able to alter the configured
functionality of the mobile device OS, until such time as device OS vendors begin incorporating better
enterprise-level security tools into their systems. While problematic but conceivable for open source
systems such as Google’s Android OS, only a cooperative vendor can mitigate this risk with their
proprietary operating systems, such as Apple, Microsoft, and Research In Motion (RIM).
3.2.1.3 Misdirection and Brute Force
Perhaps the conceptually simplest group of protective access models involves both active and passive
techniques for hiding in plain sight and for manual inspection and validation. Often clustered together
under the banners of firewalling and intrusion detection these techniques involve brute-force
processing of flowing data, occasionally requiring collaboration from endpoint systems and software.
For instance, while firewalling is largely about restricting the flow of data based on one or more
elements of network addressing, many network services (such as email, web servers, and VPNs) allow
the system administrator to modify a key component of addressing – the logical access port. While
requiring configurations changes at the enterprise back-end as well as on the users’ endpoint devices,
this misdirection can be exploited to provide a degree of camouflage to services provided for remote
access. In recent years this capability has become ubiquitous and can be found in many home wireless
technologies. On the ‘down side’, support for port reassignment on mobile devices is limited to
individual application-specific implementation. On the ‘up side’, app-level support for port
reassignment is prominent enough that it remains a useful tool for evolving enterprise environments.
Strictly the providence of back-end infrastructure, manual inspection and validation of moving data
presents an interesting conundrum for enterprise security. While malware and intrusion detection
systems based on this principle can prove to be extremely effective, they are not only costly, but also
may be defeated by some of the very-same tools used with the intent of protecting the enterprise.
After all, you cannot detect and protect against threats you cannot see, and SSL, TLS, VPNs, and other
encryption-based technologies can prevent brute-force data traffic inspection tools from ever seeing
the threat. The thoughtful enterprise administrator may mitigate this failing by employing malware
and threat detection tools on their deployed desktop, laptop, and server systems, but comparable
technologies for mobile devices are still in their infancy and provide little, if any, real protection.
3.2.2 Usage Models
While there are many variations of usage models for systems and data, when discussing the paradigm
shift in enterprise security being caused by advances in mobile device technology, only two broadbrush usage models are pertinent: securely accessing data and securely storing data. Although valid
arguments can be made for the use of application-level security models for mildly sensitive data, the
corresponding overhead involved in providing secondary protections and tracking user-specific access
contexts quickly becomes unmanageable. For this reason, these two usage models are discussed as
systemic models applicable to enterprise security, rather than application level models. Stemming in
large part from the lack of sufficient security-focused functionality available from mobile device
operating systems, addressing these two areas of concern is often perceived to be either prohibitively
146
ADFSL Conference on Digital Forensics, Security and Law, 2012
costly or completely out of reach.
Continuing to avoid straying into a lengthy discussion of the inherent risks in using existing computer
encryption techniques, it should be duly noted that issues regarding the ultimate failings of modern
computer encryption implicitly complicates any discussion of how to protect data while in use and at
rest. However, modern encryption tools can invariably slow attackers down and should therefore be
utilized wherever and whenever available.
3.2.2.1 Data at Rest
Obviously there is a need to protect valuable enterprise data intended for local storage on a mobile
device. Even when available, because not all mobile device operating systems provide native support
for file encryption, provision for encryption of stored data is generally difficult to access and may
interfere with application functionality, device performance, and data portability. Also, when
implemented, effective enterprise employment of this capability can be impeded by those
manufacturers desiring to ‘protect the user experience’ by allowing dangerous manual overrides.
Further compounding the risk, should the security posture of the mobile device operating system
become compromised, enterprise-sensitive security keys could be revealed to malicious parties.
Make no mistake, if a device is lost or stolen then the information on it could eventually be seen by
undesirable viewers. Only through use of the most current and strongest encryption systems can this
eventuality be delayed.
3.2.2.2 Data in Use
Probably the most serious inherent threat to the enterprise security posture of a mobile device
operating system that has unfiltered, unprotected access to the internet is that the operating system
may become unknowingly compromised through ‘drive by’ attacks from websites, side-channel
exploitations, or malicious attachments to non-enterprise email messages. Although most mobile
device operating systems provide a system-enforced isolation between running applications, once the
OS is compromised this protection is easily circumvented by malware. Malicious tools such as keyloggers, screen-grabbers, and memory scrapers are then employed to acquire seemingly protected
enterprise data. Thus providing some form of enterprise-level protection to the mobile device
operating system from internet-based attack becomes a critical element of ensuring protection of data
in use, a logical association necessitated specifically by the advent of mobile device technology.
4. ENHANCING PROACTIVE DEFENSE AND EVIDENTIARY DISCOVERY
In the evolving universe of cyberspace, counterattack is not an option – it is illegal. This leaves
mitigation (proactive defense) and forensics (evidentiary discovery) as the primary weapons for
protecting enterprise security. Without adequate enabling enterprise-centric tools, effective enterprise
security is nearly impossible. Comprehensively unaddressed until recently, strategies and tools for
effectively protecting the enterprise while enabling seeming unrestricted use of mobile technology is
finally beginning to emerge. This section will discuss several of the most promising of these
advancements and provide practical examples for deployment and use in the enterprise environment.
Incorporating device-level technologies, back-end control tools and techniques, and supplemental
enhancing services, these innovations present realistic solutions that are available to the enterprise
security administrator right now.
4.1 Device Ownership
The core principle behind many, sometimes draconian, enterprise security policies is a simple one: “If
we don’t own it, we can’t trust it”. Therein lays the riddle… how to take ownership of a user’s mobile
device without argument, anxiety, and legal malediction? Although the obvious simple solution is to
provide a separate enterprise-use device, another viable answer is to manage perceptions so that
enterprise ‘assimilation’ of a personal device is perceived to be of benefit to the user. New
technologies are emerging which allow this assimilation at little or no additional cost to the enterprise,
147
ADFSL Conference on Digital Forensics, Security and Law, 2012
which can provide their users with a variety of services including:

Automated secure backup of all device data
o Including assurance of secure and private handling of personal assets.

An unfettered ‘sandbox’ for personal use
o With less restrictive protections still available from corporate firewalls and
filters

One device supporting personal and business phone numbers
o Shared number or unique7

Access to all the tools and resources they have been begging for
o Online files, intranet, collaboration tools, search engines, video and
teleconferencing, etc.

Company-paid phone service8
o Much less expensive than many people think, competitive with existing phone
systems

Eliminate phone-carrier bloat-ware on the phones
o Only enterprise-vetted apps are allowed in the device’s protected enterprise
section
It should be stressed that there already exist practical, cost-effective, secure options for providing these
benefits to mobile device users who need access to enterprise attachments and resources. Central to all
of them is the need to address security limitations imposed by the device operating system. This is
why the enterprise must own the device. The cornerstone of any effective mobile device enabled
enterprise security architecture is the ability to rely on the security posture of the endpoint device. A
key component for one such solution is available today for free, courtesy of the NSA. As of January
2012, the NSA has made available a Security Enhanced (SE) version of the Android operating system9
for open use, and is planning widespread adoption within the agency itself10. This phone operating
system even boasts support from a comprehensive security compliance monitoring and management
tool. The resulting OS-level improvement in encryption, security policy enforcement, compromise
detection, and overall device control represents a solution for the problem upon which all other
significant resolutions depend.
That was the good news. The bad news is that, at present, each of the remaining primary non-Android
device OS providers cannot, or explicitly will not, currently support many or all of these requirements.
For this reason, the majority of solutions possible today are Android-centric. Within the remainder of
this Device Ownership discussion, unless otherwise explicitly noted, the solutions referenced should
be considered specific to the Android universe for this reason.
7
There are several options including VoIP clients such as Skype and OoVoo, or number consolidation platforms like Ribbit,
Phonebooth, and GoogleVoice.
8
Gray, Benjamin; Kane, Christian (2011), “10 Lessons Learned From Early Adopters Of Mobile Device management
Solutions”, Forrester Research, Inc., Cambridge, MA
9
Naraine, Ryan (2012), “NSA releases security-enhanced Android OS”, http://www.zdnet.com/blog/security/nsa-releasessecurity-enhanced-android-os/10108, 29-JAN-2012
10
Hoover, Nicholas (2012), “National Security Agency Plans Smartphone Adoption”,
http://www.informationweek.com/news/government/mobile/232600238, 05-FEB-2012
148
ADFSL Conference on Digital Forensics, Security and Law, 2012
4.1.1.1 Mobile Device Management (MDM) and Mobile Risk Management (MRM)
The core of any device’s enhanced forensic potential, regardless of OS, will center on the availability
of more comprehensive monitoring and tracking capabilities both inside the device and at the back
end. For this reason available management tools should be closely scrutinized before an enterprise
decides on which vendor’s product will be selected to provide this capability.
Although still somewhat lacking in effective security control features, because of various OS
limitations, several MDM vendors are rumored to already be implementing support for the NSA’s SE
Android OS11. Until these products begin to appear, one MRM solution provider, Fixmo12, already
offers support for an OS-integrated monitoring and management solution comprehensive enough to
have been deemed acceptable for Sensitive But Unclassified (SBU) use with DoD resources. Among
this solution’s features are the ability to monitor, track, and control various device characteristics,
provide enhanced app security controls, sandboxing, FIPS-certified cryptography, control device
resources (camera, radios, etc.), and handle overall device security policy management, compliance
monitoring, and control. This product also provides comparable capabilities, where possible, for
Apple iOS, and RIM Blackberry as well as integrated features supporting Good Technology secure
enterprise email apps. It should be noted that several other vendors, such as 3LM, BoxTone, and
AirWatch, have voiced plans to improve their support for enterprise security in ways which would
provide comparable functionality, but in most cases delivery dates have not yet been specified.
4.1.1.2 Mandatory Boot-Time Captive Tunnel (A Truly Secure VPN)
Just as MDM/MRM capabilities are at the core of a device’s forensic potential, a ‘mandatory VPN’ is
at the center of any manageable device protection solution. The reason for this is straightforward; if
the device must connect to the enterprise VPN before it can touch the internet, then all of the
enterprise’s existing investment in network-level infrastructure protections can be brought to bear to
protect browsing, email, and other network communications without unduly jeopardizing the existing
enterprise security posture. In this way the enterprise can also monitor and control access to apps,
weather through redirection to an enterprise-run ‘app store’, monitoring of installable apps as content,
or outright blocking of access to non-enterprise app resources. While this capability is not currently
supported or allowed by any major cellular carrier, there is a way around them. This leads us to the
discussion of the Mobile Virtual Network Operator…
4.1.1.3 Mobile Virtual Network Operator (MVNO) Hosting
Although possessing a secure device operating system and the management tools to control it are
critical, they are of no value if the enterprise administrator cannot use them on a device. Since
existing cellular carriers currently do not support enterprise customization of the device OS in the
manner needed to support these needs, a way must be identified for obtaining affordable service for
enterprise devices which does permit use of an enterprise-customized OS. This is where the concept
of an MVNO becomes important, and the distinction for the enterprise between ‘owning’ the MVNO
and having it ‘hosted’.
The simplest way to understand what an MVNO is and why it is important would be to think of them
as a mobile communications carrier who does not actually own any cellular towers. Instead, an
MVNO purchases bandwidth from other carriers at a discount and resells it under their own brand.
Cricket Wireless is an example of an MVNO. Realizing the potential value of supporting enterprise
infrastructure with this technology, companies such as CDS Telecommunications of Ashburn
Virginia13 have begun packaging hosted services tailored toward providing this secure architectural
solution to the market. By enabling the enterprise itself to act as a cellular service provider with
11
Project website: http://selinuxproject.org/page/SEAndroid
Company website: http://www.fixmo.com/
13
Corporate website: http://www.CDSTelecom.com
12
149
ADFSL Conference on Digital Forensics, Security and Law, 2012
competitive rates, the MVNO solution gives the enterprise the ability to lock their mobile operating
system and device management infrastructure into the ‘subscribing’ device – controlling all aspects of
data flow, call monitoring, security, and even application availability and installation. By using a
hosted source of their MVNO activity, the enterprise gains the advantages of high bandwidth
utilization from the primary carrier, resulting in lower overall rates for connectivity. Also, rather than
needing to customize the Android OS themselves, or purchase MDM/MRM licenses in limited
volumes, the MVNO host would bear the burden of providing and maintaining the modified OS and
passing through high volume licensing and sublicensing discounts for the preferred management tools.
4.2 Application Management
Having established the preferred infrastructure necessary to support Device Ownership, it is now
appropriate do discuss a variety of enhanced Application Management capabilities which can allow
the enterprise to further protect itself from malicious or risky device software (a.k.a. apps) without the
need for burdensome resource allocations. By exploiting the various app and resource monitoring
capabilities made possible through a comprehensive MRM solution, not only does on-device
enforcement of user and app compliance with enterprise security policies becomes possible, but the
ability to collect and monitor more comprehensive data for forensic use at the back-end is also
enhanced.
4.2.1 Vetting
The most monumental conundrum accompanying the introduction of mobile device support into any
enterprise ecosystem is how to establish weather apps abide with the enterprise security posture before
they are installed and without unduly impeding the user’s access. With over 400,000 apps available
directly through Google’s Android Market and over half a million currently available in the Apple
App Store14, and annual growth expected to be counted for both in the hundreds of thousands this year,
keeping pace with the need to establish the acceptability of desirable apps has already far exceeded the
reasonable limit for manual inspection. Further, since an overwhelming number of these apps
unnecessarily or inappropriately demand access to resources, both on-device and off, that the
enterprise would rather restrict, the already difficult management task can seem intimidating in its
vastness. As luck would have it, there already exist a few time and cost effective solutions for
mitigating theses difficulties, and many vendors are promising that more are in the works.
4.2.1.1. Eyes-On
Although still an imperfect solution, nothing beats eyes-on inspection of source code performed by a
well-trained vulnerability analyst when evaluating an application for safety and security. However,
being somewhat time-consuming and labor intensive, this method of vetting software for enterprise
use quickly becomes cost-prohibitive for any small enterprise environment. Even for a large and wellfunded security team, utilizing eyes-on inspection would be viable only for the most highly suspect
mobile applications whose functionality was deemed essential. Third party organizations exist, such
as VeriSign’s iDefense team, which can be contracted to provide some of the industry’s best talent for
this purpose, but this type of service comes at a high dollar cost and, as such, is likely only to be
employed by very large enterprises and MVNO Hosting services.
4.2.1.2 Automation
The field of apps available for mobile devices is already tremendous and is expanding with increasing
rapidity. Only an automated evaluation system can be expected to effectively handle the constant
onslaught, not to mention the backlog, of available mobile device apps. An obvious solution, almost
no effort has been expended in this area to-date. The key word here is ‘almost’. In a February 2012
14
Fitchard, Kevin, (2012), “Android development speeds up: Market tops 400,000 apps”,
http://www.abiresearch.com/press/3799Android+Overtakes+Apple+with+44%25+Worldwide+Share+of+Mobile+App+Downloads, 20-JAN-2012
150
ADFSL Conference on Digital Forensics, Security and Law, 2012
article15, CNN reported that Professor Angelos Stavrou of George mason University has designed such
an automated vetting system for the US Government, and it has already churned through over 200,000
of the apps in the Android backlog over the past few months. Kryptowire, a company formed early in
2011 by Professor Stavrou, is preparing to launch a version of this tool for public use sometime in the
first half of 2012, as well as some eyes-on vetting services.
4.2.2 Access Control
In the world of mobile devices, the term ‘access control’ (AC) has three connotations. The obvious
and most common understanding of the term refers to the ability to moderate access to the
functionality of the device itself, for example - by setting a device password for all activity other than
answering an incoming call. Additionally, this term refers to the ability to control the permissions
which each app installed on the device receives for accessing device resources, such as a camera, local
storage, radios, and audio I/O resources. Lastly, AC refers to the ability to regulate the manner in
which the device and its apps connect with Internet resources. It is this last area which is most
problematic when attempting to define a secure enterprise usage model, in part due to previously
discussed Secure VPN concerns. In all of these areas, appropriate on-device (in the OS and the apps)
and back-end logging of historical data is imperative to support enterprise forensic needs.
4.2.2.1 Logging On
Perhaps the only area of mobile device security which has been addressed to an adequate degree by
existing technology regards device-level access controls. Although there is certainly room for
improvement, multiple technologies providing user authentication are available which offer features
including:

Remote-controlled access revocation

Password retry limits

Password complexity

Password aging

External credentials (CAC/PIV)

Check-in timers requiring periodic back-end connection
to name a few. In most cases, these solutions provide both on-device and back-end monitoring and
management capabilities which are even adequate for use in highly sensitive environments.
4.2.2.2 Device Resources
Of all the mobile device manufacturers participating in today’s marketplace, Apple stands out as the
singularly worst prepared for enterprise use when considering the ability to mandate access controls
from the back-end. Although a limited degree of control is possible through the use of their access
policy mechanism, the device user has the ability to ignore policies without detection. More thorough
controls are available for the three other major players, RIM (Blackberry), Google (Android), and
Microsoft (WinMobile). However, additional controls are still necessary. For example, the ability for
an enterprise to require high-level encryption for stored data, restrict or allow specific Bluetooth
devices, and to control access to the device microphone, camera, and GPS on an app-specific basis is
only available to a limited degree. The previously mentioned MDR solution from Fixmo is, arguably,
the best example of an existing product which addresses this need, pushing available controls to their
limit regarding each of the four primary device OS manufacturers. It is also worth noting that in late
15
Milian, Mark (2012), “U.S. government, military to get secure Android phones”,
http://www.cnn.com/2012/02/03/tech/mobile/government-android-phones/index.html, 03-FEB-2012
151
ADFSL Conference on Digital Forensics, Security and Law, 2012
2011 Dell announced a secure version of its Streak device which was secure enough to be deemed
acceptable for some uses within the DoD.
4.2.2.3 Network Virtualization Facts and Fallacies
With regard to access control, the question how to safely provide enterprise users with the ability to
access the back-end network infrastructure may be the most misunderstood. Earlier in this work the
need to ‘own’ the device OS was discussed. Nowhere is the justification for this requirement clearer
than with regard to implementing mobile device access strategies for virtual private network
connectivity. The defining logic is simple, if an attacker has a path to the OS that does not force data
to travel through ‘industrial strength’ protection systems then the device’s integrity cannot be
guaranteed. If the device’s integrity is in question, then the security provided by the VPN must also be
suspect. As described previously, utilizing a MVNO architecture mitigates the problem of device
trustworthiness. Without such an architecture, no aspect of VPN authentication from a mobile device
can be considered safe from compromise.
5. CONCLUSION
The introduction of mobile device technology into enterprises creates a multitude of new problems for
security professionals. What were once simple voice communication tools are now powerful
multifaceted devices offering a multitude of ever increasing capabilities with an ever broadening
spectrum of potential points of compromise. These new tools have redefined the very framework
upon which modern work environments are being built. This metamorphosis has resulted in the
introduction of powerful technological changes which represent a considerable shift in the overall
security posture of the mobile phone, a potential nightmare for any organization concerned with
information security. In short, the exploding popularity of smartphone technology has greatly
outpaced the ability of many enterprises to update their security infrastructure.
Emerging technology is rapidly making its way to market which greatly enhances the enterprise
security posture and provides data monitoring and collection capability to enhance management
activity and proactively support potential forensic needs. Affordable commercial solutions, which
provide Government and Military grade protections, are emerging in today’s marketplace which
greatly enhance the ability of businesses, small and large, to achieve an acceptable security posture
supporting integrated use of mobile device resources.
6. ACKNOWLEDGEMENTS
Technology inputs from Professor Angelos Stavrou (George Mason University), Mr. Rick Segal
(Fixmo), and Dr. Mark Gaborik (DoD) are gratefully acknowledged as contributing factors to this
work.
7. AUTHOR BIOGRAPHY
A subject matter expert at SNVC LC, Andrew Scholnick is a cyber security professional with over 30
years experience in the field. In his most recent position as a technical team leader for the US Army,
he guided the efforts of vendor, DoD, and government contributors through the integration of
technologies which resulted in DoD approval for the first off-the-shelf Android smartphone solution
allowed for Army use. He has previously headed the VeriSign iDefense Vulnerability Analysis Lab
and was one of the principle technology innovators who securely connected AOL to the internet.
8. REFERENCES
Fei, Liang, (2012), “Mobile app market set for increased growth”,
http://www.globaltimes.cn/NEWS/tabid/99/ID/692610/Mobile-app-market-set-for-increased-growthGartner.aspx, 29-JAN-2012
152
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fitchard, Kevin, (2012), “Android development speeds up: Market tops 400,000 apps”,
http://www.abiresearch.com/press/3799Android+Overtakes+Apple+with+44%25+Worldwide+Share+of+Mobile+App+Downloads, 20-JAN2012
Gallagher, Sean (2012), “Why the next ‘ObamaBerry’ might run Android or iOS”,
http://arstechnica.com/business/news/2011/11/will-the-next-obamaberry-be-a-nexus-or-an-ipad.ars,
28-JAN-2012
Gray, Benjamin; Kane, Christian (2011), “10 Lessons Learned From Early Adopters Of Mobile
Device management Solutions”, Forrester Research, Inc., Cambridge, MA
Hoover, Nicholas (2012), “National Security Agency Plans Smartphone
http://www.informationweek.com/news/government/mobile/232600238, 05-FEB-2012
Adoption”,
Milian, Mark (2012), “U.S. government, military to get secure Android phones”,
http://www.cnn.com/2012/02/03/tech/mobile/government-android-phones/index.html , 03-FEB-2012
Naraine,
Ryan
(2012),
“NSA
releases
security-enhanced
Android
http://www.zdnet.com/blog/security/nsa-releases-security-enhanced-android-os/10108
OS”,
Purewal, Sarah Jacobsson (2012), “Nvidia Quad Core Mobile Processors Coming in August”,
http://www.pcworld.com/article/219768/nvidia_quad_core_mobile_processors_coming_in_august.ht
ml, 30-JAN-2012
Quirolgico, Steve, Voas, Jeffrey and Kuhn, Rick (2011), “Vetting Mobile Apps”, Published by the
IEEE Computer Society, JUL/AUG 2011
Simple Security (2012), “Mobile security market in for exponential growth”,
http://www.simplysecurity.com/2011/09/30/mobile-security-market-in-for-exponential-growth/, 31JAN-2012
Thierer, Adam (2012), ‘Prophecies of Doom & the Politics of Fear in Cybersecurity Debates’,
http://techliberation.com/2011/08/08/prophecies-of-doom-the-politics-of-fear-in-cybersecuritydebates/, 28-JAN-2012
153
ADFSL Conference on Digital Forensics, Security and Law, 2012
154
ADFSL Conference on Digital Forensics, Security and Law, 2012
A CASE STUDY OF THE CHALLENGES OF CYBER
FORENSICS ANALYSIS OF DIGITAL EVIDENCE IN A
CHILD PORNOGRAPHY TRIAL.
Richard Boddington
School of IT
Murdoch University
Perth, WA 6150
Australia.
[email protected]
Tel: +61 893602801. Fax: +61 89360 2941.
ABSTRACT
Perfunctory case analysis, lack of evidence validation, and an inability or unwillingness to present
understandable analysis reports adversely affect the outcome course of legal trials reliant on digital
evidence. These issues have serious consequences for defendants facing heavy penalties or
imprisonment yet expect their defence counsel to have clear understanding of the evidence. Poorly
reasoned, validated and presented digital evidence can result in conviction of the innocent as well as
acquittal of the guilty. A possession of child pornography Case Study highlights the issues that appear
to plague case analysis and presentation of digital evidence relied on in these odious crimes; crimes
increasingly consuming the time, resources and expertise of law enforcement and the legal fraternity.
The necessity to raise the standard and formalise examinations of digital evidence used in child
pornography seems timely. The case study shows how structured analysis and presentation processes
can enhance examinations. The case study emphasises the urgency to integrate vigorous validation
processes into cyber forensics examinations to meet acceptable standard of cyber forensics
examinations. The processes proposed in this Case Study enhance clarity in case management and
ensure digital evidence is correctly analysed, contextualised and validated. This will benefit the
examiner preparing the case evidence and help legal teams better understand the technical
complexities involved.
Keywords: Digital evidence, evidence analysis, evidence validation, presentation of evidence, digital
evidence standards.
1. INTRODUCTION
Because the legal fraternity generally understands little about computer science, the potential for
miscarriages of justice are great. The cyber forensics community sometimes exploits this situation and
obfuscates the environment by focusing on issues such as preserving, collecting, and presenting digital
evidence with evidence validation under-stated or ignored (Caloyannides, 2003). Ultimately, juries
must evaluate the evidence and if they misread or misunderstand it because of inadequate forensics
analysis and presentation, and faulty validation processes, unreliable decision as to the guilt of those
accused are inevitable. The disappearance of baby Azaria Chamberlain at Ayres Rock more than thirty
years ago and subsequent coronial inquests, a court case featuring controversial forensics evidence,
and the subsequent Royal Commission into her sad death, resulted in a fundamental reconsideration of
forensic practices in Australia (Carrick, 2010). Digital evidence may on occasions, also be failing to
meet the same high standards expected of more established forensics regimes.
Criminal trials relying on digital evidence are increasingly common and regrettably, trials where
innocents are convicted are not rare (George, 2004; Lemos, 2008). Defendants are pleading guilty
based on what appears to be overwhelming hearsay evidence, mainly digital evidence without robust
defence rebuttal. Reasons for this may be the evidence is compelling, the defendant may have limited
155
ADFSL Conference on Digital Forensics, Security and Law, 2012
financial resources, the defence lawyers misread the evidence, plea-bargaining offers lesser sentences,
etc. Various factors can affect the validity of the evidence, including failure of the prosecution or a
plaintiff to report exculpatory data, evidence taken out of context and misinterpreted, failure to
identify relevant evidence, system and application processing errors, and so forth (Cohen, 2006;
Palmer, 2002).
Since 2008, the author has provided expert analysis of digital evidence to defence criminal lawyers in
Western Australian. This involved re-examination and validation of digital evidence presented in state
and federal law enforcement cases. A number of defendants were able to convince the jury of their
innocence, partly with the assistance of the author’s re-examination and testing of the digital evidence.
The selected Case Study, a possession of child pornography, highlights the incomplete and incorrect
analysis common in cyber forensics analysis of child pornography cases in Western Australia.
According to the Australian Federal Police, possession of and trafficking in child pornography cases
are more frequent in Australia with a 30% increase arrests for child pornography offences in 2011
compared with 2010 (Sydney Morning Post, 2012). Child pornography cases are technically complex
and require analysis of evidence that supports claims of criminal intent including knowledge and
control of offensive materials. Locating evidence that proves more than simple possession of the
materials requires skill and expertise in presenting evidence to prove deliberate actions by suspects.
Linking the user to the crime may happen too quickly on some occasions without sound validation of
the evidence.
Officers who undertake these important but tedious tasks may well be under-resourced, over-burdened
with complex cases, sometimes made more difficult by inexperienced analysts and communicators.
There is some anecdotal evidence suggesting that these prosecutions are in response to political
lobbies determined to eradicate any form of immoral cyber behaviour through draconian, resultoriented legislation. The inherent problem with such an approach is too much pressure on examiners
who put at risk the innocent inadvertently caught up in criminal investigations. These problems are not
unique to Western Australia and the Case Study is not intended to criticise law enforcement agencies.
What it attempts is to identify some common problems affecting their forensics examinations and
suggests enhancements to improve outcomes. What is evident to the author, and his fellow workers in
the field, are two related problems, 1) faulty case management through inadequate analysis and
presentations of the digital evidence, and 2) incomplete and incorrect validation of the digital
evidence. The Case Study highlights typical problems, identifies best standards, and offers processes
to achieve outcomes that are more acceptable and helpful to the examiners and legal practitioners.
2. INTRODUCTION TO THE CASE
The case selected is the State of Western Australia versus Buchanan (2009) on a charge of possession
of child pornography, an indictable offence usually resulting in imprisonment on conviction in the
jurisdiction. The defendant’s computer was seized and he was charged with possession of offensive
photographs and movies contrary to the Western Australian Criminal Code Section 321. The offence
of possession is contingent on the images being of an offensive nature of a person under sixteen years,
purporting to be, or depicting a person under the statutory age. Mere possession is not sufficient to
convict under the legislation and some degree of knowledge, control of the offensive material, and
criminal intent has to be proven by the prosecution. Nevertheless, in reality, it is a reversal of the
presumption of innocence and possession carries more weight than perhaps it should. On occasion,
disproportionate onus is placed on a defendant to explain away incriminating digital evidence that is
sometimes indiscriminate in signifying blame. It is easy to overlook the gap between possession and
possession with criminal intent. Ownership or possession of a computer is tantamount to criminal guilt
in some mindsets, yet it ignores the requirement to link a specific computer user with the evidence.
A number of computers were seized and examined and a perfunctory analysis report was produced
describing the digital evidence clearly of an offensive nature located on one of the computers. The
156
ADFSL Conference on Digital Forensics, Security and Law, 2012
defence team was instructed by the defendant to analyse the digital evidence and seek a better
understanding of the nature of the prosecution’s evidence and to develop a possible defence strategy.
During the re-examination of the forensic image, exculpatory digital evidence exhibits were identified
and tendered at the trial. This new evidence and demonstrations of the unreliability of some existing
evidence, challenged some key prosecution assertions and contributed to the defendant’s swift
acquittal by the jury.
3. PROBLEMS OF ANALYSIS AND PRESENTATION
Forensics examiners overlooking or mis-reading evidence, and worse still, resorting to ‘cherrypicking’ when choosing or omitting evidence to gain legal advantage, is a common phenomenon of the
digital domain (Berk, 1983: Flushe, 2001; Koehler & Thompson, 2006). Moreover, bias, examiner
inexperience and pressures to process cases with limited time resources can also explain the
phenomenon. The reasoning used in the analysis may be faulty, lacking any safeguards to ensure
complete and thorough analysis occurs. If abductive reasoning was used in the case presented for
study, and it probably was as it is commonly used in such circumstances, then it seems to have been
done poorly.
Abductive reasoning is inferential reasoning based on a set of accepted facts from which infers the
most likely explanation for them and is most commonly used in legal proceedings (Australian Law
Dictionary, 2012). In contrast, deductive reasoning abstracts directly from data while inductive
reasoning is based on but extrapolates partially beyond data. Abductive reasoning extrapolates
inductive reasoning even further (Walton, 2004, p. 13). Abductive reasoning is used to develop
plausible hypotheses and evaluate candidate hypotheses to seek the best explanation based on a
preponderance of supporting facts (Walton, 2004, pp. 22, 174). Walton (2004, pp. 1, 4, 20 and 33)
asserts that logic is expected to be exact whereas abduction is inexact, uncertain, tentative, fallible,
conjectural, variable and presumptive, labelling it as, ". . . the judgment of likelihood". Abductive
reasoning draws its conclusions from incomplete evidence; a guess but an, ". . . intelligent guess. . ."
according to Walton (2004. pp. 3, 215).
Abductive reasoning involves a series of questions and answers and eliciting and ultimately evaluating
competing explanations that emerge from the process (Walton, 2003, p. 32). Such legal debate and
opinion of the hypotheses are passed to the jury for their consideration but of concern is the likelihood
that incorrect and incomplete reasoning, abductive or otherwise, of technically complex digital
evidence, hardly serves the course of justice. Certainly, no questioning or answering process was
shared with the defence team involved in the Case Study, and guesswork seemed banal not intelligent.
Cases relying on digital evidence are often complex and involve various groups of related and
unrelated evidence that make up the various threads of evidence that form part of the rope of evidence.
The problem confronting the examiner is locating and selecting the evidence which requires careful,
unbiased reasoning. Each thread complements the whole but often important threads are subject to
misinterpretation or are overlooked. Pulling the threads a together then requires validation to check
and test the evidence. That accomplished, the evidence must be presented in an easily understood form
which defines the evidence, explains its relevance and evidentiary strength, and includes potential
rebuttal based on validation and other issues that may challenge the claim.
4. ANALYSIS AND PRESENTATION ISSUES IN THE CASE STUDY
Subjective assumptions, evidently based on a perfunctory understanding of the evidence, with no
attempt to ensure its completeness and correctness in the Case Study were also common to other child
pornography case examined by the author, even though the innocence of those defendants was less
clear than in the Case Study.
The prosecution analyst’s original analysis report provided no narrative on the groups of catalogued
evidence or the relationship between them or their combined contribution to the criminal charge. The
157
ADFSL Conference on Digital Forensics, Security and Law, 2012
report lacked complete contextualisation to help the legal teams and defendant understand the
significance of the evidence. No timeline, no storyboard, and no detailed explanation of the
significance of the exhibits were provided. The prosecution analyst’s lack of reliable case analysis, an
absence of evidence validation, and confusing analysis presentation was problematic for the defence
lawyers. The prosecution lawyer seemed to misunderstand the digital evidence based on weak crossexamination of the defence expert.
Expedient use of the evidence selected by the prosecution analyst, combined with questionable
inferences about the probity of the evidence, suggested a disregard about the defendants’ presumed
innocence in the selected case. The charge of possession with intent, hinged on the defendant’s
ownership and exclusive access to the computer. No explicit evidence was offered to support the truth
of the contention nor was any attempt made to show others had access of the computer. Offensive
pictures and video files and access to child pornography websites was offered as prima facie evidence
of guilt, presumably based on abductive reasoning. Whatever reasoning was used to determine the
merit of the cases from a prosecution perspective, it appeared cursory and little thought given to the
possibility of any alternative hypotheses. The power of the ‘smoking gun’ alone was enough to lay
charges.
Exculpatory digital evidence recovered from extant and carved files, earlier identified but disregarded
by the prosecution analyst’s seemingly Procrustean disposition as “being irrelevant”, suggested that
the defendant was not the only suspect, nor the most obvious one. This evidence was not catalogued,
nor was it voluntarily shared with the defence team, yet its existence was acknowledged prior to the
trial and during cross-examination of the prosecution analyst. It is common for the re-examination of
the evidence in these cases, to identify extra incriminating evidence as well providing helpful analysis
presentation that sometime benefits the prosecution at monetary cost to the defendant and
disadvantage to the defence strategy. It seems unjust that defence analysts are required to complete
and improve the case evidence because of the shoddy work of the prosecution. In this case vital
exculpatory evidence was recovered and inculpatory evidence was challenged.
Although unlikely to face the same charge as the defendant, or for perjuring themselves during their
testimony, others were the likely culprits. Other witness testimony further implicated one or more of
these persons who were tenants in the defendant’s home at the time of the offence. Why the defendant
was charged and not others, when available evidence contradicted the evidence of two of the
prosecution witnesses, remains puzzling to the author. This was an exceptional case and even the
prosecutor intimated its likely collapse but the judge directed the proceedings continue and allow the
jury the final decision.
In this and other cases, the problem seems that the forensics examiners provide statements of evidence
selection but no explanation why exhibits were selected in terms of their relevance and significance to
the cases. Nonetheless, examiners should possess the experience and expertise to state the relevance of
the evidence and its relationship to other evidence in the case. It seems this task is for the legal teams
to elicit through various means; hardly efficacious case management. This raises problems of evidence
reliability; notably its accuracy, authenticity, corroboration, consistency, relevance, etc. From this
morass, some formal process is required to convey the gist of the examiner’s analysis and conclusions.
5. POTENTIAL SOLUTIONS TO ANALYSIS AND PRESENTATION ISSUES IN THE CASE
STUDY
Inman and Rudin (2001) state, “Before the criminalist ever picks up a magnifying glass, pipette or
chemical reagent, he must have an idea of where he is headed; he must define a question that science
can answer”, neatly defining a basic cyber forensics case management problem. According to Pollitt
(2008), this places the forensics examiner in a quandary who must have a sound understanding of
defining investigative or legal questions and be able to convert them into scientific questions. Pollitt
(2008) stresses the importance of first defining the legal and investigative questions before defining
158
ADFSL Conference on Digital Forensics, Security and Law, 2012
the digital forensic (scientific) questions. If that advice were heeded in the Case Study, the
prosecution analyst would have benefited from more direction from the outset of the examination and
probably be more inclined to use more reliable logic in assembling the case. Assuming sound
scientific logic is applied during analysis, presenting the evidence requires some dedicated thought.
Yet there are various simple, effective processes such as Toulmin’s model, discussed below that can
help organise analysis and presentation of digital evidence.
Toulmin's model based on his theory of The Layout of Argumentation (1958) has been used to
construct arguments at various stages of litigation and works for legal argument because of its
accuracy, flexibility, and effectiveness, according to Saunders (1994, p. 169). Toulmin (1958, p. 7)
was concerned that sound claims should be based on solid grounds and backed with firmness to
support claims used in arguments. The model accommodates lawyers’ reliance on precedential and
statutory authority and incorporates the elements inference and uncertainty in judicial reasoning and
decision-making (Saunders, 1994, p. 169). Most importantly it includes in the argument the element of
rebuttal; anticipating refutation of counter arguments by lawyers. The same process is just as relevant
to the forensics examiner.
Toulmin (1958) asserted that formal rules of logic were incompatible with common practices of
argument as a means of critiquing and analysing public argument and has been used by lawyers to
understand the constraints in legal cases when defining reasonableness standards. Toulmin’s theory
(1958) defines six aspects of argument common to any type of argument as described below and
illustrated in Figure 1:
Data (Object) is the evidence, facts, data and information for the claim. It establishes the basis of
the argument. Data can be grounded in anecdotal evidence and a specific instance can provide the
basis for the argument. Data can be human testimony or grounded in statistical data.
Warrant is the component of the argument that establishes the logical connection or reasoning
process between the Data and the Claim:
-
Authoritative warrants rely on expert testimony offering conclusions about the Data to support
the Claim.
Motivational Warrants rely on appeals to the audience offering conviction, virtues and values
to support the claim.
Substantive warrants more closely resemble traditional forms of logical reasoning, including
Cause-Effect/Effect-Cause, and generalisation based on example and classification.
Claim is the point of an argument and represents the conclusion advocated by the arguer based on
the Data linked to the Warrant.
Backing is the material that supports the Warrant and explains the reasoning used in the Warrant.
Backing adds credibility to an argument; without it the arguments seems lacking:
-
-
Statistical Backing relies on numbers to quantify information but can create an allusion of
truth.
Example Backing are concrete instances that illustrate the Warrant and provide a real world
view but caution is required when using generalised examples that may not be true in a given
argument.
Testimony Backing is based on expert opinion and personal experience which adds credibility
to the argument.
Qualifier represents the soundness, the strength and worthiness of an argument.
Reservation (Rebuttal) is an exception to the claim made by the arguer as arguments may not be
universally true.
159
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 1. Toulmin’s Model of Argumentation (Toulmin, 1958, p. 105)
The flexibility of Toulmin’s model is that it can be used to take a top down view of the case based on
validation of the combined evidence as well as individual and clusters of related evidence at an
elementary level. The model offers the opportunity to present the evidence (Data), explain the
reasoning for the Claim (Warrant). As highlighted in Toulmin’s example, providing explanation and
credibility in support of the Warrant (Backing) offers some measurement of the claim (Qualifier), and
permits a rebuttal of the Claim (Reservation), all which promote a better understanding of the
argument inherent in the Claim.
The Case Study suggests no consideration was expressed that exculpatory evidence may exist, that the
evidence may not be correct and no validation was undertaken. Toulmin’s model offers the
opportunity to develop templates to ensure that more complete and vigorous examination of the
evidence occurs. If Toulmin’s model is applied to the Case Study, it replaces the disorder with a
thorough, complete and structured perspective of the elements of the charge against the defendant.
Taking an overview of the Case Study, the prosecution claim and the defence rebuttal is illustrated in
Figure 2. The model shows assertions made by the prosecution, such as the defendant had sole access
to the computer based on witness testimony. This appears false, as exculpatory evidence suggested the
witness testimony was dubious based on other digital evidence that was valid and which contradicted
their perjuries.
160
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 2. An overview of the Case Study case using Toulmin’s Model.
The benefit of representing the argument in this format should be evident to the reader. Backing and
for that matter Reservation, can contain expert opinion and statistical research, but both should be
validated. The model allows the examiner to build a simple list of facts from which assertions are
derived and check those against a list of facts that may exist and used in the reservation to test the
original argument. This visual aid encourages rebuttal of the counter claim to complete analysis and
show the lines of reasoning; a simple yet powerful model. The model can be used to show an overview
of the case evidence and broken down into individual evidence exhibits or groups.
Computer events should be checked and tested to avoid ambiguous, inaccurate and irrelevant
outcomes and this can also be represented in the model.
6. VALIDATION PROBLEMS
The International Standards ISO/IEC DIS 27037 sets broad guidelines to validate forensics tool and
processes used in the evidence retrieval stages with an expectation that the evidence is what it purports
to be. The Standards Australia International’s Guidelines for the Management of IT evidence: A
handbook (2003) states that, “The objective of this stage of the lifecycle is to persuade decisionmakers (e.g. management, lawyer, judge, etc.) of the validity of the facts and opinion deduced from the
evidence.” The guidelines were intended to assist a range of stakeholders, including investigators and
the legal fraternity, who rely on digital evidence held in electronic records in various legal cases.
These standards are broad but offer no formal validation process of digital evidence per se during the
analysis stage.
Dardick (2010) offers guidance, stressing the need for standards of assurance required and asserts that
digital evidence validation requires a rigorous examination of the quality of evidence available and
proposes a checklist that ensures cases are thoroughly validated through examination of: accuracy,
authenticity, completeness, confidentiality, consistency, possession, relevance and timeliness.
Certainly, digital evidence validation requires more study and research as it underpins a crucial
161
ADFSL Conference on Digital Forensics, Security and Law, 2012
forensic tenet. Defining validation and testing applicable processes that make it useful to examiners
and lawyers seems overdue.
A definition of validation or correctness in the context of cyber forensics may be taken from Sippl &
Sippl (1980, p. 608) as being a relative measure of the quality of being correct, sound, efficient, etc.,
and defining data validity as a relation or measure of validity. Such relations and measures based on
specific examination must rely on tests and checks to verify the reliability of the data, thereby
validating the data or determining a degree of acceptability of the evidence (Sippl & Sippl, 1980, p.
140). The Dictionary of Psychology (2009) defines validation as, “. . . soundness or adequacy of
something or the extent to which it satisfies certain standards or conditions.” The Oxford Dictionary
of English (2005) defines validation as, “. . . The process of establishing the truth, accuracy, or
validity of something. The establishment by empirical means of the validity of a proposition.”
More relevant to digital evidence is a legal definition of validity and soundness of judicial inference
that involves conceptualisation of the actual event described by language for the trial, while abstract
law approaches actuality as it is interpreted for application in the verdict (Azuelos-Atias, 2007).
Consequently, to apply an abstract law to a concrete occurrence, the occurrence is abstracted and the
law is interpreted.
Verification of the evidence involves complete, objective checking of conformity to some well-defined
specification, whereas validation refers to a somewhat subjective assessment of likely suitability in the
intended environment (A Dictionary of Computing, 2008). If the verification and validation of the
existing digital and the extra digital evidence had provided contradictory or ambiguous findings, the
case may not have proceeded to trial because of the weakness of the evidence to link the defendant to
the crime.
According to Cohen (2006), incomplete scrutiny of the available evidence during the validation stage
of the investigative process and failure to validate the evidence at that point is where the investigation
can fail. But what is sometimes missing in cyber forensics is some formal and practical process of
validating evidence to measure the extent to which the evidence is what it purports to be: a simple,
reliable, validation test. The introduction of validation standards and compliance to such standards
should encourage correctness and completeness in cyber forensics analysis. Selecting evidence
relevant to only one party to a case contravenes legal discovery requirements. Failing to validate that
same evidence is unacceptable neglect.
7. VALIDATION ISSUES IN THE CASE STUDY
The prosecution analyst in the Case Study did establish implicit relationships between each exhibit but
failed to describe the relationship in full, meaningful terms and certainly provided no proof of
comprehensive evidence validation. These issues seem to be an inherent deficiency in other child
pornography cases examined by the author.
One argument presented by the prosecution analyst in the Case Study suggested the defendant
installed software to delete browser cookies and history cache files automatically and opined this was
done to conceal browsing for child pornography. The general assumption that computer users install
such software for anti-forensics purposes was based on the expert opinion of the analysts and not
backed with any meaningful statistics which may have been admissible had they been available. It was
guesswork and well outside proper expert opinion. The defendant’s explanation, never sought by the
prosecution, and later examination of the computer confirmed conclusively the software related to a
browser not installed on his computer; the validity of the claim never being tested before being
presented to the jury. Use of such ‘proclivity’ evidence in child pornography trials is always
controversial and on occasions introduced under legislation to bolster prosecution cases. Had it not
been challenged by the defence it is likely that the argument would have been accepted at face value
by the jury. More properly, this evidence should have been debated and rejected as inadmissible in the
absence of the jury based on the outcome of argument between the prosecution analyst and defence
162
ADFSL Conference on Digital Forensics, Security and Law, 2012
expert and not been allowed to influence the jury unnecessarily.
The prosecution analyst argued that the Internet browser was uninstalled just prior to the seizure of the
evidence to conceal wrongdoing but the files were recovered from the Recycler. The prosecution
lawyer made much of the allegation that the defendant had attempted to uninstall the browser based on
the prosecution analyst’s testimony. The matter of whether the browser was uninstalled or deleted was
patently irrelevant to the argument; the fact that the application was removed was a strong claim the
prosecution could use without needlessly obfuscating the issue. What the prosecution failed to validate
was whether the time of deletion corresponded with the defendant’s access to the computer. The
deletion timestamp was never validated nor was it used to check the defendant’s alibi that he was
elsewhere when the deletion occurred. As it transpired, the browser uninstall operation was a deletion;
appearing to be a panic reaction by the real culprits who were inadvertently warned of the raid during
the absence of the defendant and later most likely perjured themselves when denying all access to the
computer. The reader can be forgiven for believing the prosecution analyst was determined to secure
conviction on clearly dubious facts.
Significantly, the exculpatory evidence located on the computer included correspondence and browser
events that pointed to the use of the computer by the defendant’s tenants who rented an adjacent
building with free and open access to the defendant’s residence. Some of this evidence was recovered
from unallocated space and contained little or no temporal metadata. What it did contain was content
linking it to the tenants, notably Internet access logs and cookies and private documents and
photographs exclusive to the tenants and their friends. It was also possible to calculate the creation
dates of recovered job applications, school projects, photographs of school parties and curriculum
vitae. The creation dates of more than twenty of these files corresponded with periods when the
computer was used to browse for child pornography. This required corroboration and validation of the
evidence against known events in the lives of the tenants. This information was known to the
prosecution analyst but for some reason discounted and ignored.
Browsing activities for child pornography may be corroborated through search histories, downloaded
files, browser caches, and viewing and storage behavior by the user. There was no attempt made to
link the browsing activities to known or suspected users of the computer, despite the defendant’s
explanation provided at the time of his arrest and subsequent interview. A simple validation check
would have identified exculpatory evidence raising the possibility that others were involved in illicit
browsing and downloading offensive material. Reconstruction of browsing activities is part of the
validation process, checking and testing each file is crucial to measure the truth of the matter. This
must be done before any attempt can be made to test the weight of the evidence.
8. POTENTIAL SOLUTION TO THE VALIDATION ISSUES IN THE CASE STUDY
Validation of the digital evidence presented in the Case Study and majority of prosecution cases
examined by the author appeared superficial or absent. In themselves, the reports were meaningless to
the non-technical lawyers requiring the author to translate and interpret the evidence for them in
meaningful terms. The prosecution analyst corroborated some digital evidence exhibits presented in
the case but not all, seemingly taking others at face value or interpreting their status to suit his
viewpoint when testifying.
Digital evidence is circumstantial evidence and considered hearsay evidence that may only be
admitted in legal proceedings under established procedures (Anderson & Twinning, 1991). Business
records, including electronic records, are admissible in evidence as an exception to the hearsay rule but
are subject to certain requirements such as those maintained in normal business activities and
assurances that reasonable measures were taken to protect the authenticity and accuracy of the records
(Chaikin, 2006).
In line with many other jurisdictions, Western Australian jurisprudence gives the benefit of doubt to
accused parties when circumstantial evidence cannot be corroborated. For example, undated offensive
163
ADFSL Conference on Digital Forensics, Security and Law, 2012
images recovered through data carving of unallocated space will be regarded as inadmissible during a
trial if it cannot be corroborated by other evidence (SOWA versus Sabourne, 2010). Consequently, the
corroboration of a digital evidence exhibit may be seen as a mandatory part of the validation process.
From a forensics perspective, the measurement of evidentiary weight requires validation checking and
testing of the admissibility and plausibility of digital evidence, and then confirming corroboration.
Admissible evidence means it was legally acquired; plausible evidence means that it is relevant,
accurate and consistent with known facts; and corroboration means proving the existence of
independent evidence to validate the exhibit and its relationship with the former exhibit.
Some structure is required if validation is to be attempted and here the author offers a formal,
validation process, as distinct from an ad hoc, intuitive process, to measure the evidentiary weight of
digital evidence through measurements of its admissibility, plausibility and corroboration. Evidentiary
weight is the strength of the evidence against a pre-set threshold above which the evidence may be
considered likely to be true in a specific legal case.
Validation requires checking and testing of each exhibit as well as the relationship between
corroborating exhibits to measure their evidentiary weights:
Checking is examining an exhibit to measure validity of the data, metadata and relationships with
other exhibits. For example, in the Case Study, the time of deletion of the Internet browser was
obtained from the Recycler, compared with the defendant’s alibi, triangulated with the computer
clock and was then checked against other events to complete the validation process.
Testing uses experiments, existing experiment data or reliable statistics to measure the validity of
the relationship between exhibits. Testing whether the browser was deleted, complemented the
testing of the deletion and showed through modelling that deletion and uninstallation left different
artefacts on a computer; a thorough and conclusive exercise that provided a more reliable and
complete reconstruction of events compared with the prosecution analyst’s report.
Using the process, validation is depicted as a three-step process consisting of an object (the
evidence), a claim (statement about the evidence) and a test (to validate the claim). As shown in
the example in Figure 3 the process would test the correctness of the timestamp of a photograph
on a storage device. The process starts with a statement that the evidence (Object) was correct
(Claim) and the claim was checked and tested (Test) and produced a result. In this example, six
outcomes are offered but this may vary according to case context. Whether the weight of the
evidence passes a predetermined, acceptable threshold will depend on the outcomes of the test.
Fig. 3. Validation process to test the evidence.
The testing stages entails measuring the strength of the claim through checking and testing the
admissibility of the evidence (o), the plausibility of the evidence (p) and corroboration of the evidence
(c) to determine whether the strength of the evidence is higher than a predetermined threshold. A
higher threshold would be appropriate in criminal cases where a greater the burden of proof is placed
164
ADFSL Conference on Digital Forensics, Security and Law, 2012
on the prosecution, where the soundness of the evidence must be beyond reasonable doubt. In civil
litigation the burden of proof is determined at a lower threshold based on a balance of probability.
The admissibility of the evidence requires confirmation that the evidence was obtained lawfully and is
relevant to the case (claim, o,), although relevance issues may be decided during pre-trial ‘hot-tubbing’
of the analysts and during the trial. If the evidence fails this test the evidence is inadmissible and must
be excised from the case. If the evidence is admissible the claim may be set alongside the plausibility
test to determine its plausibility (claim, p). At this stage it does not matter whether plausibility is
proved to be true or false as long as it is sufficiently reasonable to include it as part of the argument
that will ultimately require validation of each supporting exhibit used in the claim.
Corroboration requires confirmation that the exhibit is corroborated with one other valid exhibit but a
requirement for more than one corroborating exhibit may be factored in depending on the case type.
The claim may then be set alongside corroboration (claim, c). Assuming the conditions of
admissibility are met, examination of the results of the plausibility and corroboration tests can be used
to measures whether the strength of evidence is higher or lower than the set threshold:
If (c,p) > threshold then the claim has a high degree of probability.
In the Case Study, the assertion was made that the defendant deliberately uninstalled the browser to
conceal illegal browsing activity. An alibi confirmed that comparing the time of the removal (later
confirmed to be deletion) of the browser with the whereabouts of the defendant established that the
defendant was not present and could not have been involved. No further testing is required – the
evidence is irrelevant and must be rejected. As shown in Figure 4 the evidence, while legally acquired
under warrant, was irrelevant to the case although it was allowed to be presented to the jury. Plausible
argument was demolished by a lack of corroboration and additional evidence that rebutted the original
argument.
Fig. 4. Inadmissibility of the evidence.
If the evidence is admissible the plausibility and corroboration stages will test and check the evidence
so that the two sets of results may be used to measure the strength of the evidence and determine the
likelihood the claim is true. In the hypothetical example in Figure 5 the claim that the defendant
deliberately deleted the browser to conceal evidence of a crime is a plausible assertion based on the
presence of evidence of deletion and the defendant’s ownership of the computer. It is not proven and
still requires corroboration. Some other evidence, such as linking the defendant to the computer during
the deletion process would bolster the strength of the evidence. If the plausibility and corroboration are
calculated to be above the threshold then the claim is substantiated as being highly probable.
165
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 5 Example of the plausibility and corroboration of digital evidence.
Testing and checking the validity of these exhibits might require comparison of the timestamps of the
deleted files recovered from the Recycler and the sent email messages as illustrated in figure 6. The
email messages would also require examination to determine the plausibility of the messages being
created by the defendant or an impostor.
Fig. 6. Comparison of the corroboratory exhibit.
The process of admissibility checking of the primary exhibit is shown in Figure 7 which involves
plausibility checking to determine the relevance, accuracy and consistency (unambiguity) of the
evidence. For example: testing if the timestamps are relevant to a user of another user accessing the
Internet; consistency checking to check for ambiguities timestamps that are unclear as to whether they
were edited or the changed as a result of some other process; and checking the accuracy of time
stamps to see if the files are relevant to the case.
166
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 7. Checking the relevance, accuracy and consistency during the plausibility stage.
Figure 8 shows the process continued through to the corroboration stage where the corroborating
exhibit and the relationship between the primary exhibit and the corroborating exhibit are checked and
tested. Corroboration between exhibits may involve: relevance checking to show that the file accessed
on the external drive was the same file shown in the Jump List Log; consistency checking to see if a
deleted file was identical with a file shown in an application log or other files with the same name; and
accuracy checking the timestamps to show a correlation between browsing and other critical user
events.
167
ADFSL Conference on Digital Forensics, Security and Law, 2012
Fig. 8. Checking corroboration and relationships between exhibits.
Presenting the findings of the analysis and the validation checking on which the claims are evaluated
confronts forensics examiners and prompts more thorough examination. This sub-argument, based on
the validation testing in the Case Study could be represented using Toulmin’s model as shown in
168
ADFSL Conference on Digital Forensics, Security and Law, 2012
Figure 9. The reservation (rebuttal) statements are based on testing the plausibility and corroboration
of relevant exhibits, and cast a different outlook on the case than that prosecution argument.
Validation issues are shown such as evidence that the deletion of the browser was incompatible with
the defendant’s alibi.
Of course, if appropriate, the rebuttal may also be rebutted by the analyst and legal team and so forth.
The point is, clarity of the argument is clearer and the evidence relied that much easier to understand
and evaluate providing both parties a more reliable prognosis.
Fig. 9. A sub-argument showing a strong rebuttal
The process can be used to measure each individual thread of evidence as well as measuring the
combined weight of evidence that comprises the trial case. The process offers some structural form to
ensure that all relevant evidence is validated and presented with greater clarity in child pornography
cases and potentially in a range of other digital evidence-based cases. The process minimises the
chance that key validation issues are overlooked or trivialised. Ideally, it provides the means to
measure the strength of each exhibit and evidence string that form part of the case, although this Case
Study does not address the complexities of combining and measuring results of the process that affect
the validity of the evidence.
9. CONCLUSION
The Case Study identifies issues of inadequate reasoning during analysis, compounded by poor
presentation of even the basic facts, further degraded by inadequate validation of the digital evidence.
Innocent or not, all facing the court should expect that the evidence presented is complete and correct.
It is reasonable to expect that vigorous processes were used to measure its completeness and
correctness; that the evidence is valid and is what it purports to be. Ideally, the trial should proceed
with the best evidence available with a degree of confidence that it is what it purports to be.
169
ADFSL Conference on Digital Forensics, Security and Law, 2012
Validation seeks a rigorous examination of the admissibility, plausibility and corroboration of the
evidence. Presenting the evidence that has undergone complete validation using well-established
argument models such as Toulmin’s model has much to commend it. Correct and complete validation
of digital evidence presented with clarity is so important and offers great benefits to the forensics
examiner and the legal practitioner. It allows the evidence analysis to be independently scrutinised and
audited; it makes the examiner accountable; and engenders thorough and diligent analysis of a high
standard. Most importantly, it seeks the truth of the matter and without bias, a fundamental hallmark
of forensics science.
The author has adopted Toulmin’s model and the validation processes in case analysis presentations
and already notes improved efficiencies in case management and communication with legal
practitioners. Further research into refining the validation processes to serve a range of different case
scenarios appears worthwhile.
AUTHOR BIOGRAPHY
Mr. Richard Boddington holds a B.Sc (Hons) 1st Class and is completing Ph.D. research in digital
evidence validation at Murdoch University, Australia where he teaches and researches information
security and cyber forensics. He has a police and security intelligence background and provides cyber
forensic analysis and expert testimony for the legal fraternity in a range of civil and criminal cases.
ACKNOWLEDGEMENTS
Sincere thanks are extended to Drs. Valerie Hobbs and Graham Mann of Murdoch University for their
support and feedback during the preparation of the paper.
REFERENCES
A Dictionary of Computing. (2008). Eds. John Daintith and Edmund Wright. Oxford University Press,
Oxford Reference Online. Accessed 21 December 2010
http://0-www.oxfordreference.com.prospero.murdoch.edu.au/views/ENTRY.html?subview=Main&ent
ry=t11.e5680
A Dictionary of Psychology. (2010). Ed. Andrew M. Colman. Oxford University Press, Oxford
Reference Online. Accessed 21 December 2010
http://0-www.oxfordreference.com.prospero.murdoch.edu.au/views/ENTRY.html?subview=Main&ent
ry=t87.e8721
Anderson, T., & Twining, W. (1991). Analysis of evidence: How to do things with facts based on
Wigmore's Science of Judicial Proof. Evanston, IL: Northwestern University Press. Australian Law
Dictionary. (2012). Accessed 20 January 2012:
http://0-www.oxfordreference.com.prospero.murdoch.edu.au/views/ENTRY.html?subview=Main&ent
ry=t317.e10
Azuelos-Atias, S. (2007). A pragmatic analysis of legal proofs of criminal intent. Philadelphia: J.
Benjamins Pub. Co
Berk, R. A. (1983). An introduction to sample selection bias in sociological data. American
Sociological Review, 48, 386 - 398.
Caloyannides, M. A. (2003). Digital evidence and reasonable doubt. IEEE Security and Privacy, 1(6),
89 - 91.
Carrick, D. (2010). The Chamberlain case: The lessons learned. Melbourne: ABC Radio National.
Chaikin, D. (2006). Network investigations of cyber attacks: The limits of digital evidence. Crime Law
& Social Change, 46, 239 - 256.
Cohen, F. (2006). Challenges to digital forensic evidence. Accessed 22 June, 2006, from
http://all.net/Talks/CyberCrimeSummit06.pdf.
170
ADFSL Conference on Digital Forensics, Security and Law, 2012
Dardick, G. S. (2010). Cyber forensic assurance. Paper presented at the 8th Australian Digital
Forensics Conference.
Flusche, K. J. (2001). Computer forensic Case Study: Espionage, Part 1 Just finding the file is not
enough! Information Security Journal, 10(1), 1 - 10.
George, E. (2004). Trojan virus defence: Regina v Aaron Caffrey, Southwark Crown Court. Digital
Investigation, 1(2), 89.
Guidelines for the management of IT evidence: A handbook (HB171). (2003).
Inman, K., & Rudin, N. (2001), Principles and Practices of Criminalistics: The Profession of Forensic
Science. CRC Press: Boca Raton, Florida.
Jones, A. (2011, November 2011). Meet the DF Professionals. Digital Forensics, 9, 37 – 38.
Koehler, J. J., & Thompson, William. C. . (2006). Mock jurors’ reactions to selective presentation of
evidence from multiple-opportunity searches: American Psychology-Law Society/Division 41 of the
American Psychological Association.
Lemos, R. (2008). Lax security leads to child-porn charges [Electronic Version]. Security Focus.
Accessed 22 November 2008 from http://www.securityfocus.com/brief/756.
Palmer, G. L. (2002). Forensic analysis in the digital world. International Journal of Digital Evidence,
1(1).
Pollitt, M. M. (2008). Applying traditional forensic taxonomy to digital forensics. Advances in Digital
Forensics IV IFIP International Federation for Information Processing, 285, 17 - 26.
Saunders, K., M. (1994 ). Law as Rhetoric, Rhetoric as Argument. Journal of Legal Education, 44,
566.
State of Western Australia versus Sabourne. (2010). Perth District Court.
State of Western Australia versus Buchanan, 2009. Perth District Court.
Sippl, C. J., & Sippl, R. J. (1980). Computer Dictionary (3rd ed.). Indianapolis: Howard W Sams &
Co.
Sydney Morning Herald. (2012). Accessed 8 February 2012
http://www.smh.com.au/national/growing-alarm-over-child-porn-epidemic-20120207-1r667.html
The Oxford Dictionary of English. (2010). (revised edition). Eds. Catherine Soanes and Angus
Stevenson. Oxford University Press, Oxford Reference Online. Accessed 21 December 2010 http://0www.oxfordreference.com.prospero.murdoch.edu.au/views/ENTRY.html?subview=Main&entry=t140
.e85868
Toulmin, S. E. (1958). The uses of argument. Cambridge: University Press.
Walton, D. (2004). Abductive Reasoning. Tuscaloosa: The University of Alabama Press.
171
ADFSL Conference on Digital Forensics, Security and Law, 2012
172
ADFSL Conference on Digital Forensics, Security and Law, 2012
AFTER FIVE YEARS OF E-DISCOVERY MISSTEPS:
SANCTIONS OR SAFE HARBOR?
Milton Luoma
Metropolitan State University
700 East Seventh Street
St. Paul, Minnesota 55106
651.793.1481
651.793.1246 (fax)
[email protected]
Vicki Luoma
Minnesota State University
145 Morris Hall
Mankato, Minnesota 56001
507.389.1916
507.389.5497(fax)
[email protected]
ABSTRACT
In 2003 the Zubulake case became the catalyst of change in the world of e-discovery. In that case
Judge Shira Scheindlin of the United States District Court for the Southern District of New York set
guidelines for e-discovery that served as the basis for amending the Federal Rules of Civil Procedure
(FRCP) in December 2006. The amendments incorporated a number of concepts that were described
by Judge Scheindlin in the Zubulake case. ( Zubulake v. UBS Warburg LLC, 2003) Since the
Zubulake case and the FRCP amendments, numerous cases have interpreted these rules changes, but
one of the main points of court decisions is that of preservation of electronically stored information
(ESI). A litigation hold to preserve ESI must be put into place as soon as litigation is reasonably
anticipated. The failure to preserve ESI has resulted in the largest number of cases where judges have
imposed sanctions, but certainly not the only one. This paper reviews the cases to answer the question
– are the courts granting safe harbor protection when litigants failed to follow the rules and best
practices rather than imposing sanctions?
Keywords:
e-discovery, electronic discovery, sanctions, safe harbor, electronically stored
information, ESI, sanctions
1. ELECTRONICALLY STORED INFORMATION AND THE PROBLEM
The biggest problem in complying with discovery requests is the enormous amount of electronically
stored information that exists. According to a study conducted at Berkeley in 2003, more than five
exabytes of information were stored electronically. Five exabytes of information would be the same
as 37,000 libraries the size of the Library of Congress. (Lyman, 2003) They further predicted that
based on past growth rates, the amount of ESI doubles every three years. (Lyman, 2003) To put this
information in visual terms, one exabyte of information is equal to 500,000,000,000 (500 trillion)
typewritten pages of paper. (Luoma M. V., 2011) Beyond the sheer volume of data, other problems
include how to determine which of the information is relevant to the litigation at hand, how to retrieve
it, in what format it must be provided to the opposing party, and most importantly, to ensure none of
the relevant or potentially relevant data has been deleted or lost.
Yet another problem with electronically stored information is that it contains metadata. If the metadata
173
ADFSL Conference on Digital Forensics, Security and Law, 2012
is not provided in response to a discovery request, court sanctions are a real possibility. Metadata is
data about data. So when discovery requests demand data it normally includes the metadata. Litigants
must ensure that the metadata is intact and not altered. Metadata will not only provide information
concerning the creator, recipient, creation dates and times, but also whether there have been any
alterations or deletions and the identity of the person making these changes. It can also provide a
timeline. Further, it also includes how and when the alternations were made.
Three types of metadata are of interest, namely, file metadata, system metadata, and embedded
metadata. File data includes information about the creator, reviser, editor and data and times. File
system metadata is important because if data has been altered this metadata will show that
information. Also this is one area in which spoliation occurs. If the collection of data is not done
properly, then metadata can be altered or deleted. System metadata is normally recovered from an
organization’s IT system and will include the path a file took and where it is located on a hard drive or
server. The metadata is often in databases in the computer system. Embedded metadata contains the
data, content, numbers that are not found in the native format of a document. One example of
embedded data would be the formulas used in a spreadsheet. Metadata can also demonstrate an
evidence timeline.
In a 2005 case, Williams v. Sprint, the court set the standard concerning metadata when it ruled as
follows:
When a party is ordered to produce electronic documents as they are maintained in the
ordinary course of business, the producing party should produce the electronic
documents with their metadata intact, unless that party timely objects to production of
metadata, the parties agree that the metadata should not be produced, or the producing
party requests a protective order. (Williams v. Sprint/United Mgmt. Co., 2005)
Metadata is an important element in discovery. The metadata can be used as a search tool to help find
relevant documents and it can be used to discover attempts at hiding or deleting information or
revealing who knew what when. For example, in the Martha Stewart case in which she was found
guilty of four counts of obstruction of justice and lying to investigators about a stock sale, it was the
metadata that showed she had made alterations to her electronic calendar, ( Securities and Exchange
Commission vrs Martha Stewart and Peter Bacanovic, 2003) and in the Qualcomm case metadata
helped to locate emails that Qualcomm had denied receiving. (Luoma M. &., 2009) The pure volume
of information and the layers of information available to a litigant make discovery a difficult dilemma.
When does a party have enough information?
2. PRIOR TO THE CIVIL RULE CHANGES
The former version of the FRCP required litigants to comply with discovery requests in an effort to
prevent surprises at trial and to ensure a fair trial. Although some courts interpreted the rules to
include electronic data as well as paper data, there were no well-established guidelines in this matter.
The case that was the catalyst for change in electronic discovery was the Zubulake case. The Zubulake
case began as a gender discrimination case, but became the first authoritative case in the United States
on electronic discovery issues. ( Zubulake v. UBS Warburg LLC, 2003)
The Zubulake case forced the legal community to review the rules of civil procedure and make the
necessary changes to the Federal Rules of Civil Procedure to include specific instructions concerning
electronically stored information. In 1999 Laura Zubulake was hired as the director and senior
salesperson of the UBS Asian Equities Sales Desk and was told that she would be considered for her
supervisor’s position when he left. When Laura’s supervisor took another position, another male was
given that position without consideration of any other candidates, including Laura Zubulake. Her new
supervisor made it clear that he did not feel a woman belonged in Laura’s position and earning
$650,000 a year. Laura Zubulake complained that her new supervisor made sexist remarks, excluded
174
ADFSL Conference on Digital Forensics, Security and Law, 2012
her from office activities and demeaned her in the presence of her co-workers. In response to this
treatment Laura Zubulake filed a gender discrimination claim with the EEOC against her employer
UBS Warburg LLC. She lost the case and was fired. She then brought suit against her employer under
federal law, New York state law, and New York City law for both gender discrimination and
retaliation. ( Zubulake v. UBS Warburg LLC, 2003)
This case was atypical in that there were five pre-trial motions dealing with discovery issues. When
Warburg could not produce the desired emails and other electronic documents requested, Zubulake
brought a motion requesting access to UBS backup media. UBS responded by asking the court to shift
the cost of restoring the backup tapes to the plaintiff. What followed were four more motions to
resolve discovery issues. During the restoration effort it was determined that many backup tapes were
missing and some were erased. The court ordered a sampling to be done of the backup tapes. In these
results it was evident that Zubulake’s supervisor had taken steps to conceal or destroy particularly
relevant emails. As a result of this action, the court set standards for retention and deletion
requirements, litigation holds and cost shifting. It became clear after the Zubulake case that the old
rules of civil procedure were no longer adequate. Clarification had to be made concerning electronic
discovery. The guidelines forged by the judge in the Zubulake case were debated in setting new rules
with trial adoption of the rule changes in December 2006 and permanent adoption one year later. (
Zubulake v. UBS Warburg LLC, 2003)
3. NEW FEDERAL RULES
In April of 2006 the U.S. Supreme Court determined the amount of electronically stored information
involved in discovery requests required clarification and standards. Changes in the Federal Rules of
Civil Procedure that dealt with ESI issues that took effect in December 1, 2006 included Rules 16, 26,
34 and 37. Rule 26 provided that parties must provide an inventory, description and location of
relevant ESI and required the parties to meet and develop a discovery plan. Rule 34 sets out rules for
document requests and Rule 37 addresses the safe harbor provisions.
Rule 26 (A) (1)(ii) requires the litigants to provide the following:
(ii) a copy — or a description by category and location — of all documents, electronically
stored information, and tangible things that the disclosing party has in its possession, custody,
or control and may use to support its claims or defenses, unless the use would be solely for
impeachment; (Federal Rules of Civil Procedure, 2006)
This revision now requires a party to reveal all of its information without requiring the opposing party
to ask for the information. This provision requires attorneys who have always been adversaries to
cooperate. In addition, Rule 26(f) requires that the parties must confer as soon as practicable or at
least 21 days before a scheduling conference is to be held or a scheduling order is due under Rule
16(b). (Federal Rules of Civil Procedure, 2006)
4. SANCTIONS
Since the Zubulake case it has been clear that parties must put a litigation hold in place as soon as
litigation is known and anticipated. That one issue has been the primary reason for courts imposing
sanctions on a party. However, there are other reasons for sanctions that have arisen since the Federal
Rules of Civil Procedure were modified. Those reasons have included data dumping, data wiping,
intentional destruction and certainly failure to have a litigation hold.
4.1 Data Dumping
If an opposing party provides too much information it may be guilty of data dumping while
demanding too much information can result in cost shifting and a litigation nightmare. Failure to retain
electronic data in a retrievable format for litigation that a litigant knew or should have known might be
imminent can result in sanctions.
175
ADFSL Conference on Digital Forensics, Security and Law, 2012
In a 2008 case, ReedHycalog v. United Diamond, a Texas court found that in discovery the producing
party must cull out irrelevant information from document production and must not engage in data
dumping. (ReedHycalog v. United Diamond, 2008) The court ruled that “there are two ways to lose a
case in the Eastern District of Texas: on the merits or by discovery abuse.” (ReedHycalog v. United
Diamond, 2008) The court stated, “While these provisions generally ensure that parties do not
withhold relevant information, a party may not frustrate the spirit of discovery rules — open,
forthright, and efficient disclosure of information potentially relevant to the case — by burying
relevant documents amid a sea of irrelevant ones.” (ReedHycalog v. United Diamond, 2008)
In a 2010 case of data dumping the court did not award sanctions even though the defendant dumped
the computer hard drive on the plaintiff because the plaintiff did not make the motion in a timely
fashion. However the court did sanction the defendant for other discovery violations and warned if the
motion on dumping had been timely there would have been additional sanctions. (Cherrington Asia
Ltd. v. A & L Underground, Inc, 2010)
In yet another data dumping case, a third party produced three servers in response to a subpoena and
court orders without conducting a review for either privilege or responsiveness. Later the party asked
the court for the right to search the 800 GB and 600,000 documents for relevant materials at their cost
in exchange for the return of any privileged documents. The court found that the party had made
voluntary disclosure and resulted in a complete waiver of applicable privileges. The court pointed out
that in nearly three months the party had not flagged even one document as privileged, so the court
rejected its "belatedly and casually proffered" objections as "too little, too late." (In re Fontainebleau
Las Vegas Contract Litig., 2011)
In a 2011 District of Columbia case, the defendants had produced thousands of e-mails just days
before trial and continued to "document dump" even after the trial ended. The court found that
"repeated, flagrant, and unrepentant failures to comply with Court orders" and "discovery abuse so
extreme as to be literally unheard of in this Court." The court also repeatedly noted the defendants'
failure to adhere to the discovery framework provided by the Federal Rules of Civil Procedure and
advised the defendants to invest time spent "ankle-biting the plaintiffs" into shaping up its own
discovery conduct. (DL v. District of Columbia, 2011)
4.2 Failure to Maintain a Litigation Hold
Failing to maintain a litigation hold when litigation can be reasonably anticipated is another ground for
court-imposed sanctions. Kolon executives and employees had deleted thousands emails and other
records relevant to DuPont’s trade secret claims. The court fined the company’s attorneys and
executives reasoning they could have prevented the spoliation through an effective litigation hold
process. Three litigation notices sent that were all deficient in some manner. ( E.I. du Pont de
Nemours v. Kolon Industries, 2011) The court issued a severe penalty against defendant Kolon
Industries for failing to issue a timely and proper litigation hold. The court gave an adverse inference
jury instruction that stated that Kolon executives and employees destroyed key evidence after the
company’s preservation duty was triggered. The jury responded by returning a stunning $919 million
verdict for DuPont.
4.3 Data Wiping
In a recent shareholders’ action in Delaware extreme sanctions were ordered against the defendant
Genger. Genger and his forensic expert used a wiping program to remove any data left on the
unallocated spaces on his ESI sources after a status quo order had been entered and after the opposing
side had searched his computer and obtained all ESI. The sanctions included attorney fees of
$750,000, costs of $3.2 million, changing the burden of proof from “a preponderance of the evidence”
to “clear and convincing evidence,” and requiring corroboration of Genger’s testimony before it would
be admitted in evidence. On appeal, the defendant argued the sanctions were disproportionate and
excessive since he erased only unallocated free space and did not erase this information until after the
176
ADFSL Conference on Digital Forensics, Security and Law, 2012
plaintiff had searched all of the data sources. Further, the status quo order did not specifically mention
preserving the unallocated space on the computer. The defendant also argued that normal computer
use would likely cause overwriting of unallocated space to occur. Further, if this order were affirmed
by the court computer activities would have to be suspended every time a discovery order issued. In
2011 the Supreme Court of Delaware upheld the sanctions. (Genger v. TR Investors, LLC, 2011)
In a patent infringement case, plaintiffs asked for sanctions against the defendants for their failure to
produce relevant electronic documents. The defendants’ excuse was that their e-mail servers were not
designed for archival purposes. The company policy was that employees should preserve valuable emails. The court refused to grant safe harbor provision citing the forensic expert’s declaration that
failed to state the destruction was a result of a "routine, good-faith operation." (Phillip M. Adams &
Assocs. LLC v. Dell, Inc., 2009)
In another case, almost every piece of media ordered to be produced was wiped, altered or destroyed.
In addition, the last modified dates for critical evidence were backdated and modified. The court
found the plaintiff guilty of "bad faith and with willful disregard for the rules of discovery" and
ordered a default judgment, dismissed the plaintiff's complaint with prejudice and ordered $1,000,000
in sanctions. In addition, the plaintiff's counsel was ordered to pay all costs and attorney fees for their
part in the misconduct. (Rosenthal Collins Grp., LLC v. Trading Tech. Int'l Inc., 2011)
4.4 Social Media
In a social media case an attorney was sanctioned when the attorney had his client purge unflattering
posts and photos on his Facebook account. The court found that the attorney, Murray, told his client,
the plaintiff in a wrongful death suit brought after his wife was killed in an auto accident, to remove
unflattering photos including one in which the distraught widower was holding a beer and wearing a tshirt emblazoned with “I [heart] hot moms.” (Lester v Allied Concrete et al, 2012)Murray instructed
his client through his assistant to “clean up” his Facebook account. Murray’s assistant emailed to the
client “We do not want blow ups of other pics at trial, so please; please clean up your Facebook and
MySpace!!” (Lester v Allied Concrete et al, 2012) The attorney was fined $522,000 even though he
argued he did not consider the social media site an ESI site. (Lester v Allied Concrete et al, 2012)
In another interesting case, several key employees intentionally and in bad faith destroyed 12,836 emails and 4,975 electronic files. The defendant argued that most of these files were recovered and thus
the plaintiffs were not harmed. Declaring these deletions significant in substance and number, the
court imposed an adverse inference instruction and ordered payment of attorney fees and costs
incurred as a result of the spoliation. (E.I. Du Pont De Nemours & Co. v. Kolon Indus., Inc, 2011)
In another case, the plaintiff sought a default judgment sanctions alleging the defendants intentionally
deleted relevant ESI by lifting a litigation hold, erasing a home computer, delaying preservation of
computers hard drives and deleting files, defragmenting disks, and destroying server backup tapes,
ghost images, portable storage devices, e-mails and a file server. The court determined that a default
judgment was appropriate given the defendants’ “unabashedly intentional destruction of relevant,
irretrievable evidence” and egregious conduct. (Gentex Corp. v. Sutter, 2011)
4.5 Egregious Behavior
After five years of case law, rules, and The Sedona Conference best practices principles there are still
cases that so blatantly flaunt the rules and good practices that all can be said is what were they
thinking? In an intellectual property dispute, the plaintiff and third-party defendants appealed the
district court’s decision to grant default judgment as a sanction for ESI spoliation. The court found that
the plaintiff willfully and in bad faith destroyed ESI. The plaintiff and his employees videotaped the
employees talking about their deliberate destruction of the potentially harmful evidence. In addition,
the employees tossed one laptop off a building and drove a car over another one. In addition, one
employee said '[If] this gets us into trouble, I hope we’re prison buddies.'" Finding this behavior
177
ADFSL Conference on Digital Forensics, Security and Law, 2012
demonstrated bad faith and a general disregard for the judicial process, the court affirmed the default
judgment and award of attorneys' fees and costs. (Daynight, LLC v. Mobilight, Inc, 2011)
In a products liability case, the plaintiff sought to re-open a case and asked that sanctions be ordered
against the defendant for systematically destroying evidence, failed to produce relevant documents and
committed other discovery violations in bad faith. The plaintiff's attorney uncovered documents
requested in the litigation nearly a year after trial ended when conducting discovery in another case.
The court determined the unproduced e-mails were extremely valuable and prejudiced the plaintiff.
Further, the court found the defendant's discovery efforts were unreasonable. The defendant put one
employee who was admittedly "as computer literate —illiterate as they get" in charge. In addition, the
defendant failed to search the electronic data, failed to institute a litigation hold, instructed employees
to routinely delete information and rotated its backup tapes, thus permanently deleting data. The court
ordered that defendant was to pay $250,000 in civil contempt sanctions. The unusual part of the order
was that the court imposed a "purging" sanction of $500,000, extinguishable if the defendant furnished
a copy of the order to every plaintiff in every lawsuit proceeding against the company for the past two
years and to file a copy of the order with its first pleading or filing in all new lawsuits for the next five
years. (Green v. Blitz U.S.A., Inc., 2011)
5. THE SAFE HARBOR
The court has great leeway in imposing sanctions on litigants who fail to produce relevant materials to
the opposing party or who has tampered with or destroyed information or metadata. Sanctions can
vary from fines, attorney fees, an award of costs, and adverse-inference instructions to outright
dismissal of the case. However, in an effort to distinguish between inadvertent mistakes and outright
obstruction, Federal Rule 37(f) provides: “Absent exceptional circumstances, a court may not impose
sanctions under these rules on a party for failing to provide electronically stored information lost as a
result of the routine, good-faith operation of an electronic information system.’ (Federal Rule of Civil
Procedure, 2006)
This rule indicates that parties could implement and rely on their document retention and deletion
policy that provides for the routine and frequent destruction of electronic evidence and still be
protected from civil sanctions when they were unable to produce relevant electronic evidence.
However, case law has not always followed that simple interpretation. With all this information from
rules to cases to think tanks on best practices, the question that remains is whether the safe harbor
provision protects litigants and does it matter if the failure is based on ignorance, mistake or
willfulness?
5.1 Ignorance
In a 2011 case the defendants sought sanctions alleging the plaintiff failed to produce relevant
information and further was guilty of spoliation of evidence by donating her personal computer to an
overseas school after commencing the action. The court found that the plaintiff did not act in bad faith
and unlike corporate parties, the plaintiff in this case was unsophisticated and unaccustomed to
preservation requirements. Further, the court stated the e-mails were withheld under the good faith
misconceptions of relevance and privilege. Therefore, the court found no evidence of prejudice or bad
faith and their court declined to impose sanctions. (Neverson-Young v. Blackrock, Inc, 2011)
5.2 Mistake
In another 2011 case, the plaintiff brought a motion for sanctions in a wrongful termination case
alleging the defendants failed to: (1) issue a prompt litigation hold resulting in the destruction of
electronically stored information (ESI); and (2) provide emails in their native file format, producing
them in paper instead. The court denied the plaintiff’s motion for sanctions holding that FRCP 37(e)
grants safe harbor for the defendants’ automated destruction of emails based on their existing
document retention policy. (Kermode v. University of Miss. Med. Ctr, 2011)
178
ADFSL Conference on Digital Forensics, Security and Law, 2012
5.3 Lack of Good Faith
In a 2008 case, the defendants sought a spoliation jury instruction alleging the plaintiff failed to
preserve and produce requested copies of critical e-mails. Further, the plaintiff failed to notify the
defendants that evidence had been lost prior to the defendant sending an employee out to the plaintiff’s
site to inspect it. The court determined there was an absence of bad faith and denied both requests for
sanctions. (Diabetes Ctr. of Am. v. Healthpia Am., Inc, 2008)
In a 2011 products liability litigation case, the plaintiffs sought ESI sanctions against the defendants
for failure to preserve data. Despite the discovery violations alleged by the plaintiff, including the
failure to preserve and produce relevant e-mails, the court noted that procedural defects and the Rule
37(e) safe harbor provision barred the imposition of sanctions as the e-mails were deleted as part of a
routine system
5.4 Not enough Evidence
In another 2011 case the court refused to compel data restoration or a finding of bad faith in an
employment law case. The defendant requested damages from the plaintiff (the former employee) for
damages for erasing data from his company-issued laptop. The employee claimed that he did not have
any way of removing his personal data. The court found that it was far from clear whether plaintiff
deleted the files in bad faith and that there was a lack of evidence that the defendant was harmed.
6. EXTREME MEASURES
In September, 2011 Chief Judge Randall R. Rader introduced a Model Order to be used in Patent cases
which basically eliminated e-discovery and metadata. For example, item 5 of the Model Order read
“General ESI production requests under Federal Rules of Civil Procedure 34 and 45 shall not include
metadata absent a showing of good cause.” In addition this model order severely limited electronic
discovery. (Rader, 2011)
This order is contrary to the Sedona Principle 12 that holds the production of data should take into
account “the need to produce reasonably accessible metadata that will enable the receiving party to
have the same ability to access, search, and display the information as the producing party where
appropriate or necessary in light of the nature of the information and the needs of the case.” (Sedona,
2009)
7. CONCLUSION
In conclusion both federal and state courts have had an increasing number of cases where they have
determined that sanctions are appropriate since the new Federal Rules of Civil Procedure were enacted
in 2006, especially Rule 26(f). The probability of receiving sanctions seems to depend on the harm
the missing information has caused or can cause as well as the bad faith involved in the failure to
provide discovery. The severity of bad faith conduct increases the severity of the sanction imposed.
Often the litigant has had numerous acts of misconduct that will result in negative-inference jury
instructions and summary judgment.
In a landmark case decided by Judge Paul Grimm, the defendant responded to discovery using
boilerplate objections. Judge Grimm ruled that use of such broad objections violates FRCP 33(b)(4)
and 34(b)(2). Judge Grimm stated that the parties must have a meet-and-confer conference prior to
discovery, discuss any controversial issues – including timing – costs, and the reasonableness of the
discovery request in proportion to the value of the case. (Mancia v. Mayflower Textile Servs. Co,
2008)
Lawyers have been trained to be advocates and adversaries and to represent their clients zealously.
However, with the volume of ESI today, lawyers must redefine what zealous representation means.
Cooperation in the discovery process by full disclosure of the ESI and cooperation in devising a
discovery plan will save the client money in the long run, avoid sanctions, and allow a full and fair
179
ADFSL Conference on Digital Forensics, Security and Law, 2012
trial on the merits.
REFERENCES
Securities and Exchange Commission vrs Martha Stewart and Peter Bacanovic, 03 CV 4070 (NRB)
Complaint ( United States Disct Court Southern District of New York 2003).
Zubulake v. UBS Warburg LLC, 220 F.R.D. 212 (S.D.N.Y 2003).
E.I. du Pont de Nemours v. Kolon Industries, 3:09cv58 (E.D. VA July 21, 2011).
Cherrington Asia Ltd. v. A & L Underground, Inc, WL 126190 (D. Kansas January 8, 2010).
Daynight, LLC v. Mobilight, Inc, WL 241084 (Utah App Jan 27, 2011).
Diabetes Ctr. of Am. v. Healthpia Am., Inc, WL 336382 (S.D. Texas Feb 5, 2008).
DL v. District of Columbia, No. 05-1437 (D.D.C. May 9, 2011).
E.I. Du Pont De Nemours & Co. v. Kolon Indus., Inc, WL 2966862 (E.D. VA July 21, 2011).
Federal Rule of Civil Procedure, 37(f) (2006).
Federal Rules of Civil Procedure, Rule 26 (A)(1)(ii) (2006).
Federal Rules of Civil Procedure, Rule 26(f) (December 2006).
Genger v. TR Investors, LLC, WL 2802832 (Del Supreme July 18, 2011).
Gentex Corp. v. Sutter, WL 5040893 (M.D. PA October 24, 2011).
Green v. Blitz U.S.A., Inc., WL 806011 (E.D. Tex March 1, 2011).
In re Fontainebleau Las Vegas Contract Litig., WL 65760 (S.D. Fla Jan 7, 2011).
Kermode v. University of Miss. Med. Ctr, 3:09-CV-584-DPJ-FKB (S.D. Jul 1, 2011).
Lester v Allied Concrete et al, CL108-150 (CC Virgina 2012).
Luoma, M. &. (2009). Qualcomm v. Broadcom: Implications for Electronic Discovery. Proceedings of
the 7th Australian Digital Forensics Conference. Perth: Edith Cowan University School of Computer
and Information Science.
Luoma, M. V. (2011). ESI a Global Problem. International Conference on Technology and Business
Management, Dubai. Dubai: UAE.
Lyman, P. V. (2003). How Much Information? Retrieved August 6, 2010, from University of
California Berkley School of Information Management and Systems: www.sims.berkeley.edu/howmuch-info
Mancia v. Mayflower Textile Servs. Co, WL 4595275 (D. MD October 15, 2008).
Neverson-Young v. Blackrock, Inc, WL 3585961 (S.D.N.Y. August 11, 2011).
Phillip M. Adams & Assocs., LLC v. Dell, Inc, WL 910801 (D. Utah March 30, 2009).
Rader, R. (2011, September). Model Order. Retrieved January 21, 2012, from An E-Discovery Model
Order:
http://memberconnections.com/olc/filelib/LVFC/cpages/9008/Library/Ediscovery%20Model%20Orde
r.pdf
ReedHycalog v. United Diamond, Lexis 93177 (E.D. Texas October 3, 2008).
Residential Funding Co. v. DeGeorge Financial Co, 306 F.3d 99 (2d Cir 2003).
Rosenthal Collins Grp., LLC v. Trading Tech. Int'l Inc., WL 722467 (N.D. Ill Feb 23, 2011).
Sedona. (2009). Commentary on Achieving Quality in the E-Discovery Process. Sedona conference.
180
ADFSL Conference on Digital Forensics, Security and Law, 2012
Sedona: Sedona Publishers.
Victor Stanley, Inc. v. Creative Pipe, Inc., WL 3530097 (D. MD 2010).
Williams v. Sprint/United Mgmt. Co., . LEXIS 21966 ( U.S. Dist Kansas September 20, 2005).
181
ADFSL Conference on Digital Forensics, Security and Law, 2012
182
ADFSL Conference on Digital Forensics, Security and Law, 2012
DIGITAL EVIDENCE EDUCATION IN SCHOOLS OF LAW
Aaron Alva
Barbara Endicott-Popovsky
Center for Information Assurance and Cybersecurity
University of Washington
4311 11th Ave NE Suite 400 Box 354985
Seattle, Washington 98105
[email protected], [email protected]
ABSTRACT
An examination of State of Connecticut v. Julie Amero provides insight into how a general lack of
understanding of digital evidence can cause an innocent defendant to be wrongfully convicted. By
contrast, the 101-page opinion in Lorraine v. Markel American Insurance Co. provides legal
precedence and a detailed consideration for the admission of digital evidence. An analysis of both
cases leads the authors to recommend additions to Law School curricula designed to raise the
awareness of the legal community to ensure such travesties of justice, as in the Amero case, don’t
occur in the future. Work underway at the University of Washington designed to address this
deficiency is discussed.
Keywords: digital forensics, law education, ESI, admissibility, evidence
1. INTRODUCTION / BACKGROUND
There is an alarming gap in the legal and judicial community’s understanding of digital evidence. This
was brought home vividly after one of the authors received a call from an attorney acquaintance
seeking advice on a case involving incriminating emails that were pending admission as evidence in a
contentious divorce case. The accused wife was an injured Iraqi War veteran whose husband sought
dissolution, as well as all of the couple’s assets and full custody of their children, based on her alleged
crimes of cyber stalking. As evidence, the husband’s lawyer provided paper copies to the judge of
incriminating emails that did indeed appear to emanate from the wife’s account. Without legal
challenge, the judge was inclined to admit it. Endicott-Popovsky recommended that the related digital
email files be requested from the husband’s lawyer, who agreed to forward them immediately. Three
weeks later, the digital email files still were not forthcoming and the lawyer--and his client—dropped
the allegations against the wife of violation of Washington State’s strict cyber stalking statutes and
withdrew the ”evidence” from the judge’s consideration. You can draw your own conclusion why this
“evidence,” initially so compelling according to the other side, was quietly withdrawn.
This entire episode gave pause. Had the wife’s attorney not known of one Author’s interest in digital
forensics, he might not have called. Had the judge then admitted the evidence proposed by the
husband’s counsel, a serious miscarriage of justice almost surely would have ensued. The tragedy of a
veteran of active duty service, and a woman at that, officially being labeled a cyberstalker, stripped of
her parental rights and losing her assets (not to mention the effect that this would have on the children)
was chilling to contemplate and reminiscent of the Julie Amero case which has become a legend
among digital forensics experts and which will be discussed in more detail later. Examination of
additional cases confirms this experience, resulting in our recommendation that education in a range of
subjects related to digital evidence be added to law school curricula where, unfortunately, today it
often is not.
Failing to provide lawyers and judges with sufficient education in digital evidence can result in serious
miscarriages of justice and disruption of the legal system. The innocent will be wrongly convicted and
incarcerated; those deserving of punishment will get away with crimes. Society as a whole would be
183
ADFSL Conference on Digital Forensics, Security and Law, 2012
better served by increasing the legal and judiciary communities’ understanding of digital evidence
(Endicott-Popovsky and Horowitz 2012). This will require that schools of law an engage in an effort
to identify what needs to be taught.
2. STATE OF CONNECTICUT V. JULIE AMERO
State of Connecticut v. Julie Amero exposed how legal ignorance of digital evidence could have a
profound impact on an individual’s life. Defendant, Amero was convicted on four charges of Risk of
Injury to a Child, which carried up to a 40-year sentence (Kantor 2007). Following four delays in
sentencing, a new trial was granted when the conviction was overturned on appeal. After years of
suffering under a cloud of suspicion, wanting to put the nightmare behind her, Amero pled guilty to
disorderly conduct, her teaching license was revoked, and she paid a $100 fine (Krebs 2008).
To summarize the facts of the case, Julie Amero was substitute-teaching a seventh grade classroom on
October 19, 2004. After stepping out of the hallway for a moment, she found two students browsing a
hairstyling site. Shortly afterwards, the computer browser began continuously opening pop-ups with
pornographic content. She was explicitly told not to turn off the computer, and was unaware that the
monitor could be shut off. Students were exposed to the pornography. The primary evidence admitted
by the court was the forensic duplicate of the hard drive on the computer in question. While the
forensic investigator did not use industry standards to duplicate the hard drive, the information was
used in the investigation (Eckelberry et al. 2007). The evidence purported to show Internet history of
pornographic links that indicated the user deliberately went to those sites (Amero Trial Testimony
2007).
Later pro bono forensics experts for the defendant showed that antivirus definitions were not updated
regularly and at the time were at least three months out-of-date. Additionally, no antispyware or client
firewall was installed and the school's content filter had expired (Eckelberry et al. 2007).
2.1 Evaluation of digital evidence by Judge and attorneys
During the trial, the judge refused to allow full testimony from the defense expert witness Herb
Horner, noting that the information was not made available beforehand. That information was relevant
to gaining a full understanding of the digital evidence crucial to the case. The decision not to admit it
indicates a troubling lack of understanding of the nature of digital evidence. Horner’s evidence should
have been provided to the jury (Amero Trial Testimony 2007).
Maintaining a digital chain of evidence is essential for admissibility of any digital evidence. In State v.
Amero, this is questionable based on the uncertainty of the forensic duplicate process (Eckelberry et al.
2007). Additionally, timestamp differences between the e-mail server (the time authority in the school
system’s network), and the computer in question, as a witness stated, “was ten or twelve minutes. I
don't remember who was faster and who was slower” (Amero Trial Testimony 2007). Both of these
discrepancies should put into question the authenticity of the digital evidence, although neither arose
as a strong defense argument. This indicates that the attorneys did not have sufficient technical
knowledge to evaluate the evidence.
Further, the judge did not find relevant the preparations taken by one expert witness to examine the
hard drive forensically. This is necessary to provide foundation for admissibility and authenticity of
digital evidence (Amero Trial Testimony 2007). The defense attorney did not question authenticity
based on the time stamp differences between PC and server, nor did the defense make an argument
regarding the process of the forensic investigation. Similarly, the prosecution did not have a proper
understanding of how to show the digital evidence in ways consistent with the actual event, as seen by
their display of full size pornographic pictures instead of thumbnails.
On the basis of the transcript of the case, the questions attorneys asked (or did not ask) of witnesses
also indicated low computer literacy. While expert witnesses are important to the case, the technical
knowledge of the attorneys (with the judge’s permission) guided the questioning of the witnesses.
184
ADFSL Conference on Digital Forensics, Security and Law, 2012
The prosecuting attorney, when questioning the defense expert witness, apparently did not understand
the information being provided. A lack of understanding on the attorney’s part resulted in a lack of
precision in direct questioning that could have elicited more meaningful answers from the witness.
The following dialogue between the prosecuting attorney (questioning) and the defense expert witness
is an example:
Q So in order for it to show up on the temporary Internet files, that person would have to actively
go to that site, correct? They were not redirected to that site, correct?
A Wrong.
Q Okay.
A You don’t understand.
Q I hear you saying that.Give me more questions. (Amero Trial Testimony 2007)
One line of questioning by the defense regarded computer functions related to the Internet, adware,
spyware, and viruses (Amero Trial Testimony 2007). Cross-examination containing phrases such as
‘parasites’ and other incompetent questioning further displayed a low level of computer literacy from
attorneys on both sides. When the prosecutor displayed full-size pornographic pictures in the
courtroom, the defense did not argue, as it should have, that the relative size of the pictures displayed
for the jury was not consistent with the pop-up size thumbnails displayed in the classroom. The lack of
a specific argument by the defense against using a full size display in the courtroom contributed to a
false impression that prejudiced the jury (Willard 2007).
2.2 Expert witness testimony
The following section will detail the expert witness testimony in the Julie Amero case.
2.2.1 Bob Hartz, IT Manager
This case heard several expert witnesses, each of whom provided information that proved to be
misleading. The jury heard testimony from Bob Hartz, IT Manager for the Norwich Public Schools
first. Hartz’s testimony provided information on server logs that provided a history of sites accessed
from the computer in question. His testimony also provided a basic understanding of the computer
environment at the School district, including notice that the timestamps between server/firewall logs
and the computer were either 10-12 minutes ahead or behind. Additionally, he testified that the content
filtering was not working, as it “had not been updated correctly” (Amero Trial Testimony 2007).
During Hartz’s testimony, he was asked a series of questions regarding the possibility of certain events
occurring, based on his 20-plus years of experience in the field. His answers provided misleading
information. When asked if it were possible to be in an ‘endless loop of pornography,’ referring to the
continual pornographic popups, Hartz stated, “I've never seen that, so I would have to say probably
not” (Amero Trial Testimony 2007). Additionally, when asked whether spyware and adware generates
pornography, Hartz replied, “I’m not aware they do” (Amero Trial Testimony 2007). Both of these
replies were speculative and should have been challenged, along with his experience to respond
competently to questions involving malicious activity on the Internet.
2.2.1 Detective Mark Lounsbury
Detective Mark Lounsbury, the computer crimes officer for the Norwich Police Department, was the
officer who personally copied and examined the hard drive. From testimony given by Lounsbury, it
appears that he may have investigated the original hard drive rather than the copy he claimed to have
made. Best practice is to preserve the original hard drive for evidentiary purposes, and to perform the
investigation on the forensic duplicate (Noblett, Pollitt, et al. 2000). Any digital forensic expert should
know that direct access to the hard drive, including opening files, will alter the files from the original
state—which in turn alters the evidence.
185
ADFSL Conference on Digital Forensics, Security and Law, 2012
Another revealing insight into the incompetence of Lounsbury’s hard drive examination was revealed
in his answer to the question, “Did you examine the hard drive for spyware, adware, viruses or
parasites?” (Amero Trial Testimony 2007) He responded that he had not. Digital forensics best
practices include examination for event correlation that searches for causes of activities in question
(NIST 800-61 2008). It is safe to say that poor examination procedures, led to missed findings that
were directly relevant to the case.
2.2.2 Herb Horner, defense expert witness
Herb Horner, a self-employed computer consultant, was called in by the defense as an expert witness.
Horner obtained the hard drive copy from the police, and then created copies for his investigation.
Horner’s testimony was cut short by the judge’s decision not to continue with information that was not
provided beforehand to the prosecution. Horner later stated, "This was one of the most frustrating
experiences of my career, knowing full well that the person is innocent and not being allowed to
provide logical proof. If there is an appeal and the defense is allowed to show the entire results of the
forensic examination in front of experienced computer people, including a computer literate judge and
prosecutor, Julie Amero will walk out the court room as a free person." (Kantor 2007). The
information that was to be presented by Horner was evidence of spyware on the computer, which was
caused pornographic pop-ups (Amero Trial Testimony 2007). As Horner stated to the judge while the
jury was dismissed, “there were things done before the 19th that led to this catastrophe, and this is a
fact” (Amero Trial Testimony 2007). Despite this, evidence that spyware caused the pop-ups still was
not allowed into the record.
2.3 Legal results
Legal precedence cited in State v. Amero was limited. There was one mention of U.S. v. Frye during
an objection to the admissibility of expert testimony by Hartz on the basis that he did not lay proper
foundation for his testimony (Amero Trial Testimony 2007). The prosecution argued that the Hartz’s
expertise was based on his described twenty years of experience, and the judge found this sufficient
(Amero Trial Testimony 2007).
During the same exchange, the defense cited State v. Porter to argue that the reliability of scientific
information was a major issue. In State v. Porter, the Connecticut Supreme Court adopted the Federal
standard set by U.S. v. Daubert, however, questioning continued due to lack of clarity on what the
defense was objecting to specifically (Amero Trial Testimony 2007).
The primary law cited in this case was risk of injury to a minor, Connecticut General Statute Section
53-21(a)(1), from which the charges stemmed. There were no laws or regulations cited relevant to
digital evidence admissibility.
The defendant was found guilty on four counts of Risk of Injury to a Child, with the possibility of a
40-year prison sentence. After four delays in sentencing, the State Court of Appeals reversed the lower
court decision and a motion for a new trial was accepted. She pled guilty to a misdemeanor to get the
nightmare behind her. As part of the deal, she agreed to have her teaching license revoked. The
emotional toll on her and her family was extreme, and the stress of trial resulted in many health
problems. She has become a cause celebre for digital forensics experts ever since.
3. LORRIANE V. MARKEL AM. INS. CO.
There are a few examples where the justice system has properly handled the admission of digital
evidence. We transition from an example of glaring misunderstandings and lack of knowledge of
digital evidence (Amero case), to an analysis of a competent judicial opinion regarding digital
evidence, the case of Lorriane v. Markel American Insurance Company. The opinion proceeding from
the ruling made by Chief United States Magistrate Judge Paul W. Grimm has set major legal
precedence (Lorraine v. Markel Am. 2007). The case arose from an insurance payment dispute. The
actual facts of the case are not important; the opinion of Judge Grimm is. It provides a detailed
186
ADFSL Conference on Digital Forensics, Security and Law, 2012
procedure for admitting digital evidence. The following section will summarize each step and
requirement of the process.
3.1 Digital evidence admissibility process
The first criteria for gauging admissibility of digital evidence is based on Federal Rules of Evidence
Rule 104. Judge Grimm explained that determining the authenticity of Electronically Stored
Information (ESI) is a two-step process (Lorraine v. Markel Am. 2007). First,
“...before admitting evidence for consideration by the jury, the district court must determine
whether its proponent has offered a satisfactory foundation from which the jury could reasonably
find that the evidence is authentic” (Lorraine v. Markel Am. quoting U.S. v. Branch).
Secondly,
“...because authentication is essentially a question of conditional relevancy, the jury ultimately
resolves whether evidence admitted for its consideration is that which the proponent
claims” (Lorraine v. Markel Am. quoting U.S. v. Branch).
Relevance is the first requirement for admissibility of evidence (Fed. R. Evid. 401). Evidence is
sufficient, in terms of relevance, if it has “…any tendency’ to prove or disprove a consequential fact in
litigation” (Lorraine v. Markel Am. 2007). Of importance, if evidence is not relevant, it is never
admissible according to Fed. R. Evid 402. In terms of the case at hand, Lorraine, the emails were
determined to be relevant to the case (Lorraine v. Markel Am. 2007).
The next step in determining admissibility of ESI is its authenticity as guided by Federal Rules of
Evidence 901 and 902. Judge Grimm submits that the authenticity of digital evidence is an important
requirement, and that the degree of admissibility only need be sufficient in terms of showing the
evidence is what it is purported to be (Lorraine v. Markel Am. 2007). Specifically, the non-exclusive
examples provided by Rule 901(b) can be studied to know how to address authentication of ESI in
Rule 901(a) (Lorraine v. Markel Am. 2007).
Federal Rules of Evidence 902 shows twelve non-exclusive methods that can be used for ‘selfauthentication’ of digital evidence (Lorraine v. Markel Am. 2007):
(1) Domestic Public Documents That Are Sealed and Signed.
(2) Domestic Public Documents That Are Not Sealed but Are Signed and Certified.
(3) Foreign Public Documents.
(4) Certified Copies of Public Records.
(5) Official Publications.
(6) Newspapers and Periodicals
(7) Trade Inscriptions and the Like.
(8) Acknowledged Documents.
(9) Commercial Paper and Related Documents.
(10) Presumptions Under a Federal Statute.
(11) Certified Domestic Records of a Regularly Conducted Activity.
(12) Certified Foreign Records of a Regularly Conducted Activity.
These methods establish a practice of authentication that can be performed without the need for expert
witness testimony, although the lack of such testimony does not exempt evidence of authentication
from challenge (Lorraine v. Markel Am. 2007; Fed. R. Evid. 902).
Procedurally, evidence that is already deemed relevant and authentic must also withstand any
argument of hearsay. Rule 801 governs hearsay, and states that electronic writings or other
information generated entirely by a “computerized system or process” is not made by a person, and
187
ADFSL Conference on Digital Forensics, Security and Law, 2012
therefore cannot be considered hearsay (Lorraine v. Markel Am., 2007). There are exemptions to the
hearsay rule within the Federal Rules of Evidence, which Judge Grimm linked to ESI in his opinion.
These included Fed. R. Evid. 803(1-3, 6, 7, 17) (Lorraine v. Markel Am. 2007).
The Original Writing Rule is considered the next hurdle for electronic evidence admissibility, which is
included in Federal Rules of Evidence 1001-1008. Rule 1002 ensures the ‘best evidence’ available and
applies when a “writing, recording or photograph is being introduced to ‘prove the context of a
writing, recording or paragraph’” (Fed. R. Evid. 1002). Rule 1003 states that duplicates of evidence
can be admitted in place of originals unless the authenticity of the original is in question (Lorraine v.
Markel Am. 2007). In determining whether secondary evidence can be included in place of an original,
Rule 1004 provides guidance.
The summary of the guidelines from the Federal Rules of Evidence cited in the Lorraine opinion
follows:
Table 1: Rules of Evidence Identified in Lorraine v. Markel as Guidance for Digital Evidence
Admissibility
Legal Guidance
Subject
Federal Rules of Evidence 104(a)
Federal Rules of Evidence 104(b)
Preliminary Questions; relationship between
judge and jury
Federal Rules of Evidence 401
Relevance
Federal Rules of Evidence 402
Federal Rules of Evidence 901
Authenticity; including examples of how to
authenticate
Federal Rules of Evidence 902
Self-Authentication; including examples
Federal Rules of Evidence 801
Hearsay; including exceptions to the hearsay
Federal Rules of Evidence 803
Federal Rules of Evidence 804
Federal Rules of Evidence 807
Federal Rules of Evidence 1001
through 1008
Original Writing Rule; also known as the “Best
Evidence Rule.” Includes use of accurate
duplicates.
Federal Rules of Evidence 403
Balance of Probative Value with Unfair
Prejudice
(Lorraine v. Markel Am. 2007; LexisNexis 2007)
3.2 Applicability to State of Connecticut v. Julie Amero
The next section applies, retrospectively, digital forensic guidelines established in Lorriane v. Markel
to State of Connecticut v. Julie Amero to determine whether this would have resulted in a different
outcome. As a note, the trial portion of the Amero case preceded the Lorraine v. Markel Am. opinion
by four months in 2007. While Amero could not have capitalized on the Lorraine opinion, an analysis
helps identify knowledge the legal participants in Amero should have had to ensure the case was
properly adjudicated.
188
ADFSL Conference on Digital Forensics, Security and Law, 2012
4. ANALYSIS
4.1 Digital evidence admitted
The primary piece of digital evidence in the Amero case was the hard drive. The opinion of Judge
Grimm includes methods describing how to handle the admissibility of this type of evidence (Section
4.5).
4.2 Evaluation of digital evidence by Judge and attorney
State v. Amero is an appalling example of the blindness of the legal system when computer literacy of
the judge and lawyers is at the very low end of the spectrum. The proceedings of this case highlight
the need for even the most basic understanding of how computers operate, including the threats to
computer systems that can cause the pop-up symptoms argued by the defense. In State v. Amero, the
lawyers did not question the authenticity of expert testimony describing the behaviors of viruses and
spyware. Of more importance, the lawyers did not put forth a clear and full objection to the digital
evidence presented. The Lorraine v. Markel Am. opinion specifically states that the burden of ensuring
that digital evidence is what it purports to be depends largely on objections by opposing counsel. Thus,
it is the responsibility of lawyers to be sufficiently knowledgeable to object competently to faulty
evidence.
4.3 Expert and witness testimony
Amero provided an example of the affects expert testimony can have on a case. The numerous errors
of so-called experts swayed the outcome of this case, and further exposed the lack of basic computer
literacy among the professional legal participants. The opinion rendered in Lorraine v. Markel Am.
references examples provided by Fed. R. Evid. 901 as methods to authenticate digital evidence,
including the call for an expert witness. Laying proper foundation qualifying the expert witness, as
well as directing a competent line of questioning, rely heavily on the computer literacy of the lawyers
involved. Reliance on digital forensic “expert” witnesses was key to State v. Amero, although the line
between testimony and speculation was crossed. An objection to speculation as to whether spyware
could produce pop-up pornography would have been warranted, given that opposing counsel had some
knowledge of computer threats.
4.4 Legal precedence
Legal precedence is scarce in the area of digital evidence, although it is building. Some cases establish
sound precedence, as indicated in Lorraine. While Daubert and Frye have established general
guidance for admission of scientific evidence, the science of digital forensics is new and evolving
(Daubert v. Merrell Dow 1993; Frye v. U.S. 1923). The speed of technological change is at odds with
the time it takes for legal precedence to accumulate. This impacts every court case involving a
computer, and hardly any don’t—whether we’re talking civil cases, such as divorces, or criminal cases
where digital evidence plays a significant role. The legal system lags in handling digital evidence
adequately (Endicott, Chee, et. al. 2007). In State v. Amero, Frye and Daubert were cited in an
objection by the defense, yet neither was applied appropriately. In order to build an adequate digital
evidence curriculum for law schools, extensive review of relevant legal cases will be required. This is
a matter for future research.
4.5 Laws and regulations identified
In Amero, first the court should have determined whether a satisfactory foundation had been laid
before the jury viewed the evidence (United States v. Branch 1992; Fed. R. Evid. 104). Then relevance
should have been considered (Fed. R. Evid. 401 and 402), then authenticity guided by Fed. R. Evid.
901 and 902. Application of the hearsay Rules (Fed. R. Evid. 801-807) and the Original Writing Rule
(Fed. R. Evid. 1001-1008) would follow in order to fully ensure the evidence, and duplicates of the
evidence, were what they purported to be (Lorraine v. Markel 2007).
189
ADFSL Conference on Digital Forensics, Security and Law, 2012
4.6 Legal Result
What is missing from most analyses of the Amero case is the larger picture of the emotional and health
effects on the innocent. Julie Amero, who faced forty years in jail, went through tremendous emotional
stress, and had a series of health issues arise including a tragic miscarriage (Cringely 2008). Her
family relationships were also impacted. While the legal system eventually righted the wrong inflicted
on Amero, the accumulated health and personal impacts cannot be undone. This will be repeated again
and again until we have a more predictable legal system when it comes to the use of digital evidence.
The contribution by Judge Paul W. Grimm’s opinion in Lorraine v. Markel Am. provides the basis for
guidance in this area. There are others emerging, but scant few.
5. COMPUTER AND DIGITAL FORENSICS EDUCATION RECOMMENDATIONS
State v. Amero and Lorraine v. Markel Am. provide a basis for recommending digital evidence
educational requirements in schools of law. The following recommendations will inform future
curricula for the University of Washington’s School of Law, and may be useful to others that wish to
mitigate deficiencies in computer knowledge of lawyers and the judiciary. At a minimum, the
curriculum must include:

Basic computer literacy. This includes an understanding of computer vulnerabilities.
A basic understanding that a compromised computer may show erratic actions not
performed (or intended) by the user is important and would have prevented the
Amero tragedy. This knowledge will enable lawyers to establish proper foundation
and a proper line of questioning.

Understanding of the digital forensics process. This includes basic knowledge of
how easily digital evidence can be altered and what it means to have a proper chain of
evidence, including storage and control. In addition, there should be sufficient
knowledge of how evidence is collected on a computer hard drive (and on a network),
how a hard drive is appropriately duplicated for forensic purposes, and then searched
by forensic tools. This recommendation arises from the several glaring errors in the
Amero case. It was never established that proper steps were taken to maintain a proper
chain of evidence during forensic duplication and investigation (State v. Amero 2008).
Additionally, the lack of a search for malware by police initially missed crucial
evidence.

Knowledge of the Federal Rules of Evidence, and how they apply to electronic
evidence. The Federal Rules of Evidence are integral to understanding the process for
admitting digital evidence. Lorraine v. Markel Am. Insurance Co. provides a
framework for applying these rules to digital evidence. Fed. R. Evid. 901 and 902
specifically deal with authentication of digital evidence, including examples of how to
do so. This is directly relevant to the Amero case; abiding by this framework would
have provided a basis for questioning whether the digital chain of evidence was
reliable, and not broken, during the investigatory process.

Survey of case law. A thorough search for relevant cases and an extraction of
precedence should be conducted before developing digital forensics curriculum.
Knowledge of how to apply the Federal Rules of Evidence as well as Daubert and
Frye in cases involving digital evidence will provide material for any classes
developed (Daubert v. Merrell Dow 1993; Frye v. U.S. 1923). A thorough survey of
other cases will provide and even more comprehensive understanding of the state of
the practice regarding digital evidence. Lorraine v. Markel Am. emphasizes that the
190
ADFSL Conference on Digital Forensics, Security and Law, 2012
burden of ensuring digital evidence admissibility rests largely on objections to such
evidence by opposing counsel. The inability to competently challenge testimony of
the State’s ‘expert’ witness led to a travesty of justice in the Amero case.
6. CONCLUSIONS
The above review provides insights into the legal and judicial communities’ lack of sufficient
knowledge of how to handle digital evidence appropriately and consistently. The authors believe this
argues for additional curriculum in schools of law that educate law students in the challenges of digital
evidence: digital evidence collection, chain of custody, the challenges of cybersecurity, basic computer
literacy. The authors recommend that law schools consider adding courses in these subjects as they
relate to digital evidence. A previously published digital forensics course conducted at the University
of Washington for a combined audience of technical and law students, is an example of what can be
accomplished with a collaboration between digital forensics/computer science and law faculty. Based
on a successfully and competently prosecuted case of online digital theft and compromise, the course
culminates in a moot court that requires law students to participate in the preparation and questioning
of digital forensics experts (computer science students taking the same course) (Endicott-Popovsky,
Frincke, et al. 2004). This is just the beginning of innovations that the authors recommend be
incorporated into legal education and training.
7. FUTURE WORK
Future work will involve researching existing case law in order to assist the University of
Washington’s School of Law, in partnership with the Information School, in revamping their
curriculum to include interdisciplinary courses that will improve digital evidence literacy among law
students. It is expected that a thorough analysis of cases where digital evidence has been
inappropriately handled will further refine recommendations for curriculum content made above.
Work on this project has begun, with preliminary findings discussed in this paper. Insights of a
thorough examination of case law will be disseminated broadly to the digital forensics community.
AUTHOR BIOGRAPHIES
Aaron Alva is a candidate of the Masters of Science in Information Management program and an
upcoming candidate, Juris Doctorate at the University of Washington. He earned his Bachelor degree
at the University of Central Florida studying political science with a minor in digital forensics (2011).
His interests are in cybersecurity law and policy creation, particularly digital evidence admissibility in
U.S. courts. He is a current recipient of the National Science Foundation Federal Cyber Service:
Scholarship For Service.
Barbara Endicott-Popovsky, Ph.D., Director for the Center of Information Assurance and
Cybersecurity at the University of Washington, holds a joint faculty appointment with the Information
School and Masters in Strategic Planning for Critical Infrastructure, following a 20-year industry
career marked by executive and consulting roles in IT and project management. Her research interests
include forensic-ready networks and the science of digital forensics. She earned her Ph.D. in
Computer Science/Computer Security from the University of Idaho (2007), and holds an MS in
Information Systems Engineering from Seattle Pacific University (1987), an MBA from the University
of Washington (1985).
REFERENCES
Connecticut General Statute Section 53-21(a)(1)
Cringely,
Robert
(2008),
‘The
Julie
Amero
Case:
A
Dangerous
Farce’,
http://www.pcworld.com/businesscenter/article/154768/the_julie_amero_case_a_dangerous_farce.htm
l, PC World, December 2, 2008.
Daubert v. Merrell Dow Pharmaceuticals, Inc., 113 S. Ct. 2786. (1993).
191
ADFSL Conference on Digital Forensics, Security and Law, 2012
Eckelberry, Alex; Glenn Dardick; Joel Folkerts; Alex Shipp; Eric Sites; Joe Stewart; Robin Stuart
(2007), ‘Technical review of the Trial Testimony State of Connecticut vs. Julie Amero’, Technical
Report, http://www.sunbelt-software.com/ihs/alex/julieamerosummary.pdf, March 21, 2007.
Endicott-Popovsky, B. and Horowitz, D. (2012). Unintended consequences: Digital evidence in our
legal system. IEEE Security and Privacy. (TBD) (Endicott-Popovsky and Horowitz, 2012)
Endicott-Popovsky, B., Chee, B., & Frincke, D. A. (2007). “Calibration Testing Of Network Tap
Devices”, IFIP International Conference on Digital Forensics, 3–19.
Endicott-Popovsky, B.E., Frincke, D., Popovsky, V.M. (2004), “Designing A Computer Forensics
Course For an Information Assurance Track,” Proceedings of CISSE 8th Annual Colloquium.
Fed. R. Evid. 104, 401-402, 801-807, 901-902, 1001-1008
Frye v. United States. 293 F. 1013 D.C. Cir. (1923).
Kantor, Andrew (2007), ‘Police, school get failing grade in sad case of Julie Amero’, USA Today
http://www.usatoday.com/tech/columnist/andrewkantor/2007-02-22-julie-amaro_x.htm, February 25,
2007.
Krebs, Brian (2008), ‘Felony Spyware/Porn Charges Against Teacher Dropped’, Washington Post,
http://voices.washingtonpost.com/securityfix/2008/11/ct_drops_felony_spywareporn_ch.html?nav=rss
_blog, November 24, 2008.
LexisNexis (2007), ‘Lorraine v. Markel: Electronic Evidence 101’,
http://www.lexisnexis.com/applieddiscovery/LawLibrary/whitePapers/ADI_WP_LorraineVMarkel.pd
f Accessed February 1, 2012.
Lorraine v. Markel American Insurance Company, 241 F.R.D. 534 D.Md (2007).
NIST 800-61 Rev. 1 (2008), “Computer Security Incident Handling Guide”, National Institute of
Standards and Technology, http://csrc.nist.gov/publications/nistpubs/800-61-rev1/SP800-61rev1.pdf,
March 2008.
Noblett, M.G., Pollitt, M. M., & Presley, L. A. (2000). ‘Recovering and Examining Computer
Forensic Evidence’ Forensic Science Communications, 2(4).
http://www.fbi.gov.hq/lab/fsc/backissu/oct2000/computer.htm, October 2000.
State of Connecticut v. Christian E. Porter, 241 Conn. 57, 698 A.2d 739, (1997).
State of Connecticut v. Julie Amero Trial Testimony,
http://drumsnwhistles.com/pdf/amero-text.zip, January 3-5, 2007.
(2007),
Retrieved
from
State of Connecticut v. Julie Amero, (2008).
United States v. Branch, 970 F.2d 1368 4th Cir. (1992)
Willard, Nancy (2007), ‘The Julie Amero Tragedy’, Center for Safe and Responsible Use of the
Internet, http://csriu.org/onlinedocs/AmeroTragedy.pdf, February, 2007.
192
ADFSL Conference on Digital Forensics, Security and Law, 2012
PATHWAY INTO A SECURITY PROFESSIONAL: A NEW
CYBER SECURITY AND FORENSIC COMPUTING
CURRICULUM
Elena Sitnikova
University of South Australia
F2-26 Mawson Lakes Campus,
Mawson Lakes SA 5095
Ph +61 8 8302 5442
Fax +61 8 8302 5233
[email protected]
Jill Slay
University of South Australia
MC2-22 Mawson Lakes Campus,
Mawson Lakes SA 5095
Ph +61 8 8302 3840
Fax +61 8 8302 5785
[email protected]
ABSTRACT
The University of South Australia has developed a new two and a half year full time equivalent
qualification to meet the established Australian Law Enforcement high demand for graduates with a
Master Degree level in cyber security and forensic computing to fulfill expertise for the Australian
courts. It offers a pathway through a suite of nested programs including Graduate Certificates, a
Graduate Diploma, Masters and possible continuing to PhD level. These are designed to attract a
diverse group of learners traditionally coming from two cohorts of industry practitioners: one with
engineering and the other with IT background. It also enables industry-trained and qualified learners
without undergraduate degrees to gain the security qualification required though their access to tertiary
study as a career choice. This paper describes this curriculum, provides an overview on how these
nested programs are conceived, developed and implemented; their current state, their first outcomes to
assess the effectiveness of the pathway through the students’ feedback; and future planned initiatives.
Keywords: cyber security curriculum, computer forensic education, security professionals
1. INTRODUCTION
In today’s Information Age a rapidly increasing number of organisations in the Australian Defence
sector, security, critical infrastructures (oil, gas, water, electricity, food production and transport),
banking and other commercial and government industries are facing significant challenges in
protecting information assets and their customers’ sensitive information from cyber-attacks.
Cyber security, computer forensics, network security and critical infrastructure protection are
emerging subfields of computer science that are gaining much attention. This will create many career
opportunities for information technology (IT), computer science and engineering graduates in law
enforcement, Australian Federal and State departments, large accounting companies and banks. Some
employment is available in small and medium enterprises but this is less common with the move
towards IT outsourcing in general and security and forensic outsourcing in particular. Students are
thus being prepared for careers such as:
193
ADFSL Conference on Digital Forensics, Security and Law, 2012

Forensic IT Specialist

Electronic Evidence Specialist

Forensic Computer Investigator

Information Security Professional

IT Risk Advisor

Information Assurance Professional

Critical Infrastructure Protection Specialist
Frost & Sullivan’s global information security workforce study (Frost & Sullivan, 2011), conducted
on behalf of (ISC)2, estimates 2.28 million of information security professionals worldwide in 2010,
with signs of strong growth. The number of computer forensic and related professionals has been
experiencing double digit growth and it is set to increase. According to Frost & Sullivan, by 2015 the
number of security professionals will increase to nearly 4.2 million with a 14.2% increase in US and
11.9% in the Asia- pacific region (APAC). The survey shows that the average salary for security
professionals worldwide is US$98,600 with (ISC)2 certification and US$78,500 without. For the
APAC region it is US$74,500 and US$ 48,600 respectively. The survey also identifies a clear gap in
skills needed to protect organisations in the near future, not only from cyber-attacks to an
organisation’s systems and data, but also to its reputation, end-users and customers.
This paper will discuss the design and development of a curriculum for the new program in cyber
security and forensic computing within the School of Computer Science at the University of South
Australia. The program has been established with the support of Australian Law Enforcement to meet
the high demand for information security professionals at Masters degree level with the required skills
expertise for the Australian courts. It prepares both IT and engineering students for the workplace by
covering industry recommended competencies for information assurance, cyber security, electronic
evidence, forensic computing and critical infrastructure professionals. The curriculum is unique – no
other Masters Degrees in Australia have been developed around these competencies.
According to an international commentator and INL’s infrastructure protection strategist Michael
Assante, for many managers and engineers responsible for control systems, physical security has
always been a priority. For many of them information security is a new field, they have to understand
the importance of CI systems’ cyber security requirements, associated risks and thus make a
deployment of proper security measures. In his testimony to the US government on process control
security issues, Assante, (2010) criticises the last decade’s considerable body of research in the area
of implementing yesterday’s general IT security approaches into today’s operational process control
systems. He asserts it is “proven ineffective in general IT systems against more advanced threats”. He
also notes that as more technological advancements are introduced to SCADA (Supervisory Control
and Data Acquisition System) and process control systems, more complexity and interconnectedness
are added to the systems, requiring higher levels of specialty skills to secure such systems. Training
managers and engineers in the field will help to meet this need.
Other literature mentions that SCADA and process control systems vulnerabilities can increase from a
lack of communication between IT and engineering departments (ITSEAG,2012). As highlighted in
the authors’ previous work (Slay and Sitnikova, 2008) engineers are responsible for deploying and
maintaining control systems, whilst network security comes from an IT background. The gap between
these two disciplines needs to be bridged to recognise, identify and mitigate against the vulnerabilities
in SCADA and process control systems networks. Broader awareness and the sharing of good practice
on SCADA security between utility companies themselves is a key step in beginning to secure the
Australian nation’s critical resources.
194
ADFSL Conference on Digital Forensics, Security and Law, 2012
Industry-trained professionals and qualified learners, often working in process control services and
government organisations today, bring to class their fundamentally different skills, objectives and
operating philosophies to the concept of security in an enterprise IT context. Thus, the authors are
challenged to develop a curriculum that aims to address both discipline-specific engineering and IT
issues and bridge the educational gap between IT network specialists and process control engineers
within the post-graduate cyber security and forensic computing nested programs. To address these
issues, the curriculum is designed to accommodate both process control engineers (with no or limited
IT skills) and IT specialists (with no or limited engineering skills).
The paper thus describes the factors that have prompted and supported our collaboration with industry
and government and the production of a nexus with academia especially with respect to curriculum
development and delivery and the maintenance of practical work with a strong hands-on and industry
relevant component. It explains the program’s goals and objectives; highlights the pedagogical aspects
of curriculum design and some challenges related to educating part-time mature aged students both
engineers and IT specialists. The design involves:

Addressing security education across the program including cyber security of process
control systems and forensic computing;

Educational issues involved in teaching foundation skills in technologically diverse
engineering industrial sectors;

Some challenges related to educating part-time mature aged students;

Determination of the success of problem based learning outcomes.
2. TARGET STUDENTS AND STAKEHOLDERS
The defining characteristics of the majority of students coming into the program are as follows:

they are experienced engineering or IT specialists and technical officers with no less
than 6 years’ experience in the area, who come from government and industry
organisations;

their main motivation to study is to gain a post-graduate qualification and become
better positioned for employment in the industry;

they are part-time mature aged students with a degree in IT or engineering (electrical
and electronic) and technicians with more than 6 years’ experience background
articulating from multiple pathways (work, Technical and Further Education
(TAFE)).
Traditionally, process control engineers are mature professionals who have been in the industry for
much of their career. They have a depth of experience in the operation and maintenance of the
SCADA systems. IT network specialists are often in the earlier stages of their careers, they have a
networking background and a good understanding of the security and reliability issues involved.
From a curriculum perspective the challenge is to help these professionals from diametrically opposed
backgrounds to bridge this gap. That is, to educate the process control engineers in basic IT network
security and then applying relevant aspects to process control systems, while building the IT network
specialists’ knowledge of process control systems from basics to the stage where they can understand
the risks that need to be mediated within internet connected process control systems. The course in
critical infrastructure and control systems security, for example, seeks to provide students from both
backgrounds with the necessary skills to improve the security and resilience of process control
systems. Similarly, the challenge comes into educators at tension when they constrain, for example,
electronic evidence course in forensic computing to provide the necessary skills for digital forensic
195
ADFSL Conference on Digital Forensics, Security and Law, 2012
examinations to control systems engineers with limited IT background.
There are multiple stakeholders involved. Directly, governments, operators (management), the
operator’s shareholders, and the operator’s employees gain from education on how to improve the
security and resilience of these systems. Indirectly, SCADA and control system owners and
consumers also benefit. Particularly through the support of government bodies like GovCERT.au, and
through activities like the ‘red/blue team war games’ to test and demonstrate the possible threats, the
operators of these systems can clearly see that the importance of this work and realise that it is in
everyone’s interests to improve these systems. This support from operators is crucial to gain funding
for the improvements and training required to bring about these changes. Employees benefit from an
improved knowledge and skill base, increasing their employability and value to their employer. These
stakeholders have different desired outcomes depending on their role and the curriculum needs to
balance these.
3. GOALS AND OBJECTIVES
The major goal of the curriculum for the new program in cyber security and forensic computing
(‘CSFC’) is to introduce students to the concepts and theoretical underpinnings in cyber security,
forensic computing, intrusion detection, secure software lifecycle and critical infrastructure protection
with a focus on developing and extending a foundational skill set in these areas and preparing students
to take two major international IT certifications (CISSP -Certified Information Systems Security
Professional and CSSLP - Certified Secure Lifecycle Professional).
The program objectives are defined depending on the levels offered. For Graduate Certificates and
Graduate diploma the objectives are:

To develop knowledge and skills in forensic computing / electronic evidence to
enable students to analyse and investigate computer-based systems;

To develop knowledge and skills in cyber security to enable students to protect and
defend information and information systems by ensuring their availability, integrity,
authentication, confidentiality and non-repudiation.
For Masters, in addition to the objectives above:

To enable students to develop and demonstrate their technical expertise, independent
learning attributes, research and critical appraisal skills through the application of
taught theory, processes, methods, skills and tools to a pure or applied forensic
computing, cyber security or information assurance research question.
4. DESIGNING AND IMPLEMENTING CSFC CURRICULUM - A NESTED SUITE OF
PROGRAMS IN CYBER CECURITY AND FORENSIC COMPUTING
The Cyber Security and Forensic Computing (‘CSFC’) curriculum is developed for a set of nested
programs of forensic computing and cyber security courses designed for the professional development
of ITC practitioners within Australian Law Enforcement, in order to be recognised as Forensic
Computing/Electronic Evidence specialists in the Australian court. The program also provides postgraduate qualifications in both forensic computing and cyber security, to meet the high demands of
Australian Defence and banking industries for professionally qualified IT security staff who have
undergraduate qualifications in ITC or engineering.
The program allows students to follow a pathway from Graduate Certificate to Graduate Diploma to
Masters and PhD in forensic computing and cyber security. Students undertaking this program apply
for entry to the appropriate study stream and must meet the University’s entry requirements for
postgraduate study. Since the program is ‘nested’ there are exit points after the Graduate Certificate
and Graduate Diploma phases. (Refer Figure 1).
196
ADFSL Conference on Digital Forensics, Security and Law, 2012
4.1 Courses in CSFC Curriculum
The curriculum consists of the following two streams, each of which corresponds to four courses (4.5
units each).
I. Forensic Computing Stream:




Electronic Evidence 1 – Forensic Computing
Electronic Evidence 2 – Network and Internet Forensics
Electronic Evidence Analysis and Presentation
e-Crime, e-Discovery and Forensic Readiness
II. Cyber Security Stream:




Intrusion Analysis and Response
Critical Infrastructure and Control Systems Security
Information Assurance and Security
Software Security Lifecycle
Course objectives are listed in Table 1. After successful completion of all 8 courses students who are
continuing on Masters level will enroll into Minor Thesis1 & 2 (18 units) to comprise 54 units of the
program in total.
A pathway for professional staff in law enforcement and other professions to increase their
capability in cyber security and forensic computing
Graduate Certificate in Science (Forensic Computing) or
Graduate Certificate in Science (Cyber Security)
Graduate Diploma in Science (Cyber Security and Forensic Computing)
Master of Science (Cyber Security and Forensic Computing)
A pathway to a PhD in the areas of Cyber Security and Forensic Computing
Figure 1: The CSFC nested programs
4.2 COURSE STRUCTURE
The CSFC curriculum consists of two distinct parts: the courseware and the hands-on practical
supplementary materials.
4.2.1 CSFC curriculum courseware
Each course is broken into weekly sessions and associates with a particular topic. Each topic contains
197
ADFSL Conference on Digital Forensics, Security and Law, 2012
detailed study instructions, recorded lectures, notes and a list of required reading materials.
Most sessions contain the following set of information:

Audio recorded lectures over slide presentations using Adobe Presenter tool including
detailed notes for corresponding slides. Students are required to listen and study the
topic on their own time prior to the face-to-face lecture (internal class) or virtual
seminar (external class)

Required Reading materials - these readings are expected to be done in their own time
in advance of the session that scheduled for the week.

Additional reading materials provided to students for further study and help with their
assignments

Quizzes – this assessment activity is designed to encourage students to learn materials
and check their gained skill performance

Discussions – the virtual space for discussions created for students to practice writing
their opinions about the question / problem posted by instructor and also to
collaborate with their fellow students.
4.2.2 Delivery options
As a majority of our students are working and studying part-time, we accommodate their needs and
make their study arrangements as flexible as possible.
Table 1. CSFC curriculum - Course Objectives
Course Objectives
COMP 5064 Electronic
Evidence 1 – Forensic
Computing
COMP 5065 Electronic
Evidence 2 – Network
Forensics Course
COMP 5066 Electronic
Evidence Analysis and
Presentation
COMP 5063 e-Crime, eDiscovery and Forensic
Readiness
COMP 5067 Intrusion
Analysis and Response
This course is designed to provide students with a sound knowledge and
understanding to enable them to recover admissible evidence from PC based
computers and the skills and competencies to prepare such evidence for
presentation in a Court of Law and to develop knowledge and understanding of
advanced forensic computing techniques and to acquire the skills to apply these
successfully.
This course is designed to enable students to develop knowledge, understanding
and skills for the recovery of admissible evidence from computers which are
connected to a network and for the recovery of admissible evidence from
computers which have been used to exchange data across the Internet.
This course is designed to provide students with a sound knowledge and
understanding to enable them to apply NIST and ISO 17025 lab standards for
validation and verification to a forensic computing lab, to comprehend continuity
and exhibit management systems and documentation systems, to understand key
legal aspects of computer crime and to provide expert evidence to the court.
This course is designed to provide students with a sound knowledge and
understanding to enable them to apply NIST and ISO 17025 lab standards for
validation and verification to a forensic computing lab, to comprehend continuity
and exhibit management systems and documentation systems, to understand key
legal aspects of computer crime and to provide expert evidence to the court.
This course is designed to develop knowledge and understanding of the strategies,
techniques and technologies used in attacking and defending networks and how to
design secure networks and protect against intrusion, malware and other hacker
exploits.
198
ADFSL Conference on Digital Forensics, Security and Law, 2012
COMP 5069 Critical
Infrastructure and
Control Systems
Security
COMP 5062 Software
Security Lifecycle
COMP 5062 Software
Security Lifecycle
COMP 5005, 5003
Masters Computing
Minor Thesis 1 & 2
This course is designed to understand the key policy issues and technologies
underlying critical control infrastructures in various industries and the design
considerations for these systems in light of threats
of natural or man-made catastrophic events with a view to securing such critical
systems.
This course is designed to provide students with a deep understanding of the
technical, management and organisational aspects of Information Assurance
within a holistic legal and social framework.
This course is designed to provide students with a deep understanding of, and the
ability to implement and manage security throughout the software development
lifecycle.
These courses are designed to develop the student’s ability to carry out a
substantial Computer and Information Science research and development project
under supervision, and to present results both at a research seminar and in the form
of written documentation.
To maximise flexibility, availability and convenience, the courses are delivered in different study
modes; both as traditional face-to-face teaching, requiring effort spread over 13 weeks including an
intensive week workshops or as an online teaching in a virtual class plus single intensive week of faceto-face in-class workshops.

Face-to-face study mode (internal class) – 1 face-to-face class per course per week
over 12 weeks plus one week half day intensive study in-class per subject (15 hours).

Online distance study mode (external class) – 1 virtual online class per course per
week over 12 weeks plus one week half day intensive study in-class per subject (15
hours).
External students are provided with an online synchronous environment where they meet virtually
with the course coordinator and fellow students. We use two e-Learning tools to facilitate online
learning for external students – the Moodle Learning Management System and the Adobe Connect Pro
System. These tools allow us to facilitate online connect meetings, virtual classrooms and other
synchronous online activities in a seamless and integrated environment. Students will be able to hear
online presentations and participate in live class discussions. However, students might experience
difficulties with poor connections from their ends using videos if they have bandwidth implications or
some technological challenges (no USB ports for head phones/microphones etc.).
Intensive study week is the cornerstone of the curriculum. Students from both external and internal
classes attend face-to-face workshops in Adelaide. This is a unique opportunity for all students
(internals and externals) to attend hands-on practicals, guest speakers’ presentations and also
networking opportunities among peers
4.2.3 Assessment
The assessment structure and their weightings vary from course to course, however all CSFC
curriculum courses are based on Bigg’s theory of reflective knowledge and the model of constructive
alignment for designing teaching and using criterion-referenced assessment. “Once you understand a
sector of knowledge it changes that part of the world; you don’t behave towards that domain in the
same way again” (Biggs 2007).
The curriculum design is based on the authors’ previous observations of the struggles faced by mature
aged students who come with a technical background to a university environment (Sitnikova and Duff,
2009). These students in their transition to post-graduate levels have much life experience but little
exposure to the ‘academic ways’ of university life. The new curriculum aims not only to increase
199
ADFSL Conference on Digital Forensics, Security and Law, 2012
students’ knowledge and capabilities in cyber security and computer forensics, but also to strengthen
their academic skills and performance.
The students tend to come from full or part-time employment so the course delivery, and program,
need to be flexible. Due to the nature of the industry, these people are critical to the operation and
maintenance of these systems so they will need to negotiate their study arrangements with their
employers to ensure business continuity. This has benefits for the employer and employee. As
discussed, the differing backgrounds of the students impact on the range of learning styles preferred by
students. To address this, a range of different teaching and assessment activities are used.
There is less discussion in the literature around issues facing mature aged students and even less so for
those coming from a technical background to a more academic engineering background. Maturity is
considered a factor in student success and few lecturers would disagree that the eager, mature aged
student strives for excellent marks. Some of the motivations for mature aged students going back to
study include general interest; an ambition to fulfill degree or qualification aspirations and the desire
for career advancement (Leder and Forgasz 2004). Linke et al. in Evance (2000) in a study of 5000
South Australian students who had deferred their studies to return to study later, perceived personal
life experience as ‘valuable’ and useful to subsequent study.
However, mature aged students have particular struggles in a new academic environment. Yorke
(2000) goes so far as to suggest mature students moving into the ‘detached’ environment of a
university can constitute a crisis for some and cites feelings of insecurity, rusty academic skills and
discomfort with younger students as just a few hurdles to overcome. This is supported by Leder and
Forgasz (2004) who found mature aged students were concerned by their own lack of critical
background knowledge and skills.
The new curriculum proposed includes a structured assessment which aims to foster the attributes of
life-long learning, while engaging the students in their studies from a theoretical perspective in their
problem based outcomes. It is hoped this approach will maximise students’ engagement with their
‘new ways’ of learning through a well-thought process of scaffolding and constructive alignment
(Biggs 2007, Vygotsky 2002).
The assessment strategy proposed for the program applies a scaffolded approach to course assessment
which embeds generic communication skills. Students will have the opportunity to ‘practice’ their
growing skills through their hands-on practical in the lab, writing a preparatory assignment on the
relevant literature (post-class essay 1) and practicing through online quizzes and discussions prior to
their major assignment (post-class essay 2).
Assessment includes the following components:

In-class activity (group hands-on exercises in the security lab) 20%

Post-class essay 1- Literature review 25%

Online exercises(Quizzes and discussion forums) -10%

Post class essay 2 - Major assignment – Security report or Security plan 45%
4.2.4 In-class activities
The cornerstone of each course is an intensive study week when students attend a half day (15 hours)
face-to-face sessions at Mawson Lakes campus at UniSA. During this intensive week students are able
to attend lectures and guest speakers’ presentations from industry, police, and law enforcement.
Students receive the majority of the course materials and hands-on practicals in class. Practicals are
based on specific tools (for example EnCase) with the licences available at the security lab at the
University of South Australia.
While the material in this curriculum does not necessarily assume that a student has knowledge in both
200
ADFSL Conference on Digital Forensics, Security and Law, 2012
control systems engineering principles and IT skills; the constraints around the coursework, weekly
seminars and discussion, as well as group exercises in a security lab, help students to share their
experience and learn from each other.
4.2.5 Post-class activities
Post-class essays for all CSFC curriculum courses are comprised of two scholarly components:
Literature review (post-class essay one) and a research report or security plan (assignment two) and
require academic writing skills.
Even high performing students do not necessarily understand the fundamental concepts of research
and research methodologies. Mature aged practitioners and part-time study students coming from
industry are often new to the concepts of academic writing and searching library databases. Some
students are accepted into the program with no degree, but more than 6 years experience in the
industry. Others with undergraduate IT or engineering degrees, succeeded by having learned how to
study laid-out coursework and master well-known solution methods. In their previous experience they
had a limited chance to learn about research because in their institutions, research activity is usually
limited. The students have received little training in how to think independently, invent new problems,
and figure out ways to solve problems in relatively unconstrained situations.
To overcome this problem and stimulate students’ interest in a research career, two scholarly face-toface sessions and one library session on searching databases have been introduced to the program.
During these sessions students can learn about the nature of research, its value and rewards; apply this
knowledge to their writing assignments and prepare themselves for their pathway to Masters or PhD
levels.
5. CHARACTERISTICS OF SUCCESS
5.1 Students
The main focus on the curriculum is on developing and extending a foundational skill set in these
areas and preparing students to become security professionals able to help organisations to be more
effective, efficient and secure in their business operations. This requires a student who is willing and
able to approach both IT and control systems support and administration from a holistic perspective,
including business, people and processes, not simply from a technology perspective.
The diversity of the students’ backgrounds is challenging for educators and students alike. For
students who have worked in industry as IT specialists, it can be difficult to study process control
system fundamentals such as PLCs, MODBUS and DMP3 SCADA protocols. For students who have
been control systems administrators for a while, studying network security, electronic evidence may
be difficult, because many of the SCADA and process control engineers consider advancements in
information technology of organisations as independent tasks carried out by a separate IT departments.
A student who is willing to step beyond their existing perceptions and see cyber security and computer
forensics as an essential skill set for the security professional has a good chance to succeed in this
curriculum.
This curriculum is new and flexible in delivery to accommodate part-time mature aged students.
Students have to be motivated and engaged in course activities during face-to-face intensive weeks
and online seminars. To be an active learner, students’ preparation prior to virtual seminars is essential
through the lecture recordings, reading materials and online discussions. Students who are flexible in
their approach to this educational environment, good in their study plans and willing to accept new
challenges are more likely to succeed.
One of the major goals of the programs is the applicability of knowledge gained to the students’ work
environments. We received positive feedback to this effect:
“The subject matter covered throughout the course was generally directly relevant to the industry
201
ADFSL Conference on Digital Forensics, Security and Law, 2012
I work in. The assignments were very helpful and relevant. Data collected during the first
assignment and the report generated as part of the second assignment was able to link directly
with issue within my own enterprise and been able to submit internally within the enterprise for
further action”
“It's been so much more than I expected - a good balance of technical along with practical skills
that will hopefully help me gain employment in the Computer Forensics field. The lecturers have
all been knowledgeable, enthusiastic and encouraging of students. You should be congratulated
for producing such a fine program and on the first go too!”
5.2 Instructors
Courses are developed and offered by academics who are active researchers and industry practitioners,
and who bring the resulting expertise to their teaching. The program is directly supported by the
relevant University of South Australia’s research institutes and has been developed to meet the needs
of industry through collaborative links with NIATEC’s Director Professor Corey Schou and the South
Australian Police. Our strong links with industry and research organisations ensure that the degrees
offered are highly relevant to industry employers at local, national and international level.
6. CONCLUSIONS
This program has completed its first two-year cycle and is still a work-in-progress. It is now on its
third year of implementation, with the first three students having passed all eight courses and decided
to continue to Masters level and pursue Minor Thesis1 and Minor Thesis2.
The program has been modified every year to reflect rapid changes in the field. It has effective
intensive week practicals where industry practitioners enhance the hands-on exercises by embedding
new technologies and up-to-date case studies. Key elements in keeping the programs current are the
active researcher academics who teach our courses. Additional expertise is provided by international
security advisors as well as close contacts with industry vendors and practitioners locally, interstate
and internationally, and alliances with other academic institutions and Government organizations. It is
also important to highlight a very close relationship with the law enforcement community in order to
address national need in security professionals and establish expertise in Australian courts.
7. ACKNOWLEDGEMENT
The authors acknowledge Dr Patricia Kelly’s helpful comments on this paper.
8. REFERENCES
Assante M. J., (2010), Testimony on Securing Critical Infrastructure in the Age of Stuxnet, National
Board of Information Security Examiners , November 17, 2010.
Biggs, J. B. and Tang, C. (2007). Teaching for quality learning at university. Open University
Press/Mc Graw-Hill Education
Evans, M. (2000) ‘Planning for the transition to tertiary study: A literature review’ Journal of
institutional research (South East Asia) Volume 9.
Frost and Sullivan (2011), The 2011 (ISC)2 Global Information Security Workforce Study.
ITSEAG, SCADA Security – Advice for CEOs : Executive summary,
http://www.ag.gov.au/.../(930C12A9101F61D43493D44C70E84EAA)~SCADA+Security.../SCADA+
Security.pdf viewed 27 January 2012
Leder, G C and Forgasz, H J (2004) ‘Australian and international mature students: the daily
challenges’ Higher Education Research and Development Vol. 23 No. 2 pp 183 – 198.
202
ADFSL Conference on Digital Forensics, Security and Law, 2012
Sitnikova E., Duff A., (2009) Scaffolding the curriculum to enhance learning, motivation and
academic performance of mature aged students in engineering, Proceedings of the 2009 AaeE
Conference, Adelaide,
Slay J., Sitnikova E., (2008) Developing SCADA Systems Security Course within a Systems
Engineering Program, proceedings 12th Colloquium for Information Systems Security Education,
Dallas, US.
Vygotsky, L. S. (2002). Mind in society and the ZPD. In A. Polard (Ed.), Readings for reflective
teaching (pp. 112-114). London: Continuum.
Yorke, M (2000) ‘Smoothing the transition into higher education: What can be learned from student
non-completion’ Journal of Institutional Research Vol 9 (1) pp.35 – 45.
203
ADFSL Conference on Digital Forensics, Security and Law, 2012
204
ADFSL Conference on Digital Forensics, Security and Law, 2012
Subscription Information
The Proceedings of the Conference on Digital Forensics, Security and Law is a publication of the
Association of Digital Forensics, Security and Law (ADFSL). The proceedings are published on a
non-profit basis.
The proceedings are published in both print and electronic form under the following ISSN's:
ISSN: 1931-7379 (print)
ISSN: 1931-7387 (online)
Subscription rates for the proceedings are as follows:
Institutional - Print & Online: $120 (1 issue)
Institutional - Online:
$95 (1 issue)
Individual
- Print:
$25 (1 issue)
Individual
- Online:
$25 (1 issue)
Subscription requests may be made to the ADFSL.
The offices of the Association of Digital Forensics, Security and Law (ADFSL) are at the following
address:
Association of Digital Forensics, Security and Law
1642 Horsepen Hills Road
Maidens, Virginia 23102
Tel: 804-402-9239
Fax: 804-680-3038
E-mail: [email protected]
Website: http://www.adfsl.org
205
ADFSL Conference on Digital Forensics, Security and Law, 2012
206
Contents
Committee.................................................................................................................................................... 4
Schedule ....................................................................................................................................................... 5
Update on the State of the Science of Digital Evidence Examination .................................................... 7
Fred Cohen*
A Proposal for Incorporating Programming Blunder as Important Evidence in
Abstraction-Filtration-Comparison Test ................................................................................................ 19
P. Vinod Bhattathiripad*
The Xbox 360 and Steganography: How Criminals and Terrorists could be “Going Dark” ............ 33
Ashley Podhradsky*, Rob D’Ovidio and Cindy Casey
Double-Compressed JPEG Detection in a Steganalysis System ........................................................... 53
Jennifer L. Davidson* and Pooja Parajape
Toward Alignment between Communities of Practice and Knowledge-Based Decision
Support ...................................................................................................................................................... 79
Jason Nichols*, David Biros* and Mark Weiser
A Fuzzy Hashing Approach Based on Random Sequences and Hamming Distance.......................... 89
Frank Breitinger* and Harald Baier
Cloud Forensics Investigation: Tracing Infringing Sharing of Copyrighted Content in Cloud ...... 101
Yi-Jun He, Echo P. Zhang*, Lucas C.K. Hui, Siu Ming Yiu* and K.P. Chow
iPad2 Logical Acquisition: Automated or Manual Examination? ..................................................... 113
Somaya Ali*, Sumaya AlHosani*, Farah AlZarooni and Ibrahim Baggili
Multi-Parameter Sensitivity Analysis of a Bayesian Network from a Digital Forensic
Investigation ............................................................................................................................................ 129
Richard E. Overill, Echo P. Zhang* and Kam-Pui Chow
Facilitating Forensics in the Mobile Millennium Through Proactive Enterprise Security .............. 141
Andrew R. Scholnick*
A Case Study of the Challenges of Cyber Forensics Analysis of Digital Evidence in a
Child Pornography Trial ........................................................................................................................ 155
Richard Boddington*
After Five Years of E-Discovery Missteps: Sanctions or Safe Harbor?............................................. 173
Milton Luoma* and Vicki Luoma*
Digital Evidence Education in Schools of Law ..................................................................................... 183
Aaron Alva* and Barbara Endicott-Popovsky
Pathway into a Security Professional: A new Cyber Security and Forensic Computing
Curriculum .............................................................................................................................................. 193
Elena Sitnikova* and Jill Slay
* Author Presenting and/or Attending