Blind Passive Media Forensics for Content Integrity Verification

Transcription

Blind Passive Media Forensics for Content Integrity Verification
CT-ISG: Trustworthy Media:
Blind Passive Media Forensics for Content Integrity Verification
Project Summary
Can audio-visual recordings be accepted as reliable and trustworthy evidence? This problem,
confronting many sectors of society, has become urgent as emerging digital editing tools make us
increasingly vulnerable to forgeries. The conventional wisdom that “a photo is fact” is no longer valid
and digital audio-visual content can nowadays be modified and simulated with unprecedented realism.
There is a critical need to develop robust and flexible techniques for verifying the integrity of multimodal digital information in order to restore its trustworthiness: this project addresses that need.
In response to the challenge of malicious media manipulation we propose a team effort to develop new
theories, methods, and a comprehensive suite of tools that can be used to verify content integrity in
various modalities (audio, visual, and combinations) at multiple levels (object, frame, shot, stream)
under diverse contexts. Our efforts will be based on an important paradigm, called blind passive media
forensics, which fundamentally differs from conventional approaches using cryptographic signatures and
watermarking. The proposed methods extract latent, unique signatures from the signals and sensing
devices to detect tampering anomalies, so that verification requires only the media content at hand
without any additional information or preparation.
This project will explore several novel directions that promise to confirm media integrity in face of
various manipulations. First, unique signatures of sensing devices (e.g., cameras and audio recorders)
will be extracted from received signals, including sensor nonlinearities, filters, noise patterns, cut-off
bandwidth, etc. Anomalies are detected by finding inconsistencies of such cues among different parts of
the content. Secondly, tampering artifacts resulting from digital alterations such as splicing and
re-compression will be modeled and detected. Finally, a new criterion for comparing correlations in
audio-visual channels will be used to verify the authenticity of multiple near-simultaneous videos of a
single event from nearby locations. This is particularly valuable as the proliferation of recorders make it
less and less likely that a forger will control all recordings of an event.
Faced with diverse tampering scenarios, our work will focus on the discovery of fundamental knowledge
and generic approaches, culled from a systematic understanding of the content editing pipeline, audiovisual device characterization, content process/coding models, and realistic context modeling.
Specifically, we propose (1) a systematic framework for device signature consistency checking and (2) a
joint multi-modal approach for verifying media integrity in audio, video, as well as their combined
context. The device signature based framework and multi-modal context-based approaches are novel and
general, representing important intellectual merits of the proposed project.
While in practice it may be impossible for the detection system to anticipate the full spectrum of
tampering to guarantee complete immunity to attacks, our aim is to make any attacking maneuver as
difficult as possible for an expert (and beyond the reach of a casual faker). To assess progresses, we will
adopt rigorous and comprehensive mechanisms for evaluation, with detailed plans for characterizing
attack scenarios, the construction of new benchmark datasets and performance metrics, and proactive
steps in sharing results and promoting awareness. As an extension of our numerous on-going efforts of
resource dissemination and on-line public testing, results from the proposed project will be widely
accessible to researchers and general users at large – leading to broad impacts on this emerging
scientific area as well as many practical applications of national interest, such as trustworthy news
reporting, surveillance security, intelligence gathering, criminal investigation, financial transactions, and
many others.
Chang and Ellis
CT-ISG: Trustworthy Media:
Blind Passive Media Forensics for Content Integrity Verification
1 Introduction and Motivation
Information integrity is a fundamental requirement for cyberspace, in which users need to ensure received
information is trustworthy and free from malicious tampering or forgery. Audio-visual information
(photos, video, and audio) is becoming increasingly important in a wide range of applications including
surveillance, news reporting, intelligence, insurance, and criminal investigation. However, with advanced
editing technologies making manipulation easier and easier, it no longer holds that “a picture is a fact” [1].
There is critical need to develop robust and flexible techniques for verifying the authenticity of audiovisual information and thereby restoring its trustworthiness.
The need for trustworthy media is increasingly obvious as users become content creators and publishers.
In December 2006, Reuters and Yahoo News announced a joint effort called You Witness News [2], in
which users submit and share newsworthy photos and video clips captured with consumer equipment.
Though such “citizen journalism” is exciting, it raises many concerns over altered or forged content [3, 4].
Digitally-altered images have already unwittingly appeared in mainstream media, such as the news
photograph published in 2003, by the Los Angeles Times in a report on the Iraq war, which was later
confirmed to be a photomontage, Fig 1(a). Some popular web sites [5, 6] highlight the best photo
manipulation or computer graphics (CG) photo-realistic effects (Fig 1(b)), which human eyes find
indistinguishable from unaltered photos. When verifying the integrity of video sources, for instance in
intelligence analysis, it may also be important to decide whether different clips (like those shown in
Fig 1(c)) are indeed captured by the same camera at the same location, or are the result of digital mixing.
(a)
(b)
(c)
Fig 1. (a) manipulated image published in LA Times (b) a computer graphics image
indistinguishable by humans (c) video shots to be tested for ensuring consistent camera sources
Faced with real-world content tampering, we propose to develop a set of tools able to answer questions
such as: Has any area in an image or video been tampered with? Is the image/video captured by a camera,
or created synthetically? Were two images or videos captured by the same camera? Were they taken from
the same location of the same event? While complete robustness against all forgery attacks is unrealistic,
a comprehensive suite of methods for detecting tampering anomalies will make content tampering and
forgery much less likely to succeed. This is the goal we set for the proposed project.
Current techniques for content integrity verification such as content hashing or watermarking require
cooperative sources and compatible end-to-end protocols, making them impractical in many real-world
situations. In contrast, we and several other groups have embarked on a new direction, blind and passive
media forensics. Such methods extract unique signatures from real-world signals and sensing devices and
detect anomalies caused by tampering, so that verification requires only the final media content without
additional information. These techniques have shown promising results on images; in this project, we
propose to extend this paradigm to solve the problem of integrity verification for audio-visual information,
including photos, videos, and associated audio streams.
Trustworthy Media (Chang & Ellis)
1
This project will explore several directions that promise to confirm media integrity in face of many
different kinds of manipulations. First, the devices used for capturing video or audio signals have unique
device signatures, including sensor nonlinearity, filters, noise patterns, cut-off bandwidth, etc. Careful
recovery of such device signatures may be used to verify the source consistency among different parts of
an image or video clip, since a video showing inconsistent device signatures is likely to be faked.
Secondly, digital alteration is not a completely transparent process; instead, each manipulation may leave
tampering artifacts that become detectable traces. For example, the effects of re-compression on
compression frame structures and coefficients can be used to check whether editing has taken place.
Finally, near-simultaneous videos of a single event should show strong correlations, in the audio-visual
channels which will all reflect the same acoustic events as well as the shared visual background. The lack
of such contextual consistency between multiple audio-video streams captured under the same context
may be used as a basis for raising forgery alerts.
In this project, we propose systematic research to explore these directions and develop novel methods for
verifying content integrity in video and audio combinations. We will leverage and extend results we have
accomplished in our current Cyber-Trust project (2004-7) on blind-passive image forensics. Specifically,
we propose (1) a systematic framework for device signature consistency and (2) a joint multi-modal
approach for verifying media integrity in audio, video, as well as their combined context. The device
signature based framework and multi-modal approaches are general and novel, presenting great
opportunity for research innovation that has not been explored to date.
The proposal is organized as follows. In Sec. 2, we analyze the problem by considering the editing
process and identify potential areas for innovative solutions. We then review our prior work and the state
of the art in the identified solution space (Sec. 3). Details of our proposed research in visual, audio, and
combined context integrity verification are described in Sec. 4. Plans for evaluation (including attack
scenarios, datasets, and metrics) are then presented in Sec. 5. Finally, in Sec. 6 and Sec. 7, we discuss our
prior NSF work, the integrated education component, and broader impact of the proposed project.
2 Problem Analysis and Solution Space
The best way of understanding the problem of digital content tampering is through modeling of the
editing process. Fig 2 shows a basic model that multiple sources are used to produce a new media stream.
In each path, signals from real-world events are recorded by imaging or audio devices, followed by
capturing and encoding steps. Content from a single or multiple streams are then manipulated, mixed,
then post-processed, and re-encoded to render the final output. Based on the processes involved, typical
tampering operations may be categorized as follows. Consideration of special attack tactics will be
discussed in Sec. 5.
2a
1a
3a
Shot mixing
capture /
encode
scene/
event
1b
4
device
Edit
3b
2b
capture /
encode
Post-process /
re-encode
Object splicing
media
archive
Fig 2. A basic model for audio-video editing. Unique signatures or artifacts are generated in each
stage of the process (shown in numbered circles) and may be used to verify the authenticity and
consistency of the received content.
Trustworthy Media (Chang & Ellis)
2
Deletion (Cut): Part of image or video or audio is deleted to remove information in the original content.
The removed part may correspond to a video shot, a set of image frames, object(s) in the image/video, a
speech segment, audio object(s) in the sound track, etc. The attacker tries to conceal evidences of such
deletion operations so that content recipients may treat it as a complete, untampered piece.
Insertion (Paste): Forged content is added to the original source, such as new visual objects, sound
events, image frames, a video shot, a speech segment, etc. Usually additional tools are used to make the
insertion boundaries smooth and unnoticeable to human perception.
Combination (Cut and Paste): Most tampering scenarios actually involve both deletion and insertion. In
the simplest form, shots from a single stream may be cut and pasted to change their order, in an effort to
remove the temporal relations among events that occur in the real world. Additionally, an audio or video
object may be deleted from the foreground of a shot, with the original area replaced by a new object or
background scene from different streams or different areas of the same stream.
The above end-to-end model for content acquisition/editing includes multiple stages, starting from scenes,
through recording devices, encoding, editing, to the final post-processing and re-encoding stage. Each of
these stages, indicated with numbered circles in Fig 2, imposes unique constraints or artifacts in the
process and thus leaves differentiable signatures in the final output signals. First, audio-visual scenes
often have distinct lighting and sound. An unaltered recording should have consistent attributes for such
environmental conditions. Multiple videos taken at the same location for the same event should also have
matched audio-visual attributes related to the scene and event. Second, the recording devices (cameras
and microphones) are actually not completely transparent apparatuses. Many characteristics (such as
nonlinearity, device noise pattern, recording bandwidth, filters, etc) are different even for different units
of the same device model. Disagreements of such device signatures between multiple parts of a stream
readily indicate suspicious cases of tampering. Third, the encoding process used in the capturing
procedure usually have some unique structures, such as framing in audio (26 ms), block tiling in image
(8x8 pixels), and the parameters (e.g., quantization tables) used in compression methods. Such coding
structures and parameters are likely to be destroyed if subsequent editing is not employed with extra care
– e.g., cutting at a location not aligned with the audio frame or image block boundaries. Finally, each
editing operation such as splicing, scaling, and re-encoding may leave traces in the signals that are
differentiable. Detection of such editing artifacts may be used to trigger alerts about potential editing. In
the case that synthetic content like CG images are inserted, lacking of the natural scene characteristics and
physical device signatures provides important clues about the synthetic content.
In summary, the processes of audio-video content acquisition and editing are not transparent. To the
contrary, they are full of tell-tale signatures and artifacts resulting from the uniqueness of the scenes,
devices, and processes. Given a piece of audio-video content, many questions about content integrity (Sec.
1) may be answered by estimation of such signatures and verification of their consistency. In an ideal
scenario, metadata encoding the above information may exist, such as camera ID, GPS location, and even
editing history. However, complete dependence on such metadata is impractical as they are easily editable
and removable. Therefore, automatic estimation and matching methods are needed for utilizing and
integrating the large array of tell-tale signatures described above. In the next section, we review
promising results in these areas from our prior works and others.
3 Review of Related Work
Digital Signatures, Watermarking, Steganalysis
There has been much work using watermarking or digital signatures to protect authenticity of the images.
The main idea is to imperceptibly embed a digital watermark onto an image for monitoring image
manipulation. Fragile watermarks [7-11] are sensitive to any minor image modification, while semifragile digital watermarks [12-15] and the content-based digital signatures [16-20] could accommodate
some operations such as compression and resizing. Additionally, in trustworthy digital camera [21], a
Trustworthy Media (Chang & Ellis)
3
digital signature is generated at the moment an image is captured and the key-encrypted digital signature
is used for image authentication. Unfortunately, all of the above techniques, falling in the class of active
methods, require compatible protocols and end-to-end cooperation, which are often difficult to achieve. In
this project, we focus on blind passive techniques that do not have such dependence.
Although different from our focus, steganography (i.e., hiding secrete information) and steganalysis (i.e.,
detection of steganography) [22] have many subtle similarities respectively with the creation and the
detection of image forgery. When a message is hidden into an image, certain image statistics are disturbed
and artifacts are introduced [23]. To detect such artifacts, [24-26] examined the statistics of the least
significant bit value, the pixel group correlation, and image quality respectively. Similar to image
steganalysis, image forgery can be detected by examining possible disturbances in image statistics. For
instance, in [27], we investigated the signal-level perturbation in the form of bipolar signals introduced by
image splicing. Other techniques inspired by steganalysis [28, 29] have also been proposed recently.
Recovery and consistency checking of device signatures
A device signature is an important clue for detecting image tampering. A typical image taken from a
camera goes through a number of operations, as shown in Fig 3. The incoming light hits the lens,
activates CCD sensors, undergoes interpolation by the demosaicking filter, and then nonlinear
transformation by the camera response function (CRF). Optional digital operations are also often used,
e.g., white balance adjustment, contrast enhancement, lossy compression, etc. Each step in this imaging
pipeline introduces related sensor imperfection, which leaves traces in the image about camera
information. By modeling these operations and examining the resulting images, one can recover camera
properties and use them to detect tampered images.
Fig 3. A basic camera model.
Characteristics of each component
(like noise, demosaicking filter,
camera response function) may be
recovered from an image as unique
camera signatures.
A lot of research effort has been made in this direction recently. [30, 31] used novel ideas based on CCD
sensor noise to identify camera sources and to detect tampered images. However, a drawback for such
methods is that multiple image samples need to be collected in advance for each camera to estimate its
noise pattern. [32-34] proposed EM-based methods and least-squared methods for demosaicking filter
estimation, and used the detected abnormality to find tampered images. Lin et al [35] used co-linearity of
edge pixel colors to estimate the CRF’s for color images, and further analyzed the abnormal estimation
results to detect tampered images [36]. In [37, 38], we proposed a geometry based approach to estimate
the CRF from a single channel image and used it to detect copy-and-pasted (spliced) images from two
different camera sources [39]. The CRF estimation and consistency checking technique proved to be quite
effective, with low estimation mean square errors with respect to ground truth CRFs and a high detection
rate of 86% over a well-defined benchmark set. The similar concept has also been adapted in printed
document analysis, e.g., [40] used unique fluctuations of photoconductors to identify printer sources and
to reveal document tampering.
Detection of Tampering/Processing Artifacts
Image post-processing clues raise suspicion for image forgery. In [41], wavelets higher-order statistics
were used for detecting image print-and-scan and steganography. Avcibas et al. used image quality
measure for identifying brightness adjustment, contrast adjustment and so on [42]. In [43], image
operations, such as resampling, JPEG compression, and adding of noise, are modeled as linear operators
and estimated by image deconvolution. In [44, 45], it is observed that double JPEG compression results in
a periodic pattern in the JPEG DCT coefficient histogram. Based on this observation, an automatic system
Trustworthy Media (Chang & Ellis)
4
that performs image forensics is developed [46]. In [47], the distribution of the first digit of the JPEG
DCT coefficients is used to distinguish a singly JPEG compressed image from a doubly compressed one.
In [48], the JPEG quantization tables for cameras and image editing software are shown to be different
and can be a useful forensics clue. Double MPEG compression artifacts can also be observed when a
video sequence is modified and re-encoded [49]. In [50, 51], image splicing detection is addressed
directly using higher-order statistics, where a splicing model is proposed [27]. Finally, there are also
works on detecting duplicate image fragments in an image due to the copy and paste operation [52-54].
Computer Graphics vs. Photo
One problem of concern in image source identification is the classification of photographic images (PIM)
and photorealistic computer graphics (PRCG). The work in [41] uses the wavelet-based natural image
statistics for the PIM and PRCG classification. In [55], we approached the problem by analyzing the
physical differences between the image generative process, hence providing a physical explanation for the
actual differences between PIM and PRCG.
Audio Forensics and Speaker Identification
Audio recordings have been used as evidence for decades, which frequently raises questions of
authenticity. The field of Audio Forensics has traditionally relied more on expert listeners than on
advanced technology, although the recent formation of an Audio Forensics Technical Committee by the
Audio Engineering Society reflects both the broadening possibilities and greater challenges that result
from digital processing [56].
While rarely used in legal situations, automatic speaker identification/speaker verification is a relatively
mature technology in which a statistical model learns the full range of sounds produced by a particular
speaker in their normal speech. Evaluating the likelihood of an unknown speech recording under this
model gives a precise measure of the confidence that the voices match; accuracy is improved by
normalization to remove irrelevant factors such as fixed channel filtering [57, 58]. Such biometric
techniques are reliable enough to be used for commercial applications such as telephone access to
sensitive information [59].
4 Description of Proposed Research
We adopt a systematic and integrative approach to exploring opportunities originating from all of the
major components in the content pipeline (as shown in Fig 2) – scene, device, processing, and editing,
across multiple modalities (both visual and audio). In each area, we will leverage prior results from our
own work and other groups for audio-visual content analysis and authentication. In addition, a unique
framework for joint audio-visual authentication will be used to verify the consistency between audio and
visual information and correlations among multiple audio-visual streams claimed to be from the same
context (event/location). We describe the specific proposed research tasks in the following subsections.
4.1 Visual
Among the many opportunities identified in Fig 2, we focus on general approaches that are applicable to
most scenarios, rather than ad hoc solutions customized for specific conditions. First, we will extend our
prior work on robust camera signature estimation to establish a general consistency framework for
detecting spliced areas in both the spatial and temporal domains. Second, we will investigate the
fundamental physics-based properties of natural visual scenes, thereby developing sound principles for
detecting abnormal synthetic content. Finally, we will investigate the theories of quantization to model the
effects of re-encoding – the most fundamental operation involved in the editing process.
Device Signature Consistency Framework:
Among the several device components shown in Fig 3, the camera response function (CRF), which maps
incoming light irradiance to output electronic image intensity in a non-linear way, provides excellent
Trustworthy Media (Chang & Ellis)
5
clues for differentiating different cameras, even different units of the same camera model. There is
established knowledge about its parametric forms, the simplest being a power law called the gamma
function. In [37, 38], we have derived new theorems and properties that relate the geometry in the image
to the estimation of CRF. Specifically, we showed that CRF parameters can be reliably recovered from
the local planar patches in the irradiance image by computing a derivative-based measure called
geometric invariance : G ( R) =
Rxx
( Rx )
2
=
Ryy
(R )
2
y
=
Rxy
Rx R y
The geometric invariance quantity G is related to the first and 2nd order intensity derivatives at each
location, revealing the local geometry such as linearity and curvature. It is invariant to changes in
orientation, scale, and translation of the local patches as long as they are planar. As CRF adds a unique
non-linearity to such planar patches, the computed geometric invariance values can be used to effectively
recover such non-linear transformation, and thus the correspondingly CRF.
In [37, 38], we tested the proposed method over 100 images from 5 different cameras from major
manufacturers and demonstrated excellent estimation accuracy. Compared to alternative methods for CRF
estimation [35, 36], our method is effective and advantageous. It can be flexibly applied to images of
diverse modalities, including multi-color channels (e.g., RGB), single color channel (e.g., greyscale), and
multiple frames in a video. Such flexibility is important in the proposed project, as the target media
content may be of different modalities. In our preliminary experiments, we applied the CRF estimation
method to videos from broadcast news in order to verify whether two shots in a story are taken by the
same camera. Fig 4 show very encouraging results – consistent CRFs are found from two consecutive
shots from a broadcaster (CNN) Fig 4(a); distinct CRFs are found for two shots of the same event from
two different broadcasters (CNN and ABC) Fig 4(b).
Same camera
(a)
Different
cameras
(b)
Fig 4. Use consistency of recovered camera response functions to verify whether two shots are
taken by the same camera. Top row: two consecutive shots in a story from CNN. Bottom row: two
shots (from CNN and ABC) of the same event show different camera curves. The curves are
automatically estimated using our geometric invariant based method.
The proposed camera signature estimation framework can be readily extended to verify the consistency
between different parts of an image or a video. Such extension has been shown promising in our initial
experiments [39], in which object splicing is detected by computing CRF inconsistency between the
suspect region and the rest of the image. A CRF is estimated using local patches extracted from each
region and cross-fitting scores are computed to measure the degree of fitness of the local patches from
one region with respect to the CRF estimated from the others. Our evaluation over a set of 363 images
spliced and authentic images has shown an encouraging accuracy as high as 87%. However, our results so
far have been semi-automatic in that suspicious objects are manually selected. In the proposed project, we
will combine the consistency checking framework with automatic image/video segmentation. We will
investigate the tradeoffs between region segmentation and fixed-block partitioning. The former has the
Trustworthy Media (Chang & Ellis)
6
potential of locating precise location of spliced objects in simple scenes, but is often susceptible to
segmentation errors under complex backgrounds. The latter is simple and less sensitive to segmentation
errors though the chance of detecting accurate boundaries or small objects is comprised.
Note the above device signature consistency framework is flexible and general; other camera signatures
can be easily incorporated, such as demosaicking filter and noise patterns. For example, least square
fitting methods were used in [33, 34] to estimate the demosaicking filter based on an image or a local
image region. Scores from such fitting and multi-region cross-fitting may be fused with the CRF fitting
scores described above to measure the camera-signature consistency between different parts of an image.
Physics-based Natural Scene Properties:
In this section, we describe the proposed research using physics-based features of nature scenes to
distinguish synthetic content such as CG images from natural photos. Such features are culled from the
fundamental understanding of real-world image generation process, which involves complex interaction
among object geometry, surfaces, lighting, and cameras. The surface of real-world objects, except for the
man-made object, are rarely smooth or of simple geometry. Mandelbrot [60] has shown the abundance of
fractals in nature and also related the formation of fractal surfaces to basic physical processes such as
erosion, aggregation, and fluid turbulence. In addition, as photographic images are captured by an
acquisition device, they also bear the characteristics of the device, such as those shown in Fig 3.
Inspired by the above observations, in [55] we have developed physics-based image features based on a
two-scale image description framework. At the finest scale, the image intensity function is related to the
fine-grained details of the surface property of a 3D objects, whose geometry can be characterized by the
local fractal dimension and also by the statistics of the local patches. At an intermediate scale, when the
fine-grained details give way to a smoother and differentiable structure, the geometry is described in the
language of differential geometry, where we compute the surface gradient, the second fundamental form
and the Beltrami flow vectors. These features are then aggregated into a combined representation (205
dimensions) upon which discriminative classifiers such as Support Vector Machines (SVM) are trained to
separate natural photos from photo-realistic CG images.
The above computable features have been proven to be effective. Our experiments showed a promising
detection accuracy of 85% with cross validation over a diverse, challenging benchmark dataset [61].
Fig 5(a) shows a few examples of the test photos and CG images which are indeed with very high photorealism quality.
In this project, we propose to extend the above method in several dimensions. First, as the techniques for
CG creation improve rapidly, it is impractical to expect a static classification system to maintain accurate
detection for all of future CG content created by new advanced tools. It will be critical to continue
acquiring new dataset, refining the image feature set, and updating the detection models accordingly. To
this end, we have developed and deployed an online photo vs. CG classification system [62], to which
public users may submit any image to their interest and receive automatic classification results and
comparative feedback on the fly. A snapshot of the user interface is shown in Fig 5(b). As the first and
only public test system for photo/CG classification, it has attracted a lot of interest with more than 1500
submitted test images so far. With such constantly expanding corpora, we will investigate online learning
methods for selecting image features, updating classification models, and analyzing the performance gaps
over different CG content subclasses.
Second, we will investigate methods to apply the CG detection framework to local regions in an image or
video, rather than just the global level. Such extension is non-trivial, as there are important tradeoffs
between the feature robustness and the region location precision. Some of the proposed features are
statistical in nature (like local image patch statistics and fractals) – thus increasing the location precision
may cause loss of statistical reliability. Additionally, straightforward application of such methods to
detect potential CG areas in a long video sequence is time prohibitive. We will develop multi-stage
Trustworthy Media (Chang & Ellis)
7
solutions that use simple features, such as cartoon features or wavelet features [41, 63], to filter unlikely
cases and reduce the data space for finer examination.
Fig 5. (a) sample images used in
natural photo vs. computer
graphics image classification (the
left 2 are photos while the right 2
are CGs) (b) user interface of our
public online system for photo vs.
CG classification [62].
(a)
(b)
Analytical Models for Double Compression and Manipulation Effects:
Another important clue for detecting editing or splicing is related to double compression. In a typical
splicing scenario, an object is cropped from an existing compressed source, scaled to fit the target splicing
area, shifted to the right position, and then the entire mixed content is recompressed again. Such double
compression process adds important clues to the editing operation since the effect of double compression
very likely are distinct in the spliced area and the background area. It has been observed that double JPEG
compression results in a periodic pattern in the JPEG DCT coefficient histogram [44, 45] [46]. This
pattern (as shown in Fig 6) is sensitive to the relative relation between the compression parameters (e.g.,
quantization step sizes) used in the first and second passes. It also depends on whether the inserted object
is shifted or scaled before insertion and recompression. By characterizing and differentiating different
compression effects in different parts of an image/video, we will be able to detect suspect cases that
splicing might have taken place, and in some cases distinguish the specific operations that have been
applied to the cropped objects. For example, as shown in our prior work [64], down scaling and shifting
resulted in different levels of noise and relative image qualities when they are employed in between two
compression passes. In this project, we will investigate further properties and analytical models of
combinations of double compression and various manipulation functions, as a basis for developing robust
tamper detectors.
Fig 6. Double quantization produces distinct patterns in the
quantized coefficient histogram due to the use of different
quantization step sizes – 5 followed by 2 on the left case and
2 followed by 3 for the right case. (from [46])
4.2 Audio
Following the framework of Fig 2, the audio signal will carry clues to the components and stages
involved in its creation, including scene characteristics (used to verify that the target of the recording is as
claimed), characteristics of the device and processing chain (used to verify that a signal or set of signals
all have the same, single origin), and cues to continuity (which can reveal deletion or insertion edits).
Scene characteristics
Statistical analysis of audio recordings can verify or refute claims that particular signals originate from a
single source or location. In our work on analyzing recordings from a body-worn microphone, we
showed that the statistics of the background ambience – the sound between the foreground events such as
speech – is a reliable basis for identifying and classifying locations [65], since the particular spectral
shape and variability of the ambient noise (e.g., sounds of air-conditioning plant) are frequently specific
to a particular location. As an illustration, Fig 7 presents an 8 hour recording file from a body-worn
Trustworthy Media (Chang & Ellis)
8
microphone, visualizing the power and fluctuation within each one minute segment, of energy in different
channels. These features can be seen to discriminate well between the different environments separated
by the hand-marked boundaries (indicating for example changes from indoor to outdoor, or different
locations).
Fig 7. Visualization of the ambient sound statistics from an 8 hour ‘personal audio’ recording,
revealing clear changes in properties corresponding to hand-marked changes in location shown as
vertical lines (from [66]).
In addition to modeling nonspeech ambience, we also plan to extend speaker identification techniques
(mentioned in Sec. 3) for this kind of scenario by employing cues such as the pitch track that can be
reliably extracted even in very noisy recordings [67]. To compensate for the diminished spectral
information in such cases, we plan to use wider temporal contexts to identify pitch dynamics and
idiosyncratic pitch gestures indicative of individual speakers.
Device and Encoding Characteristics
Although the goal of high-quality recording equipment is to be as neutral or invisible as possible, there
are frequently tell-tale characteristics that reveal details of the source equipment. Audio recording
circuitry will leave its mark in terms of:
•
•
•
Bandwidth limitations: i.e. low-frequency (rumble) filter and high-frequency (anti-aliasing)
cutoff. These cutoffs will usually be implemented with analog components prior to the analog-todigital converter, and will thus vary slightly even between different units of the same type.
Automatic gain control (AGC): A typical consumer video or audio recorder will include
automatic gain adjustment to equalize the scene signal level. If the source becomes quite, such
circuitry increases gain at a fixed, slow rate characterized by the “decay time”, and quickly
reduces gain after a sudden increase in level with a time constant known as the “attack time”.
Even when the source material is at a relatively constant level, the AGC is constantly making
small adjustments to the system gain which will allow its attack and decay times to be estimated,
which will usually be specific to a particular piece of equipment.
Background noise: In addition to the acoustic noise detected by the microphone, there is intrinsic
electrical noise generated by the equipment that is exposed when the original source is quiet
and/or the recorder is of poor quality. In preliminary investigations, we examined recordings
made by a portable MP3 player/recorder (the iRiver iFP-799), shown in Fig 8. These recordings
show a clear cutoff at 13.5 kHz (related to the compressed representation); there is a relatively
strong and steady harmonic at around 10.6 kHz, as well as weaker peaks at multiples of 500 Hz,
still clearly visible in the average spectrum. Most surprising is additional, variable noise in the
10-13 kHz region (arising from crosstalk from the CPU) which actually characterizes both the
recorder and its firmware version [68].
Similar to the case of image compression (Sec. 4.1), audio compression and other formatting leaves
clearly discernable features in the audio stream that may persist through subsequent re-encoding to reveal
the tandem encoding resulting from editing. Common audio compression schemes are based on
psychoacoustic masking, in which quantization levels are chosen independently and dynamically in
separate frequency bands to ensure that the distortion remains below the threshold of audibility [69]. For
Trustworthy Media (Chang & Ellis)
9
instance, in MPEG Audio (e.g. MP3) the spectrum is divided into thirty-two subbands of 690 Hz
bandwidth with new quantization bit allocations every 26 ms. Some high-frequency bands often contain
no perceptual energy at all and will be switched off for one or more frames, leading to clearly visible
holes and blobs in the spectrum (see Fig 9). The particular behavior of these quantization artifacts, easily
visible in a spectrogram, can indicate the particular compression algorithm in use along with its settings
(bitrate etc.); where these are inconsistent with the final audio encoding, varied source material and
compositing are revealed. In general, each implementation of an encoding algorithm will make slightly
different choices for compression algorithm parameters, leading to another device signature.
Fig 8. Audio recordings made by an iRiver iFP-799 portable player/recorder in quiet. Left column:
average spectrum, showing characteristic spectral features. Right column: spectrogram of 5 s
excerpt, showing dynamic structured high-frequency noise in the10-15 kHz region.
Cues to Continuity
Edits such as insertions, deletions, duplication, and mixing may leave tell-tale signatures. Fig 9 shows an
example of a video soundtrack (from YouTube) where there is a clear gap between one background track,
indicated by the vertical lines, during which a foreground track continues. The foreground track,
however, appears to have been originally recorded at a lower sampling rate and hence has little or no
energy above 7.5 kHz. The 80 dB range of the color bar approaches the dynamic range of human hearing,
so the difference between the presence and absence of a background noise floor in the top part of the
spectrum is not easily perceived in the recording; in the spectrogram, it is clearly visible.
Fig 9. Spectrogram of a video soundtrack excerpt, showing a clear gap (between the vertical lines)
in the wideband background signal, mixed with a foreground signal that has a lower cutoff frequency
apparently due to a difference in recording equipment. The box in the top left highlights the gating of
subbands characteristic of MP3 compression.
If the modification involves lengthening the original recording (e.g. to insert some new foreground event),
an obvious technique for preserving the background sound is to copy a section of background sound from
elsewhere in the recording; in the absence of obvious foreground sound events, such duplication will not
be noticed by listeners -- but it can be detected as a highly improbable repetition of background noise,
revealed for instance through cross correlation. Exhaustive correlation of all segments is very expensive
(particularly since it would be best performed separately in multiple frequency bands to avoid distortion
by louder, added foreground events) but can be made much more efficient by audio hashing; our recent
work [70] investigated hashes consisting of pairs of prominent spectral peaks nearby in time, then
searching for clusters of hashes at the same relative timings to find repeating stereotyped events (such as
phone rings) in long-duration environmental recordings.
Trustworthy Media (Chang & Ellis)
10
4.3 Joint Audio-Visual Scene Authentication
Having both audio and visual channels available makes possible further, cross-modal checks for validity.
One possible forgery scenario occurs when a soundtrack is doctored to change the words (for instance by
splicing-in sounds from elsewhere in the recording) without altering the video. This can be surprisingly
convincing, since human observers are largely insensitive to audio-visual asynchrony that is smaller than
around 100ms [71]. Thus, the edited-in audio need not correlate exactly with the original video to
convince the viewer. Automatic analysis can, however, make a more exact comparison between audio
and video channels, and detect the decrease in synchrony that would result from such a splice. Firstly, the
region of the mouth corresponding to the speech can be identified by measuring the mutual information
between audio features and video features at each location [72-74]; non-mouth parts of the video will be
unrelated to the speech signal whereas the visible state of the mouth is strongly informative about the
speech signal. Secondly, correlation can be calculated between linear subspace projections of mouth
motion and audio signal energy to detect the best temporal alignment between the two; this may be
nonzero given the finite speed of sound and differences in processing chains, but should remain constant
within a recording. Finally, any edited or doctored regions will most likely show a statistically significant
reduction in this correlation during the region of the edit.
Fig 10 (from [75]) shows the result of a time-varying audio-visual correlation between a speech signal
and mouth image; the vertical axis is the relative timing, the horizontal axis is the time within the clip,
and the darkness indicates the strength of correlation between the two modes averaged over windows of
different lengths. A longer window gives a more accurate indication of the true correlation, but is less
well able to detect short-term changes. Our approach will be to build a statistical model of vertical slices
through the shorter-window version, then look for regions where the correlation properties do not match
the rest of the recording.
Fig 10. Correlation between audio
and video features as a function
of relative time lag (vertical axis)
and time within the clip (horizontal
axis) for four different averaging
windows (from [75]).
4.4 Authentication through Location-Event Context
The requirements of scene consistency in authentic media can be extended to another important scenario
in which multiple audio-video streams are captured at the same site covering the same event, like news
reporting, social events (wedding, parties), and popular tourist events. For a media stream to be trusted as
associated with a specific event, its audio-visual features need to show adequate correlation with other
streams from the same event-location context. Such correlations may be manifested in the visual domain
as overlapping backgrounds or audio events and noises similar to those mentioned in Sec. 4.2.
Verification of context consistency requires the solution of two sub-problems. First, media streams
sharing the same context as a target stream in question need to be identified in order to establish the
appropriate event-location context. Then, adequate computational measures are needed to estimate the
agreement between the target and the context. For the former, one option is to rely on the external
metadata (e.g., GPS and time) if available, and then use scene reconstruction techniques to refine the
location information. For example, in the newly announced Photosynth service from Microsoft Live [76],
consumer photos taken at the same site (over different times) are used to estimate approximate camera
poses (location, orientation, and field of view) associated with each image and construct a sparse 3D
scene model for interactive browsing. Such techniques, based on the principle of structure from motion
[77], are feasible given a sufficient number of images with overlapping views. The resulting location
information, in terms of distance, direction, and field of view of the camera, is more precise than that
Trustworthy Media (Chang & Ellis)
11
given by the GPS information or other coarse location tags from users. When GPS information is not
available, we may also use the Web image search systems (e.g., Goggle Image, or Flickr photo-sharing
forum) to find images that come from the same claimed location as the target image.
Given images/videos originating from the same claimed location and event, we will then estimate their
correlation in terms of audio-visual scene characteristics. In [78], we have developed a robust technique
for detecting near-image duplicates that are captured using cameras of different poses (angles, distance,
etc) and conditions (lighting, exposures, etc). As shown in Fig 11, our method extracts salient parts from
each image and then learns a generative statistical model to explain the geo-photometric transformation
relation between near-duplicate scenes. Bayesian detection rules are then applied to determine whether
two images are a near-duplicate pair – indicating high likelihood of originating from the same location. If
none of the image/video in the same context group shares strong correlation with the target image,
suspicion will be raised and additional processes are needed to verify the claimed source of the target. Our
near-duplicate detection method has been shown as effective by systematic evaluation over a publicly
available benchmark dataset [79]. In this project, we will extend it to handle the temporal dimension of
the video and the multi-modal integration over multiple streams as discussed below.
Fig 11. Detecting near-duplicates images by partbased graph models. Contents sharing strong
audio-visual correlations are more likely to be from
the same location/event.
On the audio side, we will compute similar correlation measures based on the environment characteristics
(e.g., air conditioning noise) and audio events (e.g., explosion, clapping) as discussed in Sec 4.2. Such
correlations are expected to be frequent for sounds captured at the same location over the same time, due
to the omni-directional nature of the audio recordings. Furthermore, we will combine the scene
correlations across audio and visual modalities, as discussed in Sec 4.3. Given the multiple audio-visual
streams from the same context, such cross-modal correlations are likely to be numerous and strong, since
sound events in one stream may be correlated with the visual activities captured in a different video. If the
target stream is claimed to be of an event that is simultaneously captured by multiple recorders, it is
reasonable to expect strong correlations in the sound track, the visual scene, and/or across audio-visual
tracks over multiple streams. Multi-modal consistency characteristics like this are critical and novel,
presenting a very promising research direction for media forensics.
5 Evaluation and Milestones
The utility of the proposed detection system should not be measured only by its accuracy of detection, but
also the informative explanation of detection results. This is an important consideration in view of the
diverse array of attack scenarios and the large number of tools applicable to various components of the
media content at different levels. Without an intuitive and flexible evaluation mechanism, it will be quite
difficult to develop a sound strategy for coping with such diverse issues. To this end, our evaluation
efforts will adopt a multi-fold approach covering all of the important aspects: toolkit, dataset, and attacks.
Organization and Characterization of Toolkit
Our toolkit will comprise of a rich set of software prototypes resulting from proposed research. Each tool
will be categorized according to the target modality (audio, visual, cross-modal), the corresponding point
in the edit/processing pipeline (scene, device, coding, editing), the applicable data granularity (local
region, image/audio frame, shot, set of streams). Such explicit information will help us match the right
dataset and experiment conditions to evaluation of each tool. In addition, the detection output of each tool
is not just a binary decision (pass or fail). Instead, it will include other relevant information such as the
Trustworthy Media (Chang & Ellis)
12
suspected location of tampering, confidence scores of the detection, and reliability of the detection tool
based on separate validation experiments. The provision of such expanded set of information allows us
integrate diverse tools and summarize results to users in an intuitive way.
Benchmark Datasets
To test the performance of individual tools and combined systems, we will take a proactive approach to
constructing diverse benchmark datasets from multiple domains. For this, we will largely leverage our
extensive resources established in prior works, including several widely used datasets for image forensics
developed in our previous Cyber Trust project.
• Columbia’s Image Splicing Dataset [80, 81]: 1845 spliced and authentic image blocks, originating
from the CalPhoto image library [82].
• Columbia’s photo vs. CG classification dataset [61, 83]: 3200 images including natural photos from a
personal collection, Internet, photo-realistic CG images from 3D CG developer sites, and recaptured
CG images. This set has been downloaded by more than 50 groups so far.
• Raw images captured by multiple cameras from different manufacturers and models (Nikon, Canon,
Kodak, SONY etc). These will be used to evaluate camera signature estimation methods.
• TRECVID [84] videos used for video retrieval evaluation in 2004-2006. It consists of more than 200
hours of broadcast news videos from 6 different channels in 3 different languages over the same
period of time. This will be an excellent set of dataset for testing location-event contextual consistency
discussed in Sec 4.4 since multiple video programs from different channels often cover the same
events. It is also publicly available through NIST TRECVID organization.
• Audio LifeLog dataset: a 62 hour audio dataset that has been hand-marked with the wearer’s location.
This can be used to verify algorithms for identifying locations based on acoustic properties. We also
have two Microsoft SenseCams which can be used to capture simultaneous audio and time-lapse-style
image sequences for long-duration recordings. We will use these in combination with other recorders
to create our own multiple simultaneous recordings to test “contextual consistency”.
• Commercial movies provide an additional test case for audio-visual scene authentication, since they
frequently have soundtracks re-recorded by actors speaking “in time” to the original video.
Discriminating the synchrony between dubbed and original “production sound” scenes (where ground
truth is typically known) will be a demanding test for our techniques.
New datasets will also be created whenever necessary. Furthermore, we will apply several typical editing
and post-processing techniques (scaling, smoothing, and double compression) to the above datasets to
evaluate the impact of such post-processing operations on the performance of each detection tool.
Attack Scenarios and Performance Metrics
Typical classes of tampering attacks have been discussed in Sec 2 – deletion, insertion, and combinations.
These operations are relatively well-defined and can be evaluated quantitatively. Many of the datasets
discussed above have been designed to simulate such attacks, such as splicing, CG content synthesis, and
video mixing. Here, standard performance metrics can be used, such as precision-recall, miss, false alarm,
and detection speed. In many cases, other performance factors will also be considered: the capability in
locating forgery areas, sensitivity to the small forgery areas, and robustness over different image content
classes and imaging conditions (lighting, background, and camera settings). From the perspective of the
attacker’s, we may also assess the performance in terms of the required cost (time and computing
resource) for the attacker to be able to create a successful faked content that passes the detection system
without compromising the perceptual quality too much.
Besides the standard tampering operations mentioned above, there are many other special tactics that may
be employed by the forger. It is almost impossible for the detection system to anticipate the full spectrum
of tactics and guarantee complete immunity to attacks. In view of this, we focus our research on discovery
Trustworthy Media (Chang & Ellis)
13
of fundamental knowledge and development of generic methods, leading to a sound foundation for
developing useful engineering solutions. In the following, we briefly discuss a few special attack tactics
and implications for our research.
Consideration of Special Attack Tactics
Once a forgery creator has an unlimited access to the detector, an oracle attack may be launched. The
forger can incrementally modify the forgery guided by the detection results until it passes the detector
with a minimal visual quality loss. Some ideas have been proposed to partially address this issue. [85]
proposes a method of converting a parametric decision boundary into a fractal (non-parametric) one, so
that an accurate estimation of the boundary requires a much larger number of trials. [86] modifies the
temporal behavior of the detector such that the duration for returning a decision is lengthened when
observing a sequence of similar-content input images.
Apart from the protocol level, forgers could also apply various post-processing operations (smoothing,
compression etc) to mask forgery artifacts. This problem may be addressed by the post-processing
detection techniques mentioned in Sec 4.1. Heavy post-processing is often needed to mask the forgery
artifacts, and detection of which would greatly reduce the trustworthiness of the content.
A more sophisticated post-processing would be to simulate the device signature after content alteration so
that the forgery has a consistent device signature. However, such an attack is difficult to implement in
practice as the simulated device signature has to be strong enough to mask the inconsistency in the first
device signature, and hence possibly resulting in a large image quality loss. Furthermore, our proposed
method is based on checking multiple device signatures, which makes this attack more difficult as
simulation of all the device signatures is needed.
An attacker can also produce a seemingly authentic image or video by recapturing the sound and sight
producing from an image print/display or audio playback. However, such an attack is not easy in practice,
as to produce a good quality recaptured duplicate, a subtle setup for rendering the 3D realistic sound and
sight is needed, which is not easily feasible. Furthermore, the recapturing may not remove all the
inconsistencies in an image or a video, especially the scene inconsistency.
There is also an issue for distinguishing innocuous operations, such as resizing and transcoding, from the
malicious attacks or manipulation. The innocuous operations have a common property that they are
mainly a global operation, in contrast to the malicious manipulations which are mainly local.
Milestones:
In Year 1, we will focus on development of individual detection tools and required benchmark datasets.
These include camera signature consistency checking in image and video, local graphics object detection,
audio-based location detection, and modeling of editing/double compression. In Year 2, we will extend
research to joint audio-visual authentication and multi-modal multi-stream context-consistency checking.
We will select suitable test data from our existing corpora of LifeLog wearable recording and TRECVID
multi-channel news video. An integrated prototype system will be developed and deployed in Year 3 so
that users/developers may simulate various scenarios and test the strength and weakness of various
components, and thereby develop strategies for fusing and integration. Diagnostic and explanatory
mechanisms will be added to the prototype to give useful feedback for refining component solutions as
well as the overall fusion strategies. Throughout the entire project period, we will broadly disseminate
software, data, and other results to the public whenever permissible.
6 Team Expertise and Related Results from Prior NSF Support
Our team comprises of excellent complementary expertise required for solving the challenging problems
related to multimedia forensics. PI Chang is an established researcher with extensive experience in image
authentication and visual content analysis. Co-PI Ellis pioneers development of theories and tools for
audio scene analysis and speech/music processing. We have worked closely in several projects in the past,
Trustworthy Media (Chang & Ellis)
14
including a recent joint project on consumer video indexing. Chang is in the third year of project IIS-0430258 “Cyber Trust – Blind Detection of Digital Photograph Tampering” ($740,000, 2004-7, Chang as PI)
which has developed new theories, methods, and large benchmark datasets for detecting image splicing,
CG images, and camera response functions as device signatures. These results provide an excellent sound
foundation for pursuing new research tasks proposed in this project.
Co-PI Ellis is in the fourth year of project IIS-0238201 “CAREER: The Listening Machine” ($500,000,
award period 2003-02-01 to 2008-01-31) which has developed several novel audio information extraction
techniques including the audio lifelog work described above. He is also in the second year of IIS-0535168 “Separating Speech from Speech Noise”, ($750,000 total, 2006-01-01 to 2008-12-31) which aims
to improve source separation and speech enhancement by closer investigation of how listeners perceive
distorted and noisy speech.
7 Broader Impact and Integrated Research/Education
This project will have major impact on many areas of national priority, including surveillance security,
news reporting, intelligence gathering, criminal investigation, financial transactions, and many others. It
is motivated by a problem confronting every sector of society: Can we accept audio/visual recordings as
reliable and trustworthy evidence? This problem is urgent because emerging advanced digital editing
tools are making us particularly vulnerable: Formats such as video which hitherto have been trustworthy
are becoming easier and easier to modify and simulate with unprecedented realism. Thus, the tools we
propose to develop are crucial to forestall potentially grave impacts of people and organizations being
exploited by their tacit assumptions of media validity. Without adequate foresight and advanced
development, society runs the risk of scrambling to develop authentication tools and protocols in the face
of a rash of forgeries made possible by novel editing tools.
Broad Dissemination of Results: We will promote awareness of critical issues and potential solutions in
the public. We will disseminate the results through multiple channels, including conventional academic
publishing and collaborative experimentation by actual users like those in Columbia’s Journalism School.
We will propose organizing a special session on multimodal forensics at one or more of these venues in
the second year of the project to showcase our results, and to help organize other labs working in this area
to discuss the most significant threats, and the best evaluation metrics and datasets. On that point, we will
prepare and distribute both datasets and tools (as discussed in Sec. 5) to encourage and support
researchers interested in working in this area, and to facilitate common, comparable evaluation results.
This builds on our existing database and tools distribution from our work in image authentication, and our
efforts to organize community-wide evaluation efforts in music information retrieval [87] and large scale
concept ontology for multimedia [88].
Integrated Research and Education: This project will integrate several education objectives through
graduate student training, new course material development, and broadened outreach to the external
community. This project will support two graduate students who will specialize in audio and video
modalities respectively and both engage in research of multi-modal media forensics. Additionally, the
research results will feed into several existing courses: “Statistical Methods for Video Indexing and
Analysis” (ELEN E6882), “Statistical Pattern Recognition” (ELEN E6887) and “Digital Image
Processing” (ELEN E4830), all by PI Chang, and “Speech and Audio Processing and Recognition”
(ELEN E6820), by co-PI Ellis. The research results will form excellent modules for these courses,
providing new teaching materials, illustrative examples, and new topics for student projects as part of the
courses. As for outreach efforts, we participate in an NSF-sponsored GK12 program run by Dr. Jack
McGourty of the Engineering School, including outreach such as presentations and lab demos to visitors
from local middle schools, which, given Columbia's location on the fringe of Harlem, include many
underrepresented minority students.
Trustworthy Media (Chang & Ellis)
15
8 References
[1]
W. J. Mitchell, The Reconfigured Eye: Visual Truth in the Post-Photographic Era. Cambridge,
Mass.: MIT Press, 1992.
[2]
Y. W. News. (2006). You Witness News. http://news.yahoo.com/you-witness-news.
[3]
S. Gavard (1999). Photo-graft: A critical analysis of image manipulation. Montreal, Quebec,
McGill University, MA Thesis.
[4]
F. Baker. (2004). Is Seeing Believing? A Resource For
Educators.http://www.med.sc.edu:1081/isb.htm.
[5]
FakeorFoto. Fake or Foto.http://www.autodesk.com/eng/etc/fakeorfoto/quiz.html.
[6]
Worth1000. Image Editing Contest Site.http://www.worth1000.com/.
[7]
M. M. Yeung and F. Mintzer, "An invisible watermarking technique for image verification,"
IEEE International Conference on Image Processing, 1997.
[8]
M. Wu and B. Liu, "Watermarking for Image Authentication," IEEE International Conference on
Image Processing, 1998.
[9]
I. J. Cox, M. L. Miller, and J. A. Bloom, Digital Watermarking: Morgan Kaufmann, 2002.
[10]
J. Fridrich, M. Goljan, and B. A.C., "New Fragile Authentication Watermark for Images," IEEE
International Conference on Image Processing, Vancouver, Canada, 2000.
[11]
P. W. Wong, "A watermark for image integrity and ownership verication," IS&T Conference on
Image Processing, Image Quality and Image Capture Systems, Portland, Oregon, 1998.
[12]
E. T. Lin, C. I. Podilchuk, and E. J. Delp, "Detection of Image Alterations Using Semi-Fragile
Watermarks," SPIE International Conference on Security and Watermarking of Multimedia
Contents II, San Jose, CA, 2000.
[13]
C.-Y. Lin and S.-F. Chang, "A Robust Image Authentication Method Surviving JPEG Lossy
Compression," SPIE Storage and Retrieval of Image/Video Database, San Jose, 1998.
[14]
J. Fridrich, "Image Watermarking for Tamper Detection," IEEE International Conference on
Image Processing, Chicago, 1998.
[15]
N. Memon and P. Vora, "Authentication Techniques for Multimedia Content," SPIE Multimedia
Systems and Applications, Boston, MA, 1998.
[16]
C.-Y. Lin and S.-F. Chang, "A Robust Image Authentication Method Distinguishing JPEG
Compression from Malicious Manipulation," IEEE Transactions on Circuits and Systems for
Video Technology, 2000.
[17]
M. Schneider and S.-F. Chang, "A Robust Content Based Digital Signature for Image
Authentication," IEEE International Conference on Image Processing, Lausanne, Switzerland,
1996.
[18]
S. Bhattacharjee, "Compression Tolerant Image Authentication," IEEE International Conference
on Image Processing, Chicago, 1998.
[19]
E.-C. Chang, M. S. Kankanhalli, X. Guan, H. Zhiyong, and W. Yinghui, "Image authentication
using content based compression," ACM Multimedia Systems, vol. 9, pp. 121-130, 2003.
[20]
N. Memon, P. Vora, B.-L. Yeo, and M. Yeung, "Distortion bounded authentication techniques,"
SPIE Security and Watermarking of Multimedia Contents II, 2000.
Trustworthy Media (Chang & Ellis)
16
[21]
G. L. Friedman, "The trustworthy digital camera: restoring credibility to the photographic image,"
IEEE Transactions on Consumer Electronics, vol. 39, pp. 905-910, 1993.
[22]
P. Moulin and J. A. O’Sullivan, "Information-Theoretic Analysis of Information Hiding," IEEE
TRANSACTIONS ON INFORMATION THEORY, vol. 49, pp. 563, 2003.
[23]
A. Martin, G. Sapiro, and G. Seroussi, "Is image steganography natural?" Image Processing,
IEEE Transactions on, vol. 14, pp. 2040, 2005.
[24]
A. Westfeld and A. Pfitzmann, "Attacks on Steganographic Systems," Lecture Notes in Computer
Science, vol. 1768, pp. 61-75., 2000.
[25]
J. Fridrich, M. Goljan, and R. Du, "Reliable Detection of LSB Steganography in Grayscale and
Color Images," ACM Special Session on Multimedia Security and Watermarking, Ottawa,
Canada, 2001.
[26]
I. Avcibas, N. Memon, and B. Sankur, "Steganalysis based on Image Quality Metrics Differentiating between techniques," IEEE Workshop on Multimedia, Cannes, France, 2001.
[27]
T.-T. Ng and S.-F. Chang, "A Model for Image Splicing," IEEE International Conference on
Image Processing, Singapore, 2004.
[28]
D. Fu, Y. Q. Shi, and W. Su, "Detection of image splicing based on Hilbert-Huang transform and
moments of characteristic functions with wavelet decomposition," International Workshop on
Digital Watermarking, Jeju, Korea, 2006.
[29]
W. Chen, Y. Q. Shi, and S. Wei, "Image splicing detection using 2-D phase congruency and
statistical moments of characteristic function," SPIE Electronic Imaging, San Jose, CA, 2007.
[30]
J. Lukas, J. Fridrich, and M. Goljan, "Determining Digital Image Origin Using Sensor
Imperfections," SPIE, 2005.
[31]
J. Lukas, J. Fridrich, and M. Goljan, "Detecting Digital Image Forgeries Using Sensor Pattern
Noise," SPIE, 2006.
[32]
A. C. Popescu and H. Farid, "Exposing Digital Forgeries in Color Filter Array Interpolated
Images," IEEE Transactions on Signal Processing, vol. 53, pp. 3948-3959, 2005.
[33]
A. Swaminathan, M. Wu, and K. J. R. Liu, "Component Forensics for Digital Camera: A Nonintrusive Approach," CISS, 2006.
[34]
A. Swaminathan, M. Wu, and K. J. R. Liu, "Non-intrusive Forensic Analysis of Visual Sensors
Using Output Images," ICASSP, 2006.
[35]
S. Lin, J. Gu, S. Yamazaki, and H.-Y. Shum, "Radiometric Calibration from a Single Image,"
CVPR, 2004.
[36]
S. Lin and L. Zhang, "Determining the Radiometric Response Function from a Single Grayscale
Image," IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005.
[37]
T.-T. Ng, S.-F. Chang, and M.-P. Tsui, "Using Geometry Invariants for Camera Response
Function Estimation," submitted, 2006.
[38]
T.-T. Ng, S.-F. Chang, and M.-P. Tsui, "Camera Response Function Estimation from a Singlechannel Image Using Differential Invariants," Columbia University ADVENT Technical Report
#216-2006-2, March 2006.
[39]
Y.-F. Hsu and S.-F. Chang, "Detecting Image Splicing Using Geometry Invariants And Camera
Characteristics Consistency," ICME, 2006.
Trustworthy Media (Chang & Ellis)
17
[40]
A. K. Mikkilineni, G. N. Ali, P.-J. Chiang, G. T.-C. Chiu, J. P. Allebach, and E. J. Delp,
"Signature-embedding in printed documents for security and forensic applications," Security,
Steganography, and Watermarking of Multimedia Contents, 2004.
[41]
S. Lyu and H. Farid, "How Realistic is Photorealistic?" IEEE Transactions on Signal Processing,
vol. 53, pp. 845-850, 2005.
[42]
I. Avcibas, S. Bayram, N. Memon, M. Ramkumar, and B. Sankur, "A classifier design for
detecting image manipulations," IEEE International Conference on Image Processing, Singapore,
2004.
[43]
A. Swaminathan, M. Wu, and K. J. R. Liu, "Image Tampering Identification using Blind
Deconvolution," IEEE International Conference on Image Processing, Atlanta, GA, 2006.
[44]
J. Lukas and J. Fridrich, "Estimation of primary quantization matrix in double compressed JPEG
images," Digital Forensic Research Workshop, 2003.
[45]
A. C. Popescu and H. Farid, "Statistical Tools for Digital Forensics," 6th International Workshop
on Information Hiding, Toronto, Canada, 2004.
[46]
J. He, Z. Lin, L. Wang, and X. Tang, "Detecting doctored JPEG images via DCT coefficient
analysis," European Conference on Computer Vision, 2006.
[47]
D. Fu, Y. Q. Shi, and W. Su, "A generalized Benford's law for JPEG coefficients and its
Applications in image forensics," SPIE Electronic Imaging, San Jose, CA, 2007.
[48]
H. Farid, "Digital Image Ballistics from \protectJPEG Quantization," Department of Computer
Science, Dartmouth College 2006.
[49]
W. Wang and H. Farid, "Exposing Digital Forgeries in Video by Detecting Double \protectMPEG
Compression," ACM Multimedia and Security Workshop, Geneva, Switzerland, 2006.
[50]
H. Farid, "Detecting Digital Forgeries Using Bispectral Analysis," MIT, MIT AI Memo 1999.
[51]
T.-T. Ng, S.-F. Chang, and Q. Sun, "Blind Detection of Photomontage Using Higher Order
Statistics," IEEE International Symposium on Circuits and Systems, Vancouver, Canada, 2004.
[52]
J. Fridrich, D. Soukal, and J. Lukas, "Detection of copy-move forgery in digital images," Digital
Forensic Research Workshop, Cleveland, OH, 2003.
[53]
A. C. Popescu and H. Farid, "Exposing Digital Forgeries by Detecting Duplicated Image
Regions," Computer Science, Dartmouth College 2004.
[54]
W. Luo, J. Huang, and G. Qiu, "Robust Detection of Region-Duplication Forgery in Digital
Image," International Conference on Pattern Recognition, 2006.
[55]
T.-T. Ng, S.-F. Chang, Y.-F. Hsu, L. Xie, and M.-P. Tsui, "Physics-Motivated Features for
Distinguishing Photographic Images and Computer Graphics," ACM Multimedia, Singapore,
2005.
[56]
E. B. B. Durand Begault, Gordon Ried, Richard Sanders, Lise-Lotte Tjellesen, "Audio
Forensics," 121th Audio Engineering Society Convention (AES), San Francisco, CA, 2006.
[57]
D. A. Reynolds, "An overview of automatic speaker recognition technology," Proc. ICASSP,
Orlando FL, 2002.
[58]
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker Verification Using Adapted Gaussian
Mixture Models," Digital Signal Processing, vol. 10, pp. 19-41, 2000.
[59]
I. Nuance. (2006). Secure Sensitive Transactions with Nuance Speech
Secure.http://www.nuance.com/speakerverification/.
Trustworthy Media (Chang & Ellis)
18
[60]
B. B. Mandelbrot, The fractal geometry of nature: W.H.~Freeman, San Francisco, 1983.
[61]
T.-T. Ng and S.-F. Chang. (2005). Columbia Photographic Images and Photorealistic Computer
Graphics Dataset.http://www.ee.columbia.edu/ln/dvmm/downloads/PIM_PRCG_dataset/.
[62]
T.-T. Ng and S.-F. Chang. (2005). Columbia Online Demo: Photographic Image vs. Computer
Graphics Detector (Version 4).http://apollo.ee.columbia.edu/trustfoto/trustfoto/natcgV4.html.
[63]
T. I. Ianeva, A. P. d. Vries, and H. Rohrig, "Detecting cartoons: A case study in automatic videogenre classification," IEEE International Conference on Multimedia and Expo, 2003.
[64]
S. F. Chang and A. Eleftheriadis, "Error accumulation of repetitive image coding," IEEE
International Symposium on Circuits and Systems, ISCAS'94, 1994.
[65]
D. P. W. Ellis and K. Lee, "Accessing minimal-impact personal audio ar\chives," IEEE
MultiMedia, vol. 13, pp. 30-38, 2006.
[66]
D. Ellis and K. S. Lee, "Features for Segmenting and Classifying Long-Duration Recordings of
Personal Audio," ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio
Processing SAPA-04, Jeju, Korea, 2004.
[67]
K. S. Lee and D. P. W. Ellis, "Voice Activity Detection in Personal Audio Recordings Using
Autocorrelogram Compensation," Interspeech ICSLP-06, Pittsburgh, 2006.
[68]
D. P. W. Ellis. (2004). iRiver iFP-799T Recording Noise Analysis.
[69]
T. Painter and A. Spanias, "Perceptual coding of digital audio," Proc. IEEE, vol. 80, pp. 451-513,
2000.
[70]
J. Ogle and D. P. W. Ellis, "Fingerprinting to Identify Repeated Sound Events in Long-Duration
Personal Audio Recordings," Proc. ICASSP, Hawai'i, 2007.
[71]
W. Fujisaki and S. y. Nishida, "Temporal frequency characteristics of synchrony-asynchrony
discrimination of audio-visual signals," Experimental Brain Research, vol. V166, pp. 455-464,
2005.
[72]
J. Hershey and J. Movellan, "Audio-vision: Using audio-visual synchrony to locate sounds,"
Advances in Neural Information Processing Systems, 1999.
[73]
J. W. Fisher Iii, T. Darrell, W. T. Freeman, and P. Viola, "Learning joint statistical models for
audio-visual fusion and segregation," Advances in Neural Information Processing Systems, vol.
13, 2000.
[74]
H. J. Nock, G. Iyengar, and C. Neti, "Speaker Localisation Using Audio-Visual Synchrony: An
Empirical Study," in Lecture Notes in Computer Science: Image and Video Retrieval, vol.
2728/2003, 2003, pp. 488-499.
[75]
M. Slaney and M. Covell, "FaceSync: A Linear Operator for Measuring Synchronization of
Video Facial Images and Audio Tracks," Advances in Neural Information Processing Systems,
2000.
[76]
Photosynth. (2006). Microsoft Live Photosynth service.http://labs.live.com/photosynth/.
[77]
N. Snavely, S. M. Seitz, and R. Szeliski, "Photo tourism: exploring photo collections in 3D,"
ACM Transactions on Graphics (TOG), vol. 25, pp. 835-846, 2006.
[78]
D. Q. Zhang and S. F. Chang, "Detecting image near-duplicate by stochastic attributed relational
graph matching with learning," Proceedings of the 12th annual ACM international conference on
Multimedia, 2004.
Trustworthy Media (Chang & Ellis)
19
[79]
D. Q. Zhang and S. F. Chang. Columbia Image Duplicate Benchmark Data
Set.http://www.ee.columbia.edu/dvmm/newDownloads.htm.
[80]
T.-T. Ng and S.-F. Chang, "A Data Set of Authentic and Spliced Image Blocks," Columbia
University, ADVENT Technical Report June 2004.
[81]
T.-T. Ng and S.-F. Chang. (2004). Columbia Image Splicing Detection Evaluation
Dataset.http://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataS
et.htm.
[82]
Calphoto. (2000). A database of photos of plants, animals, habitats and other natural history
subjects.
[83]
T.-T. Ng, S.-F. Chang, Y.-F. Hsu, and M. Pepeljugoski, "Columbia Photographic Images and
Photorealistic Computer Graphics Dataset," Columbia University, ADVENT Technical Report
Feb 2005.
[84]
TRECVID. (2001-2006). National Institute of Standards and Technology: TREC Video Retrieval
Evaluation.http://www-nlpir.nist.gov/projects/t01v/.
[85]
A. Tewfik and M. Mansour, "Secure Watermark Detection with Non-Parametric Decision
Boundaries," IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002.
[86]
I. Venturini, "Counteracting Oracle attacks," ACM multimedia and security workshop on
Multimedia and security, Magdeburg, Germany, 2004.
[87]
G. Poliner, D. Ellis, A. Ehmann, E. Gomez, S. Streich, and B. Ong, "Melody Transcription from
Music Audio: Approaches and Evaluation," EEE Tr. Audio, Speech, Lang. Proc., vol. accepted
for pub., 2007.
[88]
M. Naphade, J. R. Smith, J. Tesic, S. F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J.
Curtis, "Large-scale concept ontology for multimedia," IEEE MultiMedia Magazine, vol. 13, pp.
86-91, 2006.
Trustworthy Media (Chang & Ellis)
20