Blind Passive Media Forensics for Content Integrity Verification
Transcription
Blind Passive Media Forensics for Content Integrity Verification
CT-ISG: Trustworthy Media: Blind Passive Media Forensics for Content Integrity Verification Project Summary Can audio-visual recordings be accepted as reliable and trustworthy evidence? This problem, confronting many sectors of society, has become urgent as emerging digital editing tools make us increasingly vulnerable to forgeries. The conventional wisdom that “a photo is fact” is no longer valid and digital audio-visual content can nowadays be modified and simulated with unprecedented realism. There is a critical need to develop robust and flexible techniques for verifying the integrity of multimodal digital information in order to restore its trustworthiness: this project addresses that need. In response to the challenge of malicious media manipulation we propose a team effort to develop new theories, methods, and a comprehensive suite of tools that can be used to verify content integrity in various modalities (audio, visual, and combinations) at multiple levels (object, frame, shot, stream) under diverse contexts. Our efforts will be based on an important paradigm, called blind passive media forensics, which fundamentally differs from conventional approaches using cryptographic signatures and watermarking. The proposed methods extract latent, unique signatures from the signals and sensing devices to detect tampering anomalies, so that verification requires only the media content at hand without any additional information or preparation. This project will explore several novel directions that promise to confirm media integrity in face of various manipulations. First, unique signatures of sensing devices (e.g., cameras and audio recorders) will be extracted from received signals, including sensor nonlinearities, filters, noise patterns, cut-off bandwidth, etc. Anomalies are detected by finding inconsistencies of such cues among different parts of the content. Secondly, tampering artifacts resulting from digital alterations such as splicing and re-compression will be modeled and detected. Finally, a new criterion for comparing correlations in audio-visual channels will be used to verify the authenticity of multiple near-simultaneous videos of a single event from nearby locations. This is particularly valuable as the proliferation of recorders make it less and less likely that a forger will control all recordings of an event. Faced with diverse tampering scenarios, our work will focus on the discovery of fundamental knowledge and generic approaches, culled from a systematic understanding of the content editing pipeline, audiovisual device characterization, content process/coding models, and realistic context modeling. Specifically, we propose (1) a systematic framework for device signature consistency checking and (2) a joint multi-modal approach for verifying media integrity in audio, video, as well as their combined context. The device signature based framework and multi-modal context-based approaches are novel and general, representing important intellectual merits of the proposed project. While in practice it may be impossible for the detection system to anticipate the full spectrum of tampering to guarantee complete immunity to attacks, our aim is to make any attacking maneuver as difficult as possible for an expert (and beyond the reach of a casual faker). To assess progresses, we will adopt rigorous and comprehensive mechanisms for evaluation, with detailed plans for characterizing attack scenarios, the construction of new benchmark datasets and performance metrics, and proactive steps in sharing results and promoting awareness. As an extension of our numerous on-going efforts of resource dissemination and on-line public testing, results from the proposed project will be widely accessible to researchers and general users at large – leading to broad impacts on this emerging scientific area as well as many practical applications of national interest, such as trustworthy news reporting, surveillance security, intelligence gathering, criminal investigation, financial transactions, and many others. Chang and Ellis CT-ISG: Trustworthy Media: Blind Passive Media Forensics for Content Integrity Verification 1 Introduction and Motivation Information integrity is a fundamental requirement for cyberspace, in which users need to ensure received information is trustworthy and free from malicious tampering or forgery. Audio-visual information (photos, video, and audio) is becoming increasingly important in a wide range of applications including surveillance, news reporting, intelligence, insurance, and criminal investigation. However, with advanced editing technologies making manipulation easier and easier, it no longer holds that “a picture is a fact” [1]. There is critical need to develop robust and flexible techniques for verifying the authenticity of audiovisual information and thereby restoring its trustworthiness. The need for trustworthy media is increasingly obvious as users become content creators and publishers. In December 2006, Reuters and Yahoo News announced a joint effort called You Witness News [2], in which users submit and share newsworthy photos and video clips captured with consumer equipment. Though such “citizen journalism” is exciting, it raises many concerns over altered or forged content [3, 4]. Digitally-altered images have already unwittingly appeared in mainstream media, such as the news photograph published in 2003, by the Los Angeles Times in a report on the Iraq war, which was later confirmed to be a photomontage, Fig 1(a). Some popular web sites [5, 6] highlight the best photo manipulation or computer graphics (CG) photo-realistic effects (Fig 1(b)), which human eyes find indistinguishable from unaltered photos. When verifying the integrity of video sources, for instance in intelligence analysis, it may also be important to decide whether different clips (like those shown in Fig 1(c)) are indeed captured by the same camera at the same location, or are the result of digital mixing. (a) (b) (c) Fig 1. (a) manipulated image published in LA Times (b) a computer graphics image indistinguishable by humans (c) video shots to be tested for ensuring consistent camera sources Faced with real-world content tampering, we propose to develop a set of tools able to answer questions such as: Has any area in an image or video been tampered with? Is the image/video captured by a camera, or created synthetically? Were two images or videos captured by the same camera? Were they taken from the same location of the same event? While complete robustness against all forgery attacks is unrealistic, a comprehensive suite of methods for detecting tampering anomalies will make content tampering and forgery much less likely to succeed. This is the goal we set for the proposed project. Current techniques for content integrity verification such as content hashing or watermarking require cooperative sources and compatible end-to-end protocols, making them impractical in many real-world situations. In contrast, we and several other groups have embarked on a new direction, blind and passive media forensics. Such methods extract unique signatures from real-world signals and sensing devices and detect anomalies caused by tampering, so that verification requires only the final media content without additional information. These techniques have shown promising results on images; in this project, we propose to extend this paradigm to solve the problem of integrity verification for audio-visual information, including photos, videos, and associated audio streams. Trustworthy Media (Chang & Ellis) 1 This project will explore several directions that promise to confirm media integrity in face of many different kinds of manipulations. First, the devices used for capturing video or audio signals have unique device signatures, including sensor nonlinearity, filters, noise patterns, cut-off bandwidth, etc. Careful recovery of such device signatures may be used to verify the source consistency among different parts of an image or video clip, since a video showing inconsistent device signatures is likely to be faked. Secondly, digital alteration is not a completely transparent process; instead, each manipulation may leave tampering artifacts that become detectable traces. For example, the effects of re-compression on compression frame structures and coefficients can be used to check whether editing has taken place. Finally, near-simultaneous videos of a single event should show strong correlations, in the audio-visual channels which will all reflect the same acoustic events as well as the shared visual background. The lack of such contextual consistency between multiple audio-video streams captured under the same context may be used as a basis for raising forgery alerts. In this project, we propose systematic research to explore these directions and develop novel methods for verifying content integrity in video and audio combinations. We will leverage and extend results we have accomplished in our current Cyber-Trust project (2004-7) on blind-passive image forensics. Specifically, we propose (1) a systematic framework for device signature consistency and (2) a joint multi-modal approach for verifying media integrity in audio, video, as well as their combined context. The device signature based framework and multi-modal approaches are general and novel, presenting great opportunity for research innovation that has not been explored to date. The proposal is organized as follows. In Sec. 2, we analyze the problem by considering the editing process and identify potential areas for innovative solutions. We then review our prior work and the state of the art in the identified solution space (Sec. 3). Details of our proposed research in visual, audio, and combined context integrity verification are described in Sec. 4. Plans for evaluation (including attack scenarios, datasets, and metrics) are then presented in Sec. 5. Finally, in Sec. 6 and Sec. 7, we discuss our prior NSF work, the integrated education component, and broader impact of the proposed project. 2 Problem Analysis and Solution Space The best way of understanding the problem of digital content tampering is through modeling of the editing process. Fig 2 shows a basic model that multiple sources are used to produce a new media stream. In each path, signals from real-world events are recorded by imaging or audio devices, followed by capturing and encoding steps. Content from a single or multiple streams are then manipulated, mixed, then post-processed, and re-encoded to render the final output. Based on the processes involved, typical tampering operations may be categorized as follows. Consideration of special attack tactics will be discussed in Sec. 5. 2a 1a 3a Shot mixing capture / encode scene/ event 1b 4 device Edit 3b 2b capture / encode Post-process / re-encode Object splicing media archive Fig 2. A basic model for audio-video editing. Unique signatures or artifacts are generated in each stage of the process (shown in numbered circles) and may be used to verify the authenticity and consistency of the received content. Trustworthy Media (Chang & Ellis) 2 Deletion (Cut): Part of image or video or audio is deleted to remove information in the original content. The removed part may correspond to a video shot, a set of image frames, object(s) in the image/video, a speech segment, audio object(s) in the sound track, etc. The attacker tries to conceal evidences of such deletion operations so that content recipients may treat it as a complete, untampered piece. Insertion (Paste): Forged content is added to the original source, such as new visual objects, sound events, image frames, a video shot, a speech segment, etc. Usually additional tools are used to make the insertion boundaries smooth and unnoticeable to human perception. Combination (Cut and Paste): Most tampering scenarios actually involve both deletion and insertion. In the simplest form, shots from a single stream may be cut and pasted to change their order, in an effort to remove the temporal relations among events that occur in the real world. Additionally, an audio or video object may be deleted from the foreground of a shot, with the original area replaced by a new object or background scene from different streams or different areas of the same stream. The above end-to-end model for content acquisition/editing includes multiple stages, starting from scenes, through recording devices, encoding, editing, to the final post-processing and re-encoding stage. Each of these stages, indicated with numbered circles in Fig 2, imposes unique constraints or artifacts in the process and thus leaves differentiable signatures in the final output signals. First, audio-visual scenes often have distinct lighting and sound. An unaltered recording should have consistent attributes for such environmental conditions. Multiple videos taken at the same location for the same event should also have matched audio-visual attributes related to the scene and event. Second, the recording devices (cameras and microphones) are actually not completely transparent apparatuses. Many characteristics (such as nonlinearity, device noise pattern, recording bandwidth, filters, etc) are different even for different units of the same device model. Disagreements of such device signatures between multiple parts of a stream readily indicate suspicious cases of tampering. Third, the encoding process used in the capturing procedure usually have some unique structures, such as framing in audio (26 ms), block tiling in image (8x8 pixels), and the parameters (e.g., quantization tables) used in compression methods. Such coding structures and parameters are likely to be destroyed if subsequent editing is not employed with extra care – e.g., cutting at a location not aligned with the audio frame or image block boundaries. Finally, each editing operation such as splicing, scaling, and re-encoding may leave traces in the signals that are differentiable. Detection of such editing artifacts may be used to trigger alerts about potential editing. In the case that synthetic content like CG images are inserted, lacking of the natural scene characteristics and physical device signatures provides important clues about the synthetic content. In summary, the processes of audio-video content acquisition and editing are not transparent. To the contrary, they are full of tell-tale signatures and artifacts resulting from the uniqueness of the scenes, devices, and processes. Given a piece of audio-video content, many questions about content integrity (Sec. 1) may be answered by estimation of such signatures and verification of their consistency. In an ideal scenario, metadata encoding the above information may exist, such as camera ID, GPS location, and even editing history. However, complete dependence on such metadata is impractical as they are easily editable and removable. Therefore, automatic estimation and matching methods are needed for utilizing and integrating the large array of tell-tale signatures described above. In the next section, we review promising results in these areas from our prior works and others. 3 Review of Related Work Digital Signatures, Watermarking, Steganalysis There has been much work using watermarking or digital signatures to protect authenticity of the images. The main idea is to imperceptibly embed a digital watermark onto an image for monitoring image manipulation. Fragile watermarks [7-11] are sensitive to any minor image modification, while semifragile digital watermarks [12-15] and the content-based digital signatures [16-20] could accommodate some operations such as compression and resizing. Additionally, in trustworthy digital camera [21], a Trustworthy Media (Chang & Ellis) 3 digital signature is generated at the moment an image is captured and the key-encrypted digital signature is used for image authentication. Unfortunately, all of the above techniques, falling in the class of active methods, require compatible protocols and end-to-end cooperation, which are often difficult to achieve. In this project, we focus on blind passive techniques that do not have such dependence. Although different from our focus, steganography (i.e., hiding secrete information) and steganalysis (i.e., detection of steganography) [22] have many subtle similarities respectively with the creation and the detection of image forgery. When a message is hidden into an image, certain image statistics are disturbed and artifacts are introduced [23]. To detect such artifacts, [24-26] examined the statistics of the least significant bit value, the pixel group correlation, and image quality respectively. Similar to image steganalysis, image forgery can be detected by examining possible disturbances in image statistics. For instance, in [27], we investigated the signal-level perturbation in the form of bipolar signals introduced by image splicing. Other techniques inspired by steganalysis [28, 29] have also been proposed recently. Recovery and consistency checking of device signatures A device signature is an important clue for detecting image tampering. A typical image taken from a camera goes through a number of operations, as shown in Fig 3. The incoming light hits the lens, activates CCD sensors, undergoes interpolation by the demosaicking filter, and then nonlinear transformation by the camera response function (CRF). Optional digital operations are also often used, e.g., white balance adjustment, contrast enhancement, lossy compression, etc. Each step in this imaging pipeline introduces related sensor imperfection, which leaves traces in the image about camera information. By modeling these operations and examining the resulting images, one can recover camera properties and use them to detect tampered images. Fig 3. A basic camera model. Characteristics of each component (like noise, demosaicking filter, camera response function) may be recovered from an image as unique camera signatures. A lot of research effort has been made in this direction recently. [30, 31] used novel ideas based on CCD sensor noise to identify camera sources and to detect tampered images. However, a drawback for such methods is that multiple image samples need to be collected in advance for each camera to estimate its noise pattern. [32-34] proposed EM-based methods and least-squared methods for demosaicking filter estimation, and used the detected abnormality to find tampered images. Lin et al [35] used co-linearity of edge pixel colors to estimate the CRF’s for color images, and further analyzed the abnormal estimation results to detect tampered images [36]. In [37, 38], we proposed a geometry based approach to estimate the CRF from a single channel image and used it to detect copy-and-pasted (spliced) images from two different camera sources [39]. The CRF estimation and consistency checking technique proved to be quite effective, with low estimation mean square errors with respect to ground truth CRFs and a high detection rate of 86% over a well-defined benchmark set. The similar concept has also been adapted in printed document analysis, e.g., [40] used unique fluctuations of photoconductors to identify printer sources and to reveal document tampering. Detection of Tampering/Processing Artifacts Image post-processing clues raise suspicion for image forgery. In [41], wavelets higher-order statistics were used for detecting image print-and-scan and steganography. Avcibas et al. used image quality measure for identifying brightness adjustment, contrast adjustment and so on [42]. In [43], image operations, such as resampling, JPEG compression, and adding of noise, are modeled as linear operators and estimated by image deconvolution. In [44, 45], it is observed that double JPEG compression results in a periodic pattern in the JPEG DCT coefficient histogram. Based on this observation, an automatic system Trustworthy Media (Chang & Ellis) 4 that performs image forensics is developed [46]. In [47], the distribution of the first digit of the JPEG DCT coefficients is used to distinguish a singly JPEG compressed image from a doubly compressed one. In [48], the JPEG quantization tables for cameras and image editing software are shown to be different and can be a useful forensics clue. Double MPEG compression artifacts can also be observed when a video sequence is modified and re-encoded [49]. In [50, 51], image splicing detection is addressed directly using higher-order statistics, where a splicing model is proposed [27]. Finally, there are also works on detecting duplicate image fragments in an image due to the copy and paste operation [52-54]. Computer Graphics vs. Photo One problem of concern in image source identification is the classification of photographic images (PIM) and photorealistic computer graphics (PRCG). The work in [41] uses the wavelet-based natural image statistics for the PIM and PRCG classification. In [55], we approached the problem by analyzing the physical differences between the image generative process, hence providing a physical explanation for the actual differences between PIM and PRCG. Audio Forensics and Speaker Identification Audio recordings have been used as evidence for decades, which frequently raises questions of authenticity. The field of Audio Forensics has traditionally relied more on expert listeners than on advanced technology, although the recent formation of an Audio Forensics Technical Committee by the Audio Engineering Society reflects both the broadening possibilities and greater challenges that result from digital processing [56]. While rarely used in legal situations, automatic speaker identification/speaker verification is a relatively mature technology in which a statistical model learns the full range of sounds produced by a particular speaker in their normal speech. Evaluating the likelihood of an unknown speech recording under this model gives a precise measure of the confidence that the voices match; accuracy is improved by normalization to remove irrelevant factors such as fixed channel filtering [57, 58]. Such biometric techniques are reliable enough to be used for commercial applications such as telephone access to sensitive information [59]. 4 Description of Proposed Research We adopt a systematic and integrative approach to exploring opportunities originating from all of the major components in the content pipeline (as shown in Fig 2) – scene, device, processing, and editing, across multiple modalities (both visual and audio). In each area, we will leverage prior results from our own work and other groups for audio-visual content analysis and authentication. In addition, a unique framework for joint audio-visual authentication will be used to verify the consistency between audio and visual information and correlations among multiple audio-visual streams claimed to be from the same context (event/location). We describe the specific proposed research tasks in the following subsections. 4.1 Visual Among the many opportunities identified in Fig 2, we focus on general approaches that are applicable to most scenarios, rather than ad hoc solutions customized for specific conditions. First, we will extend our prior work on robust camera signature estimation to establish a general consistency framework for detecting spliced areas in both the spatial and temporal domains. Second, we will investigate the fundamental physics-based properties of natural visual scenes, thereby developing sound principles for detecting abnormal synthetic content. Finally, we will investigate the theories of quantization to model the effects of re-encoding – the most fundamental operation involved in the editing process. Device Signature Consistency Framework: Among the several device components shown in Fig 3, the camera response function (CRF), which maps incoming light irradiance to output electronic image intensity in a non-linear way, provides excellent Trustworthy Media (Chang & Ellis) 5 clues for differentiating different cameras, even different units of the same camera model. There is established knowledge about its parametric forms, the simplest being a power law called the gamma function. In [37, 38], we have derived new theorems and properties that relate the geometry in the image to the estimation of CRF. Specifically, we showed that CRF parameters can be reliably recovered from the local planar patches in the irradiance image by computing a derivative-based measure called geometric invariance : G ( R) = Rxx ( Rx ) 2 = Ryy (R ) 2 y = Rxy Rx R y The geometric invariance quantity G is related to the first and 2nd order intensity derivatives at each location, revealing the local geometry such as linearity and curvature. It is invariant to changes in orientation, scale, and translation of the local patches as long as they are planar. As CRF adds a unique non-linearity to such planar patches, the computed geometric invariance values can be used to effectively recover such non-linear transformation, and thus the correspondingly CRF. In [37, 38], we tested the proposed method over 100 images from 5 different cameras from major manufacturers and demonstrated excellent estimation accuracy. Compared to alternative methods for CRF estimation [35, 36], our method is effective and advantageous. It can be flexibly applied to images of diverse modalities, including multi-color channels (e.g., RGB), single color channel (e.g., greyscale), and multiple frames in a video. Such flexibility is important in the proposed project, as the target media content may be of different modalities. In our preliminary experiments, we applied the CRF estimation method to videos from broadcast news in order to verify whether two shots in a story are taken by the same camera. Fig 4 show very encouraging results – consistent CRFs are found from two consecutive shots from a broadcaster (CNN) Fig 4(a); distinct CRFs are found for two shots of the same event from two different broadcasters (CNN and ABC) Fig 4(b). Same camera (a) Different cameras (b) Fig 4. Use consistency of recovered camera response functions to verify whether two shots are taken by the same camera. Top row: two consecutive shots in a story from CNN. Bottom row: two shots (from CNN and ABC) of the same event show different camera curves. The curves are automatically estimated using our geometric invariant based method. The proposed camera signature estimation framework can be readily extended to verify the consistency between different parts of an image or a video. Such extension has been shown promising in our initial experiments [39], in which object splicing is detected by computing CRF inconsistency between the suspect region and the rest of the image. A CRF is estimated using local patches extracted from each region and cross-fitting scores are computed to measure the degree of fitness of the local patches from one region with respect to the CRF estimated from the others. Our evaluation over a set of 363 images spliced and authentic images has shown an encouraging accuracy as high as 87%. However, our results so far have been semi-automatic in that suspicious objects are manually selected. In the proposed project, we will combine the consistency checking framework with automatic image/video segmentation. We will investigate the tradeoffs between region segmentation and fixed-block partitioning. The former has the Trustworthy Media (Chang & Ellis) 6 potential of locating precise location of spliced objects in simple scenes, but is often susceptible to segmentation errors under complex backgrounds. The latter is simple and less sensitive to segmentation errors though the chance of detecting accurate boundaries or small objects is comprised. Note the above device signature consistency framework is flexible and general; other camera signatures can be easily incorporated, such as demosaicking filter and noise patterns. For example, least square fitting methods were used in [33, 34] to estimate the demosaicking filter based on an image or a local image region. Scores from such fitting and multi-region cross-fitting may be fused with the CRF fitting scores described above to measure the camera-signature consistency between different parts of an image. Physics-based Natural Scene Properties: In this section, we describe the proposed research using physics-based features of nature scenes to distinguish synthetic content such as CG images from natural photos. Such features are culled from the fundamental understanding of real-world image generation process, which involves complex interaction among object geometry, surfaces, lighting, and cameras. The surface of real-world objects, except for the man-made object, are rarely smooth or of simple geometry. Mandelbrot [60] has shown the abundance of fractals in nature and also related the formation of fractal surfaces to basic physical processes such as erosion, aggregation, and fluid turbulence. In addition, as photographic images are captured by an acquisition device, they also bear the characteristics of the device, such as those shown in Fig 3. Inspired by the above observations, in [55] we have developed physics-based image features based on a two-scale image description framework. At the finest scale, the image intensity function is related to the fine-grained details of the surface property of a 3D objects, whose geometry can be characterized by the local fractal dimension and also by the statistics of the local patches. At an intermediate scale, when the fine-grained details give way to a smoother and differentiable structure, the geometry is described in the language of differential geometry, where we compute the surface gradient, the second fundamental form and the Beltrami flow vectors. These features are then aggregated into a combined representation (205 dimensions) upon which discriminative classifiers such as Support Vector Machines (SVM) are trained to separate natural photos from photo-realistic CG images. The above computable features have been proven to be effective. Our experiments showed a promising detection accuracy of 85% with cross validation over a diverse, challenging benchmark dataset [61]. Fig 5(a) shows a few examples of the test photos and CG images which are indeed with very high photorealism quality. In this project, we propose to extend the above method in several dimensions. First, as the techniques for CG creation improve rapidly, it is impractical to expect a static classification system to maintain accurate detection for all of future CG content created by new advanced tools. It will be critical to continue acquiring new dataset, refining the image feature set, and updating the detection models accordingly. To this end, we have developed and deployed an online photo vs. CG classification system [62], to which public users may submit any image to their interest and receive automatic classification results and comparative feedback on the fly. A snapshot of the user interface is shown in Fig 5(b). As the first and only public test system for photo/CG classification, it has attracted a lot of interest with more than 1500 submitted test images so far. With such constantly expanding corpora, we will investigate online learning methods for selecting image features, updating classification models, and analyzing the performance gaps over different CG content subclasses. Second, we will investigate methods to apply the CG detection framework to local regions in an image or video, rather than just the global level. Such extension is non-trivial, as there are important tradeoffs between the feature robustness and the region location precision. Some of the proposed features are statistical in nature (like local image patch statistics and fractals) – thus increasing the location precision may cause loss of statistical reliability. Additionally, straightforward application of such methods to detect potential CG areas in a long video sequence is time prohibitive. We will develop multi-stage Trustworthy Media (Chang & Ellis) 7 solutions that use simple features, such as cartoon features or wavelet features [41, 63], to filter unlikely cases and reduce the data space for finer examination. Fig 5. (a) sample images used in natural photo vs. computer graphics image classification (the left 2 are photos while the right 2 are CGs) (b) user interface of our public online system for photo vs. CG classification [62]. (a) (b) Analytical Models for Double Compression and Manipulation Effects: Another important clue for detecting editing or splicing is related to double compression. In a typical splicing scenario, an object is cropped from an existing compressed source, scaled to fit the target splicing area, shifted to the right position, and then the entire mixed content is recompressed again. Such double compression process adds important clues to the editing operation since the effect of double compression very likely are distinct in the spliced area and the background area. It has been observed that double JPEG compression results in a periodic pattern in the JPEG DCT coefficient histogram [44, 45] [46]. This pattern (as shown in Fig 6) is sensitive to the relative relation between the compression parameters (e.g., quantization step sizes) used in the first and second passes. It also depends on whether the inserted object is shifted or scaled before insertion and recompression. By characterizing and differentiating different compression effects in different parts of an image/video, we will be able to detect suspect cases that splicing might have taken place, and in some cases distinguish the specific operations that have been applied to the cropped objects. For example, as shown in our prior work [64], down scaling and shifting resulted in different levels of noise and relative image qualities when they are employed in between two compression passes. In this project, we will investigate further properties and analytical models of combinations of double compression and various manipulation functions, as a basis for developing robust tamper detectors. Fig 6. Double quantization produces distinct patterns in the quantized coefficient histogram due to the use of different quantization step sizes – 5 followed by 2 on the left case and 2 followed by 3 for the right case. (from [46]) 4.2 Audio Following the framework of Fig 2, the audio signal will carry clues to the components and stages involved in its creation, including scene characteristics (used to verify that the target of the recording is as claimed), characteristics of the device and processing chain (used to verify that a signal or set of signals all have the same, single origin), and cues to continuity (which can reveal deletion or insertion edits). Scene characteristics Statistical analysis of audio recordings can verify or refute claims that particular signals originate from a single source or location. In our work on analyzing recordings from a body-worn microphone, we showed that the statistics of the background ambience – the sound between the foreground events such as speech – is a reliable basis for identifying and classifying locations [65], since the particular spectral shape and variability of the ambient noise (e.g., sounds of air-conditioning plant) are frequently specific to a particular location. As an illustration, Fig 7 presents an 8 hour recording file from a body-worn Trustworthy Media (Chang & Ellis) 8 microphone, visualizing the power and fluctuation within each one minute segment, of energy in different channels. These features can be seen to discriminate well between the different environments separated by the hand-marked boundaries (indicating for example changes from indoor to outdoor, or different locations). Fig 7. Visualization of the ambient sound statistics from an 8 hour ‘personal audio’ recording, revealing clear changes in properties corresponding to hand-marked changes in location shown as vertical lines (from [66]). In addition to modeling nonspeech ambience, we also plan to extend speaker identification techniques (mentioned in Sec. 3) for this kind of scenario by employing cues such as the pitch track that can be reliably extracted even in very noisy recordings [67]. To compensate for the diminished spectral information in such cases, we plan to use wider temporal contexts to identify pitch dynamics and idiosyncratic pitch gestures indicative of individual speakers. Device and Encoding Characteristics Although the goal of high-quality recording equipment is to be as neutral or invisible as possible, there are frequently tell-tale characteristics that reveal details of the source equipment. Audio recording circuitry will leave its mark in terms of: • • • Bandwidth limitations: i.e. low-frequency (rumble) filter and high-frequency (anti-aliasing) cutoff. These cutoffs will usually be implemented with analog components prior to the analog-todigital converter, and will thus vary slightly even between different units of the same type. Automatic gain control (AGC): A typical consumer video or audio recorder will include automatic gain adjustment to equalize the scene signal level. If the source becomes quite, such circuitry increases gain at a fixed, slow rate characterized by the “decay time”, and quickly reduces gain after a sudden increase in level with a time constant known as the “attack time”. Even when the source material is at a relatively constant level, the AGC is constantly making small adjustments to the system gain which will allow its attack and decay times to be estimated, which will usually be specific to a particular piece of equipment. Background noise: In addition to the acoustic noise detected by the microphone, there is intrinsic electrical noise generated by the equipment that is exposed when the original source is quiet and/or the recorder is of poor quality. In preliminary investigations, we examined recordings made by a portable MP3 player/recorder (the iRiver iFP-799), shown in Fig 8. These recordings show a clear cutoff at 13.5 kHz (related to the compressed representation); there is a relatively strong and steady harmonic at around 10.6 kHz, as well as weaker peaks at multiples of 500 Hz, still clearly visible in the average spectrum. Most surprising is additional, variable noise in the 10-13 kHz region (arising from crosstalk from the CPU) which actually characterizes both the recorder and its firmware version [68]. Similar to the case of image compression (Sec. 4.1), audio compression and other formatting leaves clearly discernable features in the audio stream that may persist through subsequent re-encoding to reveal the tandem encoding resulting from editing. Common audio compression schemes are based on psychoacoustic masking, in which quantization levels are chosen independently and dynamically in separate frequency bands to ensure that the distortion remains below the threshold of audibility [69]. For Trustworthy Media (Chang & Ellis) 9 instance, in MPEG Audio (e.g. MP3) the spectrum is divided into thirty-two subbands of 690 Hz bandwidth with new quantization bit allocations every 26 ms. Some high-frequency bands often contain no perceptual energy at all and will be switched off for one or more frames, leading to clearly visible holes and blobs in the spectrum (see Fig 9). The particular behavior of these quantization artifacts, easily visible in a spectrogram, can indicate the particular compression algorithm in use along with its settings (bitrate etc.); where these are inconsistent with the final audio encoding, varied source material and compositing are revealed. In general, each implementation of an encoding algorithm will make slightly different choices for compression algorithm parameters, leading to another device signature. Fig 8. Audio recordings made by an iRiver iFP-799 portable player/recorder in quiet. Left column: average spectrum, showing characteristic spectral features. Right column: spectrogram of 5 s excerpt, showing dynamic structured high-frequency noise in the10-15 kHz region. Cues to Continuity Edits such as insertions, deletions, duplication, and mixing may leave tell-tale signatures. Fig 9 shows an example of a video soundtrack (from YouTube) where there is a clear gap between one background track, indicated by the vertical lines, during which a foreground track continues. The foreground track, however, appears to have been originally recorded at a lower sampling rate and hence has little or no energy above 7.5 kHz. The 80 dB range of the color bar approaches the dynamic range of human hearing, so the difference between the presence and absence of a background noise floor in the top part of the spectrum is not easily perceived in the recording; in the spectrogram, it is clearly visible. Fig 9. Spectrogram of a video soundtrack excerpt, showing a clear gap (between the vertical lines) in the wideband background signal, mixed with a foreground signal that has a lower cutoff frequency apparently due to a difference in recording equipment. The box in the top left highlights the gating of subbands characteristic of MP3 compression. If the modification involves lengthening the original recording (e.g. to insert some new foreground event), an obvious technique for preserving the background sound is to copy a section of background sound from elsewhere in the recording; in the absence of obvious foreground sound events, such duplication will not be noticed by listeners -- but it can be detected as a highly improbable repetition of background noise, revealed for instance through cross correlation. Exhaustive correlation of all segments is very expensive (particularly since it would be best performed separately in multiple frequency bands to avoid distortion by louder, added foreground events) but can be made much more efficient by audio hashing; our recent work [70] investigated hashes consisting of pairs of prominent spectral peaks nearby in time, then searching for clusters of hashes at the same relative timings to find repeating stereotyped events (such as phone rings) in long-duration environmental recordings. Trustworthy Media (Chang & Ellis) 10 4.3 Joint Audio-Visual Scene Authentication Having both audio and visual channels available makes possible further, cross-modal checks for validity. One possible forgery scenario occurs when a soundtrack is doctored to change the words (for instance by splicing-in sounds from elsewhere in the recording) without altering the video. This can be surprisingly convincing, since human observers are largely insensitive to audio-visual asynchrony that is smaller than around 100ms [71]. Thus, the edited-in audio need not correlate exactly with the original video to convince the viewer. Automatic analysis can, however, make a more exact comparison between audio and video channels, and detect the decrease in synchrony that would result from such a splice. Firstly, the region of the mouth corresponding to the speech can be identified by measuring the mutual information between audio features and video features at each location [72-74]; non-mouth parts of the video will be unrelated to the speech signal whereas the visible state of the mouth is strongly informative about the speech signal. Secondly, correlation can be calculated between linear subspace projections of mouth motion and audio signal energy to detect the best temporal alignment between the two; this may be nonzero given the finite speed of sound and differences in processing chains, but should remain constant within a recording. Finally, any edited or doctored regions will most likely show a statistically significant reduction in this correlation during the region of the edit. Fig 10 (from [75]) shows the result of a time-varying audio-visual correlation between a speech signal and mouth image; the vertical axis is the relative timing, the horizontal axis is the time within the clip, and the darkness indicates the strength of correlation between the two modes averaged over windows of different lengths. A longer window gives a more accurate indication of the true correlation, but is less well able to detect short-term changes. Our approach will be to build a statistical model of vertical slices through the shorter-window version, then look for regions where the correlation properties do not match the rest of the recording. Fig 10. Correlation between audio and video features as a function of relative time lag (vertical axis) and time within the clip (horizontal axis) for four different averaging windows (from [75]). 4.4 Authentication through Location-Event Context The requirements of scene consistency in authentic media can be extended to another important scenario in which multiple audio-video streams are captured at the same site covering the same event, like news reporting, social events (wedding, parties), and popular tourist events. For a media stream to be trusted as associated with a specific event, its audio-visual features need to show adequate correlation with other streams from the same event-location context. Such correlations may be manifested in the visual domain as overlapping backgrounds or audio events and noises similar to those mentioned in Sec. 4.2. Verification of context consistency requires the solution of two sub-problems. First, media streams sharing the same context as a target stream in question need to be identified in order to establish the appropriate event-location context. Then, adequate computational measures are needed to estimate the agreement between the target and the context. For the former, one option is to rely on the external metadata (e.g., GPS and time) if available, and then use scene reconstruction techniques to refine the location information. For example, in the newly announced Photosynth service from Microsoft Live [76], consumer photos taken at the same site (over different times) are used to estimate approximate camera poses (location, orientation, and field of view) associated with each image and construct a sparse 3D scene model for interactive browsing. Such techniques, based on the principle of structure from motion [77], are feasible given a sufficient number of images with overlapping views. The resulting location information, in terms of distance, direction, and field of view of the camera, is more precise than that Trustworthy Media (Chang & Ellis) 11 given by the GPS information or other coarse location tags from users. When GPS information is not available, we may also use the Web image search systems (e.g., Goggle Image, or Flickr photo-sharing forum) to find images that come from the same claimed location as the target image. Given images/videos originating from the same claimed location and event, we will then estimate their correlation in terms of audio-visual scene characteristics. In [78], we have developed a robust technique for detecting near-image duplicates that are captured using cameras of different poses (angles, distance, etc) and conditions (lighting, exposures, etc). As shown in Fig 11, our method extracts salient parts from each image and then learns a generative statistical model to explain the geo-photometric transformation relation between near-duplicate scenes. Bayesian detection rules are then applied to determine whether two images are a near-duplicate pair – indicating high likelihood of originating from the same location. If none of the image/video in the same context group shares strong correlation with the target image, suspicion will be raised and additional processes are needed to verify the claimed source of the target. Our near-duplicate detection method has been shown as effective by systematic evaluation over a publicly available benchmark dataset [79]. In this project, we will extend it to handle the temporal dimension of the video and the multi-modal integration over multiple streams as discussed below. Fig 11. Detecting near-duplicates images by partbased graph models. Contents sharing strong audio-visual correlations are more likely to be from the same location/event. On the audio side, we will compute similar correlation measures based on the environment characteristics (e.g., air conditioning noise) and audio events (e.g., explosion, clapping) as discussed in Sec 4.2. Such correlations are expected to be frequent for sounds captured at the same location over the same time, due to the omni-directional nature of the audio recordings. Furthermore, we will combine the scene correlations across audio and visual modalities, as discussed in Sec 4.3. Given the multiple audio-visual streams from the same context, such cross-modal correlations are likely to be numerous and strong, since sound events in one stream may be correlated with the visual activities captured in a different video. If the target stream is claimed to be of an event that is simultaneously captured by multiple recorders, it is reasonable to expect strong correlations in the sound track, the visual scene, and/or across audio-visual tracks over multiple streams. Multi-modal consistency characteristics like this are critical and novel, presenting a very promising research direction for media forensics. 5 Evaluation and Milestones The utility of the proposed detection system should not be measured only by its accuracy of detection, but also the informative explanation of detection results. This is an important consideration in view of the diverse array of attack scenarios and the large number of tools applicable to various components of the media content at different levels. Without an intuitive and flexible evaluation mechanism, it will be quite difficult to develop a sound strategy for coping with such diverse issues. To this end, our evaluation efforts will adopt a multi-fold approach covering all of the important aspects: toolkit, dataset, and attacks. Organization and Characterization of Toolkit Our toolkit will comprise of a rich set of software prototypes resulting from proposed research. Each tool will be categorized according to the target modality (audio, visual, cross-modal), the corresponding point in the edit/processing pipeline (scene, device, coding, editing), the applicable data granularity (local region, image/audio frame, shot, set of streams). Such explicit information will help us match the right dataset and experiment conditions to evaluation of each tool. In addition, the detection output of each tool is not just a binary decision (pass or fail). Instead, it will include other relevant information such as the Trustworthy Media (Chang & Ellis) 12 suspected location of tampering, confidence scores of the detection, and reliability of the detection tool based on separate validation experiments. The provision of such expanded set of information allows us integrate diverse tools and summarize results to users in an intuitive way. Benchmark Datasets To test the performance of individual tools and combined systems, we will take a proactive approach to constructing diverse benchmark datasets from multiple domains. For this, we will largely leverage our extensive resources established in prior works, including several widely used datasets for image forensics developed in our previous Cyber Trust project. • Columbia’s Image Splicing Dataset [80, 81]: 1845 spliced and authentic image blocks, originating from the CalPhoto image library [82]. • Columbia’s photo vs. CG classification dataset [61, 83]: 3200 images including natural photos from a personal collection, Internet, photo-realistic CG images from 3D CG developer sites, and recaptured CG images. This set has been downloaded by more than 50 groups so far. • Raw images captured by multiple cameras from different manufacturers and models (Nikon, Canon, Kodak, SONY etc). These will be used to evaluate camera signature estimation methods. • TRECVID [84] videos used for video retrieval evaluation in 2004-2006. It consists of more than 200 hours of broadcast news videos from 6 different channels in 3 different languages over the same period of time. This will be an excellent set of dataset for testing location-event contextual consistency discussed in Sec 4.4 since multiple video programs from different channels often cover the same events. It is also publicly available through NIST TRECVID organization. • Audio LifeLog dataset: a 62 hour audio dataset that has been hand-marked with the wearer’s location. This can be used to verify algorithms for identifying locations based on acoustic properties. We also have two Microsoft SenseCams which can be used to capture simultaneous audio and time-lapse-style image sequences for long-duration recordings. We will use these in combination with other recorders to create our own multiple simultaneous recordings to test “contextual consistency”. • Commercial movies provide an additional test case for audio-visual scene authentication, since they frequently have soundtracks re-recorded by actors speaking “in time” to the original video. Discriminating the synchrony between dubbed and original “production sound” scenes (where ground truth is typically known) will be a demanding test for our techniques. New datasets will also be created whenever necessary. Furthermore, we will apply several typical editing and post-processing techniques (scaling, smoothing, and double compression) to the above datasets to evaluate the impact of such post-processing operations on the performance of each detection tool. Attack Scenarios and Performance Metrics Typical classes of tampering attacks have been discussed in Sec 2 – deletion, insertion, and combinations. These operations are relatively well-defined and can be evaluated quantitatively. Many of the datasets discussed above have been designed to simulate such attacks, such as splicing, CG content synthesis, and video mixing. Here, standard performance metrics can be used, such as precision-recall, miss, false alarm, and detection speed. In many cases, other performance factors will also be considered: the capability in locating forgery areas, sensitivity to the small forgery areas, and robustness over different image content classes and imaging conditions (lighting, background, and camera settings). From the perspective of the attacker’s, we may also assess the performance in terms of the required cost (time and computing resource) for the attacker to be able to create a successful faked content that passes the detection system without compromising the perceptual quality too much. Besides the standard tampering operations mentioned above, there are many other special tactics that may be employed by the forger. It is almost impossible for the detection system to anticipate the full spectrum of tactics and guarantee complete immunity to attacks. In view of this, we focus our research on discovery Trustworthy Media (Chang & Ellis) 13 of fundamental knowledge and development of generic methods, leading to a sound foundation for developing useful engineering solutions. In the following, we briefly discuss a few special attack tactics and implications for our research. Consideration of Special Attack Tactics Once a forgery creator has an unlimited access to the detector, an oracle attack may be launched. The forger can incrementally modify the forgery guided by the detection results until it passes the detector with a minimal visual quality loss. Some ideas have been proposed to partially address this issue. [85] proposes a method of converting a parametric decision boundary into a fractal (non-parametric) one, so that an accurate estimation of the boundary requires a much larger number of trials. [86] modifies the temporal behavior of the detector such that the duration for returning a decision is lengthened when observing a sequence of similar-content input images. Apart from the protocol level, forgers could also apply various post-processing operations (smoothing, compression etc) to mask forgery artifacts. This problem may be addressed by the post-processing detection techniques mentioned in Sec 4.1. Heavy post-processing is often needed to mask the forgery artifacts, and detection of which would greatly reduce the trustworthiness of the content. A more sophisticated post-processing would be to simulate the device signature after content alteration so that the forgery has a consistent device signature. However, such an attack is difficult to implement in practice as the simulated device signature has to be strong enough to mask the inconsistency in the first device signature, and hence possibly resulting in a large image quality loss. Furthermore, our proposed method is based on checking multiple device signatures, which makes this attack more difficult as simulation of all the device signatures is needed. An attacker can also produce a seemingly authentic image or video by recapturing the sound and sight producing from an image print/display or audio playback. However, such an attack is not easy in practice, as to produce a good quality recaptured duplicate, a subtle setup for rendering the 3D realistic sound and sight is needed, which is not easily feasible. Furthermore, the recapturing may not remove all the inconsistencies in an image or a video, especially the scene inconsistency. There is also an issue for distinguishing innocuous operations, such as resizing and transcoding, from the malicious attacks or manipulation. The innocuous operations have a common property that they are mainly a global operation, in contrast to the malicious manipulations which are mainly local. Milestones: In Year 1, we will focus on development of individual detection tools and required benchmark datasets. These include camera signature consistency checking in image and video, local graphics object detection, audio-based location detection, and modeling of editing/double compression. In Year 2, we will extend research to joint audio-visual authentication and multi-modal multi-stream context-consistency checking. We will select suitable test data from our existing corpora of LifeLog wearable recording and TRECVID multi-channel news video. An integrated prototype system will be developed and deployed in Year 3 so that users/developers may simulate various scenarios and test the strength and weakness of various components, and thereby develop strategies for fusing and integration. Diagnostic and explanatory mechanisms will be added to the prototype to give useful feedback for refining component solutions as well as the overall fusion strategies. Throughout the entire project period, we will broadly disseminate software, data, and other results to the public whenever permissible. 6 Team Expertise and Related Results from Prior NSF Support Our team comprises of excellent complementary expertise required for solving the challenging problems related to multimedia forensics. PI Chang is an established researcher with extensive experience in image authentication and visual content analysis. Co-PI Ellis pioneers development of theories and tools for audio scene analysis and speech/music processing. We have worked closely in several projects in the past, Trustworthy Media (Chang & Ellis) 14 including a recent joint project on consumer video indexing. Chang is in the third year of project IIS-0430258 “Cyber Trust – Blind Detection of Digital Photograph Tampering” ($740,000, 2004-7, Chang as PI) which has developed new theories, methods, and large benchmark datasets for detecting image splicing, CG images, and camera response functions as device signatures. These results provide an excellent sound foundation for pursuing new research tasks proposed in this project. Co-PI Ellis is in the fourth year of project IIS-0238201 “CAREER: The Listening Machine” ($500,000, award period 2003-02-01 to 2008-01-31) which has developed several novel audio information extraction techniques including the audio lifelog work described above. He is also in the second year of IIS-0535168 “Separating Speech from Speech Noise”, ($750,000 total, 2006-01-01 to 2008-12-31) which aims to improve source separation and speech enhancement by closer investigation of how listeners perceive distorted and noisy speech. 7 Broader Impact and Integrated Research/Education This project will have major impact on many areas of national priority, including surveillance security, news reporting, intelligence gathering, criminal investigation, financial transactions, and many others. It is motivated by a problem confronting every sector of society: Can we accept audio/visual recordings as reliable and trustworthy evidence? This problem is urgent because emerging advanced digital editing tools are making us particularly vulnerable: Formats such as video which hitherto have been trustworthy are becoming easier and easier to modify and simulate with unprecedented realism. Thus, the tools we propose to develop are crucial to forestall potentially grave impacts of people and organizations being exploited by their tacit assumptions of media validity. Without adequate foresight and advanced development, society runs the risk of scrambling to develop authentication tools and protocols in the face of a rash of forgeries made possible by novel editing tools. Broad Dissemination of Results: We will promote awareness of critical issues and potential solutions in the public. We will disseminate the results through multiple channels, including conventional academic publishing and collaborative experimentation by actual users like those in Columbia’s Journalism School. We will propose organizing a special session on multimodal forensics at one or more of these venues in the second year of the project to showcase our results, and to help organize other labs working in this area to discuss the most significant threats, and the best evaluation metrics and datasets. On that point, we will prepare and distribute both datasets and tools (as discussed in Sec. 5) to encourage and support researchers interested in working in this area, and to facilitate common, comparable evaluation results. This builds on our existing database and tools distribution from our work in image authentication, and our efforts to organize community-wide evaluation efforts in music information retrieval [87] and large scale concept ontology for multimedia [88]. Integrated Research and Education: This project will integrate several education objectives through graduate student training, new course material development, and broadened outreach to the external community. This project will support two graduate students who will specialize in audio and video modalities respectively and both engage in research of multi-modal media forensics. Additionally, the research results will feed into several existing courses: “Statistical Methods for Video Indexing and Analysis” (ELEN E6882), “Statistical Pattern Recognition” (ELEN E6887) and “Digital Image Processing” (ELEN E4830), all by PI Chang, and “Speech and Audio Processing and Recognition” (ELEN E6820), by co-PI Ellis. The research results will form excellent modules for these courses, providing new teaching materials, illustrative examples, and new topics for student projects as part of the courses. As for outreach efforts, we participate in an NSF-sponsored GK12 program run by Dr. Jack McGourty of the Engineering School, including outreach such as presentations and lab demos to visitors from local middle schools, which, given Columbia's location on the fringe of Harlem, include many underrepresented minority students. Trustworthy Media (Chang & Ellis) 15 8 References [1] W. J. Mitchell, The Reconfigured Eye: Visual Truth in the Post-Photographic Era. Cambridge, Mass.: MIT Press, 1992. [2] Y. W. News. (2006). You Witness News. http://news.yahoo.com/you-witness-news. [3] S. Gavard (1999). Photo-graft: A critical analysis of image manipulation. Montreal, Quebec, McGill University, MA Thesis. [4] F. Baker. (2004). Is Seeing Believing? A Resource For Educators.http://www.med.sc.edu:1081/isb.htm. [5] FakeorFoto. Fake or Foto.http://www.autodesk.com/eng/etc/fakeorfoto/quiz.html. [6] Worth1000. Image Editing Contest Site.http://www.worth1000.com/. [7] M. M. Yeung and F. Mintzer, "An invisible watermarking technique for image verification," IEEE International Conference on Image Processing, 1997. [8] M. Wu and B. Liu, "Watermarking for Image Authentication," IEEE International Conference on Image Processing, 1998. [9] I. J. Cox, M. L. Miller, and J. A. Bloom, Digital Watermarking: Morgan Kaufmann, 2002. [10] J. Fridrich, M. Goljan, and B. A.C., "New Fragile Authentication Watermark for Images," IEEE International Conference on Image Processing, Vancouver, Canada, 2000. [11] P. W. Wong, "A watermark for image integrity and ownership verication," IS&T Conference on Image Processing, Image Quality and Image Capture Systems, Portland, Oregon, 1998. [12] E. T. Lin, C. I. Podilchuk, and E. J. Delp, "Detection of Image Alterations Using Semi-Fragile Watermarks," SPIE International Conference on Security and Watermarking of Multimedia Contents II, San Jose, CA, 2000. [13] C.-Y. Lin and S.-F. Chang, "A Robust Image Authentication Method Surviving JPEG Lossy Compression," SPIE Storage and Retrieval of Image/Video Database, San Jose, 1998. [14] J. Fridrich, "Image Watermarking for Tamper Detection," IEEE International Conference on Image Processing, Chicago, 1998. [15] N. Memon and P. Vora, "Authentication Techniques for Multimedia Content," SPIE Multimedia Systems and Applications, Boston, MA, 1998. [16] C.-Y. Lin and S.-F. Chang, "A Robust Image Authentication Method Distinguishing JPEG Compression from Malicious Manipulation," IEEE Transactions on Circuits and Systems for Video Technology, 2000. [17] M. Schneider and S.-F. Chang, "A Robust Content Based Digital Signature for Image Authentication," IEEE International Conference on Image Processing, Lausanne, Switzerland, 1996. [18] S. Bhattacharjee, "Compression Tolerant Image Authentication," IEEE International Conference on Image Processing, Chicago, 1998. [19] E.-C. Chang, M. S. Kankanhalli, X. Guan, H. Zhiyong, and W. Yinghui, "Image authentication using content based compression," ACM Multimedia Systems, vol. 9, pp. 121-130, 2003. [20] N. Memon, P. Vora, B.-L. Yeo, and M. Yeung, "Distortion bounded authentication techniques," SPIE Security and Watermarking of Multimedia Contents II, 2000. Trustworthy Media (Chang & Ellis) 16 [21] G. L. Friedman, "The trustworthy digital camera: restoring credibility to the photographic image," IEEE Transactions on Consumer Electronics, vol. 39, pp. 905-910, 1993. [22] P. Moulin and J. A. O’Sullivan, "Information-Theoretic Analysis of Information Hiding," IEEE TRANSACTIONS ON INFORMATION THEORY, vol. 49, pp. 563, 2003. [23] A. Martin, G. Sapiro, and G. Seroussi, "Is image steganography natural?" Image Processing, IEEE Transactions on, vol. 14, pp. 2040, 2005. [24] A. Westfeld and A. Pfitzmann, "Attacks on Steganographic Systems," Lecture Notes in Computer Science, vol. 1768, pp. 61-75., 2000. [25] J. Fridrich, M. Goljan, and R. Du, "Reliable Detection of LSB Steganography in Grayscale and Color Images," ACM Special Session on Multimedia Security and Watermarking, Ottawa, Canada, 2001. [26] I. Avcibas, N. Memon, and B. Sankur, "Steganalysis based on Image Quality Metrics Differentiating between techniques," IEEE Workshop on Multimedia, Cannes, France, 2001. [27] T.-T. Ng and S.-F. Chang, "A Model for Image Splicing," IEEE International Conference on Image Processing, Singapore, 2004. [28] D. Fu, Y. Q. Shi, and W. Su, "Detection of image splicing based on Hilbert-Huang transform and moments of characteristic functions with wavelet decomposition," International Workshop on Digital Watermarking, Jeju, Korea, 2006. [29] W. Chen, Y. Q. Shi, and S. Wei, "Image splicing detection using 2-D phase congruency and statistical moments of characteristic function," SPIE Electronic Imaging, San Jose, CA, 2007. [30] J. Lukas, J. Fridrich, and M. Goljan, "Determining Digital Image Origin Using Sensor Imperfections," SPIE, 2005. [31] J. Lukas, J. Fridrich, and M. Goljan, "Detecting Digital Image Forgeries Using Sensor Pattern Noise," SPIE, 2006. [32] A. C. Popescu and H. Farid, "Exposing Digital Forgeries in Color Filter Array Interpolated Images," IEEE Transactions on Signal Processing, vol. 53, pp. 3948-3959, 2005. [33] A. Swaminathan, M. Wu, and K. J. R. Liu, "Component Forensics for Digital Camera: A Nonintrusive Approach," CISS, 2006. [34] A. Swaminathan, M. Wu, and K. J. R. Liu, "Non-intrusive Forensic Analysis of Visual Sensors Using Output Images," ICASSP, 2006. [35] S. Lin, J. Gu, S. Yamazaki, and H.-Y. Shum, "Radiometric Calibration from a Single Image," CVPR, 2004. [36] S. Lin and L. Zhang, "Determining the Radiometric Response Function from a Single Grayscale Image," IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. [37] T.-T. Ng, S.-F. Chang, and M.-P. Tsui, "Using Geometry Invariants for Camera Response Function Estimation," submitted, 2006. [38] T.-T. Ng, S.-F. Chang, and M.-P. Tsui, "Camera Response Function Estimation from a Singlechannel Image Using Differential Invariants," Columbia University ADVENT Technical Report #216-2006-2, March 2006. [39] Y.-F. Hsu and S.-F. Chang, "Detecting Image Splicing Using Geometry Invariants And Camera Characteristics Consistency," ICME, 2006. Trustworthy Media (Chang & Ellis) 17 [40] A. K. Mikkilineni, G. N. Ali, P.-J. Chiang, G. T.-C. Chiu, J. P. Allebach, and E. J. Delp, "Signature-embedding in printed documents for security and forensic applications," Security, Steganography, and Watermarking of Multimedia Contents, 2004. [41] S. Lyu and H. Farid, "How Realistic is Photorealistic?" IEEE Transactions on Signal Processing, vol. 53, pp. 845-850, 2005. [42] I. Avcibas, S. Bayram, N. Memon, M. Ramkumar, and B. Sankur, "A classifier design for detecting image manipulations," IEEE International Conference on Image Processing, Singapore, 2004. [43] A. Swaminathan, M. Wu, and K. J. R. Liu, "Image Tampering Identification using Blind Deconvolution," IEEE International Conference on Image Processing, Atlanta, GA, 2006. [44] J. Lukas and J. Fridrich, "Estimation of primary quantization matrix in double compressed JPEG images," Digital Forensic Research Workshop, 2003. [45] A. C. Popescu and H. Farid, "Statistical Tools for Digital Forensics," 6th International Workshop on Information Hiding, Toronto, Canada, 2004. [46] J. He, Z. Lin, L. Wang, and X. Tang, "Detecting doctored JPEG images via DCT coefficient analysis," European Conference on Computer Vision, 2006. [47] D. Fu, Y. Q. Shi, and W. Su, "A generalized Benford's law for JPEG coefficients and its Applications in image forensics," SPIE Electronic Imaging, San Jose, CA, 2007. [48] H. Farid, "Digital Image Ballistics from \protectJPEG Quantization," Department of Computer Science, Dartmouth College 2006. [49] W. Wang and H. Farid, "Exposing Digital Forgeries in Video by Detecting Double \protectMPEG Compression," ACM Multimedia and Security Workshop, Geneva, Switzerland, 2006. [50] H. Farid, "Detecting Digital Forgeries Using Bispectral Analysis," MIT, MIT AI Memo 1999. [51] T.-T. Ng, S.-F. Chang, and Q. Sun, "Blind Detection of Photomontage Using Higher Order Statistics," IEEE International Symposium on Circuits and Systems, Vancouver, Canada, 2004. [52] J. Fridrich, D. Soukal, and J. Lukas, "Detection of copy-move forgery in digital images," Digital Forensic Research Workshop, Cleveland, OH, 2003. [53] A. C. Popescu and H. Farid, "Exposing Digital Forgeries by Detecting Duplicated Image Regions," Computer Science, Dartmouth College 2004. [54] W. Luo, J. Huang, and G. Qiu, "Robust Detection of Region-Duplication Forgery in Digital Image," International Conference on Pattern Recognition, 2006. [55] T.-T. Ng, S.-F. Chang, Y.-F. Hsu, L. Xie, and M.-P. Tsui, "Physics-Motivated Features for Distinguishing Photographic Images and Computer Graphics," ACM Multimedia, Singapore, 2005. [56] E. B. B. Durand Begault, Gordon Ried, Richard Sanders, Lise-Lotte Tjellesen, "Audio Forensics," 121th Audio Engineering Society Convention (AES), San Francisco, CA, 2006. [57] D. A. Reynolds, "An overview of automatic speaker recognition technology," Proc. ICASSP, Orlando FL, 2002. [58] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digital Signal Processing, vol. 10, pp. 19-41, 2000. [59] I. Nuance. (2006). Secure Sensitive Transactions with Nuance Speech Secure.http://www.nuance.com/speakerverification/. Trustworthy Media (Chang & Ellis) 18 [60] B. B. Mandelbrot, The fractal geometry of nature: W.H.~Freeman, San Francisco, 1983. [61] T.-T. Ng and S.-F. Chang. (2005). Columbia Photographic Images and Photorealistic Computer Graphics Dataset.http://www.ee.columbia.edu/ln/dvmm/downloads/PIM_PRCG_dataset/. [62] T.-T. Ng and S.-F. Chang. (2005). Columbia Online Demo: Photographic Image vs. Computer Graphics Detector (Version 4).http://apollo.ee.columbia.edu/trustfoto/trustfoto/natcgV4.html. [63] T. I. Ianeva, A. P. d. Vries, and H. Rohrig, "Detecting cartoons: A case study in automatic videogenre classification," IEEE International Conference on Multimedia and Expo, 2003. [64] S. F. Chang and A. Eleftheriadis, "Error accumulation of repetitive image coding," IEEE International Symposium on Circuits and Systems, ISCAS'94, 1994. [65] D. P. W. Ellis and K. Lee, "Accessing minimal-impact personal audio ar\chives," IEEE MultiMedia, vol. 13, pp. 30-38, 2006. [66] D. Ellis and K. S. Lee, "Features for Segmenting and Classifying Long-Duration Recordings of Personal Audio," ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing SAPA-04, Jeju, Korea, 2004. [67] K. S. Lee and D. P. W. Ellis, "Voice Activity Detection in Personal Audio Recordings Using Autocorrelogram Compensation," Interspeech ICSLP-06, Pittsburgh, 2006. [68] D. P. W. Ellis. (2004). iRiver iFP-799T Recording Noise Analysis. [69] T. Painter and A. Spanias, "Perceptual coding of digital audio," Proc. IEEE, vol. 80, pp. 451-513, 2000. [70] J. Ogle and D. P. W. Ellis, "Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings," Proc. ICASSP, Hawai'i, 2007. [71] W. Fujisaki and S. y. Nishida, "Temporal frequency characteristics of synchrony-asynchrony discrimination of audio-visual signals," Experimental Brain Research, vol. V166, pp. 455-464, 2005. [72] J. Hershey and J. Movellan, "Audio-vision: Using audio-visual synchrony to locate sounds," Advances in Neural Information Processing Systems, 1999. [73] J. W. Fisher Iii, T. Darrell, W. T. Freeman, and P. Viola, "Learning joint statistical models for audio-visual fusion and segregation," Advances in Neural Information Processing Systems, vol. 13, 2000. [74] H. J. Nock, G. Iyengar, and C. Neti, "Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study," in Lecture Notes in Computer Science: Image and Video Retrieval, vol. 2728/2003, 2003, pp. 488-499. [75] M. Slaney and M. Covell, "FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks," Advances in Neural Information Processing Systems, 2000. [76] Photosynth. (2006). Microsoft Live Photosynth service.http://labs.live.com/photosynth/. [77] N. Snavely, S. M. Seitz, and R. Szeliski, "Photo tourism: exploring photo collections in 3D," ACM Transactions on Graphics (TOG), vol. 25, pp. 835-846, 2006. [78] D. Q. Zhang and S. F. Chang, "Detecting image near-duplicate by stochastic attributed relational graph matching with learning," Proceedings of the 12th annual ACM international conference on Multimedia, 2004. Trustworthy Media (Chang & Ellis) 19 [79] D. Q. Zhang and S. F. Chang. Columbia Image Duplicate Benchmark Data Set.http://www.ee.columbia.edu/dvmm/newDownloads.htm. [80] T.-T. Ng and S.-F. Chang, "A Data Set of Authentic and Spliced Image Blocks," Columbia University, ADVENT Technical Report June 2004. [81] T.-T. Ng and S.-F. Chang. (2004). Columbia Image Splicing Detection Evaluation Dataset.http://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataS et.htm. [82] Calphoto. (2000). A database of photos of plants, animals, habitats and other natural history subjects. [83] T.-T. Ng, S.-F. Chang, Y.-F. Hsu, and M. Pepeljugoski, "Columbia Photographic Images and Photorealistic Computer Graphics Dataset," Columbia University, ADVENT Technical Report Feb 2005. [84] TRECVID. (2001-2006). National Institute of Standards and Technology: TREC Video Retrieval Evaluation.http://www-nlpir.nist.gov/projects/t01v/. [85] A. Tewfik and M. Mansour, "Secure Watermark Detection with Non-Parametric Decision Boundaries," IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002. [86] I. Venturini, "Counteracting Oracle attacks," ACM multimedia and security workshop on Multimedia and security, Magdeburg, Germany, 2004. [87] G. Poliner, D. Ellis, A. Ehmann, E. Gomez, S. Streich, and B. Ong, "Melody Transcription from Music Audio: Approaches and Evaluation," EEE Tr. Audio, Speech, Lang. Proc., vol. accepted for pub., 2007. [88] M. Naphade, J. R. Smith, J. Tesic, S. F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis, "Large-scale concept ontology for multimedia," IEEE MultiMedia Magazine, vol. 13, pp. 86-91, 2006. Trustworthy Media (Chang & Ellis) 20