ALADAN: Analyze Large Data Now
Transcription
ALADAN: Analyze Large Data Now
PROPOSAL: VOLUME 1 DARPA-BAA-12-38 Technical Area 1 (TA1) : Scalable analytics and data processing technology “ALADAN: Analyze Large Data Now” By University of Southern California Information Sciences Institute Type of Business: Other Educational ISI proposal no.: 3064-0 Technical POC: Ke-Thia Yao USC ISI 4676 Admiralty Way, Ste 1001 Marina del Rey, CA 90292 Ph: 310-448-8297 [email protected] Administrative POC: Andrew Probasco USC Dept. of Contracts & Grants-Marina Office 4676 Admiralty Way, Ste 1001 Marina del Rey, CA 90292 Ph: 310-448-8412 [email protected] Award Instrument Requested: Cooperative Agreement Place and Period of Performance: USC/ISI: 10/01/2012 03/31/2017 Proposal Validity: 120 days DUNS: 0792333393 CAGE: 1B729 TIN: 95-1642394 i Contents Cover Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Goals and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Technical Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Driving Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The Military Problem . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Broader Applicability . . . . . . . . . . . . . . . . . . . . . . 3.3 Technical Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 3.4 Detailed Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Scalable Mapping and Summarization of Multiple Data Stream 3.4.2 Probabilistic Inferencing using Factor Graphs . . . . . . . . . 3.4.3 Performance Diagnostics and Automatic Tuning . . . . . . . . 3.4.4 Quantum Inferencing . . . . . . . . . . . . . . . . . . . . . . 3.5 Deliverable and Milestones . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Technical Risk and Mitigation . . . . . . . . . . . . . . . . . . . . . . . 4 Management Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Organizational Experience . . . . . . . . . . . . . . . . . . . . . . . . . 6 Statement of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Phase 1 - Two Years . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Year 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Year 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Phase 2 - One Year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Year 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Phase 3 - One Year . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Year 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Phase 4 - Six Months . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Year 5 - Six Months . . . . . . . . . . . . . . . . . . . . . . . 7 Schedule and Milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Cost Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii i ii 1 2 3 3 4 4 5 5 6 6 8 9 10 11 12 13 14 14 15 16 16 16 17 17 17 18 18 18 18 19 20 21 22 1 Executive Summary The rapid proliferation of sensors and the exponential growth of the data they are returning, together with publicly available information, has led to a situation in which the Armed Forces of the United States have more information available to them than they can possibly digest, understand, and exploit. This situation is further complicated by the geographical distribution of the data, the many formats in which it is stored, noise, inconsistencies, and falsehoods. The University of Southern California’s Information Sciences Institute proposes to address this challenge in two ways. We propose research into scalable probabilistic inferencing over distributed high-volume, semi-structured, noisy data to aid understanding of trends/patterns and relationships. In addition, we will create domain-specific languages and automatic tuning mechanisms for them to enable these new analytical methods to exploit emerging heterogeneous computing systems to keep up with the unending stream of input data. We are already familiar with many of the big data challenges faced by the Defense community. Over the course of the last decade, we have participated in numerous JFCOM experiments ranging from Urban Resolve, that investigated the viability of next-generation sensors for urban operations, to Joint Integrated Persistence Surveillance, that studied the effectiveness of planning and tasking of sensors. We contributed scalable data logging and analysis tools, enabling analysts not only to log all simulated data traffic, but also to issue multi-dimensional queries in real time, something which had previously required weeks of post-processing. We also created tools for aggregating the results of queries to multiple, distributed databases, and creating data cubes for analysts. We already know how to issue queries to large, distributed databases, and aggregate the results into data cubes. We propose herein to extend this technology in multiple dimensions. We will create a factor graph layer over the multi-dimensional queries to provide probabilistic inferencing with propagation of uncertainty over noisy data. In addition, we will extend our logging and abstraction mechanism to work with high bandwidth, streaming data from sensors. Our methods will allow for heterogeneous data sources, on-the-fly transformation of this raw data into aggregated and summarized forms suitable for DOD Measures of Effectiveness (MOE) and Performance (MOP) analysis, and uncertainty reasoning. Probabilistic MOE/MOP graphs will support trade-off and what-if analysis. We will leverage ISI’s unique adiabatic Quantum computer to adjust classification algorithms, on-the-fly, adapting to concept drift caused by changing conditions. In addition, we will experiment with quantum-based classification algorithms to reduce the dimensional complexity of factor graphs. We shall develop principled implementations of our new methods based on a layered architecture with well-defined interfaces using open source, domain specific languages (DSLs). Our DSLs will enable interoperability with other tools and libraries, portability across platforms, and automated optimization for specific platforms. With the recent plateauing of CPU clock rate and emergence of heterogeneous systems, programming time has become a key barrier to developing new analytic techniques. The next generation of computers requires developers to master at least three different programming models. The enthusiasm for cloud computing is partly due to the difficulty of writing big data applications, yet MapReduce is just one inefficient point in the programming design space. Using DSLs, we will develop efficient implementations for emerging heterogeneous computers, and use automatic performance tuning to alleviate the additional burden of performance portability. We shall demonstrate the broad applicability of this approach with FPGA and GPU accelerators. 1 2 Goals and Impact Military history is replete with examples of missed opportunities and unrecognized threats that led to failed missions and doomed nations. These catastrophes were often due to a lack of intelligence. Today, the opposite is commonly true; the analyst and the Warfighers are deluged with too much information or befuddled by a lack of context. The goal of this project is to develop scalable analytical methods for probabilistic combination of evidence over distributed high-volume semi-structured noisy data to support DoD intelligence gathering, mission planning, situational awareness, and other related problems. If successful, ALADAN will increase the number of data source types and data volume available for real-time data collection and summarization, and provide confidence interval-based evaluation instead of single point-based evaluation. The innovative aspects of the ALADAN project are four fold, as described below. We will create a big data oriented domain specific language (DSL) for real-time filtering and summarization of high volume data sources: This Data DSL goes beyond current data logging and analysis system used by DOD by providing an abstraction mechanism that allows data to be collected from heterogeneous data sources, including sensors, and by providing fast, adaptable implementations on Field Programmable Gate Arrays (FPGA) that do not interfere with normal computing and communication operations. We will also create a second, factor graph DSL for enabling probabilistic reasoning with uncertainty about militarily relevant measure of effective and measure of performance. This Factor Graph DSL goes beyond state-of-art probabilistic reasoning systems, like Factorie [22] and BLOG [12], by connecting the factor graph model with live data streams from the Data DSL, and by providing efficient implementation of the belief propagation algorithm over high performance architectures to perform probabilistic inferencing. To address concerns about porting to future, heterogeneous computing platforms, we will automate performance analysis and software tuning for the Data DSL and Factor Graph DSL. Our approach goes beyond the ad hoc system selection and manual tuning by performing systematic exploration of programming parameters to optimally adapt implementations to specific architectures. We will leverage state-of-art software development and compiler tools pioneered by the DOE SciDAC PERI and SUPER projects, which are lead by USC ISI. Finally, we will explore the use of adiabatic quantum computing (AQC) for providing strong machine learning classification models and strong factor graph inferencing solutions. Our approach goes beyond heuristic approximation approaches by providing high confidence solutions for small scale problems using quantum annealing approaches. We are uniquely positioned to exploit the AQC system at the USC-Lockheed Martin Quantum Computing Center, which will double in the number of qubits every two years for rest of this decade (i.e., an AQC Moore’s Law). Our deliverables include the Data DSL, the Factor Graph DSL, efficient implementation of these DSL over multiple heterogeneous architectures, and strong quantum-based classifiers and algorithms. ISI is well-known for its practical applications, e.g. Domain Name System, contributions to the IP system, one of the original nodes on ARPAnet. It has spun off several companies, e.g. ICANN and GLOBUS, and ISI has close relations to many defense contractors. After this research is proven useful, ISI plans to seek out our colleagues at any of a number of contractors to commercialize this product. That is seen as the most certain way to provide the technology to the Warfighter as soon and as reliably as possible. 2 3 Technical Plan This section presents the major technical challenges addressed by this proposal and our approach to solve them. We begin by motivating the need for this research and discussing key technical challenges. We then present our approach, along with measures of effectiveness and success. Lastly, we discuss potential risks and our strategy to mitigate them. 3.1 Introduction The adequate defense of the Nation critically depends on the ability of its Armed Forces decision makers to be provided with the best information regarding potential threads so that timely and decided actions can be taken with appropriate impact. This is an extremely challenging task as data must be converted into coherent information in a timely fashion. Data emanates from a plethora of sensors, is resident in untold numbers of digital archives with disparate formats and contents, and is continually updated at very high data rates. This big data issue is exacerbated by the unreliability of sources requiring sophisticated and thus onerous methods for accurate and reliable knowledge extraction. Sophisticated analyses, scalable computational systems and the most powerful hardware platforms are all required to avoid risks such as mistaken identity or undetected association, with catastrophic unintended results that may flow therefrom. While today’s war fighters have performed miraculously well with present capabilities, the sources and diversity of future threats increase the strain on already stressed and under performing analytic systems. Two main technical factors are tragically lacking in today’s supporting big data analytics infrastructure. First, the systems have not been designed to cope with the large scale, heterogeneity, and geographic distribution of data due to limited bandwidth and the limited ability to intelligently filter noise from relevant data. Increasing the sophistication of the filtering while retaining the key information would ameliorate this technical issue, improving the scalability of the solutions at hand. Second, current data aggregation and summarization approach are either inadequate, impose infeasible constrains (such as the need to store all the sensor data), or are simply not powerful enough to extract the needed information from the massive volumes of data. The ISI Team has a strong track record in the big data filed, based on decades-long participation in advanced military research and more recently with the DoD’s Joint Experimentation Directorate in Suffolk Virginia [13, 32, 34, 35]. We intend to attack all of the above mentioned challenges with a balanced approach starting with proven techniques and augmenting them with higher risk, and thus high-payoff, approaches, as follows. To cope with the very high data rates seen at the sensor, or front-end, and to address the inability to retain this data in persistent storage, we will seek flexible and programmable filtering and classification solutions based on reconfigurable technology (e.g. using FPGA-based hardware systems). In the middle-end we will leverage AQC techniques for high-throughput classification and metric optimization, in effect pruning the space of possible inferences to be processed by subsequent computational resources. Lastly, at the back-end, or close to the human analyst, we will develop scalable algorithms for processing irregular and distributed data, creating effective human-computer interaction tools to facilitate rapidly customizable reasoning for diverse missions. Given the heterogeneity of existing and emerging platforms (e.g., multi-core, GPUs, FPGAs) and the corresponding complexity in programming models (e.g., MPI, OpenMP, CUDA), we will develop DSLs geared towards the definition of probabilistic classifier metrics and target architectures thus facilitating the deployment and portability of the developed solutions. 3 3.2 Driving Example 3.2.1 The Military Problem. The Joint Force Commander requires adequate capability to integrate and focus national and tactical collection assets to achieve the persistent surveillance of a designated geographic area of for a specific mission. Our research is motivated by our participation in the Joint Forces Command’s Joint Integrated Persistence Surveillance (JIPS) experiments. Persistent surveillance is the capability to continually track targets without losing contact. The persistence surveillance infrastructure and protocols should be robust enough to track multiple targets, to support dynamic tasking of new targets, and to perform regular standing surveillance tasks. The JIPS experiments were designed to use simulation to evaluate proposed capabilities and to make recommendation for improvements. One of the purposes of the experiments was to evaluate the performance of a simulated large, heterogeneous sensor network for performing numerous intelligence functions. At a high level, analysts specify many surveillance tasks. Each task specifies one or more targets and the type of surveillance needed. For instance, a task could be to take a single picture of a geographical location sometime in a 24 hour period. Or, it could be to provide continuous coverage of a moving target for a specified period. Many types of sensors including optical, radar, infrared and seismic are included. Sensors operate on platforms. A platform may be a UAV, a satellite, or some other vehicle. Sensors and platforms all have constraints. For instance sensors have limited range, limited resolution, and error rates. Visible spectrum optical sensors can not be used at night. Weather affects performance of different sensors by a varying amount. Platforms also have limitations such as range and duration. Maintenance is required. Both sensors and platforms have failure rates. Give a list of requests and an inventory of available sensors and platforms, a scheduler assigns specific tasks to each sensor and platform, such as take off at 0900 hours, fly to a target, turn on radar, return at 1300, refuel, take off at 1330 and fly to a second target. JIPS experiments involved millions of simulated entities running on multiple computing clusters, each with hundreds of processor nodes. There were thousands of sensors and millions of detection events. Each of these experiments generated many terabytes of data, per day. The data is both high-volume and and high-rate. Simply collecting the data without disrupting the experiments was a challenging issue. The raw data collected from these experiments was too low level to be of direct use by the analysts. For example, the raw data includes the location of entities at specific times, contact reports from sensor target pairs, weapon fire events, and damage reports. Data composed of individual messages, events, and reports had to be aggregated and abstracted. These abstractions then served as input to evaluation metrics designed by analysts. These evaluation metrics were divided into two levels: measure of effectiveness (MOE) and measure of performance (MOP). The MOP’s are lower level and often more quantitative. The MOEs take as input the MOPs to evaluate if overall mission tasks are satisfied. The mission of the JIPS experiment was to develop more efficient and effective use of sensors to optimize surveillance in restricted and denied areas. The Measure of Effectiveness (MOEs) was to maximize the use of the collection capacity of all assets given a prioritized list of intelligence tasks to perform. Measures of performance (MOPs) included the number of collection hours that went unused, collection gaps, the impact of ad-hoc collection requests on the pre-planned requests. MOPs also included the number of targets detected, the number lost, and the number later required. Characteristic of JIPS data included large volumes collected at geographically distributed sites, 4 heterogeneity as there were many sensor types, as well as noise and other errors. ISI developed the Jlogger tool to capture the data on-the-fly and store it as relations for further analysis. We also created the Scalable Data Grid (SDG) to issue distributed analytic queries, aggregate results from multiple sources, and perform reasoning involving trade offs of uncertainties and options based on uncertain evidence and inferences. 3.2.2 Broader Applicability. The JIPS experiments shares features and requirements with a range of problems confronting both industry and government. A large volume of data is generated at multiple sites requiring compute intensive probabilistic analysis to generate conclusions and predictions about the current and future state of the system. ISI has also worked with Chevron on other such problems in the CiSoft project for oil field management, using analysis of data from multiple sensors on oil rigs to statistically predict failures [20, 21] and using factor graph structure learning to understand complex oil reservoir geometries [17, 18, 19]. The current approach for problems of this class is develop custom methods for each specific concern. This approach to software development is inefficient, and further wastes resources by not obtaining the best results possible with the existing data. Our proposed ALADAN approach is to use a general method, customized only where necessary. The customization process is facilitated with two simple Domain Specific Languages, one to describe the raw data, and one to describe the high level analysis desired. The problems related to distributed data sources and noisy data analysis are solved once with general, widely applicable software algorithms. The process is made efficient with fast commodity hardware such as GPU’s and FPGAs, parallel software optimizations, and probabilistic statistical algorithms. 3.3 Technical Approach Overview We address the challenges outlined in the driving example described above by a multi-pronged technical approach combining proven algorithmic and system-level techniques such as algorithmic and programming solutions with higher-risk but high-payoff quantum computing techniques. We use sound software engineering methodology by defining a layered architecture with explicit domain specific languages at each layer to facilitate interoperability with other tools and software. Figure 1 illustrates the overall structure and data processing flow of the ALADAN research project. The bottom layer of our architecture is defined by the Data DSL, which is designed to map data from heterogeneous sources to a common format to facility further processing. In addition this DSL will be responsible for aggregating and summarizing the high-volume and high-speed data emanating from these heterogeneous sources. The aggregation and summarization step at the data source will make use of the data locality principle, dramatically reducing the bandwidth requirements to other remote and local components within the system. Using Field-Programmable Gate Arrays (FPGAs) will provide flexible and fast implementations that can be deployed in a physically distributed fashion. The next layer up is defined by the Factor Graph DSL, which is designed to model the relationships among the aggregated and summarized data. Specifically, the DSL will represent the MOP and MOE that is of interest to military analysts. Our probabilistic factor graphs will encode local relationships amongst variables. We will provide efficient implementations of belief propagation message passing-based approximation used to perform probabilistic inferencing over the factor graph using high performance architecture and systems, including distributed multi-core systems and systems with GPUs. 5 Probabilistic Inference Benefits Approach Innovation Probabilistic reasoning with uncertainty and with missing, noisy and incomplete data High-Performance Distributed Parallel Data Computation (MPI+OpenMP+CUDA) Parallelization and Performance Tuning of probabilistic inferencing algorithms. Novel Quantum data inferencing algorithms. Factor Graphs Add Relationship Connect Measure of Exploit structural data Factor graph DSL for defining Effectiveness and Measure probabilistic relationships and relationships to of Performance with enhancing reasoning as for interfacing with simulations and sensors well as performance distributed data sources Aggregated/ Abstracted Data Map/Summarize Simulation Sensor Scalable to high data Heterogeneous & Custom Architecture volumes and distributed (FPGA+GPU) heterogeneous data sources Data DSL for data filtering and summarization. Incorporation of strong classifier generated by quantum annealing. High-speed/High-Volume Data Sources Figure 1 Level Diagram of data structure and data processing. The key advantages of defining DSLs are that they enhance programmer productivity by providing higher-level abstractions, and that they increase code portability by hiding system specific details. To provide efficient implementations we will perform automatic performance tuning (autotuning) and custom code generation. Autotuning technology automates the performance tuning process by empirically searching a space of optimized code variants of a computation to determine the one that best satisfies an evaluation criterion (such as performance or energy costs.) The final element of the ALADAN technical approach is to exploit AQC to generate strong machine learning classifiers for data mapping/summarization and to compare their quality to heuristic belief propagation algorithms. ISI hosts the only commercially delivered AQC machine, the DWave One. Quantum annealing, as implemented by the D-Wave One, probabilistically finds ground states of programable spin-glass Ising models using a restricted form of open-system AQC, and is expected to scale better than classical alternatives [10, 30]. Probabilistic inference on some types of factor graphs can be reduced to the problem of finding the ground state of an Ising spin glass. 3.4 Detailed Technical Approach 3.4.1 Scalable Mapping and Summarization of Multiple Data Stream. When dealing with big data a major concern is the inability to store the data for further processing making early filter and classification of paramount importance. While filtering vast amounts of data at line speeds 6 will be a key performance requirement, early classification of data inferences and relations while retaining references for possibly revisiting and reclassification of the selected few data items is a key challenge. We address this problem by developing a domain-specific language for data filtering and early classification where developers can augment a data-flow in an incremental fashion with specific matching patterns and classifiers. The overall approach rest on the notion of a sequence of data and correlation filters that communicate not only scalar data or individual stream lets but also localized graphs or data cubes. These filters are then composed, as in a coarse-grain data-flow arrangement with possible feedback loops, to generate summarized views of the input data ready for processing of subsequent system-level components. The high-level abstractions offered by the DSL we proposed in this research is akin to the execution model offered by popular programming environments such as Simulink [16] thus lowering the barrier of adoption of this DSL. The added value we consider is the inclusion of predefined constructs for defining fields of interest in each data item and specific correlation or goodness metrics that are use to define the selected data-pairs. Once the developer specifies the data flow a translation and tuning engine can map it to a specific target architecture. In this context we foresee several opportunities for transformation and optimization in addition to the auto-tuning approach described in section 3.4.3, namely: • Define data-specific data filters on time-stamps, sensor type location, value and sequence of values. For instance, one filter may be interested in only a specific data tag for data records that have a GPS location values within a given range and in a specific time window. Other filter may want to explore similar time windows but either different tags or locations; • Merge data-filter to promote reuse of scanned data while data is inside the device i.e., in-transit. For instance if two or more filters use the same sources (or share data sources) it is beneficial to co-locate then in space so that the data each needs to examine has as small as possible life-time. • Explore natural streaming concurrency by having independent data-flow branches being mapped to distinct components of the system either in space and/or in time; • Reorganize filtered data in graph-ready data structures, possibly with some very limited (and controlled replication) for improved locality of subsequent computation; As it is clear, each of these transformations and optimization illicit trade-offs that are very target architecture-specific. For example, when using an FPGA a sophisticated filter can be implemented in hardware that effectively selects key data records or values that are spatially or time related. Internal FPGA wires can be configured for exact and approximate matches and internal data can be re-routed for maximal internal bandwidth. Previous work in the context of data reorganization and pattern-matching has highlighted the performance and data volume reduction of this approach [25]. Similarly, when targeting GPU cores, a given filtering can be materialized as code that manipulates a cube of data already populated with signification data as the output of a previous filter thus effectively exploiting the large internal bandwidth of GPUs and thread-level concurrency. By raising the level of abstraction in describing the flow of data we allows a mapping (compiler) tool to generate code, either in the form of traditional serial code or using hardware-oriented languages such as Verilog/VHDL, that exploit the specific features of each target device. This is accomplish without the burden of the developers and without the subsequent code maintainability and programmer portability costs. ISI implemented and operated data logging facilities for the multiyear JFCOM JESSP series of 7 large distributed military simulations. The simulations ran on multiple, geographically distributed (Hawaii, Ohio, Virginia) linux clusters, as well as scores of workstations. Multiple terabytes of data were logged. ISI implemented the Scalable Data Grid (SDG) to summarize the logged data at each site and supported web based realtime interactive queries. Multiple cores on the simulation processors were used to provide scalability to support the logging function. The proposed system is a technological update of that system using newer hardware, FPGAs and GPUs, and more capable algorithms such as factor graphs. 3.4.2 Probabilistic Inferencing using Factor Graphs. Military decision making often deals with complicated evaluation functions of many variables. Systematic exploration of the exponential variable space using brute force techniques is not feasible. However, these evaluation functions often have exploitable structure that allows them to be decomposed into a product of local functions that each function depends on only a few variables. The military approach of breaking down evaluation function into Measures of Effective (MOE) and then into Measures of Performance (MOP) is a strong indication that natural exploitable structures do exist. Factorization of global functions into local functions can be visualized as bipartite graphs called factor graphs that has a variable node for each variable xi and a factor node for each local function fj . An edge connects factor node to a variable node if and only if the corresponding variable xi is an argument to the corresponding function fj . Here is an example of a factor function f1 (xratio , xoverlap ) that takes two variables: the ratio of available sensors to targets xratio , and the frequency of overlapping sensors on the same persistent surveillance target xoverlap . This factor function can be part of an function that evaluates the effectiveness of sensor planning and tasking. Tasking multiple sensors on the same target provides redundancy. This is desirable when the ratio of available sensors to targets is high. But, when the available sensors are low and everything else being equal then high overlap of some target may lead to lost coverage of other persistent surveillance targets or standing surveillance targets or even other persistent surveillance targets. Overlap High Medium Low High 5 5 1 Ratio Medium 3 3 4 Low 1 3 5 (a) Function values for factor function f1 . (b) Factor graph snippet Figure 2 Factor function f1 expressed as a table and as a part of a factor graph. The function values in the table can be provided by the analysts based on prior experience, or they can be automatically gathered from multiple constructive simulation runs. The table 2a shows function values for the f1 factor function. The factor function f1 has two arguments, so the corresponding data representation is a table. If the factor function has more than two arguments, then data representation turns into a cube. In existing approaches these values are typically provided by analysts based on their prior experience. Our factor graph DSL allows such prior knowledge, but in addition it interfaces with the Data DSL to automatically gather such statistics from simulation runs and from sensor feeds. 8 For complicated situation the number of variables may range in the hundreds and thousands. Inferencing with very high dimensional cubes is not practical, because the time complexity grows exponentially with the number of dimensions. In machine learning this problem is known as the curse of dimensionality. The factor graph approach exploits local structures in the problem space to decompose the high-dimensional cube in to many interacting cubes, i.e. factors fi : f (x1 , . . . , xn ) = f1 (X1 )f2 (X2 ) . . . fm (Xm ) where Xj ∈ x1 , . . . , xn . If the corresponding factor graph representation of the above decomposition is a tree, then probabilistic inferencing can be performed in linear time. As an example, the Viterbi decoding problem can be encoded as a trellis tree. This allow the Viterbi algorithm to decode in linear time using the backward and forward inference approach. However, if the resulting factor graph contains cycles no polynomial exact algorithms are known. Belief propagation provides an approximate solution. Although it is known to work better than other algorithms on difficult problems, such as Boolean Satisfiability problem near the phase transition boundary, it may still requires high time complexity with with high numbers of message exchanges. Our approach leverage existing and emerging high performance platforms to provide efficient implementations. Our factor graph DSL represents the factor graph and the message passing algorithm, including message scheduling and graph transformations. The specificity and formal representation provided by the DSL allows for optimization and tuning for target platforms. 3.4.3 Performance Diagnostics and Automatic Tuning. The filters and data analysis specified in the proposed DSL must have high-performance implementations on a variety of target architectures or heterogeneous system components. Manual code optimization is a time-consuming and error-prone process that must be repeated for each new architecture. Automatic performance tuning, or autotuning, addresses these problems by automatically deriving a space of code implementation (or variants) of a computation, and empirically searching for the best variant on a target architecture. The automatic generation of optimized code variants is based on architecture models and known architectural parameters, as well as on static analysis or user knowledge of the application. Autotuning technology can be grouped into three main categories. In library autotuning, selftuning library generators automatically derive and empirically search optimized implementations. Examples are the libraries ATLAS, PhiPAC and OSKI [2, 31, 33] for linear algebra, and SPIRAL and FFTW [11, 27] for signal processing. In compiler autotuning a compiler or tool automatically generates a space of alternative code variants to be empirically searched. Examples are CHiLL, CUDA-CHiLL, PLUTO and Orio [5, 15, 26, 29]. In programmer-directed autotuning users guide the tuning process using language extensions, such as in PetaBricks and Sequoia [1, 9, 28]. The ISI team will use compiler-based autotuning for domain-specific code generation to achieve high-performance and performance portability on heterogeneous systems. Domain-specific code generation is particularly well-suited for autotuning because domain-specific knowledge can be incorporated in the decision algorithms that derive optimized code variants. Such knowledge can also be used to guide the empirical search for the best variant. To maintain the generality of compiler-based autotuning, our system will support the integration of domain-specific knowledge using a higher-level user interface to a compiler-based autotuning system. The ISI team has been conducting research in autotuning for over eight years, and some of its members are co-developers of an autotuning framework based on CHiLL, a code transformation and generation tool [4, 6, 14]. We will leverage CHiLL as the core of the autotuning system for 9 our DSL, and use CHiLL’s high-level transformation interface to develop new domain-specific optimizations, by composing basic code transformations already supported by CHiLL and adding new basic transformations as needed. CHiLL is a code transformation tool based on the polyhedral model that supports a rich set of compiler optimizations and code transformations. CHiLL takes as input an original computation and a transformation recipe [14] and generates parameterized code variants to be evaluated by a empirical search. A transformation recipe consists of a sequence of code transformations to be applied to the original computation, and can be derived by a compiler decision algorithm or provided by an expert user. CHiLL’s transformation interface also supports the specification of knowledge about the application code or input data set that can be used to derive highly specialized code variants. The software will be instrumented to measure and monitor performance of relevant functions and generate reports of the same. The reports will provide data to indicate where additional or different hardware or different software optimizations are required. 3.4.4 Quantum Inferencing. It is the ISI Team’s intention to conceptualize and investigate the use of the ISI D-Wave quantum optimization engine to create strong classifiers for isolating and characterizing associated data elements to recognize objects (or patterns) of interest in extensive and heterogeneous data. As data bandwidth grows, it becomes impossible to analyze it exhaustively in real time, or even to store all of it for processing at a later time. Analytic methods would be severely limited, possibly missing important correlations or developing associations in unacceptably tardy time frames. We propose to implement automatic but flexible filtering based on machine learning techniques. Humans will then be enabled to look at the products of the analysis. Results recently published on a collaboration between Google and D-Wave suggest that quantum annealing can outperform classical algorithms in the design of strong classifiers [23]. The next generation of the D-Wave quantum annealing chip will address problems which are in practice unfeasible for classical algorithms. Two ISI researchers are already using D-Wave One to create strong classifiers as optimal linear combinations of weak classifiers. They are then using these classifiers to filter control algorithms, and to search for optimal donor molecules for organic photovoltaics (in collaboration with Harvard university). These classifiers, designed to mark important data, are constructed by providing to the algorithm developer the following resources: • A training set, T , of S input vectors {xi } labeled (or associated) with their correct classification {yi }, for i ∈ [1, .., S]. • A dictionary of potential N weak classifiers, hj (x). These can be either classifiers that are known, but not sufficiently accurate for the application, or reasonable guesses at classifiers that may suit the application (Comment: note that the weak classifiers h(x) are not associated with the label vector y, which has been obtained by other means) The goal is to construct the most accurate strong classifier possible using as few weak classifiers as is reasonable. This strong classifier generalizes the information learned from the labeled examples. This is accomplished by optimizing the accuracy over the training set of a subset of weak classifiers. The strong classifier then makes its decision according to a weighted majority vote over the weak classifiers that are involved. Selecting a subset of weak classifiers is done by associating to each weak classifier hj (x) a binary variable wj ∈ [0, 1], which denotes if the weak classifier is selected for the strong classifier. 10 The value for each variable wj is set by minimizing a cost function L(w) during the training phase. This cost function measures the error over the set of S training examples. We also add a regularization term to prevent the classifier from becoming too complex, and lead to over-fitting (where the classifier performs perfectly over the training set, but generalizes poorly on other vectors). In summary, the machine-learning problem of training the strong classifier is formulated as a binary optimization problem, of the type that can be addressed with quantum annealing. The binary optimization of training the strong classifier can be approximated with a Quadratic Binary Unconstrained Optimization (QUBO). The work by Google and D-Wave approximated this QUBO directly with the Ising model that is implemented natively by the quantum annealing hardware. A new approach is to use heuristic optimization algorithms, fine-tuned for the problem at hand, similar in spirit to sequential optimization techniques and related to genetic algorithms. The algorithm proceeds by refining several populations, approximating the best solutions of each population with an Ising model sequentially. By using this machine learning approach, we can produce a strong classifier that from a small set of labeled data samples, generalizes their properties in order to classify new data that we apply it to. In this manner, we can filter data for postprocessing. Today’s ISI D-Wave One installation has 128 qubits, of which 108 are calibrated and functioning as of May 2012. The second generation machine will have 512 qubits, and ISI expects an upgrade at the end of the calendar year. Projecting that same delivery percentage for effectively functioning qubits, one would anticipate this would generate O(400) qubits. The third generation D-Wave is planned to have 2048 qubits, again it delivering O(1,600). It is planned for the end of 2014. Benchmarking done at ISI confirms that if we take into account only the “computational time”, the time used for the quantum annealing evolution, D-Wave One is already around 2 orders of magnitude faster than optimized classical algorithms (based on belief propagation [7]). Nevertheless, this time is dominated by engineering overheads, such as programming, thermalization and read-out. The performance of D-Wave One is therefore similar to that of optimal classical algorithms [7, 24]. For the next upgrade to this hardware, with 512 physical superconducting qubits, the engineering overhead will be reduced by more than two orders of magnitude. Classical algorithms to solve spin glasses at this size have prohibitive memory (1TB SRAM for believe propagation [7]) or time (thousands of years for TABU probabilistic search [24]) requirements, depending on the algorithm. Scaling analysis done at ISI are compatible with the expectation that open system quantum annealing will still perform well at this problem size. Experimental and theoretical [3] studies suggests that this method is indeed quite robust against natural noise. We emphasize that quantum annealing can be used exclusively during the training phase of the strong classifiers for data filtering. The resulting strong classifiers can be deployed in the appropriate hardware, such as FPGAs and GPUs, as detailed in Sec. 3.4.1. Nevertheless, during the life-time of the project we will explore the possibility of continuos access to specialized quantum annealing hardware for the future refinement of strong classifiers and the definition of new ones. 3.5 Deliverable and Milestones We now describe the anticipated project deliverables and milestones along with their time line and relevant references to the tasks in the SoW. We identify each deliverable (D) and milestone (M) as in M.2.1 denoting the first milestone of the second project task. Mar 2013: M.2.1,M.3.1: Definition of the Data DSL and factor graph DSL, and preliminary mapping strategies for CPU engine. M.4.1: Prototype autotuning restricted to CPU and FPGA. 11 M.5.1: Definition of weak and strong classifiers for data filtering using quantum annealing. Sept 2013: D.2.2,M.2.2 Prototype implementation of domain-specific compilation tool with back end for C programs and Verilog/VHDL suitable for synthesis in the selected FPGA-based board. D.3.2,M.3.2: Prototype implementation of belief propagation on multi-cores and/or GPUs. D.3.2 Prototype autotuning for FGPA. M.5.2 Prototype strong classifier for data filtering using quantum annealing. Demonstration of machine learning filtering with quantum annealing and with classical alternatives. Report on performance of first prototype. Mar 2014: D.2.3,D.3.3,D.4.3,D.5.3: Analyze function and performance of ALADAN system. Extend system to run on GPUs. Inital port of quantum classifier model on to FPGA. Report on performance evaluation of the prototype ALADAN system, including compiler generated code on selected streaming data inputs and corresponding measures of effectiveness (MOE). Sept 2014: D.1,2,3,4,5.4: Final Phase I report, including performance tuning on generated code for GPUs and traditional processors using the proposed auto-tuning compilation infrastructure for the generated C codes using the prototype domain-specific compiler. Delivery of initial version of ALADAN system. Mar 2015 M.2.5,M.3.5,M.4.5: Extend the functionality and optimize performance of the ALADANsystem. Report on performance tuning on generated code for GPUs and traditional processors using the proposed auto-tuning compilation infrastructure for the generated C codes using the prototype domain-specific compiler. M.5.6 Refinement of strong classifiers for data filtering. Training classifiers with D-Wave Two quantum annealing hardware, if available. Sept 2015 D.1.6,D.2.6,D.3.6,D.4.6,D.5.6 Implement and deliver a full capability ALADAN system. Final report for Phase 2. Sept 2016 D.1.7,D.2.7,D.3.7,D.4.7,D.5.7 Install/deploy ALADAN system at client site, and demonstrate the system on new test cases. Mar 2017 D.1.8,D.2.8,D.3.8,D.4.8,D.5.8 Delivery of integrated ALADAN system and the final report for Phase 4. 3.6 Technical Risk and Mitigation While the proposed approach is grounded on solid principle and previous USC Team experiences in large scale distributed simulation and data analytics, we do recognize technical risk in some areas of the proposed project, namely: • Algorithm Design and Adaptation: There is a risk of not being able to adequately extend and adapt existing solutions to the dynamic and noisy nature of the input data. Mitigation Factor: Identify appropriate subsets. Perform additional data cleansing manually. • DSL Development: There is a potential risk in the development of a DSL for code generation targeting FPGAs and GPUs for aggregation and summarization. Mitigation Factor: The team has extensive experience in compilation and program translation using DSL for FPGAs[8] and GPUs programming using programming languages such as MPI/OpenMP and CUDA. There are also adequate facilities for addition of hardware testing equipment that can be used for testing/debugging purposes. • FPGA-based Demonstrator: There is a risk of not being able to develop and assemble an FPGA-based demonstrator. Mitigation Factor: Several team members have built over the years a vast array of hardware prototypes for various research demonstrators. At USC/ISI there are also adequate facilities 12 fitted with hardware testing equipment that can be used for testing/debugging purposes. • Quantum annealing for machine learning: There is a risk that quantum annealing for machine learning will not offer an advantage over state-of-the-art classical algorithms. Mitigation Factor: We are constantly benchmarking the performance of D-Wave One against optimized classical alternatives. If the D-Wave Two does not perform as expected, we will use the best classical alternative found. In addition, we are working on using quantum annealing as an specialized binary optimization subroutine within classical heuristic optimization algorithms. Quantum annealing can be switched on and off in these algorithms. Therefore, the performance is never worse than the classical alternative. Lastly, we do not anticipate management risks associated with the execution of the project. The team has collaborated extensively in the past and includes a very diversified set of skills from hardware-oriented design to high-level algorithmic solutions that will be leveraged in this project. In addition, the recent interfacing with the government in projects that also involved demonstrators provided additional confidence that risks in this context are also negligible. 4 Management Plan The ISI Team is an experienced research team, with more than a decade of experience working on large DoD projects together. They are not an ad hoc team of researchers assembled just for this project. They will be led by the Principal Investigator (PI) Dr. Ke-Thia Yao, who is a Research Scientist and Project Leader at USC’s ISI. He is a widely published author and lecturer on data management and teaches an undergraduate course on the topic at USC. Within the JESPP project he is developing a distributed data management system, the Scalable Data Grid, and a suite of monitoring/logging/analysis tools to help users better understand the computational and behavioral properties of largescale simulations. He received his B.S. degree in EECS from UC Berkeley, and his M.S. and Ph.D. degrees in Computer Science from Rutgers University. He will be assisted by three experienced research scientist from ISI: Dr. Pedro Diniz, whose expertise is in FPGA applications to analyze large data sets, Dr. Jacqueline Chame, whose expertise is in compiler design to support Domain Specific Languages and Dr. Sergio Boixo, who is one of the key researchers on the D-Wave quantum computing project at USC. They, in turn, will be 13 supported by the Messers Gene Wagenbreth and Craig Ward, both experienced High Performance Computing (HPC) and distributed database programmers and by CDR Dan Davis, who will be responsible to the PI for documentation, progress reporting, web-site, and military intelligence issues. Of these, Dr. Yao and Mr. Wagenbreth hold current TS/SCI clearance, CDR Davis is currently cleared to the TS level, and has held three SCI tickets in the past, and Mr. Ward holds a current Secret Clearance. Further, the facility where the bulk of the research will be done has an area certified to the SCI level. The Doctors Yao, Diniz, Chame and Boixo are all considered key to the project. All of the team members are widely published in the fields indicated in the paragraph above. Experience levels are all demonstrable and exceptional, e.g. Gene Wagenbreth has been doing HPC programming for four decades (since beginning on the ILLIAC IV) and CDR Davis has a 24-year career as a cryptographic linguist, analyst and intelligence manager in the Naval Services. The key personnel all have more than two decades experience in their respective fields. With the exception of the PI, the staff will all contribute on the order of 800 hours per year to this project. CDR Davis will lead the drafting of the Project Communications Plan and create a web page with interactive capabilities for insuring the coordination of the team, the dissemination of data, the preparation of progress reports and the complete visibility of the research to the Program Manager at DARPA. The Team has effectively used this system in the past and found it intuitive, useful and conducive to good administration of the research. Collaboration will be supported by: 1. A common wiki will be used to exchange documents and retain a history of interactions. An svn repository will be used to support software collaboration and development. 2. Bi-Annual PI & Research Staff Meetings: Given the geographic location of the project partners, a program meeting is planned twice a year to exchange results and discuss multidisciplinary research issues. 3. Quarterly Research Staff Meetings: These meetings (supported mostly) via video-conference during the second phase of the project only, will aim at a close integration of the prototype tools designed during Phase 1 and developed during Phase 2. These short meetings (1 day) will mitigate the potential integration risks for these tools. 4. Teleconferences or seminars every two weeks: Key issues, milestones, and deliverables will be discussed. The PI will prepare quarterly reports for the government summarizing key research results, task progress, deliverables/publications, and other research staff/student success stories. 5 5.1 Capabilities Facilities The University of Southern California’s Information Sciences Institute is a large, universitybased research center with an emphasis on programs that blend basic and applied research through exploratory system development. It has a distinguished history of producing exceptional research contributions and successful prototype systems under government support. USC/ISI has built a reputation for excellence and efficiency in both experimental computer services and production services. USC/ISI originally developed and provided and/or now helps to support many mature 14 software packages for the entire Internet community. The computer center has been an integral part of ISI since its founding in 1972. Today’s Information Processing Center (IPC) maintains a state-of-the-art computing environment and staff to provide the technical effort required to support the performance of research. Resources include client platform and server hardware support, distributed print services, network and remote access support, operating systems and application software support, computer center operations, and help desk coverage. The IPC also acts as a technical liaison to the ISI community on issues of acquisition and integration of computing equipment and software. The Center’s servers are protected by an uninterruptible power supply and backup generator to ensure availability 24 hours a day, 365 days a year. A rich mix of computer and network equipment along with modern software tools for the research community’s use provide a broad selection of capabilities, including Unix and Linux-based servers and Windows-based servers used for electronic mail and group calendaring, web services, file and mixed application serving. File servers utilize high-performance RAID and automated backup to facilitate performance and data protection. Computer room space is also available to researchers for hosting project-related servers. In addition, research staff members have access to grid-enabled cluster computing, and to USC’s 11,664-CPU compute cluster with low latency Myrinet interconnect that is the larges academic supercomputing resource in Southern California. Research project staff have an average of over one workstation per staff member, connected to a high performance switched 10Gbps Ethernet LAN backbone with 10Gbps connectivity to research networks such as Internet 2, as well as additional network resources such as IP multicast, 802.11g and 802.11n wireless, H323 point-to-point and multipoint videoconferencing, webcasting and streaming media. The USC’s Lockheed Martin Quantum Computing Center houses a D-Wave One quantum optimization engine having 128 qubits (Rainier chip). D-Wave One performs binary function minimization using quantum annealing. There are plans to upgrade the D-Wave One to its next generation of 512 qubits (Vesuvius chip), when is available approximately in the next one or two years. The Center is currently operational for access to researchers in quantum computing technology. This quantum optimization engine, the only fully operational quantum computing engine in the world, will be used to validate and prove the technical feasibility of the quantum computing concepts discussed herein. The D-Wave One at USC’s LM-QCC is connected to a network of high-performance computers (classical HPC environment). The entire computing configuration can be accessed remotely both at USC and ISI, as well as by external researchers using either a private or a public network protected with encryption for security. Also, it can communicate to private or public cloud computing clusters using encrypted processing configurations for privacy and security protection. Operationally, the computationally-intensive problems (QC appropriate) can be partitioned with respect to the respective QC and classical HPC environments. For these problems, the corresponding QC-centric mathematical kernels can execute in the QC environment (i.e., D-Wave One), and the HPC-centric kernels can execute in the classical HPC environment. The results are properly combined and coordinated to produce the end-product communicated to the researchers performing the research activities and associated investigations. 5.2 Organizational Experience ISI personnel have experience and expertise in all of the areas critical to this proposal: 15 • probabilistic statistical inference: DR. Ke-Thia Yao, multiple years working JESPP project JFCOM distributed simulation, and CiSOft - oilfield equipment failure prediction. • parallel processing and distributed computing: DR. Jacqueline Chame, DR. Pedro Diniz, Gene Wagenbreh - many years experience in all aspects of high performance computing, hardware and software, system software for compiling, translating, measuring and tuning, application software. FPGA and GPGPU. • quantum computing: DR. Sergio Boixo - key researcher on the D-Wave quantum computing project at USC. 6 6.1 Statement of Work Phase 1 - Two Years The goals of Phase 1 are to design the system; create multiple test cases to test components and demonstrate capabilities; implement a reduced capability system; preliminary documentation; test and debug; identify design deficiencies and redesign as needed. 6.1.1 Year 1. Y1 Task 1 Management Management tasks immediately initiate with a kick-off meeting, to which the PM will be invited, and the ensuing effort will be focused on standing up the systems to support the project for four and a half years. The next major activity will be assessing the final Contract Document Requirements List (CDRL) to insure all mandated documents are submitted in a timely, complete and compliant manner. A second thrust will be conceiving, designing and programming a project web site, complete with schedules, deadlines, publication targets and drafts, and outreach information for stakeholders. Sections not intended for public use will be encrypted and password protected. Wikis, and other interactive communications will be created, maintained and decommissioned as appropriate. Travel will be at a minimum early in Year 1, but PI travel to DC and a representative’s travel to the summer test evolution are being proposed. The ISI team will deliver all required technical papers, reports software, documentation and final reports. Y1 Task 2 Mapping/Aggregation/Summarization: Define syntax and semantics of a DSL for the specification of data filtering and aggregation to be implemented on both FPGA-based platforms as well as on GPUs. Develop and validate an open-source parser and the corresponding code translation schemes making use of existing parallel programming languages such as MPI, OpenMP and CUDA regarding the GPU target. Y1 Task 3 Probabilistic Inferencing: Define syntax and semantics of a DSL for encoding relationships using factor graphs and for interfacing with the aggregated/summarized data from Task 2. Develop initial implementation of belief propagation algorithm on GPGPUs, and/or other heterogeneous architectures. Y1 Task 4 Performance Monitoring and Automatic Tuning: Identify performance bottlenecks and basic transformations opportunities regarding application code generation (in particular partitioning between multi-core and accelerator engines (FPGAs and GPUs) as well as explicit concurrency using MPI/OpenMP and CUDA). Define optimization strategies and corresponding parameters and schemes. Y1 Task 5 Quantum Inferencing: Initial prototyping of machine-learning approach for data filtering. Identification of an initial set of weak classifiers and prototyping of strong classifiers. 16 Benchmarking of strong classifiers with training phase using quantum annealing, against related purely classical algorithms. Initial interfacing with the DSL defined in Task 2. 6.1.2 Year 2. Y2 Task 1 Management Year 2 management tasks are a continuation of Year1. All meetings, communications and project deliverables will received a re-evaluation at the beginning of the year and unnecessary or unproductive activities will be curtailed and obviated tasks will be discontinued. New and useful tasks will be added as appropriate. Travel will increase as team members begin to publish results deemed publication worthy and dissemination appropriate by the PM and other DARPA-associated stake- holders. Y2 Task 2 Mapping/Aggregation/Summarization: Test and evaluate of DSL generated codes on both FPGA-based system and GPUs accelerators for a sample set of filtering and data aggregation strategies. Demonstrate (at month 18 of the project) of the filtering and data summarization sub-system and its interface with overall system. Y2 Task 3 Probabilistic Inferencing: Test and evaluate factor graph DSL generated codes on GPUs and/or other heterogeneous architectures. Develop factor graph for the MOE/MOP evaluation of persistent surveillance example. Demonstrate (at month 18 of the project) the factor graph DSL and belief propagation algorithms. Y2 Task 4 Performance Monitoring and Automatic Tuning: Implement and perform preliminary analysis of application code partitioning between multi-core and accelerator engines (FPGAs and GPUs). Perform initial evaluation and integration with GPU and FPGA programming efforts. Y2 Task 5 Quantum Inferencing: Generate, test and evaluate machine learning strong classifiers for data filtering. Improve dictionaries of weak classifiers. Continuous benchmarking between quantum annealing generated classifiers and improved classical alternatives. Demonstrate (at month 18 of the project) machine learning filtering. Initial tests with D-Wave Two for generation of strong classifiers, if available. Initial porting of strong classifiers to the appropriate hardware, as specified by Task 4. 6.2 Phase 2 - One Year The goals of Phase 2 are to implement a full capability system; reformulate test cases as appropriate; add more real world cases; test and debug; measure performance; and documentation. 6.2.1 Year 3. Y3 Task 1 Management The beginning year of Phase II will be characterized by reassessing the management procedures required to adequately support both the existing and the new goals set out by the PM and by the shift in research focus. A deep analysis of management system efficacies will be led by the PI and driven by CDR Davis. Meetings, travel and communications systems in place will be continued as deemed advisable. Travel is anticipated to be limited to the PI’s trips to DC, trips to present technical papers and the representative’s trips to Northern Virginia for the summer test evolution. Y3 Task 2 Mapping/Aggregation/Summarization: Integrate FPGA-based and GPU-based accelerator sub-systems with higher-level probability inference algorithmic designs. Refine DSL semantics and code generation schemes whenever appropriate. Y3 Task 3 Probabilistic Inferencing: Enhance the scalability of the belief propagation algorithm 17 by extend it to multiple nodes with multiple GPUs. Y3 Task 4 Performance Monitoring and Automatic Tuning: Refine tuning strategies and code generation schemes for all systems components with emphasis on data analytics algorithms. Test and evaluate sample test inputs cases. Y3 Task 5 Quantum Inferencing: Refinement of strong classifiers for data filtering. Close integration with Tasks 2 and 3. If available, use of D-Wave Two for the training phase of machine learning, and benchmarking against related classical algorithms (which are used otherwise). 6.3 Phase 3 - One Year The goals of Phase 3 are to create new test cases that utilize the capabilities of the system to generate new results; install/deploy system in actual sites; and design and implement new features to allow more innovative results 6.3.1 Year 4. Y4 Task 1 Management Year 4 will be a year of delivery and publication. The research will have born fruit that the PM will want communicated to the appropriate audiences. Delivery of programs and data will be carefully monitored and screened for utility, completeness and documentation. The PI and CDR Davis will confirm publication propriety and insure the project successes are recognized by both the research and the user communities. Meetings and communications systems, e.g. the web site, will be re-evaluated for efficacy. Y4 Task 2 Mapping/Aggregation/Summarization: Test FPGA-based and GPU-based accelerator sub-systems with higher-level probability inference algorithmic designs in real (live) data scenarios. Demonstrate and evaluate of integrated system. Y4 Task 3 Probabilistic Inferencing: Extend factor graph DSL as needed to accommodate additional test cases. Implement belief propagation algorithm on addition heterogeneous architecture, such as Tilera. Y4 Task 4 Performance Monitoring and Automatic Tuning: Evaluate base systems with additional larger input cases. Possibly refine code generation for scalability. Y4 Task 5 Quantum Inferencing: Test and evaluation of machine-learning filtering in real (live) data scenarios. Continuous improvement of weak classifiers dictionaries, training algorithms, and strong classifiers algorithms. 6.4 Phase 4 - Six Months The goals of Phase 4 are to deliver system for testing and demonstration; demonstrate system capabilities on multiple test cases; and integrate with at customer sites. 6.4.1 Year 5 - Six Months. Y5 Task 1 Management: Phase IV is a six month period devoted to finalizing and conveying the product of the project’s research to the PM. It will entail the drafting and the delivery of final report and conveyance of all open-source code developed by this project. Attention will be paid to usability by the Warfighters. Y5 Task 2 Mapping/Aggregation/Summarization: Perform field-testing and deployment of FPGAbased and GPU-based accelerator sub-systems with higher-level probability inference algorithmic designs. Y5 Task 3 Probabilistic Inferencing: Perform field-testing and deployment of factor graph DSL and belief propagation implementations. 18 Y5 Task 4 Performance Monitoring and Automatic Tuning: Integrate generated code in demonstrators with customer sites. Support and reporting. Y5 Task 5 Quantum Inferencing: Perform field-testing and deployment of strong classifiers for machine-learning based data-filtering. 7 Schedule and Milestones 19 8 Cost Summary 20 Appendix A i. Team Member Identification Prime Organization Non-US? Ke-Thia Yao USC ISI No Jacqueline N. Chame USC ISI Yes Pedro C. Diniz USC ISI Yes Federico Spedalieri USC ISI Yes Sergio Boixo USC ISI Yes Gene Wagenbreth USC ISI No Craig Ward USC ISI No Subcontactor Organization Non-US? N/A Consultant Organization Non-US? Dan Davis USC ISI No Clearance FFRDC or Government? Yes No No No No No No No No No Yes No Yes No Clearance FFRDC or Government? Clearance FFRDC or Government? Yes No ii. Government or FFRDC Team Member NONE iii. Organizational Conflict of Interest Affirmations and Disclosure Pursuant to FAR, Subpart 9.5, the University of Southern California/Information Sciences institute does not support any DARPA offices and therefore has no Organizational Conflict of Interest. iv. Intellectual Property NONE v. Human Use NONE vi. Animal Use NONE vii. Representations Regarding Unpaid Delinquent Tax Liability or a Felony Conviction under Any Federal Law (1) The proposer represents that it is [ ] is not [X] a corporation that has any unpaid Federal tax liability that has been assessed, for which all judicial and administrative remedies have been exhausted or have lapsed, and that is not being paid in a timely manner pursuant to an agreement with the authority responsible for collecting the tax liability. (2) The proposer represents that it is [ ] is not [X] a corporation that was convicted of a felony criminal violated under Federal law within the preceding 24 months. viii. Subcontractor Plan N/A 21 Appendix B Bibliography and Cited References Bibliography [1] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 38–49, New York, NY, USA, 2009. ACM. [2] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodology. In International Conference on Supercomputing, pages 340–347, 1997. [3] S. Boixo, E. Knill, and R. D Somma. Eigenpath traversal by phase randomization. uantum Information and Computation, 9:833–855, 2009. [4] C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Proc. of the Intl. Symp. on Code Generation and Optimization, Mar 2005. [5] Chun Chen. Model-Guided Empirical Optimization for Memory Hierarchy. PhD thesis, University of Southern California, May 2007. [6] Chun Chen, Jacqueline Chame, and Mary W. Hall. CHiLL: A framework for composing high-level loop transformations. Technical Report 08-897, University of Southern California, Jun 2008. [7] Rina Dechter. Bucket elimination: a unifying framework for probabilistic inference. In Proceedings of the Twelfth international conference on Uncertainty in artificial intelligence, UAI’96, page 211219, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. [8] P. Diniz, M. Hall, J. Park, B. So, and H. Ziegler. Automatic Mapping of C to FPGAs with the DEFACTO compilation and synthesis systems. Elsevier Journal on Microprocessors and Microsystems, 29(2-3):51–62, April 2005. [9] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, M anman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: Programming the Memory Hierarchy. In Proc. of the ACM/IEEE Supercomputing Conference (SC), November 2006. [10] A. B. Finnila, M. A. Gomez, C. Sebenik, C. Stenson, and J. D. Doll. Quantum annealing: A new method for minimizing multidimensional functions. Chemical Physics Letters, 219(5-6):343–348, March 1994. [11] Matteo Frigo and Steven G. Johnson. The fastest Fourier transform in the West. Technical Report MIT-LCSTR728, MIT Lab for Computer Science, 1997. [12] Noah Goodman, Vikash Mansinghka, Daniel Roy, Keith Bonawitz, and Daniel Tarlow. Church: a language for generative models. In Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-08), pages 220–229, Corvallis, Oregon, 2008. AUAI Press. [13] Robert J. Graebener, Gregory Rafuse, Robert Miller, and Ke-Thia Yao. The road to successful joint experimentation starts at the data collection trail. In Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2003, Orlando, Florida, December 2003. [14] Mary W. Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. Loop transformation recipes for code generation and auto-tuning. In Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing, Oct 2009. 22 [15] Albert Hartono, Boyana Norris, and P. Sadayappan. Annotation-based empirical performance tuning using Orio. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rome, Italy, 2009. [16] The MathWorks Inc. Simulink: Simulation and Model-based Design. [17] Hyokyeong Lee, Ke-Thia Yao, and Aiichiro Nakano. Dynamic structure learning of factor graphs and parameter estimation of a constrained nonlinear predictive model for oilfield optimization. In Proceedings of the 2010 International Conference on Artificial Intelligence, July 2010. [18] Hyokyeong Lee, Ke-Thia Yao, Olu Ogbonnaya Okpani, Aiichiro Nakano, and Iraj Ershaghi. Hybrid constrained nonlinear optimization to injector-producer relationships in oil fields. International Journal of Computer Science, 2010. [19] Hyokyeong Lee, Ke-Thia Yao, Olu Ogbonnaya Okpani, Aiichiro Nakano, and Iraj Ershaghi. Identifying injectorproducer relationship in waterflood using hybrid constrained nonlinear optimization. In SPE Western Regional Meeting, May 2010. SPE 132359. [20] Yintao Liu, Ke-Thia Yao, Shuping Liu, Cauligi S. Raghavendra, Oluwafemi Balogun, and Lanre Olabinjo. Semisupervised failure prediction for oil production wells. In Workshop on Domain Driven Data Mining at International Conference on Data Mining. IEEE, 2011. [21] Yintao Liu, Ke-Thia Yao, Shuping Liu, Cauligi S. Raghavendra, Tracy L. Lenz, Lanre Olabinjo, Burcu Seren, Sanaz Seddighrad, and Chinnapparaja Gunaskaran Dinesh Babu. Failure prediction for artificial lift systems. In SPE Western Regional Meeting, May 2010. SPE 133545. [22] Andrew McCallum, Karl Schultz, and Sameer Singh. FACTORIE: Probabilistic programming via imperatively defined factor graphs. In Neural Information Processing Systems (NIPS), 2009. [23] H. Neven, V. S Denchev, M. Drew-Brook, J. Zhang, W. G Macready, and G. Rose. NIPS 2009 demonstration: Binary classification using hardware implementation of quantum annealing. 2009. [24] G. Palubeckis. Multistart tabu search strategies for the unconstrained binary quadratic optimization problem. Annals of Operations Research, 131(1):259282, 2004. [25] Jooseok Park and Pedro C. Diniz. Using fpgas for data reorganization and pre-fetching of pointer-based data structures for scientific computations. IEEE Trans. on Design and Test in Computers (IEEE-D&T) Special Issue on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, July/Aug. 2011. [26] Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and Nicolas Vasilache. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08), Tucson, AZ, 2008. [27] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gačić, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232–275, 2005. [28] Manman Ren, Ji Young Park, Mike Houston, Alex Aiken, and William J. Dally. A Tuning Framework for Software-Managed Memory Hierarchies. In International Conference on Parallel Architectures and Compilation Techniques, pages 280–291, October 2008. [29] Gabe Rudy, Chun Chen, Mary W. Hall, Malik Murtaza Khan, and Jacqueline Chame. A programming language interface to describe transformations and code generation. In Proceedings of the 23nd International Workshop on Languages and Compilers for Parallel Computing, Oct 2010. [30] Rolando D Somma and Sergio Boixo. Spectral gap amplification. arXiv:1110.2494, October 2011. 23 [31] Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1):521–530, 2005. [32] Gene Wagenbreth, Ke-Thia Yao, Dan M. Davis, Robert F. Lucas, and Thomas D. Gottschalk. Enabling 1,000,000-entity simulation on distributed linux clusters. In M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, editors, Proceedings of the 2005 Winter Simulation Conference, 2005. [33] R. Clint Whaley and David B. Whaley. Tuning high performance kernels through empirical compilation. In ICPP, June 2005. [34] Ke-Thia Yao and Gene Wagenbreth. Simulation data grid: Joint experimentation data management and analysis. In Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2005, Orlando, Florida, 2005. [35] Ke-Thia Yao, Gene Wagenbreth, and Craig Ward. Agile data logging and analysis for large-scale simulations. In Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2006, Orlando, Florida, 2006. Team Members Biographies and CVs http://www.hpc-educ.org/XDATA/ALADAN-Bios.html 24