ALADAN: Analyze Large Data Now

Transcription

ALADAN: Analyze Large Data Now
PROPOSAL: VOLUME 1
DARPA-BAA-12-38
Technical Area 1 (TA1) : Scalable analytics and data processing technology
“ALADAN: Analyze Large Data Now”
By
University of Southern California
Information Sciences Institute
Type of Business: Other Educational
ISI proposal no.: 3064-0
Technical POC:
Ke-Thia Yao
USC ISI
4676 Admiralty Way, Ste 1001
Marina del Rey, CA 90292
Ph: 310-448-8297
[email protected]
Administrative POC:
Andrew Probasco
USC Dept. of Contracts & Grants-Marina Office
4676 Admiralty Way, Ste 1001
Marina del Rey, CA 90292
Ph: 310-448-8412
[email protected]
Award Instrument Requested: Cooperative Agreement
Place and Period of Performance:
USC/ISI: 10/01/2012 03/31/2017
Proposal Validity: 120 days
DUNS: 0792333393
CAGE: 1B729
TIN: 95-1642394
i
Contents
Cover Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Goals and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Technical Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Driving Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1
The Military Problem . . . . . . . . . . . . . . . . . . . . . .
3.2.2
Broader Applicability . . . . . . . . . . . . . . . . . . . . . .
3.3
Technical Approach Overview . . . . . . . . . . . . . . . . . . . . . . .
3.4
Detailed Technical Approach . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1
Scalable Mapping and Summarization of Multiple Data Stream
3.4.2
Probabilistic Inferencing using Factor Graphs . . . . . . . . .
3.4.3
Performance Diagnostics and Automatic Tuning . . . . . . . .
3.4.4
Quantum Inferencing . . . . . . . . . . . . . . . . . . . . . .
3.5
Deliverable and Milestones . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
Technical Risk and Mitigation . . . . . . . . . . . . . . . . . . . . . . .
4
Management Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Organizational Experience . . . . . . . . . . . . . . . . . . . . . . . . .
6
Statement of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Phase 1 - Two Years . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1
Year 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.2
Year 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Phase 2 - One Year . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1
Year 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
Phase 3 - One Year . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1
Year 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
Phase 4 - Six Months . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1
Year 5 - Six Months . . . . . . . . . . . . . . . . . . . . . . .
7
Schedule and Milestones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Cost Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
i
ii
1
2
3
3
4
4
5
5
6
6
8
9
10
11
12
13
14
14
15
16
16
16
17
17
17
18
18
18
18
19
20
21
22
1
Executive Summary
The rapid proliferation of sensors and the exponential growth of the data they are returning,
together with publicly available information, has led to a situation in which the Armed Forces of the
United States have more information available to them than they can possibly digest, understand,
and exploit. This situation is further complicated by the geographical distribution of the data,
the many formats in which it is stored, noise, inconsistencies, and falsehoods. The University
of Southern California’s Information Sciences Institute proposes to address this challenge in two
ways. We propose research into scalable probabilistic inferencing over distributed high-volume,
semi-structured, noisy data to aid understanding of trends/patterns and relationships. In addition,
we will create domain-specific languages and automatic tuning mechanisms for them to enable
these new analytical methods to exploit emerging heterogeneous computing systems to keep up
with the unending stream of input data.
We are already familiar with many of the big data challenges faced by the Defense community.
Over the course of the last decade, we have participated in numerous JFCOM experiments ranging
from Urban Resolve, that investigated the viability of next-generation sensors for urban operations,
to Joint Integrated Persistence Surveillance, that studied the effectiveness of planning and tasking
of sensors. We contributed scalable data logging and analysis tools, enabling analysts not only to
log all simulated data traffic, but also to issue multi-dimensional queries in real time, something
which had previously required weeks of post-processing. We also created tools for aggregating the
results of queries to multiple, distributed databases, and creating data cubes for analysts.
We already know how to issue queries to large, distributed databases, and aggregate the results
into data cubes. We propose herein to extend this technology in multiple dimensions. We will
create a factor graph layer over the multi-dimensional queries to provide probabilistic inferencing with propagation of uncertainty over noisy data. In addition, we will extend our logging and
abstraction mechanism to work with high bandwidth, streaming data from sensors. Our methods
will allow for heterogeneous data sources, on-the-fly transformation of this raw data into aggregated and summarized forms suitable for DOD Measures of Effectiveness (MOE) and Performance
(MOP) analysis, and uncertainty reasoning. Probabilistic MOE/MOP graphs will support trade-off
and what-if analysis. We will leverage ISI’s unique adiabatic Quantum computer to adjust classification algorithms, on-the-fly, adapting to concept drift caused by changing conditions. In addition,
we will experiment with quantum-based classification algorithms to reduce the dimensional complexity of factor graphs.
We shall develop principled implementations of our new methods based on a layered architecture
with well-defined interfaces using open source, domain specific languages (DSLs). Our DSLs will
enable interoperability with other tools and libraries, portability across platforms, and automated
optimization for specific platforms. With the recent plateauing of CPU clock rate and emergence
of heterogeneous systems, programming time has become a key barrier to developing new analytic
techniques. The next generation of computers requires developers to master at least three different
programming models. The enthusiasm for cloud computing is partly due to the difficulty of writing
big data applications, yet MapReduce is just one inefficient point in the programming design space.
Using DSLs, we will develop efficient implementations for emerging heterogeneous computers,
and use automatic performance tuning to alleviate the additional burden of performance portability.
We shall demonstrate the broad applicability of this approach with FPGA and GPU accelerators.
1
2
Goals and Impact
Military history is replete with examples of missed opportunities and unrecognized threats that
led to failed missions and doomed nations. These catastrophes were often due to a lack of intelligence. Today, the opposite is commonly true; the analyst and the Warfighers are deluged with too
much information or befuddled by a lack of context. The goal of this project is to develop scalable analytical methods for probabilistic combination of evidence over distributed high-volume
semi-structured noisy data to support DoD intelligence gathering, mission planning, situational
awareness, and other related problems. If successful, ALADAN will increase the number of data
source types and data volume available for real-time data collection and summarization, and provide confidence interval-based evaluation instead of single point-based evaluation. The innovative
aspects of the ALADAN project are four fold, as described below.
We will create a big data oriented domain specific language (DSL) for real-time filtering and
summarization of high volume data sources: This Data DSL goes beyond current data logging
and analysis system used by DOD by providing an abstraction mechanism that allows data to be
collected from heterogeneous data sources, including sensors, and by providing fast, adaptable
implementations on Field Programmable Gate Arrays (FPGA) that do not interfere with normal
computing and communication operations.
We will also create a second, factor graph DSL for enabling probabilistic reasoning with uncertainty about militarily relevant measure of effective and measure of performance. This Factor Graph DSL goes beyond state-of-art probabilistic reasoning systems, like Factorie [22] and
BLOG [12], by connecting the factor graph model with live data streams from the Data DSL, and
by providing efficient implementation of the belief propagation algorithm over high performance
architectures to perform probabilistic inferencing.
To address concerns about porting to future, heterogeneous computing platforms, we will automate performance analysis and software tuning for the Data DSL and Factor Graph DSL. Our
approach goes beyond the ad hoc system selection and manual tuning by performing systematic
exploration of programming parameters to optimally adapt implementations to specific architectures. We will leverage state-of-art software development and compiler tools pioneered by the
DOE SciDAC PERI and SUPER projects, which are lead by USC ISI.
Finally, we will explore the use of adiabatic quantum computing (AQC) for providing strong
machine learning classification models and strong factor graph inferencing solutions. Our approach
goes beyond heuristic approximation approaches by providing high confidence solutions for small
scale problems using quantum annealing approaches. We are uniquely positioned to exploit the
AQC system at the USC-Lockheed Martin Quantum Computing Center, which will double in the
number of qubits every two years for rest of this decade (i.e., an AQC Moore’s Law).
Our deliverables include the Data DSL, the Factor Graph DSL, efficient implementation of these
DSL over multiple heterogeneous architectures, and strong quantum-based classifiers and algorithms. ISI is well-known for its practical applications, e.g. Domain Name System, contributions
to the IP system, one of the original nodes on ARPAnet. It has spun off several companies, e.g.
ICANN and GLOBUS, and ISI has close relations to many defense contractors. After this research is proven useful, ISI plans to seek out our colleagues at any of a number of contractors to
commercialize this product. That is seen as the most certain way to provide the technology to the
Warfighter as soon and as reliably as possible.
2
3
Technical Plan
This section presents the major technical challenges addressed by this proposal and our approach
to solve them. We begin by motivating the need for this research and discussing key technical
challenges. We then present our approach, along with measures of effectiveness and success.
Lastly, we discuss potential risks and our strategy to mitigate them.
3.1
Introduction
The adequate defense of the Nation critically depends on the ability of its Armed Forces decision makers to be provided with the best information regarding potential threads so that timely
and decided actions can be taken with appropriate impact. This is an extremely challenging task
as data must be converted into coherent information in a timely fashion. Data emanates from a
plethora of sensors, is resident in untold numbers of digital archives with disparate formats and
contents, and is continually updated at very high data rates. This big data issue is exacerbated by
the unreliability of sources requiring sophisticated and thus onerous methods for accurate and reliable knowledge extraction. Sophisticated analyses, scalable computational systems and the most
powerful hardware platforms are all required to avoid risks such as mistaken identity or undetected
association, with catastrophic unintended results that may flow therefrom.
While today’s war fighters have performed miraculously well with present capabilities, the
sources and diversity of future threats increase the strain on already stressed and under performing
analytic systems. Two main technical factors are tragically lacking in today’s supporting big data
analytics infrastructure. First, the systems have not been designed to cope with the large scale,
heterogeneity, and geographic distribution of data due to limited bandwidth and the limited ability
to intelligently filter noise from relevant data. Increasing the sophistication of the filtering while
retaining the key information would ameliorate this technical issue, improving the scalability of
the solutions at hand. Second, current data aggregation and summarization approach are either inadequate, impose infeasible constrains (such as the need to store all the sensor data), or are simply
not powerful enough to extract the needed information from the massive volumes of data.
The ISI Team has a strong track record in the big data filed, based on decades-long participation
in advanced military research and more recently with the DoD’s Joint Experimentation Directorate
in Suffolk Virginia [13, 32, 34, 35]. We intend to attack all of the above mentioned challenges with
a balanced approach starting with proven techniques and augmenting them with higher risk, and
thus high-payoff, approaches, as follows. To cope with the very high data rates seen at the sensor,
or front-end, and to address the inability to retain this data in persistent storage, we will seek flexible and programmable filtering and classification solutions based on reconfigurable technology
(e.g. using FPGA-based hardware systems). In the middle-end we will leverage AQC techniques
for high-throughput classification and metric optimization, in effect pruning the space of possible inferences to be processed by subsequent computational resources. Lastly, at the back-end, or
close to the human analyst, we will develop scalable algorithms for processing irregular and distributed data, creating effective human-computer interaction tools to facilitate rapidly customizable
reasoning for diverse missions. Given the heterogeneity of existing and emerging platforms (e.g.,
multi-core, GPUs, FPGAs) and the corresponding complexity in programming models (e.g., MPI,
OpenMP, CUDA), we will develop DSLs geared towards the definition of probabilistic classifier
metrics and target architectures thus facilitating the deployment and portability of the developed
solutions.
3
3.2
Driving Example
3.2.1 The Military Problem. The Joint Force Commander requires adequate capability to integrate and focus national and tactical collection assets to achieve the persistent surveillance of a
designated geographic area of for a specific mission. Our research is motivated by our participation in the Joint Forces Command’s Joint Integrated Persistence Surveillance (JIPS) experiments.
Persistent surveillance is the capability to continually track targets without losing contact. The persistence surveillance infrastructure and protocols should be robust enough to track multiple targets,
to support dynamic tasking of new targets, and to perform regular standing surveillance tasks. The
JIPS experiments were designed to use simulation to evaluate proposed capabilities and to make
recommendation for improvements.
One of the purposes of the experiments was to evaluate the performance of a simulated large,
heterogeneous sensor network for performing numerous intelligence functions. At a high level,
analysts specify many surveillance tasks. Each task specifies one or more targets and the type
of surveillance needed. For instance, a task could be to take a single picture of a geographical
location sometime in a 24 hour period. Or, it could be to provide continuous coverage of a moving
target for a specified period. Many types of sensors including optical, radar, infrared and seismic
are included. Sensors operate on platforms. A platform may be a UAV, a satellite, or some other
vehicle.
Sensors and platforms all have constraints. For instance sensors have limited range, limited
resolution, and error rates. Visible spectrum optical sensors can not be used at night. Weather
affects performance of different sensors by a varying amount. Platforms also have limitations such
as range and duration. Maintenance is required. Both sensors and platforms have failure rates.
Give a list of requests and an inventory of available sensors and platforms, a scheduler assigns
specific tasks to each sensor and platform, such as take off at 0900 hours, fly to a target, turn on
radar, return at 1300, refuel, take off at 1330 and fly to a second target.
JIPS experiments involved millions of simulated entities running on multiple computing clusters,
each with hundreds of processor nodes. There were thousands of sensors and millions of detection
events. Each of these experiments generated many terabytes of data, per day. The data is both
high-volume and and high-rate. Simply collecting the data without disrupting the experiments was
a challenging issue. The raw data collected from these experiments was too low level to be of direct
use by the analysts. For example, the raw data includes the location of entities at specific times,
contact reports from sensor target pairs, weapon fire events, and damage reports. Data composed of
individual messages, events, and reports had to be aggregated and abstracted. These abstractions
then served as input to evaluation metrics designed by analysts. These evaluation metrics were
divided into two levels: measure of effectiveness (MOE) and measure of performance (MOP). The
MOP’s are lower level and often more quantitative. The MOEs take as input the MOPs to evaluate
if overall mission tasks are satisfied.
The mission of the JIPS experiment was to develop more efficient and effective use of sensors
to optimize surveillance in restricted and denied areas. The Measure of Effectiveness (MOEs) was
to maximize the use of the collection capacity of all assets given a prioritized list of intelligence
tasks to perform. Measures of performance (MOPs) included the number of collection hours that
went unused, collection gaps, the impact of ad-hoc collection requests on the pre-planned requests.
MOPs also included the number of targets detected, the number lost, and the number later required.
Characteristic of JIPS data included large volumes collected at geographically distributed sites,
4
heterogeneity as there were many sensor types, as well as noise and other errors. ISI developed
the Jlogger tool to capture the data on-the-fly and store it as relations for further analysis. We also
created the Scalable Data Grid (SDG) to issue distributed analytic queries, aggregate results from
multiple sources, and perform reasoning involving trade offs of uncertainties and options based on
uncertain evidence and inferences.
3.2.2 Broader Applicability. The JIPS experiments shares features and requirements with a
range of problems confronting both industry and government. A large volume of data is generated
at multiple sites requiring compute intensive probabilistic analysis to generate conclusions and
predictions about the current and future state of the system. ISI has also worked with Chevron on
other such problems in the CiSoft project for oil field management, using analysis of data from
multiple sensors on oil rigs to statistically predict failures [20, 21] and using factor graph structure learning to understand complex oil reservoir geometries [17, 18, 19]. The current approach
for problems of this class is develop custom methods for each specific concern. This approach to
software development is inefficient, and further wastes resources by not obtaining the best results
possible with the existing data. Our proposed ALADAN approach is to use a general method, customized only where necessary. The customization process is facilitated with two simple Domain
Specific Languages, one to describe the raw data, and one to describe the high level analysis desired. The problems related to distributed data sources and noisy data analysis are solved once with
general, widely applicable software algorithms. The process is made efficient with fast commodity
hardware such as GPU’s and FPGAs, parallel software optimizations, and probabilistic statistical
algorithms.
3.3
Technical Approach Overview
We address the challenges outlined in the driving example described above by a multi-pronged
technical approach combining proven algorithmic and system-level techniques such as algorithmic
and programming solutions with higher-risk but high-payoff quantum computing techniques. We
use sound software engineering methodology by defining a layered architecture with explicit domain specific languages at each layer to facilitate interoperability with other tools and software.
Figure 1 illustrates the overall structure and data processing flow of the ALADAN research project.
The bottom layer of our architecture is defined by the Data DSL, which is designed to map
data from heterogeneous sources to a common format to facility further processing. In addition
this DSL will be responsible for aggregating and summarizing the high-volume and high-speed
data emanating from these heterogeneous sources. The aggregation and summarization step at
the data source will make use of the data locality principle, dramatically reducing the bandwidth
requirements to other remote and local components within the system. Using Field-Programmable
Gate Arrays (FPGAs) will provide flexible and fast implementations that can be deployed in a
physically distributed fashion.
The next layer up is defined by the Factor Graph DSL, which is designed to model the relationships among the aggregated and summarized data. Specifically, the DSL will represent the MOP
and MOE that is of interest to military analysts. Our probabilistic factor graphs will encode local
relationships amongst variables. We will provide efficient implementations of belief propagation
message passing-based approximation used to perform probabilistic inferencing over the factor
graph using high performance architecture and systems, including distributed multi-core systems
and systems with GPUs.
5
Probabilistic
Inference
Benefits
Approach
Innovation
Probabilistic reasoning
with uncertainty and
with missing, noisy and
incomplete data
High-Performance
Distributed Parallel Data
Computation
(MPI+OpenMP+CUDA)
Parallelization and
Performance Tuning of
probabilistic inferencing
algorithms. Novel Quantum
data inferencing algorithms.
Factor Graphs
Add Relationship
Connect Measure of
Exploit structural data
Factor graph DSL for defining
Effectiveness and Measure probabilistic relationships and
relationships to
of Performance with
enhancing reasoning as
for interfacing with
simulations and sensors
well as performance
distributed data sources
Aggregated/ Abstracted Data
Map/Summarize
Simulation
Sensor
Scalable to high data Heterogeneous & Custom
Architecture
volumes and distributed
(FPGA+GPU)
heterogeneous data
sources
Data DSL for data filtering
and summarization.
Incorporation of strong
classifier generated by
quantum annealing.
High-speed/High-Volume Data Sources
Figure 1 Level Diagram of data structure and data processing.
The key advantages of defining DSLs are that they enhance programmer productivity by providing higher-level abstractions, and that they increase code portability by hiding system specific details. To provide efficient implementations we will perform automatic performance tuning
(autotuning) and custom code generation. Autotuning technology automates the performance
tuning process by empirically searching a space of optimized code variants of a computation to
determine the one that best satisfies an evaluation criterion (such as performance or energy costs.)
The final element of the ALADAN technical approach is to exploit AQC to generate strong machine learning classifiers for data mapping/summarization and to compare their quality to heuristic
belief propagation algorithms. ISI hosts the only commercially delivered AQC machine, the DWave One. Quantum annealing, as implemented by the D-Wave One, probabilistically finds ground
states of programable spin-glass Ising models using a restricted form of open-system AQC, and is
expected to scale better than classical alternatives [10, 30]. Probabilistic inference on some types
of factor graphs can be reduced to the problem of finding the ground state of an Ising spin glass.
3.4
Detailed Technical Approach
3.4.1 Scalable Mapping and Summarization of Multiple Data Stream. When dealing with
big data a major concern is the inability to store the data for further processing making early filter
and classification of paramount importance. While filtering vast amounts of data at line speeds
6
will be a key performance requirement, early classification of data inferences and relations while
retaining references for possibly revisiting and reclassification of the selected few data items is a
key challenge.
We address this problem by developing a domain-specific language for data filtering and early
classification where developers can augment a data-flow in an incremental fashion with specific
matching patterns and classifiers. The overall approach rest on the notion of a sequence of data
and correlation filters that communicate not only scalar data or individual stream lets but also
localized graphs or data cubes. These filters are then composed, as in a coarse-grain data-flow
arrangement with possible feedback loops, to generate summarized views of the input data ready
for processing of subsequent system-level components.
The high-level abstractions offered by the DSL we proposed in this research is akin to the execution model offered by popular programming environments such as Simulink [16] thus lowering the
barrier of adoption of this DSL. The added value we consider is the inclusion of predefined constructs for defining fields of interest in each data item and specific correlation or goodness metrics
that are use to define the selected data-pairs.
Once the developer specifies the data flow a translation and tuning engine can map it to a specific target architecture. In this context we foresee several opportunities for transformation and
optimization in addition to the auto-tuning approach described in section 3.4.3, namely:
• Define data-specific data filters on time-stamps, sensor type location, value and sequence of
values. For instance, one filter may be interested in only a specific data tag for data records that
have a GPS location values within a given range and in a specific time window. Other filter may
want to explore similar time windows but either different tags or locations;
• Merge data-filter to promote reuse of scanned data while data is inside the device i.e., in-transit.
For instance if two or more filters use the same sources (or share data sources) it is beneficial to
co-locate then in space so that the data each needs to examine has as small as possible life-time.
• Explore natural streaming concurrency by having independent data-flow branches being mapped
to distinct components of the system either in space and/or in time;
• Reorganize filtered data in graph-ready data structures, possibly with some very limited (and
controlled replication) for improved locality of subsequent computation;
As it is clear, each of these transformations and optimization illicit trade-offs that are very target
architecture-specific. For example, when using an FPGA a sophisticated filter can be implemented
in hardware that effectively selects key data records or values that are spatially or time related.
Internal FPGA wires can be configured for exact and approximate matches and internal data can be
re-routed for maximal internal bandwidth. Previous work in the context of data reorganization and
pattern-matching has highlighted the performance and data volume reduction of this approach [25].
Similarly, when targeting GPU cores, a given filtering can be materialized as code that manipulates
a cube of data already populated with signification data as the output of a previous filter thus
effectively exploiting the large internal bandwidth of GPUs and thread-level concurrency.
By raising the level of abstraction in describing the flow of data we allows a mapping (compiler) tool to generate code, either in the form of traditional serial code or using hardware-oriented
languages such as Verilog/VHDL, that exploit the specific features of each target device. This is
accomplish without the burden of the developers and without the subsequent code maintainability
and programmer portability costs.
ISI implemented and operated data logging facilities for the multiyear JFCOM JESSP series of
7
large distributed military simulations. The simulations ran on multiple, geographically distributed
(Hawaii, Ohio, Virginia) linux clusters, as well as scores of workstations. Multiple terabytes of
data were logged. ISI implemented the Scalable Data Grid (SDG) to summarize the logged data at
each site and supported web based realtime interactive queries. Multiple cores on the simulation
processors were used to provide scalability to support the logging function. The proposed system is
a technological update of that system using newer hardware, FPGAs and GPUs, and more capable
algorithms such as factor graphs.
3.4.2 Probabilistic Inferencing using Factor Graphs. Military decision making often deals
with complicated evaluation functions of many variables. Systematic exploration of the exponential variable space using brute force techniques is not feasible. However, these evaluation functions
often have exploitable structure that allows them to be decomposed into a product of local functions that each function depends on only a few variables. The military approach of breaking down
evaluation function into Measures of Effective (MOE) and then into Measures of Performance
(MOP) is a strong indication that natural exploitable structures do exist.
Factorization of global functions into local functions can be visualized as bipartite graphs called
factor graphs that has a variable node for each variable xi and a factor node for each local function
fj . An edge connects factor node to a variable node if and only if the corresponding variable xi is
an argument to the corresponding function fj .
Here is an example of a factor function f1 (xratio , xoverlap ) that takes two variables: the ratio of
available sensors to targets xratio , and the frequency of overlapping sensors on the same persistent
surveillance target xoverlap . This factor function can be part of an function that evaluates the effectiveness of sensor planning and tasking. Tasking multiple sensors on the same target provides
redundancy. This is desirable when the ratio of available sensors to targets is high. But, when the
available sensors are low and everything else being equal then high overlap of some target may
lead to lost coverage of other persistent surveillance targets or standing surveillance targets or even
other persistent surveillance targets.
Overlap
High Medium Low
High
5
5
1
Ratio Medium
3
3
4
Low
1
3
5
(a) Function values for factor function f1 .
(b) Factor graph snippet
Figure 2 Factor function f1 expressed as a table and as a part of a factor graph. The function
values in the table can be provided by the analysts based on prior experience, or they can
be automatically gathered from multiple constructive simulation runs.
The table 2a shows function values for the f1 factor function. The factor function f1 has two
arguments, so the corresponding data representation is a table. If the factor function has more than
two arguments, then data representation turns into a cube. In existing approaches these values
are typically provided by analysts based on their prior experience. Our factor graph DSL allows
such prior knowledge, but in addition it interfaces with the Data DSL to automatically gather such
statistics from simulation runs and from sensor feeds.
8
For complicated situation the number of variables may range in the hundreds and thousands.
Inferencing with very high dimensional cubes is not practical, because the time complexity grows
exponentially with the number of dimensions. In machine learning this problem is known as
the curse of dimensionality. The factor graph approach exploits local structures in the problem space to decompose the high-dimensional cube in to many interacting cubes, i.e. factors fi :
f (x1 , . . . , xn ) = f1 (X1 )f2 (X2 ) . . . fm (Xm ) where Xj ∈ x1 , . . . , xn .
If the corresponding factor graph representation of the above decomposition is a tree, then probabilistic inferencing can be performed in linear time. As an example, the Viterbi decoding problem
can be encoded as a trellis tree. This allow the Viterbi algorithm to decode in linear time using the
backward and forward inference approach. However, if the resulting factor graph contains cycles
no polynomial exact algorithms are known. Belief propagation provides an approximate solution.
Although it is known to work better than other algorithms on difficult problems, such as Boolean
Satisfiability problem near the phase transition boundary, it may still requires high time complexity
with with high numbers of message exchanges.
Our approach leverage existing and emerging high performance platforms to provide efficient
implementations. Our factor graph DSL represents the factor graph and the message passing algorithm, including message scheduling and graph transformations. The specificity and formal
representation provided by the DSL allows for optimization and tuning for target platforms.
3.4.3 Performance Diagnostics and Automatic Tuning. The filters and data analysis specified
in the proposed DSL must have high-performance implementations on a variety of target architectures or heterogeneous system components. Manual code optimization is a time-consuming
and error-prone process that must be repeated for each new architecture. Automatic performance
tuning, or autotuning, addresses these problems by automatically deriving a space of code implementation (or variants) of a computation, and empirically searching for the best variant on a
target architecture. The automatic generation of optimized code variants is based on architecture
models and known architectural parameters, as well as on static analysis or user knowledge of the
application.
Autotuning technology can be grouped into three main categories. In library autotuning, selftuning library generators automatically derive and empirically search optimized implementations.
Examples are the libraries ATLAS, PhiPAC and OSKI [2, 31, 33] for linear algebra, and SPIRAL
and FFTW [11, 27] for signal processing. In compiler autotuning a compiler or tool automatically
generates a space of alternative code variants to be empirically searched. Examples are CHiLL,
CUDA-CHiLL, PLUTO and Orio [5, 15, 26, 29]. In programmer-directed autotuning users guide
the tuning process using language extensions, such as in PetaBricks and Sequoia [1, 9, 28].
The ISI team will use compiler-based autotuning for domain-specific code generation to achieve
high-performance and performance portability on heterogeneous systems. Domain-specific code
generation is particularly well-suited for autotuning because domain-specific knowledge can be
incorporated in the decision algorithms that derive optimized code variants. Such knowledge can
also be used to guide the empirical search for the best variant. To maintain the generality of
compiler-based autotuning, our system will support the integration of domain-specific knowledge
using a higher-level user interface to a compiler-based autotuning system.
The ISI team has been conducting research in autotuning for over eight years, and some of its
members are co-developers of an autotuning framework based on CHiLL, a code transformation
and generation tool [4, 6, 14]. We will leverage CHiLL as the core of the autotuning system for
9
our DSL, and use CHiLL’s high-level transformation interface to develop new domain-specific
optimizations, by composing basic code transformations already supported by CHiLL and adding
new basic transformations as needed.
CHiLL is a code transformation tool based on the polyhedral model that supports a rich set of
compiler optimizations and code transformations. CHiLL takes as input an original computation
and a transformation recipe [14] and generates parameterized code variants to be evaluated by
a empirical search. A transformation recipe consists of a sequence of code transformations to
be applied to the original computation, and can be derived by a compiler decision algorithm or
provided by an expert user. CHiLL’s transformation interface also supports the specification of
knowledge about the application code or input data set that can be used to derive highly specialized
code variants.
The software will be instrumented to measure and monitor performance of relevant functions
and generate reports of the same. The reports will provide data to indicate where additional or
different hardware or different software optimizations are required.
3.4.4 Quantum Inferencing. It is the ISI Team’s intention to conceptualize and investigate the
use of the ISI D-Wave quantum optimization engine to create strong classifiers for isolating and
characterizing associated data elements to recognize objects (or patterns) of interest in extensive
and heterogeneous data. As data bandwidth grows, it becomes impossible to analyze it exhaustively in real time, or even to store all of it for processing at a later time. Analytic methods would
be severely limited, possibly missing important correlations or developing associations in unacceptably tardy time frames. We propose to implement automatic but flexible filtering based on
machine learning techniques. Humans will then be enabled to look at the products of the analysis.
Results recently published on a collaboration between Google and D-Wave suggest that quantum
annealing can outperform classical algorithms in the design of strong classifiers [23]. The next
generation of the D-Wave quantum annealing chip will address problems which are in practice
unfeasible for classical algorithms.
Two ISI researchers are already using D-Wave One to create strong classifiers as optimal linear
combinations of weak classifiers. They are then using these classifiers to filter control algorithms,
and to search for optimal donor molecules for organic photovoltaics (in collaboration with Harvard
university). These classifiers, designed to mark important data, are constructed by providing to the
algorithm developer the following resources:
• A training set, T , of S input vectors {xi } labeled (or associated) with their correct classification
{yi }, for i ∈ [1, .., S].
• A dictionary of potential N weak classifiers, hj (x). These can be either classifiers that are
known, but not sufficiently accurate for the application, or reasonable guesses at classifiers that
may suit the application (Comment: note that the weak classifiers h(x) are not associated with
the label vector y, which has been obtained by other means)
The goal is to construct the most accurate strong classifier possible using as few weak classifiers
as is reasonable. This strong classifier generalizes the information learned from the labeled examples. This is accomplished by optimizing the accuracy over the training set of a subset of weak
classifiers. The strong classifier then makes its decision according to a weighted majority vote over
the weak classifiers that are involved.
Selecting a subset of weak classifiers is done by associating to each weak classifier hj (x) a binary variable wj ∈ [0, 1], which denotes if the weak classifier is selected for the strong classifier.
10
The value for each variable wj is set by minimizing a cost function L(w) during the training phase.
This cost function measures the error over the set of S training examples. We also add a regularization term to prevent the classifier from becoming too complex, and lead to over-fitting (where
the classifier performs perfectly over the training set, but generalizes poorly on other vectors). In
summary, the machine-learning problem of training the strong classifier is formulated as a binary
optimization problem, of the type that can be addressed with quantum annealing.
The binary optimization of training the strong classifier can be approximated with a Quadratic
Binary Unconstrained Optimization (QUBO). The work by Google and D-Wave approximated this
QUBO directly with the Ising model that is implemented natively by the quantum annealing hardware. A new approach is to use heuristic optimization algorithms, fine-tuned for the problem at
hand, similar in spirit to sequential optimization techniques and related to genetic algorithms. The
algorithm proceeds by refining several populations, approximating the best solutions of each population with an Ising model sequentially. By using this machine learning approach, we can produce
a strong classifier that from a small set of labeled data samples, generalizes their properties in order
to classify new data that we apply it to. In this manner, we can filter data for postprocessing.
Today’s ISI D-Wave One installation has 128 qubits, of which 108 are calibrated and functioning
as of May 2012. The second generation machine will have 512 qubits, and ISI expects an upgrade
at the end of the calendar year. Projecting that same delivery percentage for effectively functioning
qubits, one would anticipate this would generate O(400) qubits. The third generation D-Wave is
planned to have 2048 qubits, again it delivering O(1,600). It is planned for the end of 2014.
Benchmarking done at ISI confirms that if we take into account only the “computational time”,
the time used for the quantum annealing evolution, D-Wave One is already around 2 orders of
magnitude faster than optimized classical algorithms (based on belief propagation [7]). Nevertheless, this time is dominated by engineering overheads, such as programming, thermalization and
read-out. The performance of D-Wave One is therefore similar to that of optimal classical algorithms [7, 24]. For the next upgrade to this hardware, with 512 physical superconducting qubits, the
engineering overhead will be reduced by more than two orders of magnitude. Classical algorithms
to solve spin glasses at this size have prohibitive memory (1TB SRAM for believe propagation [7])
or time (thousands of years for TABU probabilistic search [24]) requirements, depending on the
algorithm. Scaling analysis done at ISI are compatible with the expectation that open system quantum annealing will still perform well at this problem size. Experimental and theoretical [3] studies
suggests that this method is indeed quite robust against natural noise.
We emphasize that quantum annealing can be used exclusively during the training phase of
the strong classifiers for data filtering. The resulting strong classifiers can be deployed in the
appropriate hardware, such as FPGAs and GPUs, as detailed in Sec. 3.4.1. Nevertheless, during the
life-time of the project we will explore the possibility of continuos access to specialized quantum
annealing hardware for the future refinement of strong classifiers and the definition of new ones.
3.5
Deliverable and Milestones
We now describe the anticipated project deliverables and milestones along with their time line
and relevant references to the tasks in the SoW. We identify each deliverable (D) and milestone
(M) as in M.2.1 denoting the first milestone of the second project task.
Mar 2013: M.2.1,M.3.1: Definition of the Data DSL and factor graph DSL, and preliminary
mapping strategies for CPU engine. M.4.1: Prototype autotuning restricted to CPU and FPGA.
11
M.5.1: Definition of weak and strong classifiers for data filtering using quantum annealing.
Sept 2013: D.2.2,M.2.2 Prototype implementation of domain-specific compilation tool with back
end for C programs and Verilog/VHDL suitable for synthesis in the selected FPGA-based board.
D.3.2,M.3.2: Prototype implementation of belief propagation on multi-cores and/or GPUs. D.3.2
Prototype autotuning for FGPA. M.5.2 Prototype strong classifier for data filtering using quantum
annealing. Demonstration of machine learning filtering with quantum annealing and with classical
alternatives. Report on performance of first prototype.
Mar 2014: D.2.3,D.3.3,D.4.3,D.5.3: Analyze function and performance of ALADAN system.
Extend system to run on GPUs. Inital port of quantum classifier model on to FPGA. Report on
performance evaluation of the prototype ALADAN system, including compiler generated code on
selected streaming data inputs and corresponding measures of effectiveness (MOE).
Sept 2014: D.1,2,3,4,5.4: Final Phase I report, including performance tuning on generated code
for GPUs and traditional processors using the proposed auto-tuning compilation infrastructure for
the generated C codes using the prototype domain-specific compiler. Delivery of initial version of
ALADAN system.
Mar 2015 M.2.5,M.3.5,M.4.5: Extend the functionality and optimize performance of the ALADANsystem. Report on performance tuning on generated code for GPUs and traditional processors using the proposed auto-tuning compilation infrastructure for the generated C codes using
the prototype domain-specific compiler. M.5.6 Refinement of strong classifiers for data filtering.
Training classifiers with D-Wave Two quantum annealing hardware, if available.
Sept 2015 D.1.6,D.2.6,D.3.6,D.4.6,D.5.6 Implement and deliver a full capability ALADAN system. Final report for Phase 2.
Sept 2016 D.1.7,D.2.7,D.3.7,D.4.7,D.5.7 Install/deploy ALADAN system at client site, and demonstrate the system on new test cases.
Mar 2017 D.1.8,D.2.8,D.3.8,D.4.8,D.5.8 Delivery of integrated ALADAN system and the final
report for Phase 4.
3.6
Technical Risk and Mitigation
While the proposed approach is grounded on solid principle and previous USC Team experiences
in large scale distributed simulation and data analytics, we do recognize technical risk in some areas
of the proposed project, namely:
• Algorithm Design and Adaptation: There is a risk of not being able to adequately extend and
adapt existing solutions to the dynamic and noisy nature of the input data.
Mitigation Factor: Identify appropriate subsets. Perform additional data cleansing manually.
• DSL Development: There is a potential risk in the development of a DSL for code generation
targeting FPGAs and GPUs for aggregation and summarization.
Mitigation Factor: The team has extensive experience in compilation and program translation using DSL for FPGAs[8] and GPUs programming using programming languages such as
MPI/OpenMP and CUDA. There are also adequate facilities for addition of hardware testing
equipment that can be used for testing/debugging purposes.
• FPGA-based Demonstrator: There is a risk of not being able to develop and assemble an
FPGA-based demonstrator.
Mitigation Factor: Several team members have built over the years a vast array of hardware
prototypes for various research demonstrators. At USC/ISI there are also adequate facilities
12
fitted with hardware testing equipment that can be used for testing/debugging purposes.
• Quantum annealing for machine learning: There is a risk that quantum annealing for machine
learning will not offer an advantage over state-of-the-art classical algorithms.
Mitigation Factor: We are constantly benchmarking the performance of D-Wave One against
optimized classical alternatives. If the D-Wave Two does not perform as expected, we will use
the best classical alternative found. In addition, we are working on using quantum annealing as
an specialized binary optimization subroutine within classical heuristic optimization algorithms.
Quantum annealing can be switched on and off in these algorithms. Therefore, the performance
is never worse than the classical alternative.
Lastly, we do not anticipate management risks associated with the execution of the project. The
team has collaborated extensively in the past and includes a very diversified set of skills from
hardware-oriented design to high-level algorithmic solutions that will be leveraged in this project.
In addition, the recent interfacing with the government in projects that also involved demonstrators provided additional confidence that risks in this context are also negligible.
4
Management Plan
The ISI Team is an experienced research team, with
more than a decade of experience working on large
DoD projects together. They
are not an ad hoc team of
researchers assembled just
for this project. They will
be led by the Principal Investigator (PI) Dr. Ke-Thia
Yao, who is a Research Scientist and Project Leader at
USC’s ISI. He is a widely
published author and lecturer on data management
and teaches an undergraduate course on the topic
at USC. Within the JESPP
project he is developing a
distributed data management system, the Scalable Data Grid, and a suite of monitoring/logging/analysis tools to help users better understand the computational and behavioral properties of largescale simulations. He received his B.S. degree in EECS from UC Berkeley, and his M.S. and Ph.D.
degrees in Computer Science from Rutgers University.
He will be assisted by three experienced research scientist from ISI: Dr. Pedro Diniz, whose
expertise is in FPGA applications to analyze large data sets, Dr. Jacqueline Chame, whose expertise
is in compiler design to support Domain Specific Languages and Dr. Sergio Boixo, who is one of
the key researchers on the D-Wave quantum computing project at USC. They, in turn, will be
13
supported by the Messers Gene Wagenbreth and Craig Ward, both experienced High Performance
Computing (HPC) and distributed database programmers and by CDR Dan Davis, who will be
responsible to the PI for documentation, progress reporting, web-site, and military intelligence
issues.
Of these, Dr. Yao and Mr. Wagenbreth hold current TS/SCI clearance, CDR Davis is currently
cleared to the TS level, and has held three SCI tickets in the past, and Mr. Ward holds a current
Secret Clearance. Further, the facility where the bulk of the research will be done has an area
certified to the SCI level. The Doctors Yao, Diniz, Chame and Boixo are all considered key to
the project. All of the team members are widely published in the fields indicated in the paragraph
above. Experience levels are all demonstrable and exceptional, e.g. Gene Wagenbreth has been
doing HPC programming for four decades (since beginning on the ILLIAC IV) and CDR Davis
has a 24-year career as a cryptographic linguist, analyst and intelligence manager in the Naval
Services. The key personnel all have more than two decades experience in their respective fields.
With the exception of the PI, the staff will all contribute on the order of 800 hours per year to this
project.
CDR Davis will lead the drafting of the Project Communications Plan and create a web page
with interactive capabilities for insuring the coordination of the team, the dissemination of data, the
preparation of progress reports and the complete visibility of the research to the Program Manager
at DARPA. The Team has effectively used this system in the past and found it intuitive, useful and
conducive to good administration of the research.
Collaboration will be supported by:
1. A common wiki will be used to exchange documents and retain a history of interactions. An svn
repository will be used to support software collaboration and development.
2. Bi-Annual PI & Research Staff Meetings: Given the geographic location of the project partners,
a program meeting is planned twice a year to exchange results and discuss multidisciplinary
research issues.
3. Quarterly Research Staff Meetings: These meetings (supported mostly) via video-conference
during the second phase of the project only, will aim at a close integration of the prototype
tools designed during Phase 1 and developed during Phase 2. These short meetings (1 day) will
mitigate the potential integration risks for these tools.
4. Teleconferences or seminars every two weeks: Key issues, milestones, and deliverables will be
discussed.
The PI will prepare quarterly reports for the government summarizing key research results, task
progress, deliverables/publications, and other research staff/student success stories.
5
5.1
Capabilities
Facilities
The University of Southern California’s Information Sciences Institute is a large, universitybased research center with an emphasis on programs that blend basic and applied research through
exploratory system development. It has a distinguished history of producing exceptional research
contributions and successful prototype systems under government support. USC/ISI has built a
reputation for excellence and efficiency in both experimental computer services and production
services. USC/ISI originally developed and provided and/or now helps to support many mature
14
software packages for the entire Internet community.
The computer center has been an integral part of ISI since its founding in 1972. Today’s Information Processing Center (IPC) maintains a state-of-the-art computing environment and staff
to provide the technical effort required to support the performance of research. Resources include
client platform and server hardware support, distributed print services, network and remote access support, operating systems and application software support, computer center operations, and
help desk coverage. The IPC also acts as a technical liaison to the ISI community on issues of
acquisition and integration of computing equipment and software.
The Center’s servers are protected by an uninterruptible power supply and backup generator to
ensure availability 24 hours a day, 365 days a year. A rich mix of computer and network equipment
along with modern software tools for the research community’s use provide a broad selection of
capabilities, including Unix and Linux-based servers and
Windows-based servers used for electronic mail and group calendaring, web services, file and
mixed application serving. File servers utilize high-performance RAID and automated backup to
facilitate performance and data protection. Computer room space is also available to researchers
for hosting project-related servers. In addition, research staff members have access to grid-enabled
cluster computing, and to USC’s 11,664-CPU compute cluster with low latency Myrinet interconnect that is the larges academic supercomputing resource in Southern California.
Research project staff have an average of over one workstation per staff member, connected
to a high performance switched 10Gbps Ethernet LAN backbone with 10Gbps connectivity to
research networks such as Internet 2, as well as additional network resources such as IP multicast,
802.11g and 802.11n wireless, H323 point-to-point and multipoint videoconferencing, webcasting
and streaming media.
The USC’s Lockheed Martin Quantum Computing Center houses a D-Wave One quantum optimization engine having 128 qubits (Rainier chip). D-Wave One performs binary function minimization using quantum annealing. There are plans to upgrade the D-Wave One to its next generation of 512 qubits (Vesuvius chip), when is available approximately in the next one or two years.
The Center is currently operational for access to researchers in quantum computing technology.
This quantum optimization engine, the only fully operational quantum computing engine in
the world, will be used to validate and prove the technical feasibility of the quantum computing
concepts discussed herein. The D-Wave One at USC’s LM-QCC is connected to a network of
high-performance computers (classical HPC environment). The entire computing configuration
can be accessed remotely both at USC and ISI, as well as by external researchers using either a
private or a public network protected with encryption for security. Also, it can communicate to
private or public cloud computing clusters using encrypted processing configurations for privacy
and security protection. Operationally, the computationally-intensive problems (QC appropriate)
can be partitioned with respect to the respective QC and classical HPC environments. For these
problems, the corresponding QC-centric mathematical kernels can execute in the QC environment
(i.e., D-Wave One), and the HPC-centric kernels can execute in the classical HPC environment.
The results are properly combined and coordinated to produce the end-product communicated to
the researchers performing the research activities and associated investigations.
5.2
Organizational Experience
ISI personnel have experience and expertise in all of the areas critical to this proposal:
15
• probabilistic statistical inference: DR. Ke-Thia Yao, multiple years working JESPP project JFCOM distributed simulation, and CiSOft - oilfield equipment failure prediction.
• parallel processing and distributed computing: DR. Jacqueline Chame, DR. Pedro Diniz,
Gene Wagenbreh - many years experience in all aspects of high performance computing, hardware and software, system software for compiling, translating, measuring and tuning, application software. FPGA and GPGPU.
• quantum computing: DR. Sergio Boixo - key researcher on the D-Wave quantum computing
project at USC.
6
6.1
Statement of Work
Phase 1 - Two Years
The goals of Phase 1 are to design the system; create multiple test cases to test components and
demonstrate capabilities; implement a reduced capability system; preliminary documentation; test
and debug; identify design deficiencies and redesign as needed.
6.1.1 Year 1.
Y1 Task 1 Management Management tasks immediately initiate with a kick-off meeting, to which
the PM will be invited, and the ensuing effort will be focused on standing up the systems to
support the project for four and a half years. The next major activity will be assessing the final
Contract Document Requirements List (CDRL) to insure all mandated documents are submitted
in a timely, complete and compliant manner. A second thrust will be conceiving, designing and
programming a project web site, complete with schedules, deadlines, publication targets and
drafts, and outreach information for stakeholders. Sections not intended for public use will be
encrypted and password protected. Wikis, and other interactive communications will be created,
maintained and decommissioned as appropriate. Travel will be at a minimum early in Year 1,
but PI travel to DC and a representative’s travel to the summer test evolution are being proposed.
The ISI team will deliver all required technical papers, reports software, documentation and final
reports.
Y1 Task 2 Mapping/Aggregation/Summarization: Define syntax and semantics of a DSL for the
specification of data filtering and aggregation to be implemented on both FPGA-based platforms as well as on GPUs. Develop and validate an open-source parser and the corresponding
code translation schemes making use of existing parallel programming languages such as MPI,
OpenMP and CUDA regarding the GPU target.
Y1 Task 3 Probabilistic Inferencing: Define syntax and semantics of a DSL for encoding relationships using factor graphs and for interfacing with the aggregated/summarized data from Task 2.
Develop initial implementation of belief propagation algorithm on GPGPUs, and/or other heterogeneous architectures.
Y1 Task 4 Performance Monitoring and Automatic Tuning: Identify performance bottlenecks
and basic transformations opportunities regarding application code generation (in particular
partitioning between multi-core and accelerator engines (FPGAs and GPUs) as well as explicit
concurrency using MPI/OpenMP and CUDA). Define optimization strategies and corresponding
parameters and schemes.
Y1 Task 5 Quantum Inferencing: Initial prototyping of machine-learning approach for data filtering. Identification of an initial set of weak classifiers and prototyping of strong classifiers.
16
Benchmarking of strong classifiers with training phase using quantum annealing, against related
purely classical algorithms. Initial interfacing with the DSL defined in Task 2.
6.1.2 Year 2.
Y2 Task 1 Management Year 2 management tasks are a continuation of Year1. All meetings, communications and project deliverables will received a re-evaluation at the beginning of the year
and unnecessary or unproductive activities will be curtailed and obviated tasks will be discontinued. New and useful tasks will be added as appropriate. Travel will increase as team members
begin to publish results deemed publication worthy and dissemination appropriate by the PM
and other DARPA-associated stake- holders.
Y2 Task 2 Mapping/Aggregation/Summarization: Test and evaluate of DSL generated codes on
both FPGA-based system and GPUs accelerators for a sample set of filtering and data aggregation strategies. Demonstrate (at month 18 of the project) of the filtering and data summarization
sub-system and its interface with overall system.
Y2 Task 3 Probabilistic Inferencing: Test and evaluate factor graph DSL generated codes on
GPUs and/or other heterogeneous architectures. Develop factor graph for the MOE/MOP evaluation of persistent surveillance example. Demonstrate (at month 18 of the project) the factor
graph DSL and belief propagation algorithms.
Y2 Task 4 Performance Monitoring and Automatic Tuning: Implement and perform preliminary analysis of application code partitioning between multi-core and accelerator engines (FPGAs and GPUs). Perform initial evaluation and integration with GPU and FPGA programming
efforts.
Y2 Task 5 Quantum Inferencing: Generate, test and evaluate machine learning strong classifiers
for data filtering. Improve dictionaries of weak classifiers. Continuous benchmarking between
quantum annealing generated classifiers and improved classical alternatives. Demonstrate (at
month 18 of the project) machine learning filtering. Initial tests with D-Wave Two for generation
of strong classifiers, if available. Initial porting of strong classifiers to the appropriate hardware,
as specified by Task 4.
6.2
Phase 2 - One Year
The goals of Phase 2 are to implement a full capability system; reformulate test cases as appropriate; add more real world cases; test and debug; measure performance; and documentation.
6.2.1 Year 3.
Y3 Task 1 Management The beginning year of Phase II will be characterized by reassessing the
management procedures required to adequately support both the existing and the new goals
set out by the PM and by the shift in research focus. A deep analysis of management system
efficacies will be led by the PI and driven by CDR Davis. Meetings, travel and communications
systems in place will be continued as deemed advisable. Travel is anticipated to be limited to
the PI’s trips to DC, trips to present technical papers and the representative’s trips to Northern
Virginia for the summer test evolution.
Y3 Task 2 Mapping/Aggregation/Summarization: Integrate FPGA-based and GPU-based accelerator sub-systems with higher-level probability inference algorithmic designs. Refine DSL
semantics and code generation schemes whenever appropriate.
Y3 Task 3 Probabilistic Inferencing: Enhance the scalability of the belief propagation algorithm
17
by extend it to multiple nodes with multiple GPUs.
Y3 Task 4 Performance Monitoring and Automatic Tuning: Refine tuning strategies and code
generation schemes for all systems components with emphasis on data analytics algorithms.
Test and evaluate sample test inputs cases.
Y3 Task 5 Quantum Inferencing: Refinement of strong classifiers for data filtering. Close integration with Tasks 2 and 3. If available, use of D-Wave Two for the training phase of machine
learning, and benchmarking against related classical algorithms (which are used otherwise).
6.3
Phase 3 - One Year
The goals of Phase 3 are to create new test cases that utilize the capabilities of the system to
generate new results; install/deploy system in actual sites; and design and implement new features
to allow more innovative results
6.3.1 Year 4.
Y4 Task 1 Management Year 4 will be a year of delivery and publication. The research will have
born fruit that the PM will want communicated to the appropriate audiences. Delivery of programs and data will be carefully monitored and screened for utility, completeness and documentation. The PI and CDR Davis will confirm publication propriety and insure the project successes
are recognized by both the research and the user communities. Meetings and communications
systems, e.g. the web site, will be re-evaluated for efficacy.
Y4 Task 2 Mapping/Aggregation/Summarization: Test FPGA-based and GPU-based accelerator sub-systems with higher-level probability inference algorithmic designs in real (live) data
scenarios. Demonstrate and evaluate of integrated system.
Y4 Task 3 Probabilistic Inferencing: Extend factor graph DSL as needed to accommodate additional test cases. Implement belief propagation algorithm on addition heterogeneous architecture, such as Tilera.
Y4 Task 4 Performance Monitoring and Automatic Tuning: Evaluate base systems with additional larger input cases. Possibly refine code generation for scalability.
Y4 Task 5 Quantum Inferencing: Test and evaluation of machine-learning filtering in real (live)
data scenarios. Continuous improvement of weak classifiers dictionaries, training algorithms,
and strong classifiers algorithms.
6.4
Phase 4 - Six Months
The goals of Phase 4 are to deliver system for testing and demonstration; demonstrate system
capabilities on multiple test cases; and integrate with at customer sites.
6.4.1 Year 5 - Six Months.
Y5 Task 1 Management: Phase IV is a six month period devoted to finalizing and conveying the
product of the project’s research to the PM. It will entail the drafting and the delivery of final
report and conveyance of all open-source code developed by this project. Attention will be paid
to usability by the Warfighters.
Y5 Task 2 Mapping/Aggregation/Summarization: Perform field-testing and deployment of FPGAbased and GPU-based accelerator sub-systems with higher-level probability inference algorithmic designs.
Y5 Task 3 Probabilistic Inferencing: Perform field-testing and deployment of factor graph DSL
and belief propagation implementations.
18
Y5 Task 4 Performance Monitoring and Automatic Tuning: Integrate generated code in demonstrators with customer sites. Support and reporting.
Y5 Task 5 Quantum Inferencing: Perform field-testing and deployment of strong classifiers for
machine-learning based data-filtering.
7
Schedule and Milestones
19
8
Cost Summary
20
Appendix A
i. Team Member Identification
Prime
Organization Non-US?
Ke-Thia Yao
USC ISI
No
Jacqueline N. Chame USC ISI
Yes
Pedro C. Diniz
USC ISI
Yes
Federico Spedalieri
USC ISI
Yes
Sergio Boixo
USC ISI
Yes
Gene Wagenbreth
USC ISI
No
Craig Ward
USC ISI
No
Subcontactor
Organization Non-US?
N/A
Consultant
Organization Non-US?
Dan Davis
USC ISI
No
Clearance FFRDC or Government?
Yes
No
No
No
No
No
No
No
No
No
Yes
No
Yes
No
Clearance FFRDC or Government?
Clearance FFRDC or Government?
Yes
No
ii. Government or FFRDC Team Member
NONE
iii. Organizational Conflict of Interest Affirmations and Disclosure
Pursuant to FAR, Subpart 9.5, the University of Southern California/Information Sciences
institute does not support any DARPA offices and therefore has no Organizational Conflict
of Interest.
iv. Intellectual Property
NONE
v. Human Use
NONE
vi. Animal Use
NONE
vii. Representations Regarding Unpaid Delinquent Tax Liability or a Felony Conviction under
Any Federal Law
(1) The proposer represents that it is [ ] is not [X] a corporation that has any unpaid Federal
tax liability that has been assessed, for which all judicial and administrative remedies
have been exhausted or have lapsed, and that is not being paid in a timely manner
pursuant to an agreement with the authority responsible for collecting the tax liability.
(2) The proposer represents that it is [ ] is not [X] a corporation that was convicted of a
felony criminal violated under Federal law within the preceding 24 months.
viii. Subcontractor Plan
N/A
21
Appendix B Bibliography and Cited References
Bibliography
[1] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe.
Petabricks: a language and compiler for algorithmic choice. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 38–49, New York, NY, USA,
2009. ACM.
[2] Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC:
A portable, high-performance, ANSI C coding methodology. In International Conference on Supercomputing,
pages 340–347, 1997.
[3] S. Boixo, E. Knill, and R. D Somma. Eigenpath traversal by phase randomization. uantum Information and
Computation, 9:833–855, 2009.
[4] C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels
of the memory hierarchy. In Proc. of the Intl. Symp. on Code Generation and Optimization, Mar 2005.
[5] Chun Chen. Model-Guided Empirical Optimization for Memory Hierarchy. PhD thesis, University of Southern
California, May 2007.
[6] Chun Chen, Jacqueline Chame, and Mary W. Hall. CHiLL: A framework for composing high-level loop transformations. Technical Report 08-897, University of Southern California, Jun 2008.
[7] Rina Dechter. Bucket elimination: a unifying framework for probabilistic inference. In Proceedings of the
Twelfth international conference on Uncertainty in artificial intelligence, UAI’96, page 211219, San Francisco,
CA, USA, 1996. Morgan Kaufmann Publishers Inc.
[8] P. Diniz, M. Hall, J. Park, B. So, and H. Ziegler. Automatic Mapping of C to FPGAs with the DEFACTO
compilation and synthesis systems. Elsevier Journal on Microprocessors and Microsystems, 29(2-3):51–62,
April 2005.
[9] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park,
Mattan Erez, M anman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: Programming the
Memory Hierarchy. In Proc. of the ACM/IEEE Supercomputing Conference (SC), November 2006.
[10] A. B. Finnila, M. A. Gomez, C. Sebenik, C. Stenson, and J. D. Doll. Quantum annealing: A new method for
minimizing multidimensional functions. Chemical Physics Letters, 219(5-6):343–348, March 1994.
[11] Matteo Frigo and Steven G. Johnson. The fastest Fourier transform in the West. Technical Report MIT-LCSTR728, MIT Lab for Computer Science, 1997.
[12] Noah Goodman, Vikash Mansinghka, Daniel Roy, Keith Bonawitz, and Daniel Tarlow. Church: a language
for generative models. In Proceedings of the Twenty-Fourth Conference Annual Conference on Uncertainty in
Artificial Intelligence (UAI-08), pages 220–229, Corvallis, Oregon, 2008. AUAI Press.
[13] Robert J. Graebener, Gregory Rafuse, Robert Miller, and Ke-Thia Yao. The road to successful joint experimentation starts at the data collection trail. In Interservice/Industry Training, Simulation, and Education Conference
(I/ITSEC) 2003, Orlando, Florida, December 2003.
[14] Mary W. Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. Loop
transformation recipes for code generation and auto-tuning. In Proceedings of the 22nd International Workshop
on Languages and Compilers for Parallel Computing, Oct 2009.
22
[15] Albert Hartono, Boyana Norris, and P. Sadayappan. Annotation-based empirical performance tuning using Orio.
In IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rome, Italy, 2009.
[16] The MathWorks Inc. Simulink: Simulation and Model-based Design.
[17] Hyokyeong Lee, Ke-Thia Yao, and Aiichiro Nakano. Dynamic structure learning of factor graphs and parameter
estimation of a constrained nonlinear predictive model for oilfield optimization. In Proceedings of the 2010
International Conference on Artificial Intelligence, July 2010.
[18] Hyokyeong Lee, Ke-Thia Yao, Olu Ogbonnaya Okpani, Aiichiro Nakano, and Iraj Ershaghi. Hybrid constrained
nonlinear optimization to injector-producer relationships in oil fields. International Journal of Computer Science,
2010.
[19] Hyokyeong Lee, Ke-Thia Yao, Olu Ogbonnaya Okpani, Aiichiro Nakano, and Iraj Ershaghi. Identifying injectorproducer relationship in waterflood using hybrid constrained nonlinear optimization. In SPE Western Regional
Meeting, May 2010. SPE 132359.
[20] Yintao Liu, Ke-Thia Yao, Shuping Liu, Cauligi S. Raghavendra, Oluwafemi Balogun, and Lanre Olabinjo. Semisupervised failure prediction for oil production wells. In Workshop on Domain Driven Data Mining at International Conference on Data Mining. IEEE, 2011.
[21] Yintao Liu, Ke-Thia Yao, Shuping Liu, Cauligi S. Raghavendra, Tracy L. Lenz, Lanre Olabinjo, Burcu Seren,
Sanaz Seddighrad, and Chinnapparaja Gunaskaran Dinesh Babu. Failure prediction for artificial lift systems. In
SPE Western Regional Meeting, May 2010. SPE 133545.
[22] Andrew McCallum, Karl Schultz, and Sameer Singh. FACTORIE: Probabilistic programming via imperatively
defined factor graphs. In Neural Information Processing Systems (NIPS), 2009.
[23] H. Neven, V. S Denchev, M. Drew-Brook, J. Zhang, W. G Macready, and G. Rose. NIPS 2009 demonstration:
Binary classification using hardware implementation of quantum annealing. 2009.
[24] G. Palubeckis. Multistart tabu search strategies for the unconstrained binary quadratic optimization problem.
Annals of Operations Research, 131(1):259282, 2004.
[25] Jooseok Park and Pedro C. Diniz. Using fpgas for data reorganization and pre-fetching of pointer-based data
structures for scientific computations. IEEE Trans. on Design and Test in Computers (IEEE-D&T) Special Issue
on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, July/Aug. 2011.
[26] Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and Nicolas Vasilache. Iterative optimization in the polyhedral model: Part II, multidimensional time. In ACM SIGPLAN Conference on Programming Language Design
and Implementation (PLDI’08), Tucson, AZ, 2008.
[27] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin
Xiong, Franz Franchetti, Aca Gačić, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo.
SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2):232–275, 2005.
[28] Manman Ren, Ji Young Park, Mike Houston, Alex Aiken, and William J. Dally. A Tuning Framework for
Software-Managed Memory Hierarchies. In International Conference on Parallel Architectures and Compilation
Techniques, pages 280–291, October 2008.
[29] Gabe Rudy, Chun Chen, Mary W. Hall, Malik Murtaza Khan, and Jacqueline Chame. A programming language
interface to describe transformations and code generation. In Proceedings of the 23nd International Workshop
on Languages and Compilers for Parallel Computing, Oct 2010.
[30] Rolando D Somma and Sergio Boixo. Spectral gap amplification. arXiv:1110.2494, October 2011.
23
[31] Richard Vuduc, James W. Demmel, and Katherine A. Yelick. OSKI: A library of automatically tuned sparse
matrix kernels. Journal of Physics: Conference Series, 16(1):521–530, 2005.
[32] Gene Wagenbreth, Ke-Thia Yao, Dan M. Davis, Robert F. Lucas, and Thomas D. Gottschalk. Enabling
1,000,000-entity simulation on distributed linux clusters. In M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and
J. A. Joines, editors, Proceedings of the 2005 Winter Simulation Conference, 2005.
[33] R. Clint Whaley and David B. Whaley. Tuning high performance kernels through empirical compilation. In
ICPP, June 2005.
[34] Ke-Thia Yao and Gene Wagenbreth. Simulation data grid: Joint experimentation data management and analysis. In Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2005, Orlando, Florida,
2005.
[35] Ke-Thia Yao, Gene Wagenbreth, and Craig Ward. Agile data logging and analysis for large-scale simulations. In
Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC) 2006, Orlando, Florida, 2006.
Team Members Biographies and CVs
http://www.hpc-educ.org/XDATA/ALADAN-Bios.html
24