QMapper for Smart Grid: Migrating SQL
Transcription
QMapper for Smart Grid: Migrating SQL
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang1,2 , Yingzhong Xu1,2 , Yue Liu1,2 , Jian Chen3 and Songlin Hu1,2 1 1 Institute of Computing Technology, Chinese Academy of Sciences, China 2 University of Chinese Academy of Sciences, China 3 Zhejiang Electric Power Corporation, China {wangyue89,xuyingzhong,liuyue01,husonglin}@ict.ac.cn, 3 [email protected] ABSTRACT Apache Hive has been widely used by Internet companies for big data analytics applications. It can provide the capability of compiling high-level languages into efficient MapReduce workflows, which frees users from complicated and time consuming programming. The popularity of Hive and its HiveQL-compatible systems like Impala and Shark attracts attentions from traditional enterprises as well. However, enterprise big data processing systems such as Smart Grid applications often have to migrate their RDBMS-based legacy applications to Hive rather than directly writing new logic in HiveQL. Considering their differences in syntax and cost model, manual translation from SQL in RDBMS to HiveQL is very difficult, error-prone, and often leads to poor performance. In this paper, we propose QMapper, a tool for automatically translating SQL into proper HiveQL. QMapper consists of a rule-based rewriter and a cost-based optimizer. The experiments based on the TPC-H benchmark demonstrate that, compared to manually rewritten Hive queries provided by Hive contributors, QMapper dramatically reduces the query latency on average. Our real world Smart Grid application also shows its efficiency. Categories and Subject Descriptors H.2 [Database Management]: Systems Keywords SQL on Hadoop; Hive; System Migration; Join Optimization 1. INTRODUCTION In recent years, high-level query languages such as Hive [17, 18], Pig [16] and JAQL [3] based on MapReduce have been widely used to deal with big data problems [4]. For example, more than 95% of MapReduce jobs running in FacePermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia. c 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00. Copyright http://dx.doi.org/10.1145/2723372.2742792. book are generated by Hive [12]. By providing the SQL-like query language HiveQL, Hive enables users who have experience using traditional RDBMS to run familiar queries on MapReduce. Moreover, HiveQL queries can also be run in systems like Shark[20] and Impala1 , which utilize the inmemory storage to accelerate iterative computations and are very popular in Internet companies as well. Due to an increasing amount of data produced, performance bottlenecks of current RDBMS-based infrastructure appear in traditional enterprises. It is critical for them to leverage a new sustainable and scalable platform, which can guarantee high performance, robust data processing and controllable budget at the same time. For example, in the Zhejiang province, the State Grid Corporation of China (SGCC) has deployed more than 17 million smart meters, which will grow to 22 million within two years. Data collected by these sensors was previously stored with complex indexes in a commercial RDBMS to facilitate selective data reading. Based on that, statistic data processing is implemented in the form of SQL stored procedures. We observe that with the growth of both data collecting frequency and the amount of sensors, the performance of data writing becomes very low and the global statistic analyzing on big tables tends to be a bottleneck as well. The weak scalability of traditional RDBMS and expensive cost of commercial parallel databases force Zhejiang Grid to find another way to meet its business requirements. Considering the high write throughput, good scalability, high performance offline batch data processing capability and lower cost of Hadoop/Hive, SGCC tries to leverage Hive as an extra engine to perform metering data collection and offline analysis. However, different from Internet companies, traditional enterprises have many legacy applications run on RDBMS. Though new applications can be implemented on Hadoop/Hive environment, legacy ones still need to be smoothly migrated to Hive. As in the Smart Electricity Consumption Information Collection System (SECICS) of the Zhejiang Smart Grid, the business logic has already been implemented by hundreds of stored procedures in RDBMS. These procedures have been accumulated for years and are usually written by different developers. It would be a huge cost to let engineers figure out the logic of each query which they are probably not familiar with and re-implement them using Hive. They will face great challenges during the course of migrating: 1 http://www.cloudera.com • Hive can not fully support the SQL syntax at the moment. Statements such as UPDATE and DELETE are not well supported, so the way to modify existing files is overwriting them entirely (we will analyze the support for UPDATE and DELETE in Hive 0.14 in Section 2.3). Moreover, subqueries such as EXISTS and IN are also not fully supported. Migrating these queries to Hive is time consuming and the correctness can hardly be guaranteed. • Even if some SQL queries used in RDBMS can be directly accepted by Hive, their performance might be very low in the Hadoop ecosystem because of the difference between the cost models of RDBMS and MapReduce. SQL queries run in traditional centralized RDBMS are tuned according to the cost of physical operations, mainly involving CPU cycle and Disk I/O costs. Whereas in MapReduce, due to the materialization of intermediate data and data shuffle between mappers and reducers, disk I/O and network I/O factors dominate the performance. Successful migration needs to consider cost model shift in addition to syntax mapping. In this paper, we introduce QMapper [21], an intelligent, rather than a ”direct”, translator to enable automatic rulebased SQL-to-HiveQL mapping as well as cost-based optimizing of the translated queries. We divide the translation process of QMapper into three phases. For a given SQL query, it is first parsed by the SQL Interpreter into the Query Graph Model (QGM)[7]. Then, the Query Rewriter applies translation rules to generate equivalent queries, and further figures out the near-optimal join order for each equivalent query variation based on our cost model. Finally, a plan evaluator is then utilized to identify the optimal rewritten query. The specific contributions of this paper include: • A tool QMapper is implemented to automatically translate SQL to optimized HiveQL. Our current work focuses on the optimization on the stack of Hive and Hadoop. However, since the translation rules and cost model are pluggable, QMapper can also be extended to support HiveQL-compatible systems such as Shark and Impala. • A cost model is proposed to reflect the execution time of MapReduce jobs. Our cost model combines the costs of MapReduce framework and the costs of internal Hive operators. With our cost model, there is no need to pre-run the corresponding query. • An algorithm is designed to reorganize the join structure so as to construct the near-optimal query. It is known that a bushy-plan can exploit multiple servers by allowing joins to run in parallel but may potentially increase the sizes of intermediate relations. Whereas, a left-deep plan can minimize the intermediate relations but can not exploit parallel resources[6]. In this paper, our algorithm makes a trade-off between intermediate data volume and job concurrency. • Many experiments have been done based on TPC-H workloads. The results demonstrate that queries optimized by QMapper outperform the queries manually rewritten by experienced users. Moreover, a real application of QMapper for Smart Grid is reported. The rest of this paper is organized as follows: In Section 2, we describe big smart meter data problem, the motivation of migration and our solution in detail. Section 3 gives an overview of QMapper. Section 4 introduces our rewriting rules and the cost-based optimization. QMapper’s cost model and MapReduce workflow estimation are discussed in Section 5. Section 6 briefly describes the implementations of statistics collection and query evaluation. In Section 7, we validate QMapper with extensive experiments. In Section 8, we briefly review the recent works related to our paper. Finally, we summarize our work in Section 9. 2. BACKGROUND In this section, we will first give an overview of the Smart Electricity Consumption Information Collection System in Zhejiang Grid, then discuss the difficulties of smoothly migrating legacy RDBMS applications to Hadoop platform. 2.1 Smart Electricity Consumption Information Collection System Figure 1 shows the data flow and system architecture of SECICS in Zhejiang Grid. Currently, 17 million smart meters are deployed in Zhejiang Province to collect meter data at fixed frequency, for example, once per 15 minutes. When massive collected data is decoded by the Data Collection Server Cluster, they will be written to a commercial RDBMS, which is deployed on two high end servers. The total amount of data is 20TB and there is about 30GB new data added into the database every day. There are mainly three kinds of data in SECICS: • Meter data. Meter data is collected by smart meters. Because of the large number of smart meters and the high collecting frequency, its amount is very huge. Since meter data is the real measurement of physical world, it is rarely updated. The massive meter data needs to be stored timely, otherwise it may be overwritten by the next dataflow. • Archive data. Archive data records the detailed archived information of meter data, such as user and device information of a smart meter. Compared with the meter data, archive data has much smaller scale and is updated frequently. • Statistic data. Statistic data is the result of offline batch analysis. Its computation is based on meter data and archive data. It is used for the report generation and online query for staff in Zhejiang Grid. SECICS needs to deeply analyze the data it collects to support scenarios such as computing user electricity consumption, district line loss calculation, statistics of data acquisition rates, terminal traffic statistics, exception handling, and fraud detection, etc. The offline batch analysis logic currently consists of several stored procedures, each of which contains tens of SQL statements. They are executed in fixed frequencies every day to calculate corresponding statistic data. As many queries in the stored procedures involve both meter data and archive data, and usually contain multiple join operations, the performance of join significantly affects the efficiency of the business application. The middle part of Figure 1 shows the previous solution of SECICS in Zhejiang Grid. The solution is based on Workflow 1 Stored Procedure 1 Collection Workflow n Stored Procedure n OLTP 100,000 lines of stored procedures migration Offline Batch Analysis ... Data Collection Server Cluster Archive Data OLTP Offline Batch Analysis Meter Data Meter Data Online Query ... ... Online Query Statistic data ETL Statistic Data Statistic Data Copy same table schema Archive Data Archive Data Archive data synchronization Figure 1: The System Migration of the Smart Electricity Consumption Information Collection System RDBMS deployed on a high performance 2*98 cores commercial RDBMS cluster and an advanced storage system. Careful optimization works had been done by database administrators and experts, so as to guarantee that all the tasks can be finished between 1am and 7am. Otherwise they will block the business operations in working hours. However, with the growing number of installed meters and increasing of collecting frequency, the previous solution encounters some bottlenecks. 2.2 System Improvement Requirements and Solution The previous solution can not provide sustainable and scalable infrastructure, mainly because: • Low data write throughput. RDBMS with complex indexes can not provide enough write throughput, which would consequently result in serious problem when the data scale becomes larger and larger. The arriving data would be dropped if the data previously put in queue could not be stored in time. • Unsatisfied statistics analyzing capability. Smart Grid business involves a large amount of global statistics analyzing tasks. They are implemented by SQL stored procedures, each of which may need to execute on whole big tables and perform complex join operations among them. Even with the current scale of smart meters and collecting frequency, the tasks can hardly be completed in time. For example, in order to compute users’ electricity consumption, a query needs to perform complex join operations on 5 tables, which contain 60GB data in total. The average processing time even reaches 3 to 4 hours. • Weak scalability. The amount of metering devices increases 30 times in recent five years, and the collecting frequency is still speeding up. The meter data scale is consequently growing and the system is also expected to be scaled correspondingly at the same time. However, the scalability of the previous solution is fairly weak. Besides, scaling out RDBMS mostly leads to redesign of the sharding strategies as well as a lot of application logic. While scaling up means purchasing more expensive hardware, and the limitation will easily be reached. Both of the cases need huge human labors and financial resources. • Uncontrollable resource competition. As online query processing and offline batch analysis tasks are put on a single RDBMS, they will compete for computing resources. In the worst cases, they will incur significant performance penalty. Moreover, keep on scaling RDBMS needs more powerful hardware, which will bring additional cost. The finance of SGCC is another factor that must be taken into consideration. In order to meet the requirements, we propose a new solution that leverages both Hadoop/Hive and RDBMS. It is shown in the right part of Figure 1. In this solution, the meter data is directly written into HDFS instead of RDBMS, which makes use of the high writing throughput capability of HDFS. The archive data is still stored in RDBMS to support frequent CRUD operations. The offline batch analysis logic is migrated to Hive. Before running each offline batch analysis logic, the archive data used in the queries is copied to HDFS via Sqoop (an open source ETL tool). After finishing the analysis, the resulting statistic data is written back to RDBMS for online query. To make it easy for migration and data sharing, the schema of both meter and archive data remains the same in Hive. The new solution takes advantages of both RDBMS and Hadoop. The powerful OLTP and index abilities make RDBMS still more suitable for archive data management and online query processing. The good write throughput of HDFS guarantees that the meter data can be written into the system in time. In addition, migrating offline batch analysis tasks to Hadoop and Hive frees RDBMS from complicate big data processing. It improves the statistics analyzing performance significantly by making full use of the computing capability of Hadoop ecosystem. Moreover, uncontrollable resource competition in RDBMS is avoided. The most important thing is that the new solution provides good scalability. In a Hadoop/Hive cluster, we only need to add cheap servers to scale out the system. Finally, since Hadoop/Hive platforms are open source, it is cost effective. The budget in the future can even be reduced as the heavy burden of the RDBMS has been removed and we do not need to buy extra hardware for it to cope with the growing meter data. 2.3 Smooth SQL Migration When migrating applications from RDBMS to Hadoop/Hive, first we create tables for meter data and archive data in Hive, each of them keeps the same schema used in RDBMS. The constraints and indexes are ignored. Instead, we create DGFIndex [15](a cost-effective multi-dimensional index for Hive that we developed for the application of Smart Grid) for specific tables as needed to improve the reading efficiency SQL Interpreter SQL Result Count Sum Avg Result Count Sum Avg Sample Data HiveQL Query Rewriter HiveQL Correctness Validation HiveQL 1 ... Workflow HiveQL n SQL 1 ... SQL n Correctness Validation <HV Deploy Workflow Stored Procedure Figure 2: The migration of Stored Procedures for some selective queries. Then, lots of stored procedures need to be migrated/translated to Hive, which is the most technical challenging problem in system migration. Currently, the offline batch analysis logic consists of more than 200 stored procedures, more than 4000 statements in RDBMS, adding up to around 100,000 lines of SQL codes in total. These codes had been maintained for 5 years by a team of 100 developers, and they have done enormous optimizations on these SQL statements based on the cost model of RDBMS. Since members in the team change frequently and the documents may not be up-to-date to reflect all the modifications/improvements on the codes, it is impossible for us to find anyone to fully explain even a single procedure. As the business logic is really complex and the SQL developers are not familiar with Hadoop and Hive, it is not realistic to re-implement these codes manually and do optimization works. Thus, it is ideal to provide a fairly ”automatic” way to enable free translation, without deep concern about their internal business logic. However, when we try to migrate these codes to Hive, two challenges come into being: First, the stored procedures can not be run directly, as Hive does not support full SQL, for example, UPDATE, DELETE (before Hive 0.14), MERGE INTO, EXISTS subqueries (as of Hive 0.13, only top-level conjunctive EXISTS subqueries are supported) etc. While in fact, these happen a lot in real engineering. For instance, the ratio of DML statements is very high as they form more than 70% of the offline batch analysis logic (detailed statistics can be found in [11]). Although Hive issued on November 2014, only several days before the submission of this paper, starts to support DML operations, it still has some restrictions on file format etc. The UPDATE and DELETE operations need to be operated on ORC (Optimized RCFile) format files and the table must be bucketed. Its performance and stability still need to be validated in practice. Second, since the cost model of Hive on Hadoop is different from that of RDBMS, the statement optimized in RDBMS might yield worse performance in Hive. It is also far from practical to ask developers to manually rewrite one query and enumerate all its variations. We can not ask developers to compute cost for each candidate variation and choose the best one by hand either. A new approach for selecting optimal statement for a query is needed. With our observation, there are some regular mapping patterns between SQL and HiveQL, so we propose an automatic translation tool named QMapper. It can accelerate SQL migration process of legacy RDBMS applications and avoid manual mistakes. Besides, one SQL statement may generate several equivalent candidate HiveQL statements, and QMapper can be used to choose the near-optimal one based on our cost model. Figure 2 shows the translation work in detail. We divide this work into three phases. In phase one, with QMapper, each SQL statement is translated into corresponding HiveQL, and the near-optimal variation is selected automatically. Details about translation will be given in the rest of this paper. In phase two, we validate the correctness of each translated HiveQL based on sample data. We run count, sum and avg queries on SQL and its corresponding HiveQL separately. If both results are the same, this HiveQL is considered to be correct. In phase three, we organize the HiveQL queries and generate HiveQL script files, each file corresponds to a stored procedure in RDBMS. If the script files pass the correctness validation tests, they will be carefully organized and deployed as a workflow in Oozie. The timing of running these script files is coordinated by Oozie. 3. QMAPPER OVERVIEW RDBMS DBMS-X DBMS-… DBMS-Y SQL QMapper SQL Interpreter Query Graph Model Statistics Collector Query Rewriter Rule-Based Rewriter Query Hive Compiler MR Plan DAGs Cost-Based Optimizer HiveQL Queries Dumping Collector Cardinality Estimation Plan Evaluator Hive Pg_stat Statistics Background Collector PostgreSQL HiveQL Hive Impala Shark Figure 3: QMapper Architecture Overview Figure 3 demonstrates the architecture of QMapper, which contains four components: SQL Interpreter : This component resolves the SQL query provided by a user and parses that query into an Abstracted Syntax Tree (AST). Then a Tree Walker will traverse the AST and further interpret it into a Query Graph Model (QGM) [7], which is more convenient to be manipulated. This interpretation process includes analyzing the relationship between tables and subqueries, etc. After a QGM is built, it will be sent to the rewriter. Query Rewriter : The Query Rewriter is composed of two phases. In the first phase, a Rule-Based Rewriter (RBR) checks if a query matches a series of static rules. These rules can be triggered to adjust that query and new equivalent queries will be generated. Then Cost-Based Optimizer (CBO) is used to further optimize the join structure for each query. In this phase, cardinality estimation in PostgreSQL is involved to calculate the cost. The cost model is discussed in Section 5. Statistics Collector : The Statistics Collector is responsible for collecting statistics of related tables and their columns. These information is important for estimating the cardinality used to calculate the cost. The collecting approach is discussed in Section 6.1. Plan Evaluator : The Plan Evaluator is the final stage in QMapper. The queries with equivalent join cost generated by RBR will be sent to it. It will take into account none-join operators like UNION ALL, GROUP BY, etc to distinguish them. For each query, the Plan Evaluator will first invoke Hive to generate the MapReduce workflows, then use PostgreSQL’s cardinality estimation component (P g stat shown in Figure 3 is a system table used for storing column statistics) to estimate the data selectivity of each operator in map and reduce. Finally, our cost model is applied to the MapReduce workflows to calculate their cost and a query with the minimal cost is returned to the user. 4. QUERY REWRITING Query rewriting in QMapper aims at generating more equivalent candidate HiveQL queries and increasing the probability of triggering more optimization mapping rules. The more candidates are available, the more chances to find the optimal solution. In QMapper, the rewriting rule itself is pluggable to make it easy to extend. Moreover, since query rewriting is performed outside Hive, it can also work together with MapReduce level optimizers like YSmart [12]. Consider the following example which retrieves all Shipments information of the P arts stored in Hangzhou or provided by Suppliers in the same city. SELECT * FROM Shipments SP WHERE EXISTS ( SELECT 1 FROM Suppliers S WHERE S.SNO=SP.SNO AND S.city = ’Hangzhou’) OR EXISTS ( SELECT 1 FROM Parts P Where SP.PNO=P.PNO AND P.city=’Hangzhou’) QMapper will generate a variety of versions of HiveQL that can yield the same results. One version is to divide the two disjunctive predicates into separate SEMI JOIN and then perform an UNION ALL. Another version is to transform the EXISTS to LEFT OUTER JOIN and replace itself with joinColumn IS NOT NULL. The first one uses SEMI JOIN so that it can remove unnecessary tuples earlier than OUTER JOIN. However, a tuple may satisfy two predicates at same time, and this might bring about duplicates after UNION ALL. Thus, an additional operation can be added by QMapper to eliminate the duplicates. Next, We first introduce the rewriting rules of RBR and then describe the mechanism of CBO in Query Rewriter. 4.1 Rule-based Rewriter Rule-based Rewriter tries to detect the SQL clauses that are not supported well by Hive and transform them into HiveQL. As we find the regular mapping patterns between SQL and HiveQL, we summarize them into different rules. In the translation process, some initial rules are first invoked to check if the query can be rewritten. If the condition is satisfied, the rule will be applied to the query and some new equivalent queries will be returned. Then the RBR will traverse the subqueries of each query and apply rules to them recursively. After that, all rewritten queries are generated and sent to the CBO. Here, we only introduce some typical rules, like UPDATE and (NOT)EXISTS, to demonstrate how RBR works. lvRate(uid,deviceid,isMissing,date,type) dataProfile(dataid,uid,isActive) dataRecord(dataid,date,consumption) powerCut(uid,date) gprsUsage(deviceid,dataid,date,gprs) deviceInfo(deviceid,region,type) The above tables are designed according to SECICS’s real scenarios, and they will be used to explain the rules. Table lvRate is calculated on daily basis, which can be looked up to check if a device should re-collect the consumption information or indicate the malfunction of sensors. dataRecord stores the consumption data uploaded from sensors. And dataP rof ile is the mapping between low-voltage customers and sensors’ data. Moreover, powerCut records the powercut information. If some users face the power cut situation, then their data should not be collected. 4.1.1 Basic UPDATE Rule Trigger Pattern : UPDATE table SET column = CONSTANT, . . . LEFT OUTER JOIN table ON condition, . . . WHERE simpleCondition, . . . Example : UPDATE lvRate a SET a.isMissing=true LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c on b.dataid=c.dataid AND a.date=c.date WHERE c.dataid IS NULL Description : The above example updates lvRate. isM issing to true if there is not a corresponding record stored in dataRecord, which means the consumption data on that day is not collected because of some failures. Setting isM issing to true will let the device re-upload its data. This rule translates U P DAT E into SELECT statement by putting the simpleCondition to selectList. In this case, the output of this rule will be: INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid, IF(c.dataid IS NULL,true,false) as isMissing ,a.date,a.type FROM lvRate LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c ON b.dataid=c.dataid AND a.date=c.date 4.1.2 (NOT) EXISTS Rule Trigger Pattern : SELECT selectList FROM table JOIN table ON condition, . . . WHERE simpleCondition, . . . Example : DELETE FROM lvRate a WHERE NOT EXISTS ( SELECT 1 FROM powerCut b WHERE a.uid=b.uid AND a.date=b.date ) Description : The simpleCondition includes EXISTS subqueries and comparisons between columns and constants that are connected by conjunctions. Since (NOT) EXISTS is not fully supported by HiveQL, this rule is applied after initial rules to flatten the subqueries into JOIN clauses. As shown in the example demonstrated at the beginning of Section 4, there are two approaches to flatten an EXISTS subquery: 1. Transforming the subquery into a SEMI JOIN and then extracting the JOIN condition from the WHERE clause in the subquery. This approach only works for EXISTS clause. 2. The second approach can be applied on both EXISTS and NOT EXISTS subqueries. It transforms that subquery into a LEFT OUTER JOIN and replaces that (NOT) EXISTS condition with join Column IS (NOT) NULL. In this example, the second approach is used to generate: INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,a.isMissing,a.date,a.type FROM lvRate a LEFT OUTER JOIN ( SELECT uid,date FROM powerCut) b ON a.uid=b.uid AND a.date=b.date WHERE b.uid IS NULL In order to make the rule easier to be understood, the original subquery style is kept in the new JOIN clause even if there is only one table. This won’t harm the performance since Hive’s internal optimizer could further optimize it. 4.2 Cost-based Optimizer E E D C A A B C D B (a) Left-deep Plan (b) Bushy Plan Figure 4: Join Plan Example As the cost model of Hive and RDBMS is different, an efficient SQL query may not get high performance in Hive if we only directly translate it into HiveQL. QMapper’s Costbased Optimizer is used to optimize the join order of a query. Consider the following example which tries to get the daily gprs usage of a normal running device in a region according to the device type: SELECT sum(gprs), type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid JOIN dataRecord C ON A.dataid = C.dataid AND A.date = C.date JOIN dataProfile D ON C.dataid = D.dataid LEFT OUTER JOIN powerCut E ON D.uid = E.uid AND A.date = E.date WHERE E.uid IS NULL AND A.date=’2014-01-01’ GROUP BY B.type A left-deep tree plan is applied for this query just as Figure 4(a) shows. One goal of the optimization here is to reduce the execution time by running jobs concurrently. QMapper can adjust the join order by rewriting the query like this: SELECT sum(gprs), type FROM( SELECT T1.gprs, T1.date, T1.type, T2.uid FROM (SELECT A.gprs, A.dataid, A.date, B.type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid WHERE A.date=’2014-01-01’ )T1 JOIN ( SELECT C.dataid, C.date, D.uid FROM dataRecord C JOIN dataProfile D ON C.dataid = D.dataid )T2 ON T1.dataid = T2.dataid AND T1.date = T2.date )T LEFT OUTER JOIN powerCut E ON T.uid = E.uid AND T.date = E.date WHERE E.uid IS NULL GROUP BY type Now, an equivalent bushy plan is generated, just as Figure 4(b) shows. Different from traditional databases, MapReduce-based query processing will write join intermediate results back to HDFS and the next join operation will read it from HDFS too, causing big I/O costs. So, another significant goal for join optimization in MapReduce is to reduce the size of intermediate results. Comparing the above left-deep plan A 1 B 1 C 1 D 1 E with bushy plan (A 1 B) 1 (C 1 D) 1 E, the main difference in intermediate results is that the left-deep plan generates A 1 B 1 C and the bushy plan generates C 1 D. Thus, the sizes of A 1 B 1 C and C 1 D will be important for comparing the two plans. On the other hand, in the bushy tree plan, A 1 B and C 1 D may execute concurrently, reducing the total executing time. So, concurrent jobs should also be taken into consideration. QMapper’s CBO will evaluate these join plans according to our cost model and choose the best one for the query. In the query optimization, both left-deep trees and bushy trees are explored, which are shown as Figure 4. In leftdeep trees, intermediate results are pushed to join with a base table until all join tables are covered. It is very clear as join operation goes step by step, we can add the cost of each join step to get the final cost of the plan. However, it may not get the optimal plan as computing resources may not be fully exploited. In bushy trees, intermediate results can join with each other, so join operations can be executed concurrently. It seems that concurrent jobs will be more efficient, but this may lead to worse performance as jobs will compete for computing resources. We believe that it is enough to get a near-optimal plan by considering left-deep plans and bushy plans. So, we establish a cost model to evaluate them and choose the one with minimum cost. We use a bottom-up method to construct the join tree. In order to prune the searching space, dynamic programming algorithm is used to get the best join plan for a query. The cost model applied by the algorithm will be introduced in Section 5. The input is a query with join tables and their relations. Initially, tables involved are recorded in the query plan as base tables. Then, the number of join tables gradually increases from two to all. We build left-deep and bushy plans for them. Meanwhile, we compare the plans by our cost model and continually prune the inferior ones. In the end, the best plan for all join tables is returned as the result. 5. COST MODEL QMapper’s cost model is inspired by [8], which is used by [9] for tuning MapReduce parameters. However, that cost model can not be directly applied to QMapper since it tries to simulate the whole process of MapReduce framework and some parameters used in the model are highly related to each specific task, such as the CPU cost for compressing the output per byte. As a result, sampling is required to gather the parameters for each task. In QMapper, the cost model is also designed to capture the cost of each phase in MapReduce. We focus on the cost of Hive’s operators and the most time-consuming factors such as I/O operations. By filling the parameters with data collected by probes and 120. 12.41 Reduce Other Reduce Write 100. Merge Write 0.27 18.85 Reduce Operators 4.19 Merge Sort each mapper processes Msp size bytes of data concurrently. We can model this part by estimating the time consumption of one mapper, the expected cost for reading data will be: 6.55 TIME USAGE (SECONDS) 80. Merge Read Merge Write 5.71 Spill Write 0.75 15.17 1.91 60. Spill Sort 2.57 4.97 Merge Read Map Other 40. 24.81 53.19 Shuffle 27.75 Map Read 0. Map Reduce Execution Time of Each MapReduce Table 1: Parameters of QMapper’s Cost Model Symbol Diskr Diskw HDF Sw Mlocal Msp size Mout.rec Mout.avgbytes Rout.rec Rout.avgbytes N etwork Nmap Nm max Nreduce Nr max Description Disk reading speed (bytes/second) Disk writing speed (bytes/second) HDFS writing speed (bytes/second) Map Local Ratio Map Split Size (bytes) Number of Map Output Records Average Bytes of Map Output Records Number of Reduce Output Records Average Bytes of Reduce Output Records Data Transfer Speed via Network (bytes/second) Estimated Number of mappers Maximum Number of mappers Estimated Number of reducers Maximum Number of reducers the estimated cardinality, the cost model can be universally applied. It is well known that disk and network I/O costs are main reasons that slow down the MapReduce tasks. But there is not any data indicating how and to what extent do the I/O costs affect each phase of MapReduce. We add a few counters to MapReduce framework in order to find the timeconsuming factors. Figure 5 shows the detailed time usage of one representative MapReduce job generated by Hive. We run the job on our cluster and collect the time usage in each phase. It is obvious that Hive’s operators, together with I/O dominated phases such as M erge and Shuf f le take up more than 80% costs in terms of execution time. In this figure, other costs include sorting during M erge phase and data serialization as well as deserialization in map and reduce functions. In this section, we first introduce the cost model for evaluating a single MapReduce job. Then, the approach of estimating Hive operators’ costs is discussed. Finally, we describe how to evaluate the cost of MapReduce workflows. 5.1 Msp size Msp size +(1−Mlocal )× Diskr N etwork So the total cost of map phase is: Cost(Mmap ) = Cost(Mread ) + Cost(Mops ) Map Operators 20. Figure 5: Phase Cost(Mread ) = Mlocal × Cost of MapReduce MapReduce’s programming model is abstracted as two parts, Map and Reduce. Map phase can be divided into three subphases, which are M ap, Spill and M erge. Reduce phase also includes three parts, Shuf f le, M erge and Reduce. We will analyze the cost in each phase and form the overall cost of one MapReduce job. In Map phase, the mappers read data from DFS and process them. For the reading part, Hadoop tries to assign mapper to the node where the input data is stored, and where Cost(Mops ) donates the time spent on processing data, which will be discussed in Section 5.2. After the (key, value) pairs are partitioned and collected into a memory buffer, a spill-thread may simultaneously clear the buffer and write the records into a Segment file if the size of data stored in memory exceeds the maximum of either accounting buffer (storing metadata for each record) or key/value buffer. Even if the memory buffer has enough space to store the data, the spill-thread will materialize them into a single file before the merge phase. Thus, the writing cost for spill phase can be simply estimated as: Cost(Mspill ) = Mout.rec × Mout.avgbytes Diskw The spill phase may generate multiple Segment files and the goal of merge phase is to sort and merge these Segment files into a single file. If there is only one Segment, merge phase is bypassed since it can be shuffled to reducer directly. This phase performs an external merge-sort, thus it may need multiple rounds to generate the final output. Here we simplify the processing logic and assume the merge phase being done in a single round. The cost of merge phase can be simplified into: C(Mmerge)= Mout.rec×Mout.avgbytes Mout.rec×Mout.avgbytes + Diskr Diskw Different from normal MapReduce jobs, in Hive, the internal logic of mappers may vary depending on the specific table to be processed. Thus, the costs of processing each table are evaluated independently. Suppose the number of input tables is n, the total cost of M ap can be evaluated by the mapper that takes the longest time: i i ) + Cost(Mspill ) Cost(M ) = max {Cost(Mmap 1≤i≤n i )} + Cost(Mmerge In the reduce phase, shuffle is responsible for fetching mappers outputs to their corresponding reducers. Here we treat the input tables as a whole. For simplicity, we assume the output data is evenly distributed, thus the total data size each reducer received from Segment files is n i=1 (Mout.rec i ×Mout.avgbytes i ×Nmap i ) , where n is Segr.size = Nreduce the number of input tables. So, the network cost of shuffle phase is: Segr.size N etwork The merge phase is executed concurrently with shuffle. When the reduce memory buffer reaches its threshold, the merge threads will be triggered. Similar to the merge phase in M ap side, the merge phase of Reduce is applied to sort and merge the Segment files from the M ap tasks. The difference is that some Segment files may have already been merged during the shuffle phase. Here, we also assume that Cost(Rshuf f le ) = Depth 1 it only needs one single round to merge the Segment files into the final one. Thus, the cost of this phase can be simplified as: Cost(Rmerge ) = Stage-1 Depth 3 Stage-4 Stage-5 Cost(Rreduce ) = Cost(Rops ) + Rout.avgbytes × Rout.rec HDF Sw Total Cost Note that, the cost model does not consider the costs and effects of Combiner in spill and merge phases, since Hive does not use Combiner for data processing. The total cost of the whole MapReduce will be the sum of costs discussed above. In the cases that the required number of mappers or reducers exceeds the cluster’s capacity, some mappers and N and Pr = NNreducer reducers have to wait, Pm = Nmmap max r max are in turn used as punishment coefficients. Thus, the total cost is: Figure 6: MapReduce Workflow Depth 5.3 Cost of Workflow A HiveQL query is finally compiled to MapReduce workflows (a directed acyclic graph) where each node is a single MapReduce job and the edge represents the dataflow. As the example shown in Figure 6, QMapper groups MapReduce jobs (each stage stands for a MapReduce job) and task chains that can be executed concurrently within the same depth. Here, we just assume the resources are enough for concurrent jobs. Since the cost of each job is already known, the summation of each depth’s cost will be the total cost of a query: Cost(M Rworkf low ) = Hive embeds the operators in map and reduce functions and the records are processed by each operator iteratively. The cost of each operator is considered as CPU cost. In order to calculate the costs, a few sample queries based on TPC-H are designed as probes to collect the execution time of operators such as F ileterOperator, JoinOperator and GroupByOperator. Also, since we have added counters in Hive’s operators, these statistics can be updated according to logs. The cost of F ileSinkOperator is ignored, since its responsibility is to write output data to HDFS, whose I/O cost is already accounted in Section 5.1. Because of the variety of operators’ internal implementation, it is hard to precisely estimate the time consumption of each operator with different input data and filters. As we observe in real applications, the time usage of a Hive operator might increase almost linearly as the amount of input records grows. Thus, we treat each operator as a black-box, its cost is defined as the execution time of processing a given amount of records. Since there may be multiple tables as input, j is used to donate the records number of table j. Ntab.rec For an operator, the Linear-Regression approach is used to m j Ntab.rec ). build the cost function, represented as f ( j=1 After calculating the cost function of operators through analyzing logs of sample queries, given a chain with n operators, the cost is evaluated as: n i=1 fi ( m j Ntab.rec ) j=1 Notice that, in each iteration i, the internal m and table parameters may be different due to the effects of JoinOperator and F ilterOperator, the approach of how to calculate these parameters is discussed in Section 6.2. n Cost(Depthi ) i=1 + Cost(Rmerge ) + Cost(Rreduce )} Cost of Operators in Map and Reduce Stage-6 Stage-24 Cost(Job) = Pm × Cost(M ) + Pr × {Cost(Rshuf f le ) 5.2 Depth 4 Stage-3 Segr.size Segr.size + Diskr Diskw The costs of reduce function includes Hive operators’ costs (discussed in Section 5.2) and writing costs. In Hive, the writing part is handled by F ileSinkOperator, which directly writes data to HDFS. So the cost will be: 5.1.1 Stage-2 Depth 2 where Cost(Depthi ) = max{ pression m m Cost(Stageki )}. The ex- k=1 Cost(Stageki ) donates the cost of a task chain k=1 within the Depth-i. 6. IMPLEMENTATION 6.1 Statistics Collection QMapper leverages PostgreSQL for cardinality estimation. In order to do that, PostgreSQL’s source codes have been modified so as to fetch statistics data from external source and write them into the system table P g stat. In this way, there is no need to actually insert data into PostgreSQL. The statistics data are collected by Statistics Collector. Same with PostgreSQL, besides the number of tuples, 5 metrics are collected for each column: 1.M ost Common V alues. 2.Histogram. 3.N ull F raction. 4.Distinct Ratio. 5.Average Bytes. We implement MapReduce programs based on PostgreSQL’s analyzing module and the Statistics Collector runs the programs periodically to calculate these statistics. The mappers perform sampling on the tuples and send them to reducers, where key is the attribute name and value is the value corresponding to that attribute. Thus, one reducer is only responsible for calculating the statistics of one attribute and all the attributes can be calculated concurrently. 6.2 Query Cost Evaluation Query cost evaluation is used in 2 stages: in and after CBO. For example, given an input query A, RBR may generate two candidate queries A1 and A2 . Then, the 2 queries are sent to CBO for optimizing the join order. For each query, CBO adjusts its join plan and performs evaluation using the cost model in Section 5. In this stage, we do not need Hive to compile the query to MapReduce workflows. 6.3 Nreduce = Mout.avgbytes × Mout.rec × Nmap 512 × 10242 . EXPERIMENTS In this section, we will evaluate the correctness and efficiency of QMapper. Here, efficiency contains two aspects: the efficiency of translating SQL into HiveQL and the efficiency of HiveQL execution comparing QMapper with manually translated work. Experiments based on TPC-H will demonstrate the execution efficiency of HiveQL generated by QMapper, and Smart Grid application will show the correctness and translation efficiency of QMapper. 7.1 7.1.1 TPC-H workloads Evaluation Experiment Environment We perform the experiments on our in-house cluster, which consists of 30 virtual nodes. Each of them has 8 cores and 8GB RAM. All nodes are installed with CentOS 6.4, Java 1.6.0 22 and hadoop1.2.1. The Hive 0.12.0 is deployed and Value 66584576 (Bytes/second) 61027123 (Bytes/second) 46137344 (Bytes/second) 1024 (MB) 0.67 Default: dfs.block.size=67108864 (Bytes) 44040192 (Bytes/second) 140 84 10GB, 20GB, 50GB, 100GB TPC-H data sets are generated as workload. By default, the cluster contains 30 nodes and the size of data set is 10GB. The replica number of files stored in HDFS is set to 3. The value of parameters collected from this cluster are listed in table 2. In most cases, more than 2 tasks are running on the same machine, the values of Diskw , Diskr and N etwork are smaller than ideal ones since there is resource contention. Other parameters are set to default values. 7.1.2 Overall Performance HiveM QMapper 1400 Tuning the Number of Reducers We observe that another important factor affecting the performance of a MapReduce job is the number of reduc. ers. By default, Hive sets this parameter by M apInputSize 2×10243 However, it is not always a good choice due to the assumption that lots of records are filtered in M AP phase. In some cases, a single reducer will have too much workloads. In QMapper, since the total output data size of M AP side can be estimated, we enable Hive to allow setting reducer numbers per stage. The recommended setting provided by QMapper is: 7. Table 2: Parameters of Experiments Symbol Diskr Diskw HDF Sw Pr.task.mem Mlocal Msp size N etwork Nm max Nr max Execution Time (Second) Because only Join and F ilter are concerned in this stage and Hive’s strategy of processing join operations is predictable, CBO can generate the workflows for evaluation. After the processing of CBO, there are still two queries A1 and A2 which have been optimized. Then, QMapper will invoke Hive’s planner to compile them into MapReduce workflows and the cost model is used again to evaluate the whole plan. The one with the minimum cost is then returned to end user. As previously mentioned, QMapper leverages PostgreSQL for cardinality estimation, and the key problem is how to map the estimated results to MapReduce plan. First, in order to make sure that PostgreSQL’s optimizer does not rewrite the SQL query, the join optimizer in PostgreSQL is disabled by setting join collapse limit = 1. Then, by using explain command, the query is sent to PostgreSQL to get its cardinality estimation result as json format. The estimation result can also be treated as a DAG, the difference between this DAG and the MapReduce plan DAG is that it is composed by operators such as Hash Join, Seq Scan, Sort and GroupAggregate etc. And the operators’ dependency is constructed based on the structure of blocks in a query, which can be mapped to our MapReduce plan. For example, the output row number of a Seq Scan operator with filters can be mapped to a table scan Mapper for calculating the output data size, and the output row number of a Hash Join operator can be used to estimate the output data size of the Reduce phase. And we use filters, aliases, table names and join conditions as identifications to correctly retrieve the corresponding operator’s cardinality information. 1200 1000 800 600 400 200 0 1 2 3 4 5 6 7 8 9 10 1112 13 141516 17 1819 20 2122 Query Figure 7: Performance of Full TPC-H Queries Figure 7 shows the overall performance of full TPC-H queries. We choose HiveM2 and QMapper for comparison. In general, we can divide the TPC-H queries into three types: queries with few translation variations, queries containing subqueries and queries with multiple join tables. For queries with few translation variations (Q1, Q6, Q12, Q13, Q14, Q19), the translation results of QMapper are quite similar to HiveM, thus the execution time of these queries is almost the same. For queries containing subqueries (Q2, Q4, Q11, Q15, Q16, Q17, Q18, Q20, Q21, Q22), QMapper and HiveM choose different translation strategies. QMapper converts the subqueries into join operations, whereas HiveM separates the subqueries from the original ones and the final results contain several queries. Usually, as the translation methods of HiveM involve more MapReduce tasks, the execution time of QMapper’s results is better than that of HiveM (Q4, Q16, Q17, Q18), but the performance improvement is not that obvious. Sometimes the results of HiveM are even better since the extra join operations introduced by 2 These queries are TPC-H queries rewritten by Hive contributors for benchmark purpose. https://issues.apache. org/jira/browse/HIVE-600 QMapper take longer time (Q15, Q22). Q11 is very special as the join operations in the subquery are same as the ones outside, so HiveM reuses the results of the subquery and dramatically reduces the overall execution time. For other queries (Q2, Q20, Q21), QMapper further adjusts the join orders of them, this will greatly shorten the execution time. For queries with multiple join tables (Q2, Q3, Q5, Q7, Q8, Q9, Q10, Q20, Q21), QMapper leverages the cost-based optimizer to adjust the join orders. For Q2, Q3 and Q10, as there are not so many join tables, the join orders in HiveM are superior and leave little optimization space for QMapper. For the other queries, QMapper dramatically improves the performance. In the next section, we will analyze these typical queries in detail. 7.1.3 Join Performance Figure 8 shows the performance of the five typical TPC-H queries with multiple join tables. Here we add HiveM+YSmart and QMapper+YSmart for comparison. In our settings, YSmart is employed by setting hive.optimize.correlation = true in Hive. We can see that, in these 5 cases QMapper gets the best performance. For Q5, Q8 and Q20, QMapper achieves about 44% improvements. Also, for Q7 and Q9, QMapper improves the performance more than 50%. Even for experienced engineers, it is very difficult to manually optimize complex queries which involve five or more tables. Intuitive decisions often lead to sub-optimal plans. Thus, it is necessary to estimate the costs and identify the best query automatically, QMapper provides a solution and gets good performance. As shown in Figure 8, YSmart does not perform any transformations on the queries except for Q20, Figure 9 and 10 show Q20’s MapReduce plans generated by QMapper+YSmart and QMapper respectively. We can see that YSmart merges Stage-4 and Stage-8 into a single job, however, this does not bring any improvement since the query rewritten by QMapper is already highly paralleled. In this case, reducing the number of jobs does not affect the depths of the DAG. Thus, no improvement is gained when there are enough resources for running multiple jobs concurrently. 7.1.4 Scalability Figure 11, 12, 13 and 14 show the scalability for the optimized queries. Figure 11 and 12 reflect the performance of Q7 and Q20 while the data size increases from 10G to 100G. While Figure 13 and 14 show the performance of Q5 and Q8 with data size of 20GB, while the number of nodes scales from 10 to 30. In Figure 13 and 14, the execution time of the queries optimized by QMapper almost does not shorten when there are more than 20 nodes in the cluster, while the execution time of the original queries keeps decreasing. Because QMapper balances the trade-off between reducing the size of intermediate data and improving parallelism, 20-node cluster is already able to support the optimal plan, adding more nodes does not bring much help in that data size. 7.1.5 Accuracy of Cost Model Some experiments have been done to evaluate QMapper’s cost model. We choose Q9, which is the most complicated query in TPC-H, to validate the accuracy of the cost model. Figure 15 shows the execution time comparison between query variations for Q9. These variations are selected from the intermediate results of QMapper’s CBO. We can see that different query variations get huge difference in performance. Since our cost model does not consider the influences such as the CPU costs of sorting and compressing, it can not fully reflect the execution time of queries in reality. However, the cost model considers the main factors in each phase which cover more than 80% (Figure 5) of the execution time, thus it is enough to choose a good variation. Furthermore, we validate the cost model by using real world data set. We pick out a typical query from SECICS, which consists six join tables and is used to calculate the line loss rate. And we execute some variations of the query in the 30-node cluster with 25GB real world data set. Figure 16 shows the comparison results. The performance of QMapper’s cost model using the real world data set is a little worse than that using TPC-H. As we assume the data is evenly distributed in our cost model (for example, we assume the output of map phase is evenly shuffled to each reducer), the data skew in the real world data set will influence the accuracy of cost model. However, the cost model is practical enough to pick out a superior variation for the query. 7.2 Smart Grid Application We built up the Hadoop/Hive platforms on a cluster composed of eight commodity servers (8-core and 32GB memory each). Then we started to translate the stored procedures. With the help of QMapper, our 6 members team finished the migration from SQL to HiveQL in only 6 weeks. According to the statistics, 90% of the SQL queries can be perfectly translated by QMapper. QMapper can handle queries with complex subqueries very well, but we still need to adjust some queries manually: (1) It is very difficult to find general translation rules for some SQL functions, such as the window function, and we have to deal with them manually. (2) For queries containing f or or while loops, we still need to translate them by ourselves. (3) For queries containing rownum, we need to analyze them carefully and choose proper translation ways to ensure the correctness of translation results. As we only performed the translation work with QMapper, we don’t know how many time it will cost if we try to understand the logic of SQL codes and translate them manually, but it must be a much longer time than the time we cost now. For correct validation of HiveQL and deployment of workflows, we performed it by manual work. It took us another 8 weeks to finish this work. We had to take fairly long time to prepare environment and import data for validation. Correct validation and workflow deployment are time consuming, mainly because: (1) In some cases, it is difficult to guarantee that SQL and corresponding HiveQL generate the same results. For example, for queries containing rownum, different records might be chosen out and we must validate the impact of these results. (2) Considering the cost of initiation of MapReduce tasks, sometimes we choosed to merge some queries to minimize the number of MapReduce tasks. (3) It is very complicated to debug at workflow level. As we used new data to validate the workflow every day, once we found the results of the workflow were not the same as that of SQL stored procedure, we had to analyze the results of HiveQL sentence by sentence to locate the error. (4) The workflow deployed on Oozie should be carefully arranged to make full use of resources of the cluster. Some tasks were arranged to execute concurrently to improve the overall efficiency. HiveM HiveM+YSmart QMapper QMapper+YSmart Execution Time (Second) Stage-12 Stage-12 1400 1200 1000 Stage-4 800 Stage-2 Stage-3 600 Stage-2 400 Stage-3 Stage-5 200 0 Q5 Q7 Q8 Q9 Q20 Stage-4 Stage-5 Stage-8 Query Figure 8: mance Overall Join Perfor- QMapper HiveM HiveM 20G 50G Execution Time (Second) 1000 800 600 400 200 20G 50G 200 15 Execution Time Execution Time Execution Time (Seconds) Execution Time (Seconds) 25 30 1200 1000 800 600 400 200 1 2 Figure 14: Effect of Cluster Size for Q8 3 4 Optimal Query Variations of Q9 Figure 15: Accuracy of Optimizer for Q9 We use QMapper for the migration of five main business in SECICS: computing user electricity consumption, district line loss calculation, statistics of data acquisition rates, terminal traffic statistics and exception handling. After migration, 95% of the queries are executed by Hive and the others are deployed on Impala. In order to guarantee the stability of migration, we haven’t removed the offline batch analysis tasks from RDBMS. The meter data is currently duplicated and stored both in RDBMS and HDFS. We compare the analysis results of RDBMS and Hadoop/Hive every day to verify their consistency. Now the new solution has been in production for eight month and it works very well. Results show that, comparing with the execution time in RDBMS, the performance in new environment is more than 3 times faster on average. With the growth of the meter data scale, the Zhejiang Grid plans to let Hive take full charge of the offline analyzing tasks. RELATED WORK Automatic SQL to MapReduce mapping and optimization attract lots of attentions from both industry and academia. Current works mainly fall into three categories. 25 30 Figure 13: Effect of Cluster Size for Q5 Cost Estimation 1400 Nodes Number 20 Nodes Number 0 20 400 10 1600 15 600 100G Figure 12: Effect of Data Size for Q20 QMapper 900 800 700 600 500 400 300 200 100 0 10 800 Data Size Figure 11: Effect of Data Size for Q7 HiveM 1000 0 10G 100G QMapper 1200 Data Size Execution Time (Second) QMapper 0 10G 8. Figure 10: Execution Plan of Q20QMapper 1200 1800 1600 1400 1200 1000 800 600 400 200 0 Execution Time (Second) Execution Time (Second) HiveM Figure 9: Execution Plan of Q20QMapper & YSmart Cost Estimation 900 800 700 600 500 400 300 200 100 0 1 2 3 4 Optimal Variations of A Real World Query Figure 16: Accuracy of Optimizer for A Real World Query The first category adopts a rule-based approach to guide the SQL-to-MapReduce mapping procedure. Tenzing [14], proposed by Google, provides SQL92 compatible interface and introduces four join algorithms based on MapReduce. It also supports a simple cost-based approach to switch the join orders. HadoopDB [1] uses MapReduce as a computing layer on top of multiple DBMS instances and push SQL queries down to DBMS so as to utilize their sophisticated optimizers. These systems provide the fundament to run SQL on MapReduce. However, most of the translations follow a specific pattern, which is to map each part of a SQL query into MapReduce jobs directly. Comparing to the optimizations in RDBMS, there are lots of optimization opportunities left, such as query rewriting and cost-based optimization. In the second category, works including [2, 9, 10] propose cost-based approaches to determine appropriate configuration parameters for MapReduce jobs. They treat MapReduce as a black-box system. By using the statistics of prerun jobs, more than 10 parameters are tuned and the cost of MapReduce job is evaluated during the optimization phase. The work in third category aims at optimizing MapReduce workflows directly. YSmart [12] is a rule-based correlation aware SQL-to-MapReduce translator, aiming at re- ducing the number of jobs by merging jobs according to their relations. Stubby [13] is a cost-based optimizer for MapReduce workflows. It is designed to perform transformations on MapReduce workflows. Stubby searches through the workflow variations to get a near-optimal solution. Moreover, lots of works focus on the optimization of join operations in MapReduce. AQUA [19] is a query optimizer which aims at reducing the cost of storing intermediate results by optimizing the order of join operations. In AQUA, a top-down approach is utilized. The join operators are divided into several groups, each of them is processed by a single MapReduce job. Then, a heuristic approach is used for connecting these groups to form a final join tree. The works above are very important improvements for running SQL queries on MapReduce. However, current approaches mostly focus on optimization at MapReduce level, while ignoring the varieties of SQL queries and their influences on the performance. Moreover, unlike the well known cost model described in [8], which simulates every stage of MapReduce and needs to pre-run the jobs on sampling data set, QMapper’s cost model only captures the I/O costs of each phase and computes the cost of involved Hive operator based on its cost sampling and run time I/O data size. Besides, some benchmarks have been done to compare Hive and Impala. [5] compares the performance of Hive and Impala in detail, which can be a reference for the improvement of SECICS. 9. CONCLUSION Traditional enterprises seek for tools to migrate legacy RDBMS-based data analysis applications to Hive. QMapper is proposed in this paper, which applies rule-based rewriting and cost-based optimization to a given SQL query to generate efficient HiveQL. The experimental evaluations based on TPC-H demonstrate the effectiveness of QMapper as well as the accuracy of its cost model. Our real world application in Smart Grid also shows its superiority. One direction of our future work is trying to integrate this approach with Shark. Multiple query optimization will be another improvement direction. 10. ACKNOWLEDGEMENTS This work is supported by the National Natural Science Foundation of China under Grant No.61070027, 61020106002 and 61161160566. 11. REFERENCES [1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB, 2(1):922–933, 2009. [2] S. Babu. Towards automatic optimization of mapreduce programs. In SoCC, pages 137–142, 2010. [3] K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. In VLDB, 2011. [4] Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing big data systems: A cross-industry study of mapreduce workloads. VLDB, 5(12):1802–1813, 2012. [5] A. Floratou, U. F. Minhas, and U. F. Minhas. Sql-on-hadoop: Full circle back to shared-nothing database architectures. Proceedings of the VLDB Endowment, 12(7):1295 – 1306, 2014. [6] M. J. Franklin, B. T. J´ onsson, and D. Kossmann. Performance tradeoffs for client-server query processing. ACM SIGMOD Record, 25(2):149–160, 1996. [7] L. M. Haas, W. Chang, G. M. Lohman, J. McPherson, P. F. Wilms, G. Lapis, B. Lindsay, H. Pirahesh, M. J. Carey, and E. Shekita. Starburst mid-flight: as the dust clears. TKDE, 2(1):143–160, 1990. [8] H. Herodotou. Hadoop performance models. arXiv preprint arXiv:1106.0940, 2011. [9] H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB, 4(11):1111–1122, 2011. [10] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In CIDR, volume 11, pages 261–272, 2011. [11] S. Hu, W. Liu, T. Rabl, S. Huang, Y. Liang, Z. Xiao, H.-A. Jacobson, X. Pei, and J. Wang. Dualtable: A hybrid storage model for update optimization in hive. In ICDE, 2015. to appear. [12] R. Lee, T. Luo, F. Huai, Yand Wang, Y. He, and X. Zhang. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, pages 25–36, 2011. [13] H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. VLDB, 5(11):1196–1207, 2012. [14] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, and M. Wong. Tenzing a sql implementation on the mapreduce framework. 2011. [15] Y. Liu, S. Hu, T. Rabl, W. Liu, H.-A. Jacobsen, K. Wu, J. Chen, and J. Li. DGFIndex for Smart Grid: Enhancing Hive with a Cost-Effective Multidimensional Range Index. Proceedings of the VLDB Endowment, 13(7):1496–1507, 2014. [16] C. Olston, U. Reed, Benjamand Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099–1110, 2008. [17] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB, 2(2):1626–1629, 2009. [18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In ICDE, pages 996–1005, 2010. [19] S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SoCC, page 12, 2011. [20] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, pages 13–24, 2013. [21] Y. Xu and S. Hu. Qmapper: a tool for sql optimization on hive using query rewriting. In WWW, pages 211–212, 2013.