MapReduce Join Strategies for Key
Transcription
MapReduce Join Strategies for Key
2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) MapReduce Join Strategies for Key-Value Storage Duong Van Hieu, Sucha Smanchat, and Phayung Meesad Faculty of Information Technology King Mongkut’s University of Technology North Bangkok Bangkok 10800, Thailand [email protected], {[email protected],[email protected]}, [email protected] dataset to generate a set of intermediate <key, value> pairs. A reduce process with a reduce function merges all of intermediate values generated by the map processes associated with the same intermediate key to form a possibly smaller set of <key, value> pairs, called final output <key, value> pairs. Abstract—This paper analyses MapReduce join strategies used for big data analysis and mining known as map-side and reduce-side joins. The most used joins will be analysed in this paper, which are theta-join algorithms including all pair partition join, repartition join, broadcasting join, semi join, persplit semi join. This paper can be considered as a guideline for MapReduce application developers for the selection of join strategies. The analysis of several join strategies for big data analysis and mining is accompanied by comprehensive examples. Fig. 1 is a simple word counting example. The input string data “Advanced Research Methodology, Advanced Information Modelling and Database, Advanced Network and Information Security, Advanced Database and Distributed Systems” is divided into four blocks corresponding to each subject name separated by commas. A Hash function mod(code(upper(left(key,1))),k)+1 is used for distributing intermediate <key, value> pairs into reduce tasks. The left(key,1) means taking the first letter of key, the upper(x) means changing x to upper case, the code (x) means taking ASCII code of character x, and the mod(m, k) means returning the remainder after m is divided by k. Keywords—MapReduce; join strategy; NoSQL I. INTRODUCTION With the continuous development of big data and cloud computing, it is believed that traditional database technologies are insufficient for data storage and access, and also performance and flexibility requirements. In the new era of big data, NoSQL databases are more appropriate than relational databases [1]. Key-Value store, a kind of NoSQL databases, is an appropriate choice for applications that use MapReduce model for distributed processing. Key-Value stores offer only four underlying operators including inserting <key, value> pairs to a data collection, updating values of existing pairs, finding values associated with a specific key, and deleting pairs from a data collection [2]. Intermediate <key,value>pairs Inputdata Block1 Key 1 Database Research Map1 Research 1 Group1 Database Information Methodology 1 Key Value Key 1 1 1 Key Re Value 1 Advanced 1 Information duc e 1 Value Database 2 Distributed 1 1 Advanced 1 Map2 Modelling 1 Advanced 1 Advanced 4 and and 1 Advanced 1 and 3 Database Database Modelling Advanced Advanced Network Network and Map3 and 1 Key Reduce2 Information Value Group2 and 1 Value and 1 Methodology 1 1 and 1 Modelling 1 1 Information 1 1 Information 1 Key 2 Value Information Information 1 Methodology 1 Network 1 Security Security 1 Modelling 1 Research 1 Key Block4 Distributed <key,value>pairs producedbyReduce processs Value Advanced Key Block3 Key Advanced Advanced Block2 Value Advanced Methodology Joining two data collections to produce a new dataset based on joining fields is a responsibility of programmers or application developers rather than of database management systems. However, several join strategies existing, which have different advantages and disadvantages. To provide programmers a guideline to the selection of join strategies, this study analyses several joining strategies for big data analysis and mining accompanied comprehensive examples. The content of this paper is organised into four main sections. Section 2 gives an overview of the MapReduce programming model, Section 3 explains MapReduce join strategies, and Section 4 is the conclusion of and comparison of join strategies used in MapReduce. <key,value>pairs distribution Value Key Advanced Advanced 1 Database Database 1 Research 1 Key and Map4 and Group3 Network Value 1 R uc ed e 3 1 Value Distributed Distributed 1 Group4 Security 1 Systems Systems 1 Systems 1 R Key Value Security 1 4 Systems c e e du 1 Fig. 1. Map and reduce processes of a simple word counting example II. MAPREDUCE OVERVIEW MapReduce has been used at Google since February 2003, and was first introduced in 2004 by Dean and Ghemawat [3] and in Communications of the ACM in 2008 [4]. It is used for processing large datasets in a parallel or distributed computing environment. It is a combination of map processes and reduce processes. A map process is a function that processes a set of input <key, value> pairs that is a portion of a large input ,((( III. MAPREDUCE KEY JOIN STRATEGIES Physically, data in a Key-Value format can be stored in the form of a data structure such as B-Tree, Queue, and Hash table [5, 6]. Logically, each record in a Key-Value store is a single entry including a key and a value. To make it easy to understand, a set of <key, value> pairs, called data collection, can be 164 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) Okcan [16]. Theta join algorithms will be analysed in the following sections. considered as a two-column table. The first column stores keys and the second one, which can be a combination of more than two columns, stores values associated with the keys. TableL Joins using MapReduce can be categorized as map-side join, reduce-side join, memory-backed join, join using Bloom Filter, and map-reduce merge [7]. However, this paper follows the categories proposed by Tom White [8, 9], grouping into two types which are map-side joins and reduce-side joins. Map-side joins are joins per-formed by mappers, used to join two large input datasets before feeding data to the map functions. Reduce-side joins are joins performed by reducers, being more general than a map-side join because inputs do not need to be structured in any particular way [9]. In some cases, reduce-side joins are less efficient than map-side joins because datasets go through the MapReduce shuffle. For reduce-side joining, several components are involved. These are multiple inputs and secondary sorting [8]. Stds Aj Hiu Lo Su Suna Profs PhMe PhMe PhMe Mar Un SELECT*FROML,RWHEREL.Profs=R.Profs TableR Stds Hin Hiu Jia Ling Sul Profs Sup Sup PhMe Su PhMe L.Stds Aj Aj Hiu Hiu Lo Lo Fig. 3. All pairs partition join Each compound partition will be assigned to a map task. Output of the map task is <compound key, tagged record> pairs. A compound key is a combination of partition name from table R and L such as (1, 2), (1, 2), and (1, 3). To identify which record comes from which table, each record from table R or L will be tagged its table name, called tagged record. Each group of <compound key, tagged record> pairs will be passed to reducers. Before reducing data, this input data will be split into table R and L and they will join in the same way as the traditional joining method. Among join algorithms used in MapReduce literature listed in [11-15], it is believed that equi-join strategies used in [11] are more efficient than those used in Yahoo Pig, Facebook Hive, and IBM Jaql. This paper focuses on the theta-join implementation strategies proposed by Blanas et al.[11] and Ta bl e L Hi u PhMe Lo PhMe Pa rt2 Di t PhMe Su Ma r Pa rt3 Sun Un Ta bl eR Stds Profs Pa rt1 Hi n Sup Hi u Sup Pa rt2 Ji a PhMe Li ng Su R.Profs PhMe PhMe PhMe PhMe PhMe PhMe B. All Pair Partition Joins Given table R having |R| records and table L having |L| records, product of R and L is a set of |R|*|L| records. This traditional method takes a long time when joining two very large tables. To compute this product in MapReduce, table R and table L will be divided into u and v disjoin partitions, respectively. |R|*|L| records can be obtained from u*v products, each product partition (1, 1), partition (1, 2),.., partition (u, v) can be processed by a map or a reduce function. This method is called all pairs partition join in MapReduce model [16]. A. Theta Joins Theta join is a kind of join that uses comparison operators such as <, <=, >, >=, =, <> in the join predicates. Among these, equi-join is the most used join for joining two datasets to achieve the intersection between them. Fig. 2 is an example of equi-join. This join matches every record from table L to every record from table R which has the same value of the field join. The results of joining can be projected to eliminate some redundant fields to produce only required fields. Stds Profs Aj PhMe R.Stds Jia Sul Jia Sul Jia Sul Fig. 2. A simple equi- join example (using equi-join on the field Profs) Multiple inputs mean inputs from different sources can have different formats or presentations. To deal with this situation, multiple inputs need to be parsed separately. This parsing is provided in Hadoop, called per-path basis [10]. Secondary sorting occurs when reducers obtain inputs from two sources and each of them can be sorted by different orders. To solve this challenge, when the first dataset comes from source A sorted by key1, the second dataset comes from source B sorted by key2. The merged data should be sorted by a composite key (key1, key2) before reducing. Pa rt1 L.Profs PhMe PhMe PhMe PhMe PhMe PhMe Key Valuelists ('R',Hi n,Sup) Key Valuelists ('R',Hi n,Sup) Key Valuelists ('R',Ji a ,PhMe) (1,1) ('R',Hi u,Sup) ('L',Aj,PhMe) ('L',Hi u,PhMe) (1,2) ('R',Hi u,Sup) ('L',Lo,PhMe) ('L',Di t,PhMe) (2,1) ('R',Li ng,Su) ('L',Aj,PhMe) ('L',Hi u,PhMe) Key Valuelists ('R',Hi n,Sup) ('R',Hi u,Sup) (1,3) ('L',Su,Ma r) ('L',Sun,Un) Key Valuelists ('R',Ji a ,PhMe) ('R',Li ng,Su) (2,3) ('L',Su,Ma r) ('L',Sun,Un) Key Valuelists ('R',Ji a ,PhMe) ('R',Li ng,Su) (2,2) ('L',Lo,PhMe) ('L',Di t,PhMe) Key empty R.Stds Ji a Ji a Ji a Ji a Valuelists R.Profs PhMe PhMe PhMe PhMe L.Stds Aj Hi u Lo Di t Fig. 4. An example of all pairs partition joins (using equi-join on the field Profs) 165 L.Profs PhMe PhMe PhMe PhMe 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) The standard version is the same as the partitioned sortmerge join that is used in parallel Rational Database Management Systems [11]. In the map phase, each map task works on a block of either table L or table R. To identify which table an input record is from, the map function tags each record with its original table and produces the extracted join key and the tagged records. Output of the map function is a set of <join_key, tagged_record> pairs. Join_key is the attribute used to join two tables, and tagged_record is a compound of table name and record. These outputs are then partitioned, sorted, and merged. Then, all records for each join key are grouped together and fed to a reducer. In the reduce phase, for each join key, the reducer first separates and buffers the input records into two sets according to the table tagged, and then performs a cross-product between two sets. This following example uses hash function mod(code(upper(left(join key,1))),2)+1 for distributing intermediate <key, value> pairs to each reducer (the similar has function used earlier). In Fig. 4, each record from table L and R will be added tag ‘L’ and ‘R’, respectively. Those records are called tagged records. Only the composite key has records from both table L and R having the same join key are fed to reduce functions. In this example, only partition (2, 1), partition (2, 2) has shared join key records from table R and L, which will be used for joining. The remaining partitions will be ignored. Disadvantage of this joining is enumerating every pair may not be processed by reducers. C. Repartition Join Repartition join is the most used join strategy in MapReduce. Datasets L and R are dynamically split into parts based on the join key and pairs of partitions from L and R will be joined [15]. It has two versions called standard repartition join and improved repartition join. merging, sorting, and groupin T able L Intermediate output Input of map functions S tds Profs Join ke y Tagge d Re cord key tagge d re cord Block1 Su Mar Map1 Mar ('L', Su, Mar) Group 1 P hMe ('L' , Aj, P hMe) Aj P hMe P hMe ('L', Aj, P hMe) P hMe ('L' , Hiu, P hMe) Block2 Hiu P hMe Map2 P hMe ('L', Hiu, P hMe) P hMe ('L' , Lo, P hMe) Lo P hMe P hMe ('L', Lo, PhMe) P hMe ('R' , Jia, PhMe) Block3 Sun Un Map3 Un ('L', Sun, Un) P hMe ('R' , Sul, P hMe) T able R S tds Profs key tagge d re cord Jia P hMe P hMe ('R', Jia, PhMe) Mar ('L' , Su, Mar) Block1 Map4 Sul P hMe P hMe ('R', Sul, PhMe) Su ('R' , Ling, Su) Block2 Ling Su Map5 Su ('R', Ling, Su) Sup ('R' , Hin, Sup) Hin Sup Sup ('R', Hin, Sup) Sup ('R' , Hiu, Sup) Block3 Map6 Hiu Sup Sup ('R', Hiu, Sup) Group 2 Un ('L' , Sun, Un) Reduce process T able L S tds Aj Reduce 1 Hiu Lo T able R S tds Jia Sul T able L S tds Su Reduce 2 Sun S tds T able R Ling Hin Hiu Profs PhMe PhMe PhMe Profs PhMe PhMe Profs Mar Un Profs Su Sup Sup Final result from reduce process L.S tds L.Profs R.Stds R.Profs Aj PhMe Jia P hMe Aj PhMe Sul P hMe Hiu PhMe Jia P hMe Hiu PhMe Sul P hMe Lo PhMe Jia P hMe Lo PhMe Sul P hMe Fig. 5. An example of standard repartition joins (using equi-join on the field Profs) of those from table L on a given join key. Partition function is also customised so that hash code is computed from just the join key instead of composite key. Records are then grouped by just the join key instead of the composite key. Grouping function in the reducer which groups records on the join key, and ensures that records from table R are stored ahead of those from table L for a given key. To decrease buffer size, only the record, that have composite key containing all table tags will be written into buffer. All records from table L and R will be buffered before joining and that may lead to insufficient memory problem, as encountered by Yahoo Pig and Facebook Hive [11, 17, 18]. To deal with this, improved repartition join is proposed. In the improved version, the map function is changed. Output key of the map function is changed to a composite of join key and table tag. The table tags will be generated in a way that guarantees that records from table R will be stored ahead Block 1 Block 2 Block 3 Block 1 Block 2 Block 3 T able R Stds Profs Jia PhMe Sul PhMe Ling Su Hin Sup Hiu Sup T able L Stds Profs Su Mar Aj PhMe Hiu PhMe Lo PhMe Sun Un Output of map functions C omp. Ke ys Tagge d Re cords Map 1 [PhMe, R] ('R', Jia, PhMe) [PhMe, R] ('R', Sul, PhMe) Map 2 [Su, R] ('R', Ling, Su) [Sup, R] ('R', Hin, Sup) Map 3 [Sup, R] ('R', Hiu, Sup) C omp. Ke ys Map 4 [Mar, L] [PhMe, L] Map 5 [PhMe, L] [PhMe, L] Map 6 [Un, L] Tagge d Re cords ('L', Su, Mar) ('L', Aj, PhMe) ('L', Hiu, PhMe) ('L', Lo, PhMe) ('L', Sun, Un) Inte rme diate Re sults Ke ys Tagge d Re cords [Mar, L] ('L', Su, Mar) [PhMe, R ('R', Jia, PhMe) [PhMe, R ('R', Sul, PhMe) [PhMe, L ('L', Aj, PhMe) [PhMe, L ('L', Hiu, PhMe) [PhMe, L ('L', Lo, PhMe) [Su, R] ('R', Ling, Su) [Sup, R] ('R', Hin, Sup) [Sup, R] ('R', Hiu, Sup) [Un, L] ('L', Sun, Un) Input of reduce functiom Ke ys Lists of Value s ([Jiaja, PhMe], [AjPae, PhMe]) ([Jiaja,PhMe], [Hiu, PhMe]) ([Jiaja, PhMe], [Lo, PhMe]) [PhMe R, L] ([Sul,PhMe], [AjPae, PhMe]) ([Sul, PhMe], [Hiu, PhMe]) ([Sul, PhMe], [Lo, PhMe]) Ke ys Lists of Value s [Mar,_, L] (_, [Su, Mar]) [Un, _, L] (_, [Sun, Un]) [Su, R, _] ([Ling, Su],_) ([Hin, Sup],_) [Sup, R, _] ([Hiu, Sup],_) Fig. 6. Example of improved repartition joins (using equi-join on the field Profs) 166 Final result from L.Stds L.Profs Aj PhMe Aj PhMe Hiu PhMe Hiu PhMe Lo PhMe Lo PhMe reducer R.Stds Jia Sul Jia Sul Jia Sul R.Profs PhMe PhMe PhMe PhMe PhMe PhMe 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) In some cases, a large portion of table R may not be referenced by any record from table L. For example R is a table of users including millions of records while L is a table of activities that users act during an hour. In this situation, only a few of records from table R are referenced by records from table L. However, when joining based on broadcasting, a large amount of records of table R are shipped across network and loaded into the hash table. If these data are not referenced based on the join key, the network resource is wasted for the shipping. D. Broadcasting Join Broadcast join is used when table R is much smaller than table L. Instead of passing both tables R and L across the network, the smaller table will be broadcasted to larger table. This technique reduces sorting time and network traffic. At the beginning of each map function, broadcast join checks whether R is stored on the local file system or not. If not, it retrieves table R from the distributed file system, and splits R into partitions on the join key, and stores these partitions on the local file system. Hash table is built from table L or R depending on which one has smaller size. E. Semi Join The semi-join proposed to solve the problem mentioned above is comprised of three phases as follows. The first phase runs as a full MapReduce job. In the map function, a main memory table of hash code is used for determining the set of unique join key values in a part of table L. By sending only unique key values to the map output, number of records that need to be sorted is reduced. The reduce function processes unique join key. In Fig. 9, all unique join keys will be consolidated by a reducer, result from this phase is a single file called L.uk. If R is smaller than a partition of L, then all partitions of R will be loaded to memory to build the hash table. The map function then extracts join key value from each record from L, and uses it to probe the hash table and to generate join output. If R is bigger than a split of L, joining is not done at the map function. The map function will map each partition of L with each partition of R using other join strategies. Then, results from R and L will be joined at the end of the map process. TableR StdId subject 55501 701 55501 371 55502 555 56701 511 56702 814 TableRisusedto buildhashtable In Fig. 7 and Fig. 8, table R is smaller than a part of table L, so it is broadcasted to each node. The map function loads all records from table R to build a hash table. For each record from a partition of table L, the map function finds its reference in the hash table, and outputs only those it has referenced. All unreferenced records from table L will be ignored. TableL StdId subject 55501 701 Split1 55501 371 55502 555 56701 511 Split2 56701 814 56702 814 Hashtable, Distributedfunction=(StdIdmod2)+1 StdId Group 55501 2 55502 1 56701 2 56702 1 Split2 Joinkeyis usedtoprobe hashtable Map1 Map2 Joinkeyis usedtoprobe hashtable Split1 Name Lo Mo Bo Dit Hiu Cha Sul Sher Jia Dih Tha Ling outputL.uk StdId 55501 55502 56701 56702 Fig. 9. Example of the first phase in Semi joins (using equi-join) The second phase, similar to the broadcast join, runs as a map job. Firstly, L.uk will be loaded into a memory hash table, the map function iterates each record from table R and outputs it if its join key can be found in the L.uk. Each part of table R produces one file called Ri. Output of this phase is a list of file Ri as shown in Fig. 10. Fig. 7. Building Hash table when R is smaller than any part of L TableL StdId 55701 55702 56700 56701 56702 56703 55501 55502 55503 55504 55505 56501 HashtableL1 StdId 55501 55502 HashtableL2 StdId 56701 56702 L.StdId L.Name R.StdId R.subject IntermediaResults 56701 Dit 56701 511 56702 Hiu 56702 814 L.StdId L.Name R.StdId R.subject 55701Hash table 56701Hash table 55502 Sher 55502 555 55702Hash table 56703Hash table Group1 56702 Hiu 56702 814 56700Hash table L.StdId L.Name R.StdId R.subject L.StdId L.Name R.StdId R.subject 55501 Sul 55501 701 55501 Sul 55501 701 Group2 55501 Sul 55501 371 55501 Sul 55501 371 56701 Dit 56701 511 55502 Sher 55502 555 55503Hash table 55505Hash table 55504Hash table 56501Hash table The third phase, join all file Ri with table L using broadcast join as shown in Fig. 11. One challenge of semi join is that not every record in the Ri of R will join with a particular part Li of table L. To solve this issue, per-split semi join is proposed. Fig. 8. Example of broadcasting joins when R is smaller than any part of L(using equi-join) Split1 TableR StdId 55701 55702 56700 56701 56702 56703 Name Map1 Lo Hashtable Mo StdId Bo 55501 Dit 55502 Hiu 56701 Cha 56702 OutputR1 TableR StdId Name StdId Name Map2 56701 Dit 55501 Sul Hashtable 56702 Hiu 55502 Sher StdId 55701Hash table 55503 Jia 55501 Split2 55702Hash table 55504 Di 55502 56700Hash table 55505 Tha 56701 56703Hash table 56501 Ling 56702 OutputR2 StdId Name 55501 Sul 55502 Sher 55503Hash table 55504Hash table 55505Hash table 56501Hash table Fig. 10. Example of the second phase in Semi joins (using equi-join) 167 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) TableL StdId subject 55501 701 Ma p1 Split1 55501 371 55502 555 56701 511 Ma p2 Split2 56701 814 56702 814 OutputR1 StdId Name 56701 Dit R1 56702 Hieu TableL Intermediateresults2 StdId subject L.StdId L.Name R.StdId R.subject 55501 701 Map1 55501 Sul 55501 701 55501 371 55501 Sul 55501 371 55502 555 55502 Sher 55502 555 56701 511 R.subject Split2 56701 814 Map2 511 56702 814 L.StdId L.Name R.StdId R.subject 814 OutputR2 56701R2 814 StdId Name 56702R2 55501 Sul R2 55502 Sher L.StdId L.Name R.StdId R.subject 55501R1 Split1 55502R1 Intermediateresults1 L.StdId L.Name R.StdId 56701 Dit 56701 56701 Dit 56701 56702 Hiu 56702 Fig. 11. Example of the last phase in Semi joins (using equi-join) F. Per-Split Semi Join Per-split semi join consists of three phases. The first and the last phases are map jobs, and the second phase is a full map reduce job. The first phase is to generate the set of unique join keys in a split Li of table L, and stores them in the distributed file system, called Li.uk. The second phase is to load all records from a split of table R into main memory hash table, and read the unique keys from file Li.uk and probe the hash table for matching records from R. Each matched record is outputted with a tag RLi, which is used by reduce function to collect all records from table R that will join with Li. In the last phase, the results of the second phase and Li are joined directly as shown in Fig. 12 and Fig. 13. IV. Many of big data mining problems can be solved by using MapReduce associated with Key-Value store. Based on advantages and drawbacks of those explained strategies in terms of time and network resources consumption, we provide a comparison of join strategies as shown in Table 1. TABLE 1. COMPARISION OF JOIN STRATEGIES Strategy All pair partition join Standard repartition join Improved repartition join Broadcasting join Fig. 12. Example of the first phase and second phase in Per-Split Semi Join OutputofRjoinLi.uk Tags StdId Name RL1 55501 Sul RL1 55502 Sher RL2 56701 Dit RL2 56702 Hiu TableL StdId subject 55501 701 55501 371 55502 555 56701 511 56701 814 56702 814 CONCLUSION Semi-join Outputoffinalphase L.StdId L.Name R.StdId R.subject 55501 Sul 55501 701 55501 Sul 55501 371 55502 Sher 55502 555 56701 Dit 56701 511 56701 Dit 56701 814 56702 Hiu 56702 814 Per-split semi join Fig. 13. Example of the last phase in Per-Split Semi Join (using equi-join) Pros/Cons Easy to implement, all compound partition transferred to reducers may not be processed by reducers. Easy to implement, all records from both tables will be buffered before joining that may lead to insufficient memory problem. To reduce buffer size, implementation is more complex than the standard version. Reduce sorting time and network traffic. May waste of network resource. Some records from parts of a table broadcasted to another table may not be joined. Complicated implementation, more reading and writing operations. Suggestion Used when two datasets have more data in common, be sorted by the same fields. Same with all pair partition join. Used when two joined datasets have few data in common, be sorted by the same fields. Used when one table is much smaller than the other table. Used when a large portion of a table may not be referenced by any record from the other table. Same with semijoin. Which strategy should be used in any problem depends on nature of the data and available network resources. If two 168 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) joined tables have more data in common or having sufficient network resources, all pair partition join, repartition join should be used because its implementation is not as complex as the others. If two joined tables have few data in common or having inadequate network resources, broadcasting join, semi join, per-split semi join should be used because it may reduce time and resources consumption. [6] [7] [8] Data in NoSQL database can be structured, semi-structured, or unstructured; and can be stored in many types of data structures such as indexed table of relational database, B-Tree, Queue, Hash table. Therefore, in addition to the consideration presented in this paper, selection of join strategies is also affected by data structures. MapReduce programmers may also need to consider data accessing time, data sorting time when selecting joining strategy. This issue is beyond the scope of this paper and is left for future research. [9] [10] [11] REFERENCES [12] [1] [2] [3] [4] [5] Mapanga, I. and P. Kadebu, Database Management Systems: A NoSQL Analysis. Interna-tional Journal of Modern Communication Technologies & Research (IJMCTR), 2013. 1: p. 12-18. Hecht, R. and S. Jablonski. NoSQL evaluation: A use case oriented survey. in Cloud and Service Computing (CSC), 2011 International Conference on. 2011. Dean, J. and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI '04: Sixth Symposium on Operating Systems Design and Implementation. 2004, USENIX: San Francisco, California, USA. p. 137–150. Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters, in Communications of the ACM - 50th anniversary issue: 1958 - 2008. 2008. p. 107-113. Celko, J., Chapter 6. Key–Value Stores, in Joe Celko's complete guide to NoSQL : what every SQL professional needs to know about [13] [14] [15] [16] [17] [18] 169 nonrelational databases, A. Dierna and H. Scherer, Editors. 2014, Morgan Kaufmann, Elsevier: USA. p. 81-88. Oracle, Chapter 1. Introduction to Berkeley DB, in Oracle Berkeley DB: Getting Started with Berkeley DB for C. 2013. p. 8-15. Jadhav, V., J. Aghav, and S. Dorwani, Join Algorithms Using MapReduce: A Survey, in International Conference on Electrical Engineering and Computer Science. 2013, IOAJ INDIA: Coimbatore, Tamil Nadu, India. p. 40-44. White, T., Chapter 8. MapReuce Features, in Hadoop: The Definitive Guide, Second Edi-tion, M. Loukides, Editor. 2011, O'Reilly Media, Inc.,: USA. p. 225-257. White, T., Chapter 8. MapReduce Features, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.,: USA. p. 259-295. White, T., Chapter 7. MapReduce Types and Formats, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.: USA. p. 223-258. Blanas, S., et al., A comparison of join algorithms for log processing in MaPreduce, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010, ACM: Indianapolis, Indiana, USA. p. 975-986. Özsu, M.T. and P. Valduriez, Chapter 3. Distributed Database Design, in Principles of Dis-tributed Database Systems, Third Edition. 2011, Springer New York. p. 71-125. Bernstein, P.A., et al., Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., 1981. 6(4): p. 602-625. Lee, K.-H., et al., Parallel data processing with MapReduce: a survey. SIGMOD Rec., 2012. 40(4): p. 11-20. Okcan, A. and M. Riedewald, Processing theta-joins using MapReduce, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, ACM: Athens, Greece. p. 949-960. Shim, K., MapReduce algorithms for big data analysis, in Proceedings of the VLDB En-dowment 2012, VLDB Endowment. p. 2018-2017. Olston, C., et al., Pig latin: a not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, ACM: Vancouver, Canada. p. 1099-1110. Hive, A., Theta Join. 2013.