Distributed Data Management - Databases and Information Systems
Transcription
Distributed Data Management - Databases and Information Systems
Distributed Data Management Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Distributed Data Management, SoSe 2015, S. Michel 1 General Note • For exercise sheet 2 we put some data files to download. • To access these outside the university network, you need the following login information. – Username: ddm – Password: • Remember this login also for forthcoming protected accesses. Distributed Data Management, SoSe 2015, S. Michel 2 Map Reduce from High Level D MAP A MAP T MAP A REDUCE Result REDUCE Result REDUCE Result MAP Intermediate Results Distributed Data Management, SoSe 2015, S. Michel 3 Map and Reduce: Types • Map (k1,v1) list(k2,v2) • Reduce (k2, list(v2)) list(k3, v3) keys allow grouping data to machines/tasks • For instance: – k1= document identifier – v1= document content – k2= term – v2=count – k3= term – v3= final count Distributed Data Management, SoSe 2015, S. Michel 4 MR Key Principle • Many data chunks • Map function on each of the chunks • Map process outputs data with keys => Partitions based on keys • Aggregate (i.e., reduce) mapped data per key • Out is written out to (distributed) file system • E.g., count number occurrences of each terms in set of documents. Distributed Data Management, SoSe 2015, S. Michel 5 MR: Example of Principle • For instance, CSV file in file system, e.g., weather.csv of temperature readings 2/12/2004;64;5;2.46 9/6/2006;80;14;10.15 6/1/2002;9;16;16.01 10/30/2014;73;19;23.81 8/30/2002;64;4;16.16 1/29/2007;40;24;-2.16 11/10/2012;85;10;12.20 ….. with data;station_id;hour_of_day;temp Distributed Data Management, SoSe 2015, S. Michel 6 MR: Example of Principle (2) • The Mapper is responsible for “parsing” the lines • Simple example: get all tuples from 2014, like in the “grep” example. No reducer. E.g., we have File 1 11/24/2014;21;3;-0.47 3/13/2014;40;6;12.79 10/14/2014;26;22;22.41 2/5/2014;17;12;7.87 File 2 11/1/2014;84;1;4.62 2/24/2014;35;13;-2.44 11/17/2014;59;17;26.31 6/9/2014;23;13;23.60 File 3 2/24/2014;11;11;6.80 11/17/2014;12;2;4.85 10/8/2014;3;9;12.71 8/28/2014;33;12;7.27 …….. One file per Mapper is directly written to file system; no partioning by key, no sorting. But let’s add a reducer now …. Distributed Data Management, SoSe 2015, S. Michel 7 MR: Example of Principle (3) • Let’s say we are interested in the average temperature for each hour of the day, in 2014. • This can be done by the reducer (after the mapper), which we add now. So the mapper “sends” stuff to the reducer. Sorted by key. • Say, we have two reducers (= number of partitions) – Partitions of tuples are created (by default) using key.hashCode() % number_of_partitions – So there are in general more than one group of hour_of_dayDistributed in the partition. Data Management, SoSe 2015, S. Michel 8 MR: Example of Principle (4) • The reducer obtains then for each group of tuples with the same hour_of_day the temperature values and can compute the average. • Output is One file per Reducer 14;17.34 17;14.01 23;9.11 4;7.19 16;16.35 22;9.89 Inside each file lines are sorted by hour_of_day But not globally! Distributed Data Management, SoSe 2015, S. Michel 9 More Details • While there are in general several keys (and corresponding values) in a partition, it is assured that all tuples for a key are in a single partition. • The sorting is done based on the key. • Data is grouped by key. The reduce function is called for each such group. To get a strong understanding of these concepts, play around with the Hadoop MapReduce implementation (see exercise sheet 2). Distributed Data Management, SoSe 2015, S. Michel 10 Map-Only Job vs. IdentityReducer • In map-only jobs (zero reducers) no sorting takes place • With an identity reducer that simply outputs data it receives, sorting will take place. Distributed Data Management, SoSe 2015, S. Michel 11 Multiple MapReduce Jobs Together • One can combine multiple MapReduce Jobs, in general, forming a workflow of jobs that can be described with a directed acyclic graph (DAG) • Each MR job outputs results to the (distributed) file system. • Subsequent jobs can consume these outputs (files, one by reducer or mapper if there is no reducer). Distributed Data Management, SoSe 2015, S. Michel 12 MAPREDUCE AND SQL / JOINS Distributed Data Management, SoSe 2015, S. Michel 13 Processing SQL Queries in MR • Given a relation R(A,B,…) that is stored in a file (one tuple per line) • How to process standard SQL queries? SELECT B FROM R WHERE predicate(B) “projection” Distributed Data Management, SoSe 2015, S. Michel “selection” 14 SQL in MR: Group-By, Aggregate, Having SELECT department, avg(salary) FROM salaries GROUP BY department HAVING avg(salary) > 50000 Distributed Data Management, SoSe 2015, S. Michel 15 SQL in MR: Group-By, Aggregate, Having SELECT department, avg(salary) FROM salaries GROUP BY department HAVING avg(salary) > 50000 • Group-By, Aggregate • Map: Send tuple to reducer using the attribute after which is grouped as key (here: department) • Reducer: Receives, hence, all tuples for one department and can group and aggregate • Having • Having is a predicate executed on aggregate of a group, hence, executed at reducer. Distributed Data Management, SoSe 2015, S. Michel 16 (Equi) Joins in Map Reduce • Two relations R(A,B) and S(B,C): SELECT * FROM R, S WHERE R.B = S.B • Obviously: Join “partners” have to end up at same node. How to achieve this? Distributed Data Management, SoSe 2015, S. Michel 17 Example Station ID Timestamp Temperature 1 12434343434300 25 Station Name 2 12434343434500 27 1 12434343434700 31 1 A 1 12434343434900 28 2 B 2 12434343435200 29 Station ID Join Station ID Station Name Timestamp Temperature 1 A 12434343434300 25 2 B 12434343434500 27 1 A 12434343434700 31 1 A 12434343434900 28 2 B 12434343435200 29 Distributed Data Management, SoSe 2015, S. Michel 18 Reduce Side Join • Two relations R(A,B) and S(B,C). • Map: – Send tuple t to reducer of key t.B – And information where t is from (R or S) • Reduce: – Join tuples t1, t2 with t1.B=t2.B and t1 in R and t2 in S Distributed Data Management, SoSe 2015, S. Michel 19 Map-Side Join with one entirely known Relation • Two relations R(A,B) and S(B,C). • One relation is small, say R • Map: – each map process knows entire relation R – can perform join on subset of S • output joined tuple • Reduce: – no reducer needed Distributed Data Management, SoSe 2015, S. Michel 20 Reduce-Side Join with “Semi Join” Optimization (Filtering) in Map Phase • Two relations R(A,B) and S(B,C). • Unique values in R.B are small in number • Map: – knows unique ids of R.B – send tupes t in R by key t.B – send tuples t in S only if t.B in R.B • Reduce: – perform the actual join (why is this still required?) Distributed Data Management, SoSe 2015, S. Michel 21 Global Sharing of Information • Implemented as “Distributed Cache” • For small data • E.g., dictionary or “stopwords” file for text processing, or “small” relation for joins • Read at initialization of Mapper Distributed Data Management, SoSe 2015, S. Michel 22 Reduce Side Join with Map-Side Filtering, but now with Bloom Filters • Reduce-side join with Map-side filtering • Compact representation of join attributes • Using Bloom Filter* – very generic data structure with wide applications to distributed data management / systems • Will see them later again (so worth introducing) *) - Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM 13 (7): 422–426. - Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom Filters: A Survey", Internet Mathematics 1 (4): 485–509 Distributed Data Management, SoSe 2015, S. Michel 23 Bloom Filter • Bit array of size m (all bits=0 initially) • Encode elements of a set in that array – set is for instance the distinct attributes of table column or a set of words. How to hash nonnumbers? E.g., use byte representation of string • How is the bit array constructed? – Hash element to bucket no. and set this bit to 1 (If the bit is already 1, ok, keep it 1) – Use multiple (=k) hash functions hi Bucket number 1 2 3 4 5 Distributed Data Management, SoSe 2015, S. Michel 6 7 8 24 Bloom Filter: Insert + Query h1(x) = 3*x mod 8 h2(x) = 5*x mod 8 h1(17)=3 h2(17)=5 h1(59)=1 h2(59)=7 Bucket number 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 7 • Query: is x contained in the set (=filter)? – Check if bits at both h1(x) and h2(x) are set to 1. Yes? Then x ”might be” in the set. No? Then x is for sure not in! Distributed Data Management, SoSe 2015, S. Michel 25 Bloom Filter: False Positives • In case all bits at hash positions are 1, the element might be in, but maybe it’s a mistake. • Is x=45 contained? Bucket number h1(45)=7 h2(45)=1 0 1 0 1 0 1 0 1 0 1 2 3 4 5 6 7 • Looks like, but actually it is not! (i.e., we didn’t insert it on the slide before) • It is a false positive! Distributed Data Management, SoSe 2015, S. Michel 26 Bloom Filter: Probability of False Positives • Bloom Filter of size m (bits) • k hash functions • n inserted elements • Thus, can be controlled: tradeoff between compression and “failures“ Distributed Data Management, SoSe 2015, S. Michel 27 Implications of False Positives on Join • Reconsider the reduce-side join with map-side filtering of relations R(A,B) and S(B,C). • We have a Bloom filter for R.B, etc. (see slide before) • What do false positives cause? – additional (and useless network) traffic and also more work for reducer – but no erroneous results as reducer will check if the join can in fact be done Distributed Data Management, SoSe 2015, S. Michel 28 Map-Side Join • Reduce side joins are quite straight forward • But potentially lots of data needs to be transferred • How can we realize a join on the map side (without distributed cache for one small relation)? Distributed Data Management, SoSe 2015, S. Michel 29 Map-Side Join: Idea • Consider that input relations are already sorted by key over that the join is computed Relation S Relation R A value A value A value A value B value B value B value B value C value B value C value C value D value C value D value E value E value How can this work out properly? Distributed Data Management, SoSe 2015, S. Michel 30 Map-Side Join: Solution • Need to make sure data is properly aligned • That is, Mappers need to read input in pairs of chunks (one from R, one from S) that contain all join partners • Achieved through two prior MR jobs* that use same number of reducers, sorting tuples by the key (of course, same order) • Then final MR job performs join in Map Phase Distributed Data Management, SoSe 2015, S. Michel *) If not sorted/aligned properly already 31 Map-Side Join: Solution (Cont’d) • In Hadoop there exist a special input format named CompositeInputFormat • You can specify the join you want (e.g., inner and outer join) and the input. Distributed Data Management, SoSe 2015, S. Michel 32 Map Reduce vs. Databases Traditional RDBMS Map Reduce Data Size Gigabytes Petabytes Access Interactive and batch Batch Updates Read and write many times Write once, read many times Structure Static schema Dynamic schema Integrity High Low Scaling Non linear Linear source: T. White, Hadoop, The Definitive Guide, 3rd edition Distributed Data Management, SoSe 2015, S. Michel 33 Objectives/Benefits • Simple model (see also criticisms) ;) • Scalable (depends also on problem of course) • Aims at high throughput • Tolerant against node failures Distributed Data Management, SoSe 2015, S. Michel 34 Limitations • Very low level routines • Can have quite slow response time for individual, small tasks • Writing complex queries can be a hassle – Think: declarative languages like SQL SELECT * FROM WHERE GROUP BY … Distributed Data Management, SoSe 2015, S. Michel 35 Criticism • Some people claim MR is a major step backward • Why? – Too low level – No indices – No updates – No transactions • But: was it really made to replace a DB? http://craig-henderson.blogspot.de/2009/11/dewitt-and-stonebrakers-mapreducemajor.html Distributed Data Management, SoSe 2015, S. Michel 36 HADOOP (A MR IMPLEMENTATION) Distributed Data Management, SoSe 2015, S. Michel 37 Hands on MapReduce (with Hadoop) • Apache Hadoop. Open Source MR • Wide acceptance: – http://wiki.apache.org/hadoop/PoweredBy – Amazon.com, Apple, AOL, eBay, IBM, Google, LinkedIn, Last.fm, Microsoft, SAP, Twitter, … Distributed Data Management, SoSe 2015, S. Michel 38 Hadoop Distributed File System (HDFS): Basics Given file is cut in big pieces (blocks) (e.g., 64MB) Which are then assigned to (different) nodes block Distributed Data Management, SoSe 2015, S. Michel node 39 HDFS Architecture metadata ops Metadata (Name, replicas, …) /home/foo/data, 3, … NameNode Client block ops read DataNodes DataNodes replication of block Rack 2 Rack 1 write Client source: http://hadoop.apache.org Distributed Data Management, SoSe 2015, S. Michel 40 Replication • Can specify default replication factor (or per directory/file) • Replication is pipelined – if block is full, NameNode is asked for other DataNodes (that can hold replica) – DataNode is contacted, receives data – Forwards to third replica, etc. Distributed Data Management, SoSe 2015, S. Michel 41 Distributed Data Management, SoSe 2015, S. Michel 42 Distributed Data Management, SoSe 2015, S. Michel 43 A Note on Input Splits • An Input Split is a chunk of the input data, processed by a single map. • For instance a set of lines of the original big file. • Size of splits usually like size of file system blocks. • But does not fit in general precisely with the block boundaries. So need to read a bit across boundaries. • Luckily, for applications we consider, we “do not care” and use available input formats. Distributed Data Management, SoSe 2015, S. Michel 44 MR job execution in Hadoop 1.x Map Reduce Program run Job client JVM client node source: T. White, Hadoop, The Definitive Guide, 3rd edition Distributed Data Management, SoSe 2015, S. Michel 45 MR job execution in Hadoop 1.x (Cont’d) 2: get new job ID 1: run Job 4: submit job JobTracker 5: init job lient JVM jobtracker node ent node 3: copy job resources Shared Filesystem (e.g., HDFS) 6: retrieve input splits … tasktracker node … Distributed Data Management, SoSe 2015, S. Michel 46 MR job execution in Hadoop 1.x (Cont’d) 7: heartbeat (returns task) JobTracker 5: init job jobtracker node 6: retrieve input splits TaskTracker 9: launch 8: retrieve job resources Child Shared Filesystem (e.g., HDFS) 10: run Map or Reduce child JVM tasktracker node Distributed Data Management, SoSe 2015, S. Michel 47 Job Submission, Initialization, Assignment, Execution • • • • • asks for new job id checks if input/output directories exist computes input splits writes everything to HDFS submits job to JobTracker (step 4) • Retrieves splits (chunks) from HDFS • Creates for each split a Map task • TaskTracker is responsible for executing a certain assigned task (multiple on one physical machine) Distributed Data Management, SoSe 2015, S. Michel 48 Distributed Data Management, SoSe 2015, S. Michel 49 Stragglers and Speculative Execution • JobTracker continuously controls progress (see Web user interface) • Stragglers are slow nodes – have to wait for the slowest one (think: only one out of 1000 is slow and delays overall response time) • Speculative execution – run same task on more nodes if the first instance is observed to underperform (after some time) – wasted resources vs. improved performance Distributed Data Management, SoSe 2015, S. Michel 50 source: T. White, Hadoop, The Definitive Guide, 3rd edition Typical Setup Switch Rack 1 Rack 2 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Disks Disks Disks Disks Disks Disks Distributed Data Management, SoSe 2015, S. Michel 51 Locality node • data-local • rack-local • off-rack map tasks rack Map task HDFS block data center Distributed Data Management, SoSe 2015, S. Michel source: T. White, Hadoop, The Definitive Guide, 3rd 52 edition Cost Model + Configuration for Rack Awareness • Simple cost model applied in Hadoop: – Same node: 0 – Same rack: 2 – Same data center: 4 – Different data center: 6 • Hadoop needs help: You have to specify config. (topology) • Sample configuration: '13.2.3.4' : '/datacenter1/rack0', '13.2.3.5' : '/datacenter1/rack0', '13.2.3.6' : '/datacenter1/rack0', '10.2.3.4' : '/datacenter2/rack0', '10.2.3.4' : '/datacenter2/rack0' .... Distributed Data Management, SoSe 2015, S. Michel 53 Shuffle and Sort • Output of map is partitioned by key as standard • Reducer is guaranteed to get entire partition • Sorted by key (but not by value within each group) • Output of each reducer is sorted also by this key • Selecting which key to use, hence, affects partitions and sort order (see few slides later how to customize) Distributed Data Management, SoSe 2015, S. Michel 54 Shuffle and Sort Copy phase reduce task map task fetch buffer in memory merge map merge on disk input split merge partitions other maps Distributed Data Management, SoSe 2015, S. Michel other reducers 55 Shuffle and Sort (Cont’d) “Sort” phase Reduce phase reduce task map task fetch merge merge merge on disk reduce merge output other maps mixture of in-memory and data Distributed Data Management, SoSeon-disk 2015, S. Michel other reducers 56 Secondary Sort • In MapReduce (Hadoop) tuples/records are sorted by key before reaching the reducers. • For a single key, however, tuples are not sorted in any specific order (and this can also vary from one execution of the job to another). • How can we impose a specific order? Distributed Data Management, SoSe 2015, S. Michel 57 Partitioning, Grouping, Sorting • Consider weather data, temperature (temp) for each day. Want: maximum temp per year • So, want data per year sorted by temp: 1900 1900 1900 ... 1901 1901 35°C 34°C 34°C 36°C 35°C max for year 1900 max for year 1901 • Idea: composite key: (year, temp) example source: T. White, Distributed Data Management, SoSe 2015, S. Michel Hadoop, The Definitive Guide, 3rd edition 58 Partitioning, Grouping, Sorting (Cont’d) • Obviously, doesn’t work: (1900, 35°C) and (1900, 34°C) end up at different partitions • Solution(?): Write a custom partitioner that considers year as partition and sort comparator for sorting by temperature Distributed Data Management, SoSe 2015, S. Michel 59 Need for Custom Grouping • With that custom partitioner by year and still year and temp as key we get Partition 1900 1900 1900 ... 1901 1901 Group 35°C 34°C 34°C 36°C 35°C • Problem: reducer still consumes groups by key (within correct partitions) Distributed Data Management, SoSe 2015, S. Michel 60 Custom Grouping • Solution: Define custom grouping method (class) that considers year for grouping Partition 1900 1900 1900 ... 1901 1901 Group 35°C 34°C 34°C 36°C 35°C Distributed Data Management, SoSe 2015, S. Michel 61 Custom Sorting • Finally, we provide a custom sorting that sorts the keys by temperature in descending order (= large values first) • What happens then? Hadoop uses year for grouping (as said on previous slide), but which temp is used as the key (remember, we still have composite keys). • The first one observed is used as key, i.e., the largest (max) temperature is used for the temp. Note that this example specifically aims at computing the max using secondary sort. How would you implement a job such that the output is sorted by (year,temp) ? Distributed Data Management, SoSe 2015, S. Michel 62 Secondary Sort: Summary • Recipe to get sorting by value – Use composite key of natural key and natural value – Sort comparator has to order by the composite key (i.e., both natural key and natural value) – Partitioner and grouping comparator for the composite key should use only the natural key for partitioning and grouping. Hint (for Hadoop): job.setMapperClass(…); job.setPartitionerClass(…); job.setSortComparatorClass(…); job.setGroupingComparatorClass(…); job.setReducerClass(…); Distributed Data Management, SoSe 2015, S. Michel 63 Failure/Recovery in MR • Tasktracker failure: – – – – detected by master through periodic heartbeats can also be black listed if too many failures occur just restart if dead. Jobtracker re-schedules tasks • Master failure: – unlikely to happen (only one machine) but if: all running jobs failed – improved in Hadoop 2.x (YARN) Distributed Data Management, SoSe 2015, S. Michel 64 And Specifically in HDFS • NameNode marks DataNodes without recent Heartbeats as dead • Replication factor of some blocks can fall below their specified value • The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. • If NameNode crashed: Manual restart/recovery. Distributed Data Management, SoSe 2015, S. Michel 65 Literature • Read on: hadoop.apache.org, there is also a tutorial • Hadoop Book: Tom White. Hadoop: The definitive Guide. O’Reilly. • Hadoop Illuminated: http://hadoopilluminated.com/hadoop_book/ • Websites, e.g., http://bradhedlund.com/2011/09/10/understandinghadoop-clusters-and-the-network/ • http://lintool.github.io/MapReduceAlgorithms/MapReduce -book-final.pdf Distributed Data Management, SoSe 2015, S. Michel 66