© inovex Academy 1

Transcription

© inovex Academy 1
Hdfs & Map-reduce
© inovex Academy
1
Agenda
introduction
1. HDFS
2. Map-Reduce
© inovex Academy
2
What?
1. framework for distributed data
processing
2. highly scalable: TBs and PBs
3. originated at Google
4. open-source implementation:
Apache Hadoop
© inovex Academy
3
Why?
1. too much data for one machine
2. processing speed
3. scaling out vs. scaling up
Photo by Flo P.
© inovex Academy
4
The Big Picture
webserver farm
Hadoop cluster
logs
© inovex Academy
5
The Big Picture
webserver farm
Hadoop cluster
logs
logs
logs
© inovex Academy
5
The Big Picture
webserver farm
Hadoop cluster
logs
logs
logs
© inovex Academy
5
Agenda
1. HDFS
(Hadoop Distributed File System)
2. Map-Reduce
© inovex Academy
6
HDFS Architecture
(standby NN for failover)
name node
© inovex Academy
data node 01
data node 05
data node 09
data node 02
data node 06
data node 10
data node 03
data node 07
data node 11
data node 04
data node 08
data node 12
rack 1
rack 2
rack 3
7
HDFS Architecture
(standby NN for failover)
name node
data node 01
data node 05
data node 09
data node 02
data node 06
data node 10
data node 03
data node 07
data node 11
data node 04
data node 08
data node 12
rack 1
rack 2
rack 3
client
blk 1
blk 2
© inovex Academy
blk 3
blk 4
7
HDFS Architecture
(standby NN for failover)
name node
data node 01
data node 05
data node 09
data node 02
data node 06
data node 10
data node 03
data node 07
data node 11
data node 04
data node 08
data node 12
rack 1
rack 2
rack 3
client
blk 1
blk 2
© inovex Academy
blk 3
blk 4
7
HDFS Architecture
(standby NN for failover)
name node
Where do I store block 1?
data nodes 03, 05, 08
data node 01
data node 05
data node 09
data node 02
data node 06
data node 10
data node 03
data node 07
data node 11
data node 04
data node 08
data node 12
rack 1
rack 2
rack 3
client
blk 1
blk 2
© inovex Academy
blk 3
blk 4
7
HDFS Architecture
(standby NN for failover)
name node
Where do I store block 1?
data nodes 03, 05, 08
data node 01
data node 05
data node 09
data node 06
data node 10
data node 07
data node 11
data node 04
data node 08
data node 12
rack 1
rack 2
rack 3
blk 3
blk 4
data node 02
05
blk 2
© inovex Academy
(03
,
blk
1
05,
08)
data node 03
blk 1 (03, 05, 08)
blk
1
(0
3,
blk 1
,0
8)
client
7
HDFS Architecture
(standby NN for failover)
name node
Done!
Where do I store block 1?
Done!
data nodes 03, 05, 08
Done!
data node 01
data node 05
data node 09
data node 06
data node 10
data node 07
data node 11
data node 04
data node 08
data node 12
rack 1
rack 2
rack 3
blk 3
blk 4
data node 02
05
blk 2
© inovex Academy
(03
,
blk
1
05,
08)
data node 03
blk 1 (03, 05, 08)
blk
1
(0
3,
blk 1
,0
8)
client
7
Fault tolerance
1. cluster nodes fail
2. MTBF decreases
3. unavoidable
© inovex Academy
8
Failure detection
1. heartbeat protocol
2. namenode blacklists unresponsive
nodes
3. automatic replication to another
datanode
© inovex Academy
9
Availability
1. client requests storage location from
namenode
2. client fetches data from datanode
directly
3. highly available HDFS:
1. standby namenode
2. shared state
© inovex Academy
10
Durability
1. datanodes: replication
2. namenode: edit log
1. fast append operations
2. replay upon restart
3. secondary namenode:
synchronization of FS image with
edit log (≠ standby!)
© inovex Academy
11
HDFS Features
1. replication
2. failure detection
3. high availability
4. durability
... in a highly distributed, scalable manner
© inovex Academy
12
Exercise: HDFS CLI
1. handout: HDFS commands
2. inspect directory structure in HDFS
3. remove data/candy/mandms-100k.txt
from HDFS
4. upload local copy to HDFS
5. inspect data
© inovex Academy
13
Agenda
1. HDFS
2. Map-Reduce
© inovex Academy
14
Processing w/ Map-reduce
input
© inovex Academy
15
Processing w/ Map-reduce
© inovex Academy
15
Processing w/ Map-reduce
map
map
map
map
map
map
© inovex Academy
15
Processing w/ Map-reduce
map
map
map
map
map
map
© inovex Academy
15
Processing w/ Map-reduce
map
map
map
reduce
shuffle
reduce
map
map
reduce
map
© inovex Academy
15
Processing w/ Map-reduce
map
map
map
reduce
shuffle
reduce
map
map
reduce
map
© inovex Academy
15
Processing w/ Map-reduce
on datanodes
map
map
map
reduce
shuffle
reduce
map
map
reduce
map
© inovex Academy
15
Embarrassingly parallel
1. map-only jobs
e.g., filtering, extraction
2. no synchronization needed
map
map
map
map
3. Example: field selection
"5000000006394954865";"6395770414";"2013-02-05
23:26:01.111000";"movies.xyz.de";"shows";"0";
"null";"other“;“prerollAd_start";"x-files";
"/shows/the-x-files/ourplayer/full/21591/player/
MapReduce for aliens(1b_blaG A)";"full";"0"
© inovex Academy
map
map
16
Embarrassingly parallel
1. map-only jobs
e.g., filtering, extraction
2. no synchronization needed
map
map
map
map
3. Example: field selection
player event
"5000000006394954865";"6395770414";"2013-02-05
23:26:01.111000";"movies.xyz.de";"shows";"0";
"null";"other“;“prerollAd_start";"x-files";
"/shows/the-x-files/ourplayer/full/21591/player/
MapReduce for aliens(1b_blaG A)";"full";"0"
© inovex Academy
map
map
16
Introducing reducers
1. aggregate mapper output
2. single point of synchronization
© inovex Academy
17
Some code
Pair<Color, Integer> map( Long key, MM val ) {
return new Pair( val.color, 1 );
}
Pair<Color, Integer> reduce( Color key, List<Integer> vals ) {
return new Pair( key, sum( vals ) );
}
List<Pair<Color, Integer>> main( List<Pair<Long, MM>> mmJar ){
Photo by Shawn Rossi
Map<Color, List<Integer>> mapped = new Map();
for ( Pair<Long, MM> mapIn : mmJar ) {
// in parallel
Pair<Color, Integer> mapOut! = map( mapIn.first, mapIn.second );
mapped.get( mapOut.first ).add( mapOut.second );
}
List<Pair<Color, Integer>> reduced = new List();
for ( Color c : mapped.keySet() ) {
// in parallel
Pair<Color, Integer> reduceOut = reduce( c,!
mapped.get( c ));
reduced.add( reduceOut );
}
return reduced;
}
© inovex Academy
18
Photo by wayne’s eye view
Keys & values
1. typed key-value pairs
2. partitioning of data based
on keys
© inovex Academy
19
Photo by wayne’s eye view
Keys & values
map
1
1. typed key-value pairs
2. partitioning of data based
on keys
© inovex Academy
19
Photo by wayne’s eye view
Keys & values
map
1
1. typed key-value pairs
1
1
1
2. partitioning of data based
on keys
© inovex Academy
19
Photo by wayne’s eye view
Keys & values
map
1
1
1
{
1. typed key-value pairs
1
2. partitioning of data based
on keys
© inovex Academy
[1,1,1,1]
19
Photo by wayne’s eye view
Keys & values
map
1
1
1
{
1. typed key-value pairs
1
2. partitioning of data based
on keys
[1,1,1,1]
reduce
4
© inovex Academy
19
The shuffle
copy
phase
map task
map
(misnomer)
reduce task
partition,
sort and
spill to disk
fetch (http)
merge
buffer in
memory
input
split
“sort”
phase
merge
merge
merge
on disk
reduce
output
partitions
mixture of in-memory and on-disk data
Other reduces
Other maps
© inovex Academy
20

Similar documents