Yahoo Audience Expansion: Migra2on from Hadoop

Transcription

Yahoo Audience Expansion: Migra2on from Hadoop
Yahoo Audience Expansion: Migra5on from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo Agenda •  Audience Expansion Spark Applica5on •  Spark scalability: problems and our solu5ons •  Performance tuning How we built audience expansion on Spark AUDIENCE EXPANSION Audience Expansion •  Train a model to find users perform similar as sample users •  Find more poten5al “converters” System Large scale machine learning system Logis5c Regression TBs input data, up to TBs intermediate data Hadoop pipeline is using 30000+ mappers, 2000 reducers, 16 hrs run 5me •  All hadoop streaming, ~20 jobs • 
• 
• 
• 
•  Use Spark to reduce latency and cost Pipeline Labeling Feature Extrac5on Model Training Score/Analyze models Valida5on/
Metrics •  Label posi5ve/nega5ve samples •  6-­‐7 hrs, IO intensive, 17 TB intermediate IO in hadoop •  Extract Features from raw events •  Logis5c regression phase, CPU bound •  Validate trained models, parameters combina5ons, select new model •  Validate and publish new model How to adopt to Spark efficiently? • 
• 
• 
• 
• 
Very complicated system 20+ hadoop streaming map reduce jobs 20k+ lines of code Tbs data, person.months to do data valida5on 6+ person, 3 quarters to rewrite the system based on Scala from scratch Our migrate solu5on •  Build transi5on layer automa5cally convert hadoop streaming jobs to Spark job •  Don’t need to change any Hadoop streaming code •  2 person*quarter •  Private Spark ZIPPO Audience Expansion Pipeline 20+ Hadoop Streaming jobs ZIPPO: Hadoop Streaming Over Spark Hadoop Streaming Spark HDFS ZIPPO •  A layer (zippo) between Spark and applica5on •  Implemented all Hadoop Streaming interfaces •  Migrate pipeline without code rewri5ng •  Can focus on rewri5ng perf bojleneck •  Plan to open source Audience Expansion Pipeline Hadoop Streaming ZIPPO: Hadoop Streaming Over Spark HDFS Spark ZIPPO -­‐ Supported Features •  Par55on related –  Hadoop Par55oner class (-­‐par55oner) –  Num.map.key.fields, num.map.pari5on.fields •  Distributed cache –  -­‐cacheArchive, -­‐file, -­‐cacheFile •  Independent working directory for each task instead of each executor •  Hadoop Streaming Aggrega5on •  Input Data Combina5on (to mi5gate many small files) •  Customized OutputFormat, InputFormat Performance Comparison 1Tb data •  Zippo Hadoop streaming •  Spark cluster –  1 hard drive –  40 hosts •  Perf data: –  1hr 25 min •  Original Hadoop streaming •  Hadoop cluster –  1 hard drives –  40 Hosts •  Perf data –  3hrs 5 min SPARK SCALABILITY Spark Shuffle •  Mapper side of shuffle write all the output to disk(shuffle files) •  Data can be large scale, so not able to all hold in memory •  Reducers transfer all the shuffle files for each par55on, then process Spark Shujle Reducer Par55on 1 Reducer Par55on 2 Mapper m-­‐2 Reducer Par55on 3 Shuffle n Shuffle 3 Shuffle 2 Shuffle 1 Reducer Par55on n Mapper 1 Shuffle n Shuffle 3 Shuffle 2 Shuffle 1 On each Reducer Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 Shuffle mapper n Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 •  Uncompressed Par55on 4 Par55on 3 Shuffle mapper n •  Every par55on needs to hold all the data from all the mappers •  In hash map Reducer i of 4 cores Par55on 1 Par55on 2 •  In memory Shuffle mapper n Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 Shuffle mapper n Shuffle mapper 3 Shuffle mapper 2 Shuffle mapper 1 How many par55ons? •  Need to have small enough par55ons to put all in memory …… Host 2 (4 cores) Host 1 (4 cores) Par55on n Par55on 1 Par55on 2 Par55on 3 Par55on 4 Par55on 5 Par55on 6 Par55on 7 Par55on 8 Par55on 9 Par55on 10 Par55on 11 Par55on 12 Par55on 13 Par55on 14 …… Spark needs many Par55ons •  So a common pajern of using Spark is to have big number of par55ons On each Reducer • 
• 
• 
• 
• 
For 64 Gb memory host 16 cores CPU For compression ra5o 30:1, 2 5mes overhead To process 3Tb data, Needs 46080 par55ons To process 3Pb data, Need 46 million par55ons Non Scalable •  Not linear scalable. •  No majer how many hosts in total do we have, we always need 46k par55ons Issues of huge number of par55ons •  Issue 1: OOM in mapper side –  Each Mapper core needs to write to 46k shuffle files simultaneously –  1 shuffle file = OutputStream + FastBufferStream + CompressionStream –  Memory overhead: •  FD and related kernel overhead •  FastBufferStream (for making ramdom IO to sequen5al IO), default 100k buffer each stream •  CompressionStream, default 64k buffer each stream –  So by default total buffer size: •  164k * 46k * 16 = 100+ Gb Issues of huge number of pari5ons •  Our solu5on to Mapper OOM –  Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream (kernel block size) –  Based on our Contributed patch hjps://github.com/mesos/spark/pull/685 •  Set spark.storage.compression.codec to spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint •  Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can s5ll have good compression ra5o) –  Total buffer size ater this: •  12k * 46k * 16 = 10Gb Issues of huge number of par55ons •  Issue 2: large number of small files –  Each Input split in Mapper is broken down into at least 46K par55ons –  Large number of small files makes lots of random R/W IO –  When each shuffle file is less then 4k (kernel block size), overhead becomes significant –  Significant meta data overhead in FS layer –  Example: only manually dele5ng the whole tmp directory can take 2 hour as we have too many small files –  Especially bad when splits are not balanced. –  5x slower than Hadoop Input Split 1 Input Split 2 … Shuffle 46080 Shuffle 3 Shuffle 2 Shuffle 1 … Shuffle 46080 Shuffle 3 Shuffle 2 Shuffle 1 Shuffle 46080 Shuffle 3 Shuffle 2 Shuffle 1 … Input Split n Reduce side compression •  Current shuffle in reducer side data in memory is not compressed •  Can take 10-­‐100 5mes more memory •  With our patch hjps://github.com/mesos/spark/pull/686, we reduced memory consump5on by 30x, while compression overhead is only less than 3% •  Without this patch it doesn’t work for our case •  5x-­‐10x performance improvement Reduce side compression •  Reducer side –  compression – 1.6k files –  Noncompression – 46k shuffle files Reducer Side Spilling Reduce Compression Bucket n Spill n Compression Bucket 3 Compression Spill 2 Bucket 2 Compression Spill 1 Bucket 1 … Reducer Side Spilling •  Spills the over-­‐size data to Disk in the aggrega5on hash table •  Spilling -­‐ More IO, more sequen5al IO, less seeks •  All in mem – less IO, more random IO, more seeks •  Fundamentally resolved Spark’s scalability issue Align with previous Par55on func5on •  Our input data are from another map reduce job •  We use exactly the same hash func5on to reduce number of shuffle files Align with previous Par55on func5on •  New hash func5on, More even distribu5on Spark Job Previous Job Genera5ng Input data Key 0, 4, 8… shuffule file 0 shuffule file 1 shuffule file 2 Input Data 0 shuffule file 3 Key 1,5,9… shuffule file 4 Mod 4 Key 2, 6, 10… Mod 5 shuffule file 0 shuffule file 1 shuffule file 2 Key 3, 7, 11… shuffule file 3 shuffule file 4 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 Align with previous Par55on func5on •  Use the same hash func5on Spark Job Previous Job Genera5ng Input data Key 0, 4, 8… Input Data 0 1 shuffle file Key 1,5,9… Mod 4 1 shuffle file Mod 4 Key 2, 6, 10… 1 shuffle file Key 3, 7, 11… 1 shuffle file Align with previous Hash func5on •  Our Case: –  16m shuffle files, 62kb on average (5-­‐10x slower) –  8k shuffle files, 125mb on average •  Several different input data sources •  Par55on func5on from the major one PERFORMANCE TUNNING All About Resource U5liza5on •  Maximize the resource u5liza5on •  Use as much CPU,Mem,Disk,Net as possbile •  Monitor vmstat, iostat, sar Resource U5liza5on •  (This is old diagram, to update) Resource U5liza5on •  Ideally CPU/IO should be fully u5lized •  Mapper phase – IO bound •  Final reducer phase – CPU bound Shuffle file transfer •  Spark transfers all shuffle files to reducer memory before start processing. •  Non-­‐streaming(very hard to change to streaming). •  For poor resource u5liza5on –  So need to make sure maxBytesInFlight is set big enough –  Consider alloca5ng 2x more threads than physical core number Thanks. Gavin Li liyu@yahoo-­‐inc.com Jaebong Kim pitecus@yahoo-­‐inc.com Andrew Feng afeng@yahoo-­‐inc.com