崔宝秋 - Bigdata Blog
Transcription
崔宝秋 - Bigdata Blog
HBase at Xiaomi Liang Xie / Honghua Feng {xieliang, fenghonghua}@xiaomi.com www.mi.com 1 About Us Honghua Feng Liang Xie www.mi.com 2 Outline Introduction Latency practice Some patches we contributed Some ongoing patches Q&A www.mi.com 3 About Xiaomi Mobile internet company founded in 2010 Sold 18.7 million phones in 2013 Over $5 billion revenue in 2013 Sold 11 million phones in Q1, 2014 www.mi.com 4 Hardware www.mi.com 5 Software www.mi.com 6 Internet Services www.mi.com 7 About Our HBase Team Founded in October 2012 5 members Liang Xie Shaohui Liu Jianwei Cui Liangliang He Honghua Feng Resolved 130+ JIRAs so far www.mi.com 8 Our Clusters and Scenarios 15 Clusters : 9 online / 2 processing / 4 test Scenarios MiCloud MiPush MiTalk Perf Counter www.mi.com 9 Our Latency Pain Points Java GC Stable page write in OS layer Slow buffered IO (FS journal IO) Read/Write IO contention www.mi.com 10 HBase GC Practice Bucket cache with off-heap mode Xmn/ServivorRatio/MaxTenuringThreshold PretenureSizeThreshold & repl src size GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com 11 Write Latency Spikes HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com 12 Root Cause of Write Latency Spikes write() is expected to be fast But blocked by write-back sometimes! www.mi.com 13 Stable page write issue workaround Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com 14 Root Cause of Write Latency Spikes ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com 15 Write Latency Spikes Testing 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com 16 Hedged Read (HDFS-5776) www.mi.com 17 Other Meaningful Latency Work Long first “put” issue (HBASE-10010) Token invalid (HDFS-5637) Retry/timeout setting in DFSClient Reduce write traffic? (HLog compression) HDFS IO Priority (HADOOP-10410) www.mi.com 18 Wish List Real-time HDFS, esp. priority related Core data structure GC friendly More off-heap; shenandoah GC TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com 19 Some Patches Xiaomi Contributed New write thread model(HBASE-8755) Reverse scan(HBASE-4811) Per table/cf replication(HBASE-8751) Block index key optimization(HBASE-7845) www.mi.com 20 1. New Write Thread Model Old model: WriteHandler … WriteHandler … WriteHandler 256 Local Buffer WriteHandler : write totoHDFS WriteHandler WriteHandler:write :write toHDFS HDFS 256 WriteHandler : sync to HDFS WriteHandler WriteHandler:sync :synctotoHDFS HDFS 256 Problem : WriteHandler does everything, severe lock race! www.mi.com 21 New Write Thread Model New model : WriteHandler … WriteHandler … WriteHandler 256 Local Buffer AsyncWriter : write to HDFS 1 AsyncSyncer : sync to HDFS WriteHandler WriteHandler:sync :synctotoHDFS HDFS AsyncNotifier : notify writers www.mi.com 4 1 22 New Write Thread Model Low load : No improvement Heavy load : Huge improvement (3.5x) www.mi.com 23 2. Reverse Scan 1. All scanners seek to ‘previous’ rows (SeekBefore) 2. Figure out next row : max ‘previous’ row 3. All scanners seek to first KV of next row (SeekTo) Row2 kv2 Row3 kv1 Row1 kv1 Row1 kv2 Row2 kv1 Row3 kv2 Row3 kv3 Row2 kv3 Row3 kv4 Row4 kv2 Row4 kv1 Row4 kv4 Row4 kv5 Row4 kv3 Row4 kv6 Row5 kv2 Row6 kv1 Row5 kv3 Performance : 70% of forward scan www.mi.com 24 3. Per Table/CF Replication PeerB creates T2 only : replication can’t work! PeerB creates T1&T2 : all data replicated! PeerA (backup) T1 : cfA, cfB T2 : cfX, cfY Source PeerB (T2:cfX) Need a way to specify which data to replicate! www.mi.com 25 Per Table/CF Replication add_peer ‘PeerA’, ‘PeerA_ZK’ add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’ PeerA T1 : cfA, cfB T2 : cfX, cfY Source PeerB (T2:cfX) www.mi.com 26 4. Block Index Key Optimization Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2) k1:“ab” k2 : “ah, hello world” … … Block 1 Block 2 Reduce block index size Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] www.mi.com 27 Some ongoing patches Cross-table cross-row transaction(HBASE-10999) HLog compactor(HBASE-9873) Adjusted delete semantic(HBASE-8721) Coordinated compaction (HBASE-9528) Quorum master (HBASE-10296) www.mi.com 28 1. Cross-Row Transaction : Themis http://github.com/xiaomi/themis Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications Two-phase commit : strong cross-table/row consistency Global timestamp server : global strictly incremental timestamp No touch to HBase internal: based on HBase Client and coprocessor Read : 90%, Write : 23% (same downgrade as Google percolator) More details : HBASE-10999 www.mi.com 29 2. HLog Compactor HLog 1,2,3 Region x : few writes but scatter in many HLogs Memstore Region 1 Region 2 Region x HFiles PeriodicMemstoreFlusher : flush old memstores forcefully ‘flushCheckInterval’/‘flushPerChanges’ : hard to config Result in ‘tiny’ HFiles HBASE-10499 : problematic region can’t be flushed! www.mi.com 30 HLog Compactor HLog 1, 2, 3,4 Compact : HLog 1,2,3,4 HLog x Archive : HLog1,2,3,4 HLog x Memstore Region 1 Region x Region 2 HFiles www.mi.com 31 3. Adjusted Delete Semantic Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 2 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again 5. Read kvA Result : kvA can be read out Fix : “delete can’t mask kvs with larger mvcc ( put later )” www.mi.com 32 4. Coordinated Compaction RS RS RS Compact storm! HDFS (global resource) Compact uses a global HDFS, while whether to compact is decided locally! www.mi.com 33 Coordinated Compaction RS RS Can OK I ? RS Can I ? NO Can I ? OK Master HDFS (global resource) Compact is scheduled by master, no compact storm any longer www.mi.com 34 5. Quorum Master A zk2 zk3 Master X A Master RS Read info/states RS zk1 ZooKeeper RS When active master serves, standby master stays ‘really’ idle When standby master becomes active, it needs to rebuild in-memory status www.mi.com 35 Quorum Master A X Master 1 Master 3 A Master 2 RS RS RS Better master failover perf : No phase to rebuild in-memory status Better restart perf for BIG cluster(10+K regions) No external(ZooKeeper) dependency No potential consistency issue Simpler deployment www.mi.com 36 Acknowledgement Hangjun Ye, Zesheng Wu, Peng Zhang Xing Yong, Hao Huang, Hailei Li Shaohui Liu, Jianwei Cui, Liangliang He Dihao Chen www.mi.com 37 www.mi.com Thank You! [email protected] [email protected] www.mi.com 38