崔宝秋 - Bigdata Blog

Transcription

崔宝秋 - Bigdata Blog
HBase at Xiaomi
Liang Xie / Honghua Feng
{xieliang, fenghonghua}@xiaomi.com
www.mi.com
1
About Us
Honghua Feng
Liang Xie
www.mi.com
2
Outline
 Introduction
 Latency practice
 Some patches we contributed
 Some ongoing patches
 Q&A
www.mi.com
3
About Xiaomi
 Mobile internet company founded in 2010
 Sold 18.7 million phones in 2013
 Over $5 billion revenue in 2013
 Sold 11 million phones in Q1, 2014
www.mi.com
4
Hardware
www.mi.com
5
Software
www.mi.com
6
Internet Services
www.mi.com
7
About Our HBase Team
 Founded in October 2012
 5 members





Liang Xie
Shaohui Liu
Jianwei Cui
Liangliang He
Honghua Feng
 Resolved 130+ JIRAs so far
www.mi.com
8
Our Clusters and Scenarios
 15 Clusters : 9 online / 2 processing / 4 test
 Scenarios
 MiCloud
 MiPush
 MiTalk
 Perf Counter
www.mi.com
9
Our Latency Pain Points
 Java GC
 Stable page write in OS layer
 Slow buffered IO (FS journal IO)
 Read/Write IO contention
www.mi.com
10
HBase GC Practice
 Bucket cache with off-heap mode
 Xmn/ServivorRatio/MaxTenuringThreshold
 PretenureSizeThreshold & repl src size
 GC concurrent thread number
GC time per day :
[2500, 3000] -> [300, 600]s !!!
www.mi.com
11
Write Latency Spikes
HBase client put
->HRegion.batchMutate
->HLog.sync
->SequenceFileLogWriter.sync
->DFSOutputStream.flushOrSync
->DFSOutputStream.waitForAckedSeqno <Stuck here often!>
===================================================
DataNode pipeline write, in BlockReceiver.receivePacket() :
->receiveNextPacket
->mirrorPacketTo(mirrorOut) //write packet to the mirror
->out.write/flush //write data to local disk. <- buffered IO
[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result
also confirmed it
www.mi.com
12
Root Cause of Write Latency Spikes
 write() is expected to be fast
 But blocked by write-back sometimes!
www.mi.com
13
Stable page write issue workaround
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)
or
2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency
sensitive HBase cluster!
www.mi.com
14
Root Cause of Write Latency Spikes
...
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]
0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]
0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]
0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]
0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]
0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]
0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]
0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]
0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]
0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]
XFS in latest kernel can relieve journal IO blocking issue, more friendly to
metadata heavy scenarios like HBase + HDFS
www.mi.com
15
Write Latency Spikes Testing
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel :
3.12.17
Statistic the stalled write() which costs > 100ms
The largest write() latency in Ext4 : ~600ms !
www.mi.com
16
Hedged Read (HDFS-5776)
www.mi.com
17
Other Meaningful Latency Work
 Long first “put” issue (HBASE-10010)
 Token invalid (HDFS-5637)
 Retry/timeout setting in DFSClient
 Reduce write traffic? (HLog compression)
 HDFS IO Priority (HADOOP-10410)
www.mi.com
18
Wish List
 Real-time HDFS, esp. priority related
 Core data structure GC friendly
 More off-heap; shenandoah GC
 TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
19
Some Patches Xiaomi Contributed
 New write thread model(HBASE-8755)
 Reverse scan(HBASE-4811)
 Per table/cf replication(HBASE-8751)
 Block index key optimization(HBASE-7845)
www.mi.com
20
1. New Write Thread Model
Old model:
WriteHandler
…
WriteHandler
…
WriteHandler
256
Local Buffer
WriteHandler
: write totoHDFS
WriteHandler
WriteHandler:write
:write toHDFS
HDFS
256
WriteHandler
: sync to
HDFS
WriteHandler
WriteHandler:sync
:synctotoHDFS
HDFS
256
Problem : WriteHandler does everything, severe lock race!
www.mi.com
21
New Write Thread Model
New model :
WriteHandler
…
WriteHandler
…
WriteHandler
256
Local Buffer
AsyncWriter : write to HDFS
1
AsyncSyncer
: sync to
HDFS
WriteHandler
WriteHandler:sync
:synctotoHDFS
HDFS
AsyncNotifier : notify writers
www.mi.com
4
1
22
New Write Thread Model

Low load : No improvement

Heavy load : Huge improvement (3.5x)
www.mi.com
23
2. Reverse Scan
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Row2 kv2
Row3 kv1
Row1 kv1
Row1 kv2
Row2 kv1
Row3 kv2
Row3 kv3
Row2 kv3
Row3 kv4
Row4 kv2
Row4 kv1
Row4 kv4
Row4 kv5
Row4 kv3
Row4 kv6
Row5 kv2
Row6 kv1
Row5 kv3
Performance : 70% of forward scan
www.mi.com
24
3. Per Table/CF Replication
 PeerB creates T2 only : replication can’t work!
 PeerB creates T1&T2 : all data replicated!
PeerA
(backup)
T1 : cfA, cfB
T2 : cfX, cfY
Source
PeerB
(T2:cfX)
Need a way to specify which data to replicate!
www.mi.com
25
Per Table/CF Replication

add_peer ‘PeerA’, ‘PeerA_ZK’

add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’
PeerA
T1 : cfA, cfB
T2 : cfX, cfY
Source
PeerB
(T2:cfX)
www.mi.com
26
4. Block Index Key Optimization
Before : ‘Block 2’ block index key = “ah, hello world/…”
Now :
‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)
k1:“ab”
k2 : “ah, hello world”
…
…
Block 1
Block 2

Reduce block index size

Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
www.mi.com
27
Some ongoing patches
 Cross-table cross-row transaction(HBASE-10999)
 HLog compactor(HBASE-9873)
 Adjusted delete semantic(HBASE-8721)
 Coordinated compaction (HBASE-9528)
 Quorum master (HBASE-10296)
www.mi.com
28
1. Cross-Row Transaction : Themis
http://github.com/xiaomi/themis
 Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
 Two-phase commit : strong cross-table/row consistency
 Global timestamp server : global strictly incremental timestamp
 No touch to HBase internal: based on HBase Client and coprocessor
 Read : 90%,
Write : 23% (same downgrade as Google percolator)
 More details : HBASE-10999
www.mi.com
29
2. HLog Compactor
HLog 1,2,3
Region x : few writes but scatter in many HLogs
Memstore
Region 1
Region 2
Region x
HFiles
PeriodicMemstoreFlusher : flush old memstores forcefully

‘flushCheckInterval’/‘flushPerChanges’ : hard to config

Result in ‘tiny’ HFiles

HBASE-10499 : problematic region can’t be flushed!
www.mi.com
30
HLog Compactor
HLog 1, 2, 3,4
 Compact : HLog 1,2,3,4  HLog x
 Archive : HLog1,2,3,4
HLog x
Memstore
Region 1
Region x
Region 2
HFiles
www.mi.com
31
3. Adjusted Delete Semantic
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result : kvA can’t be read out
Scenario 2
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
5. Read kvA
Result : kvA can be read out
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
www.mi.com
32
4. Coordinated Compaction
RS
RS
RS
Compact storm!
HDFS (global resource)

Compact uses a global HDFS, while whether to compact is decided locally!
www.mi.com
33
Coordinated Compaction
RS
RS
Can
OK I ?
RS
Can I ?
NO
Can I ? OK
Master
HDFS (global resource)

Compact is scheduled by master, no compact storm any longer
www.mi.com
34
5. Quorum Master
A
zk2
zk3
Master
X
A
Master
RS


Read info/states
RS
zk1
ZooKeeper
RS
When active master serves, standby master stays ‘really’ idle
When standby master becomes active, it needs to rebuild in-memory status
www.mi.com
35
Quorum Master
A
X
Master 1
Master 3
A
Master 2
RS





RS
RS
Better master failover perf : No phase to rebuild in-memory status
Better restart perf for BIG cluster(10+K regions)
No external(ZooKeeper) dependency
No potential consistency issue
Simpler deployment
www.mi.com
36
Acknowledgement
Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
www.mi.com
37
www.mi.com
Thank You!
[email protected]
[email protected]
www.mi.com
38