Flaky Tests and Bugs in Apache Software (eg Hadoop)

Transcription

ApacheCon Core North America (May 12, 2016, at Vancouver)
Flaky Tests and Bugs in
Apache Software (e.g. Hadoop)
Akihiro Suda <[email protected]>
NTT Software Innovation Center
Copyright© 2016 NTT Corp. All Rights Reserved.
Who am I
• Software Engineer at NTT Corporation
• NTT: the largest telecom in Japan
• Engaged in improvement on reliability of
distributed systems
•
•
Some contributions to ZooKeeper / Hadoop
including critical bug fixes (non-committer)
github: https://github.com/AkihiroSuda
2
Agenda
• Current "flakiness" in Apache software
• Why flaky test matters?
• What causes a flaky test?
• How can we find, reproduce, and fix a flaky test?
•
•
Existing work at Apache communities
Our work: Namazu(鯰, catfish)
https://github.com/osrg/namazu
3
Agenda
•
•
4
Data are measured at 14/01/2016, using CLOC
Good News: Apache software are well tested!
Software
Production code (LOC)
Test code (LOC)
95K
87K
YARN
178K
121K
HDFS
152K
150K
33K
27K
HBase
571K
222K
Spark
167K
128K
Flume
46K
34K
168K
78K
MapReduce
ZooKeeper
Cassandra
Prod
Test
5
Data are captured at 14/01/2016
Bad News: https://builds.apache.org/job/%s-trunk/
MapReduce
YARN
HDFS
Build Time
Blue = Success
Red = Failure
ZooKeeper
HBase
Build
I've never seen fully successful Hadoop build,
even on my local machine... Copyright© 2016 NTT Corp. All Rights Reserved. 6
Bad News: JIRA QL: project = ? AND text ~ "test fail*"
just for approximation
Software
#Matched
#All
Issues
MapReduce
2,441 (38%)
6,373
YARN
2,290 (63%)
4,756
HDFS
5,141 (53%)
9,672
828 (35%)
2,384
HBase
6,595 (42%)
15,542
Spark
794 ( 6%)
14,047
Flume
342 (12%)
2,882
1,656 (15%)
11,430
ZooKeeper
Cassandra
Roughly speaking,
the half of
Hadoop development
is dedicated to
debugging test failures.
Interestingly,
its flakiness seems
not uniform
across software..
(discussed later)
7
Agenda
•
•
8
Not all test failures are critical for production..
97% unit test failures in Apache software are said to be
harmless for production (" false-alarm ")
•
Information source:
" An Empirical Study of Bugs in Test Code " (A.Vahabzadeh et al., ICSME'15)
9
So flaky test doesn't matter, as it doesn't affect production?
It still matters!
For developers..
It's a barrier to promotion of CI
• If many tests are flaky, developers tend to ignore CI
failure  overlook real bugs
It's also a psychological barrier to contribution
•
A developer may be blamed due to a test failure
For users..
It's a barrier to risk assessment for production
• No one can tell flaky tests from real bugs
10
image: http:// guid es .lib. j jay. cu ny.ed u / nypd /b roken wind ows
So flaky test doesn't matter, as it doesn't affect production?
SemaphoreCI suggests " No broken windows " strategy
for flaky tests
https://semaphoreci.com/community/tutorials/how-to-deal-with-and-eliminate-flaky-tests
11
Agenda
•
•
12
Basic cause: async operation
• Typical flaky test is caused by a malformed async
operation like this
(A.Vahabzadeh et al., ICSME'15 / Q.Luo et al., ACM FSE'14 / YARN-4478)
invokeAsyncOperation();
// some tests lack even this sleep
sleep(certainHardcodedTimeout);
assertTrue(checkSomethingGoodHasHappened());
• Basically it can be fixed by increasing timeout&retries
•
•
But it's not easy to find a reasonable timeout value
(e.g. YARN-{4804, 4807, 4929...})
Long timeout is expensive
13
Testbed (e.g. CI) can cause test failures as well
• Host configuration
• Host performance
• Docker is great! But it still has some
issues
14
CI host configuration can cause test failures
• HADOOP-12687
•
Many YARN test fails when /etc/hosts has multiple loopback
entries
• ZOOKEEPER-2252
•
•
Test: nslookup("a") should fail
It does not fail when there is actually the host named "a“
• INFRA-11811
•
JDK was not set up properly in a Jenkins slave
• Such a test can fail when the job is assigned to a specific
buildbot and it looks like a flaky test
15
CI host performance: they're not made equal
• Hadoop's buildbot
https://builds.apache.org/computer/
16
•
Spark's buildbot
https://amplab.cs.berkeley.edu/jenkins/computer /
17
• Significant difference in the response time!
Target
Hadoop
Spark
Average
1163ms
3ms
Max
1482ms
6ms
Min
30ms
0ms
• Maybe related to the fact that Spark has only a small
number of test-related issues
(e.g. YARN 63% vs Spark 6% (slide 7))
18
Docker issues
Docker is great for testing!
• Some Apache software are using Docker on their CI
(via Apache Yetus)
• Apache BigTop also utilizes Docker for provisioning
Hadoop
• People also loves Docker for setting up test beds on
their workstations and laptops
•
Of course me too
19
Docker #18180: Java VM unkillable zombie
• Mentioned in several Apache-related issue tickets:
•
•
•
•
•
•
•
jupyter/docker-stacks#75: Spark hanging
docker-library/cassandra#43, #46
docker-solr/docker-solr#4
ALLURA-8039
AMBARI-14706
IGNITE-2377
YETUS-229 …
• Fortunately Apache Buildbot (Yetus) didn't hit the bug,
but made people's local testbeds flaky in a weird way.
•
Fixed in recent kernels (so, accurately, it's not a Docker's issue)
20
Other potential Docker-related issues
AUFS: fcntl(F_SETFL, O_APPEND) was not supported
(#20199)
•
•
Can cause data corruption (Dovecot is known to be affected)
Fixed in recent AUFS
Overlay: You should not open O_RDWR and O_RDONLY
simultaneously (#10180)
•
•
Can cause data corruption (RPM is known to be affected)
Expected behavior, won't get fixed
More information: https://github.com/AkihiroSuda/docker-issues
21
Flaky test is not limited to xUnit in CI..
• Some issues can occur only in a deployed
environment rather than in a CI
• e.g. TCP packet corruption
• Very flaky and critical
TCP
22
TCP packet corruption
https://www.pagerduty.com/blog/the-discovery-of-apachezookeepers-poison-packet/
• TCP checksum was ignored in some IPsec configuration
•
ZooKeeper became weird intermittently due to corrupted TCP
packet
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ipdata-to-mesos-kubernetes-docker-containers4986f88f7a19#.gq8chzply
• TCP checksum was ignored in some veth configuration
•
Mesos and Kubernetes are affected
TCP
23
TCP packet corruption
• It's very hard to notice (and reproduce) flaky TCP
packet corruption...
• Should distributed systems be TCP-corruption
tolerant...?
• the probability is very low in regular environments,
but it is not zero
(32-bit Ethernet CRC + 16-bit TCP checksum)
• JIRA issues: ZOOKEEPER-2175, HDFS-8161…
TCP
24
Agenda
•
•
25
Efforts to find/reproduce a flaky test
• determine-flaky-tests-hadoop.py
• Apache Kudu‘s CI (dist_test)
• Google's TAP
• Our work: Namazu
https://github.com/osrg/Namazu
• and similar great tools
26
determine-flaky-tests-hadoop.py
• Picks up failed tests using Jenkins API
• Included in hadoop.git/dev-support (HADOOP11045)
$ determine-flaky-tests-hadoop.py --job Hadoop-YARN-trunk
****Recently FAILED builds in url:
https://builds.apache.org/job/Hadoop-YARN-trunk
...
Among 15 runs examined, all failed tests <#failedRuns: testName>:
7: TestContainerManagerRecovery.testApplicationRecovery
...
27
determine-flaky-tests-hadoop.py
• Great tool, but it doesn't support running a
specific test repeatedly
• Also there is a maven dependency issue
(YARN-4478)
• B depends on A
• TestB is never executed if TestA fails
 if TestA is flaky, we can't evaluate the flakiness of
TestB!
28
http://dist-test.cloudera.org:8080/ (Apr 25)
Kudu's CI: flaky test dashboard
Recently open-sourced and introduced in Apache: Big Data (Monday)
https://github.com/cloudera/dist_test
29
Kudu's CI: flaky test dashboard
• Tests are run repeatedly on CI to find flaky tests
• KUDU_FLAKY_TEST_ATTEMPTS
• KUDU_FLAKY_TEST_LIST
(From https://github.com/apache/incubator-kudu/commit/1a24338a)
Fix flakiness of client_failover-itest
The reason this test was flaky is that there is a race between..
..
100x
Looped
and they all passed:
http://dist-test.cloudera.org/job?job_id=mpercy.1454486819.10566
Author
Mike Percy
Jan 29, 2016 8:01 AM
Committer
Todd Lipcon Feb 4, 2016 2:14 PM
Commit
1a24338ad60a8842d1ae5e227f8f03e58faea8c0
30
Google's TAP
• Google's internal CI
• 1.6M test failures per day
• 73K (4.5%) are flaky
• Repeat a failing test 10 times for labeling
flaky tests
• Information source: An Empirical Analysis
of Flaky Tests (Q.Luo et al. ACM FSE'14)
31
Challenge: poor non-determinism
• Modern CIs run jobs repeatedly to find / reproduce
flaky tests
• But they don't control non-determinism
•  Overlook a flaky test
•  Can not reproduce a failure
 Cannot analyze the failure
•
Our suggestion: increase non-determinism
for finding and reproducing flaky tests
32
NAMAZU: PROGRAMMABLE FUZZY SCHEDULER
NOTE: Namazu was formerly named "Earthquake"
33
Namazu: programmable fuzzy scheduler
鯰
Increases non-determinism
for finding and
reproducing flaky tests
(namazu) means
a catfish in Japanese
Fuzzed (Randomized)
Schedule
Event
Filesystem
Packet
Java
Go[planned]
Linux threads
34
Namazu: programmable fuzzy scheduler
Namazu uses non-invasive techniques
• can be easily applied to any environment
https://github.com/AkihiroSuda/golang-exp-aspectgo
• can avoid false-positives
FUSE
Filesystem
Openflow
AspectJ
Netfilter
Byteman
Packet
Java
AspectGo
[wip]
Go[planned]
sched_
setattr(2)
Linux threads
35
Namazu targets
• xUnit tests
• 😃 Easy to get started; just run `mvn`
• 😃 Can reproduce test failures observed in CI
• 😞 Limited testable scope
• Integration tests on a distributed cluster
• 😃 Can test everything
• 😞 Need to write a script to set up the cluster
•
But Docker helps us a lot!
36
Namazu targets
We support the both scenarios
RPC
$ mvn test
Orchestrator
Single-node mode
(for xUnit tests)
Distributed mode
(for integration tests)
37
NAMAZU + XUNIT TESTS
$ mvn test
38
Namazu + xUnit tests
• Namazu is a comprehensive framework...
• Quick start: “renice” threads for xUnit tests
•
•
POSIX.1 requires that threads share the single nice(priority)
value, but the actual Linux implementation (NPTL) not.
Not always effective, but it’s generic and easy to get started
Filesystem
Packet
Java
Go[planned]
Linux threads
39
Namazu + xUnit tests
$ cd hadoop; ./start-build-env.sh
[container]$ mvn test –Dtest=TestFoo#testBar
$ PID=$(docker inspect $(docker ps -q -f ancestor=hadoopbuild-ubuntu) | jq .[0].State.Pid)
$ sudo nmz inspectors proc -pid $PID
Namazu periodically sets random nice values for all the child
processes and the threads under $PID
Plus utilizes non-default kernel schedulers (e.g. SCHED_BATCH)
40
Namazu + xUnit tests: Reproducibility
•
Testcase
Traditional
Namazu
YARN-4548
RM/TestCapacityScheduler
11%
82%
YARN-4556
RM/TestFifoScheduler
2%
44%
ZOOKEEPER-2137
ReconfigTest
2%
16%
YARN-4168
NM/TestLogAggregationService
1%
8%
YARN-1978
NM/TestLogAggregationService
0%
4%
YARN-4543
NM/TestNodeStatusUpdater
0%
1%
More information: osrg/namazu#125
41
Namazu + xUnit tests: Reproducibility
• "Renicing" is not always effective...
• But even when renicing is ineffective,
sometimes you can also reproduce the flaky test
by injecting delays or reordering packets
Testcase
Traditional
Namazu
ZOOKEEPER-2080
ReconfigRecoveryTest
14.0%
61.9%
$ sudo iptables ... -j NFQUEUE --queue-num 42
$ sudo nmz inspectors ethernet -nfq-number 42
42
NAMAZU + INTEGRATION TESTS
43
Namazu + Integration tests
• ZooKeeper: distributed coordination service
•
used in Hadoop, Spark, Mesos, Kafka..
• ZooKeeper 3.5 (alpha) introduced the dynamic
configuration
• We performed an integration test so as to evaluate the
reliability of the reconfiguration
•
We found a flaky bug!
44
Namazu + Integration tests
• We permuted some specific Ethernet packets in random
order using Namazu
•
TCP retransmissions are eliminated for reducing possible state
space
ZooKeeper cluster
Open vSwitch + Ryu SDN Framework
+ Namazu
45
Found ZOOKEEPER-2212
• Bug: New node cannot participate to ZK cluster properly
New node cannot become a leader of ZK cluster itself
(More technically, it keeps being an "observer“)
• Cause: distributed race (ZAB packet vs FLE packet)
•
•
ZAB.. atomic broadcast protocol for data
FLE.. leader election protocol for ZK cluster itself
Uses different TCP connection
Non-deterministic packet order
ZAB [2888/tcp]
FLE [3888/tcp]
Leader of ZK cluster
New ZK node
46
47
• Expected: ZK cluster works even when 𝑵/𝟐 nodes crashed
• Real: single node failure can terminate the 3-node ensemble
Not participating properly
(keeps being an "observer")
48
How hard is it to reproduce?
• Reproducibility: 0.0%  21.8%
(tested 1,000 times)
• We could not reproduce the bug even after
5,000 times traditional testing (60 hours!)
• Even reproducible by “renicing” threads, but the
reproducibility is just 0.7%
49
Why we can hit the bug?
We define the distributed execution pattern based on code coverage:
𝒑 𝟏,𝟏
⋮
𝑷=
𝒑 𝑳,𝟏
⋯
⋱
⋯
𝒑 𝟏,𝑵
⋮
𝒑 𝑳,𝑵
•
•
𝐿: LOC
𝑁: Number of nodes (==3 in this case)
•
•
𝑝𝑖,𝑗 : 1 if the node 𝑗 covers the branch in line 𝑖, otherwise 0
We used JaCoCo: Java Code Coverage Library (patch: ZOOKEEPER-2266)
Namazu achieves faster pattern growth.
That's why we can hit the bug.
50
HOW TO USE NAMAZU?
51
How to use Namazu?
Easy to install
$ sudo apt-get install lib{netfilter-queue,zmq3}-dev
$ go get github.com/osrg/namazu/nmz
Easy to get started
• Provides Docker-like CLI
• No code instrumentation needed
• No configuration needed (default: just renice threads)
$ sudo nmz container run –it –v /foo:/foo ubuntu
[container]$ cd /foo && mvn test
52
How to use Namazu?
For threads ("renicing")
$ sudo nmz inspectors proc -pid $TARGET_PID
For filesystem
$ sudo nmz inspectors fs -mount-point /nmzfs
For network packets
$ sudo iptables ... -j NFQUEUE --queue-num 42
$ sudo nmz inspectors ethernet -nfq-number 42
Need distributed mode? (for integration testing)
Just add `--orchestrator-url http://foobar:10080/api/v3` to the CLI.
53
Namazu API (Go)
type ExplorePolicy interface {
QueueEvent(Event)
ActionChan() chan Action
}
Namazu defines REST API,
so you can also use other languages
func (p *MyPolicy) QueueEvent(event Event) {
action := event.DefaultAction()
You can also inject fault actions here
p.timeBoundedQ.Enqueue(action,
10 * Millisecond, 30 * Millisecond)
}
Action is randomly fired in [10ms, 30ms]
func (p *MyPolicy) ActionChan() chan Action {
return p.timeBoundedQ.DequeueChan
}
54
API use case: found YARN-4301
• We found a bug: YARN cannot detect disk failure cases where
mkdir()/rmdir() blocks
mkdir
EIO
A case where mkdir() returns EIO explicitly
mkdir
...
A case where mkdir() blocks
• We noticed that the bug can occur theoretically
when we are reading the code, and actually produced the bug
using Namazu
•
•
When we should inject the fault is pre-known;
so we manually wrote a concrete scenario using Namazu API
Much more realistic than JUnit + mocking
55
Interactive test is often easier than writing a JUnit testcase
func (p *MyPolicy) signalHandler() {
signal.Notify(sigChan, syscall.SIGUSR1)
for {
We use SIGUSR1 here,
<-sigChan
but it is also interesting to
p.sleep = 10 * time.Minute
implement human-friendly
}
fault: blocks for 10 minutes
CLI or GUI for
}
interactive testing
go p.signalHandler()
func (p *MyPolicy) QueueEvent(event Event) {..}
func (p *MyPolicy) ActionChan() chan Action {..}
$ go run mypolicy.go inspectors fs -mount-point /nmzfs
Set "yarn.nodemanager.local-dirs" to "/nmzfs/nm-local-dir",
Send SIGUSR1 to Namazu when you (and YARN) are ready
56
57
Another API use case: "semi"-deterministic replay
• If you have knowledge on the protocol, you can make a
hash for a packet
•
Note that you have to eliminate time-dependent and random
bytes when you hash the packet
• Using the hash and Namazu API, you can "semi"deterministically replay the scenario
•
•
Not fully deterministic; it just does its best effort
Record-less! You just need to remember the "seed" for
replaying
• PoC: ZOOKEEPER-2212: up to 65% reproducibility
•
•
More information: osrg/namazu#137
See also (for Go): https://github.com/AkihiroSuda/go-replay
58
SIMILAR GREAT TOOLS
59
Similar great tool: Jepsen
• Network partitioner + Linearizability tester
• Famous for "Call Me Maybe" blog: http://jepsen.io/
•
“Call Me Maybe” by Carly Rae Jepsen (vevo):
https://www.youtube.com/watch?v=fWNaR-rxAic
• Randomly injects network partition using iptables
• "Linearizability" ∈ "Strong consistency"
• Integration test on a flaky network rather than a
flaky xUnit test
60
Similar great tool: Jepsen
• Has been used to test several Apache software
• Cassandra: 9851,10001,10068,10231,10413,10674
•
http://www.datastax.com/dev/blog/testing-apache-cassandra-with-jepsen
• HBase
• Kafka
• Solr: 6530, 6583, 6610
•
http:///lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsenflaky-networks
• ZooKeeper
61
Namazu + Jepsen?
• Namazu is much more generalized
•
The bugs we found/reproduced are basically beyond the
scope of Jepsen (Threads, Disks..)
• Namazu can be also combined with Jepsen! It will be
our next work..
Jepsen
• causes network partition
• tests linearizablity
...
Namazu
• increases non-determinism
• injects filesystem faults
62
Similar great tool: CharybdeFS
• Make the filesystem flaky using FUSE
• Used in testing ScyllaDB (Apache Cassandra's clone)
•
https://github.com/scylladb/charybdefs
• Similar to Namazu FS
•
•
•
Both supports API
Also similar to PetardFS (not active since 2007)
CharybdeFS can be also combined with Namazu as well
•
CharybdeFS is specialized in FS; Namazu is much more
comprehensive.
63
Similar great tool: DEMi (appeared in NSDI'16)
https://github.com/NetSys/demi
• Found some akka-raft bugs and reproduced a few Spark bugs
•
•
challenge in reducing false-positives related to instrumentation
DEMi and Namazu are complementary each other
•
•
DEMi is powerful, but has some limitations
Namazu is comprehensive and made easy to get started
Namazu
DEMi
Target
Generic
(Network,Filesystem,Thread..)
Akka
Getting Started
Easy
Need to write
AspectJ codes
Deterministic Replay?
No
Yes
Bug Cause Minimization?
No
Yes
64
SO... HOW CAN WE FIX FLAKY TESTS?
65
How can we fix flaky tests?
• Namazu finds/reproduces flaky tests, but it doesn't
automatically fix them 😞
• Basic approach for async-related flakiness:
Adjust the values for sleep() and retries in the test
code
66
How can we fix flaky tests?
• Suggestion: the timeout(&retries) should be a configurable
parameter rather than a hard-coded value
Timeout value
Cost (time)
Risk (timeout)
Appropriate for
Long
High
Low
• Slow machine (e.g.CI)
• Conservative person
Short
Low
High
• Fast machine
• Risk-appetite person
67
CONCLUSION
68
Conclusion
• Apache software are well tested
• But they are flaky
• Let’s improve them
• Improve asynchronous code
• Repeat tests
• Our tool can control non-determinism
so as to reproduce flaky tests
69

Flaky Tests and Bugs in Apache Software (eg Hadoop)

Transcription

Similar documents

Annonce S1

waiver guidelines jakarta incident

Internship Presentation

Collines en Forme - Municipalité de Val-des

OL396 Gunwharf Quay

Big data Web sites: 40 billion indexed web pages Youtube: 100 hrs

Here Comes 2016! - NTS MediaOnline

July 10, 2016 - Church of St. Joseph

TMKL_Sept 2016-ISSUE 31_email - The Majestic Hotel Kuala Lumpur

July-August 2016 - Dallas Fort Worth Metro Golden Retriever Club

NTT® Tytan Server