REmatch: High-performance Regular Expression Matching for
Transcription
REmatch: High-performance Regular Expression Matching for
REmatch: High-performance Regular Expression Matching for Network Security Petabi, Inc. Contact: Victor Valgenti ([email protected]) May 6, 2015 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) 1 Regular Expression Matching in Network Security Regular expression matching is an important aspect of most network security solutions. Unfortunately, regular expressions can prove resource-intensive to match and most regular expression engines do not scale with the number of regular expressions. This can cause problems to a network security system if an attacker is able to target the regular expression engine with worstcase traffic [4,8] which can cause excessive burden on the matching system and potentially even lost packets. Further, since regular expressions are deemed resource-intensive many network security systems use arbitrary, redundant filters just to reduce the amount of traffic forwarded to regular expression matching. Our philosophy, however, is to make regular expression matching efficient enough that it can match at line-speed, making filtering to avoid regular expressions obsolete, such that regular expressions can be used for all pattern matching. This is what we call total inspection with regular expressions. Our patented regular expression engine, REmatch, can match at speeds up to 2,000 times faster than other matching libraries for large regular expression sets. REmatch takes advantages of parallelism common in general purpose processors as well as creating a traversal-friendly layout of the matcher such that these matching speeds are possible on general purpose processors. We believe in total inspection of network traffic using regular expressions, eliminating the need for many of the complex filters used in the pipeline of many network security applications as those filters can often be eliminated or replicated with the use of regular expressions. Even better, REmatch is comprised of a C++ automata construction library and a C matching library that can easily be used in place of other regular expression matching libraries for immediate improvements in matching speed. REmatch maintains line-speed matching for thousands of rules and scales linearly with the number of cores used. REmatch makes total inspection of network traffic against a set of hundreds, or thousands, of regular expressions a reality without the need for expensive hardware or specialized chipsets. 2 REmatch Performance Figure 1 illustrates the basic strengths of the REmatch regular expression matching engine. For this example, three sets of regular expressions were used: ClamAV, Petabi, and Snort (Please Section 4.3 for specifics concerning test setup and execution). First, in Figure 1 REmatch scales linearly to the increase in the number of regular expressions. When the number of regular expressions is small, less than 3, PCRE matches exceptionally fast. However, even at only 5 regular expressions REmatch is 70-500% faster, while for 1,245 rules REmatch is from 243 to 1,343 times faster than PCRE and at 2264 rules REmatch is nearly 2,000 times faster. Further, REmatch, even with more than two thousand regular expressions, maintains greater than 1Gbps on a single core. In this manner, REmatch can maintain total inspection of traffic and still meet line-speed requirements. To demonstrate the versatility and power of REmatch we created a test where we substituted PCRE calls in Snort with calls to the REmatch libraries (please refer to Section 4.4 for details). The performance comparison is summarized in Table 1 and the trends mirror those from Figure 1. 1 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) 10000 Throughput Mbps 1000 4x 37x 100 206x 382x 1,900x 10 ClamAV-REmatch Petabi-REmatch 1 Snort-REmatch ClamAV-PCRE Petab-PCRE Snort-PCRE 0.1 1 10 100 1000 10000 Total RE Figure 1: REmatch performance vs PCRE performance as number of regular expressions increase. Table 1: Drop-in Replacement of PCRE with REmatch in Snort. 3 # Regex Snort with REmatch Mbps Snort with PCRE Mbps Speedup 1 100 200 300 500 700 779 440 422 422 410 390 350 334 666 28 12.2 8 4.8 2.6 1.1 0.66 15 35 51 81 135 304 REmatch to enhance your products As we have demonstrated, REmatch could serve to boost performance anywhere you match multiple regular expressions against a particular set of data. The versatility of REmatch is such that it can easily be added to current products with minimal change to code-base. Optionally, it could serve as a new core matching engine. If you are interested in REmatch please contact us and we will set up further evaluations and answer all questions you may have concerning our products. Finally, we are flexible with licensing terms and willing to work with partners. We encourage you to seriously consider REmatch to improve the regular expression matching in your products. 4 Test Details This section explains the details in the test environment, test methodology, and data used. 2 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) 4.1 Explanation of REmatch REmatch [7] is our patented regular expression matching engine. REmatch makes use of parallelism inherent in most commodity general purpose processors while utilizing a traversal and architecture-friendly memory layout. One of the primary goals of REmatch is to shrink the working-set such that it can entirely reside in cache memory. Thus, the memory alignment of the REmatch matching automata is designed to maximize cache lines and locality during matching. REmatch further improves performance by utilizing Non-deterministic Finite Automata (NFA) rather than Deterministic Finite Automata (DFA) during matching. Without delving into the complexities involved DFA simply grow too large to be practical for any set of regular expressions larger than a couple hundred. NFA, however, scale linearly with the number of regular expressions. This is one of the reasons why REmatch scales far better with the number of regular expressions. 4.2 Explanation of PCRE The Perl Compatible Regular Expression (PCRE) [2] library is a common, full-functioned, regular expression matching library. We compare against this library as it is one of the best libraries freely available. We note that all of the regular expressions are compiled prior to matching. Thus, the times for PCRE only consider matching time, not construction or deconstruction time. 4.3 Throughput and Scalability of REmatch Evaluation Figure 1 illustrates both the high performance and scalability of REmatch. To generate the data for this graph, we matched against sample-sets of regular expressions containing 1, 2, 5, 10, 50, 100, 200, 400, 600, 800, 1,000, 1,200, 1245, 2,264 and 30,596 regular expressions respectively. To create a sample-set of regular expressions for each point x rules were randomly selected (without replacement) from the targeted rule-set. This sample rule-set was then used by the respective matching engine to match against the traffic three separate times to compute an average throughput for that sample rule-set. This process was repeated 10 times to determine the average throughput of all ten sample rule-sets and the result is displayed in the Figure 1 and reiterated in Table 2. This process was repeated once for each regular expression set for each matching engine (REmatch or PCRE). Note if the sample size was larger than the total number of rules in the rule-set than that data-point is set to N/A. Further, no sampling is done when the size of the sample is equal to the size of the rule-set. 4.3.1 Regular Expression Sets for this Test Table 3 illustrates specific statistics concerning each of the rule-sets. Aside from the Average Length of the rules and the name of the rule-set, each entry is the total occurrence of the investigated feature within the set. The greater the number of these features the more complex the rules. These data provide a statistical view of the regular expressions involved. Each rule-set is described in more detail below. The ClamAV regular expression set and Snort regular expression 3 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) Table 2: REmatch vs PCRE throughput comparison in tabular format (values in Mbps) # Regex ClamAV-REmatch Petabi-REmatch Snort-REmatch ClamAV-PCRE Petabi-PCRE Snort-PCRE 1 2 5 10 50 100 200 400 600 800 1,000 1,200 1,245 2,264 30,596 1,736.81 1,921.04 1,991.06 2,312.89 2,191.82 2,043.93 1,748.67 1,442.60 1,291.76 1,218.31 1,122.65 1,069.19 1,063.23 918.57 437.09 2,054.31 2,201.65 2,325.42 2,311.58 2,273.08 2,202.18 2,162.94 1,972.59 1,805.33 1,758.72 1,711.88 1,622.91 1,590.23 1,287.03 N/A 2,831.48 2,037.06 2,280.55 2,316.74 2,268.02 2,208.63 2,132.88 2,008.29 1,811.03 1,805.36 1,668.91 1,622.50 1,587.03 N/A N/A 3,406.39 2,124.09 972.58 517.42 109.19 54.91 27.47 13.63 9.09 6.80 5.45 4.53 4.36 2.40 0.17 4,633.56 1,697.65 410.14 193.78 28.40 15.55 7.51 3.70 2.52 1.88 1.44 1.24 1.18 0.65 N/A 5,002.04 3,911.48 1,436.80 504.21 92.37 38.08 17.13 8.78 6.01 4.47 3.54 3.02 2.88 N/A N/A set can be provided upon request. The Petabi rules require a Non-disclosure Agreement prior to release. 1. ClamAV: The ClamAV [5] regular expression rule-set represents the ClamAV database as of January 8, 2015. The ClamAV format of regular expressions is very close to normal PCRE. Thus, we created a script that converted them from the ClamAV format to a standard PCRE format. The large average length of the regular expressions stems from the fact that the rules are denoted as binary strings and the conversion process then uses PCRE’s ‘\x’ notation. Another interesting note is that ClamAV rules represent mostly fixed binary strings. 2. Petabi: These rules are part of our business and were crafted specifically for REmatch. They are currently used in Network Intrusion Detection Systems employed in client businesses and handle line speeds without issue. These rules were crafted in-house and/or gathered from many sources; some of those sources with strong Non-disclosure Agreements. As such, these rules cannot be provided prior to such an agreement being signed by all parties privy to the rules. 3. Snort: The snort regular expressions were harvested from the Sourcefire Vulnerability Research Team Snort [6] 2.9.7.2 Registered Users rule-set for April 22, 2015. Where possible, the rules represent the merging of content and PCRE tags into a single regular expression. This was done since Snort rules employ a pipeline approach such that later tags rely on earlier tags to filter content to reduce workload. For rules where there were no such tags, other Snort rule features were converted to regex (where possible) and prepended to the regular expression to better simulate the intent of the Snort rule. 4 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) Table 3: Regular Expression Set Breakdown. AVG length is the average length in number of characters of all of the regular expressions. Wildcard characters are the ‘.’ character and the character class [\x00-\xFF]. Repetition represents ‘?’ (zero or one) ‘*’ (zero or many) and ‘+’ (one or many). Counting represents any counted repetition like ‘a{1,5}’. Alternation indicates the number of times alternate branches are used like ‘a(b|c)d’. Types represent one of the following PCRE types: \d (any digit), \w (any word char), \s (any whitespace), \D (any non-digit), \W (any non-word char), and \S (any non-whitespace). Classes represent any character class like: [abc]. 4.3.2 RE Set Total RE ClamAV Petabi Snort 30,596 2,264 1,245 Avg Length Alternation Wildcards Types 270.1 89.6 100.8 3 177 400 2,349 1,828 845 0 3,821 3,774 Classes Repetition Counting 0 1,525 973 1,226 7,836 6,090 212 890 738 Test Traffic Data for this Test For traffic we used a synthetic traffic capture with 100,000 packets all with random data. Each packet is given an arbitrary length of 79 bytes. The number of 79 bytes of data was taken as the average data size from several publicly available packets captures used in Intrusion Detection System evaluation [1, 3]. 100,000 packets was chosen arbitrarily as sufficient to test the system and small enough to do so quickly. Figure 2 shows that the distribution of bytes is even across the total values possible for each byte illustrating the uniformly random nature of the data. Tests were performed by reading the packet capture through the libpcap library. All tests were performed with the same traffic capture. 4.3.3 Hardware Setup All tests were run on a single core in a single thread. 1. Operating System: FreeBSD 10.1 v2-RELEASE 2. CPU: Intel Core i5-3570K Ivy Bridge Quad-Core 3.4GHz (6MB L3 Cache) 3. RAM: 16GB 4.4 Drop-in Replacement of PCRE in Snort with REmatch Evaluation For this evaluation we examined the impact of using REmatch in Snort rather than PCRE matching. The idea was to demonstrate a drop-in replacement of REmatch into Snort with minimal changes to Snort overall. It was a simple matter to replace pcre compile calls in Snort to use the REmatch automata construction calls instead. Then, during matching rather than using pcre search we used the library call to the REmatch matcher. The primary difference for this approach is that all regular expressions are matched at once with any matches cached and returned as Snort’s list of possible matches is traversed. As can be seen by the data in Table 1 if 5 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) 1 0.9 0.8 0.7 CDF 0.6 0.5 0.4 0.3 0.2 0.1 0 Random 0 50 100 150 200 250 Byte Value Figure 2: Diversity of data byte values. there are more than one or two regular expressions to match then REmatch can provide vastly superior matching speed. 4.4.1 Regular Expression Set for this Test For this evaluation 779 regular expressions were selected from the Sourcefire Vulnerability Research Team Snort [6] 2.9.7.0 Registered Users rule-set. The selection criteria depended on regular expressions that defined the intent of the entire rule. For example, a snort rule might have a content option of ‘ab’ and another of ‘cd’. If the PCRE option for this rule contained both ‘ab’ and ‘cd’ then that regular expression was included in the regular expression set. 4.4.2 Test Traffic Data for this Test For this evaluation traffic was generated using a Spirent SPT-2000 traffic generator. 1,400 byte TCP packets with random payloads were generated and sent across network links to the target box at 1Gbps speeds. 4.4.3 Hardware Setup All tests show numbers for a single core and single thread. 1. Operating System: Ubuntu 11.04 (Kernel 2.6.38-16) 2. CPU: Intel dual-Core i3-3220 3.30GHz (3MB L3 Cache) 3. RAM: 2GB 6 2082 Business Center Drive #170 Irvine CA 92606 ([email protected]) References [1] National cyberwatch center mid-atlantic collegiate cyber defense competition, 2012. [2] P. Hazel. Perl compatible regular expressions. http://www.pcre.org/. [3] B. Sangster, T. J. O’Connor, T. Cook, R. Fanelli, E. Dean, W. J. Adams, C. Morrell, and G. Conti. Towards instrumenting network warfare competitions to generate labeled datasets. In Proceedings of USENIX Security Workshop on Cyber Security Experimentation and Test, 2009. [4] R. Smith, C. Estan, and S. Jha. Backtracking algorithmic complexity attacks against a NIDS. In Proceedings of the 22nd Annual Computer Security Applications Conference, 2006. [5] Sourcefire. ClamAV Open Source Antivirus Engine 0.98.6, 2015. http://www.clamav.net/. Available at [6] Sourcefire Vulnerability Research Team. Sourcefire Vulnerability Research Team (VRT) Snort Rule-set, Apr. 2015. Available at http://www.snort.org/vrt. [7] V. C. Valgenti, J. Chhugani, Y. Sun, N. Satish, M. S. Kim, C. Kim, and P. Dubey. GPPgrep: High-speed regular expression processing engine on general purpose processors. In International Symposium on Research in Attacks, Intrusions, and Defense. Springer, 2012. [8] V. C. Valgenti, H. Sun, and M. S. Kim. Protecting run-time filters for network intrusion detection systems. In Advanced Information Networking and Applications (AINA), 2014 IEEE 28th International Conference on, pages 116–122, May 2014. 7