Skip to content

Latest commit

 

History

History
420 lines (221 loc) · 63.1 KB

README.md

File metadata and controls

420 lines (221 loc) · 63.1 KB

Recent Papers Related To Bug Detection

All Papers (Classification according to Conference)

All Papers (Classification according to Subject)

Concurrency Bugs

A true positives theorem for a static race detector (POPL 2019)

Abstract: RacerD is a static race detector that has been proven to be effective in engineering practice: it has seen thousands of data races fixed by developers before reaching production, and has supported the migration of Facebook's Android app rendering infrastructure from a single-threaded to a multi-threaded architecture. We prove a True Positives Theorem stating that, under certain assumptions, an idealized theoretical version of the analysis never reports a false positive. We also provide an empirical evaluation of an implementation of this analysis, versus the original RacerD.

The theorem was motivated in the first case by the desire to understand the observation from production that RacerD was providing remarkably accurate signal to developers, and then the theorem guided further analyzer design decisions. Technically, our result can be seen as saying that the analysis computes an under-approximation of an over-approximation, which is the reverse of the more usual (over of under) situation in static analysis. Until now, static analyzers that are effective in practice but unsound have often been regarded as ad hoc; in contrast, we suggest that, in the future, theorems of this variety might be generally useful in understanding, justifying and designing effective static analyses for bug catching.

Detecting Concurrency Memory Corruption Vulnerabilities (FSE 2019)

Abstract: Memory corruption vulnerabilities can occur in multithreaded executions, known as concurrency vulnerabilities in this paper. Due to non-deterministic multithreaded executions, they are extremely difficult to detect. Recently, researchers tried to apply data race detectors to detect concurrency vulnerabilities. Unfortunately, these detectors are ineffective on detecting concurrency vulnerabilities. For example, most (90%) of data races are benign. However, concurrency vulnerabilities are harmful and can usually be exploited to launch attacks. Techniques based on maximal causal model rely on constraints solvers to predict scheduling; they can miss concurrency vulnerabilities in practice. Our insight is, a concurrency vulnerability is more related to the orders of events that can be reversed in different executions, no matter whether the corresponding accesses can form data races. We then define exchangeable events to identify pairs of events such that their execution orders can be probably reversed in different executions. We further propose algorithms to detect three major kinds of concurrency vulnerabilities. To overcome potential imprecision of exchangeable events, we also adopt a validation to isolate real vulnerabilities. We implemented our algorithms as a tool ConVul and applied it on 10 known concurrency vulnerabilities and the MySQL database server. Compared with three widely-used race detectors and one detector based on maximal causal model, ConVul was significantly more effective by detecting 9 of 10 known vulnerabilities and 6 zero-day vulnerabilities on MySQL (four have been confirmed). However, other detectors only detected at most 3 out of the 16 known and zero-day vulnerabilities.

RacerD: compositional static race detection (OOPSLA 2018)

Abstract: Automatic static detection of data races is one of the most basic problems in reasoning about concurrency. We present RacerD—a static program analysis for detecting data races in Java programs which is fast, can scale to large code, and has proven effective in an industrial software engineering scenario. To our knowledge, RacerD is the first inter-procedural, compositional data race detector which has been shown to have non-trivial precision and impact. Due to its compositionality, it can analyze code changes quickly, and this allows it to perform continuous reasoning about a large, rapidly changing codebase as part of deployment within a continuous integration ecosystem. In contrast to previous static race detectors, its design favors reporting high-confidence bugs over ensuring their absence. RacerD has been in deployment for over a year at Facebook, where it has flagged over 2500 issues that have been fixed by developers before reaching production. It has been important in enabling the development of new code as well as fixing old code: it helped support conversion of part of the main Facebook Android app from a single-threaded to a multi-threaded architecture. In this paper we describe RacerD’s design, implementation, deployment and impact.

UFO: Predictive Concurrency Use-After-Free Detection (ICSE 2018)

Abstract: Use-After-Free (UAF) vulnerabilities are caused by the program operating on a dangling pointer and can be exploited to compromise critical software systems. While there have been many tools to mitigate UAF vulnerabilities, UAF remains one of the most common attack vectors. UAF is particularly di cult to detect in concurrent programs, in which a UAF may only occur with rare thread schedules. In this paper, we present a novel technique, UFO, that can precisely predict UAFs based on a single observed execution trace with a provably higher detection capability than existing techniques with no false positives. The key technical advancement of UFO is an extended maximal thread causality model that captures the largest possible set of feasible traces that can be inferred from a given multithreaded execution trace. By formulating UAF detection as a constraint solving problem atop this model, we can explore a much larger thread scheduling space than classical happens-before based techniques. We have evaluated UFO on several real-world large complex C/C++ programs including Chromium and FireFox. UFO scales to real-world systems with hundreds of millions of events in their execution and has detected a large number of real concurrency UAFs.

OWL: Understanding and Detecting Concurrency Attacks (DSN 2018)

Abstract: Just like bugs in single-threaded programs can lead to vulnerabilities, bugs in multithreaded programs can also lead to concurrency attacks. We studied 31 real-world concurrency attacks, including privilege escalations, hijacking code executions, and bypassing security checks. We found that compared to concurrency bugs' traditional consequences (e.g., program crashes), concurrency attacks' consequences are often implicit, extremely hard to be observed and diagnosed by program developers. Moreover, in addition to bug-inducing inputs, extra subtle inputs are often needed to trigger the attacks. These subtle features make existing tools ineffective to detect concurrency attacks. To tackle this problem, we present OWL, the first practical tool that models general concurrency attacks' implicit consequences and automatically detects them. We implemented OWL in Linux and successfully detected five new concurrency attacks, including three confirmed and fixed by developers, and two exploited from previously known and well-studied concurrency bugs. OWL has also detected seven known concurrency attacks. Our evaluation shows that OWL eliminates 94.1% of the reports generated by existing concurrency bug detectors as false positive, greatly reducing developers' efforts on diagnosis. All OWL source code, concurrency attack exploit scripts, and results are available on github.com/hku-systems/owl.

A Runtime Verification Tool for Detecting Concurrency Bugs in FreeRTOS Embedded Software (ISPDC 2018)

Abstract: This article presents a runtime verification tool for embedded software executing under the open source real-time operating system FreeRTOS. The tool detects and diagnoses concurrency bugs such as deadlock, starvation, and suspension based-locking. The tool finds concurrency bugs at runtime without debugging and tracing the source code. The tool uses the Tracealyzer tool for logging relevant events. Analysing the logs, our tool can detect the concurrency bugs by applying algorithms for diagnosing each concurrency bug type individually. In this paper, we present the implementation of the tool, as well as its functional architecture, together with illustration of its use. The tool can be used during program testing to gain interesting information about embedded software executions. We present initial results of running the tool on some classical bug examples running on an AVR 32-bit board SAM4S.

ConPredictor: Concurrency Defect Prediction in Real-World Applications (TSE 2018)

Abstract: Concurrent programs are difficult to test due to their inherent non-determinism. To address this problem, testing often requires the exploration of thread schedules of a program; this can be time-consuming when applied to real-world programs. Software defect prediction has been used to help developers find faults and prioritize their testing efforts. Prior studies have used machine learning to build such predicting models based on designed features that encode the characteristics of programs. However, research has focused on sequential programs; to date, no work has considered defect prediction for concurrent programs, with program characteristics distinguished from sequential programs. In this paper, we present ConPredictor, an approach to predict defects specific to concurrent programs by combining both static and dynamic program metrics. Specifically, we propose a set of novel static code metrics based on the unique properties of concurrent programs. We also leverage additional guidance from dynamic metrics constructed based on mutation analysis. Our evaluation on four large open source projects shows that ConPredictor improved both within-project defect prediction and cross-project defect prediction compared to traditional features.

Learning from mistakes: a comprehensive study on real world concurrency bug characteristics (ASPLOS 2008)

Abstract: The reality of multi-core hardware has made concurrent programs pervasive. Unfortunately, writing correct concurrent programs is difficult. Addressing this challenge requires advances in multiple directions, including concurrency bug detection, concurrent program testing, concurrent programming model design, etc. Designing effective techniques in all these directions will significantly benefit from a deep understanding of real world concurrency bug characteristics.

This paper provides the first (to the best of our knowledge) comprehensive real world concurrency bug characteristic study. Specifically, we have carefully examined concurrency bug patterns, manifestation, and fix strategies of 105 randomly selected real world concurrency bugs from 4 representative server and client open-source applications (MySQL, Apache, Mozilla and OpenOffice). Our study reveals several interesting findings and provides useful guidance for concurrency bug detection, testing, and concurrent programming language design.

Some of our findings are as follows: (1) Around one third of the examined non-deadlock concurrency bugs are caused by violation to programmers' order intentions, which may not be easily expressed via synchronization primitives like locks and transactional memories; (2) Around 34% of the examined non-deadlock concurrency bugs involve multiple variables, which are not well addressed by existing bug detection tools; (3) About 92% of the examined concurrency bugs canbe reliably triggered by enforcing certain orders among no more than 4 memory accesses. This indicates that testing concurrent programs can target at exploring possible orders among every small groups of memory accesses, instead of among all memory accesses; (4) About 73% of the examinednon-deadlock concurrency bugs were not fixed by simply adding or changing locks, and many of the fixes were not correct at the first try, indicating the difficulty of reasoning concurrent execution by programmers.

Side Channels Attack

Profit: Detecting and Quantifying Side Channels in Networked Applications (NDSS 2019)

Abstract: We present a black-box, dynamic technique to detect and quantify side-channel information leaks in networked applications that communicate through a TLS-encrypted stream. Given a user-supplied profiling-input suite in which some aspect of the inputs is marked as secret, we run the application over the inputs and capture a collection of variable-length network packet traces. The captured traces give rise to a vast side-channel feature space, including the size and timestamp of each individual packet as well as their aggregations (such as total time, median size, etc.) over every possible subset of packets. Finding the features that leak the most information is a difficult problem.

Our approach addresses this problem in three steps: 1) Global analysis of traces for their alignment and identification of emph{phases} across traces; 2) Feature extraction using the identified phases; 3) Information leakage quantification and ranking of features via estimation of probability distribution.

We embody this approach in a tool called Profit and experimentally evaluate it on a benchmark of applications from the DARPA STAC program, which were developed to assess the effectiveness of side-channel analysis techniques. Our experimental results demonstrate that, given suitable profiling-input suites, Profit is successful in automatically detecting information-leaking features in applications, and correctly ordering the strength of the leakage for differently-leaking variants of the same application.

DifFuzz: Differential Fuzzing for Side-Channel Analysis (ICSE 2019)

Abstract: Side-channel attacks allow an adversary to uncover secret program data by observing the behavior of a program with respect to a resource, such as execution time, consumed memory or response size. Side-channel vulnerabilities are difficult to reason about as they involve analyzing the correlations between resource usage over multiple program paths. We present DifFuzz, a fuzzing-based approach for detecting side-channel vulnerabilities related to time and space. DifFuzz automatically detects these vulnerabilities by analyzing two versions of the program and using resource-guided heuristics to find inputs that maximize the difference in resource consumption between secret-dependent paths. The methodology of DifFuzz is general and can be applied to programs written in any language. For this paper, we present an implementation that targets analysis of Java programs, and uses and extends the Kelinci and AFL fuzzers. We evaluate DifFuzz on a large number of Java programs and demonstrate that it can reveal unknown side-channel vulnerabilities in popular applications. We also show that DifFuzz compares favorably against Blazer and Themis, two state-of-the-art analysis tools for finding side-channels in Java programs.

Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World (S&P 2019)

Abstract: Although clouds have strong virtual memory isolation guarantees, cache attacks stemming from shared caches have proved to be a large security problem. However, despite the past effectiveness of cache attacks, their viability has recently been called into question on modern systems, due to trends in cache hierarchy design moving away from inclusive cache hierarchies. In this paper, we reverse engineer the structure of the directory in a sliced, non-inclusive cache hierarchy, and prove that the directory can be used to bootstrap conflict-based cache attacks on the last-level cache. We design the first cross-core Prime+Probe attack on non-inclusive caches. This attack works with minimal assumptions: the adversary does not need to share any virtual memory with the victim, nor run on the same processor core. We also show the first high-bandwidth Evict+Reload attack on the same hardware. We demonstrate both attacks by extracting key bits during RSA operations in GnuPG on a state-of-the-art non-inclusive Intel Skylake-X server.

CaSym: Cache Aware Symbolic Execution for Side Channel Detection and Mitigation (S&P 2019)

Abstract: Cache-based side channels are becoming an important attack vector through which secret information can be leaked to malicious parties. implementations and Previous work on cache-based side channel detection, however, suffers from the code coverage problem or does not provide diagnostic information that is crucial for applying mitigation techniques to vulnerable software. We propose CaSym, a cache-aware symbolic execution to identify and report precise information about where side channels occur in an input program. Compared with existing work, CaSym provides several unique features: (1) CaSym enables verification against various attack models and cache models, (2) unlike many symbolic-execution systems for bug finding, CaSym verifies all program execution paths in a sound way, (3) CaSym uses two novel abstract cache models that provide good balance between analysis scalability and precision, and (4) CaSym provides sufficient information on where and how to mitigate the identified side channels through techniques including preloading and pinning. Evaluation on a set of crypto and database benchmarks shows that CaSym is effective at identifying and mitigating side channels, with reasonable efficiency.

Synesthesia: Detecting Screen Content via Remote Acoustic Side Channels (USENIX Security 2019)

Abstract: We show that subtle acoustic noises emanating from within computer screens can be used to detect the content displayed on the screens. This sound can be picked up by ordinary microphones built into webcams or screens, and is inadvertently transmitted to other parties, e.g., during a videoconference call or archived recordings. It can also be recorded by a smartphone or ``smart speaker'' placed on a desk next to the screen, or from as far as 10 meters away using a parabolic microphone.

Empirically demonstrating various attack scenarios, we show how this channel can be used for real-time detection of on-screen text, or users' input into on-screen virtual keyboards. We also demonstrate how an attacker can analyze the audio received during video call (e.g., on Google Hangout) to infer whether the other side is browsing the web in lieu of watching the video call, and which web site is displayed on their screen.

Memory Error Detection

Leopard: identifying vulnerable code for vulnerability assessment through program metrics (ICSE 2019)

Abstract: Identifying potentially vulnerable locations in a code base is critical as a pre-step for effective vulnerability assessment; i.e., it can greatly help security experts put their time and effort to where it is needed most. Metric-based and pattern-based methods have been presented for identifying vulnerable code. The former relies on machine learning and cannot work well due to the severe imbalance between non-vulnerable and vulnerable code or lack of features to characterize vulnerabilities. The latter needs the prior knowledge of known vulnerabilities and can only identify similar but not new types of vulnerabilities.

In this paper, we propose and implement a generic, lightweight and extensible framework, Leopard, to identify potentially vulnerable functions through program metrics. Leopard requires no prior knowledge about known vulnerabilities. It has two steps by combining two sets of systematically derived metrics. First, it uses complexity metrics to group the functions in a target application into a set of bins. Then, it uses vulnerability metrics to rank the functions in each bin and identifies the top ones as potentially vulnerable. Our experimental results on 11 real-world projects have demonstrated that, Leopard can cover 74.0% of vulnerable functions by identifying 20% of functions as vulnerable and outperform machine learning-based and static analysis-based techniques. We further propose three applications of Leopard for manual code review and fuzzing, through which we discovered 22 new bugs in real applications like PHP, radare2 and FFmpeg, and eight of them are new vulnerabilities.

Hunting for bugs in code coverage tools via randomized differential testing (ICSE 2019)

Abstract: Reliable code coverage tools are critically important as it is heavily used to facilitate many quality assurance activities, such as software testing, fuzzing, and debugging. However, little attention has been devoted to assessing the reliability of code coverage tools. In this study, we propose a randomized differential testing approach to hunting for bugs in the most widely used C code coverage tools. Specifically, by generating random input programs, our approach seeks for inconsistencies in code coverage reports produced by different code coverage tools, and then identifies inconsistencies as potential code coverage bugs. To effectively report code coverage bugs, we addressed three specific challenges: (1) How to filter out duplicate test programs as many of them triggering the same bugs in code coverage tools; (2) how to automatically reduce large test programs to much smaller ones that have the same properties; and (3) how to determine which code coverage tools have bugs? The extensive evaluations validate the effectiveness of our approach, resulting in 42 and 28 confirmed/fixed bugs for gcov and llvm-cov, respectively. This case study indicates that code coverage tools are not as reliable as it might have been envisaged. It not only demonstrates the effectiveness of our approach, but also highlights the need to continue improving the reliability of code coverage tools. This work opens up a new direction in code coverage validation which calls for more attention in this area.

SMOKE: Scalable Path-Sensitive Memory Leak Detection for Millions of Lines of Code (ICSE 2019)

Abstract: Detecting memory leak at industrial scale is still not well addressed, in spite of tremendous efforts from both the industrial and academia in the past decades. Existing work suffers from an unresolved paradox – a highly precise analysis limits its scalability and an imprecise one seriously hurts its precision or recall. In this work, we present SMOKE, a staged approach to resolve this paradox. Instead of using a uniform precise analysis for all paths, in the first stage, we use a scalable but imprecise analysis to compute a succinct set of candidate memory leak paths. In the second stage, we leverage a more precise analysis to verify the feasibility of the candidates. Our first stage analysis is scalable, due to the design of a new sparse program representation, namely use-flow graph (UFG), which enables us to model the problem as a polynomial-time state analysis. Our second stage analysis is precise and still efficient, due to the smaller number of candidates and the design of a dedicated constraint solver. Experimental results demonstrated that SMOKE can finish checking industrial-sized projects, up to 8MLoC, in forty minutes with an average false positive rate of 24.4%. Besides, SMOKE is significantly faster than the state-of-the-art research techniques as well as the industrial tools, with the speedup ranging from 27.9X to 105.9X. In the twenty-nine mature and extensively checked benchmark projects, SMOKE has discovered thirty previously-unknown memory leaks which were confirmed by developers, and one even got a CVE ID.

Target-Driven Compositional Concolic Testing with Function Summary Refinement for Effective Bug Detection (FSE 2019)

Abstract: Concolic testing is popular in unit testing because it can detect bugs quickly in a relatively small search space. But, in system-level testing, it suffers from the symbolic path explosion and often misses bugs. To resolve this problem, we have developed a focused compositional concolic testing technique, FOCAL, for effective bug detection. Focusing on a target unit failure v (a crash or an assert violation) detected by concolic unit testing, FOCAL generates a system-level test input that validates v. This test input is obtained by building and solving symbolic path formulas that represent system-level executions raising v. FOCAL builds such formulas by combining function summaries one by one backward from a function that raised v to main. If a function summary φa of function a conflicts with the summaries of the other functions, FOCAL refines φa to φa′ by applying a refining constraint learned from the conflict. FOCAL showed high system-level bug detection ability by detecting 71 out of the 100 real-world target bugs in the SIR benchmark, while other relevant cutting edge techniques (i.e., AFL-fast, KATCH, Mix-CCBSE) detected at most 40 bugs. Also, FOCAL detected 13 new crash bugs in popular file parsing programs.

Locating Vulnerabilities in Binaries via Memory Layout Recovering (FSE 2019)

Abstract: Locating vulnerabilities is an important task for security auditing, exploit writing, and code hardening. However, it is challenging to locate vulnerabilities in binary code, because most program semantics (e.g., boundaries of an array) is missing after compilation. Without program semantics, it is difficult to determine whether a memory access exceeds its valid boundaries in binary code. In this work, we propose an approach to locate vulnerabilities based on memory layout recovery. First, we collect a set of passed executions and one failed execution. Then, for passed and failed executions, we restore their program semantics by recovering fine-grained memory layouts based on the memory addressing model. With the memory layouts recovered in passed executions as reference, we can locate vulnerabilities in failed execution by memory layout identification and comparison. Our experiments show that the proposed approach is effective to locate vulnerabilities on 24 out of 25 DARPA’s CGC programs (96%), and can effectively classifies 453 program crashes (in 5 Linux programs) into 19 groups based on their root causes.

CRCount: Pointer Invalidation with Reference Counting to Mitigate Use-after-free in Legacy C/C++ (NDSS 2019)

Abstract: Pointer invalidation has been a popular approach adopted in many recent studies to mitigate use-after-free errors. The approach can be divided largely into two different schemes: explicit invalidation and implicit invalidation. The former aims to eradicate the root cause of use-after-free errors by invalidating every dangling pointer one by one explicitly. In contrast, the latter aims to prevent dangling pointers by freeing an object only if there is no pointer referring to it. A downside of the explicit scheme is that it is expensive, as it demands high-cost algorithms or a large amount of space to maintain every up-to-date list of pointer locations linking to each object at all times. Implicit invalidation is more efficient in that even without any explicit effort, it can eliminate dangling pointers by leaving objects undeleted until all the links between the objects and their referring pointers vanish by themselves during program execution. However, such an argument only holds if the scheme knows exactly when each link is created and deleted. Reference counting is a traditional method to determine the existence of reference links between objects and pointers. Unfortunately, impeccable reference counting for legacy C/C++ code is very difficult and expensive to achieve in practice, mainly because of the type unsafe operations in the code. In this paper, we present a solution, called CRCount, to the use-after-free problem in legacy C/C++. For effective and efficient problem solving, CRCount is armed with the pointer footprinting technique that enables us to compute, with high accuracy, the reference count of every object referred to by the pointers in the legacy code. Our experiments demonstrate that CRCount mitigates the use-after-free errors with a lower performance-wise and space-wise overhead than the existing pointer invalidation solutions.

SoK: Sanitizing for Security (S&P 2019)

Abstract: The C and C++ programming languages are notoriously insecure yet remain indispensable. Developers therefore resort to a multi-pronged approach to find security issues before adversaries. These include manual, static, and dynamic program analysis. Dynamic bug finding tools—henceforth "sanitizers"—can find bugs that elude other types of analysis because they observe the actual execution of a program, and can therefore directly observe incorrect program behavior as it happens. A vast number of sanitizers have been prototyped by academics and refined by practitioners. We provide a systematic overview of sanitizers with an emphasis on their role in finding security issues. Specifically, we taxonomize the available tools and the security vulnerabilities they cover, describe their performance and compatibility properties, and highlight various trade-offs.

NAR-Miner: Discovering Negative Association Rules from Code for Bug Detection (FSE 2018)

Abstract: Inferring programming rules from source code based on data mining techniques has been proven to be effective to detect software bugs. Existing studies focus on discovering positive rules in the form of A ⇒ B, indicating that when operation A appears, operation B should also be here. Unfortunately, the negative rules (A ⇒ ¬ B), indicating the mutual suppression or conflict relationships among program elements, have not gotten the attention they deserve. In fact, violating such negative rules can also result in serious bugs.

In this paper, we propose a novel method called NAR-Miner to automatically extract negative association programming rules from large-scale systems, and detect their violations to find bugs. However, mining negative rules faces a more serious rule explosion problem than mining positive ones. Most of the obtained negative rules are uninteresting and can lead to unacceptable false alarms. To address the issue, we design a semantics-constrained mining algorithm to focus rule mining on the elements with strong semantic relationships. Furthermore, we introduce information entropy to rank candidate negative rules and highlight the interesting ones. Consequently, we effectively mitigate the rule explosion problem. We implement NAR-Miner and apply it to a Linux kernel (v4.12-rc6). The experiments show that the uninteresting rules are dramatically reduced and 17 detected violations have been confirmed as real bugs and patched by kernel community. We also apply NAR-Miner to PostgreSQL, OpenSSL and FFmpeg and discover six real bugs.

Effectivesan: type and memory error detection using dynamically typed C/C++ (PLDI 2018)

Abstract: Low-level programming languages with weak/static type systems, such as C and C++, are vulnerable to errors relating to the misuse of memory at runtime, such as (sub-)object bounds overflows, (re)use-after-free, and type confusion. Such errors account for many security and other undefined behavior bugs for programs written in these languages. In this paper, we introduce the notion of dynamically typed C/C++, which aims to detect such errors by dynamically checking the "effective" type of each object before use at runtime. We also present an implementation of dynamically typed C/C++ in the form of the Effective Type Sanitizer (EffectiveSan). EffectiveSan enforces type and memory safety using a combination of low-fat pointers, type meta data and type/bounds check instrumentation. We evaluate EffectiveSan against the SPEC2006 benchmark suite and the Firefox web browser, and detect several new type and memory errors. We also show that EffectiveSan achieves high compatibility and reasonable overheads for the given error coverage. Finally, we highlight that EffectiveSan is one of only a few tools that can detect sub-object bounds errors, and uses a novel approach (dynamic type checking) to do so.

Spatio-Temporal Context Reduction: A Pointer-Analysis-Based Static Approach for Detecting Use-After-Free Vulnerabilities (ICSE 2018)

Abstract: Zero-day Use-After-Free (UAF) vulnerabilities are increasingly popular and highly dangerous, but few mitigations exist. We introduce a new pointer-analysis-based static analysis, CRed, for finding UAF bugs in multi-MLOC C source code efficiently and effectively. CRed achieves this by making three advances: (i) a spatio-temporal context reduction technique for scaling down soundly and precisely the exponential number of contexts that would otherwise be considered at a pair of free and use sites, (ii) a multi-stage analysis for filtering out false alarms efficiently, and (iii) a path-sensitive demand-driven approach for finding the points-to information required.

We have implemented CRed in LLVM-3.8.0 and compared it with four different state-of-the-art static tools: CBMC (model checking), Clang (abstract interpretation), Coccinelle (pattern matching), and Supa (pointer analysis) using all the C test cases in Juliet Test Suite (JTS) and 10 open-source C applications. For the ground-truth validated with JTS, CRed detects all the 138 known UAF bugs as CBMC and Supa do while Clang and Coccinelle miss some bugs, with no false alarms from any tool. For practicality validated with the 10 applications (totaling 3+ MLOC), CRed reports 132 warnings including 85 bugs in 7.6 hours while the existing tools are either unscalable by terminating within 3 days only for one application (CBMC) or impractical by finding virtually no bugs (Clang and Coccinelle) or issuing an excessive number of false alarms (Supa).

Stack Bounds Protection with Low Fat Pointers (NDSS 2017)

Abstract: Object bounds overflow errors are a common source of security vulnerabilities. In principle, bounds check instrumentation eliminates the problem, but is hampered by limited compatibility against un-instrumented code and high overheads. On 64-bit systems, low-fat pointers are a recent scheme for implementing efficient and compatible bounds checking by transparently encoding meta information within the native pointer representation itself. However, low-fat pointers are traditionally used for heap objects only, where the allocator has sufficient control over object location necessary for the encoding. This is a problem for stack allocation, where there exist strong constraints regarding the location of stack objects that is apparently incompatible with low-fat pointers. In this paper, we present an extension of low-fat pointers to stack objects by using a collection of techniques, such as pointer mirroring and memory aliasing, thereby allowing stack objects to enjoy bounds error protection from instrumented code. Our extension is compatible with common special uses of the stack, such as alloca, setjmp and longjmp, exceptions, and multi-threading, which rely on direct manipulation of the stack pointer. Our experiments show that we successfully extend the advantages of the low-fat pointer encoding to stack objects. The end result is competitive bounds checking instrumentation for the stack and heap with low memory and runtime overheads, and high compatibility with un-instrumented legacy code.

Learning-based Approach

Mining Software Defects: Should We Consider Affected Releases? (ICSE 2019)

Abstract: With the rise of the Mining Software Repositories (MSR) field, defect datasets extracted from software repositories play a foundational role in many empirical studies related to software quality. At the core of defect data preparation is the identification of post-release defects. Prior studies leverage many heuristics (e.g., keywords and issue IDs) to identify post-release defects. However, such the heuristic approach is based on several assumptions, which pose common threats to the validity of many studies. In this paper, we set out to investigate the nature of the difference of defect datasets generated by the heuristic approach and the realistic approach that leverages the earliest affected release that is realistically estimated by a software development team for a given defect. In addition, we investigate the impact of defect identification approaches on the predictive accuracy and the ranking of defective modules that are produced by defect models. Through a case study of defect datasets of 32 releases, we find that the heuristic approach has a large impact on both defect count datasets and binary defect datasets. Surprisingly, we find that the heuristic approach has a minimal impact on defect count models, suggesting that future work should not be too concerned about defect count models that are constructed using heuristic defect datasets. On the other hand, using defect datasets generated by the realistic approach leads to an improvement in the predictive accuracy of defect classification models.

The importance of accounting for real-world labelling when predicting software vulnerabilities (FSE 2019)

Abstract: Previous work on vulnerability prediction assume that predictive models are trained with respect to perfect labelling information (includes labels from future, as yet undiscovered vulnerabilities). In this paper we present results from a comprehensive empirical study of 1,898 real-world vulnerabilities reported in 74 releases of three security-critical open source systems (Linux Kernel, OpenSSL and Wiresark). Our study investigates the effectiveness of three previously proposed vulnerability prediction approaches, in two settings: with and without the unrealistic labelling assumption. The results reveal that the unrealistic labelling assumption can profoundly mis- lead the scientific conclusions drawn; suggesting highly effective and deployable prediction results vanish when we fully account for realistically available labelling in the experimental methodology. More precisely, MCC mean values of predictive effectiveness drop from 0.77, 0.65 and 0.43 to 0.08, 0.22, 0.10 for Linux Kernel, OpenSSL and Wiresark, respectively. Similar results are also obtained for precision, recall and other assessments of predictive efficacy. The community therefore needs to upgrade experimental and empirical methodology for vulnerability prediction evaluation and development to ensure robust and actionable scientific findings.

AI Bugs

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries (ICSE 2019)

Abstract: Deep learning (DL) systems are widely used in domains including aircraft collision avoidance systems, Alzheimer’s disease diagnosis, and autonomous driving cars. Despite the requirement for high reliability, DL systems are difficult to test. Existing DL testing work focuses on testing the DL models, not the implementations (e.g., DL software libraries) of the models. One key challenge of testing DL libraries is the difficulty of knowing the expected output of DL libraries given an input instance. Fortunately, there are multiple implementations of the same DL algorithms in different DL libraries. Thus, we propose CRADLE, a new approach that focuses on finding and localizing bugs in DL software libraries. CRADLE (1) performs cross-implementation inconsistency checking to detect bugs in DL libraries, and (2) leverages anomaly propagation tracking and analysis to localize faulty functions in DL libraries that cause the bugs. We evaluate CRADLE on three libraries (TensorFlow, CNTK, and Theano), 11 datasets (including ImageNet, MNIST, and KGS Go game), and 30 pre-trained models. CRADLE detects 12 bugs and 104 unique inconsistencies, and highlights functions relevant to the causes of inconsistencies for all 104 unique inconsistencies.

A Comprehensive Study on Deep Learning Bug Characteristics (FSE 2019)

Abstract: Deep learning has gained substantial popularity in recent years. Developers mainly rely on libraries and tools to add deep learning capabilities to their software. What kinds of bugs are frequently found in such software? What are the root causes of such bugs? What impacts do such bugs have? Which stages of deep learning pipeline are more bug prone? Are there any antipatterns? Understanding such characteristics of bugs in deep learning software has the potential to foster the development of better deep learning platforms, debugging mechanisms, development practices, and encourage the development of analysis and verification frameworks. Therefore, we study 2716 high-quality posts from Stack Overflow and 500 bug fix commits from Github about five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand the types of bugs, root causes of bugs, impacts of bugs, bug-prone stage of deep learning pipeline as well as whether there are some common antipatterns found in this buggy software. The key findings of our study include: data bug and logic bug are the most severe bug types in deep learning software appearing more than 48% of the times, major root causes of these bugs are Incorrect Model Parameter (IPS) and Structural Inefficiency (SI) showing up more than 43% of the times. We have also found that the bugs in the usage of deep learning libraries have some common antipatterns that lead to a strong correlation of bug types among the libraries.

Others

ReduKtor: How We Stopped Worrying About Bugs in Kotlin Compiler (ASE 2019)

Abstract: Bug localization is well-known to be a difficult problem in software engineering, and specifically in compiler development, where it is beneficial to reduce the input program to a minimal reproducing example; this technique is more commonly known as delta debugging. What additionally contributes to the problem is that every new programming language has its own unique quirks and foibles, making it near impossible to reuse existing tools and approaches with full efficiency. In this experience paper we tackle the delta debugging problem w.r.t. Kotlin, a relatively new programming language from JetBrains. Our approach is based on a novel combination of program slicing, hierarchical delta debugging and Kotlin-specific transformations, which are synergistic to each other. We implemented it in a prototype called ReduKtor and did extensive evaluation on both synthetic and real Kotlin programs; we also compared its performance with classic delta debugging techniques. The evaluation results support the practical usability of our approach to Kotlin delta debugging and also shows the importance of using both language-agnostic and language-specific techniques to achieve best reduction efficiency and performance.

Understanding Exception-Related Bugs in Large-Scale Cloud Systems (ASE 2019)

Abstract: Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems.

In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first eBug study that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% eBugs are triggered by non-semantic conditions such as network errors; 40% eBugs can be triggered by simulating the conditions at simple system states. Interestingly, we find that exception triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, which reports 31 bugs and bad practices from the latest versions of the studied systems. So far developers have confirmed that 23 of them are “previously-unknown” bugs or bad practices.

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform (FSE 2019)

Abstract: Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Postcards from the Post-HTTP World: Amplification of HTTPS Vulnerabilities in the Web Ecosystem (S&P 2019)

Abstract: HTTPS aims at securing communication over the Web by providing a cryptographic protection layer that ensures the confidentiality and integrity of communication and enables client/server authentication. However, HTTPS is based on the SSL/TLS protocol suites that have been shown to be vulnerable to various attacks in the years. This has required fixes and mitigations both in the servers and in the browsers, producing a complicated mixture of protocol versions and implementations in the wild, which makes it unclear which attacks are still effective on the modern Web and what is their import on web application security. In this paper, we present the first systematic quantitative evaluation of web application insecurity due to cryptographic vulnerabilities. We specify attack conditions against TLS using attack trees and we crawl the Alexa Top 10k to assess the import of these issues on page integrity, authentication credentials and web tracking. Our results show that the security of a consistent number of websites is severely harmed by cryptographic weaknesses that, in many cases, are due to external or related-domain hosts. This empirically, yet systematically demonstrates how a relatively limited number of exploitable HTTPS vulnerabilities are amplified by the complexity of the web ecosystem.

Scalable Scanning and Automatic Classification of TLS Padding Oracle Vulnerabilities (USENIX Security 2019)

Abstract: The TLS protocol provides encryption, data integrity, and authentication on the modern Internet. Despite the protocol's importance, currently-deployed TLS versions use obsolete cryptographic algorithms which have been broken using various attacks. One prominent class of such attacks is CBC padding oracle attacks. These attacks allow an adversary to decrypt TLS traffic by observing different server behaviors which depend on the validity of CBC padding.

We present the first large-scale scan for CBC padding oracle vulnerabilities in TLS implementations on the modern Internet. Our scan revealed vulnerabilities in 1.83% of the Alexa Top Million websites, detecting nearly 100 different vulnerabilities. Our scanner observes subtle differences in server behavior, such as responding with different TLS alerts, or with different TCP header flags.

We used a novel scanning methodology consisting of three steps. First, we created a large set of probes that detect vulnerabilities at a considerable scanning cost. We then reduced the number of probes using a preliminary scan, such that a smaller set of probes has the same detection rate but is small enough to be used in large-scale scans. Finally, we used the reduced set to scan at scale, and clustered our findings with a novel approach using graph drawing algorithms.

Contrary to common wisdom, exploiting CBC padding oracles does not necessarily require performing precise timing measurements. We detected vulnerabilities that can be exploited simply by observing the content of different server responses. These vulnerabilities pose a significantly larger threat in practice than previously assumed.

Automatically Finding Bugs in Cyber-Physical System Development Tool Chains With Slforge (ICSE 2018)

Abstract: Cyber-physical system (CPS) development tool chains are widely used in the design, simulation, and verification of CPS data-flow models. Commercial CPS tool chains such as MathWorks' Simulink generate artifacts such as code binaries that are widely deployed in embedded systems. Hardening such tool chains by testing is crucial since formally verifying them is currently infeasible. Existing differential testing frameworks such as CyFuzz can not generate models rich in language features, partly because these tool chains do not leverage the available informal Simulink specifications. Furthermore, no study of existing Simulink models is available, which could guide CyFuzz to generate realistic models.

To address these shortcomings, we created the first large collection of public Simulink models and used the collected models' properties to guide random model generation. To further guide model generation we systematically collected semi-formal Simulink specifications. In our experiments on several hundred models, the resulting SLforge generator was more effective and efficient than the state-of-the-art tool CyFuzz. SLforge also found 8 new confirmed bugs in Simulink.