LangSec Workshop

at IEEE Security & Privacy, May 21, 2020

Papers & Slides

Keynote

David Walker (Princeton University).

Keynote detail to be announced.

Jeremy Yallop (University of Cambridge).

Keynote detail to be announced.

Invited talks

"A Story About JavaScript". Natalie Silvanovich (Google Inc.).

Every year, dozens of vulnerabilities are reported in JavaScript engines, despite strong efforts from defenders. This talk explains why JavaScript has been so difficult to secure. It goes through the history of JavaScript features, and describes how design choices that ignored implementation realities reduced the security and performance of JavaScript engines, while also making them more difficult to maintain. This talk will end with suggestions on how to create better specifications by keeping implementation in mind.

Bio: Natalie Silvanovich is a security researcher on Google Project Zero. Her current focus is browser security, including script engines, WebAssembly and WebRTC. Previously, she worked in mobile security on the Android Security Team at Google and as a team lead of the Security Research Group at BlackBerry, where her work included finding security issues in mobile software and improving the security of mobile platforms. Outside of work, Natalie enjoys applying her hacking and reverse engineering skills to unusual targets and has spoken at several conferences on the subject of Tamagotchi hacking.

"Weird Circuits in CPU Microarchitectures". Abhrajit Ghosh (Perspecta Labs) and Dmitry Evtyushkin (College of William & Mary).

Spectre and Meltdown have demonstrated the feasibility of accomplishing cyber exploits using capabilities present in modern day CPU microarchitectures such as speculative execution and out-of-order execution. While microarchitectural features are performance optimization mechanisms that are invisible at the architectural level, they can be implicitly manipulated by programs running at the architectural level. In our work we have structured micorarchitectural manipulations into implementations of Boolean logic computation that we refer to as Weird Circuits. We see Weird Circuits as an emergent execution capability where the actions of a program result in unintended computations via manipulation of microarchitectural state. We have implemented simple Weird Circuits capable of performing several Boolean logic operations at the same time while maintaining architectural invisibility. We are now exploring approaches for scaling up Weird Circuit computations. We are also exploring nuanced formalisms for Weird Circuits that will support a better understanding of their computing power. Finally, we are also exploring methods to detect Weird Circuits in existing software corpora, and methods to automate their construction.

Research Papers

"Language-agnostic Injection Detection". Lars Hermerschmidt (AXA Konzern AG), Andreas Straub (RWTH Aachen University), and Goran Piskachev (Fraunhofer IEM).

Formal languages are ubiquitous wherever software systems need to exchange or store data. Unparsing into and parsing from such languages is an error-prone process that has spawned an entire class of security vulnerabilities. There has been ample research into finding vulnerabilities on the parser side, but outside of language specific approaches, few techniques targeting unparser vulnerabilities exist. This work presents a language-agnostic approach for spotting injection vulnerabilities in unparsers. It achieves this by mining unparse trees using dynamic taint analysis to extract language keywords, which are leveraged for guided fuzzing. Vulnerabilities can thus be found without requiring prior knowledge about the formal language, and in fact, the approach is even applicable where no specification thereof exists at all. This empowers security researchers and developers alike to gain deeper understanding of unparser implementations through examination of the unparse trees generated by the approach, as well as enabling them to find new vulnerabilities in poorly-understood software.

"Toward Automated Grammar Extraction via Semantic Labeling of Parser Implementations". Carson Harmon, Bradford Larsen, and Evan S. Sultanik (Trail of Bits).

This paper introduces a new approach for labeling the semantic purpose of the functions in a parser. An input file with a known syntax tree is passed to a copy of the target parser that has been instrumented for universal taint tracking. This paper introduces a novel algorithm for merging that syntax tree ground truth with the observed taint and control-flow information from the parser's execution. This produces a mapping from types in the file format to the set of functions most specialized in operating on that type. The resulting mapping has applications in mutational fuzzing, reverse engineering, differential analysis, as well as automated grammar extraction. We demonstrate that even a single execution of an instrumented parser with a single input file can lead to a mapping that a human would identify as intuitively correct. We hope that this approach will lead to both safer subsets of file formats, as well as safer parsers.

"The Geometry of Syntax and Semantics for Directed File Transformations". Steve Huntsman (BAE Systems FAST Labs) and Michael Robinson (American University).

We introduce a conceptual framework that associates syntax and semantics with vertical and horizontal directions in principal bundles and related constructions. This notion of geometry corresponds to a mechanism for performing goal-directed file transformations such as ``eliminate unsafe syntax'' and suggests various engineering practices.

"Armor Within: Defending against Vulnerabilities in Third-Party Libraries". Sameed Ali, Prashant Anantharaman and Sean Smith (Dartmouth College).

Vulnerabilities in third-party software modules have resulted in severe security flaws, including remote code execution and denial of service. However, current approaches to securing such libraries suffer from one of two problems. First, they do not perform sufficiently well to be applicable in practice and incur high CPU and memory overheads.

Second, they are also harder to apply to legacy and proprietary systems when the source code of the application is not available. There is, therefore, a dire need to secure the internal boundaries within an application to ensure vulnerable software modules are not exploitable via crafted input attacks.

We present a novel approach to secure third-party software modules without requiring access to the source code of the program. First, using the foundations of language-theoretic security, we build a validation filter for the vulnerable module. Using the foundations of linking and loading, we present two different ways to insert that filter between the main code and the vulnerable module. Finally, using the foundations of ELF-based access control, we ensure any entry into the vulnerable module must first go through the filter.

We evaluate our approaches using three known real-world exploits in two popular libraries---libpng and libxml. We were able to successfully prevent all three exploits from executing.

Research Reports

"The Parsley Data Format Definition Language". Prashanth Mundkur (SRI International), Prashant Anantharaman (Dartmouth College), Sameed Ali (Dartmouth College), Linda Briesemeister (SRI International), Natarajan Shankar (SRI International), and Sean Smith (Dartmouth College).

Any program that reads formatted input relies on parsing software to check the input for validity and transform it into a representation suitable for further processing. Many security vulnerabilities can be attributed to poorly defined grammars, incorrect parsing, and sloppy input validation. In contrast to programming languages, grammars for even common data formats such as DNS and PDF are typically context-sensitive and heterogenous. However, as in programming languages, a standard notation or language to express these data format grammars can address poor or ambiguous definitions, and the automated generation of correct-by-construction parsers from such grammar specifications can yield correct and type- and memory-safe data parsing routines. We present our ongoing work on developing such a data format description language. Parsley is a declarative data format definition language that combines grammars and constraints in a modular way. We show how it can be used to capture complex data formats such as PDF, DNS, MAVLink, and ELF. We briefly describe the processing pipeline we are designing to generate verified parsers from these specifications.

"Formally-Verified ASN.1 Protocol C-language Stack". Nika Pona (Digamma.ai) and Vadim Zaliva (Carnegie Mellon University).

We describe our approach and progress in verification of an existing mature open-source ASN.1 compiler ASN1C using the Coq proof assistant. Once completed, our project will provide state-of-the-art high assurance which is suitable for mission-critical systems. Furthermore, since formal verification will be layered atop a mature existing ASN.1 stack, it will combine the benefits of high-performance portable stack implementation with formal correctness guarantees. As an essential and necessary step in our approach, the project will also provide a formalization of a key part of the ASN.1 standard. Such formal specification could subsequently be used by others to analyze ASN.1 properties and validate other implementations.

"Building a Wide Reach Corpus for Secure Parser Development". Timothy Allison, Wayne Burke, Chris Mattmann, Anastasia Menshikova, Philip Southam, Ryan Stonebraker and Virisha Timmaraju (NASA Jet Propulsion Laboratory).

Whether developing from a specification or deriving parsers from samples, LangSec parser developers require widereach corpora of their target file format in order to identify key edge cases or common deviations from the format’s specification. In this work-in-progress paper, we report the details of several methods we’ve used to gather 30 million files, extract features and make these features amenable to search and other analytics. This paper documents opportunities and limitations of some popular open source data and tools and this paper will benefit researchers who need to efficiently gather a large file corpus.

"ICARUS: Building Parsers out of Feathers and Wax". Sam Cowger, Yerim Lee, Nichole Schimanski, Mark Tullsen, Walter Woods, Richard Jones, Ew Davis, William Harris (Galois, Inc.), Trent Brunson, Carson Harmon, Bradford Larsen and Evan Sultanik (Trail of Bits).

When a data format achieves a significant level of adoption, the presence of multiple format implementations expands the original specification in often-unforeseen ways. This results in an implicit, de facto format, which can create vulnerabilities in programs handling the associated data files. In this paper we present our initial work on ICARUS: a toolchain for dealing with the problem of understanding and hardening de facto file formats. We show our the results of our work in progress towards labeling and categorizing a corpora of data format samples to understand accepted alterations to a format; the detection of sublanguages within the de facto format using both entropy- and taint-tracking-based methods, as a means of breaking down the larger problem of learning how the grammar has evolved; grammar inference via reinforcement learning, as a means of tying together the learned sublanguages; and the defining of both safe subsets of the de facto grammar, as well as translations from unsafe regions of the de facto grammar into safe regions. Real-world data formats evolve as they find use in real-world applications, and a comprehensive ICARUS toolchain for understanding and hardening the resulting de facto formats can identify and address security risks arising from this evolution.

Work-in-Progress Reports

"Reverse Engineering Binary Messages through Design Patterns" Jared Chandler and Kathleen Fisher (Tufts University).

Reverse engineering message formats from binary data is a difficult and time consuming task. Many approaches rely on byte aligned data across messages. Without this alignment, the complexity of the task is greatly increased. Methods from computational biology attempt to discover alignments between sections of messages, but depend on data commonalities such as values repeated across messages, or frequently used delimiters. Our interest in this problem is motivated by our efforts to automatically reverse engineer command and control (CnC) message formats used by the Mirai botnet malware. In particular, messages commanding compromised systems to attack contain fields of both varying quantity and varying length. That these messages proved time-consuming and difficult to reverse engineer highlighted the need for an automated generalized method. Here we present a parser-based approach to message format inference utilizing common design patterns employed in binary data encoding.

"On Session Languages". Prashant Anantharaman and Sean Smith (Dartmouth College).

The LangSec approach defends against crafted input attacks by defining a formal language specifying correct inputs and building a recognizer/parser for that language. However, each successive input is not necessarily in the same basic language---e.g., most communication protocols use formats that either depend on values previously received, or on some other additional context. When we try to use LangSec in these real-world scenarios, most parsers we write need additional mechanisms to change the recognized language as the execution progresses.

This paper discusses approaches researchers have taken in the past to build parsers for such protocols, and provides formal descriptions of new sets of languages that could be considered to be a sequence of languages, instead of a single language describing an entire protocol---thus bringing LangSec theory closer to practice.

"Weird Machine-inducing Vulnerabilities: a Symbolic Execution Model". Jordy Gennissen (Royal Holloway, University of London).

Weird machines are considered the stepping stone towards exploitability or proving unexploitaibility. Existing formalisations of weird machines tend to misalign with real-world exploitation and can lead to a false sense of (in)security if used without care. Instead, we suggest to stick with a formulation as close to the intuitive concept of a weird machine.

Hence, in this paper we introduce a symbolic execution-based model defining weird machines. We compare two traces with the same path condition in different execution models. When two traces divert, it is due to a discrepancy in execution models. We argue this discrepancy is a result of a bad assumption in the execution model and creates a weird machine. Isolating the violated assumption provides a classification for these weird machine-inducing vulnerabilities. Finally, we expand on the notion of input to explicitly include indirect input that can influence the execution of a program.

"A Survey of Connections between Exploitation Primitives". Avi Weinstock.

We demonstrate several exploit primitives, and connections between those primitives. We sketch an approach to formalizing said primitives an an attempt to make progress on the intermediate exploitation state problem.