Much of the research in language-based security, like that presented at LangSec 2020, focuses on hardening the software that makes up the servers, databases, and other systems that process inputs from the external world. The LangSec philosophy argues that we need to develop formal linguistic models for representing such systems as well as efficient algorithms for analyzing them. In this talk, we argue that a similar kind of language-based modeling and analysis can and should be applied to the networks that sit in front of such servers. It is these networks that form the first line of defence against malicious adversaries.
More specifically, we will describe NV, a new domain-specific language that is capable of modeling network routing protocols, such as OSPF and BGP, and their configurations. NV models can be obtained via automatic translation from ordinary router configurations, like those used to configure CISCO and Juniper devices. Once in hand, these models may be analyzed using a BDD-based simulator inspired by the literature on software model checking. To get the simulator to scale, we develop a variety abstractions of network state and use ideas from abstract interpretation to ensure these abstractions are sound. Surprisingly, we show that the right abstractions can make asymptotic differences in simulator performance in the right circumstances, and allow our system to scale to the point that analyzing the correctness large modern data centers is possible for the first time.
Ideas in this talk arose from joint research with Ryan Beckett, Nick Giannarakis, Aarti Gupta, Devon Loehr, and Ratul Mahajan.
Programmers often find parser combinators an appealing approach to constructing parsers. Parser combinators are simply functions in some general purpose programming language, and so building a parser becomes a matter of writing a program that encodes the structure of the corresponding grammar.
Unfortunately, standard parser combinator designs can introduce security vulnerabilities: they make it easy to write backtracking parsers that have exponential complexity or that fail to terminate, and they have complex semantics that make it easy to introduce bugs.
Asp (Algebraic, Staged Parsing) is a new design that eliminates these problems using the tools of programming language theory. Asp grammars are context-free expressions with simple semantics, and a type system that guarantees linear-time parsing with no backtracking and single-token lookahead. Furthermore, asp's implementation uses multi-stage programming to significantly increase performance, making combinator parsing competitive with code generated by tools such as yacc. A fully certified implementation of asp is underway.
Speaker Biography Jeremy Yallop is a senior lecturer at the Computer Laboratory in the University of Cambridge and a fellow of Robinson College. His research investigates how to design languages that support advanced abstractions with uncompromised performance by building in support for user-defined optimization. Recent honours include a distinguished paper award and distinguished artifact award from PLDI (2019) and a best paper award from PEPM (2016).
Bio: Natalie Silvanovich is a security researcher on Google Project Zero. Her current focus is browser security, including script engines, WebAssembly and WebRTC. Previously, she worked in mobile security on the Android Security Team at Google and as a team lead of the Security Research Group at BlackBerry, where her work included finding security issues in mobile software and improving the security of mobile platforms. Outside of work, Natalie enjoys applying her hacking and reverse engineering skills to unusual targets and has spoken at several conferences on the subject of Tamagotchi hacking.
Spectre and Meltdown have demonstrated the feasibility of accomplishing cyber exploits using capabilities present in modern day CPU microarchitectures such as speculative execution and out-of-order execution. While microarchitectural features are performance optimization mechanisms that are invisible at the architectural level, they can be implicitly manipulated by programs running at the architectural level. In our work we have structured micorarchitectural manipulations into implementations of Boolean logic computation that we refer to as Weird Circuits. We see Weird Circuits as an emergent execution capability where the actions of a program result in unintended computations via manipulation of microarchitectural state. We have implemented simple Weird Circuits capable of performing several Boolean logic operations at the same time while maintaining architectural invisibility. We are now exploring approaches for scaling up Weird Circuit computations. We are also exploring nuanced formalisms for Weird Circuits that will support a better understanding of their computing power. Finally, we are also exploring methods to detect Weird Circuits in existing software corpora, and methods to automate their construction.
Formal languages are ubiquitous wherever software systems need to exchange or store data. Unparsing into and parsing from such languages is an error-prone process that has spawned an entire class of security vulnerabilities. There has been ample research into finding vulnerabilities on the parser side, but outside of language specific approaches, few techniques targeting unparser vulnerabilities exist. This work presents a language-agnostic approach for spotting injection vulnerabilities in unparsers. It achieves this by mining unparse trees using dynamic taint analysis to extract language keywords, which are leveraged for guided fuzzing. Vulnerabilities can thus be found without requiring prior knowledge about the formal language, and in fact, the approach is even applicable where no specification thereof exists at all. This empowers security researchers and developers alike to gain deeper understanding of unparser implementations through examination of the unparse trees generated by the approach, as well as enabling them to find new vulnerabilities in poorly-understood software.
This paper introduces a new approach for labeling the semantic purpose of the functions in a parser. An input file with a known syntax tree is passed to a copy of the target parser that has been instrumented for universal taint tracking. This paper introduces a novel algorithm for merging that syntax tree ground truth with the observed taint and control-flow information from the parser's execution. This produces a mapping from types in the file format to the set of functions most specialized in operating on that type. The resulting mapping has applications in mutational fuzzing, reverse engineering, differential analysis, as well as automated grammar extraction. We demonstrate that even a single execution of an instrumented parser with a single input file can lead to a mapping that a human would identify as intuitively correct. We hope that this approach will lead to both safer subsets of file formats, as well as safer parsers.
We introduce a conceptual framework that associates syntax and semantics with vertical and horizontal directions in principal bundles and related constructions. This notion of geometry corresponds to a mechanism for performing goal-directed file transformations such as ``eliminate unsafe syntax'' and suggests various engineering practices.
Vulnerabilities in third-party software modules have resulted in severe security flaws, including remote code execution and denial of service. However, current approaches to securing such libraries suffer from one of two problems. First, they do not perform sufficiently well to be applicable in practice and incur high CPU and memory overheads.
Second, they are also harder to apply to legacy and proprietary systems when the source code of the application is not available. There is, therefore, a dire need to secure the internal boundaries within an application to ensure vulnerable software modules are not exploitable via crafted input attacks.
We present a novel approach to secure third-party software modules without requiring access to the source code of the program. First, using the foundations of language-theoretic security, we build a validation filter for the vulnerable module. Using the foundations of linking and loading, we present two different ways to insert that filter between the main code and the vulnerable module. Finally, using the foundations of ELF-based access control, we ensure any entry into the vulnerable module must first go through the filter.
We evaluate our approaches using three known real-world exploits in two popular libraries---libpng and libxml. We were able to successfully prevent all three exploits from executing.
Any program that reads formatted input relies on parsing software to check the input for validity and transform it into a representation suitable for further processing. Many security vulnerabilities can be attributed to poorly defined grammars, incorrect parsing, and sloppy input validation. In contrast to programming languages, grammars for even common data formats such as DNS and PDF are typically context-sensitive and heterogenous. However, as in programming languages, a standard notation or language to express these data format grammars can address poor or ambiguous definitions, and the automated generation of correct-by-construction parsers from such grammar specifications can yield correct and type- and memory-safe data parsing routines. We present our ongoing work on developing such a data format description language. Parsley is a declarative data format definition language that combines grammars and constraints in a modular way. We show how it can be used to capture complex data formats such as PDF, DNS, MAVLink, and ELF. We briefly describe the processing pipeline we are designing to generate verified parsers from these specifications.
We describe our approach and progress in verification of an existing mature open-source ASN.1 compiler ASN1C using the Coq proof assistant. Once completed, our project will provide state-of-the-art high assurance which is suitable for mission-critical systems. Furthermore, since formal verification will be layered atop a mature existing ASN.1 stack, it will combine the benefits of high-performance portable stack implementation with formal correctness guarantees. As an essential and necessary step in our approach, the project will also provide a formalization of a key part of the ASN.1 standard. Such formal specification could subsequently be used by others to analyze ASN.1 properties and validate other implementations.
Whether developing from a specification or deriving parsers from samples, LangSec parser developers require widereach corpora of their target file format in order to identify key edge cases or common deviations from the format’s specification. In this work-in-progress paper, we report the details of several methods we’ve used to gather 30 million files, extract features and make these features amenable to search and other analytics. This paper documents opportunities and limitations of some popular open source data and tools and this paper will benefit researchers who need to efficiently gather a large file corpus.
When a data format achieves a significant level of adoption, the presence of multiple format implementations expands the original specification in often-unforeseen ways. This results in an implicit, de facto format, which can create vulnerabilities in programs handling the associated data files. In this paper we present our initial work on ICARUS: a toolchain for dealing with the problem of understanding and hardening de facto file formats. We show our the results of our work in progress towards labeling and categorizing a corpora of data format samples to understand accepted alterations to a format; the detection of sublanguages within the de facto format using both entropy- and taint-tracking-based methods, as a means of breaking down the larger problem of learning how the grammar has evolved; grammar inference via reinforcement learning, as a means of tying together the learned sublanguages; and the defining of both safe subsets of the de facto grammar, as well as translations from unsafe regions of the de facto grammar into safe regions. Real-world data formats evolve as they find use in real-world applications, and a comprehensive ICARUS toolchain for understanding and hardening the resulting de facto formats can identify and address security risks arising from this evolution.
Reverse engineering message formats from binary data is a difficult and time consuming task. Many approaches rely on byte aligned data across messages. Without this alignment, the complexity of the task is greatly increased. Methods from computational biology attempt to discover alignments between sections of messages, but depend on data commonalities such as values repeated across messages, or frequently used delimiters. Our interest in this problem is motivated by our efforts to automatically reverse engineer command and control (CnC) message formats used by the Mirai botnet malware. In particular, messages commanding compromised systems to attack contain fields of both varying quantity and varying length. That these messages proved time-consuming and difficult to reverse engineer highlighted the need for an automated generalized method. Here we present a parser-based approach to message format inference utilizing common design patterns employed in binary data encoding.
The LangSec approach defends against crafted input attacks by defining a formal language specifying correct inputs and building a recognizer/parser for that language. However, each successive input is not necessarily in the same basic language---e.g., most communication protocols use formats that either depend on values previously received, or on some other additional context. When we try to use LangSec in these real-world scenarios, most parsers we write need additional mechanisms to change the recognized language as the execution progresses.
This paper discusses approaches researchers have taken in the past to build parsers for such protocols, and provides formal descriptions of new sets of languages that could be considered to be a sequence of languages, instead of a single language describing an entire protocol---thus bringing LangSec theory closer to practice.
Weird machines are considered the stepping stone towards exploitability or proving unexploitaibility. Existing formalisations of weird machines tend to misalign with real-world exploitation and can lead to a false sense of (in)security if used without care. Instead, we suggest to stick with a formulation as close to the intuitive concept of a weird machine.
Hence, in this paper we introduce a symbolic execution-based model defining weird machines. We compare two traces with the same path condition in different execution models. When two traces divert, it is due to a discrepancy in execution models. We argue this discrepancy is a result of a bad assumption in the execution model and creates a weird machine. Isolating the violated assumption provides a classification for these weird machine-inducing vulnerabilities. Finally, we expand on the notion of input to explicitly include indirect input that can influence the execution of a program.
Emergent Computations attempt to formalize unintended computations when unsafe programs surrender computational power to untrusted attacker programs ("exploits"). Emergent computations are abound in programs written in memory-unsafe languages like C and C++. This work-in-progress talk discusses how traditional deductive techniques inspired by Hoare's axiomatic semantics are not adapted to reason about attacker knowledge, whether in concrete or symbolic form. We introduce a logical representation of the transducer model introduced in [Vanegue'14] and used in [Dullien'17] to show how attacker knowledge can be symbolically represented in formal proofs, and help fill the gap between program verification and attack proofs. We draw a parallel with recent work on Incorrectness Logic [O'Hearn 2020] and Incorrectness Separation Logic [Raad et al. 2020].