case studies

Performance fuzzing compilers a journey with kotlinc

Compiler and IDE performance are paramount to a good developer experience. Howerver, optimizing a modern compiler is tricky due to the sheer size and complexity: just the frontend of kotlinc the Kotlin compiler weighs in at 80KLOC.

This makes expert-driven optimization expensive and suggests the need for improved tooling.

Indeed, the security community is well familiar with fuzzing — automatically feeding a program randomized inputs to find crashes, memory corruption, and other misbehaviour. The discovered interesting inputs are then investigated by experts.

Inspired by a 2018 ISSTA paper, we worked with Jetbrains Research’s MLSE lab to apply fuzzing to the problem of finding performance anomalies in kotlinc.

kotperffuzz diagram

Fuzzing redux

Fuzzing methods differ on three axes:

  • whether they generate inputs from scratch or mutate existing ones
  • how the input format structure is encoded
  • how much feedback they get from the target program

For instance, the poster child of fuzzing is AFL, a mutation-based unstructured graybox fuzzer. To use it, one seeds it with samples of input format (i.e. a few PNG images to fuzz libpng) which it then mutates with built-in heuristics. The user need not supply a specification of PNG — AFL hooks into the program to observe its execution graph in response to each input and optimizes inputs so as to maximize graph coverage.

Source code in a high-level language is a way more structured input than normally tackled by fuzzers, where structure \sim interdependency.

Imagine generating a network packet: it has a structured header and a data section. The fuzzer is free to generate any flags permutations or mutate source/dest leaving the data as-is; the only dependence it has to learn is the relation between header.packet_length and actual length(data) — if they don’t match, the packet will be rejected early and thus the bulk of the program unexplored.

The property-based testing crowd frames this as sampling from the generator domain and filtering only valid inputs encoded in an executable oracle. When the generator domain is >>>> set of valid inputs, random generation becomes intractable.

Theory in Kotlin practice

When working with highly-structured data, it is tempting to provide the generator an explicit model of the format. Unfortunately, Kotlin has no formal description of the type system, and designing one is very much out of scope of a fuzzing project. An ANTLR grammar is available though!

This rules out whitebox fuzzing.

Generally, the blacker the fuzzing, the faster you have to fuzz to find interesting cases (AFL generally works with at least 1000 samples/sec). kotlinc is slow, compiling on the order of 10samplesseccore10 \mathrm{\frac{samples}{sec \ast core}}, so blackbox would seem intractable.

Prior work shows neural generation of source code coupled with a quick pre-filter make for a practical blackbox fuzzer. That is, a neural generator + a fast heuristic oracle generate valid enough inputs to make fuzzing of even slow programs tractable.

Still, OpenCL is much simpler than Kotlin. We also could not quickly reproduce authors’ results; and so moved on to graybox approaches. We currently believe a DeepSmith-Kotlin to be viable, but computationally expensive.

Going graybox

Exploration of the program graph à la AFL sounds like a great fit for a compiler: one expects to exercise various analysis and optimisation routines when optimizing inputs for breadth.

Our first executable result was a kotlinc driver for JQF, a port of AFL for Java (ISSTA 2019). JQF’s graybox instrumentation further reduced performance to 5samplesseccore5 \mathrm{\frac{samples}{sec \ast core}}.

Could we remedy that with very good input generation?

We tried two strategies:

  • domain-general semantic fuzzing with Zest
  • domain-specific seeds and mutators

We found Zest doesn’t work with slow feedback (convergence is much too slow).

Domain-specific mutators

Here’s a diagram of kotlinc’s internal representations:


Observe we can pick a layer to start generation at. For instance, to generate Kotlin sources we could start with the ANTLR grammar mentioned above.

We can even isolate a part of the compiler and only fuzz that!

For this project, we bootstrapped from the compiler testsuite (that’s kotlin/compiler/testData and kotlin/benchmarks) and implemented mutation through recombination.

Our technique annotates the seed files with a source map of type information extracted from the compiler. This lets us separate slow-offline-correct analysis provided by the compiler from fast-online-heuristic analysis we do.

The recombinator is cognizant of types and scopes and only replaces item xx' for xx if TxTxT_x' \subseteq T_x and ΓxΓx\Gamma_x' \subseteq \Gamma_x.

We find this approach promising and look forward to putting in some additional engineering work to make it scale to hundreds of cores (remember, kotlinc is slow).

In this case study, we saw how Multicore evaluated the landscape of fuzzing and applied their findings to fuzzing kotlinc, an industrial compiler. In the span of a three months’s contract, we prototyped two approaches inspired by the latest research and earmarked future work directions.

This work reflects our values of striving for multiplicative impact — an improvement in kotlinc improves every Kotlin developer’s experience — and applying state-of-the art techniques.

Additional (raw) research notes are available at wldhx/kotperffuzz.