K2 Optimizing Compiler
Synthesizing Safe and Efficient Kernel Extensions for Packet Processing
Extended Berkeley Packet Filter (BPF) has emerged as a powerful method to extend packet-processing functionality in the Linux operating system. BPF allows users to write code in high-level languages (like C or Rust) and execute them at specific hooks in the kernel, such as the network device driver. To ensure safe execution of a user-developed BPF program in kernel context, Linux uses an in-kernel static checker. The checker allows a program to execute only if it can prove that the program is crash-free, always accesses memory within safe bounds, and avoids leaking kernel data.
BPF programming is not easy. One, even modest-sized BPF programs are deemed too large to analyze and rejected by the kernel checker. Two, the kernel checker may incorrectly determine that a BPF program exhibits unsafe behaviors. Three, even small performance optimizations to BPF code (e.g., 5% gains) must be meticulously hand-crafted by expert developers. Traditional optimizing compilers for BPF are often inadequate since the kernel checker’s safety constraints are incompatible with rule-based optimizations.
We present K2, a program-synthesis-based compiler that automatically optimizes BPF bytecode with formal correctness and safety guarantees. K2 produces code with 6–26% reduced size, 1.36%–55.03% lower average packet-processing latency, and 0–4.75% higher throughput (packets per second per core) relative to the best clang-compiled program, across benchmarks drawn from Cilium, Facebook, and the Linux kernel. K2 incorporates several domain-specific techniques to make synthesis practical by accelerating equivalence-checking of BPF programs by 6 orders of magnitude.
Frequently Asked Questions
What are the limitations of the proposed approach and prototype (size of the eBPF programs, usage of other hooks than XDP, how to handle new versions of the eBPF kernel verifier, …)?
(1) Scale: In our opinion, the relatively small size of the programs that can be realistically optimized within a given time budget is the most significant conceptual limitation of K2’s current approach. Having said that, K2 already handles programs much larger than many state of the art synthesizers. Due to optimizations based on modular verification, it is still possible to optimize smaller regions of a large program. However, clearly, more work is needed to make the techniques scale to larger programs.
(2) Handling non-XDP hooks: In addition to XDP programs, K2 currently handles socket filters and also hook points in some system calls in the TCP/IP stack. Adding new hooks is a matter of setting up the K2 interpreter and verifier with the correct input state (i.e., the registers that are valid upon initialization, along with semantics, like the fact that the register points into a memory of a certain size). This is a matter of engineering the specific initialization settings to K2’s interpreter and verifier rather than a conceptual limitation of the approach.
(3) Keeping up with the kernel checker: While the kernel checker is a moving target with more code and features added all the time, its fundamental safety conditions have been fairly static: (i) don’t run instructions that might cause exceptions and crash the OS, e.g., divide by zero (ii) don’t access memory out of bounds, (iii) don’t leak contents of kernel memory that might be visible to a BPF program into user-space or out of the system. In our opinion, everything that the kernel checker targets these goals and it is possible to capture these conditions with first-order logic. The post-processing pass indicated in the response to point M1 (above) will ensure that a little lag in keeping up with the kernel checker won’t “break” K2’s output. The first-order checks within K2 must be updated as necessary, at a “human time scale”, when one notices that many outputs start failing the checker.
There is a post-processing safety check to K2 that weeds out programs that do not pass the kernel checker. Why not add this safety check within K2 and checked in each iteration?
There are three problems with this approach:
(1) It would provide no insight into why the program fails on the safety front. Our static analysis and first order checks do provide such insight, potentially paving the path to a safety score that smoothly quantifies the “unsafe”-ness of a program, guiding the search towards safer programs.
(2) It makes the optimization CEGIS loop inefficient. First-order checks for safety provide counterexamples, which can be used to weed out unsafe programs through interpretation in subsequent iterations rather than a potentially costly kernel loading operation (system call + kernel analysis time).
(3) In the longer term, we believe that encoding the safety checks more declaratively in logic is the right way to build a synthesizer that emits safe programs. As one of the reviewers themselves (E) pointed out, the kernel checker is an idiosyncratic moving target — encoding the intent declaratively at least seems like it would make it possible to understand the constraints over programs in the first place.
The approach we propose (adding the kernel checker as a final post-processing step) combines the best of the two designs: use the first-order logic in the inner optimization loop to get rid of most unsafe programs in the synthesizer, have understandable safety checks and scores, and yet, finally emit safe programs that are guaranteed to be accepted by the checker.
Why the synthesized programs perform better than the standard ones and why such optimizations cannot be added to compilers?
We categorize and describe the non-obvious optimizations that K2 found in the section 9 according to the evaluation results.
The cost of executing an eBPF instruction is measured by executing it millions of times. It’s strange as the performance of such an instruction will also depend on caches and when it is executed millions of time, you carry an evaluation that is not representative of how the instruction will be executed surrounded by other instructions. Maybe you could compare the predicted latency of the generated eBPF code with the measurements carried out later in the paper.
It is indeed true that the predictions aren’t always representative of the actual performance of an instruction surrounded by other code. We have mainly used the performance objective as a “weak” predictor of performance. Unlike pattern-matching compilers like Clang, our synthesis-based approach affords us the luxury of generating many variants of the same program and just running the top few variants to determine which one has the best run-time performance. Better performance models for programs may result in better programs to be discovered by the search and such models are of independent interest. However, building such models will require a study of its own.
The paper states that the kernel checker’s safety analysis is incomplete and imprecise. So, instead of writing a compiler that generates code to meet the requirements of this whimsical kernel checker, shouldn’t we be focusing our efforts on fixing the kernel checker and make it easier to write eBPF code normally?
In our view, improving the kernel checker does not invalidate the need or existence of a safety-aware optimizing compiler. Peephole optimizations in compilers are often unaware of the context of the transformation within the overall program (e.g., such as the set of registers or memory that can be read or written safely at a given point in the program). Attempting to produce code that is always safe will preclude many optimizations, while producing optimized code that might be risky under some contexts will result in many legitimate and safe programs being rejected. We believe that the separation of the compiler and the checker in the context of BPF results in a strict necessity for collaboration between the compiler and the run-time environment (i.e., the checker that loads the program). We’re not the first to make this observation: the developers of the BPF backend of Clang/LLVM and the developers of the kernel checker already collaborate quite closely as evidenced by the BPF documentation (https://www.kernel.org/doc/html/latest/bpf/bpf_devel_QA.html#llvm) and posts on the kernel BPF mailing lists.
The paper seems to ignore concurrency in its modeling of maps.
Our formalization mainly guarantees equivalence to the original program under the condition that the map is only modified by a single thread running the BPF program. BPF developers recently introduced a helper function that allows programs to use spinlocks to provide mutual exclusion for map access. We haven’t modeled that helper (our corpus did not include code that used the helper), hence, we haven’t modeled mutual exclusion under concurrency. In the absence of mutual exclusion, race conditions are permissible, and this is not a limitation specific to K2 — Clang-compiled code provides the same semantics.