Adding Parallelization to C0

By Joshua Nichols and Lucca Rodrigues

Summary

We will implement an optimizing C0 compiler with support for thread parallelism and an easy to use standard library to make parallelism easy. We will perform a detailed analysis of the performance of our compiler, and benchmark against optimized C code on multi-core CPU platforms.

Background

Modern computational hardware is explicitly designed to be parallel, with multi-core processors and Single Instruction, Multiple Data (SIMD) execution units implemented to achieve significant speedups. However, the C0 imperative programming language (a memory safe subset of C) is strictly sequential by design. Because C0 allows for memory operations and mutable state, it is extremely difficult for a compiler to automatically determine which sections of code can be parallelized, resulting in incorrect behavior if optimizations are overly aggressive.

To avoid these issues, modern languages fall on the spectrum between declarative programming and granular imperative programming. The first approach is pure functional programming, where programmers specify the logic of computation without specifying its control flow. The second is providing low-level control-flow management to the user via thread spawning and synchronization, such as pthreads, which can achieve high performance but suffers from being difficult to develop code in, and offers few guarantees about safety and correctness.

This project proposes a middle ground between the two approaches: a simple set of extensions to the C0 language to provide high-level parallel primitives. By introducing a structured set of abstractions that allows declarative computation without explicitly managing threads or synchronization, the compiler can mathematically prove safe execution. The objective of this approach is to keep the imperative structure of C0 while providing a provable verifiable framework for achieving high-performance code.

The Challenge

We previously implemented a C0 compiler in the CMU compiler's course 15-411, but we intend to take a different approach for our 15-418 project by targeting LLVM as the compiler backend. Targeting the LLVM intermediate representation (IR) will allow us to focus on parallelism features rather than spending time implementing and debugging general compiler related features in the backend. To achieve this goal, we need to implement a translation step to convert either the frontend abstract syntax tree (AST) or IR into the LLVM IR.

Language parallelism primitives

The design philosophy of our C0 compiler will be to implement thread spawning and joining using wrappers around the operating system's native threads, and to implement higher level operations, such as loop parallelism, in the language standard library. Rust implements a similar approach in the std::thread library by calling into the OS kernel to generate and manage threads.

The C0 language does not implement a foreign function interface (FFI), so we must implement an extern keyword to allow native system calls in the language. This could either be done by implementing the system calls wrapped underneath spawn and sync keywords (such as used in Cilk), or as a general FFI implementation.

Language features

Writing parallel programs directly using threads is difficult and error prone. For example, the most obvious way to sum the elements of an array is for threads to update a shared accumulator variable, but this introduces a race condition. The correct approach is for each thread to store a separate accumulator and combine the results at the end, but this is tedious and error prone to rewrite. In other words, we don't want writing high-performance code to be more difficult than sequential code.

This can be addressed by supporting functions as first-class values. For example, we could easily implement the summing of squared values in an array with help from the standard library.

// Standard library functions
int[] map(int[] data, len, fn(int) -> int transform);
int reduce(int[] data, len, fn(int, int) -> int transform);

// User code
int sum_sqs(int[] data, len) {
  int[] sqs = map(data, len, |x| { x * x });
  return reduce(sqs, len, |x, y| { x + y });
}

Figure 1: Sum of squared values using map and reduce

The ability to support first-class functions and closure allows code to be written in the expressive syntax of a functional language while keeping the performance of an imperative language like C.

To support first-class functions, we need to add support for function pointers and lambda lifting in the C0 compiler.

Standard library

The standard library will make heavy use of thread parallelism and first-class functions to implement map, reduce, scan, and other common algorithms. An example of the interface would look like:

int[] map(int[] data, len, fn(int) -> int transform);
int reduce(int[] data, len, fn(int, int) -> int transform);
int[] scan(int[] data, len, fn(int, int) -> int transform);
int[] filter(int[] data, len, fn(int) -> bool predicate);

Figure 2: Standard library interface

Further optimizations would include a shared thread pool to reduce OS overhead and a dynamic task queue to balance workloads.

Resources

We have a C0 compiler from 15-411 to use as a reference for compiler correctness. However, since the focus of this course is on parallel computing rather than compiler design, we will target LLVM as the compiler backend. Using LLVM has several important benefits:

The code produced by our compiler is portable since LLVM can compile to many different platforms, unlike the 15-411 compiler backend.
LLVM has a mature representation of parallel computation in its intermediate representation, which our compiler frontend can produce after performing program analysis.
The development time of our project can be shifted to focusing on the program analysis stages that are relevant to parallel computation, rather than spending time on unrelated aspects of compiler design.

Goals and Deliverables

Plan to achieve

The project can be divided into three main parts: adding parallelism features to the compiler, implementing simple functional language features, and developing the language standard library (only consisting of a few functions). The must haves for the project would be the parallelism itself:

Implementing code generation to convert frontend AST into LLVM IR.
Adding FFI support to the C0 compiler to enable native system calls.
Developing standard library with support for thread spawning and joining.

Hope to achieve

The ideal case would be to develop an easy to use standard library based on our parallelism and functional language features:

Implement support for functions as first-class values.
Develop standard library with common algorithms implemented using parallelism.
Add shared thread pool to reduce overhead of spawning and joining threads.
Add dynamic task queue to balance workload in standard library algorithms.

The ultimate goal for our project would be to show benchmarks for algorithms implemented in both OpenMP and our language on both the GHC and PSC clusters, demonstrating both excellent baseline performance and scalability.

Platform

We will primarily be testing our compiler on GHC machines during the compiler development process. Once we have achieved satisfactory performance on the GHC cluster, we will benchmark the code produced by our compiler on the PSC machines to examine the scaling properties of the executable code.

We chose to focus on multi-core CPU systems since many programs can be straightforwardly improved through parallelism with multiple cores, but parallelizing the same computations on a GPU would require a fundamental restructuring of the code.

Schedule

Week 3/29-4/4: Implement AST to LLVM IR translation step and add support for extern keyword in the compiler data structures. Note we are using the inkwell Rust library which wraps the LLVM IR and have already made significant progress on the translation, so we will not be starting from scratch here.
Week 4/5-4/11: Add FFI support to compiler and successfully demonstrate calling native system functions from C0. Write standard library for threading support (should be very small with only a few functions).
Week 4/12-4/18: Implement functions as first-class values in the compiler. This step is necessary to complete before working on the standard library, since we need a way to pass functions as arguments to the thread standard library for spawn/join.
Week 4/19-4/25: Implement the standard library with common algorithms using parallelism. Benchmark
Week 4/26-4/30: Add support for shared task pool and dynamic task queue if time is available. Also write final report and presentation for poster session.