15-418 Milestone Report

By Joshua Nichols (joshuani) and Lucca Rodrigues (luccar)

Progress Summary

We have fully implemented a compiler for C0 in Rust with several new language features. The compiler targets LLVM using the inkwell bindings package to convert the LLVM intermediate representation (IR) into machine code. The new language features that we have added include the following:

extern keyword to call external functions,
void* type for generic external function calls,
type casting: e.g. x as int, and
support for functions as first-class values.

We are able to call the native OS's pthread_create and pthread_join to allow users to write parallel C0 code. Implementing parallel map and reduce in C0 showed an approximately 7x speedup on 8 processors.

Goals and Deliverables Progress

Deliverables

The three main deliverables we specified in the project proposal are:

Implement code generation to target LLVM backend.
Add foreign function interface (FFI) to support native system calls.
Develop standard library for spawning and joining threads.

We have successfully achieved all our deliverables:

We implemented LLVM code generation through a series of IR lowerings. First, the source code is parsed into an AST which is defined as a recursive data structure. Next, we lower the AST to an IR which stores expressions and statements in an arena allocator and desugars symbol names. Typechecking and semantic analysis are performed on this IR. Finally, we lower the higher-level IR to a three-address assembly IR which we can directly lower to the LLVM representation.
We added the extern keyword to specify externally defined functions for the FFI. The compiler tells LLVM to trust those functions exist without being explicitly defined, and they are only later added during the linking phase.
We added user defined functions to call into the system's native threads:

extern fn pthread_create(thread: void*, attr: void*, start_routine: fn(void*) -> void*, arg: void*) -> int;
extern fn pthread_join(thread: void*, retval: void*) -> int;

Note this required functions as values (function pointers) which was initially included in the nice to haves.

Nice to Haves

The nice to haves we specified in the project proposal are:

Implement support for functions as first-class values.
Develop standard library with common algorithms implemented using parallelism.
Add shared thread pool to reduce overhead of spawning and joining threads.
Add dynamic task queue to balance workload in standard library algorithms.

We successfully added support for functions as first-class values, have made significant progress in adding a parallel standard library, and have not started on the last two items.

We implemented functions as first-class values using function pointers by adding a function type, e.g. fn(int) -> int, which is represented as a pointer underneath the hood. To support generic functions and structs, we added the void* type and type casting, e.g. x as void*.

Updated Deliverables

The remaining deliverables we want to accomplish before the poster session are:

Finish standard library with parallel filter and scan algorithms.
Implement shared thread pool to reduce thread spawning/joining overhead.
Implement dynamic task queue to support workload balancing.

Poster Demo

During the poster session, we would like to showcase speedup graphs of several examples of common parallel algorithms in C0 implemented using our parallel DSL and compiler. We plan to focus on our performance debugging, data collection and analysis journey rather than elaborate demos.

Preliminary Results

To test our DSL, we implemented a simple example in C0: a hand-written, parallel map-reduce over an array of integers. At a high-level, it does the following:

Creates an array A of length 40.
Fills every entry with 40.
In parallel, replaces each entry with fib(40) (parallel map).
In parallel, sums all those results together (parallel reduce).
Returns the final sum.

So mathematically, the result is:

$40 \times \text{fib}(40)$

because there are 40 elements, and each one becomes fib(40). The Fibonacci function is a classic recursive implementation, and intentionally expensive, serving as a good benchmark for parallelization of work.

It is worth noting that this algorithm is embarrassingly parallel. As shown by the results below, our program's speedup scaled almost perfectly linear with the number of processors. But this is by design: our main goal here was to implement a proof of concept for our parallelization primitives and establish a performance baseline. If we had somehow gotten less than 5x for 8 processors, we would have instead turned our attention to identifying bottlenecks with the actual language implementation. But with this baseline now in place, we're confident that our compilation pipeline delivers expected speedup trends on par with a common multi-threading implementation in general-purpose systems languages (e.g. C/C++ with POSIX threads).

Number of threads	Computation time (s)	Speedup
1	5.62	1.00
2	2.79	2.01
4	1.41	3.99
8	0.719	7.81

Table 1: Benchmark which computes and sums Fibonacci numbers

Issues and Concerns

One issue that we have noticed is that, due to the fact that the function signature for pthread_create and pthread_join take in arguments in the form of void pointers as opposed to typed generics, we inevitably have to define and cast a new data type every time we want to call the multi-threading API functions.

We believe that stronger type checker features/ergonomics could solve this (e.g. C++ templates/STL, Rust traits, closures/lambdas). But we are also aware that this project is quite time-sensitive, and we want to focus on implementing and benchmarking parallelization, as opposed to just implementing a new programming language from scratch.

Updated Schedule

Week 3/29–4/4: Implement AST to LLVM IR translation step and add support for extern keyword in the compiler data structures. Note we are using the inkwell Rust library which wraps the LLVM IR and have already made significant progress on the translation, so we will not be starting from scratch here. DONE
Week 4/5–4/11: Add FFI support to compiler and successfully demonstrate calling native system functions from C0. Write standard library for threading support (should be very small with only a few functions). DONE
Week 4/12–4/18: Implement functions as first-class values in the compiler. This step is necessary to complete before working on the standard library, since we need a way to pass functions as arguments to the thread standard library for spawn/join. DONE
Week 4/19–4/25: Implement the standard library with common algorithms using parallelism. Benchmark. WIP: map() and reduce() are done, but we still need to implement filter() and scan() (a.k.a. a parallel prefix-sum).
Week 4/26–4/30: Add support for shared task pool and dynamic task queue if time is available. Also write final report and presentation for poster session. TODO