This implementation of SCAF supports Linux and FreeBSD, and is based on the techniques presented in the conference paper.

This version has the following notable limitations:

  • Non-malleable and long-running parallel sections are supported, but not automatically detected.
  • The SunOS/UltraSparc T2 platform is no longer supported.

Repository access

git clone


Mainly, SCAF requires a C compiler, ZeroMQ, HWLoc, and PAPI.

On Debian-like systems, try:

apt-get install build-essential libhwloc-dev libzmq3-dev libpapi-dev.


SCAF uses GNU Autotools, so installation is easy: just use ./configure && make. Install with make install.

SCAF consists of a centralized SCAF daemon (scafd) and clients. In this example, we start up scafd and a few clients.

Starting scafd

First, we start scafd. The -b option tells it to ignore any background load on the system, while the -t 1 option tells it to print its status to standard output every 1 second.

$ scafd -b -t 1

Starting clients

With scafd running, we can now launch clients in another shell. The clients will coordinate in order to avoid oversubscribing the machine. Furthermore, they will perform live performance experiments in order to determine how much of the machine's available hardware contexts they should each use.

Note that scafwrap only affects OpenMP programs.

$ scafwrap ./cg.C.x
$ scafwrap ./lu.B.x

The two tabs below contain videos showing a test scenario without and with SCAF. The user runs NAS benchmarks lu.A.x and cg.B.x at the same time, with each benchmark printing its speedup over serial (single-threaded) execution upon completion.

The 36 bars at the top of the display show CPU usage for each of the machine's hardware contexts. Each bar is colorized: green usage indicates that the left shell is using the CPU, while blue usage indicates that the right shell is using the CPU.

Without SCAF, there are 36*2 = 72 runnable threads on the system, and scheduling is left to the Linux scheduler. This is usually fine, except the Linux scheduler does not schedule OpenMP processes: it schedules individual threads. The result is "fair", but threads compete for hardware contexts. In this case, the fine-grained sharing results in a huge performance hit because lu.A.x uses a type of spinning synchronization which assumes that hardware contexts are not shared.

With SCAF, we see that both benchmarks peform better using the same kernel and the same binaries. The difference is that SCAF is coordinating the degrees of parallelism used by each process in order to avoid oversubscription. A total of about 36 runnable threads is maintained at all times.

We also see that SCAF gives CG a larger portion of the machine than LU because CG is observed to scale better than LU. SCAF is able to determine this without any profiling or ahead-of-time analysis by using "serial experiments" at the beginning of execution.

The SCAF paper

The SCAF paper describes the design, implementation, and performance of SCAF in more detail.

To reference this white paper, please use the following:

T. Creech, A. Kotha, and R. Barua, "Efficient Multiprogramming for Multicores with SCAF." In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 334-345. ACM, 2013.;
Alternatively, you can use this Bibtex file. This includes the document's URL with a \url{...}, so don't forget to \usepackage{url} in your main tex file.

Documentation is available in the form of man pages once SCAF is installed. Without installing, you can view the man pages from the source code repository:

These are the people currently working on the SCAF project.

Tim Creech

Tim Creech has graduated with this PhD, but worked on SCAF as a student under Prof. Barua. Tim's work on this project was funded by a NASA Space Technology Research Fellowship.

View homepage »

Rajeev Barua

Rajeev Barua is an associate professor in the department of Electrical and Computer Engineering at the University of Maryland. Dr. Barua is the principal investigator of the SCAF project.

View homepage »