SCAF stands for Scheduling and Allocation with Feedback.
SCAF is a multiprogramming strategy for malleable multithreaded processes which aims to improve system efficiency, presented at MICRO46 in 2013.
The idea is to implement space sharing, with allocations chosen according to live performance feedback. Notably, the techniques used do not require any programming paradigm shift, code modification, or even recompilation. The initial implementation targets OpenMP.
Currently, the SCAF implementation runs on any Linux or FreeBSD SMP which supports GCC's OpenMP implementation and counting cycles and instructions via hardware counters via PAPI.
See the SCAF poster»This implementation of SCAF supports Linux and FreeBSD, and is based on the techniques presented in the conference paper.
This version has the following notable limitations:
git clone https://bitbucket.org/tcreech/scaf.git
Mainly, SCAF requires a C compiler, ZeroMQ, HWLoc, and PAPI.
On Debian-like systems, try:
apt-get install build-essential libhwloc-dev libzmq3-dev libpapi-dev
.
SCAF uses GNU Autotools, so installation is easy: just use ./configure && make
. Install with make install
.
First, we start scafd. The -b option tells it to ignore any background load on the system, while the -t 1 option tells it to print its status to standard output every 1 second.
$ scafd -b -t 1
With scafd running, we can now launch clients in another shell. The clients will coordinate in order to avoid oversubscribing the machine. Furthermore, they will perform live performance experiments in order to determine how much of the machine's available hardware contexts they should each use.
Note that scafwrap only affects OpenMP programs.
$ scafwrap ./cg.C.x
$ scafwrap ./lu.B.x
The two tabs below contain videos showing a test scenario without and with SCAF. The user runs NAS benchmarks lu.A.x and cg.B.x at the same time, with each benchmark printing its speedup over serial (single-threaded) execution upon completion.
The 36 bars at the top of the display show CPU usage for each of the machine's hardware contexts. Each bar is colorized: green usage indicates that the left shell is using the CPU, while blue usage indicates that the right shell is using the CPU.
Without SCAF, there are 36*2 = 72 runnable threads on the system, and scheduling is left to the Linux scheduler. This is usually fine, except the Linux scheduler does not schedule OpenMP processes: it schedules individual threads. The result is "fair", but threads compete for hardware contexts. In this case, the fine-grained sharing results in a huge performance hit because lu.A.x uses a type of spinning synchronization which assumes that hardware contexts are not shared.
With SCAF, we see that both benchmarks peform better using the same kernel and the same binaries. The difference is that SCAF is coordinating the degrees of parallelism used by each process in order to avoid oversubscription. A total of about 36 runnable threads is maintained at all times.
We also see that SCAF gives CG a larger portion of the machine than LU because CG is observed to scale better than LU. SCAF is able to determine this without any profiling or ahead-of-time analysis by using "serial experiments" at the beginning of execution.
The SCAF paper describes the design, implementation, and performance of SCAF in more detail.
To reference this white paper, please use the following:
\url{
...}
, so don't forget to \usepackage{url}
in your main tex file.
Documentation is available in the form of man pages once SCAF is installed. Without installing, you can view the man pages from the source code repository:
Tim Creech has graduated with this PhD, but worked on SCAF as a student under Prof. Barua. Tim's work on this project was funded by a NASA Space Technology Research Fellowship.
Rajeev Barua is an associate professor in the department of Electrical and Computer Engineering at the University of Maryland. Dr. Barua is the principal investigator of the SCAF project.