offence tutorial

Version 0.01

July 12, 2011

We wrote offence to illustrate the techniques of our article “Stability in Weak Memory Models” [2]. The prototype offence takes a litmus test in the style of diy and outputs a synchronised version of this litmus test. When run on the original X86-TSO [6] or Power [3] model (i.e. on some x86 or Power machine) the synchronised test will behave as it were running on the Sequential Consistent memory model [4]. In effect, offence restores SC, by inserting fences, atomic instructions or locks.

The main purpose of offence is to demonstrate the soundness of our semantics-based technique for restoring SC, by performing experiments. It is to be noticed that offence builds over previous work [7, 5]. Our contribution mostly resides in using memory model semantics to justify and extend those previous works.

Background information on offence is available at http://offence.inria.fr. The authors of offence are Jade Alglave and Luc Maranget, re-using code of diy, whose additional authors are Susmit Sarkar and Peter Sewell.

1 Introduction to offence
2 Some words on offence internals
3 Usage of offence

1 Introduction to offence

Let us first build two classical tests for x86: SB.litmus that exercises the store-buffering relaxation, and SB-BIS.litmus that also exercises the store-buffering relaxation.

% diyone -arch X86 -name SB Rfi PodRR Fre PodWR Fre
% diyone -arch X86 -name SB-BIS Rfi PodRR Fre PodWW Wse

We use diyone and get the following two tests:

X86 SB
"Rfi PodRR Fre PodWR Fre"
{}
 P0          | P1          ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV EAX,[y] | MOV EAX,[y] ;
 MOV EBX,[x] |             ;

exists (0:EAX=1 /\ 0:EBX=0 /\ 1:EAX=0)

X86 SB-BIS
"Rfi PodRR Fre PodWW Wse"
{}
 P0          | P1         ;
 MOV [y],$2  | MOV [x],$1 ;
 MOV EAX,[y] | MOV [y],$1 ;
 MOV EBX,[x] |            ;

exists (y=2 /\ 0:EAX=2 /\ 0:EBX=0)

When run on an x86 machine with litmus, both tests are highly likely to exhibit the non-SC outcome highlighted in their final “exists” condition.

Such non-SC outcomes can be avoided by inserting fences at the appropriate places in test code, which operation offence performs with its default settings:

% offence SB.litmus SB-BIS.litmus

offence outputs three files, an index file @out and two litmus tests, SB-F.litmus and SB-BIS-F.litmus.

X86 SB-F
"Rfi PodRR Fre PodWR Fre"

{}

 P0          | P1          ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV EAX,[y] | MFENCE      ;
 MFENCE      | MOV EAX,[y] ;
 MOV EBX,[x] |             ;

exists (0:EAX=1 /\ 0:EBX=0 /\ 1:EAX=0)

X86 SB-BIS-F
"Rfi PodRR Fre PodWW Wse"

{}

 P0          | P1         ;
 MOV [y],$2  | MOV [x],$1 ;
 MOV EAX,[y] | MOV [y],$1 ;
 MFENCE      |            ;
 MOV EBX,[x] |            ;

exists (y=2 /\ 0:EAX=2 /\ 0:EBX=0)

Observe that no fence is inserted between the two stores of P₁ in test SB-BIS: inserting a fence here would be useless by TSO semantics.

For Power, offence inserts both the sync and the lightweight lwsync barriers:

% diyone -arch PPC -name PPC-SB-BIS Rfi PodRR Fre PodWW Wse
% offence PPC-SB-BIS.litmus

We get the new test with fences PPC-SB-BIS-F, where the barrier lwsync is inserted between the two stores of P₁.

2 Some words on offence internals

2.1 Critical cycles

The tool offence proceeds by analysing its input, so as to construct compile time approximations of critical cycles [7].

X86 SB P0 \| P1 ; (a)Wy \| (d)Wx ; (b)Ry \| (e)Ry ; (c)Rx \| ;		X86 SB-BIS P0 \| P1 ; (a)Wy \| (d)Wx ; (b)Ry \| (e)Wy ; (c)Rx \| ;

In critical cycles, conflict edges (in red, between memory accesses by different threads, one of which one at least is a write) and program order edges (in black, oriented, between memory accesses to different locations by the same thread) alternate. Here, the graphs of SB and SB-BIS contain the same critical cycle each, namely (a) → (c) → (d) → (e) → (a) , while the graph of contains the additional critical cycle (b) → (c) → (d) → (e) → (b). In the TSO-model, it suffices to insert fences between W → R pairs involved in cycles (or fragile pairs). Hence, mfence is inserted between (a) and (c) for both tests, and between (d) and (e) for SB only.

For Power, a fence must be inserted between all fragile pairs, hence the lwsync barrier inserted between the two stores (d) and (e) of P₁ in PPC-SB-BIS-F.

PPC PPC-SB-BIS-F
"Rfi PodRR Fre PodWW Wse"

{0:r2=y; 0:r5=x; 1:r2=x; 1:r4=y;}

 P0           | P1           ;
 li r1,2      | li r1,1      ;
 stw r1,0(r2) | stw r1,0(r2) ;
 lwz r3,0(r2) | lwsync       ;
 sync         | li r3,1      ;
 lwz r4,0(r5) | stw r3,0(r4) ;

exists (y=2 /\ 0:r3=2 /\ 0:r4=0)

Notice that the lightweight lwsync is admissible, based upon our semantics of the Power memory model. On P₀ we have two fragile pairs: (a):Wy → (c):Rx and (b):Ry → (c):Rx. The tool offence inserts the strong sync just before (c), which suffices to maintain those two fragile pairs.

2.2 Global fence insertion

The distribution of offence includes a mixer tool, to mix two tests together. The tool mixer takes two tests as arguments. It first randomly permutes the threads of one of the argument tests, convert its registers and memory locations to fresh ones, and randomly interleaves its code with the code of the other test, thread by thread.

% mixer SB.litmus SB-BIS.litmus > MIXED.litmus
% offence < MIXED.litmus > MIXED-F.litmus

We get the litmus tests MIXED.litmus and MIXED-F.litmus:

X86 SB+SB-BIS

 P0          | P1          ;
 MOV [a],$1  | MOV [b],$2  ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV [b],$1  | MOV EBX,[b] ;
 MOV EAX,[y] | MOV ECX,[a] ;
 MOV EBX,[x] | MOV EAX,[y] ;

⇒

X86 SB+SB-BIS-F

 P0          | P1          ;
 MOV [a],$1  | MOV [b],$2  ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV [b],$1  | MFENCE      ;
 MOV EAX,[y] | MOV EBX,[b] ;
 MFENCE      | MOV ECX,[a] ;
 MOV EBX,[x] | MOV EAX,[y] ;

There are 7 critical cycles in test MIXED, resulting in 8 W → R fragile pairs.

Nevertheless, offence inserts two mfence instructions only, thanks to the optimising fence insertion algorithm of [5]. With naive fence insertion, offence inserts 5 mfence instructions:

% offence -fencing naive < MIXED.litmus > MIXED-N.litmus

We get the (naively fence) litmus tests MIXED-N.litmus:

X86 SB+SB-BIS

 P0          | P1          ;
 MOV [a],$1  | MOV [b],$2  ;
 MOV [y],$1  | MOV [x],$1  ;
 MOV [b],$1  | MOV EBX,[b] ;
 MOV EAX,[y] | MOV ECX,[a] ;
 MOV EBX,[x] | MOV EAX,[y] ;

⇒

X86 SB+SB-BIS-F

 P0          | P1          ;
 MOV [a],$1  | MOV [b],$2  ;
 MFENCE      | MFENCE      ;
 MOV [y],$1  | MOV [x],$1  ;
 MFENCE      | MFENCE      ;
 MOV [b],$1  | MOV EBX,[b] ;
 MFENCE      | MOV ECX,[a] ;
 MOV EAX,[y] | MOV EAX,[y] ;
 MOV EBX,[x] |             ;

A similar experiment can be performed on Power:

% diyone -arch PPC -name PPC-SB-BIS Rfi PodRR Fre PodWW Wse
% mixer PPC-SB-BIS.litmus > PPC-MIXED.litmus
% offence < PPC-MIXED.litmus > PPC-MIXED-F.litmus
% offence -fencing naive < PPC-MIXED.litmus > PPC-MIXED-N.litmus

We get the fenced test PPC-MIXED-F and the naively fenced test PPC-MIXED-N, or:

PPC MIXED-F
{0:r2=y; 0:r5=x;  0:r7=a; 0:r9=b;
 1:r2=x; 1:r4=y; 1:r6=b; 1:r9=a;}

 P0           | P1           ;
 li r6,1      | li r1,1      ;
 li r1,2      | stw r1,0(r2) ;
 stw r1,0(r2) | li r5,2      ;
 stw r6,0(r7) | li r3,1      ;
 sync         | stw r5,0(r6) ;
 lwz r3,0(r2) | lwz r7,0(r6) ;
 sync         | lwsync       ;
 li r8,1      | stw r3,0(r4) ;
 lwz r4,0(r5) | sync         ;
 stw r8,0(r9) | lwz r8,0(r9) ;

PPC PPC-MIXED-N
{0:r2=y; 0:r5=x; 0:r7=a; 0:r9=b;
 1:r2=x; 1:r4=y; 1:r6=b; 1:r9=a; }

 P0           | P1           ;
 li r6,1      | li r1,1      ;
 li r1,2      | stw r1,0(r2) ;
 stw r1,0(r2) | sync         ;
 sync         | li r5,2      ;
 stw r6,0(r7) | li r3,1      ;
 sync         | stw r5,0(r6) ;
 lwz r3,0(r2) | sync         ;
 sync         | lwz r7,0(r6) ;
 li r8,1      | sync         ;
 lwz r4,0(r5) | stw r3,0(r4) ;
 stw r8,0(r9) | sync         ;
              | lwz r8,0(r9) ;

Here there are 13 fragile pairs. Notice that naive fence insertion is not naive to the point of inserting 13 barriers: when several barriers are commanded at the same place, only one (the stronger, if they are different) survives. With respect to naive fence insertion, global fence insertion spares 4 sync instructions (one on P₀ and 3 on P₁), for the price of one additional lwsync instruction (on P₁).

2.3 Atomic instructions

There are other hardware primitives than fences that restore SC. The x86 architecture features the atomic (or locked) exchange instruction xchg. When instructed to do so by option -technique atom, offence will use atomic instructions:

% offence -technique atom SB-BIS.litmus

offence outputs the new test SB-BIS-A, which, according to X86-TSO semantics, will never produce the non-SC outcome detected by the final condition.

X86 SB-BIS-A
"Rfi PodRR Fre PodWW Wse"

{}

 P0             | P1         ;
 MOV %atom,$2   | MOV [x],$1 ;
 XCHG [y],%atom | MOV [y],$1 ;
 MOV EAX,[y]    |            ;
 MOV EBX,[x]    |            ;

exists (y=2 /\ 0:EAX=2 /\ 0:EBX=0)

In the case of x86, the insertion of xchg instructions is optimised, as illustrated by the more complex test MIXED:

% offence -technique atom MIXED.litmus # output to MIXED-A.litmus
% offence -technique atom -fencing naive  < MIXED.litmus > MIXED-AN.litmus

We get the two test MIXED-A and MIXED-AN:

X86 MIXED-A

 P0             | P1             ;
 MOV [a],$1     | MOV [b],$2     ;
 MOV [y],$1     | MOV %atom,$1   ;
 MOV %atom,$1   | XCHG [x],%atom ;
 XCHG [b],%atom | MOV EBX,[b]    ;
 MOV EAX,[y]    | MOV ECX,[a]    ;
 MOV EBX,[x]    | MOV EAX,[y]    ;

X86 MIXED-AN

 P0             | P1             ;
 MOV %atom,$1   | MOV %atom,$2   ;
 XCHG [a],%atom | XCHG [b],%atom ;
 MOV %atom,$1   | MOV %atom,$1   ;
 XCHG [y],%atom | XCHG [x],%atom ;
 MOV %atom,$1   | MOV EBX,[b]    ;
 XCHG [b],%atom | MOV ECX,[a]    ;
 MOV EAX,[y]    | MOV EAX,[y]    ;
 MOV EBX,[x]    |                ;

As we see above, optimised insertion spares 3 xchg instructions. The optimisation derives from X86-TSO semantics, which gives mfence semantics to the xchg instruction.

In the case of the Power architecture, offence uses atomic stores (STA) and loads idioms (called fetch and store and fetch and no-op (FNO) in Power documentation [1, p.719]). Those idioms are built from load-and-reserve/store conditional instruction pairs.

% offence -technique atom PPC-SB-BIS.litmus

Output is test PPC-SB-BIS-A. The STA and FNO idioms are expressed by macros, which litmus expands before running the test. Here is the test PPC-SB-BIS-A expanded. Notice that this usage of STA/FNO to restore SC is incorrect according to our semantics for Power. Nevertheless, we have experimented it and found it valid.

Finally, with option -technique lock, offence will insert conflicting pairs into critical sections guarded by standard low-level locks.

3 Usage of offence

Arguments

offence takes file names as command line arguments. Those files are either a single litmus test, when having extension .litmus, or an index file (i.e. a file that contains a list of file names), when prefixed by @. The tool offence outputs one .litmus file per test it successfully processes ¹, plus one index file @out. The names of the litmus tests output by offence depends on the technique (see below).

Command line options

The tool offence accepts the following command line options:

-version

Show version number and exit.

-v

Be verbose, can be repeated to increase verbosity.

-o <dest>

Save all offence output files into <dest>. If argument <dest> is an archive (extension .tar) or a compressed archive (extension .tgz), offence builds that archive. Otherwise, <dest> is interpreted as the name of an existing directory and output files are created in it.

-technique <fence|atom|lock>

Commands the technique used to restore SC. This is offence main control, default is fence. Techniques are:

fence: Use fences, mapping ’F’ in the paper.
atom: Use atomic instructions, mapping ’A’ in the paper.
lock: Use locks, mapping ’L’ in the paper.

The mapping name X (i.e. F, A or L) appears in the names of output litmus tests: the result of processing a test NAME have name NAME-X and goes into file NAME-X.litmus.

-alt

Use alternative technique, if any. There is one alternative to the atom technique for Power. It correspond to mapping ’P’ in the paper.

-fencing <global|naive>

Use the optimising fence insertion algorithm (global) or naive fence insertion (naive). Default is global.

1: offence can fail, noticeably when the addresses of loads and stores are not constants.

References

[1]: Power ISA Version 2.06. 2009.
[2]: J. Alglave and L. Maranget. Stability in Weak Memory Models. In CAV, 2011.
[3]: J. Alglave, L. Maranget, S. Sarkar, and P. Sewell. Fences in Weak Memory Models. In CAV, 2010.
[4]: L. Lamport. How to make a correct multiprocess program execute correctly on a multiprocessor. IEEE Trans. Comput., 46(7):779–782, 1979.
[5]: J. Lee and D.A. Padua. Hiding relaxed memory consistency with a compiler. IEEE Transactions on Computers, 50:824–833, 2001.
[6]: S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-TSO. In TPHOL, 2009.
[7]: Dennis Shasha and Marc Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988.

This document was translated from L^AT_EX by H^EV^EA.