Timing experiments

In this note we describe the experiments we performed so as to time synchronisation constructs.

The complete sources and logs are available, for x86 X86-SPEED (archive) and Power PPC-SPEED (archive).

1 Generating tests
2 Running the tests
3 Measures
- 3.1 x86
- 3.2 Power

1 Generating tests

We build series of 8 tests named X01 to X08 involving from 1 to 8 threads. The code of each thread consists in a write to location, as x_i and a read from location say x_i+1 (the read of the last thread being from x₁). In practice we generate the tests with diyone. For instance X03 for x86 is produced by:

% diyone -arch X86 -name X03 PodWR Fre PodWR Fre PodWR Fre

While X03 for Power is produced by:

% diyone -arch PPC -name X03 PodWR Fre PodWR Fre PodWR Fre

Note that the first test of a batch (X01 for x86 and X01 for Power) is written by hand, as it cannot be induced by a cycle of candidate relaxations. We then build series of synchronised tests, either with offence or by hand.

For x86 from the batch ’X’, we produce three batches:

’F’ by running offence -technique fence, see for instance F03, for timing mfence;
’A’ by running offence -technique atom, see for instance A03, for timing xchg;
’L’ by running offence -technique lock, see for instance L03, for timing locks.

Notice that in the case of locks, there are two LOCK/UNLOCK sequences per thread, to different lock variables. The alternative of one LOCK/UNLOCK sequence per thread would result in less concurrent execution and has not been tested.

For Power from the batch ’X’, we produce four batches:

’F’ by running offence -technique fence, see for instance F03, for timing sync;
’W’ by changing the sync instructions of batch ’F’ into lwsync¹ instructions, see for instance W03, for timing lwsync;
’A’ by running offence -technique atom, see for instance A03, for timing STA/FNO pairs;
’L’ by running offence -technique lock, see for instance L03, for timing locks.

2 Running the tests

In the previous, soundness, experiment, the test harness of litmus accounted for an important part of running times. In this experiment we minimise the impact of harness code by:

running test code only once, with litmus options -a 1 -s 1 -r 1;
and inserting the code of each thread of tests in a loop of size 100 · 10⁶, with litmus option -loop 100M.

We perform these settings with litmus configuration file speed.cfg.

The altered test code can be seen in litmus logs, for instance here is the code actually executed by the first thread of F03:

...
 P0          | P1          | P2          ;
 MOV [z],$1  | MOV [x],$1  | MOV [y],$1  ;
 MFENCE      | MFENCE      | MFENCE      ;
 MOV EAX,[x] | MOV EAX,[y] | MOV EAX,[z] ;
...
 _litmus_P0_0_: cmpl $0,%edx
 _litmus_P0_1_: jmp Lit__L1
 _litmus_P0_2_: Lit__L0:
 _litmus_P0_3_: movl $1,(%r8)
 _litmus_P0_4_: mfence
 _litmus_P0_5_: movl (%r9),%eax
 _litmus_P0_6_: decl %edx
 _litmus_P0_7_: Lit__L1:
 _litmus_P0_8_: jg Lit__L0
...

Where %edx is the loop counter. We run all test batches at least five times on our 8 core machines, chianti for x86 and power7 for Power.

3 Measures

3.1 x86

We get the log files C.00 to C.04, of which we extract the following timings (times are wall-clock times of a test run, in seconds):

	`C.00`	`C.01`	`C.02`	`C.03`	`C.04`
`X01`	0.08	0.08	0.08	0.08	0.08
`X02`	0.39	0.38	0.39	0.39	0.39
`X03`	0.29	0.29	0.29	0.29	0.29
`X04`	0.47	0.47	0.47	0.47	0.47
`X05`	0.42	0.42	0.40	0.40	0.40
`X06`	0.52	0.52	0.52	0.53	0.52
`X07`	0.48	0.49	0.48	0.48	0.47
`X08`	0.60	0.60	0.60	0.60	0.59

	`C.00`	`C.01`	`C.02`	`C.03`	`C.04`
`F01`	1.29	1.30	1.30	1.30	1.29
`F02`	5.76	5.75	5.72	5.81	5.72
`F03`	15.28	15.02	15.18	14.55	17.48
`F04`	18.24	18.10	14.40	14.55	14.60
`F05`	17.59	17.53	17.87	17.42	17.44
`F06`	17.00	16.53	16.56	14.91	17.04
`F07`	16.88	15.57	15.50	17.00	16.90
`F08`	19.15	17.69	16.03	15.86	16.56

	`C.00`	`C.01`	`C.02`	`C.03`	`C.04`
`A01`	0.72	0.72	0.73	0.72	0.72
`A02`	5.18	5.18	5.08	5.14	5.24
`A03`	11.65	11.56	11.51	12.01	10.81
`A04`	16.99	12.17	13.27	11.53	14.26
`A05`	18.28	18.59	18.51	18.05	18.64
`A06`	13.78	16.42	14.38	13.69	13.16
`A07`	18.30	18.53	16.84	18.27	16.02
`A08`	12.51	12.43	12.22	17.88	12.29

	`C.00`	`C.01`	`C.02`	`C.03`	`C.04`
`L01`	1.85	1.85	1.86	1.85	1.86
`L02`	60.99	60.00	57.57	60.20	60.74
`L03`	69.28	68.39	68.43	69.89	67.88
`L04`	71.31	83.92	83.76	85.05	80.58
`L05`	88.06	88.27	87.24	88.09	86.55
`L06`	96.73	106.32	96.39	101.97	94.78
`L07`	108.03	98.52	97.45	98.99	99.07
`L08`	93.50	98.42	89.53	90.48	88.08

So as to approximate the time taken by one synchronisation construct, we first select a value for each test, taking the median of the five measures performed. Then, we subtract the value found for a given ’X’ test from the corresponding ’F’, ’A’ or ’L’ values and divide the result by iteration size (10⁸). The final numbers are decent approximations of synchronisation costs. We plot them below, including and excluding the plot for the mapping ’L’:

Such time measures are to be treated with caution, due to the non-determinism of the test itself, to the intervention of the system scheduler etc. However, we argue that we can draw a few conclusions from them:

Locks are much more expensive than fences and atomic exchange.
While fast in isolation (nproc=1), fences and atomic exchange get more expensive by a factor of at least ten when communication by shared memory indeed occurs.
mfence and xchg incur similar time penalties.

3.2 Power

We get the log files P.00 to P.04 and P.10 to P.14, of which we extract the following timings (times are wall-clock times, in seconds):

	`P.00`	`P.01`	`P.02`	`P.03`	`P.04`
`X01`	0.14	0.13	0.13	0.13	0.13
`X02`	0.57	0.33	0.57	0.61	0.58
`X03`	0.17	0.23	0.17	0.17	0.17
`X04`	0.24	0.34	0.33	0.32	0.19
`X05`	0.29	0.43	0.42	0.36	0.31
`X06`	0.88	0.76	0.92	0.78	0.67
`X07`	0.37	0.45	0.38	0.37	0.40
`X08`	0.36	0.39	0.37	0.41	0.47

	`P.00`	`P.01`	`P.02`	`P.03`	`P.04`
`F01`	2.05	2.05	2.05	2.05	2.05
`F02`	7.99	8.25	8.95	8.95	8.95
`F03`	16.79	15.61	16.21	16.22	16.63
`F04`	27.34	27.18	26.91	26.91	26.70
`F05`	38.67	40.12	39.92	39.92	40.00
`F06`	54.55	55.49	55.25	54.65	55.75
`F07`	71.51	67.92	68.20	67.52	68.64
`F08`	66.35	64.25	64.48	64.09	64.33

	`P.00`	`P.01`	`P.02`	`P.03`	`P.04`
`A01`	4.74	5.25	5.24	5.68	5.25
`A02`	15.08	15.08	13.21	13.20	13.22
`A03`	25.08	25.67	25.06	25.25	25.67
`A04`	41.31	40.60	40.61	41.49	40.92
`A05`	58.27	55.48	56.49	57.50	60.78
`A06`	84.57	84.24	84.35	84.17	84.20
`A07`	99.29	99.11	99.95	97.40	98.05
`A08`	100.71	98.10	99.04	95.32	99.82

	`P.00`	`P.01`	`P.02`	`P.03`	`P.04`
`L01`	9.26	9.80	9.80	9.80	9.26
`L02`	36.66	38.35	38.50	35.33	37.05
`L03`	89.80	90.58	83.56	89.62	86.43
`L04`	82.83	83.14	82.43	82.49	82.47
`L05`	114.12	106.56	107.68	118.98	106.05
`L06`	184.04	181.77	181.49	187.03	183.95
`L07`	207.62	207.85	205.07	206.18	210.64
`L08`	263.62	272.39	266.76	253.83	270.28

Times for batch ’W’ (lwsync) are from different runs:

	`P.10`	`P.11`	`P.12`	`P.13`	`P.14`
`W01`	0.80	0.82	0.80	0.80	0.89
`W02`	2.94	2.95	3.21	3.30	3.16
`W03`	7.76	7.14	7.84	7.37	7.85
`W04`	14.20	14.33	13.62	14.14	14.48
`W05`	20.87	21.73	22.02	22.37	22.21
`W06`	31.26	30.56	30.40	30.34	29.62
`W07`	28.35	29.00	31.13	31.02	28.14
`W08`	22.50	30.37	29.26	28.15	31.38

We compute approximations of synchronisation cost as we did for x86:

Again, our numbers are to be treated with caution. However, we can draw a few conclusions from them:

Locks are more expensive than fences and atomic pairs.
While fast in isolation (nproc=1), fences and atomic pairs get more expensive by a factor of at least ten when communication by shared memory indeed occurs.
lwsync, sync and sta/fno incur increasing time penalties, both from lwsync to sta/fno and as the number of processors involved grows.

Complete sources and logs: for x86 X86-SPEED (archive) and Power PPC-SPEED (archive).

1: A lwsync fence between a store and a load is useless. However this is irrelevant to our purpose of measuring its cost, as we checked with similar examples on store-store pairs.

This document was translated from L^AT_EX by H^EV^EA.