B-cumulativity of lwsync towards the reads of atomic pairs |
This note describes the experiments we performed so as to check that lwsync is B-cumulative towards a read issued by the lwarx of an atomic pair1.
The complete sources and logs are available as BCWW (archive).
In terms of our Power model, lwsync is not B-cumulative when inserted between two stores. This is proved for instance by the following test X00:
PPC X00 "Rfe DpAddrdR Fre LwSyncdWW Rfe DpAddrdR Fre LwSyncdWW" { 0:r2=a; 0:r5=x; 1:r2=x; 1:r4=y; 2:r2=y; 2:r5=z; 3:r2=z; 3:r4=a; } P0 | P1 | P2 | P3 ; lwz r1,0(r2) | li r1,1 | lwz r1,0(r2) | li r1,1 ; xor r3,r1,r1 | stw r1,0(r2) | xor r3,r1,r1 | stw r1,0(r2) ; lwzx r4,r3,r5 | lwsync | lwzx r4,r3,r5 | lwsync ; | li r3,1 | | li r3,1 ; | stw r3,0(r4) | | stw r3,0(r4) ; exists (0:r1=1 /\ 0:r4=0 /\ 2:r1=1 /\ 2:r4=0)
This test derives from the following cycle of candidate relaxations:
Which we may write as “BCLwSyncdWW DpAddrdRR Fre BCLwSyncdWW DpAddrdRR Fre”,
where BCLwSyncdWW stands for the sequence of LwSyncdWW
and Rfe (see diy documentation for the syntax of candidate relaxations).
As DpAddrdRR and Fre are safe,
we say that the cycle above targets BCLwSyncdWW.
Indeed, if the outcome that characterises the cycle
(here 0:r1=1 /\ 0:r4=0 /\ 2:r1=1 /\ 2:r4=0
)
shows up during experiments, BCLwSyncdWW is observed to be relaxed,
and we say that the test witnesses that fact.
From a test that witnesses BCLwSyncdWW to be relaxed, we easily build a new cycle that targets the B-cumulativity of lwsync towards the read of an atomic pair, expressed by the new, undocumented, candidate relaxation “LwSyncdWW Rf*e”: it suffices to change the sequences “LwSyncdWW Rfe” into “LwSyncdWW Rf*e”. For instance, assuming that X00 witnesses BCLwSyncdWW to be relaxed, we get the new cycle “LwSyncdWW Rf*e DpAddrdRR Fre LwSyncdWW Rf*e DpAddrdRR Fre”, from which diyone builds the the new test Y00.
PPC Y00 "Rf*e DpAddrdR Fre LwSyncdWW Rf*e DpAddrdR Fre LwSyncdWW" { 0:r2=a; 0:r5=x; 1:r2=x; 1:r4=y; 2:r2=y; 2:r5=z; 3:r2=z; 3:r4=a; } P0 | P1 | P2 | P3 ; DIY00: | li r1,1 | DIY01: | li r1,1 ; lwarx r1,r0,r2 | stw r1,0(r2) | lwarx r1,r0,r2 | stw r1,0(r2) ; xor r3,r1,r1 | lwsync | xor r3,r1,r1 | lwsync ; lwzx r4,r3,r5 | li r3,1 | lwzx r4,r3,r5 | li r3,1 ; stwcx. r1,r0,r2 | stw r3,0(r4) | stwcx. r1,r0,r2 | stw r3,0(r4) ; bne DIY00 | | bne DIY01 | ; exists (0:r1=1 /\ 0:r4=0 /\ 2:r1=1 /\ 2:r4=0)
Observe that the first load instructions of threads 0 and 2 of X00 are replaced by a lwarx (load and reserve) followed by a stwcx. (store conditional) and forming a read-modify write idiom (here a fetch and no-op).
While exploring Power machines, our automated front end dont generated many cycles that target BCLwSyncdWW. Amongst those, we select the 36 such cycles that witnessed BCLwSyncdWW to be relaxed on our 8 cores, 4-ways SMT, Power 7 machine power7, resulting in the list of cycles bcww.txt. Then we edit the list, replacing sequences “LwSyncdWW Rfe” by “LwSyncdWW Rf*e”, resulting in the new list of cycles bcww-star.txt. The tests themselves are built by diyone as follows:
% diyone < bcww.txt % diyone < bcww-star.txt
We compile the tests with litmus, configured by power7.32.cfg (i.e. we run “litmus -mach power7.32…”). Then, we run the tests several times on two machines, abducens (Power 6, 4 cores 2-ways SMT) and power7 (Power 7, 8 cores 4-ways SMT), resulting in aggregated log files A.X and P.X.
The objective of this test is twofold:
In other words, the outcome observed should show up in ’X’ tests (“Ok” in the tables below), and should not show up in ’Y’ tests (“No” in the tables below).
|
|
The experiment succeeds, since “Ok” appears in every cell of the first table, while “No” appears in every cell of the second table. For each test and machine, the tables also show the number of specified outcome collected/total number of outcome collected.
This document was translated from LATEX by HEVEA.