简介：
#

有两种内存屏障：

编译器内存屏障
CPU内存屏障

1.编译器内存屏障
#

考虑以下代码：

#include <stdio.h>
void main() {
    int a = 0;
    int b = 0; 

    a = b + 1;
    b = 0;

    printf("%d %d\n",a,b);
}

//O0优化 aarch64-linux-gnu-gcc memory_order_test.c -S -O0
    ...
    str	wzr, [sp, 28] //a=0
    str	wzr, [sp, 24] //b=0
    ldr	w0, [sp, 24]  
    add	w0, w0, 1     //a=b+1
    str	w0, [sp, 28] 
    str	wzr, [sp, 24] //b=0
    ldr	w2, [sp, 24]
    ldr	w1, [sp, 28]
    adrp	x0, .LC0
    add	x0, x0, :lo12:.LC0
    bl	printf
    ...

//10优化 aarch64-linux-gnu-gcc memory_order_test.c -S -O1
    ...
    mov	x29, sp
    mov	w2, 0 // b = 0
    mov	w1, 1 // a = 1
    adrp	x0, .LC0
    add	x0, x0, :lo12:.LC0
    bl	printf
    ...

可以看出，在两种优化等级下，a和b赋值的顺序是不同的。

因此此时就引入了一个很有趣的问题，假设此时正好切换到了其他线程执行，或者是触发了中断执行，结果会怎么样呢？答案是会出现意料之外的情况。

编译器屏障 compiler barrier
#

参考：https://preshing.com/20120625/memory-ordering-at-compile-time/

https://coffeebeforearch.github.io/2020/11/21/compiler-memory-ordering.html

为了避免编译时出现这种问题，需要使用编译器屏障来显示的让编译器不要进行优化：

...
a = b + 1;
//添加这一行告诉编译器不要把b=0优化到这一行前面。
asm volatile("" ::: "memory");
b = 0;
...

这样就能保证b的赋值在a之后。

举例：自旋锁的场景
#

在裸机编程中我们需要实现spin_lock自旋锁的功能，通常是只需要关闭中断即可，但是在实际使用时经常遇到中断返回后出现 prefetch abort、undef等cpu异常。

这是因为自旋锁没有起到作用造成的：

// 有问题的代码：
do_something();
//spin_lock
disable_irq();
do_something_important();

由于编译器的优化，会将代码顺序修改，可能something_important的内容放到了 disable_irq 之前执行。

//正确的代码
do_something();
//spin_lock
disable_irq();
asm volatile("" ::: "memory");
do_something_important();

2.CPU内存屏障 cpu barrier
#

考虑以下汇编代码：

    ...
    a = 1; //第一行
    printf("hello1:0x%x \n", somedata1); //第二行
    printf("hello2:0x%x \n", somedata2); //第三行
    f = a; //第四行
    ...

根据arm官方介绍： arm官方文档:memory_systems__ordering__and_barriers （本文接下来的内容都是对这篇文章的解读：）

ARM CPU对以上代码的执行顺序，从硬件级别可能进行优化，优化为：
第二行
第三行
第一行
第四行

对CPU底层而言，读写操作可能是并行的，所以可能会先执行2、3行，再执行1、4行。但这看起来貌似会导致问题？我们仔细考虑以下几种场景：

场景1：单CPU单线程
#

只要第四行和第一行顺序能够保证，第二行和第三行的顺序我们并不关注，而文档确实有说明，CPU会解析出第一行和第四行存在地址依赖关系，从而保证第一行和第四行的执行顺序：

For accesses to the same bytes, ordering must be maintained. The processor needs to detect the read-after-write hazard and ensure that the accesses are ordered correctly for the intended outcome.

场景2：单CPU多线程（不包括DMA）
#

问题也不大，如果在执行完第三行时正好发生了线程切换，而此时第一行还没有被写入，新的线程即使读取了变量a，也会遵循上述的地址依赖关系。

场景3：多CPU（包括控制器DMA场景）
#

这种场景就会出现问题了，新的CPU如果尝试读取a变量，会读到奇怪的值，因为新的CPU不知道此时a变量还存在一个pending的写入。

例如如下代码：

//thread 1
a=1;
flag=true;

//thread 2
a=0;
while(flag==false);
print("a=%d",a);

//可能输出 a=0

因此在SMP架构上，在切换线程时，通常需要使用内存屏障，比如在linux kernel自旋锁的代码中(smp_mb函数):

static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
{
	u32 val = atomic_fetch_add(1<<16, lock);
	u16 ticket = val >> 16;

	if (ticket == (u16)val)
		return;

	/*
	 * atomic_cond_read_acquire() is RCpc, but rather than defining a
	 * custom cond_read_rcsc() here we just emit a full fence.  We only
	 * need the prior reads before subsequent writes ordering from
	 * smb_mb(), but as atomic_cond_read_acquire() just emits reads and we
	 * have no outstanding writes due to the atomic_fetch_add() the extra
	 * orderings are free.
	 */
	atomic_cond_read_acquire(lock, ticket == (u16)VAL);
	smp_mb();
}

举例： ARM64上的应用程序出现数据不一致的问题
#

两个线程，一个写数据、一个读数据，线程间使用一个变量 flag来同步。

问题： flag变量已经修改了，但是写的数据还没有写入进去，需要加dsb内存屏障才行。


#include "stdint.h"
#include "stdbool.h"
#include "stdio.h"
#include <pthread.h>

static uint8_t test_array[128] = {0};
static bool flag = false;

void* read_thread(void * parameter) {
    while(1) {
        if(flag == true) {
            //读取并检查数据
            for(int i=0;i< (sizeof(test_array) - 1);i++) {
                if(test_array[i] != test_array[i+1]) {
                      //数据不一致，出错
                      printf("error test_array[%d]=0x%x but test_array[%d]=0x%x\n", i, test_array[i], i+1, test_array[i+1]);
                }
            }

            flag = false;
        }
    }
}

void* write_thread(void * parameter) {
    int count = 0;
    while(1) {
        if(flag == false) {
            count++;
            for(int i=0;i<sizeof(test_array);i++) {
                if((count % 2) == 0)
                    test_array[i] = 0xAA;
                else
                    test_array[i] = 0xBB;
            }
            __asm__ volatile("dsb sy");

            flag = true;
        }
    }
}


int main() {
    //两个线程，一个写test_array，一个读test_array，是否会出现不一致的情况
    pthread_t thread1;
    pthread_t thread2;
    if (pthread_create(&thread1, NULL, read_thread, NULL) != 0) {
        perror("pthread_create1");
        return 1;
    }
    if (pthread_create(&thread2, NULL, write_thread, NULL) != 0) {
        perror("pthread_create2");
        return 1;
    }

    // 等待线程完成
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    while(1);
}

其他参考文档：
#

1. kernel barrier文档
#

2. 和volatile的区别：
#

https://stackoverflow.com/questions/1787450/how-do-i-understand-read-memory-barriers-and-volatile
volatile只是防止编译器对代码的优化，例如

int a = *(int*)0x24001010;
while(a==0);//这里期望每次都去0x24001010这块内存取值

上述代码中的a可能会被优化，while中不会每次都去读取0x24001010这块内存的值。

简介： #

1.编译器内存屏障 #

编译器屏障 compiler barrier #

举例：自旋锁的场景 #

2.CPU内存屏障 cpu barrier #

场景1：单CPU单线程 #

场景2：单CPU多线程（不包括DMA） #

场景3：多CPU（包括控制器DMA场景） #

举例： ARM64上的应用程序出现数据不一致的问题 #

其他参考文档： #

1. kernel barrier文档 #

2. 和volatile的区别： #