Bus Error - Large shared memory space (Ubuntu 22.04.3, Instance type: r5ad.4xlarge)

Question

I encountered an issue when creating C++ programs that share memory space. The r5ad.4xlarge EC2 instance MemTotal: is 133,873,930,240. A large portion of this was setup as a shared memory object. When the programs transfer data through this shared memory object, a Bus Error core dump occurs.

I created a simple small C/C++ test program that reproduces the error. It simply reads up or down through an 80,000,000,000 byte shared memory object. Reading from the start of the shared object space and incrementing, the SIGBUS error occurs 66,936,954,880 bytes into the shared memory object. Starting at the top of the object, the SIGBUS error occurs after reading down 66,936,958,976 bytes. This is completely repeatable on two different r5ad.4xlarge EC2 instances. I find it interesting that the increment count to failure and the decrement count to failure differ by 4096, the size of a page. It's also interesting that both failure counts are close to 1/2 of the total memory amount. This doesn't seem to be a program error. Could it be an AWS issue? A Linux kernel issue? Other thoughts?

Thanks,

Gene

```
// g++ -std=c++20 -O3 test2.cpp -W -Wall -Wextra -pedantic -pthread -o test2

#include 
#include 
#include 
#include 
#include 
#include

int main() {

uint_fast64_t mem_amt = 80000000000;
    std::cout << "mem_amt = " << mem_amt << "
";

int fd;
    std::string shmpath = "/foo";

// Remove any existing shared memory object
    shm_unlink(shmpath.c_str());
    // Create the shared memory object with read-write access.
    fd = shm_open(shmpath.c_str(), O_CREAT | O_EXCL | O_RDWR, S_IRUSR | S_IWUSR);

if (fd == -1) {
        std::cerr << "
shm_open shmbuf failure. Exiting program.

";
        exit(EXIT_FAILURE);
    }

// Truncate (set) the size.
    if (ftruncate64(fd, mem_amt) == -1) {
        std::cerr << "
ftruncate shmbuf failure. Exiting program.

";
        exit(EXIT_FAILURE);
    }

// Map the shared memory object.
    char* pool = (char*)mmap(NULL, mem_amt, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (pool == MAP_FAILED) {
        std::cerr << "
mmap pool failure. Exiting program.

";
        exit(EXIT_FAILURE);
    }

std::cout << "pool = " << (uint_fast64_t)pool << "
";

char temp;
//    for (uint_fast64_t i=0; i0; i--) {
        temp = pool[i];
        if (i % 5000000000 == 0) {
            std::cout << "i = " << i << "
";
        }
    }
    std::cout << "temp = " << temp << "
";
}
```

gbd output of the core files from incrementing and decrementing respectively:

```
Core was generated by `./test2'.
Program terminated with signal SIGBUS, Bus error.
#0  0x00005570b7fd1373 in main () at test2.cpp:47
47	        temp = pool[i];
(gdb) bt full
#0  0x00005570b7fd1373 in main () at test2.cpp:47
        i = 66936954880
        mem_amt = 80000000000
        fd = 
        shmpath = "/foo"
        pool = 0x7fa09da0e000 ""
        temp = 
(gdb)

Core was generated by `./test2'.
Program terminated with signal SIGBUS, Bus error.
#0  0x000055e242fdc379 in main () at test2.cpp:47
47	        temp = pool[i];
(gdb) bt full
#0  0x000055e242fdc379 in main () at test2.cpp:47
        i = 13063041023
        mem_amt = 80000000000
        fd = 
        shmpath = "/foo"
        pool = 0x7f7366a0e000 ""
        temp = 
(gdb)
```

Answer

Hello, 
gene_weber,
As the count difference between increment count to failure and the decrement count to failure is 4096 (Size of a page). To know the reason of this whether it is program, kernel or AWS side issue. Please try same code and observe the behavior in OS "Ubuntu 20" and "Amazon Linux 2" if you see same pattern then it might be related to kernel and we can take action accordingly.

Bus Error - Large shared memory space (Ubuntu 22.04.3, Instance type: r5ad.4xlarge)

Relevant content