I encountered an issue when creating C++ programs that share memory space. The r5ad.4xlarge EC2 instance MemTotal: is 133,873,930,240. A large portion of this was setup as a shared memory object. When the programs transfer data through this shared memory object, a Bus Error core dump occurs.
I created a simple small C/C++ test program that reproduces the error. It simply reads up or down through an 80,000,000,000 byte shared memory object. Reading from the start of the shared object space and incrementing, the SIGBUS error occurs 66,936,954,880 bytes into the shared memory object. Starting at the top of the object, the SIGBUS error occurs after reading down 66,936,958,976 bytes. This is completely repeatable on two different r5ad.4xlarge EC2 instances. I find it interesting that the increment count to failure and the decrement count to failure differ by 4096, the size of a page. It's also interesting that both failure counts are close to 1/2 of the total memory amount. This doesn't seem to be a program error. Could it be an AWS issue? A Linux kernel issue? Other thoughts?
Thanks,
Gene
// g++ -std=c++20 -O3 test2.cpp -W -Wall -Wextra -pedantic -pthread -o test2
#include <iostream>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
int main() {
uint_fast64_t mem_amt = 80000000000;
std::cout << "mem_amt = " << mem_amt << "\n";
int fd;
std::string shmpath = "/foo";
// Remove any existing shared memory object
shm_unlink(shmpath.c_str());
// Create the shared memory object with read-write access.
fd = shm_open(shmpath.c_str(), O_CREAT | O_EXCL | O_RDWR, S_IRUSR | S_IWUSR);
if (fd == -1) {
std::cerr << "\nshm_open shmbuf failure. Exiting program.\n\n";
exit(EXIT_FAILURE);
}
// Truncate (set) the size.
if (ftruncate64(fd, mem_amt) == -1) {
std::cerr << "\nftruncate shmbuf failure. Exiting program.\n\n";
exit(EXIT_FAILURE);
}
// Map the shared memory object.
char* pool = (char*)mmap(NULL, mem_amt, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (pool == MAP_FAILED) {
std::cerr << "\nmmap pool failure. Exiting program.\n\n";
exit(EXIT_FAILURE);
}
std::cout << "pool = " << (uint_fast64_t)pool << "\n";
char temp;
// for (uint_fast64_t i=0; i<mem_amt; i++) {
for (uint_fast64_t i=mem_amt-1; i>0; i--) {
temp = pool[i];
if (i % 5000000000 == 0) {
std::cout << "i = " << i << "\n";
}
}
std::cout << "temp = " << temp << "\n";
}
gbd output of the core files from incrementing and decrementing respectively:
Core was generated by `./test2'.
Program terminated with signal SIGBUS, Bus error.
#0 0x00005570b7fd1373 in main () at test2.cpp:47
47 temp = pool[i];
(gdb) bt full
#0 0x00005570b7fd1373 in main () at test2.cpp:47
i = 66936954880
mem_amt = 80000000000
fd = <optimized out>
shmpath = "/foo"
pool = 0x7fa09da0e000 ""
temp = <optimized out>
(gdb)
Core was generated by `./test2'.
Program terminated with signal SIGBUS, Bus error.
#0 0x000055e242fdc379 in main () at test2.cpp:47
47 temp = pool[i];
(gdb) bt full
#0 0x000055e242fdc379 in main () at test2.cpp:47
i = 13063041023
mem_amt = 80000000000
fd = <optimized out>
shmpath = "/foo"
pool = 0x7f7366a0e000 ""
temp = <optimized out>
(gdb)