Hello,
We experienced repetitive issues on our production real time matches: every week or so, players were not able to play in real time matches anymore.
After checking the logs and metrics, we noted that the available game sessions were either dropping to 0 or were not being reported anymore. Our auto-scaling policy is not triggered though since it is based on the percentage of available game session which somehow is showing healthy... (see image below)
This also seem like a bug but that's not the one I want to treat in that question.
The Event tab was also flooded with crashing processes.
Our workaround is to manually increase the number of desired instance, and eventually, the crashing instance is automatically killed by the system. And we're good for another week or 2 of peace.
Recently, we managed to find out that the issue happens because of storage capacity reached:
10/3/2023 7:52:15 AM [ERROR] Error caught in the beginning of Main!
System.IO.IOException: No space left on device
at System.IO.FileStream.WriteNative(ReadOnlySpan`1 source)
at System.IO.FileStream.FlushWriteBuffer()
at System.IO.FileStream.FlushInternalBuffer()
at System.IO.FileStream.Flush(Boolean flushToDisk)
at System.IO.FileStream.Flush()
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.IO.StreamWriter.WriteLine(String value)
at Z3.Gameplay.Backend.ConsoleLogger.Log(String message)
at Z3.Gameplay.Backend.Program.Main(String[] args)
************************************************
And we know that the folder which takes all the space is whitewater
which we don't have access to, and is managed by GameLift, as showed in the first image.
That's why we're asking for your help to help us investigate why this folder ends up taking all the space or understand what can we do on our end to prevent this. Please let us know if we can provide any more information. Thank you.
Hi Shashank,
Thank you for your answer.
The build is not a debug build, it is a production one. I am afraid the processes crashed due to the disk space being full, and not the opposite.
The GameLift Anywhere solution to try and reproduce is definitely a good option to investigate, thank you for the tip.
Otherwise, I have some updates: in the build causing the disk to flood, we were dumping logs on a different file for each match (using the matchId as part of the file name). I must insist that the logs file were not the cause of the disk being flooded (as you can see on the screenshot of the "du" command). But somehow, after stoping writing in different file for each match, the problem have not appeared yet. This makes me believe that you had a good intuiting about memory leaks. Maybe the way we were handling file writing was bad in the opening/close process. I'll try to investigate more if I have the time.
Thank you so much for your time and suggestions, it was truly appreciated.
Stephane
Actually, I just discovered that the GameLift Server SDK build was a development one. Thank you for your insights, I will change that.