- 最新
- 投票最多
- 评论最多
Unable to reserve a process on a fleet generally means things:
- That you have no spare processes left to host game sessions. You should be able to see this in your fleet metrics (https://docs.aws.amazon.com/gamelift/latest/developerguide/monitoring-cloudwatch.html). There should be an available game session metrics
If you don't see available game sessions then I would guess is that you are not calling TerminateGameSession / ProcessEnding to clean up the game sessions. Confirm your tick loop is running as expected and you see calls to ProcessEnding being made.
- You are out of threads/memory/cpu so your server cannot respond in time/successfully. Check your instance metrics to confirm
Try larger instances or fewer servers per instance and ensure you are not CPU/memory bound.
Otherwise you're encountering some SDK/GameLift communication bugs, in which case we can get GameLift service team to investigate, but please check 1 & 2 first.
Hi Pip -- hope you're well,
First, in the past week we've experienced this issue on:
- May 24th at 10:30 AM PST
- May 24th at 10:36 AM PST
- May 27th at 6:09 PM PST
Now for your points, I'll attach some graphs / logs here. So:
Available game sessions
It appears we have enough available game sessions
REMOVEDUPLOAD
Are we calling TerminateGameSession / ProcessEnding correctly?
Here are some logs from a particular process ending:
28 May 2020 17:54:00,382 [INFO] (server.js) 1646: All players disconnected. Ending game
28 May 2020 17:54:00,397 [INFO] (gamelift.js) 103: Calling GameLiftServerAPI.ProcessEnding
28 May 2020 17:54:00,713 [INFO] (gamelift.js) 108: GameLiftServerAPI.ProcessEnding succeeded with optional result: undefined
28 May 2020 17:54:00,713 [INFO] (server.js) 601: Completed process ending with: true
Should we be concerned about the undefined
optional result?
GameLiftServerAPI.ProcessEnding succeeded with optional result: undefined
CPU usage
REMOVEDUPLOAD
Memory usage
The red, orange, & blue lines represent the first, second, & third instance in our fleet over time.
REMOVEDUPLOAD
Threads -- I'm not sure if/how to track this metric. Instead here are the number of established connections:
Again with red, orange, & blue for the first, second, & third instance in each fleet over time
REMOVEDUPLOAD
Based on these metrics, I don't think I see anything glaringly obvious? Maybe our CPU usage is a little too high? We only topped out ~80%.
Thanks for the update. I will file an issue with the GameLift service team to take a look.
Hello Kip,
I have investigated the FleetCapacityExceededExceptions you are seeing, and found that they are due to server-processes beginning registration with our service, but not completing the registration for some time. This causes server-processes to look available in metrics, but not actually be available for gameSession placement.
From looking at logs, the cause of this issue is that the on-box process that registers server-processes is starved for threads and doesn't have resources to complete the process registration. Looking at ec2 metrics, I see that some of the instances in the fleet hit 100% CPU utilization at times as well. These fleetCapacityExceededExceptions can likely be solved by reducing the # of concurrent server processes.
I'll work to improve "AvailableGameSession" metric to be more accurate. In the meantime, reducing # of concurrent server processes should solve this issue you're seeing.
Looking at the debug messages, they're pretty much all related to log upload. Looking closer, it looks like the same files are being uploaded many times. The need to copy/compress/upload all these files is why you're seeing large log volumes as well as high cpu.
It looks like when one process terminates aux proxy sets off uploading logs across all processes, not just the one that terminated. My guess is that when you called Process ready you passed in /local/game/logs instead of (or possibly in addition to) /local/game/logs/<pid>. You need to ensure that you are only specifying the process-specific log directory to be pulled. If you correct that then I believe your problem should go away.
To clarify, I believe that you are specify the per-pid directory correctly in process ready. The issue is that you also have the parent directory specified on the fleet itself. If you remove the parent from the fleet configuration I think that your issue should go away.
Hi @REDACTEDUSER
Just to clarify, we're using the GameLift Realtime Servers product. I'm a bit confused by your comment, I don't think this is something we control on our end?
Taking a look at the application logs:
[gl-user-remote@ip-10-22-80-142 ~]$ cd /local/game/logs
[gl-user-remote@ip-10-22-80-142 logs]$ grep -R "Realtime server started" | head -n 5
2952/server.log.2020-06-01-02:01 Jun 2020 02:19:45,742 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1908,"LogParameters":{"LogPaths":["/local/game/logs/2952"]}}
3422/server.log.2020-06-01-02:01 Jun 2020 02:19:46,531 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1922,"LogParameters":{"LogPaths":["/local/game/logs/3422"]}}
2875/server.log.2020-06-01-02:01 Jun 2020 02:19:46,197 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1916,"LogParameters":{"LogPaths":["/local/game/logs/2875"]}}
3377/server.log.2020-06-01-02:01 Jun 2020 02:19:47,142 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1938,"LogParameters":{"LogPaths":["/local/game/logs/3377"]}}
2961/server.log.2020-06-01-02:01 Jun 2020 02:19:46,793 [INFO] (gamelift.js) 223: Realtime server started! Calling GameLiftServerAPI.ProcessReady with processParameters: {"Port":1931,"LogParameters":{"LogPaths":["/local/game/logs/2961"]}}
Assuming LogPaths
is the parameter we're talking about, they do seem to be specified on a per-pid basis.
There's a log paths parameter attached at the fleet level. This is combined with the paths that you register for each process. It looks like you can't update this -- it is only specified when you create the fleet. How did you create your RTS fleet? via the console, CLI, or something else? If CLI, do you have the command that you used (and also where you got the command from). I am wondering if we might have bad/misleading instructions out there somewhere, but I was unable to find them looking around a bit.
@REDACTEDUSER
https://docs.aws.amazon.com/cli/latest/reference/gamelift/create-fleet.html
--log-paths (list) This parameter is no longer used. Instead, to specify where Amazon GameLift should store log files once a server process shuts down, use the Amazon GameLift server API ProcessReady() and specify one or more directory paths in logParameters . See more information in the Server API Reference .
Edit: Looks like you're right about the LogPaths
parameter, though:
$ aws gamelift describe-fleet-******
{
"FleetAttributes": [
{
...
"LogPaths": [
"/local/game/logs"
],
...
}
]
}
So we should try creating a new fleet via the CLI and specifying an empty list for the --log-paths
parameter? Seems like it's still a bug that this is the default value?
Yeah, that's what it sounds like to me. I'll see if I can track down where this log paths is being assigned. If nothing else, it definitely is doing the wrong thing in the cases where customers are using more than one server per host.
Let us know if creating via the CLI allows you to work around this issue. describe-fleet-******
Hi @REDACTEDUSER
Thanks for taking a look at this! We'll reduce the # of concurrent server processes and see how that fares.
About CPU usage, I'm taking a look at one of our fleet instances and notice that much of the CPU goes towards a java
auxproxy
process. I assume this is the "...on-box process that registers server-processes...". At this particular moment, it doesn't seem to dip below 50%
CPU usage and is very often sitting between 75%
to 100%
CPU usage. This is the fleet in question:
Region: us-east-1 FleetId: fleet-******
Working with GameLift, we were looking to find a way to specify a log level for the base server process. We ran across this todo:
// TODO (GLIFT-10345): allow developers to specify log level in --devargs
We find the current default info
level very verbose for our production environment. We'd ideally like to bump this up to at least warn
.
So, without any knowledge about what the auxproxy
process does nor how it works, I'm curious about how much CPU time is spent on logging. Taking a look at the one instance in the fleet above:
# pwd
/local/whitewater/Logs
# ls -lSh
total 1.6G
-rw-r--r-- 1 gl-auxproxy gl-auxproxy 945M Jun 2 04:51 auxproxy.log
-rw-r--r-- 1 gl-auxproxy gl-auxproxy 123M Jun 2 04:51 auxproxy-metrics.log
# cat auxproxy.log | grep "\[DEBUG\]" | wc -l | xargs printf "%'d\n"
3,932,761
# cat auxproxy.log | grep "\[INFO\]" | wc -l | xargs printf "%'d\n"
690,643
This fleet has been up for ~26 hours now and auxproxy
has produced almost 4 million lines of debug logs. Between the logs auxproxy
generates itself and the verbose application level logs it ingests/processes/uploads/other -- does this seem correct/normal?
Once again, I don't know much about this process, so please let me know if I'm missing something or whether this is completely normal. Otherwise, maybe a higher log level for our production environment might help us out?
P.S. We really would love a supported way to specify the log level (GLIFT-10345)
😄
Cheers
@REDACTEDUSER
Happy to say that starting a fleet via the CLI and emptying the --log-paths
attribute worked a charm. Thanks for your help debugging this issue and also thanks to @REDACTEDUSER
We're back up to 50
concurrent processes per instance and our CPU usage is looking much better. Here's a graph over the past week. The blue line on the right is the new instance in the fleet without that pesky LogPaths
attribute, the other lines are the other various instances we ran over the week:
REMOVEDUPLOAD
Cheers
The underlying issue which caused the log paths to be incorrect for multiple servers on a host should be fixed. Please let us know if you experience any issues!
相关内容
- AWS 官方已更新 2 年前