Hi all, I'm writing a java spring application which utilises monte carlo simulations to predict outcomes from a statistical model. I have previously used on-premises "nodes" to perform the simulations but am now looking to scale with AWS Lambda.
My nodes are powerful machines capable of running 50,000 sims in ~1 second. With 2 nodes I can run 100k simulations in 1 second.
The lambda is less powerful and is capable of 100 simulations in ~1s, so if I want to perform 100,000 simulations, I should spread that across 1000 lambda instances.
See diagrams below for a better understanding:
Previous:

Proposed:

The "Orchestrator" spring boot app has the V2 java sdk (software.amazon.awssdk), and i'm trying to kick off 1000 asynchronous calls at once. At best, I'm getting around 300 concurrent calls and the whole process is taking much longer than expected (about 30s).
Average completion time is about 2s per lambda.

I've been reading various documentation to try and tune the client in order to allow me to make these 1000 calls in parallel. Even providing my own thread pool with poolSize of 1000. But still no luck.
LambdaAsyncClient client = LambdaAsyncClient.builder()
.region(Region.BLA_BLA)
.httpClientBuilder(
NettyNioAsyncHttpClient
.builder()
.maxConcurrency(1000)
.maxPendingConnectionAcquires(1000)
.connectionAcquisitionTimeout(Duration.of(20000, ChronoUnit.MILLIS))
.connectionTimeout(Duration.of(20000, ChronoUnit.MILLIS))
.useNonBlockingDnsResolver(true)
)
.credentialsProvider(StaticCredentialsProvider.create(awsCredentials))
.asyncConfiguration(
ClientAsyncConfiguration.builder()
.advancedOption(SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR, threadpool)
.build())
.build();
Could it be related to the concurrency scaling rate? It says 1000 per 10s, but does that mean in the first second I can only scale up to 100 lambda instances?
Firing up wireshark I can see the requests being made to Amazon and it's clear they're taking a long time to be processed: Is this the amazon java sdk blocking while waiting for the response for some reason? Or could it just be that my computer can't process that many concurrent connections (1000 isn't that much traffic!)?
| time | src ip | dst ip | protocol | length | info |
|---|
| 4.007815 | 192.168.1.69 | 3.8.129.52 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.007817 | 192.168.1.69 | 3.8.129.52 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.007818 | 192.168.1.69 | 3.8.129.8 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.007819 | 192.168.1.69 | 3.8.129.56 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.00782 | 192.168.1.69 | 3.8.129.56 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.029959 | 192.168.1.69 | 3.8.129.56 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.029962 | 192.168.1.69 | 3.8.129.52 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.035489 | 192.168.1.69 | 3.8.129.9 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 4.035686 | 192.168.1.69 | 3.8.129.9 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| ... | ... | ... | ... | ... | ... |
| 25.010523 | 192.168.1.69 | 3.8.129.8 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.010526 | 192.168.1.69 | 3.8.129.54 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.013757 | 192.168.1.69 | 3.8.129.27 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.023627 | 192.168.1.69 | 3.8.129.56 | TLSv1.3 | 650 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.023776 | 192.168.1.69 | 3.8.129.52 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.045933 | 192.168.1.69 | 3.8.129.36 | TLSv1.3 | 650 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.109813 | 192.168.1.69 | 3.8.129.27 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.120254 | 192.168.1.69 | 3.8.129.9 | TLSv1.3 | 650 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 25.14439 | 192.168.1.69 | 3.8.129.30 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
Thanks!
EDIT!!
Lots of retransmission at the start, waiting ~20s before the request is handled normally.
| number | time | src ip | dest ip | protocol | length | comment |
|---|
| 40279 | 17.193426 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021385807 TSecr=0 SACK_PERM |
| 67825 | 18.194271 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021386807 TSecr=0 SACK_PERM |
| 95346 | 19.194267 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021387808 TSecr=0 SACK_PERM |
| 122934 | 20.195233 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021388809 TSecr=0 SACK_PERM |
| 151995 | 21.195689 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021389809 TSecr=0 SACK_PERM |
| 181983 | 22.195368 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021390809 TSecr=0 SACK_PERM |
| 239798 | 24.19601 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021392810 TSecr=0 SACK_PERM |
| 345785 | 28.196351 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021396810 TSecr=0 SACK_PERM |
| 524196 | 36.197391 | 192.168.1.69 | 3.8.129.55 | TCP | 78 | [TCP Retransmission] 49174 → 443 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=2021404811 TSecr=0 SACK_PERM |
| 524825 | 36.225522 | 3.8.129.55 | 192.168.1.69 | TCP | 74 | 443 → 49174 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=1452 SACK_PERM TSval=856648105 TSecr=2021404811 WS=256 |
| 524843 | 36.225861 | 192.168.1.69 | 3.8.129.55 | TCP | 66 | 49174 → 443 [ACK] Seq=1 Ack=1 Win=132480 Len=0 TSval=2021404839 TSecr=856648105 |
| 524846 | 36.226384 | 192.168.1.69 | 3.8.129.55 | TLSv1.3 | 498 | Client Hello (SNI=lambda.eu-west-2.amazonaws.com) |
| 525459 | 36.251007 | 3.8.129.55 | 192.168.1.69 | TCP | 66 | 443 → 49174 [ACK] Seq=1 Ack=433 Win=28160 Len=0 TSval=856648128 TSecr=2021404839 |
| 525460 | 36.251008 | 3.8.129.55 | 192.168.1.69 | TLSv1.3 | 1506 | Server Hello, Change Cipher Spec, Application Data |
| 525462 | 36.251009 | 3.8.129.55 | 192.168.1.69 | TCP | 1506 | 443 → 49174 [ACK] Seq=1441 Ack=433 Win=28160 Len=1440 TSval=856648128 TSecr=2021404839 [TCP segment of a reassembled PDU] |
| 525463 | 36.25101 | 3.8.129.55 | 192.168.1.69 | TCP | 1506 | 443 → 49174 [ACK] Seq=2881 Ack=433 Win=28160 Len=1440 TSval=856648128 TSecr=2021404839 [TCP segment of a reassembled PDU] |
| 525464 | 36.251011 | 3.8.129.55 | 192.168.1.69 | TLSv1.3 | 1263 | Application Data, Application Data, Application Data |
Thanks, yes I had considered that as my main route when we want to scale at more than 100k sims. Just want to prove that it's possible to run 1000 concurrently before assuming I can do the same with beefier lambdas.