AWS Glue pyspark, AWS OpenSearch and 429 Too Many Requests

0

Hi forum,

I'm on AWS and trying to write ~ 1.2mio documents from an AWS Glue 2.0 job Python / pyspark job to an OpenSearch 1.2 "t3.small.search"/SSD cluster.

The issue I'm facing is that after a while I'm facing "429 Too Many Requests":

org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [PUT] on [MY/doc/_bulk] failed; server[https://SOME_ENDPOINT_HERE] returned [429|Too Many Requests:]

From what I understand and read so far this is pretty much about configuration, throttling down indexing requests on the client side giving the server more time to process queued requests. And that's what I tried but somehow the config on the Hadoop connector does not work for me.

Already tried to send smaller batches of documents to the ElasticSearch and increased retry wait time: set 'es.batch.size.entries' to 100 and 'es.batch.write.retry.wait' to 30s:

df \
    .write \
    .mode('overwrite') \
    .format('org.elasticsearch.spark.sql') \
    .option('es.nodes', 'SOME_ENDPOINT_HERE') \
    .option('es.port', 443) \
    .option('es.net.ssl', 'true') \
    .option('es.net.http.auth.user', 'SOME_USER_NAME_HERE') \
    .option('es.net.http.auth.pass', 'SOME_PASS_HERE') \
    .option('es.nodes.wan.only', 'true') \
    .option('es.nodes.discovery', 'false') \
    .option('es.resource', 'SOME_NAME_HERE') \
    .option('es.index.auto.create', 'true') \
    .option('es.mapping.id', 'SOME_FIELD_HERE') \
    .option('es.write.operation', 'index') \
    .option('es.batch.size.entries', '100') \
    .option('es.batch.write.retry.policy', 'simple') \
    .option('es.batch.write.retry.count', '-1') \
    .option('es.batch.write.retry.limit', '-1') \
    .option('es.batch.write.retry.wait', '30s') \
    .save()

Already set logging for 'org.elasticsearch.hadoop.rest' logger to DEBUG level:

Bulk Flush #[12653715211658247214404]: Sending batch of [34000] bytes/[1000] entries
Bulk Flush #[12653715211658247214404]: Response received
Bulk Flush #[12653715211658247214404]: Completed. [1000] Original Entries. [1] Attempts. [1000/1000] Docs Sent. [0/1000] Docs Skipped. [0/1000] Docs Aborted.

From what I understand the Hadoop-Connector is sending batches of 1000 documents, not the 100 from my config. Further I can not see any wait time.

My actual setup on AWS is:

Spark: 2.4.3 Python: 3.7 OpenSearch: 1.2 Elasticsearch Hadoop: 7.13.4 (elasticsearch-spark-20_2.11-7.13.4.jar)

Any hints or ideas on my setup?

Many Thanks, Matthias

matze79
질문됨 2년 전1711회 조회
1개 답변
1

i'd rather increase the batch size to reduce the overall number of requests to Opensearch. You also may want to increase refresh timeout. https://aws.amazon.com/ru/premiumsupport/knowledge-center/opensearch-indexing-performance/ On the other point, t3.small cluster is really small so you might need to use different type of instances

AWS
Alex_T
답변함 2년 전
  • Hi, thanks for your quick reply @Alex_T.

    Updated to t3.small cluster and the indexing of the 1.2 million records was immediately successful with AWS Glue in 7 minutes. :) So was a useful hint with upscaling cluster instances.

    For those who are interested: I also asked the fine people at Elastic about this:

    https://discuss.elastic.co/t/aws-es-hadoop-and-429/310124

    The guys over there mentioned that the option "es.batch.size.entries" is not respected under any circumstances. For my use case for example I enabled PySpark's overwrite mode in AWS Glue:

    df.write.mode('overwrite')...

    Before new documents are indexed the index is first emptied here. Turns out there is no config option in the elastic-hadoop module for that initial delete, so I always saw "1000 records" in my logs. Maybe this will help someone later.

    Thanks again.

    Best, Matthias

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인