Avoid select * from table in toDf() for big tables - Glue

Question

Hi. When I do a DynamicFrame.toDf() in Glue it makes a "select * from table " but if the table is very big is a problem. How can I add a filter to the query so it dont's read all table data?

DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = dataSourceCatalogDataBase, table_name = dataSourceCatalogTableName,redshift_tmp_dir = args["TempDir"], transformation_ctx = "DataSource0")

df1 = DataSource0.toDF()

Answer

@RobertoH,

if you are reading from a relational database you can use the connection option to push down a query using the option sampleQuery as described [here](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html).

hope this helps,

Answer

The toDf() has a show method that will show only a certain number of rows.  You can use that if you want to see a subset of the data.  https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

Avoid select * from table in toDf() for big tables - Glue

Contenuto pertinente