Skip to content

DynamicFrame.split_rows on boolean column

0

I am trying to create two DynamicFrames based on a column that is a boolean. I have tried
dyf.split_rows({'mybool': {'=': 'true'}}, 'is_true', 'is_not_true')
dyf.split_rows({'mybool': {'=': True}}, 'is_true', 'is_not_true')
dyf.split_rows({'mybool': {'=': 1}}, 'is_true', 'is_not_true')

The documentation at https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#pyspark-split_rows-example is not helping me figure this out. For now I am using DataFrame.filter, but I was wondering if there was a way to use split_rows.

EDIT: 2024-05-28 I am adding my test code to show that I never get a count over zero in the invalid frame:

from pyspark.sql import Row

fake_data = [
    {'id': '00001', 'mybool': True},
    {'id': '00002', 'mybool': False},
    {'id': '00003', 'mybool': True},
    {'id': '00004', 'mybool': False},
    {'id': '00005', 'mybool': True},
]

df_fake_data = spark.createDataFrame(Row(**x) for x in fake_data)

df_fake_data.show()

comparison_dicts = [
    {'=': 'true'},
    {'=': True},
    {'=': 1},
    {'=': 'yes'},
    {'=': 'no'},
    {'=': '1'},
    {'=': '0'},
    {'=': 'True'},
    {'=': 'False'},
    {'==': True},
]

for comparison_dict in comparison_dicts:
    print(f'comparison_dict: {comparison_dict}')
    dyc = DynamicFrame.fromDF(df_fake_data, glueContext).split_rows(
        {'mybool': comparison_dict}, 'invalid', 'valid'
    )
    dyc.select('invalid').count()
    dyc.select('valid').count()
    dyc.select('invalid').show()
    dyc.select('valid').show()
    print()
asked 2 years ago305 views
2 Answers
0

Hello,

Just try if the below approaches works out

split_rows_collection = dyf.split_rows({“mybool": {"=": “1"}}, “is_true", “is_not_true”) split_rows_collection = dyf.split_rows({“mybool": {"=": "yes"}}, “is_true", “is_not_true”)

Thanks !

AWS
SUPPORT ENGINEER
answered 2 years ago
  • Thank you for the suggestions, but neither of those work.

0

To split a DynamicFrame based on a boolean column using split_rows, use the following code: is_true, is_not_true = dyf.split_rows(comparison_dict={"mybool": {"==": True}}). This will create two DynamicFrames: is_true containing rows where the mybool column is True, and is_not_true containing rows where it is False.

Source: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-SplitRows.html#aws-glue-api-crawler-pyspark-transforms-SplitRows-__call__

EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.