How to use transformation rule in DMS task to perform replace operation(remove space) for all columns

0

Have created a DMS task to migrate data from MongoDB to S3 in parquet, and will be using parquet files in Glue. But the column names contain spaces in their names, due to which the parquet files are not readable in Glue, it is throwing error - AnalysisException: Attribute name "column A" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it. Had been renaming in the rule, but have multiple columns with this space issue.

Is there a way to add a transformation rule in DMS to do this for all columns having such issue. I tried doing this in Glue code, but then unable to even read the parquet file.

Tried by applying the schema after changing and reading again but it is leading to nulls in the fields.

`file = "/content/LOAD00000001.parquet"
df = spark.read.parquet(file)
newdf = spark.read.schema(df.schema).parquet(file)
newdf.coalesce(1).write.format('parquet').save('/content/4/',header = 'true')`

Please help me find a way to handle this dynamically.

1 Answer
0

this is a very interesting scenerio i did reproduce this

source mongodb collection

rs0:PRIMARY> db.testdata.find().pretty()
{
	"_id" : ObjectId("628c8ab5f530bdb073b54017"),
	"column 1" : "john",
	"column 2" : "mark"
}
{
	"_id" : ObjectId("628c8accf530bdb073b54018"),
	"column 1" : "mike",
	"column 2" : "mark"
}

dms transformations on column names allow you to rename but it wont help here

also it is not possible to replace just the space (" ") but it allows to change prefix or suffix

in my case columns are "column 1" and "column 2"

which qualifies for prefix "column "

so made the table mapping as

{
  "rules": [
    {
      "rule-type": "transformation",
      "rule-id": "378097618",
      "rule-name": "378097618",
      "rule-target": "column",
      "object-locator": {
        "schema-name": "mydata",
        "table-name": "%",
        "column-name": "%"
      },
      "rule-action": "replace-prefix",
      "value": "column_",
      "old-value": "column "
    },
    {
      "rule-type": "selection",
      "rule-id": "378089419",
      "rule-name": "378089419",
      "object-locator": {
        "schema-name": "mydata",
        "table-name": "%"
      },
      "rule-action": "include",
      "filters": [

      ]
    }
  ]
}

this mapping would then change prefix "column " to "column_"

final s3 parq file schema and data

[ec2-user@ip-10-0-0-4 ~]$ parq LOAD00000001.parquet -s

 # Schema
 <pyarrow._parquet.ParquetSchema object at 0x7fda6fa8b5d0>
oid__id: BYTE_ARRAY String
column_1: BYTE_ARRAY String
column_2: BYTE_ARRAY String

[ec2-user@ip-10-0-0-4 ~]$ parq LOAD00000001.parquet --head
oid__id                     column_1 column_2
0  628c8ab5f530bdb073b54017  john     mark
1  628c8accf530bdb073b54018     mike     mark

I hope this dynamic change of column prefix helps.

AWS
answered 2 years ago
  • I tried this but it is not working. Below is subset of document from Mongo that I am trying to migrate to S3. { "_id": { "$oid": "123" }, "synonyms": { "Ceramic": [ "Ceramic" ], "Stainless Steel": [ "Stainless Steel" ] } } Columns printed from parquet: ['oid__id', 'array_synonyms.Ceramic', 'array_synonyms.Stainless Steel'] The spaces are still there. Also, have commas too in some columns, please help me with this.

  • in your data set, data seems very unique and not aligned to field name as "Column " did you tried making a copy of collection and replacing the spaces on the source itself.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions