By using AWS re:Post, you agree to the Terms of Use
/AWS Glue/

Questions tagged with AWS Glue

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

AWS Glue Studio Data Preview Fails instantly (Py4JJavaError)

Hi, I'm using AWS Glue Studio and once I click "data preview" it fails with the following error. The flow consists of 2 actions - PosgtreSQL JDBC Data source and Select Field action. The error is thrown instantly once the "Data Preview" button is clicked. The overall flow run successfully if i click "RUN" button - there are no errors and I get the outcome if I add the Target Source to dump the results back to PosgtreSQL table. It's just a Data Preview functionality that fails. Any idea what could be wrong and how to troubleshoot it? > Py4JJavaError: An error occurred while calling o538.getSampleDynamicFrame. : java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) at scala.collection.AbstractTraversable.reduceLeft(Traversable.scala:104) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) at com.amazonaws.services.glue.SparkSQLDataSource.com$amazonaws$services$glue$SparkSQLDataSource$$getPaths(DataSource.scala:724) at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getSampleDynamicFrame$7.apply(DataSource.scala:799) at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getSampleDynamicFrame$7.apply(DataSource.scala:793) at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:89) at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:89) at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:82) at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:89) at com.amazonaws.services.glue.SparkSQLDataSource.getSampleDynamicFrame(DataSource.scala:792) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
0
answers
0
votes
13
views
asked 4 days ago

MWAA apache-airflow-providers-amazon DAGS can't import operator

I'm trying to use newest package for Amazon integration in MWAA. In this particular case I want to use `GlueJobOperator` which is a part of the latest `apache-airflow-providers-amazon` package. ([Link to the documentation](https://airflow.apache.org/docs/apache-airflow-providers-amazon/4.0.0/operators/glue.html)) MWAA Airflow version: ` 2.2.2 ` I added this to the *requirements.txt*: ``` apache-airflow-providers-amazon==4.0.0 ``` and tried to import it and use it like in the examples: ``` from airflow.providers.amazon.aws.operators.glue import GlueJobOperator glue_job = GlueJobOperator( task_id='airflow_id', job_name='job_name' wait_for_completion=True, script_location='bucket_script_prefix', s3_bucket='bucket_name', iam_role_name='iam_role' create_job_kwargs=job_arguments, script_args=script_arguments ) ``` Unfortunately, whenever DAG is parsed I get this error: ``` ... from airflow.providers.amazon.aws.operators.glue import GlueJobOperator ImportError: cannot import name 'GlueJobOperator' from 'airflow.providers.amazon.aws.operators.glue' (/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/glue.py) ``` It's not my first rodeo with MWAA and some extra packages, plugins etc. but I am lost. In this case, I tried many things, went through docs from cover to cover and I still couldn't find the reason. I verified in MWAA that packages were successfully installed both in the logs and in the UI with the version prompted in *requirements.txt*. | Package Name | Version | Description | | --- | --- | --- | | apache-airflow-providers-amazon |4.0.0 |Amazon integration (including Amazon Web Services (AWS)). | Fun fact: I'm using `S3Hook` in some other DAGs and it parses just fine. ``` from airflow.providers.amazon.aws.hooks.s3 import S3Hook s3_hook = S3Hook() ... ```
0
answers
0
votes
21
views
asked 5 days ago

Glue Hudi get the freshly added or updated records

Hello, I'm using Hudi connector in Glue, first, I bulk inserted the initial dataset to Hudi table, I'm adding a daily incremental records and I can query them using Athena, what I'm trying to do is to get the newly added, updated or deleted records in a separate parquet file. I've tried different approaches and configurations using both copy on write and merge on read tables but still can get the updates in a separate file. I used these configurations in different combinations: 'className' : 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.payload.event.time.field': 'ts', 'hoodie.table.name': 'table_name', 'hoodie.datasource.hive_sync.database': 'hudi_db', 'hoodie.datasource.hive_sync.table': 'table_name', 'hoodie.datasource.hive_sync.enable': 'false', # 'hoodie.datasource.write.partitionpath.field': 'date:SIMPLE', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.meta.sync.client.tool.class': 'org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'path': 's3://path/to/output/', # 'hoodie.datasource.write.operation': 'bulk_insert', 'hoodie.datasource.write.operation': 'upsert', # 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor', # 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', # 'hoodie.compaction.payload.class': 'org.apache.hudi.common.model.OverwriteWithLatestAvroPayload', # 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.delete.bootstrap.base.file': 'true', "hoodie.index.type": "GLOBAL_BLOOM", 'hoodie.file.index.enable': 'true', 'hoodie.bloom.index.update.partition.path': 'true', 'hoodie.bulkinsert.shuffle.parallelism': 1, # 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator' Thank you.
1
answers
0
votes
18
views
asked 7 days ago

Hudi Clustering

I am using EMR 6.6.0, which has hudi 10.1. I am trying to bulkinsert and do inline clustering using Hudi. But seems its not clustering the file as per file size being mentioned. But it is still producing the files in KB only. I tried below configuration: > hudi_clusteringopt = { 'hoodie.table.name': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'creation_date', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': 'my_hudi_db', 'hoodie.datasource.hive_sync.table': 'myhudidataset_upsert_legacy_new7', 'hoodie.datasource.hive_sync.partition_fields': 'creation_date', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.write.operation": "bulk_insert", } # "hoodie.datasource.write.operation": "bulk_insert", try: inputDF.write.format("org.apache.hudi"). \ options(**hudi_clusteringopt). \ option("hoodie.parquet.small.file.limit", "0"). \ option("hoodie.clustering.inline", "true"). \ option("hoodie.clustering.inline.max.commits", "0"). \ option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). \ option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). \ option("hoodie.clustering.plan.strategy.sort.columns", "pk_col"). \ mode('append'). \ save("s3://xxxxxxxxxxxxxx"); except Exception as e: print(e) Here is the data set if someone wants to regenerate: inputDF = spark.createDataFrame( [ ("1001",1001, "2015-01-01", "2015-01-01T13:51:39.340396Z"), ("1011",1011, "2015-01-01", "2015-01-01T12:14:58.597216Z"), ("1021",1021, "2015-01-01", "2015-01-01T13:51:40.417052Z"), ("1031",1031, "2015-01-01", "2015-01-01T13:51:40.519832Z"), ("1041",1041, "2015-01-02", "2015-01-01T12:15:00.512679Z"), ("1051",1051, "2015-01-02", "2015-01-01T13:51:42.248818Z"), ], ["id","id_val", "creation_date", "last_update_time"] )
1
answers
0
votes
7
views
asked 9 days ago

AWS GLUE - java.sql.SQLIntegrityConstraintViolationException: Duplicate entry Error

I was doing a POC using Glue to Migrate data from RDS MySql to RDS Postgres. I have created Connectors to both source and target, and a crawler which connected to source. Then created a job and tried to migrate data with out any transformation and started getting java.sql.SQLIntegrityConstraintViolationException: Duplicate entry error java.lang.reflect.Constructor.newInstance(Constructor.java:423)\n\tat com.mysql.cj.util.Util.handleNewInstance(Util.java:192)\n\tat com.mysql.cj.util.Util.getInstance(Util.java:167)\n\tat com.mysql.cj.util.Util.getInstance(Util.java:174)\n\tat com.mysql.cj.jdbc.exceptions.SQLError.createBatchUpdateException(SQLError.java:224)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.executeBatchSerially(ServerPreparedStatement.java:385)\n\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeBatchInternal(ClientPreparedStatement.java:435)\n\tat com.mysql.cj.jdbc.StatementImpl.executeBatch(StatementImpl.java:794)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.saveBatch(GlueJDBCSink.scala:400)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$4(GlueJDBCSink.scala:77)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$4$adapted(GlueJDBCSink.scala:77)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.savePartition(GlueJDBCSink.scala:261)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$3(GlueJDBCSink.scala:77)\n\tat org.apache.spark.sql.jdbc.glue.GlueJDBCSink$.$anonfun$saveTable$3$adapted(GlueJDBCSink.scala:76)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:131)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat ****java.lang.Thread.run(Thread.java:750)\nCaused by: java.**sql.SQLIntegrityConstraintViolationException: Duplicate entry '00002b0b-1a34-3319-b003-fb073fb8248d' for key 'transformer_fragment.PRIMARY'\n\tat ******com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:117)\n\tat com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:637)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:418)\n\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1092)\n\tat com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1040)\n\tat com.mysql.cj.jdbc.ServerPreparedStatement.executeBatchSerially(ServerPreparedStatement.java:357)\n\t... 19 more\n","Accumulator Updates":[{"ID":130,"Update":"105","Internal":false,"Count Failed Values":true},{"ID":132,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":139,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":140,"Update":"1","Internal":false,"Count Failed Values":true},{"ID":141,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":142,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":143,"Update":"1135","Internal":false,"Count Failed Values":true},{"ID":144,"Update":"0","Internal":false,"Count Failed Values":true},{"ID":145,"Update":"1","Internal":false,"Count Failed Values":true}]},"Task Info":{"Task ID":13,"Index":5,"Attempt":1,"Launch Time":1655403050036,"Executor ID":"1","Host":"172.31.22.88","Locality":"NODE_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1655403050162,"Failed":true,"Killed":false,"Accumulables": The following are the values I have used while creating the job, Type : Spark Glue Version - Glue 3.0 Supports spark3.1, Scala2, Python3 Language - Python3 Worker Type - G.1X Automatically Scale the number of workers - False Requested number of workers - 2 Generate Job Insights - True Job Bookmark - Enable Number of retries - 1 job timeout - 2880 Really appreciate any help !
0
answers
0
votes
22
views
asked 18 days ago

AWS parameterized Glue Concurrent jobs using step functions with enabled bookmarks - throws Version mismatch exception

I have a parameterized glue job , that will be called in parallel (25 glue job) through step functions, when bookmark enabled , version mismatch exception is thrown, when disabled, it runs fine. . Below are the inputs to the step function { "JobName": "gluejobname3", "Arguments": { "--cdb": "catalogdbname", "--ctbl": "tablename", "--dest": "s3://thepath/raw/", "--extra-py-files": "s3://thepath/mytransformation.py.zip" } } When bookmarks are disabled, the step functions calls the parameterized glue job, and loads the data into the different s3 location. below is the glue job script --------------- import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job import mytransformation args = getResolvedOptions(sys.argv, ["JOB_NAME","cdb","ctbl","dest"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) # this was given so that while it was initialized each job would have their unique id job.init(args["JOB_NAME"]+args["ctbl"], args) #Custom parameters that will be dynamically passed through with the call to the parameterized glue job db = args["cdb"] tbl = args["ctbl"] dest = args["dest"] # Script generated for node MySQL table #dynamically creating a variable name. so that transformation_ctx would be unique for each glue job globals()[f'MySQLtable_sourceDf{tbl}'] = glueContext.create_dynamic_frame.from_catalog( database=db, table_name=tbl, transformation_ctx = '"'+f'MySQLtable_sourceDf{tbl}'+'"' ) #passing the same transformation_ctx into the destination frame destination_Df = mytransformation.tansform(globals()[f'MySQLtable_sourceDf{tbl}']) # Script generated for node Amazon S3 # : "s3://anvil-houston-preprod" #creating the dynamic unique transformation ctx name since the jobs will be run concurrently globals()[f'S3bucket_node{tbl}'] = glueContext.write_dynamic_frame.from_options( frame=destination_Df, connection_type="s3", format="glueparquet", connection_options={"path": dest, "partitionKeys": []}, format_options={"compression": "snappy"}, transformation_ctx='"'+f'S3bucket_node{tbl}'+'"' ) job.commit() ------ Above runs fine , while the execution is started through step functions ( 25 parallel parameterized glue job), the job runs fine, and loads to 25 diffferent locations. When bookmark is now enabled, the job fails with version mismatch . An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit. Continuation update failed due to version mismatch. Expected version 1 but found version 2 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: VersionMismatchException; Please help
0
answers
0
votes
23
views
asked 19 days ago

Generating Parquet files from Glue Data Catalog

I have a glue job that write to a Data Catalog. In the Data Catalog I originally set it up as CSV, and all works fine. Now I would like to try to use Parquet for the Data Catalog. I thought I would just have to re-create the table and select Parquet instead of CSV, so I did so like so: ``` CREATE EXTERNAL TABLE `gp550_load_database_beta.gp550_load_table_beta`( `vid` string, `altid` string, `vtype` string, `time` timestamp, `timegmt` timestamp, `value` float, `filename` string) PARTITIONED BY ( `year` int, `month` int, `day` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://ds905-load-forecast/data_store_beta/' TBLPROPERTIES ( 'classification'='parquet') ``` I left my glue job unchanged. I have it sending its output to the Data Catalog Table like so: ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] # Data Catalog WRITE DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog( frame = dynamicDF, database = db_name, table_name = tbl_name, additional_options=additionalOptions, transformation_ctx = "DataCatalogtable_node2", ) ``` When I checked the files being created by the Data Catalog in s3://ds905-load-forecast/data_store_beta/, they look to just be CSV. What do I need to do to use Parquet? Can I just change the sink routine to use glueContext_write_dynamic_frame.from_options()?
0
answers
0
votes
17
views
asked 25 days ago
1
answers
0
votes
14
views
asked a month ago

MSCK REPAIR TABLE behaves differently when executed via Spark Context vs Athena Console/boto3

I have a Glue ETL job which creates partitions during the job ``` additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "LOG"} additionalOptions["partitionKeys"] = ["year", "month", "day"] ``` I don’t have it Update the Data Catalog because doing so changes all my Table Data Types. So after I am done, the way I get the Data Catalog updated with the correct partition information is to run MSCK REPAIR TABLE. If I do this inside the Glue ETL job using the Spark Context like so: ``` spark.sql("use gp550_load_database_beta") spark.sql("msck repair table gp550_load_table_beta").show() ``` The following happens: Serde Properties of my table are updated with “serialization.format : 1” Table Properties are updated with: EXTERNAL : TRUE spark.sql.partitionProvider : catalog ALL Data Types in my table are set to “String” with a comment of “from deserializer” Basically it makes a mess. If I instead run MSCK REPAIR TABLE from boto3, or if I manually run it from Athena Console, then there are no issues. No Serde Properties are changes, no table properties, no data types are changed, it simply adds my partitions to the Data Catalog like I want. I do like so in my ETL job: ``` client = boto3.client('athena') sql = 'MSCK REPAIR TABLE gp550_load_database_beta.gp550_load_table_beta' context = {'Database': 'gp550_load_database_beta'} client.start_query_execution(QueryString = sql, QueryExecutionContext = context, ResultConfiguration= { 'OutputLocation': 's3://aws-glue-assets-977465404123-us-east-1/temporary/' }) ``` Why does it behave so differently? Is it because somehow I need to tell Spark to work with HIVE? I had thought that since I already had a spark context it would be easy to use that to kick off the MSCK REPAIR TABLE, but obviously I was surprised at the result!
0
answers
0
votes
15
views
asked a month ago

Quicksight Athena - analysis error: "HIVE_UNSUPPORTED_FORMAT: Unable to create input format"

Hello. I'm trying to create an analysis from my DocumentDB instance. I'm using the aws services Glue, Athena and Quicksight. In Glue I have created a connection to the DocumentDB and a crawler for auto creating tables. This works as expected and tables are created and displayed in glue. Even though I specify that the crawler should not give the tables any prefixes, it does add the database name as a prefix. When I look at the Glue catalog in Athena (the default AwsDataCatalog) I do see the database that was created in glue, however it does not show any tables. If I click on edit, it takes me to the correct database in glue which displays the tables that have been created by the previously mentioned crawler. So my first question is **Why doesn't the tables show up in Athena?** This is blocking me from performing queries in Athena. When I go to QuickSight and select the default Athena glue catalog ("AwsDataCatalog") I DO get the tables created by the crawler, and I can create datasets. However, when I try to create an analysis using these datasets, I get the error: ``` sourceErrorCode: 100071 sourceErrorMessage: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. HIVE_UNSUPPORTED_FORMAT: Unable to create input format ``` I have looked a bit around and some people said that this error is due to the table properties **"Input format"** and **"Output format"** being empty (which they indeed are for me). I have tried entering almost all the different formats to the table, but I keep on getting the Quicksight error above only now it has the input format at the end ``` HIVE_UNSUPPORTED_FORMAT: Unable to create input format json ``` **So my second questions is** I do not see anywhere in the crawler where I can specify input or output format. Does it have to be done manually? And What are the correct input and output formats for my setup?
0
answers
0
votes
32
views
asked a month ago

How to create dynamic dataframe from AWS Glue catalog in local environment?

I I have performed some AWS Glue version 3.0 jobs testing using Docker containers as detailed [here](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/). The following code outputs two lists, one per connection, with the names of the tables in a database: ```python import boto3 db_name_s3 = "s3_connection_db" db_name_mysql = "glue_catalog_mysql_connection_db" def retrieve_tables(database_name): session = boto3.session.Session() glue_client = session.client("glue") response_get_tables = glue_client.get_tables(DatabaseName=database_name) return response_get_tables s3_tables_list = [table_dict["Name"] for table_dict in retrieve_tables(db_name_s3)["TableList"]] mysql_tables_list = [table_dict["Name"] for table_dict in retrieve_tables(db_name_mysql)["TableList"]] print(f"These are the tables from {db_name_s3} db: {s3_tables_list}\n") print(f"These are the tables from {db_name_mysql} db {mysql_tables_list}") ``` Now, I try to create a dynamic dataframe with the *from_catalog* method in this way: ```python import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.dynamicframe import DynamicFrame source_activities = glueContext.create_dynamic_frame.from_catalog( database = db_name, table_name =table_name ) ``` When `database="s3_connection_db"`, everything works fine, however, when I set `database="glue_catalog_mysql_connection_db"`, I get the following error: ```python Py4JJavaError: An error occurred while calling o45.getDynamicFrame. : java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver ``` I understand the issue is related to the fact that I am trying to fetch data from a mysql table but I am not sure how to solve this. By the way, the job runs fine on the Glue console. I would really appreciate some help, thanks!
0
answers
0
votes
23
views
asked a month ago

AWS Glue problems reading from PostgreSQL DB that has uppercased table and column names

I have an RDS PostgreSQL database that has table names and column names with upper cased characters. I have created a glue crawler that connects to the database via jdbc and populates the glue data catalog with the database schemas but in that translation the upper case characters are converted to lower case characters when stored in the data catalog tables. When I run a glue job to query I get this error: An error occurred while calling o94.getDynamicFrame. ERROR: column "id" does not exist I made a copy of this table and changed the table names and column names to have all lower case characters and the same glue jobs and queries run successfully. Changing the table names and column names to lower cased characters in our production environment is just not an option due to the extensive work it would require. I found the 'Edit Schema' option in the Glue UI where you can change column names and data types and thought for a moment the solution had been found. However, when you change a character to upper case and then select "Save" it is reverted to lower case as it is saved. I have edited the pyspark script directly and worked with the glueContext.create_dynamic_frame.from_catalog method using the additional_options parameter to build my select statement using upper and lower case characters but that still fails with the error message noted above. # Script generated for node PostgreSQL table PostgreSQLtable_node1 = glueContext.create_dynamic_frame.from_catalog( database="mydatabase", table_name="mytable", additional_options={"query":"SELECT id from mytable;"}, transformation_ctx="PostgreSQLtable_node1" I believe the failure is because the schema as it is stored in the data catalog contains lower characters while the actual schema in the database is upper characters so when Glue tries to work with the table it is looking for "id" while the actual is "ID" and so "not found" is returned. I have read about the CaseSensitive option and looking in that direction next for a solution. I have not seen any recent (less than couple years old) posts about this issue so not sure if I'm missing something. Any assistance would be greatly appreciated.
2
answers
0
votes
27
views
asked a month ago

Data Catalog schema table getting modified when I run my Glue ETL job

I created a Data Catalog with a table that I manually defined. I run my ETL job and all works well. I added partitions to both the table in the Data Catalog, as well as the ETL job. it creates the partitions and I see the folders being created in S3 as well. But my table data types change. I originally had: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | timestamp | | timegmt | timestamp | | value | float | | filename | string | | year | int | | month | int | | day | int | But now after the ETL job with partitions, my table ends up like so: | column | data type | | --- | --- | | vid | string | | altid | string | | vtype | string | | time | bigint | | timegmt | bigint | | value | float | | filename | string | | year | bigint | | month | bigint | | day | bigint | Before this change of data types, I could do queries in Athena. Including a query like this: ``` SELECT * FROM "gp550-load-database"."gp550-load-table-beta" WHERE vid IN ('F_NORTH', 'F_EAST', 'F_WEST', 'F_SOUTH', 'F_SEAST') AND vtype='LOAD' AND time BETWEEN TIMESTAMP '2021-05-13 06:00:00' and TIMESTAMP '2022-05-13 06:00:00' ``` But now with the data types change, I get an error when trying to do a query like above ``` "SYNTAX_ERROR: line 1:154: Cannot check if bigint is BETWEEN timestamp and timestamp This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 2a5287bc-7ac2-43a8-b617-bf01c63b00d5" ``` So then if I go into the the table and change the data type back to "timestamp", I then run the query and get a different error: ``` "HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'time' in table 'gp550-load-database.gp550-load-table-beta' is declared as type 'timestamp', but partition 'year=2022/month=2/day=2' declared column 'time' as type 'bigint'. This query ran against the "gp550-load-database" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: f788ea2b-e274-43fe-a3d9-22d80a2bbbab" ``` With Glue Crawlers, you can influence how the partitions are created. You can have the Crawler modify the Data Catalog table, or not make changes to the table scheme, other than adding new partitions: ``` { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } } ``` basically, this *InheritFromTable* behavior is what I am looking for with ETL jobs. Does anyone know what is happening?
0
answers
0
votes
54
views
asked a month ago

Can't get Partitions to work with my Glue Data Catalog

I have S3 files that are uploaded to a single bucket. There is no folders or anything like that, its just 1 file per hour uploaded to this bucket. I run a Glue ETL job on these files, do some transformations, and insert the data into a Glue Data Catalog stored in a different bucket. I can then query that Glue Data Catalog with Athena, and that works. What I would like to do is store the files in the S3 folder of the Data Catalog as YEAR/MONTH/DAY, using partitions. Even though the SOURCE data is just files uploaded every hour with no partitions, I want to store them in the Data Catalog WITH partitions. So I extracted the YEAR, MONTH, DAY from the files during Glue ETL, and created columns in my Data Catalog table accordingly and marked them as partitions: Partition 1 YEAR Partition 2 MONTH Partition 3 DAY The proper values are in these columns, and I have verified that. After creating the partitions I rand MSCK REPAIR TABLE on the table, and it came back with "Query Ok." I then ran my Glue ETL job. When I look in the S3 bucket I do not see folders created. I just see regular r-part files. When I click on the Table Schema it shows the columns YEAR, MONTH, DAY marked as partitions, but when I click on View Partitions it just shows: year month day No partitions found What do I need to do? These are just CSV files. I can't control the process that is uploading the raw data to S3, it is just going to store hourly files in a bucket. I can control the ETL job and the Data Catalog. When I try to query after creating the partitions and running MSCK REPAIR TABLE, there is no data returned. Yet I can go into the Data Catalog bucket and pull up one of the r-part files and the data is there.
1
answers
0
votes
34
views
asked a month ago

aws-glue-libs:glue_libs_3.0.0_image_01 image issue

I am getting issues in aws-glue-libs:glue_libs_3.0.0_image_01 image docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" -v C:/Docker/jupyter_workspace:**/home/glue_user/workspace/jupyter_workspace/ ** --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh It is getting started locally but When I am trying to read the csv file stored locally it is giving error : An error was encountered: Path does not exist: file:/home/glue_user/workspace/employees.csv Traceback (most recent call last): File "/home/glue_user/spark/python/pyspark/sql/readwriter.py", line 737, in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File "/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/glue_user/spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Path does not exist: file:/home/glue_user/workspace/employees.csv Or When I am trying to start with docker run -it -p 8888:8888 -p 4040:4040 -e DISABLE_SSL="true" -v C:/Docker/jupyter_workspace****:/home/glue_user/workspace** ** --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh then container is not getting started getting following error : Bad config encountered during initialization: No such directory: ''/home/glue_user/workspace/jupyter_workspace''
1
answers
0
votes
45
views
asked 2 months ago
  • 1
  • 90 / page