By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Data Lakes

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Querying Latest Available Partition

I am building an ETL pipeline using primarily state machines, Athena, and the Glue catalog. In general things work in the following way: 1. A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process. 2. A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time. My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table? I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture: 1. Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment. 2. Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access) 3. Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows. Any guidance would be appreciated!
1
answers
0
votes
51
views
asked a month ago

Apache Hudi on Amazon EMR and AWS Database

I created the tutorial from this [link](https://aws.amazon.com/pt/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/) successfully But a trying make that using other data and table I don't have success. I receive this error: ``` hadoop@ip-10-99-2-111 bin]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5 --master yarn --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar --table-type COPY_ON_WRITE --source-ordering-field dms_received_ts --props s3://hudi-test-tt/properties/dfs-source-health-care-full.properties --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3://hudi-test-tt/hudi/health_care --target-table hudiblogdb.health_care --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --enable-hive-sync Ivy Default Cache set to: /home/hadoop/.ivy2/cache The jars for the packages stored in: /home/hadoop/.ivy2/jars :: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.apache.hudi#hudi-utilities-bundle_2.11 added as a dependency org.apache.spark#spark-avro_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c;1.0 confs: [default] found org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating in central found org.apache.spark#spark-avro_2.11;2.4.5 in central found org.spark-project.spark#unused;1.0.0 in central :: resolution report :: resolve 270ms :: artifacts dl 7ms :: modules in use: org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating from central in [default] org.apache.spark#spark-avro_2.11;2.4.5 from central in [default] org.spark-project.spark#unused;1.0.0 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 0 | 0 | 0 || 3 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c confs: [default] 0 artifacts copied, 3 already retrieved (0kB/7ms) 22/08/25 21:39:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/08/25 21:39:38 INFO RMProxy: Connecting to ResourceManager at ip-10-99-2-111.us-east-2.compute.internal/10.99.2.111:8032 22/08/25 21:39:38 INFO Client: Requesting a new application from cluster with 1 NodeManagers 22/08/25 21:39:38 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container) 22/08/25 21:39:38 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead 22/08/25 21:39:38 INFO Client: Setting up container launch context for our AM 22/08/25 21:39:38 INFO Client: Setting up the launch environment for our AM container 22/08/25 21:39:39 INFO Client: Preparing resources for our AM container 22/08/25 21:39:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 22/08/25 21:39:41 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_libs__5969710364624957851.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_libs__5969710364624957851.zip 22/08/25 21:39:41 INFO Client: Uploading resource file:/usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hudi-utilities-bundle_2.11-0.5.2-incubating.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.5.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.spark_spark-avro_2.11-2.4.5.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.spark-project.spark_unused-1.0.0.jar 22/08/25 21:39:41 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hive-site.xml 22/08/25 21:39:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_conf__6985991088000323368.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_conf__.zip 22/08/25 21:39:42 INFO SecurityManager: Changing view acls to: hadoop 22/08/25 21:39:42 INFO SecurityManager: Changing modify acls to: hadoop 22/08/25 21:39:42 INFO SecurityManager: Changing view acls groups to: 22/08/25 21:39:42 INFO SecurityManager: Changing modify acls groups to: 22/08/25 21:39:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() 22/08/25 21:39:43 INFO Client: Submitting application application_1661296163923_0003 to ResourceManager 22/08/25 21:39:43 INFO YarnClientImpl: Submitted application application_1661296163923_0003 22/08/25 21:39:44 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:44 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:39:45 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:46 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:47 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:48 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:49 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:50 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:50 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal ApplicationMaster RPC port: 33179 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:39:51 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:52 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:53 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:54 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:39:55 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:55 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:39:56 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:57 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:58 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:39:59 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED) 22/08/25 21:40:00 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:00 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal ApplicationMaster RPC port: 34591 queue: default start time: 1661463583358 final status: UNDEFINED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:40:01 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:02 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:03 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING) 22/08/25 21:40:04 INFO Client: Application report for application_1661296163923_0003 (state: FINISHED) 22/08/25 21:40:04 INFO Client: client token: N/A diagnostics: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685) Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89) at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:99) ... 9 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78) ... 11 more Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Required property hoodie.deltastreamer.schemaprovider.source.schema.file is missing at org.apache.hudi.DataSourceUtils.lambda$checkRequiredProperties$1(DataSourceUtils.java:173) at java.util.Collections$SingletonList.forEach(Collections.java:4824) at org.apache.hudi.DataSourceUtils.checkRequiredProperties(DataSourceUtils.java:171) at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:55) ... 16 more ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal ApplicationMaster RPC port: 34591 queue: default start time: 1661463583358 final status: FAILED tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/ user: hadoop 22/08/25 21:40:04 ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685) Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80) at org.apache.hudi.common.util.ReflectionUtils.loadClass(Refl ```
1
answers
0
votes
21
views
asked a month ago

Connecting Users to AWS Athena and AWS Lake Formation via Tableau Desktop using the Simba Athena JDBC Driver and Okta as Identity Provider

Hello, due to the following Step by Step Guide provided by the official AWS Athena user-guide (Link at the End of the question), it should be possible to connect Tableau Desktop to Athena and Lake Formation via the Simba Athena JDBC Driver using Okta as Idp. The challenge that I am facing right now, is although i followed each step as documented in the Athena user-guide i can not make the connection work. The error message that i recieve whenever i try to connect Tableau Desktop states: > [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. The security token included in the request is invalid. [Execution ID not available] Invalid Username or Password. My athena.properties file to configure the driver on the Tableau via connection string URL looks as follows (User Name and Password are masked): ``` jdbc:awsathena://AwsRegion=eu-central-1; S3OutputLocation=s3://athena-query-results; AwsCredentialsProviderClass=com.simba.athena.iamsupport.plugin.OktaCredentialsProvider; idp_host=1234.okta.com; User=*****.*****@example.com; Password=******************; app_id=****************************; ssl_insecure=true; okta_mfa_type=oktaverifywithpush; LakeFormationEnabled=true; ``` The configuration settings used in here are from the official Simba Athena JDBC driver documentation (Version: 2.0.31). Furthermore i assigned the required permissions for my users and groups inside Lake Formation as stated in the Step by Step guide linked below. Right now I am not able to point out why I am not able to make the connection work. So I would be very greatful for any support / idea to find a solution on that topic. Best regards Link: https://docs.aws.amazon.com/athena/latest/ug/security-athena-lake-formation-jdbc-okta-tutorial.html#security-athena-lake-formation-jdbc-okta-tutorial-step-1-create-an-okta-account)
0
answers
0
votes
79
views
asked 3 months ago