Apache Hudi on Amazon EMR and AWS Database
0
I created the tutorial from this link successfully But a trying make that using other data and table I don't have success. I receive this error:
hadoop@ip-10-99-2-111 bin]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5 --master yarn --deploy-mode cluster --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar --table-type COPY_ON_WRITE --source-ordering-field dms_received_ts --props s3://hudi-test-tt/properties/dfs-source-health-care-full.properties --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3://hudi-test-tt/hudi/health_care --target-table hudiblogdb.health_care --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --enable-hive-sync
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hudi#hudi-utilities-bundle_2.11 added as a dependency
org.apache.spark#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c;1.0
confs: [default]
found org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating in central
found org.apache.spark#spark-avro_2.11;2.4.5 in central
found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 270ms :: artifacts dl 7ms
:: modules in use:
org.apache.hudi#hudi-utilities-bundle_2.11;0.5.2-incubating from central in [default]
org.apache.spark#spark-avro_2.11;2.4.5 from central in [default]
org.spark-project.spark#unused;1.0.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-fbc63aec-b48f-4ef4-bc38-f788919cf31c
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/7ms)
22/08/25 21:39:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/08/25 21:39:38 INFO RMProxy: Connecting to ResourceManager at ip-10-99-2-111.us-east-2.compute.internal/10.99.2.111:8032
22/08/25 21:39:38 INFO Client: Requesting a new application from cluster with 1 NodeManagers
22/08/25 21:39:38 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
22/08/25 21:39:38 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
22/08/25 21:39:38 INFO Client: Setting up container launch context for our AM
22/08/25 21:39:38 INFO Client: Setting up the launch environment for our AM container
22/08/25 21:39:39 INFO Client: Preparing resources for our AM container
22/08/25 21:39:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/08/25 21:39:41 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_libs__5969710364624957851.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_libs__5969710364624957851.zip
22/08/25 21:39:41 INFO Client: Uploading resource file:/usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hudi-utilities-bundle_2.11-0.5.2-incubating.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.hudi_hudi-utilities-bundle_2.11-0.5.2-incubating.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.spark_spark-avro_2.11-2.4.5.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.apache.spark_spark-avro_2.11-2.4.5.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/org.spark-project.spark_unused-1.0.0.jar
22/08/25 21:39:41 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/hive-site.xml
22/08/25 21:39:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4c327077-6693-4371-9e41-10e2342e0200/__spark_conf__6985991088000323368.zip -> hdfs://ip-10-99-2-111.us-east-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1661296163923_0003/__spark_conf__.zip
22/08/25 21:39:42 INFO SecurityManager: Changing view acls to: hadoop
22/08/25 21:39:42 INFO SecurityManager: Changing modify acls to: hadoop
22/08/25 21:39:42 INFO SecurityManager: Changing view acls groups to:
22/08/25 21:39:42 INFO SecurityManager: Changing modify acls groups to:
22/08/25 21:39:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
22/08/25 21:39:43 INFO Client: Submitting application application_1661296163923_0003 to ResourceManager
22/08/25 21:39:43 INFO YarnClientImpl: Submitted application application_1661296163923_0003
22/08/25 21:39:44 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:44 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1661463583358
final status: UNDEFINED
tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
user: hadoop
22/08/25 21:39:45 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:46 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:47 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:48 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:49 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:50 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:50 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
ApplicationMaster RPC port: 33179
queue: default
start time: 1661463583358
final status: UNDEFINED
tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
user: hadoop
22/08/25 21:39:51 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:52 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:53 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:54 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:39:55 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:55 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1661463583358
final status: UNDEFINED
tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
user: hadoop
22/08/25 21:39:56 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:57 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:58 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:39:59 INFO Client: Application report for application_1661296163923_0003 (state: ACCEPTED)
22/08/25 21:40:00 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:00 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
ApplicationMaster RPC port: 34591
queue: default
start time: 1661463583358
final status: UNDEFINED
tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
user: hadoop
22/08/25 21:40:01 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:02 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:03 INFO Client: Application report for application_1661296163923_0003 (state: RUNNING)
22/08/25 21:40:04 INFO Client: Application report for application_1661296163923_0003 (state: FINISHED)
22/08/25 21:40:04 INFO Client:
client token: N/A
diagnostics: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:89)
at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:99)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78)
... 11 more
Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Required property hoodie.deltastreamer.schemaprovider.source.schema.file is missing
at org.apache.hudi.DataSourceUtils.lambda$checkRequiredProperties$1(DataSourceUtils.java:173)
at java.util.Collections$SingletonList.forEach(Collections.java:4824)
at org.apache.hudi.DataSourceUtils.checkRequiredProperties(DataSourceUtils.java:171)
at org.apache.hudi.utilities.schema.FilebasedSchemaProvider.<init>(FilebasedSchemaProvider.java:55)
... 16 more
ApplicationMaster host: ip-10-99-2-253.us-east-2.compute.internal
ApplicationMaster RPC port: 34591
queue: default
start time: 1661463583358
final status: FAILED
tracking URL: http://ip-10-99-2-111.us-east-2.compute.internal:20888/proxy/application_1661296163923_0003/
user: hadoop
22/08/25 21:40:04 ERROR Client: Application diagnostics message: User class threw exception: java.io.IOException: Could not load schema provider class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
at org.apache.hudi.utilities.UtilHelpers.createSchemaProvider(UtilHelpers.java:101)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:364)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:95)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:89)
at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate class
at org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
at org.apache.hudi.common.util.ReflectionUtils.loadClass(Refl
質問済み 2年前538ビューlg...
1回答
- 新しい順
- 投票が多い順
- コメントが多い順
これらの回答は役に立ちましたか?コミュニティがあなたの知識を活用できるように、正解に賛成票を投じてください。
0
Hi,
You need to specify the property hoodie.deltastreamer.schemaprovider.source.schema.file
as can be seen on the log traces.
Let me know if you succeed!
回答済み 2年前lg...
関連するコンテンツ
- 質問済み 6年前lg...
- 質問済み 6年前lg...
- 質問済み 1年前lg...
- AWS公式更新しました 3年前
- AWS公式更新しました 1年前
To do this I already changed the S3 files: schema, properties changing to the new data and added the new data correctly in s3. Do I need to do anything else for this to work?