- Newest
- Most votes
- Most comments
As per my understanding, EMR doesn't include the Spark-Kinesis connector by default. If you're using Spark Structured Streaming, make sure you've added the connector like this: --packages org.apache.spark:spark-sql-kinesis_2.12:3.5.0 (Use a version compatible with Spark 3.5.4 and EMR 7.8.0 — let me know if you need help with the exact version.)
IAM Permissions: Check that your EMR EC2 instance profile has the following permissions: kinesis:DescribeStream kinesis:GetRecords kinesis:GetShardIterator kinesis:ListStreams Also ensure that the role is properly attached and assumed by the job (especially when submitting from Livy or notebooks).
Networking / VPC: If your Kinesis stream is in a different VPC or region, or if EMR is in a private subnet, make sure there are no VPC endpoint or NACL restrictions blocking access.
Logs: Check the logs in s3://yourbucketname/booking/ for: The full exception message (e.g., StreamingQueryException, FileNotFoundException, etc.) The job configuration or Spark submit command — missing config values like streamName or region could cause this.
All neccesary permissions are apllied, I have add a jar spark-streaming-sql-kinesis-connector_2.12-1.1.0.jar, VPC is correct, the full error sounds like " [DATA_SOURCE_NOT_FOUND] Failed to find the data source: kinesis. Please find packages at
https://spark.apache.org/third-party-projects.html
. at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:726)" even if package was added
To debug your EMR cluster failure and Kinesis-to-S3 pipeline issues, you can follow these steps:
-
Check the EMR cluster logs: Navigate to the S3 bucket where your logs are stored (s3://test-lab-bucket-s3/booking/). Look for the master node logs, particularly the bootstrap action logs and application logs for Spark, Hive, and Livy. These logs may contain more detailed error messages about the "data source not found" errors.
-
Examine the last state change: In the EMR console, view the details pane for your cluster. Look for the "Last State Change" information, which can provide insights into why the cluster failed. You can also use the AWS CLI with the 'describe-cluster' command to get this information.
-
Verify Kinesis data source configuration: Since the cluster can't see the Kinesis data source, double-check your Kinesis stream configuration and ensure that the correct stream name and region are specified in your EMR applications' configurations.
-
Check IAM roles and permissions: Although you mentioned that all necessary permissions are applied, it's worth double-checking the IAM roles associated with your EMR cluster. Ensure they have the required permissions to access both Kinesis and S3.
-
Use development mode for deeper investigation: If the issue persists, consider launching your EMR cluster in development mode. To do this, use the AWS CLI or SDK to set the "--additional-info" parameter to '{"clusterType":"development"}' when creating the cluster. This will keep the cluster alive even if provisioning fails, allowing you to SSH into the primary node for more detailed troubleshooting.
-
Analyze application-specific logs: Once you have access to the primary node (either through development mode or by examining S3 logs), look into application-specific logs:
- For Spark: Check /var/log/spark/ directory
- For Hive: Check /var/log/hive/ directory
- For Livy: Check /var/log/livy/ directory
-
Verify compatibility: Ensure that your custom AMI (if used) is compatible with EMR 7.x, which requires Amazon Linux 2023-based images.
-
Check for timeouts: If the cluster is timing out during bootstrap, examine the EC2 instance system logs for any service failures or startup issues.
By systematically going through these steps, you should be able to identify the root cause of your EMR cluster failure and Kinesis data source issues. Remember to look for specific error messages related to data source connectivity, permissions, or configuration problems in the various log files.
Sources
EMR Cluster failure with "On the master instance, application provisioning failed" | AWS re:Post
EMR Cluster failure with "Failed to start the job flow due to an internal error" | AWS re:Post
Step 3: Look at the last state change - Amazon EMR
hi Sabina,
can you share your .jar is added? check all jars, also version mismatch:->
print(sc._jsc.sc().listJars())
This will print a list of the JAR file paths. Check if your Kinesis connector JAR is in this list.
Also,
spark-shell --version
spark-submit --version
OR
use --packages to automatically resolve dependencies:
spark-submit \ --packages software.amazon.kinesis:spark-sql-kinesis_2.12:1.1.0 \ --class com.example.Main \ your-app.jar <application_arguments>
Best,
Relevant content
- asked 6 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 8 months ago
https://aws.amazon.com/blogs/big-data/build-spark-structured-streaming-applications-with-the-open-source-connector-for-amazon-kinesis-data-streams/
You can check if the connector JAR is loaded with:
print(spark.conf.get("spark.jars"))