Skip to content

Kinesis Firehose to Apache Iceberg — Input Success but No Data Delivered

0

Hello,

I have configured an Amazon Kinesis Data Firehose delivery stream to deliver data directly into an Apache Iceberg table, using the Direct PUT method. The destination is set as an Iceberg table registered via AWS Glue Resource Links.


Environment Setup

  • Delivery Stream: PUT-ICE-S3-ICEBERG
  • Buffer Settings: 1 MiB or 60 seconds
  • Inline JSON Parsing: Enabled
  • Operation: "insert"
  • Target Database: example_namespace
  • Target Table (via Resource Link): iceberg_table_link
  • Actual S3-Managed Table: iceberg_managed_table
  • Unique Key for Upserts: id
  • S3 Error Output Prefix: s3://my-error-logs-bucket/errors/

Observed Behavior

  • Input Bytes metric shows successful ingestion to Firehose.
  • No throttled records are reported.
  • However, "Delivery to Apache Iceberg Tables (Success)" remains flat (with message "No data available. Try adjusting the dashboard time range.").
  • No .error logs appear in the configured S3 error output path.
  • No records are available via Athena queries on the Iceberg table (iceberg_table_link).

Access Configuration

IAM Role attached to the Firehose stream:
arn:aws:iam::123456789012:role/firehose_delivery_role

IAM Policy attached:

Glue and Lake Formation Permissions

{
  "Action": [
    "glue:GetTable",
    "glue:GetDatabase",
    "glue:UpdateTable"
  ],
  "Resource": [
    "arn:aws:glue:us-east-1:123456789012:catalog",
    "arn:aws:glue:us-east-1:123456789012:database/*",
    "arn:aws:glue:us-east-1:123456789012:table/*/*"
  ]
}

S3 Permissions

{
  "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket",
    "s3:AbortMultipartUpload",
    "s3:DeleteObject"
  ],
  "Resource": [
    "arn:aws:s3:::my-data-bucket",
    "arn:aws:s3:::my-data-bucket/*"
  ]
}

Lake Formation permissions:
The firehose_delivery_role has ALL Lake Formation permissions on:

  • Database example_namespace
  • Table iceberg_managed_table (via resource link iceberg_table_link)
  • The corresponding S3 location

Data Format

Each record is JSON with partition fields included:

{
  "id": "uuid",
  "dataset_id": "ds1",
  "placement": "top",
...
  "time_window_start": "2025-02-15T00:00:00Z",
  "year": 2025,
  "month": 2,
  "day": 15
}

Records are sent using a Python script with boto3, via:

client.put_record(
    DeliveryStreamName="PUT-ICE-S3-ICEBERG",
    Record={"Data": json.dumps(record)}
)

Request for Guidance

Despite input metrics showing success, Firehose is not delivering data to the Iceberg table and no errors are reported. I would really appreciate help with the following:

  1. What are the best ways to debug this silent delivery failure to Apache Iceberg?
  2. Are there hidden logs or troubleshooting tools to identify schema mismatches or permission issues?
  3. Are there known issues with Firehose delivery to partitioned Iceberg tables or Resource Links in Glue?

Any insights or steps to resolve or further debug this would be highly appreciated. Thank you in advance!

1 Answer
0

To debug the silent delivery failure of Amazon Kinesis Data Firehose to your Apache Iceberg table, consider the following approaches:

  1. Check CloudWatch Logs: Enable CloudWatch logging for your Firehose delivery stream if not already done. This can provide more detailed information about any issues occurring during the delivery process.

  2. Verify Schema Compatibility: Ensure that the schema of your incoming JSON data matches exactly with the Iceberg table schema. Even small discrepancies in field names or data types can cause issues without generating explicit errors.

  3. Examine Firehose Configuration: Double-check your Firehose configuration, particularly the database and table names. Make sure they exactly match your Glue catalog entries, including case sensitivity.

  4. Investigate S3 Permissions: Although you've provided S3 permissions, verify that the Firehose role can indeed write to the S3 location where your Iceberg table data is stored. You might want to temporarily enable S3 access logging to see if any access attempts are being made.

  5. Check Glue Data Catalog: Ensure that the Glue Data Catalog is correctly updated after data ingestion. Sometimes, the catalog might not reflect recent changes, leading to query issues in Athena.

  6. Validate Lake Formation Permissions: While you've granted ALL permissions in Lake Formation, double-check that these permissions are correctly applied and there are no conflicts with other policies.

  7. Monitor Firehose Metrics: Keep a close eye on other Firehose metrics in CloudWatch, such as "DeliveryToS3.Success" and "DeliveryToS3.Failure", which might provide additional insights.

  8. Test with Simplified Configuration: Try creating a new Firehose delivery stream with a simpler configuration (e.g., without partitioning) to isolate the issue.

  9. Examine Lambda Transformation: If you're using a Lambda function for data transformation, ensure it's not inadvertently modifying or dropping data.

  10. Check for Iceberg-Specific Issues: Verify that your Iceberg table configuration, including partitioning scheme, matches your expectations and the incoming data format.

As for hidden logs or troubleshooting tools, CloudWatch Logs is your primary resource for detailed debugging information. You can also check the AWS Glue Data Catalog for any discrepancies in table definitions.

Regarding known issues, while there aren't specific widespread problems reported with Firehose delivery to partitioned Iceberg tables or Resource Links in Glue, these features are relatively new and complex. Ensure you're using the latest versions of all services and check the AWS forums or support channels for any recent reported issues.

If the problem persists after these checks, consider reaching out to AWS Support for more in-depth troubleshooting assistance, as they can access account-specific information that might reveal the root cause of the issue.
Sources
firehose delivery stream destination error | AWS re:Post
Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose | AWS Big Data Blog
Streaming data to tables with Amazon Data Firehose - Amazon Simple Storage Service

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.