Hi, I'm using the Sagemaker Feature Store Feature Processor SDK to ingest data into an OfflineStore.
The problem I am having is that the ingestion speed is very slow. Ingesting a test set of 10,000 records takes 18 minutes, which implies the 1M records I need to ingest will take 30 hours!
Is this expected, or is there some way to improve this ingestion performance?
For reference, here is the code I'm using:
@feature_processor(
inputs=[SnowflakeDataSource(query, sf_options, secret)],
# snowflake data source implemented using code in custom data sources doc
output=CRB_FG_ARN,
target_stores=["OfflineStore"],
spark_config={"spark.jars.packages": "net.snowflake:spark-snowflake_2.12:2.12.0-spark_3.3"}
)
def transform(input_df):
from pyspark.sql.functions import col, unix_timestamp
transformed_df = (
input_df.select([col(x).alias(x.lower()) for x in input_df.columns])
.withColumn("created_at", unix_timestamp("created_at"))
)
# this print statement is shown almost immediately, implying latency isn't with Snowflake query
print(f"dataframe shape: {(transformed_df.count(), len(transformed_df.columns))}")
return transformed_df
transform()
EDIT1:
Well I changed the table format of the feature group to TableFormatEnum.ICEBERG and that allowed me to ingest the full 1M rows in 13 minutes.
EDIT2:
I re-enabled the write to the OnlineStore in the feature store creation, and the ingestion is very slow again. When looking in s3, I see many files per day (rather than a single file per day as was the case when writing only to the OfflineStore)
2024-05-16 17:25:40 4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/dfkchwkp/created_at_trunc=2023-05-07/20230507T164259Z_xuEYFSbUERrnKwQx.parquet
2024-05-16 17:21:10 4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/iyzled4v/created_at_trunc=2023-05-07/20230507T065739Z_yoFAGbgxIFClXSFy.parquet
2024-05-16 17:25:40 4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/qvvzi6tz/created_at_trunc=2023-05-07/20230507T180708Z_icsZiEfeGETlPgFl.parquet
2024-05-16 17:21:09 4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/rmmlxmbw/created_at_trunc=2023-05-07/20230507T181622Z_uyyeaMQWysmxzTez.parquet
2024-05-16 17:21:10 4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/yaiznhwv/created_at_trunc=2023-05-07/20230507T102844Z_kVUEXnFqcRCdaeGB.parquet
2024-05-16 17:25:40 4711 feature-store/sandbox/redacted/sagemaker/eu-west-1/offline-store/crb-1715868548/data/yzk3hxbn/created_at_trunc=2023-05-07/20230507T161916Z_eFLMXfmhPkorCPyJ.parquet
I can't find any docs about this, and I'm starting to lose a little faith in the production worthiness of the sagemaker feature store :/