Questions tagged with Data Lakes
Content language: English
Sort by most recent
Development Endpoint & Glue Version 1.0
Hello. Development Endpoint only supports Glue version <= 1.0. With upgraded Glue Versions, will Glue Version 1.0 eventually be deprecated? I saw the following post related to development under Glue version 2.0 and 3.0, is the intention that we move development to Glue Studio Notebook and Glue Interactive Sessions? Will Glue Development Endpoint eventually go away as well? Thank you!
Athena and Analytics
1. What is the best way to Create a subset of factory location data Current process: Query location data for specific factories, save in a new Athena table with a direct insert statement 2. Get factory data and generate statistics about the factories Objective: save statistics in a database that automatically updates to a visualization tool 3. Is it possible to do a batch insert to put the contents of a pandas dataframe into Athena? 3. What is the best tool to connect Athena data to excel?
Update Records with AWS Glue
I have two S3 buckets with data tables, namely A and B, and a Glue job, that transforms data from A to B. Both tables contain a column called x. The Glue job performs a GroupBy operation on this column x, which results in transforming all other columns from table A into list type columns for table B. I activate the bookmarking mechanism for the Glue job, so that it processes only new data. That requires, that I also read in inputs from table B (which are outputs of the previous run of the Glue job) in this job and append new items to the list type columns in case a record with a specific value for column x already exists. It is unclear for me how I could update the table B when saving outputs of the Glue job and avoid duplicated values of column x. Does anybody have a hint here? Thanks!
Unsupported case of DataType
**ERROR MESSAGE:** An error occurred while calling o518.pyWriteDynamicFrame. Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@235d3a6f and DynamicNode: integernode. I utilize glue catalog and use s3 as my DB. I'm uploading CSV files and processing them with glue jobs, however I continue to get the error message **"An error occurred while calling o518.pyWriteDynamicFrame. Unsupported case of DataType: com.amazonaws.services.glue.schema.types.StringType@235d3a6f and DynamicNode: integernode.Datatype". ** I've updated and recreated the glue tables and crawlers multiple times as well as the schema to match data types but I still get the same error message. The glue job only succeeds when I update the schema to have string as the only data type but the data type needs to be int for integers when querying in Athena so this will cause an issue down stream. Can anyone help?
Are you able to hide tables in a database using Lake Formation Tagging
Hi, I have a database with around 40 tables. However, some end users don't need to see all tables in the database. I'm using Lake Formation Tagging and know that if a tag is added to the database that the tag is then inherited by all the tables. I also found by going into the table that the inherited tags can't be removed. I tried adding a tag to just the table level and granting permissions but the database won't appear as the tag is only at the table level. Is there a way to hide certain tables in a database by using Lake Formation Tagging?
Migrating exiting data to AWS
Hi everyone, I have 270GB of data in my NAS. So what we are doing right now is that we have set up bidirectional sync from dropbox. Through windows explorer, I have given access to NAS to all users. My question is that 1. How we can migrate that data to AWS, I know about S3 storage but want some expert opinion on this. 2. We have implemented security roles for the users so that they can see relevant data. Is it possible in AWS as well? 3. What will be the cost of this?
AWS Lake Formation: (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s) on table
I have implemented LakeFormation on my data bucket. I have a step function in which one step consists of running a GlueJob that reads and writes to the data catalog. I have upgraded my DataLake permissions as reported [here]. The Service Role that runs my Step Function has a root-type policy (granted just for debugging this issue): ```yaml Statement: - Effect: "Allow" Action: - "*" Resource: - "*" ``` On lake formation the service role has: - Administrator Rights - Database Creation rights (and Grantable) - Data Location access to the entire bucket (and Grantable) - Super rights on read and write Database (and Grantable) - Super rights on ALL tables within above Databases (and Grantable). The bucket is not encrypted. But, somehow, its access to the tables is denied with the error: ``` (AccessDeniedException) when calling the GetTable operation: Insufficient Lake Formation permission(s) on table ``` What's really strange is that the Glue Job succeeds when writing to some tables, and fails on others. And there is no real substantial difference across tables: all of them are under the same S3 prefix, parquet files, partitioned on the same key. Given the abundance of permissions granted, I am really clueless about what is causing the error. Please, send help. : https://docs.aws.amazon.com/lake-formation/latest/dg/upgrade-glue-lake-formation.html
Ingesting data from external sources like Git, Slack, Zoom, Instagram, 3rd party systems
## Problem I want to know, understand and correct my knowledge, approach on, Setting up an Data Ingestion pipeline, which collects "events" or "data" from any possible external application sources (applications of 3rd party) The rate of ingestion can be about 5000 (5K) events per day (normal) on peak it can go slightly more 20K ### Approach I been thinking about I am planning to setup AWS Lambda endpoint to which external systems can post(HTTP POST) the data, which then can load into OpenSearch to form a Data Lake The ingestion pipeline operates between Lambda and OpenSearch, to perform - Parsing of data - Fetch more data if needed, by making API calls - Process, transform, enrich - Post to OpenSearch Indices as per indices I have been googling and exploring on AWS but so far can't find any thing which can validate above. Hence request you experts to comment, suggest and direct me to a practical solution
Amazon S3 connectors. Pros and cons
Hi Team, I couldn't find list/details of the tools to which Amazon S3 integrates with by using a S3 connector. Which tools integrate with S3 to provide in-place querying of S3 data (i.e. data shouldn't move out from S3 in order to be queried). How much data can be queried at a time? Does it support joins? What data formats can be queried ? Any pointers would really help. Thank you
Can AWS Glue read data from different SQL Server table, generate csv files and zipping it to S3?
I need to load data from multiple tables in a SQL server to S3 for some batch processing. Can AWS Glue read data from different SQL Server table, generate csv files and zipping it to S3? And can AWS Glue run R script functions?
Grouping of partitioned dataframes
I have a large dataset (table) with >1e9 records (rows) in Glue. The tables are partitioned by column A, which is a n-letters subtring of column B. For example: | A (partition key) | B | ... | | --- | --- | --- | | abc | abc123... | ... | | abc | abc123... | ... | | abc | abc456... | ... | | abc | abc456... | ... | | abc | abc456... | ... | | abc | abc789... | ... | | abc | abc789... | ... | | ... | ... | ... | | xyz | xyz123... | ... | | xyz | xyz123... | ... | | xyz | xyz123... | ... | | xyz | xyz456... | ... | | xyz | xyz456... | ... | | xyz | xyz456... | ... | | xyz | xyz789... | ... | | xyz | xyz789... | ... | There are >1e6 possible different values of column B and correspondingly significantly less for column A (maybe 1e3). Now I need to group records/rows by column B and the assumption is that it could be advantageous if the table was partitioned by column A, as it would be sufficient to load dataframes from single partitions for grouping instead of running the operation on the entire table. (Partitioning by column B would lead to unreasonably large numbers partitions.) Is my assumption right? How would I tell my Glue job the link between column A and B and profit from the partitioning? Alternatively I could handle the 1e3 dataframes (one for each partition) separately in my Glue job and merge them lateron. But this looks a bit complicated to me. This question is a follow-up question to https://repost.aws/questions/QUwxdl4EwTQcKBuL8MKCU0EQ/are-partitions-advantageous-for-groupby-operations-in-glue-jobs.