By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Storage

Sort by most recent
  • 1
  • 12 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

What should be the correct Exclude Pattern and Table level when dealing with folders with different names?

Hello, I have a s3 bucket with this following path: "s3://a/b/c" Inside this 'c' folder I have one folder for each table. Then for each of these table folders I have a folder for each version. Each version is a database snapshot obtained on a weekly basis, which is run by a workflow. To clarify, the structure inside 'c' is like this: 1. products 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. locations 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet I have created a crawler (Include Path is set to the same path mentioned above - "s3://a/b/c") with the intention of merging all the versions together into 1 table, for each table (products, locations). The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. The _temporary folder is something automatically generated by the workflow. **What should be the actual correct Exclude path (to ignore everything in _temporary folder) and maybe set any Table Level in order for me to create only ONE table merging all versions together for each table (products, locations)?** In summary I should have 2 tables: 1. products (containing version_0 and version_1 rows) 2. locations (containing version_0 and version_1 rows) I really have no way of testing the exclude patterns. Is there any Sandbox where we can actually test the glob exclude patterns? I have found one online but it doesn't seem to be similar to what AWS is using. I have tried with these exclude patterns but none worked (it still created a table for each table & each version): 1. version*/_temporary** 2. /\*\*/version*/_temporary*\*
0
answers
0
votes
2
views
asked 23 minutes ago

Set correct Table level, Include Path and Exclude path.

Hello all, I have a s3 bucket with this following path: s3://a/b/c/products Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow). 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-29... ...c000.snappy.parquet I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version). The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work. **What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?** I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?
1
answers
0
votes
16
views
asked a day ago

Data transfer speeds from S3 bucket -> EC2 SLURM cluster are slower than S3 bucket -> Google SLURM cluster

Hello, I am currently benchmarking big data multi-cloud transfer speeds at a range of parallel reads using a cluster of EC2 instances & similar Google machines. I first detected an issue when using a `c5n.2xlarge` EC2 instance for my worker nodes reading a 7 GB dataset in multiple formats from an S3 bucket. I have verified that the bucket is in the same cloud region as the EC2 nodes, but the data transfer executed far slower to EC2 instances than it did for GCP. The data is not going into EBS, rather being read in-memory, where the data chunks are then removed from memory when the process is complete. Here are a list of things I have tried to diagnose the problem: 1. Upgrading to a bigger instance type. I am aware that there is a network bandwidth limit to each instance type, and I saw a read speed increase when I changed to a `c5n.9xlarge` (From your documentation, there should be 50 Gpbs of bandwidth), but it was still slower than reading from S3 to a Google VM with larger network proximity. I also upgraded instance type again, but there little to no speed increase. Note that hyperthreading is turned off for each EC2 instance. 2. Changing the S3 bucket parameter `max_concurrent_requests` to `100`. I am using python to benchmark these speeds, so this parameter was passed into a `storage_options` dictionary that is used in different remote data access APIs (see the [Dask documentation](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html#:~:text=%22config_kwargs%22%3A%20%7B%22s3%22%3A%20%7B%22addressing_style%22%3A%20%22virtual%22%7D%7D%2C) for more info). Editing this parameter had absolutely no effect on the transfer speeds. 3. Verified that enhanced networking is active on all worker nodes & controller node. 4. Performed the data transfer directly from a worker node command line for both AWS and GCP machines. This was done to rule out my testing code being at fault, and the results were the same: S3 -> EC2 was slower than S3-> GCP. 5. Varying how many cores of each EC2 instance were used in each SLURM job. For the Google machines, each worker node has 4 cores and 16 GB memory, so each job that I submit there takes up an entire node. However, when I had to upgrade my EC2 worker node instances, there are clearly more cores than just 4 per node. To try and maintain a fair comparison, I configured each SLURM job to only access 8 cores per node in my EC2 cluster (I am performing 40 parallel reads at maximum, so if my understanding is correct each node will have 8 separate data stream connections, with 5 total nodes being active at a time with `c5n.9xlarge` instances). I also tried seeing if allocating all of a node's resources for my 40 parallel reads would speed things up (2 instances with all 18 cores in each active, and a third worker instance with only 4 cores active), but there was no effect. I'm fairly confident there is a solution to this, but I am having an extremely difficult time figuring out what it is. I know that setting an endpoint shouldn't be the problem, because GCP is faster than EC2 and there is egress occurring there. Any help would be appreciated, because I want to make sure I get an accurate picture of S3->EC2 before presenting my work. Please let me know if more information is needed!
1
answers
0
votes
24
views
asked 2 days ago
  • 1
  • 12 / page