By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Amazon Simple Storage Service

Sort by most recent
  • 1
  • 12 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Set correct Table level, Include Path and Exclude path.

Hello all, I have a s3 bucket with this following path: s3://a/b/c/products Inside the products folder I have one folder for each version (each version is a database snapshot of the products table, obtained on a weekly basis by a workflow). 1. /version_0 1. _temporary 1. 0_$folder$ 2. part-00000-c5... ...c000.snappy.parquet 2. /version_1 1. _temporary 1. 0_$folder$ 2. part-00000-29... ...c000.snappy.parquet I have created a crawler (Include Path is set to the same path mentioned above -s3://a/b/c/products) with the intention of merging all the versions together into 1 table. The schemas of the different partitions are always the same. The structure of the different partitions is also always the same. I have tried with different Table Levels (4, 5 and 6) in the "Grouping Behaviour for S3 Data" section on the Crawler Settings but it always created multiple tables (one table for each version). The _temporary folder is something automatically generated by the workflow so it seems. I don't know if I have to include this in the exclude path in order for it to work. **What should be the correct Include path, exclude path and table levels in order for me to create only ONE table merging all versions together?** I have checked all your general documentation links about this issue but could you please provide an actual solution for this issue?
1
answers
0
votes
11
views
asked 16 hours ago

How to save a .html file to S3 that is created in a Sagemaker processing container

**Error message:** "FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/processing/output/profile_case.html'" **Background:** I am working in Sagemaker using python trying to profile a dataframe that is saved in a S3 bucket with pandas profiling. The data is very large so instead of spinning up a large EC2 instance, I am using a SKLearn processor. Everything runs fine but when the job finishes it does not save the pandas profile (a .html file) in a S3 bucket or back in the instance Sagemaker is running in. When I try to export the .html file that is created from the pandas profile, I keep getting errors saying that the file cannot be found. Does anyone know of a way to export the .html file out of the temporary 24xl instance that the SKLearn processor is running in to S3? Below is the exact code I am using: ``` import os import sys import subprocess def install(package): subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package]) install('awswrangler') install('tqdm') install('pandas') install('botocore==1.19.4') install('ruamel.yaml') install('pandas-profiling==2.13.0') import awswrangler as wr import pandas as pd import numpy as np import datetime as dt from dateutil.relativedelta import relativedelta from string import Template import gc import boto3 from pandas_profiling import ProfileReport client = boto3.client('s3') session = boto3.Session(region_name="eu-west-2") ``` ``` %%writefile casetableprofile.py import os import sys import subprocess def install(package): subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package]) install('awswrangler') install('tqdm') install('pandas') install('botocore') install('ruamel.yaml') install('pandas-profiling') import awswrangler as wr import pandas as pd import numpy as np import datetime as dt from dateutil.relativedelta import relativedelta from string import Template import gc import boto3 from pandas_profiling import ProfileReport client = boto3.client('s3') session = boto3.Session(region_name="eu-west-2") def run_profile(): query = """ SELECT * FROM "healthcloud-refined"."case" ; """ tableforprofile = wr.athena.read_sql_query(query, database="healthcloud-refined", boto3_session=session, ctas_approach=False, workgroup='DataScientists') print("read in the table queried above") print("got rid of missing and added a new index") profile_tblforprofile = ProfileReport(tableforprofile, title="Pandas Profiling Report", minimal=True) print("Generated carerequest profile") return profile_tblforprofile if __name__ == '__main__': profile_tblforprofile = run_profile() print("Generated outputs") output_path_tblforprofile = ('profile_case.html') print(output_path_tblforprofile) profile_tblforprofile.to_file(output_path_tblforprofile) #Below is the only part where I am getting errors import boto3 import os s3 = boto3.resource('s3') s3.meta.client.upload_file('/opt/ml/processing/output/profile_case.html', 'intl-euro-uk-datascientist-prod','Mark/healthclouddataprofiles/{}'.format(output_path_tblforprofile)) ``` ``` import sagemaker from sagemaker.processing import ProcessingInput, ProcessingOutput session = boto3.Session(region_name="eu-west-2") bucket = 'intl-euro-uk-datascientist-prod' prefix = 'Mark' sm_session = sagemaker.Session(boto_session=session, default_bucket=bucket) sm_session.upload_data(path='./casetableprofile.py', bucket=bucket, key_prefix=f'{prefix}/source') ``` ``` import boto3 #import sagemaker from sagemaker import get_execution_role from sagemaker.sklearn.processing import SKLearnProcessor region = boto3.session.Session().region_name S3_ROOT_PATH = "s3://{}/{}".format(bucket, prefix) role = get_execution_role() sklearn_processor = SKLearnProcessor(framework_version='0.20.0', role=role, sagemaker_session=sm_session, instance_type='ml.m5.24xlarge', instance_count=1) ``` ``` sklearn_processor.run(code='s3://{}/{}/source/casetableprofile.py'.format(bucket, prefix), inputs=[], outputs=[ProcessingOutput(output_name='output', source='/opt/ml/processing/output', destination='s3://intl-euro-uk-datascientist-prod/Mark/')]) ``` Thank you in advance!!!
1
answers
0
votes
24
views
asked 2 days ago

Allowing permission to Generate a policy based on CloudTrail events where the selected Trail logs events in an S3 bucket in another account

I have an AWS account (Account A) with CloudTrail enabled and logging management events to an S3 'logs' bucket in another, dedicated logs account (Account B, which I also own). The logging part works fine, but I'm now trying (and failing) to use the 'Generate policy based on CloudTrail events' tool in the IAM console (under the Users > Permissions tab) in Account A. This is supposed to read the CloudTrail logs for a given user/region/no. of days, identify all of the actions the user performed, then generate a sample IAM security policy to allow only those actions, which is great for setting up least privilege policies etc. When I first ran the generator, it created a new service role to assume in the same account (Account A): AccessAnalyzerMonitorServiceRole_ABCDEFGHI When I selected the CloudTrail trail to analyse, it (correctly) identified that the trail logs are stored in an S3 bucket in another account, and displayed this warning messsage: > Important: Verify cross-account access is configured for the selected trail The selected trail logs events in an S3 bucket in another account. The role you choose or create must have read access to the bucket in that account to generate a policy. Learn more. Attempting to run the generator at this stage fails after a short amount of time, and if you hover over the 'Failed' status in the console you see the message: > Incorrect permissions assigned to access CloudTrail S3 bucket. Please fix before trying again. Makes sense, but actually giving read access to the S3 bucket to the automatically generated AccessAnalyzerMonitorServiceRole_ABCDEFGHI is where I'm now stuck! I'm relatively new to AWS so I might have done something dumb or be missing something obvious, but I'm trying to give the automatically generated role in Account A permission to the S3 bucket by adding to the 'Bucket Policy' attached to the S3 logs bucket in our Account B. I've added the below extract to the existing bucket policy (which is just the standard policy for a CloudTrail logs bucket, extended to allow CloudTrail in Account A to write logs to it as well), but my attempts to run the policy generator still fail with the same error message. ``` { "Sid": "IAMPolicyGeneratorRead", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::1234567890:role/service-role/AccessAnalyzerMonitorServiceRole_ABCDEFGHI" }, "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::aws-cloudtrail-logs-ABCDEFGHI", "arn:aws:s3:::aws-cloudtrail-logs-ABCDEFGHI/*" ] } ``` Any suggestions how I can get this working?
1
answers
0
votes
30
views
asked 3 days ago
  • 1
  • 12 / page