Creating hive external tables with s3 location error

0

Hi,

I'm working on a EMR application that leverages queries over S3 ORC data using Hive and Presto, I've been following AWS Guides, information on the forum, stackoverflow and other sources in order to achieve this goal, but I'm stuck trying to create an external table in hive, I'm getting the following error when running the create table statement:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 817E60B9FFAD0392; S3 Extended Request ID: lM_tt8slxX14Svkfn_01tHTTRsz7YIBwbpRRYNDjWMUuUurywiHQou1wHMcNXmdKLN0eeGGaokE=), S3 Extended Request ID: lM_tt8slxX14Svkfn_01tHTTRsz7YIBwbpRRYNDjWMUuUurywiHQou1wHMcNXmdKLN0eeGGaokE=)

I tried adding to the EMR role permissions for reading on S3 and recreated the cluster without results, when creating the table I'm using the following statement:

CREATE EXTERNAL TABLE sensor
(
room string,
energy double,
temp double,
occupancy int,
awhen timestamp
)
PARTITIONED BY (year string, month string, day string)
STORED AS ORC
LOCATION 's3://s3.us-west-2.amazonaws.com/a2g-hive-test/tempsensores/data/';

Can anyone guide me on the best practice to achieve this?
Thanks.

asked 5 years ago3642 views
5 Answers
0

Hi,
First of all, let's assume that your Roles/Permissions are correct. If yes, can you verify that the s3 folder exists. If no, try first creating this folder before running.

s3://s3.us-west-2.amazonaws.com/a2g-hive-test/tempsensores/data/ 

If the folder exists, then you will need to carefully review the IAM permissions and making sure that the service roles that allow S3 access are properly passed/assumed so that the service that is making the call to s3 has the proper permissions. Can you please include all of the the IAM roles/policies and Trust relationships to help debug.

-randy

answered 5 years ago
0

Hi Randy,

The folder exists, I've already put data in ORC format before starting the tutorial.

https://ibb.co/crGSGst

iThe IAM roles being used are the default created by EMR when I created the first cluster, arn:aws:iam::753682516828:role/EMR_EC2_DefaultRole
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Resource": "",
"Action": [
"cloudwatch:
",
"dynamodb:",
"ec2:Describe
",
"elasticmapreduce:Describe*",
"elasticmapreduce:ListBootstrapActions",
"elasticmapreduce:ListClusters",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListSteps",
"kinesis:CreateStream",
"kinesis:DeleteStream",
"kinesis:DescribeStream",
"kinesis:GetRecords",
"kinesis:GetShardIterator",
"kinesis:MergeShards",
"kinesis:PutRecord",
"kinesis:SplitShard",
"rds:Describe*",
"s3:",
"sdb:
",
"sns:",
"sqs:
",
"glue:CreateDatabase",
"glue:UpdateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:GetTable",
"glue:GetTables",
"glue:GetTableVersions",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:UpdatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition",
"glue:CreateUserDefinedFunction",
"glue:UpdateUserDefinedFunction",
"glue:DeleteUserDefinedFunction",
"glue:GetUserDefinedFunction",
"glue:GetUserDefinedFunctions"
]
}
]
}

The trust relationship is as follows
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

Is there any S3 permission missing?

answered 5 years ago
0

Hi,
The service role permissions looks fine to me. The only two other possibilities that I can think of at the moment are:

  1. If the LOCATION value does not exist, it will return an 'access denied' instead of a not found message (for security reasons so that probing tools can't easily deduce the structure of your bucket):
    Try changing:
LOCATION 's3://s3.us-west-2.amazonaws.com/a2g-hive-test/tempsensores/data/';

To just:

LOCATION 's3://a2g-hive-test/tempsensores/data/';
  1. Double check the bucket policy for a2g-hive-test to see if there is anything that might restrict reading of the data.

-randy

answered 5 years ago
0

Hi Randy,

That worked out like a charm, I just modified the name in order to just reference the bucket/location:

CREATE EXTERNAL TABLE sensor
(
room string,
energy double,
temp double,
occupancy int,
awhen timestamp
)
PARTITIONED BY (year string, month string, day string)
STORED AS ORC
LOCATION 's3://a2g-hive-test/tempsensores/data/';

With that, hive was able to create the table without issues.

Thanks for your help,
Best.
Carlos.

answered 5 years ago
0

All,

Apparently the Hive S3 code does not know how to handle the region endpoint in the first URL path element and instead only expects a globally unique bucket name there.

If you to be able to specify the region endpoint it will probably take an EMR (or Hive) feature request to support it.

Regards,
-Kurt

klarson
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions