跳至内容

Refresh Redshift materialized view using a Glue Job script

0

I'm trying to refresh a materialized view with a glue job, connecting to Redshift cluster using boto3 authenticating with a database username. The execution timeouts with no errors in CloudWatch. I'm sure the problem is within the connection, but I can't find what am I doing wrong. Also, I configured a connection in AWS Glue, wich I use successfully with other Visual ETL jobs, as well as this one. I know I could schedule a refresh using the code editor v2 query scheduler, but I had too many issues configuring the security, with no success, the schedule executes but I see no runs in the history, but this would be another post. Here's my script so far:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Redshift connection details
redshift_cluster_id = 'my-warehouse'
db_name = 'my-db'
db_user = 'my-user'

# SQL to refresh materialized views
refresh_sql_1 = "REFRESH MATERIALIZED VIEW mv_name1"
refresh_sql_2 = "REFRESH MATERIALIZED VIEW mv_name2"

def refresh_materialized_view(sql):
    client = boto3.client('redshift-data')
    response = client.execute_statement(
        ClusterIdentifier=redshift_cluster_id,
        Database=db_name,
        DbUser=db_user,
        Sql=sql,
        WithEvent=True
    )
    return response

# Refresh the materialized views and commit
refresh_materialized_view(refresh_sql_1)
refresh_materialized_view(refresh_sql_2)

job.commit()

During execution, the job loops running these commands until timeout (15 mins):

  • INFO LogPusher: uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-etc..../sparkHistoryLogs/
  • INFO ExecutorTaskManagement: polling for executor task status
  • INFO JESSchedulerBackend: polling for JES task status
  • Why can’t u use a scheduled query?

已提问 1 年前559 查看次数
3 回答
0
已接受的回答

I did it with the Query Editor v2 scheduler, carefully following the documentation: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-schedule-query.html

The key, in my case, was to authenticate using Temporary credentials, allowing the IAM Role to access database credentials of a specific dbuser:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "UseTemporaryCredentialsForAllDbUsers",
        "Effect": "Allow",
        "Action": "redshift:GetClusterCredentials",
        "Resource": [
            "arn:aws:redshift:*:*:dbuser:*/awsuser",
            "arn:aws:redshift:*:*:dbuser:*/myuser"
        ]
    }
  ]
}
已回答 1 年前
AWS
专家
已审核 8 个月前
0

Hi

You could use the visual ETL approach to make sure your connection works, then copy the script generated into a new job, and modify the ETL logic accordingly.

Also, an important step when it comes to Redshift is the Glue job must use the correct VPC otherwise the DB won't be reachable. For further details, check Redshift connections

Thanks, Rama

AWS
专家
已回答 1 年前
  • That's a good approach if the job actually is to perform ETL operations, because it has to have at least a source and a target. This is not the case, I just want to run a refresh materialized view command. After carefully reading the documentation, I managed to get the scheduled queries in Query Editor v2 to work, so I'm leaving the Glue job on hold.

0

I added code to the script, trying to replicate what the Redshift Query Editor V2 Scheduled queries do, using temporary credentials from a database user:

# Assume IAM Role
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
    RoleArn='arn:aws:iam::000000:role/service-role/AmazonRedshift-CommandsAccessRole',
    RoleSessionName='RedshiftSession'
)

# Get temporary credentials
credentials = assumed_role['Credentials']

# Create redshift-data API client using temporary credentials
redshift_client = boto3.client(
    'redshift-data',
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken']
)

# Create function

Unfortunately, I got this error (which makes sense): Connect timeout on endpoint URL: "https://sts.amazonaws.com/"

已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。