내용으로 건너뛰기

Refresh Redshift materialized view using a Glue Job script

0

I'm trying to refresh a materialized view with a glue job, connecting to Redshift cluster using boto3 authenticating with a database username. The execution timeouts with no errors in CloudWatch. I'm sure the problem is within the connection, but I can't find what am I doing wrong. Also, I configured a connection in AWS Glue, wich I use successfully with other Visual ETL jobs, as well as this one. I know I could schedule a refresh using the code editor v2 query scheduler, but I had too many issues configuring the security, with no success, the schedule executes but I see no runs in the history, but this would be another post. Here's my script so far:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Redshift connection details
redshift_cluster_id = 'my-warehouse'
db_name = 'my-db'
db_user = 'my-user'

# SQL to refresh materialized views
refresh_sql_1 = "REFRESH MATERIALIZED VIEW mv_name1"
refresh_sql_2 = "REFRESH MATERIALIZED VIEW mv_name2"

def refresh_materialized_view(sql):
    client = boto3.client('redshift-data')
    response = client.execute_statement(
        ClusterIdentifier=redshift_cluster_id,
        Database=db_name,
        DbUser=db_user,
        Sql=sql,
        WithEvent=True
    )
    return response

# Refresh the materialized views and commit
refresh_materialized_view(refresh_sql_1)
refresh_materialized_view(refresh_sql_2)

job.commit()

During execution, the job loops running these commands until timeout (15 mins):

  • INFO LogPusher: uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-etc..../sparkHistoryLogs/
  • INFO ExecutorTaskManagement: polling for executor task status
  • INFO JESSchedulerBackend: polling for JES task status
  • Why can’t u use a scheduled query?

질문됨 일 년 전554회 조회
3개 답변
0
수락된 답변

I did it with the Query Editor v2 scheduler, carefully following the documentation: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-schedule-query.html

The key, in my case, was to authenticate using Temporary credentials, allowing the IAM Role to access database credentials of a specific dbuser:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "UseTemporaryCredentialsForAllDbUsers",
        "Effect": "Allow",
        "Action": "redshift:GetClusterCredentials",
        "Resource": [
            "arn:aws:redshift:*:*:dbuser:*/awsuser",
            "arn:aws:redshift:*:*:dbuser:*/myuser"
        ]
    }
  ]
}
답변함 일 년 전
AWS
전문가
검토됨 8달 전
0

Hi

You could use the visual ETL approach to make sure your connection works, then copy the script generated into a new job, and modify the ETL logic accordingly.

Also, an important step when it comes to Redshift is the Glue job must use the correct VPC otherwise the DB won't be reachable. For further details, check Redshift connections

Thanks, Rama

AWS
전문가
답변함 일 년 전
  • That's a good approach if the job actually is to perform ETL operations, because it has to have at least a source and a target. This is not the case, I just want to run a refresh materialized view command. After carefully reading the documentation, I managed to get the scheduled queries in Query Editor v2 to work, so I'm leaving the Glue job on hold.

0

I added code to the script, trying to replicate what the Redshift Query Editor V2 Scheduled queries do, using temporary credentials from a database user:

# Assume IAM Role
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
    RoleArn='arn:aws:iam::000000:role/service-role/AmazonRedshift-CommandsAccessRole',
    RoleSessionName='RedshiftSession'
)

# Get temporary credentials
credentials = assumed_role['Credentials']

# Create redshift-data API client using temporary credentials
redshift_client = boto3.client(
    'redshift-data',
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken']
)

# Create function

Unfortunately, I got this error (which makes sense): Connect timeout on endpoint URL: "https://sts.amazonaws.com/"

답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.