スキップしてコンテンツを表示

Refresh Redshift materialized view using a Glue Job script

0

I'm trying to refresh a materialized view with a glue job, connecting to Redshift cluster using boto3 authenticating with a database username. The execution timeouts with no errors in CloudWatch. I'm sure the problem is within the connection, but I can't find what am I doing wrong. Also, I configured a connection in AWS Glue, wich I use successfully with other Visual ETL jobs, as well as this one. I know I could schedule a refresh using the code editor v2 query scheduler, but I had too many issues configuring the security, with no success, the schedule executes but I see no runs in the history, but this would be another post. Here's my script so far:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Redshift connection details
redshift_cluster_id = 'my-warehouse'
db_name = 'my-db'
db_user = 'my-user'

# SQL to refresh materialized views
refresh_sql_1 = "REFRESH MATERIALIZED VIEW mv_name1"
refresh_sql_2 = "REFRESH MATERIALIZED VIEW mv_name2"

def refresh_materialized_view(sql):
    client = boto3.client('redshift-data')
    response = client.execute_statement(
        ClusterIdentifier=redshift_cluster_id,
        Database=db_name,
        DbUser=db_user,
        Sql=sql,
        WithEvent=True
    )
    return response

# Refresh the materialized views and commit
refresh_materialized_view(refresh_sql_1)
refresh_materialized_view(refresh_sql_2)

job.commit()

During execution, the job loops running these commands until timeout (15 mins):

  • INFO LogPusher: uploading /tmp/spark-event-logs/ to s3://aws-glue-assets-etc..../sparkHistoryLogs/
  • INFO ExecutorTaskManagement: polling for executor task status
  • INFO JESSchedulerBackend: polling for JES task status
  • Why can’t u use a scheduled query?

質問済み 1年前554ビュー
3回答
0
承認された回答

I did it with the Query Editor v2 scheduler, carefully following the documentation: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-schedule-query.html

The key, in my case, was to authenticate using Temporary credentials, allowing the IAM Role to access database credentials of a specific dbuser:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Sid": "UseTemporaryCredentialsForAllDbUsers",
        "Effect": "Allow",
        "Action": "redshift:GetClusterCredentials",
        "Resource": [
            "arn:aws:redshift:*:*:dbuser:*/awsuser",
            "arn:aws:redshift:*:*:dbuser:*/myuser"
        ]
    }
  ]
}
回答済み 1年前
AWS
エキスパート
レビュー済み 8ヶ月前
0

Hi

You could use the visual ETL approach to make sure your connection works, then copy the script generated into a new job, and modify the ETL logic accordingly.

Also, an important step when it comes to Redshift is the Glue job must use the correct VPC otherwise the DB won't be reachable. For further details, check Redshift connections

Thanks, Rama

AWS
エキスパート
回答済み 1年前
  • That's a good approach if the job actually is to perform ETL operations, because it has to have at least a source and a target. This is not the case, I just want to run a refresh materialized view command. After carefully reading the documentation, I managed to get the scheduled queries in Query Editor v2 to work, so I'm leaving the Glue job on hold.

0

I added code to the script, trying to replicate what the Redshift Query Editor V2 Scheduled queries do, using temporary credentials from a database user:

# Assume IAM Role
sts_client = boto3.client('sts')
assumed_role = sts_client.assume_role(
    RoleArn='arn:aws:iam::000000:role/service-role/AmazonRedshift-CommandsAccessRole',
    RoleSessionName='RedshiftSession'
)

# Get temporary credentials
credentials = assumed_role['Credentials']

# Create redshift-data API client using temporary credentials
redshift_client = boto3.client(
    'redshift-data',
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken']
)

# Create function

Unfortunately, I got this error (which makes sense): Connect timeout on endpoint URL: "https://sts.amazonaws.com/"

回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。