How to correctly provide log4j properties file for pyspark glue jobs?

0

Following this post: https://repost.aws/knowledge-center/glue-reduce-cloudwatch-logs I have created the following glue job:

from awsglue.context import GlueContext
from pyspark.context import SparkContext
import logging
from pyspark.sql.functions import rand, randn

DEFAULT_LOG_LEVEL = logging.DEBUG
DEFAULT_LOG_FORMAT = "%(asctime)s - %(levelname)-8s: %(message)s"
class SparkLogger(logging.Handler):
    def __init__(self, j_logger) -> None:
        self.j_logger = j_logger
        logging.Handler.__init__(self=self)

    def emit(self, record: logging.LogRecord) -> None:
        print(record)
        self.j_logger.log(None, None, record.levelno, logging.Handler.format(self, record), None, None)

def get_logger(
    name: str, level: int = DEFAULT_LOG_LEVEL, format: str = DEFAULT_LOG_FORMAT
) -> logging.Logger:
    logger = logging.getLogger(name)
    logger.setLevel(level)
    glueContext = GlueContext(SparkContext.getOrCreate())
    handler = SparkLogger(glueContext.get_logger().log())
    handler.setLevel(level)
    handler.setFormatter(logging.Formatter(fmt=format, datefmt="%Y-%m-%d %H:%M:%S"))
    logger.addHandler(handler)
    return logger
    
def main():
    sc = SparkContext.getOrCreate()
    glueContext = GlueContext(sc)
    logger = get_logger("MEEEJN")
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
    # Create a DataFrame with random data
# Here we are creating a DataFrame with two columns: 'id' and 'value'
# 'id' column will have random integers and 'value' column will have random floats
    spark = glueContext.spark_session
    df = spark.range(0, 100).withColumn('id', (rand(seed=1234) * 100).cast('int')).withColumn('value', randn(seed=42))

# Perform a transformation: Let's filter rows where 'value' > 0 and then add a new column 'value_squared' which is the square of 'value'
    transformed_df = df.filter(df['value'] > 0).withColumn('value_squared', df['value'] ** 2)

# Collect and print the top 6 rows of the transformed DataFrame
    top_rows = transformed_df.take(6)

    print("Top 6 Rows:")
    for row in top_rows:
        print(row)

if __name__ == "__main__":
    main()

Then, at the job details, libraries section in the Referenced files path input, i put the s3 uri of a log4j.properties file that contains the following:

# Root Logger
log4j.rootLogger=DEBUG, stdout
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=DEBUG

# Direct log messages to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%m%n

This, in my understanding should make my continous logs contain only the log message, no date and other information provided. Yet my logs in cloudwatch still look like this:

Cloudwatch logs

What am I doing wrong?

  • Is that Glue 3 or 4?

  • We are using glue version 4.0

  • Also looking through the cloudtrail, it does not seem glue even tried to pick up the file, no s3 operations are being made

RipplB
asked 10 days ago64 views
1 Answer
1
Accepted Answer

Glue 4 uses Log4j 2, so the file named log4j2.properties provided by Glue will have priority.
Bear in mind is not just the file name but the configuration has also changed, check the Log4j documentation

profile pictureAWS
EXPERT
answered 10 days ago
  • Thank you for your answer. So I have to update the file to the log4j 2 style. But if glue overrides it anyway, does that mean there is no way to configure the logging in glue 4?

  • No, the file you add will be downloaded and override the one that comes out of the box

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions