How do I use a PostgreSQL database as the external metastore for Hive on Amazon EMR?

6 minute read
1

I want to use an Amazon Relational Database Service (Amazon RDS) for PostgreSQL DB instance as my external metastore for Apache Hive on Amazon EMR.

Resolution

Before you begin, note the following:

  • This solution assumes that you already have an active PostgreSQL database.
  • If you're using Amazon EMR release version 5.7 or earlier, first download the PostgreSQL JDBC driver from the pgJDBC website. Then, add the driver to the Hive library path (/usr/lib/hive/lib). Amazon EMR release versions 5.8.0 and later come with the PostgreSQL JDBC driver in the Hive library path.

Configure a PostgreSQL DB instance as the external metastore for Hive

Complete the following steps:

  1. Create an Amazon RDS for PostgreSQL DB instance and the database.
    Note: You can create the database while you create the DB instance in the Amazon RDS console. You can specify the database name in the Initial database name field under Additional configuration. Or, you can connect the PostgreSQL database instance and then create the database.

  2. To allow connections on port 5432 between your database and the ElasticMapReduce-master security group, modify the DB instance security group. For more information, see VPC security groups.

  3. Launch an Amazon EMR cluster without an external metastore. Amazon EMR uses the default MySQL database in this case.

  4. Use SSH to connect to the primary node.

  5. Update the Hive configuration.
    The following is an example Hive configuration. Replace the following values with your own:
    mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com with the endpoint of your DB instance
    mypgdb with the name of your PostgreSQL database
    database_username with the DB instance username
    database_password with the DB instance password

    \[hadoop@ip-X-X-X-X ~\]$ sudo vi /etc/hive/conf/hive-site.xml
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb</value>
        <description>PostgreSQL JDBC driver connection URL</description>
      </property>
    
      <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
        <description>PostgreSQL metastore driver class name</description>
      </property>
    
      <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>database\_username</value>
        <description>the username for the DB instance</description>
      </property>
    
      <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>database\_password</value>
        <description>the password for the DB instance</description>
      </property>
  6. To create the PostgreSQL schema, run the following commands:

    [hadoop@ip-X-X-X-X ~]$ cd /usr/lib/hive/bin/[hadoop@ip-X-X-X-X bin]$ ./schematool -dbType postgres -initSchema  
    Metastore connection URL:     jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb
    Metastore Connection Driver :     org.postgresql.Driver
    Metastore connection User:     test
    Starting metastore schema initialization to 2.3.0
    Initialization script hive-schema-2.3.0.postgres.sql
    Initialization script completed
    schemaTool completed
  7. To apply the updates and changes, stop and start Hive services.

    [hadoop@ip-X-X-X-X bin]$ sudo initctl list |grep -i hivehive-server2 start/running, process 11818
    hive-hcatalog-server start/running, process 12708
    [hadoop@ip-X-X-X-X9 bin]$ sudo stop hive-server2
    hive-server2 stop/waiting
    [hadoop@ip-X-X-X-X bin]$ sudo stop hive-hcatalog-server
    hive-hcatalog-server stop/waiting
    [hadoop@ip-X-X-X-X bin]$ sudo start hive-server2
    hive-server2 start/running, process 18798
    [hadoop@ip-X-X-X-X bin]$ sudo start hive-hcatalog-server
    hive-hcatalog-server start/running, process 19614

    Note: To automate steps 5 through 7, run the following bash script (hive_postgres_emr_step.sh) as a step job in the EMR cluster.

    ## Automated Bash script to update the hive-site.xml and restart Hive
    
    ## Parameters
    rds_db_instance_endpoint='<rds_db_instance_endpoint>'
    rds_db_instance_port='<rds_db_instance_port>'
    rds_db_name='<rds_db_name>'
    rds_db_instance_username='<rds_db_instance_username>'
    rds_db_instance_password='<rds_db_instance_username>'
    
    ############################# Copying the original hive-site.xml
    sudo cp /etc/hive/conf/hive-site.xml /tmp/hive-site.xml
    
    ############################# Changing the JDBC URL
    old_jdbc=`grep "javax.jdo.option.ConnectionURL" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
    sudo sed -i "s|$old_jdbc|<value>jdbc:postgresql://$rds_db_instance_endpoint:$rds_db_instance_port/$rds_db_name</value>|g" /tmp/hive-site.xml
    
    ############################# Changing the Driver name
    old_driver_name=`grep "javax.jdo.option.ConnectionDriverName" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
    sudo sed -i "s|$old_driver_name|<value>org.postgresql.Driver</value>|g" /tmp/hive-site.xml
    
    ############################# Changing the database user
    old_db_username=`grep "javax.jdo.option.ConnectionUserName"  -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
    sudo sed -i "s|$old_db_username|<value>$rds_db_instance_username</value>|g" /tmp/hive-site.xml
    
    ############################# Changing the database password and description
    connection_password=`grep "javax.jdo.option.ConnectionPassword" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs`
    sudo sed -i "s|$connection_password|<value>$rds_db_instance_password</value>|g" /tmp/hive-site.xml
    old_password_description=`grep "javax.jdo.option.ConnectionPassword" -A +3 -B 1 /tmp/hive-site.xml | grep "<description>" | xargs`
    new_password_description='<description>the password for the DB instance</description>'
    sudo sed -i "s|$password_description|$new_password_description|g" /tmp/hive-site.xml
    
    ############################# Moving hive-site to backup
    sudo mv /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml_bkup
    sudo mv /tmp/hive-site.xml /etc/hive/conf/hive-site.xml
    
    ############################# Init Schema for Postgres
    /usr/lib/hive/bin/schematool -dbType postgres -initSchema
    
    ############################# Restart Hive
    ## Check Amazon Linux version and restart Hive
    OS_version=`uname -r`
    if [[ "$OS_version" == *"amzn2"* ]]; then
        echo "Amazon Linux 2 instance, restarting Hive..."
        sudo systemctl stop hive-server2
        sudo systemctl stop hive-hcatalog-server
        sudo systemctl start hive-server2
        sudo systemctl start hive-hcatalog-server
    elif [[ "$OS_version" == *"amzn1"* ]]; then
        echo "Amazon Linux 1 instance, restarting Hive"
        sudo stop hive-server2
        sudo stop hive-hcatalog-server
        sudo start hive-server2
        sudo start hive-hcatalog-server
    else
        echo "ERROR: OS version different from AL1 or AL2."
    fi
    echo "--------------------COMPLETED--------------------"

    Be sure to replace the following values in the script:

    • rds_db_instance_endpoint with the endpoint of your DB instance
    • rds_db_instance_port with the port of your DB instance
    • rds_db_name with the name of your PostgreSQL database
    • rds_db_instance_username with the DB instance username
    • rds_db_instance_password with the DB instance password

Upload the script to Amazon S3

You can run the script as a step job through the Amazon EMR console, AWS Command Line Interface (AWS CLI), or the API. To use the Amazon EMR console to run the script, do the following:

  1. Open the Amazon EMR console.

  2. On the Cluster List page, select the link for your cluster.

  3. On the Cluster Details page, choose the Steps tab.

  4. On the Steps tab, choose Add step.

  5. In the Add step dialog box, retain the default values for Step type and Name.

  6. For JAR location, enter the following:

    command-runner.jar
  7. For Arguments, enter the following:

    bash -c "aws s3 cp s3://example_bucket/script/hive_postgres_emr_step.sh .; chmod +x hive_postgres_emr_step.sh; ./hive_postgres_emr_step.sh"

    Replace the S3 location in the command with the location where you stored the script.

  8. Choose Add run the step job.

Verify the Hive configuration updates

Complete the following steps:

  1. Log in to the Hive shell and create a Hive table.
    Note: Replace test_postgres in the following example with the name of your Hive table.

    [hadoop@ip-X-X-X-X bin]$ hive
    Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
    hive> show databases;
    OK
    default
    Time taken: 0.569 seconds, Fetched: 1 row(s)
    hive> create table test_postgres(a int,b int);
    OK
    Time taken: 0.708 seconds
  2. To install PostgreSQL, run the following command:

    \[hadoop@ip-X-X-X-X bin\]$ sudo yum install postgresql
  3. Connect to the PostgreSQL DB instance using the command line.
    Replace the following values in the command:
    mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com with the endpoint of your DB instance
    mypgdb with the name of your PostgreSQL database
    database_username with the DB instance username

    [hadoop@ip-X-X-X-X bin]$ psql --host=mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com --port=5432 --username=database_username --password --dbname=mypgdb
  4. When prompted, enter the password for the DB instance.

  5. To confirm that you can access the Hive table that you created earlier, run the following command:

    
    mypgdb=>  select * from "TBLS";
     TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID |   TBL_NAME    |   TBL_TYPE    | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED 
    --------+-------------+-------+------------------+--------+-----------+-------+---------------+---------------+--------------------+--------------------+--------------------
          1 |  1555014961 |     1 |                0 | hadoop |         0 |     1 | test_postgres | MANAGED_TABLE |                    |                    | f
    (1 row)

Related information

Configuring an external metastore for Hive

Connecting to a DB instance running the PostgreSQL database engine

AWS OFFICIAL
AWS OFFICIALUpdated 9 months ago
4 Comments

To help some time for future people. 😃 Initially when I run

cd /usr/lib/hive/bin/
./schematool -dbType postgres -initSchema

I got error:

org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
Underlying cause: org.postgresql.util.PSQLException : FATAL: no pg_hba.conf entry for host "172.31.187.xx", user "hive_metastore_readwrite", database "hive_metastore_db", no encryption
SQL Error code: 0

I am using Amazon EMR version emr-6.12.0, however, it only has an old PostgreSQL JDBC Driver at /usr/lib/hive/lib/postgresql-9.4.1208.jre7.jar

I am using Postgres 15 from Amazon RDS. I need download latest PostgreSQL JDBC Driver at https://jdbc.postgresql.org/download/ and move to folder /usr/lib/hive/lib/

Or you can use command to download

sudo curl --silent --fail --show-error --location --remote-name --output-dir /usr/lib/hive/lib/ https://jdbc.postgresql.org/download/postgresql-42.6.0.jar

Then I can succeed

cd /usr/lib/hive/bin/
./schematool -dbType postgres -initSchema
profile picture
replied 2 years ago

The tutorial's sudo yum install postgresql is old.

I met issue

psql --host=my.abc.us-west-2.rds.amazonaws.com --port=5432 --dbname=hive_metastore_db --username=hive_metastore_readwrite --password
Password for user hive_metastore_readwrite: 
psql: SCRAM authentication requires libpq version 10 or above

After install new

sudo yum install -y amazon-linux-extras
# postgresql15 does not exist today
sudo amazon-linux-extras install postgresql14

I can succeed

psql --host=my.abc.us-west-2.rds.amazonaws.com --port=5432 --dbname=hive_metastore_db --username=hive_metastore_readwrite --password
profile picture
replied 2 years ago

This section is also old. initctl, stop, start are for Amazon EMR 4.x-5.29.0 release versions:

[hadoop@ip-X-X-X-X bin]$ sudo initctl list |grep -i hive
hive-server2 start/running, process 11818
hive-hcatalog-server start/running, process 12708
[hadoop@ip-X-X-X-X9 bin]$ sudo stop hive-server2
hive-server2 stop/waiting
[hadoop@ip-X-X-X-X bin]$ sudo stop hive-hcatalog-server
hive-hcatalog-server stop/waiting
[hadoop@ip-X-X-X-X bin]$ sudo start hive-server2
hive-server2 start/running, process 18798
[hadoop@ip-X-X-X-X bin]$ sudo start hive-hcatalog-server
hive-hcatalog-server start/running, process 19614

The Amazon EMR 5.30.0 and later release versions, and Amazon EMR 6.x, the corresponding commands are

systemctl status | grep -i hive
sudo systemctl stop hive-hcatalog-server.service 
sudo systemctl start hive-hcatalog-server.service 

(In my case, I only have Trino. I do not have hive-server2 running)

Reference: https://repost.aws/knowledge-center/restart-service-emr

profile picture
replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
EXPERT
replied 2 years ago