How do I use a PostgreSQL database as the external metastore for Hive on Amazon EMR?
I want to use an Amazon Relational Database Service (Amazon RDS) for PostgreSQL DB instance as my external metastore for Apache Hive on Amazon EMR.
Resolution
Before you begin, note the following:
- This solution assumes that you already have an active PostgreSQL database.
- If you're using Amazon EMR release version 5.7 or earlier, download the PostgreSQL JDBC driver. Then, add the driver to the Hive library path ( /usr/lib/hive/lib). Amazon EMR release versions 5.8.0 and later come with the PostgreSQL JDBC driver in the Hive library path.
To configure a PostgreSQL DB instance as the external metastore for Hive, do the following:
1. Create an Amazon RDS for PostgreSQL DB instance and create the database. Note that you can do this while creating the DB instance from Amazon RDS in the AWS Console. You can specify the database name in the Initial database name field under Additional configuration. Or, you can connect the PostgreSQL database instance and then create the database.
2. Modify the DB instance security group to allow connections on port 5432 between your database and the ElasticMapReduce-master security group. For more information, see VPC security groups.
3. Launch an Amazon EMR cluster without an external metastore. Amazon EMR uses the default MySQL database in this case.
4. Connect to the master node using SSH.
5. Replace the Hive configuration with the following properties.
Replace the following values in the example:
mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com with the endpoint of your DB instance
mypgdb with the name of your PostgreSQL database
database_username with the DB instance username
database_password with the DB instance password
[hadoop@ip-X-X-X-X ~]$ sudo vi /etc/hive/conf/hive-site.xml <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb</value> <description>PostgreSQL JDBC driver connection URL</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.postgresql.Driver</value> <description>PostgreSQL metastore driver class name</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>database_username</value> <description>the username for the DB instance</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>database_password</value> <description>the password for the DB instance</description> </property>
6. Run the following commands to create the PostgreSQL schema:
[hadoop@ip-X-X-X-X ~]$ cd /usr/lib/hive/bin/ [hadoop@ip-X-X-X-X bin]$ ./schematool -dbType postgres -initSchema Metastore connection URL: jdbc:postgresql://mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com:5432/mypgdb Metastore Connection Driver : org.postgresql.Driver Metastore connection User: test Starting metastore schema initialization to 2.3.0 Initialization script hive-schema-2.3.0.postgres.sql Initialization script completed schemaTool completed
7. Stop and start Hive services so that the updated settings take effect:
[hadoop@ip-X-X-X-X bin]$ sudo initctl list |grep -i hive hive-server2 start/running, process 11818 hive-hcatalog-server start/running, process 12708 [hadoop@ip-X-X-X-X9 bin]$ sudo stop hive-server2 hive-server2 stop/waiting [hadoop@ip-X-X-X-X bin]$ sudo stop hive-hcatalog-server hive-hcatalog-server stop/waiting [hadoop@ip-X-X-X-X bin]$ sudo start hive-server2 hive-server2 start/running, process 18798 [hadoop@ip-X-X-X-X bin]$ sudo start hive-hcatalog-server hive-hcatalog-server start/running, process 19614
You can choose to automate the steps 5 through 7 in the preceding process by running the following bash script (hive_postgres_emr_step.sh) as a step job in the EMR cluster.
## Automated Bash script to update the hive-site.xml and restart Hive ## Parameters rds_db_instance_endpoint='<rds_db_instance_endpoint>' rds_db_instance_port='<rds_db_instance_port>' rds_db_name='<rds_db_name>' rds_db_instance_username='<rds_db_instance_username>' rds_db_instance_password='<rds_db_instance_username>' ############################# Copying the original hive-site.xml sudo cp /etc/hive/conf/hive-site.xml /tmp/hive-site.xml ############################# Changing the JDBC URL old_jdbc=`grep "javax.jdo.option.ConnectionURL" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs` sudo sed -i "s|$old_jdbc|<value>jdbc:postgresql://$rds_db_instance_endpoint:$rds_db_instance_port/$rds_db_name</value>|g" /tmp/hive-site.xml ############################# Changing the Driver name old_driver_name=`grep "javax.jdo.option.ConnectionDriverName" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs` sudo sed -i "s|$old_driver_name|<value>org.postgresql.Driver</value>|g" /tmp/hive-site.xml ############################# Changing the database user old_db_username=`grep "javax.jdo.option.ConnectionUserName" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs` sudo sed -i "s|$old_db_username|<value>$rds_db_instance_username</value>|g" /tmp/hive-site.xml ############################# Changing the database password and description connection_password=`grep "javax.jdo.option.ConnectionPassword" -A +3 -B 1 /tmp/hive-site.xml | grep "<value>" | xargs` sudo sed -i "s|$connection_password|<value>$rds_db_instance_password</value>|g" /tmp/hive-site.xml old_password_description=`grep "javax.jdo.option.ConnectionPassword" -A +3 -B 1 /tmp/hive-site.xml | grep "<description>" | xargs` new_password_description='<description>the password for the DB instance</description>' sudo sed -i "s|$password_description|$new_password_description|g" /tmp/hive-site.xml ############################# Moving hive-site to backup sudo mv /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml_bkup sudo mv /tmp/hive-site.xml /etc/hive/conf/hive-site.xml ############################# Init Schema for Postgres /usr/lib/hive/bin/schematool -dbType postgres -initSchema ############################# Restart Hive ## Check Amazon Linux version and restart Hive OS_version=`uname -r` if [[ "$OS_version" == *"amzn2"* ]]; then echo "Amazon Linux 2 instance, restarting Hive..." sudo systemctl stop hive-server2 sudo systemctl stop hive-hcatalog-server sudo systemctl start hive-server2 sudo systemctl start hive-hcatalog-server elif [[ "$OS_version" == *"amzn1"* ]]; then echo "Amazon Linux 1 instance, restarting Hive" sudo stop hive-server2 sudo stop hive-hcatalog-server sudo start hive-server2 sudo start hive-hcatalog-server else echo "ERROR: OS version different from AL1 or AL2." fi echo "--------------------COMPLETED--------------------"
Be sure to replace the following values in the script:
- rds_db_instance_endpoint with the endpoint of your DB instance
- rds_db_instance_port with the port of your DB instance
- rds_db_name with the name of your PostgreSQL database
- rds_db_instance_username with the DB instance user name
- rds_db_instance_password with the DB instance password
Upload the script to Amazon S3. You can run the script as a step job using the Amazon EMR Console, AWS Command Line Interface (AWS CLI), or the API. To use the Amazon EMR console to run the script, do the following:
1. Open the Amazon EMR console.
2. On the Cluster List page, select the link for your cluster.
3. On the Cluster Details page, choose the Steps tab.
4. On the Steps tab, choose Add step.
5. In the Add step dialog box, retain the default values for Step type and Name.
6. For JAR location, enter the following:
command-runner.jar
7. For Arguments, enter the following:
bash -c "aws s3 cp s3://example_bucket/script/hive_postgres_emr_step.sh .; chmod +x hive_postgres_emr_step.sh; ./hive_postgres_emr_step.sh"
Replace the S3 location in the command with the location where you stored the script.
8. Choose Add run the step job.
After the step job is completed, do the following to verify the Hive configuration updates:
1. Log in to the Hive shell and create a Hive table.
Note: Be sure to replace test_postgres in the example with the name of your Hive table.
[hadoop@ip-X-X-X-X bin]$ hive Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true hive> show databases; OK default Time taken: 0.569 seconds, Fetched: 1 row(s) hive> create table test_postgres(a int,b int); OK Time taken: 0.708 seconds
2. Install PostgreSQL:
[hadoop@ip-X-X-X-X bin]$ sudo yum install postgresql
3. Connect to the PostgreSQL DB instance using the command line.
Replace the following values in the command:
mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com with the endpoint of your DB instance
mypgdb with the name of your PostegreSQL database
database_username with the DB instance user name
[hadoop@ip-X-X-X-X bin]$ psql --host=mypostgresql.testabcd1111.us-west-2.rds.amazonaws.com --port=5432 --username=database_username --password --dbname=mypgdb
4. When prompted, enter the password for the DB instance.
5. Run the following command to confirm that you can access the Hive table that you created earlier:
mypgdb=> select * from "TBLS"; TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED --------+-------------+-------+------------------+--------+-----------+-------+---------------+---------------+--------------------+--------------------+-------------------- 1 | 1555014961 | 1 | 0 | hadoop | 0 | 1 | test_postgres | MANAGED_TABLE | | | f (1 row)
Your Amazon EMR cluster is now using the PostgreSQL database as the external metastore for Hive.
Related information
Configuring an external metastore for Hive
Connecting to a DB instance running the PostgreSQL database engine

Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 9 Monaten
- AWS OFFICIALAktualisiert vor einem Jahr
- AWS OFFICIALAktualisiert vor 3 Monaten
- AWS OFFICIALAktualisiert vor 3 Monaten