Adding Hive and beeline client on AWS MWAA

0

Hi all,
I am using HiveOperator to execute a query on EMR cluster using beeline client from AWS MWAA. Rather than adding a step on EMR, I want to hit the Hive Query directly using HiveOperator from AWS MWAA. But due to missing binaries on MWAA, it says no such beeline directory. Please see the below dag code and stack trace.

Sample Code (The below code works fine with our airflow installed on EKS)
hive_direct_task = HiveOperator(
task_id='hive_direct_task',
hive_cli_conn_id='hive_emr_dag_connection',
hql='CREATE TABLE XXXX.XXXX STORED AS ORC AS SELECT DISTINCT * from XXXX.XXXX limit 2'
)

{{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: spark_hive_ssh_dag.hive_direct_task > ip-XXXX.ec2.internal
{{hive_operator.py:121}} INFO - Executing: CREATE TABLE XXXX.XXXX STORED AS ORC AS SELECT DISTINCT * from XXXX.XXXX limit 2
{{hive_operator.py:136}} INFO - Passing HiveConf: {'airflow.ctx.dag_email': 'XXXX@XXXX.com', 'airflow.ctx.dag_owner': 'airflow', 'airflow.ctx.dag_id': 'spark_hive_ssh_dag', 'airflow.ctx.task_id': 'hive_direct_task', 'airflow.ctx.execution_date': '2020-12-09T18:03:28.344312+00:00', 'airflow.ctx.dag_run_id': 'manual__'}
{{taskinstance.py:1150}} ERROR - No such file or directory: 'beeline': 'beeline'
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/hive_operator.py", line 137, in execute
self.hook.run_cli(hql=self.hql, schema=self.schema, hive_conf=self.hiveconfs)
File "/usr/local/lib/python3.7/site-packages/airflow/hooks/hive_hooks.py", line 258, in run_cli
close_fds=True)
File "/usr/lib64/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/usr/lib64/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: No such file or directory: 'beeline': 'beeline'

Similar to this, If I want to use HDFSSensor, it will require hadoop/hdfs client. and many-more like this.

Talking about my on premises airflow setup on EKS, We added all the binaries (hive, hadoop, hdfs) by keeping the base image - Apache Airflow. The same code works fine, whenever queried from Airflow to EMR using HiveOperator.

Does MWAA support integrations to AWS Services only? Like EMR ClusterLaunch, EMR AddStep, AthenaOperators etc.
Can I Achieve the above use case with AWS MWAA? Till now I have explored everything and I am unable to find a way to add the binaries on MWAA.
If the above usecase is possible, please let us know how can I add the binaries in MWAA.

Thanks,
Neeraj Vyas (Data Engineer)
Neerajvyas615@gmail.com

Edited by: NeerajVyas on Dec 13, 2020 1:37 AM

Edited by: NeerajVyas on Dec 13, 2020 1:39 AM

asked 3 years ago486 views
1 Answer
0

Hi!

Amazon MWAA limits the number of binaries installed on the worker images in order to ensure a small enough image size for reasonable performance. As such, some operators may not be available, for example those that depend on the Java runtime.

One possible workaround is to leverage an external container and the ECS_Operator or kubernetesPodOperator to run your commands on custom images. Another alternative, if running on EMR, is to use the PythonOperator with the boto3 library for more fine-grained control on the EMR commands.

Thanks!

AWS
John_J
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions