How to connect On-premise Oracle database from Glue without using Crawler

0

Hello All, I have some doubts regarding designing glue job .

  1. Do we always need catalog and crawler to connect to any database ?
  2. Can we connect to database directly either using spark read or dynamic_frame_from_options by passing DB details ?
  3. I was going though https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ but in this we are connecting to database via catelog . cant we connect directly ?
  4. If answer to 1 and 3 is yes then do we need to attach VPC to Glue , if yes then how ?I dont see any option in console to attach VPC directly ?
  5. In security group do we need database ip address mentioned in outbound rules ?
asked 2 years ago3973 views
1 Answer
0

1. Do we always need catalog and crawler to connect to any database ?

Ans:- If you would like to leverage the advantages of Glue Dynamic Frames then it is mandatory to have a Glue data catalog table else in case your requirement is to make use of spark read and store data in data frames then Glue Catalog is not required. The use of Crawlers is for the purpose of creation of Glue data catalog. Instead of Crawlers, you can manually create a catalog table as well.

2. Can we connect to database directly either using spark read or dynamic_frame_from_options by passing DB details ?

Ans:- Provided that you attach a new or existing JDBC connection to the Glue job, you can connect to the database directly. The JDBC connection shall contain the details about your on-premise Oracle database along with the VPC and other details that are to be used by the Glue job. Again, if you want to use dynamic_frame_from_options then Glue catalog must be present.

3. I was going though https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ but in this we are connecting to database via catelog . cant we connect directly ?

Ans:- As I mentioned in my previous answer, if you want to use dynamic frame then creation of data catalog is necessary. The method followed in the link that you have mentioned, is making use of dynamic frame and thus Glue Catalog is used.

4. If answer to 1 and 3 is yes then do we need to attach VPC to Glue , if yes then how ?I dont see any option in console to attach VPC directly ?

Ans:- While creating a JDBC connection, you will be prompted to give the details of VPC, subnet and security group. So, when the Glue job is run, the backend resources of Glue shall make use of this VPC. There is no option to attach a VPC directly to the Glue job. This is done only through connection.

5. In security group do we need database ip address mentioned in outbound rules ?

Ans:- In case your security group is restrictive then, you would have to add an outbound rule mentioning the IP address. But on the other hand if there is no restriction on outbound traffic to flow out of the security group then it is not necessary to add any outbound rule. By default, the security group allows all the outgoing traffic. Please do note that an inbound self referencing rule must be added to the security group of Glue job such that it allows all TCP traffic.

profile pictureAWS
SUPPORT ENGINEER
Chaitu
answered 2 years ago
  • Hello Chaitu Thanks for sharing knowledge . I have one more doubt after reading your response :

    1. As per my knowledge while doing local development on docker using AWS glue image I think we cannot create Catelog locally . Do we need to create aws catelog using console and then use it dynamic_frame_from_options within docker .

    2. Also I would like to ask is there way to connect to On premise Database without using catelog ?

    Also can you please help me with https://repost.aws/questions/QU74cjzPDLRE2FZy12yjmBGA/aws-glue-connect-with-on-premise-db

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions