1. Do we always need catalog and crawler to connect to any database ?
Ans:- If you would like to leverage the advantages of Glue Dynamic Frames then it is mandatory to have a Glue data catalog table else in case your requirement is to make use of spark read and store data in data frames then Glue Catalog is not required. The use of Crawlers is for the purpose of creation of Glue data catalog. Instead of Crawlers, you can manually create a catalog table as well.
2. Can we connect to database directly either using spark read or dynamic_frame_from_options by passing DB details ?
Ans:- Provided that you attach a new or existing JDBC connection to the Glue job, you can connect to the database directly. The JDBC connection shall contain the details about your on-premise Oracle database along with the VPC and other details that are to be used by the Glue job. Again, if you want to use dynamic_frame_from_options then Glue catalog must be present.
3. I was going though https://aws.amazon.com/blogs/big-data/how-to-access-and-analyze-on-premises-data-stores-using-aws-glue/ but in this we are connecting to database via catelog . cant we connect directly ?
Ans:- As I mentioned in my previous answer, if you want to use dynamic frame then creation of data catalog is necessary. The method followed in the link that you have mentioned, is making use of dynamic frame and thus Glue Catalog is used.
4. If answer to 1 and 3 is yes then do we need to attach VPC to Glue , if yes then how ?I dont see any option in console to attach VPC directly ?
Ans:- While creating a JDBC connection, you will be prompted to give the details of VPC, subnet and security group. So, when the Glue job is run, the backend resources of Glue shall make use of this VPC. There is no option to attach a VPC directly to the Glue job. This is done only through connection.
5. In security group do we need database ip address mentioned in outbound rules ?
Ans:- In case your security group is restrictive then, you would have to add an outbound rule mentioning the IP address. But on the other hand if there is no restriction on outbound traffic to flow out of the security group then it is not necessary to add any outbound rule. By default, the security group allows all the outgoing traffic. Please do note that an inbound self referencing rule must be added to the security group of Glue job such that it allows all TCP traffic.
How could we have Glue to get data from csv as String?Accepted Answerasked 3 months ago
Glue crawler unable to detect array of strings columnasked 22 days ago
Error Running Glue CrawlerAccepted Answerasked 3 years ago
I need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?Accepted Answerasked 5 months ago
Glue Crawler - skip adding partitionsAccepted Answerasked 2 years ago
create glue meta data tables via CDK without need crawlerasked 6 days ago
AWS Crawler to directly read Delta lake files from S3asked a month ago
Delete partitions in Glue Data Catalog using crawler not working.asked 3 months ago
How to connect On-premise Oracle database from Glue without using Crawlerasked 25 days ago
AWS Glue crawlerasked 2 months ago