Questions tagged with AWS Glue DataBrew

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Pyspark Code failing while connecting to Oracle database ----Invalid Oracle URL specified

Hello All I have created 3 docker containers running in one network using docker images as follows : postgres aws glue image oracle image Sharing docker yml for same . ``` version: "2" services: spark-postgres: image: postgres:latest container_name: spark-postgres build: ./postgresql restart: always hostname: spark-postgres env_file: - ./env/postgresdb-env-vars.env ports: - "5432:5432" volumes: - ./data/territoryhub-replication/postgresql:/var/lib/postgresql/data networks: glue-network: ipv4_address: 10.4.0.4 spark-oracle: image: oracle:test container_name: spark-oracle build: ./oracle restart: always hostname: spark-oracle env_file: - ./env/oracledb-env-vars.env ports: - "1521:1521" volumes: - ./data/territoryhub-replication/oracle:/opt/oracle/oradata - ./oracle/oracle-scripts:/opt/oracle/scripts/startup networks: glue-network: ipv4_address: 10.4.0.5 spark-master: image: spark-master container_name: spark-master build: ./spark hostname: spark-master depends_on : - spark-postgres - spark-oracle ports: - "8888:8888" - "4040:4040" env_file: - ./env/spark-env-vars.env command : "/home/jupyter/jupyter_start.sh" volumes: - ../app/territoryhub-replication:/home/jupyter/jupyter_default_dir networks: glue-network: ipv4_address: 10.4.0.3 networks: glue-network: driver: bridge ipam: config: - subnet: 10.4.0.0/16 gateway: 10.4.0.1 ``` Also I would like to mention I was not finding oracle jdbc driver to connect to oracle database so in my glue image (spark-master) I have added jar file . Sharing docker file of that ``` FROM amazon/aws-glue-libs:glue_libs_1.0.0_image_01 COPY jar/ojdbc8.jar /home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/ojdbc8.jar RUN mkdir -p /root/.aws RUN echo "[default]\nregion=us-east-1" >> /root/.aws/config ``` Now I am simply trying to connect to oracle database created on same network from my local which I am running as follows : http://localhost:8888/tree? pyspark code is ``` from pyspark import SparkContext from awsglue.context import GlueContext glueContext = GlueContext(SparkContext.getOrCreate()) inputDF = glueContext.create_dynamic_frame_from_options(connection_type = "oracle", connection_options = {"url": "jdbc:oracle:thin:@//10.4.0.5:1521:ORCLCDB", "user": 'system', "password": '<some pwd>' ,"dbtable": "<some table name>"}) inputDF.toDF().show() ``` I am getting below error Invalid Oracle URL specified ``` An error was encountered: An error occurred while calling o303.getDynamicFrame. : java.sql.SQLException: Invalid Oracle URL specified at oracle.jdbc.driver.PhysicalConnection.parseUrl(PhysicalConnection.java:1738) at oracle.jdbc.driver.PhysicalConnection.readConnectionProperties(PhysicalConnection.java:1419) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:943) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:928) at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:557) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:68) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:732) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:648) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:900) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:896) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:852) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:852) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:852) at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:895) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:671) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:671) at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:797) at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:85) at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:124) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:863) at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Traceback (most recent call last): File "/home/aws-glue-libs/awsglue.zip/awsglue/context.py", line 204, in create_dynamic_frame_from_options return source.getFrame(**kwargs) File "/home/aws-glue-libs/awsglue.zip/awsglue/data_source.py", line 36, in getFrame jframe = self._jsource.getDynamicFrame() File "/home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o303.getDynamicFrame. : java.sql.SQLException: Invalid Oracle URL specified at oracle.jdbc.driver.PhysicalConnection.parseUrl(PhysicalConnection.java:1738) at oracle.jdbc.driver.PhysicalConnection.readConnectionProperties(PhysicalConnection.java:1419) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:943) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:928) at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:557) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:68) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:732) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:648) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:900) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$8.apply(JDBCUtils.scala:896) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1$$anonfun$apply$6.apply(JDBCUtils.scala:852) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$$anonfun$connectWithSSLAttempt$1.apply(JDBCUtils.scala:852) at scala.Option.getOrElse(Option.scala:121) at com.amazonaws.services.glue.util.JDBCWrapper$.connectWithSSLAttempt(JDBCUtils.scala:852) at com.amazonaws.services.glue.util.JDBCWrapper$.connectionProperties(JDBCUtils.scala:895) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties$lzycompute(JDBCUtils.scala:671) at com.amazonaws.services.glue.util.JDBCWrapper.connectionProperties(JDBCUtils.scala:671) at com.amazonaws.services.glue.util.JDBCWrapper.tableDF(JDBCUtils.scala:797) at com.amazonaws.services.glue.util.NoCondition$.tableDF(JDBCUtils.scala:85) at com.amazonaws.services.glue.util.NoJDBCPartitioner$.tableDF(JDBCUtils.scala:124) at com.amazonaws.services.glue.JDBCDataSource.getDynamicFrame(DataSource.scala:863) at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:97) at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:683) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ```
2
answers
0
votes
183
views
Purnima
asked 6 months ago

AWS GLUE Image certificate related issue

Hello Team , I have created Docker compose file mentioned below : ``` version: "2" services: spark: image: glue/spark:latest container_name: spark ** build: ./spark** hostname: spark ports: - "8888:8888" - "4040:4040" entrypoint : sh command : -c "/home/glue_user/jupyter/jupyter_start.sh" volumes: - ../app/territoryhub-replication:/home/glue_user/workspace/jupyter_workspace ``` Docker file which is getting called in build section is as follows : ``` FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01 USER root RUN mkdir -p /root/.aws RUN echo "[default]\nregion=us-east-1" >> /root/.aws/config ``` My Docker is getting started but failing (sharing logs ) Starting Jupyter with SSL /home/glue_user/jupyter/jupyter_start.sh: line 4: livy-server: command not found [I 2022-05-12 15:41:33.032 ServerApp] jupyterlab | extension was successfully linked. [I 2022-05-12 15:41:33.044 ServerApp] nbclassic | extension was successfully linked. [I 2022-05-12 15:41:33.046 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret [I 2022-05-12 15:41:33.541 ServerApp] sparkmagic | extension was found and enabled by notebook_shim. Consider moving the extension to Jupyter Server's extension paths. [I 2022-05-12 15:41:33.541 ServerApp] sparkmagic | extension was successfully linked. [I 2022-05-12 15:41:33.541 ServerApp] notebook_shim | extension was successfully linked. [W 2022-05-12 15:41:33.556 ServerApp] All authentication is disabled. Anyone who can connect to this server will be able to run code. [I 2022-05-12 15:41:33.558 ServerApp] notebook_shim | extension was successfully loaded. [I 2022-05-12 15:41:33.560 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.7/site-packages/jupyterlab [I 2022-05-12 15:41:33.560 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab [I 2022-05-12 15:41:33.565 ServerApp] jupyterlab | extension was successfully loaded. [I 2022-05-12 15:41:33.569 ServerApp] nbclassic | extension was successfully loaded. [I 2022-05-12 15:41:33.569 ServerApp] sparkmagic extension enabled! [I 2022-05-12 15:41:33.569 ServerApp] sparkmagic | extension was successfully loaded. Traceback (most recent call last): File "/usr/local/bin/jupyter-lab", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.7/site-packages/jupyter_server/extension/application.py", line 584, in launch_instance serverapp = cls.initialize_server(argv=args) File "/usr/local/lib/python3.7/site-packages/jupyter_server/extension/application.py", line 557, in initialize_server find_extensions=find_extensions, File "/usr/local/lib/python3.7/site-packages/traitlets/config/application.py", line 88, in inner return method(app, *args, **kwargs) File "/usr/local/lib/python3.7/site-packages/jupyter_server/serverapp.py", line 2421, in initialize self.init_httpserver() File "/usr/local/lib/python3.7/site-packages/jupyter_server/serverapp.py", line 2251, in init_httpserver max_buffer_size=self.max_buffer_size, File "/usr/local/lib64/python3.7/site-packages/tornado/util.py", line 288, in __new__ instance.initialize(*args, **init_kwargs) File "/usr/local/lib64/python3.7/site-packages/tornado/httpserver.py", line 191, in initialize read_chunk_size=chunk_size, File "/usr/local/lib64/python3.7/site-packages/tornado/tcpserver.py", line 134, in __init__ 'certfile "%s" does not exist' % self.ssl_options["certfile"] **ValueError: certfile "/home/glue_user/.certs/my_key_store.pem" does not exist** Please help in resolving this asap Many Thanks
1
answers
0
votes
99
views
asked 7 months ago

Data Quality Framework in AWS

I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are: * The data pipelines widely vary and ingest very high volumes of data. They are developed using spark,python,emr clusters, kafka, Kinesis stream * Any new system that we onboard in the framework, it should be easily able to include the data quality checks with minimal coding. so some sort of metadata framework might help for ex: storing the business rules in dynamodb which can automatically run check different feeders/new data pipeline created. * Our tech stack includes AWS,Python,Spark, Java, so kindly advise related services(AWS Databrew, PyDeequ, Greatexpectations libraries, various lambda event driven services are some I want to focus) * I am also looking for some sort of audit, balance and control mechanism. Auditing the source data, balancing # of records between 2 points and have some automated mechanism to remediate(control) them. * I am looking for testing frameworks for the different data pipelines. Also for data profiling, kindly advise tools/libraries, Aws data brew, Pandas are some I am exploring. I know there wont be one specific solution, and hence appreciate all and any different ideas. A flow diagram with Audit, balance and control with automated data validation and testing mechanism for data pipelines can be very helpful.
1
answers
0
votes
304
views
asked 7 months ago

Aws glue connects to RDS PostgreSQL

I failed to connect to the RDS postgreSQL database using glue, and failed to return the following message: Check that your connection definition references your JDBC database with correct URL syntax, username, and password. The authentication type 10 is not supported. Check that you have configured the pg_hba.conf file to include the client's IP address or subnet, and that it is using an authentication scheme supported by the driver. Exiting with error code 30 My connection steps: type JDBC JDBC URL jdbc:postgresql://xxx.xxx.us-west-2.rds.amazonaws.com:5432/xxx VPC ID vpc-xxx Subnet xxx Security group sg-xxx SSL connection required false I have checked the pg database for the above configuration and there should be no problem. My glue IAM permissions: { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "rds:*", "Resource": "*", "Condition": { "BoolIfExists": { "aws:MultiFactorAuthPresent": "true" } } }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "s3:GetAccessPoint", "ec2:DescribeAddresses", "ec2:DescribeByoipCidrs", "s3:GetBucketPolicy", "glue:*", "kms:*", "s3:GetAccessPointPolicyStatus", "s3:GetBucketPolicyStatus", "s3:GetBucketPublicAccessBlock", "s3:GetMultiRegionAccessPointPolicyStatus", "rds-db:*", "s3:GetMultiRegionAccessPointPolicy", "s3:ListAccessPoints", "s3:GetMultiRegionAccessPoint", "rds-data:*", "s3:ListMultiRegionAccessPoints", "s3:GetBucketAcl", "s3:DescribeMultiRegionAccessPointOperation", "s3:PutObject", "s3:GetObject", "s3:GetAccountPublicAccessBlock", "s3:ListAllMyBuckets", "ec2:DescribeVpcs", "ec2:DescribeVpcEndpoints", "s3:GetBucketLocation", "s3:GetAccessPointPolicy" ], "Resource": "*" } ] } The error message gives three possible directions: 1 Check that your connection definition references your JDBC database with correct URL syntax, username, and password. I don't think this should be wrong 2 The authentication type 10 is not supported. I'm not exactly sure what this error means, and all my Google queries are to modify pg_hba.conf. However, RDS does not provide modifications to this file. 3 Check that you have configured the pg_hba.conf file to include the client's IP address or subnet, and that it is using an authentication scheme supported by the driver. I don't understand what this mistake means
1
answers
0
votes
1267
views
asked 9 months ago