By using AWS re:Post, you agree to the Terms of Use

Questions tagged with Database

Sort by most recent
  • 1
  • 12 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Temp space used up in Aurora PostgreSQL whilst generating Index concurrently

I use a Aurora PGSQL cluster ( 4 nodes in total ). The database is partitioned by month, with the largest partition for that table being around 1.3TB of data. One of the columns within the table is a JSONB type. I'm wanting to enable GIN indexing on the column so that I query by fields within the JSONB object. I am creating the GIN Index concurrently as to not affect live traffic. I have been able to create a GIN Index within the QA environment because the data is relatively small. However when I try to create the GIN index within production, the server runs out of temp storage whilst building that index ( [see here](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Managing.html#AuroraPostgreSQL.Managing.TempStorage) for list of temp storage available per node size ). An easy solution for this would be to say 'just scale up the nodes within the cluster', that way there is more temp space to build the GIN index in. This would likely work, but it seems a little overkill. If I scaled the nodes up just to GIN index creation, then scaled them down, I would be in a position whereby there is not currently sufficient hardware to rebuild that GIN Index should it need rebuilding - this seems like a smell in production.. If I scaled the instances up and left them scaled up, the instances would be massively overprovisioned and it would be very expensive. I'm curious as to if there is any workaround for this temp space issue so that I would not have to scale up the servers so drastically. Also if I scaled the servers up, then scaled them down after the GIN indexing completes, would this be risky, is there any reason why the GIN index would have to completely rebuild after the initial build. Thanks
1
answers
0
votes
65
views
asked a day ago

How to use bolt protocol in java to directly execute cypher query in AWS Neptune Service

I am following the following article to query cypher query on the Neptune instance:- [Cypher Using Bolt](https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-opencypher-bolt.html) I want to execute the cypher query directly on the AWS Neptune Instance without translating it to the gremlin. Following is the error I am getting, despite following the code as shown in the documentation. """ Exception in thread "main" org.neo4j.driver.exceptions.ServiceUnavailableException: Failed to establish connection with the server at org.neo4j.driver.internal.util.Futures.blockingGet(Futures.java:143) Caused by: org.neo4j.driver.internal.shaded.io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 485454502f312e31203430302042616420526571756573740d0a5365727665723a20617773656c622f322e30 (trimmed) """ I am also putting the sample java code for your reference:- ``` public class TestImpl { private static final String ACCESS_KEY = "XYZ"; private static final String SECRET_KEY = "ABC"; private static final String SERVICE_REGION = "AAAA"; private static final Gson GSON = new Gson(); public static void main(String[] args) { String URL = "bolt://URL:PORT"; final Driver driver = GraphDatabase.driver(URL, AuthTokens.basic("username", getSignedHeader()), getDefaultConfig()); String query = "MATCH (ruleSet:RULE_SET) " + "WHERE ruleSet.refId = \"aws-iam-best-practices\" " + "RETURN ruleSet.refId as refId, ruleSet.name as name, collect(ruleSet.ruleIds) as ruleIds"; System.out.println(query); final Record rec = driver.session().run(query).list().get(0); System.out.println(rec.get("refId").asNode().toString()); } private static Config getDefaultConfig() { return Config.builder() .withConnectionTimeout(30, TimeUnit.SECONDS) .withMaxConnectionPoolSize(1000) .withDriverMetrics() .withLeakedSessionsLogging() .withEncryption() .withTrustStrategy(Config.TrustStrategy.trustSystemCertificates()) .build(); } private static String getSignedHeader() { // If you are using permanent credentials, use the BasicAWSCredentials access key and secret key final BasicAWSCredentials permanentCreds = new BasicAWSCredentials(ACCESS_KEY, SECRET_KEY); final AWSCredentialsProvider creds = new AWSStaticCredentialsProvider(permanentCreds); // Or, if you are using temporary credentials, use the BasicSessionCredentials to // pass the access key, secret key, and session token, like this: // final BasicSessionCredentials temporaryCredentials = new BasicSessionCredentials(ACCESS_KEY, SECRET_KEY, AWS_SESSION_TOKEN); // final AWSCredentialsProvider tempCreds = new AWSStaticCredentialsProvider(temporaryCredentials); String signedHeader = ""; final Request<Void> request = new DefaultRequest<Void>("neptune-db"); // Request to neptune request.setHttpMethod(HttpMethodName.GET); request.setEndpoint(URI.create("https://NeptuneServiceURL")); // Comment out the following line if you're using an engine version older than 1.2.0.0 request.setResourcePath("/openCypher"); final AWS4Signer signer = new AWS4Signer(); signer.setRegionName(SERVICE_REGION); signer.setServiceName(request.getServiceName()); signer.sign(request, creds.getCredentials()); signedHeader = getAuthInfoJson(request); return signedHeader; } private static String getAuthInfoJson(final Request<Void> request) { final Map<String, Object> obj = new HashMap<>(); obj.put("Authorization", request.getHeaders().get("Authorization")); obj.put("HttpMethod", request.getHttpMethod()); obj.put("X-Amz-Date", request.getHeaders().get("X-Amz-Date")); obj.put("Host", request.getEndpoint().getHost()); // If temporary credentials are used, include the security token in // the request, like this: // obj.put("X-Amz-Security-Token", request.getHeaders().get("X-Amz-Security-Token")); final String json = GSON.toJson(obj); return json; } } ``` Please guide me on what is my mistake in this process. Thanking you in advance for it :).
0
answers
0
votes
30
views
asked 3 days ago
1
answers
0
votes
34
views
asked 3 days ago

Large amount of time spent in "com.amazonaws.services.glue.DynamicFrame.recomputeSchema" in Spark/AWS Glue

Dear GlueCommunity, I'm using PySpark with AWS Glue for a large ETL job. Most of my source data sit in an AWS RDS Postgres instance. I read all my tables directly with `create_data_frame.from_catalog` because the logic is quite complex and implemented in pure PySpark, I don't use Dynamic Frames at all except when writing the final data back to S3. The first thing that puzzled me in the AWS Glue documentation was this part : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog > >**create_data_frame_from_catalog** > >create_data_frame_from_catalog(database, table_name, transformation_ctx = "", additional_options = {}) > >Returns a DataFrame that is created using information from a Data Catalog table. **Use this function only with AWS Glue streaming sources.** > Can someone explain to me why we are not supposed to use `create_data_frame_from_catalog` with non streaming-sources ? In my case none of my sources are streamed, so would it change anything to do instead `create_dynamic_frame_from_catalog().toDF()` ? However, my main problem is that one of my source data sits in S3 and is quite large (for my standards, 1To as gzip CSV). I configured the crawler and added the table into my glue database, the schema consists of only 3 columns. Everything seems in order for this table in my glue database. However, when I try & create a Data Frame for this table with `create_data_frame.from_catalog`, I get an additional Spark Stage which mostly consists of : ``` fromRDD at DynamicFrame.scala:320 org.apache.spark.sql.glue.util.SchemaUtils$.fromRDD(SchemaUtils.scala:74) com.amazonaws.services.glue.DynamicFrame.recomputeSchema(DynamicFrame.scala:320) com.amazonaws.services.glue.DynamicFrame.schema(DynamicFrame.scala:296) com.amazonaws.services.glue.DynamicFrame.toDF(DynamicFrame.scala:385) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:498) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:282) py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) py4j.commands.CallCommand.execute(CallCommand.java:79) py4j.GatewayConnection.run(GatewayConnection.java:238) java.lang.Thread.run(Thread.java:748) ``` It's unclear to my why Recomputing the Schema would be necessary as the table with the schema is already incorported into my glue database and there is no reason to assume it has changed. At the moment with 5 G2.X workers, this step takes around 12 hours. After reading about the AWS Glue streaming sources and `create_data_frame.from_catalog`, I tried loading the dataframe with `create_dynamic_frame_from_catalog().toDF()` but this step of schema recomputing occured nevertheless. Does someone have any idea why recomputing the schema is necessary at that step? Could I force non recomputation in some way or another? A more general question would be, is there any other way to have this To of data accessed more efficiently than in S3 ? I don't want to put that in an AWS RDS instance because these data would be very infrequently accessed, should I be looking at something else, DynamoDB, etc ? Thanks a lot for your help,
0
answers
0
votes
21
views
asked 6 days ago
  • 1
  • 12 / page