Questions tagged with Analytics

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Hi team, we are working in accelerator account AWS ASEA, that has no outbound connectivity we can not connect to internet to download anything (libraries, ....) the VPC is private only. our task is to fetch data from twitter and do - twitter data processing - sentiment Analysis .. we would like to know **if there is a way to achieve this when our account doesn't have outbound(internet) connectivity**? could you please advice best practices/architecture to do this scenarios (twitter data processing, sentiment Analysis) ? Thank you
1
answers
0
votes
13
views
Jess
asked 13 hours ago
Hi, I'm using Pinpoint to track mobile app events and send push notification. For the Android app only I'm experiencing a strange value lower than 1 for the metric "Sessions per endpoint". How this could be possible? ![](/media/postImages/original/IM0gg32FZPQf2CVKIrpLt_rw) Thank you
0
answers
0
votes
13
views
asked 2 days ago
Glue Interactive sessions How can one monitor job metrics like CPU utilisation, memory usage, and network activity when running job from interactive sessions. thanks
Accepted AnswerAnalyticsAWS Glue
1
answers
0
votes
7
views
asked 2 days ago
How can one set up an Execution Class = FLEX on a Jupyter Job Run , im using the %magic on my %%configure cell like below and also setting the input arguments with --execution_class = FLEX But still the jobs are quicking as STANDARD %%configure { "region": "us-east-1", "idle_timeout": "480", "glue_version": "3.0", "number_of_workers": 10, "execution_class": "FLEX", "worker_type": "G.1X" } ![Enter image description here](/media/postImages/original/IMgaPRfCicTAKewOu41SXTqw)
2
answers
0
votes
54
views
asked 6 days ago
so I have 2 streams and what I am doing is making 2 tables in studio and then just merging the 2 tables using inner-join and putting all that data in s3, what this issue seems to be is that because of the time being different aws is not able to join the data based on the given key. the following is the code for joining the 2 tables previously when I used simple data everything was working but now there is no error, only the tables are not merging ``` %flink.ssql(type=update) INSERT INTO s3_join SELECT * FROM ExampleInputStream1 INNER JOIN ExampleInputStream ON ExampleInputStream1.seq_num = ExampleInputStream.seq_num ```
1
answers
0
votes
10
views
asked 6 days ago
I have a string type for date and in that column, it has the word 'None' My query for casting the date is below - *getting only the Month and Year on it*, date_format(cast(c.enddate as date), '%M') as "Month", date_format(cast(c.enddate as date), '%Y') as "Year" ERROR prompted INVALID_CAST_ARGUMENT: Value cannot be cast to date: None- Can somebody help me with this problem, so that I can still get the Month and Year only? Thank you in advance!
2
answers
0
votes
31
views
asked 9 days ago
Hi, I'd appreciate AWS Athena support for TIMESTAMP data type with microsecond precision for all row formats and table engines. Currently, the support is very inconsistent. See the SQL script below. ``` drop table if exists test_csv; create external table if not exists test_csv ( id int, created_time timestamp ) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties('separatorChar'=',', 'quoteChar'='"', 'escapeChar'='\\') location 's3://my-bucket/tmp/timestamp_csv_test/'; -- result: OK drop table if exists test_parquet; create external table if not exists test_parquet ( id int, created_time timestamp ) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' stored as inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' location 's3://my-bucket/tmp/timestamp_parquet_test/' tblproperties ('parquet.compress' = 'snappy'); -- result: OK drop table if exists test_iceberg; create table if not exists test_iceberg ( id int, created_time timestamp ) location 's3://my-bucket/tmp/timestamp_iceberg_test/' tblproperties ( 'table_type' ='iceberg'); -- result: OK insert into test_csv values (1, timestamp '2023-03-22 11:00:00.123456'); /* result: ERROR [HY000][100071] [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. GENERIC_INTERNAL_ERROR: class org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector cannot be cast to class org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector and org.apache.hadoop.hive.serde2.objectinspector.primitive.StringObjectInspector are in unnamed module of loader io.trino.server.PluginClassLoader @1df1bd44). If a data manifest file was generated at 's3://my-bucket/athena_results/ad44adee-2a80-4f41-906a-17aa5dc27730-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account. [Execution ID: ***] */ insert into test_parquet values (1, timestamp '2023-03-22 11:00:00.123456'); -- result: OK select * from test_parquet; -- result: OK DATA: 1,2023-03-22 11:00:00.123000 BUT THE TIMESTAMP VALUE IS TRUNCATED TO MILLISECONDS! insert into test_iceberg values (1, timestamp '2023-03-22 11:00:00.123456'); -- result: OK select * from test_csv; select * from test_iceberg; -- result: OK DATA: 1,2023-03-22 11:00:00.123456 THIS IS FINE ```
0
answers
0
votes
24
views
asked 9 days ago
Hi community, I am trying to perform an ETL job using AWS Glue. Our data is stored in MongoDB Atlas, inside a VPC. Our AWS is connected to our MongoDB Atlas using VPC peering. To perform the ETL job in AWS Glue I have first created a connection using the VPC details and the mongoDB Atlas URI along with the password and username. The connection is used by the AWS Glue crawlers to extract the schema to AWS Data Catalog Tables. This connection works! However, I am then attempting to perform the actual ETL job using the following pySpark code: #My Temp Variables source_database="d*********a" source_table_name="main_businesses source_mongodb_db_name = "main" source_mongodb_collection = "businesses" glueContext.create_dynamic_frame.from_catalog(database=source_database,table_name=source_table_name,additional_options = {"database": source_mongodb_db_name,"collection":source_mongodb_collection}) However the connection times out and for some reason mongodb atlas is blocking the connection from the ETL job. It's as if the ETL Job is using the connection differently than the crawler does. Maybe the ETL Job is not able to run the job inside our AWS VPC that is connected to the MongoDB Atlas VPC (VPC Peering is not possible?). Does anyone have any idea what might be going on or how I can fix this? Thank you!
1
answers
0
votes
21
views
asked 11 days ago
Hi, I'm creating a dashboard for operators to download the athena query results. The ID column values contain hyphens `-` and For example, if table contains the following data | id | name | | --- | --- | | `-xyz` | `First example` | | `a-b-c` | `Second example` | The generated csv contains a extra single quote in the id column at the first row ```csv "id","name" "'-xyz","First example" "a-b-c","Second example" ``` Is there any way to avoid it?
1
answers
0
votes
21
views
hota
asked 13 days ago
I have a KPI visual that displays the count of records I have from a dataset. Is it possible for me to make that KPI show me all the records that were included in this count?
0
answers
0
votes
5
views
asked 15 days ago
In Redshift, I'm trying to update a table using another table from another database. The error details: SQL Error [XX000]: ERROR: Assert Detail: ----------------------------------------------- error: Assert code: 1000 context: scan->m_src_id == table_id - query: 17277564 location: xen_execute.cpp:5251 process: padbmaster [pid=30866] The context is not helpful. I have used a similar join based approach for other tables and there the update statement has been working fine. Update syntax used: ``` UPDATE ods.schema.tablename SET "TimeStamp" = GETDATE(), "col" = S."col", FROM ods.schema.tablename T INNER JOIN stg.schema.tablename S ON T.Col = S.Col; ```
1
answers
0
votes
21
views
asked 16 days ago
Hello, I am running a job to apply an ETL on a semi-colon-separated CSV on S3, however, when I read the file using the DynamicFrame feature of AWS and try to use any method like `printSchema` or `toDF`, I get the following error: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o77.schema. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (52bff5da55da executor driver): com.amazonaws.services.glue.util.FatalException: Unable to parse file: s3://my-bucket/my-file.csv ``` I have already verified the codification, it is UTF-8 so there should be no problem. When I read the CSV using `spark.read.csv`, it works fine, and the Crawlers can also recognize the schema. The data has some special characters that shouldn't be there, and that's part of the ETL I am looking to perform. Neither using the `from_catalog` nor `from_options` function from AWS Glue works, the problem is the same whether I run the job locally on docker or Glue Studio. My data have a folder date partition so I would prefer to avoid using directly Spark to read the data and take advantage of the Glue Data Catalog as well. Thanks in advance.
1
answers
0
votes
39
views
Aftu
asked 18 days ago