How to do faster data manipulation operation

0

Hello, Its db.r7g.8xlarge aurora postgres 32VCPU, 256GB RAM instance. PG version 15.4.

A non-partitioned table having ~4.8 billion rows in it and is having data size as ~1.4TB, row size as ~313 bytes having ~127 columns in it. This has got approx. ~1billion+ duplicate rows inserted in it, and we want to get the duplicate data removed for this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are struggling to do the same.

Teammates suggested to do this using CTAS method, i.e. create a new table with the unique record set and then drop the main table. Something as below

create table TAB1_New as SELECT * from TAB1 A where ID in (select min(ID) from TAB1 group by ID having count(ID)>=1 );

But for above to work faster, they mentioned to have an index created on the column using which the duplicate check will be performed i.e. ID column. So, creating the index itself took ~2hrs+ and the index size now showing as ~116GB.

Create index idx1 on public.TAB1(ID)

And now running the SELECT subquery part of the CTAS statement to see if its giving the correct unique records count. Its running for 2.5hrs+ and then we killed it. Below is the plan for the same.

explain select min(ID) from TAB1 A group by ID having count(ID)>=1

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) Group Key: ID Filter: (count(ID) >= 1) -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 rows=4883397120 width=14)

I want to understand if by anyway this can be done faster. also, I am worried that, creating PK constraint/index back after deleting the duplicate is also going to run forever. Is there any way we can make these heavy operations faster by facilitating more database resources through some parameter setup, like parallel hint etc.? We have pg_hint_plan extension added, but not seeing the parallel hint enforced when adding it to the query.

In oracle we have Parallel hints, Direct path read/write for faster read/write operations, parallel index scan etc. available for dealing with large data volume, if anything similar to that available in aurora postgresql.

asked 3 months ago230 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions