How to do faster data manipulation operation

0

Hello, Its db.r7g.8xlarge aurora postgres 32VCPU, 256GB RAM instance. PG version 15.4.

A non-partitioned table having ~4.8 billion rows in it and is having data size as ~1.4TB, row size as ~313 bytes having ~127 columns in it. This has got approx. ~1billion+ duplicate rows inserted in it, and we want to get the duplicate data removed for this table and create a PK/unique constraint back so as to not have the duplicate values in future. We are struggling to do the same.

Teammates suggested to do this using CTAS method, i.e. create a new table with the unique record set and then drop the main table. Something as below

create table TAB1_New as SELECT * from TAB1 A where ID in (select min(ID) from TAB1 group by ID having count(ID)>=1 );

But for above to work faster, they mentioned to have an index created on the column using which the duplicate check will be performed i.e. ID column. So, creating the index itself took ~2hrs+ and the index size now showing as ~116GB.

Create index idx1 on public.TAB1(ID)

And now running the SELECT subquery part of the CTAS statement to see if its giving the correct unique records count. Its running for 2.5hrs+ and then we killed it. Below is the plan for the same.

explain select min(ID) from TAB1 A group by ID having count(ID)>=1

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46) Group Key: ID Filter: (count(ID) >= 1) -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17 rows=4883397120 width=14)

I want to understand if by anyway this can be done faster. also, I am worried that, creating PK constraint/index back after deleting the duplicate is also going to run forever. Is there any way we can make these heavy operations faster by facilitating more database resources through some parameter setup, like parallel hint etc.? We have pg_hint_plan extension added, but not seeing the parallel hint enforced when adding it to the query.

In oracle we have Parallel hints, Direct path read/write for faster read/write operations, parallel index scan etc. available for dealing with large data volume, if anything similar to that available in aurora postgresql.

질문됨 3달 전231회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠