How to detect a duplicate row and then update it in PySpark?

Question

I have a Glue ETL Job in which some of the data has issues.  There is a case where a row is duplicated, and what I need to do is increase the value by 1 hour on the duplicate.

So imagine a set of data that looks like:

| Name | Color | Size | Value |
| --- | --- | --- | --- |
| Alpha | Blue | Large | 1 |
| Alpha | Blue | Large | 1 |
| Bravo | Red | Small | 5 |

So it would see that Alpha row is a duplicate and on the duplicate row it would increase value to 2.  So basically it needs to find the duplicated row and update it.  This should only happen once in my corner case, so there won't be more than 1 duplicate for any orignal row.  In Pandas there is the .duplicated() method, but I don't see something like that in PySpark and so I am trying to think of a way to deal with it.  If I could iterate the Spark DF then I could build dictionaries with counters, and if the count was 2, then do an update.......not sure if that is a good way.

Answer

Hi , 
If I understood well you would like to achieve something like:

| Name | Color | Size | Value |Time
| --- | --- | --- | --- |--|
| Alpha | Blue | Large | 2 | 01-04-2022 09:58:30
| Alpha | Blue | Large | 1 | 01-04-2022 08:58:30
| Bravo | Red | Small | 5 | 01-04-2022 09:58:30

Is that right?

You will only have one duplicates for any original row and the duplicates value should be increased by 1.

you could look into something like:

```
import pyspark.sql.functions as F
from pyspark.sql.window import Window

windowSpec  = Window.partitionBy("Name", "Color", "Size", "Value").orderBy("Time_col")
df3=df.withColumn("row_num",F.row_number().over(windowSpec))
df4=df3.withColumn("Value", F.when(df3.row_num==2,df3.Value+1).otherwise(df3.Value)).drop(df3.row_num)
```
my ouptput:

+-----+-----+-----+-----+-------------------+
|Name |Color|Size |Value|Time_col           |
+-----+-----+-----+-----+-------------------+
|Alpha|Blue |Large|1    |01-04-2022 08:58:30|
|Alpha|Blue |Large|2    |01-04-2022 09:58:30|
|Bravo|Red  |Small|5    |01-04-2022 09:58:30|
+-----+-----+-----+-----+-------------------+
```

it should work also without the Time col.

hope this helps

Name	Color	Size	Value	Time
Alpha	Blue	Large	2	01-04-2022 09:58:30
Alpha	Blue	Large	1	01-04-2022 08:58:30
Bravo	Red	Small	5	01-04-2022 09:58:30

How to detect a duplicate row and then update it in PySpark?

Contenus pertinents