Create `increasing id` column to pyspark data-frame

Apr 1, 2023

without using row_number(), monotonically_increasing_id create incremental id with help of using zipWithIndex() rdd function.

row_number() and monotonically_increasing_id() function will force all of your data into a single partition, killing performance. A better option would be use zipWithIndex() and assign unique number as id column of your data frame.

read csv file

raw_df = (
    self.spark.read.format("csv")
        .option("header", "true")
        .load(f"{self.file_name}")
)

Create `increasing id` column to pyspark data-frame

2. Create function addColumn which take df as argument and add id as first column of data frame

def addColumn(self, df):
    batch_df = self.spark.createDataFrame(
        df.rdd.zipWithIndex().map(lambda x: [str(x[1] + 1)] + [i for i in x[0]]).map(lambda x: Row(*x)),
        StructType([StructField('id',StringType(),True)] + df.schema.fields)
    )

    return batch_df

3. call addcolumn function

df2 = self.addColumn(raw_df).withColumn("batchid",F.lit(1))
df2.show(5)

Create `increasing id` column to pyspark data-frame

Written by Rupesh Kumar Singh

No responses yet

Create increasing id column to pyspark data-frame

Written by Rupesh Kumar Singh

No responses yet

Create `increasing id` column to pyspark data-frame