Create increasing id column to pyspark data-frame

without using row_number(), monotonically_increasing_id create incremental id with help of using zipWithIndex() rdd function.

row_number() and monotonically_increasing_id() function will force all of your data into a single partition, killing performance. A better option would be use zipWithIndex() and assign unique number as id column of your data frame.

  1. read csv file
raw_df = (
self.spark.read.format("csv")
.option("header", "true")
.load(f"{self.file_name}")
)
Create increasing id column to pyspark data-frame

2. Create function addColumn which take df as argument and add id as first column of data frame

def addColumn(self, df):
batch_df = self.spark.createDataFrame(
df.rdd.zipWithIndex().map(lambda x: [str(x[1] + 1)] + [i for i in x[0]]).map(lambda x: Row(*x)),
StructType([StructField('id',StringType(),True)] + df.schema.fields)
)

return batch_df

3. call addcolumn function

df2 = self.addColumn(raw_df).withColumn("batchid",F.lit(1))
df2.show(5)
Create increasing id column to pyspark data-frame

--

--

Rupesh Kumar Singh

An IT professional with 10+ years of experience, Python | pandas| Django | Flask | Superset | pyspark | FullStack | Hadoop | AWS | php | no-SQL | ETL | Data-pip