Create increasing id
column to pyspark data-frame
Apr 1, 2023
without using row_number(), monotonically_increasing_id create incremental id with help of using
zipWithIndex() rdd function.
row_number() and monotonically_increasing_id
() function will force all of your data into a single partition, killing performance. A better option would be use zipWithIndex() and assign unique number as id column of your data frame.
- read csv file
raw_df = (
self.spark.read.format("csv")
.option("header", "true")
.load(f"{self.file_name}")
)
increasing id
column to pyspark data-frame2. Create function addColumn which take df as argument and add id as first column of data frame
def addColumn(self, df):
batch_df = self.spark.createDataFrame(
df.rdd.zipWithIndex().map(lambda x: [str(x[1] + 1)] + [i for i in x[0]]).map(lambda x: Row(*x)),
StructType([StructField('id',StringType(),True)] + df.schema.fields)
)
return batch_df
3. call addcolumn function
df2 = self.addColumn(raw_df).withColumn("batchid",F.lit(1))
df2.show(5)
increasing id
column to pyspark data-frame