Python / Spark

How to Add Serial Number to Spark Dataframe


You may required to add Serial number to Spark Dataframe sometimes.
It can be done with the spark function called monotonically_increasing_id(). It generates a new column with unique 64-bit monotonic index for each row. But it isn’t significant, as the sequence changes based on the partition. In short,  random numbers will be assigned which are out of sequence.

If the goal is add serial number to the dataframe, you can use zipWithIndex method available on RDD.
below is how you can achieve the same on dataframe.

from pyspark.sql.types import LongType, StructField, StructType

def dfZipWithIndex (df, offset=1, colName="rowId"):
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema

:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column

new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema

zipped_rdd = df.rdd.zipWithIndex()

new_rdd = (row,rowId): ([rowId +offset] + list(row)))

return spark.createDataFrame(new_rdd, new_schema)

Credits: stackoverflow


Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at

Lost Password