Python / Spark

How to Add Serial Number to Spark Dataframe

how-to-add-serial-number-to-spark-dataframe-24tutorials

You may required to add Serial number to Spark Dataframe sometimes.
It can be done with the spark function called monotonically_increasing_id(). It generates a new column with unique 64-bit monotonic index for each row. But it isn’t significant, as the sequence changes based on the partition. In short,  random numbers will be assigned which are out of sequence.

If the goal is add serial number to the dataframe, you can use zipWithIndex method available on RDD.
below is how you can achieve the same on dataframe.


from pyspark.sql.types import LongType, StructField, StructType

def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema

:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''

new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)

zipped_rdd = df.rdd.zipWithIndex()

new_rdd = zipped_rdd.map(lambda (row,rowId): ([rowId +offset] + list(row)))

return spark.createDataFrame(new_rdd, new_schema)

Credits: stackoverflow

 

Share This Post

An Ambivert, music lover, enthusiast, artist, designer, coder, gamer, content writer. He is Professional Software Developer with hands-on experience in Spark, Kafka, Scala, Python, Hadoop, Hive, Sqoop, Pig, php, html,css. Know more about him at www.saikumar.me

Lost Password

Register