How to create Spark DataFrame from different sources

Creating DataFrame from Scala List or Sequence In some cases in order to test our business logic we need to have DataFrame and in most cases we would have created DataFrame from a sample file. Instead of doing that we can create a List of our sample data and we can convert it to DataFrame. Note : spark.implicits._ will be available in spark-shell by default. In case if we want to test in IDE we should import spark.implicits._ explicitly. From CSV Source From Parquet Source From Avro Source From JSON Source Using Spark StructType schema to create DataFrame on File Sources Using Spark StructType JSON Schema to create DataFrame on File Sources In some cases we may require to have a external StructType Schema in such cases we can define the StructType as JSON and store it as file and during ru...

Reusable Spark Scala application to export files from HDFS/S3 into Mongo Collection

Application Flow How This Application Works ? When user invokes the application using spark-submit First, the application will parse and validate the input options. Instantiate new SparkSession with mongo config spark.mongodb.output.uri. Depending on the input options provided by the user DataFrame will be created for source data file. If user provided a transformation SQL a temporary view will be created on source DataFrame and transformation will be applied to form transformed DataFrame or the source DataFrame will used for writing the data to Mongo Collection. Finally, either transformed DataFrame or Source DataFrame will be written into Mongo Collection depending on the write configuration provided by user or default write configuration. Read Configuration By default, application will ...

Scala Companion Object Explained

What is a companion object? Companion object is nothing but a singleton object which will have same name as the class name and it will hold all the static methods and members. Since, Scala doesn’t support definition of static methods in the class we have to create the companion object. Both the class and companion object should need to have same name and should be present in the same source file. Consider the below example – class ListToString private(list: List[Any]) { def size(): Int = { list.length } def makeString(sep: String = ","): String = { list.mkString(sep) } } object ListToString { def makeString(list: List[Any], sep: String = ","): String = { list.mkString(sep) } def apply(list: List[Any]): ListToString = new ListToString(list) } Class ListToString defined with two ...

Binary Search in Scala (Iterative,Recursion,Tail Recursion Approaches)

Binary Search is a fast & efficient algorithm to search an element in sorted list of elements. It works on the technique of divide and conquer. Data structure: Array Time Complexity: Worst case: O(log n) Average case: O(log n) Best case: O(1) Space complexity: O(1) Let’s see how it can implemented in Scala with different approaches – Iterative Approach – Recursion Approach – Tail Recursion – Driver Program- val arr = Array(1, 2, 4, 5, 6, 7) val target = 7 println(binarySearch_iterative(arr, target) match { case -1 => s"$target doesn't exist in ${arr.mkString("[", ",", "]")}" case index => s"$target exists at index $index" }) println(binarySearch_Recursive(arr, target)() match { case -1 => s"$target doesnt match" case index => s"$target exists a...

Understanding Tail recursion in Scala

Tail recursion is little tricky concept in Scala and takes time to master it completely. Before we get into Tail recursion, lets try to look into recursion. A Recursive function is the function which calls itself. If some action is repetitive, we can call the same piece of code again. Recursion could be applied to problems where you use regular loops to solve it. Factorial program with regular loops – [code lang=”scala”] def factorial(n: Int): Int = { var fact = 1 for(i <- 1 to n) { fact = fact * i; } return fact } [/code] The same can be re-written with recursion like below – [code lang=”scala”] def factorialWithRecursion(n: Int): Int = { if (n == 0) return 1 else return n * factorialWithRecursion(n-1) } [/code] In the recursive approach, we return e...

How to write Current method name to log in Scala[Code Snippet]

You will be having many methods in your application framework, and if want to trace and log current method name then the below code will be helpful for you. def getCurrentMethodName:String = Thread.currentThread.getStackTrace()(2).getMethodName def test{ println("you are in - "+getCurrentMethodName) println("this is doing some functionality") } test Output: you are in – test this is doing some functionality

How to Calculate total time taken for particular method in Spark[Code Snippet]

In some cases where you applied Joins in the spark application, you might want to know the time taken to complete the particular join. Below code snippet might come in handy to achieve so. import java.util.Date val curent = new Date().getTime println(curent) Thread.sleep(30000) val end = new Date().getTime println(end) println("time taken "+(end-curent).toFloat/60000 + "mins") Output: import java.util.Date curent: Long = 1520502573995 end: Long = 1520502603996 time taken 0.5000167mins All you need to do is get current time before method starts and get current time after method ends, then calculate the difference to get total time taken to complete that particular method. Hope this code snippet helps!!

How to write current date timestamp to log file in Scala[Code Snippet]

Scala doesn’t have its own library for Dates and timestamps, so we need to depend on Java libraries. Here is the quick method to get current datetimestamp and format it as per your required format. Please note that all the code syntaxes are in Scala, this can be used while writing Scala application. import java.sql.Timestamp def getCurrentdateTimeStamp: Timestamp ={ val today:java.util.Date = Calendar.getInstance.getTime val timeFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") val now:String = timeFormat.format(today) val re = java.sql.Timestamp.valueOf(now) re } import java.sql.Timestamp getCurrentdateTimeStamp: java.sql.Timestamp getCurrentdateTimeStamp res0: java.sql.Timestamp = 2018-03-18 07:48:00.0

Algorithm to sort elements in an Array using Scala

Algorithm/Program for sorting elements in an Array using Scala. The algorithm used is Bubble Sort. Bubble Sort is the simplest algorithm that works by repeatedly swapping the adjacent elements. [code lang=”scala”] object SortArray{ def main(args: Array[String]) { val inputarray = Array(1,2,3,2,4,1,4) println("Input") println(inputarray.mkString(",")) for(i <- 0 until inputarray.length-1){ for(j<-0 until inputarray.length-i-1){ if(inputarray(j)>inputarray(j+1)){ var temp = inputarray(j) inputarray(j)=inputarray(j+1) inputarray(j+1)=temp } } } println("Sorted elements in Array") println(inputarray.mkString(",")) } } [/code] Output: Input 1,2,3,2,4,1,4 Sorted elements in Array 1,1,2,2,3,4,4

Program to print only duplicate elements in an Integer Array using Scala

Write a Program to print only duplicate elements in an Integer Array? Logic: Loop through each element of Array and Compare it with other elements. [code lang=”scala”]object PrintDuplicates{ def main(args: Array[String]) { for (i <- 0 until inputarray.length){ for(j <- i+1 until inputarray.length){ if(inputarray(i)==inputarray(j)){ println(inputarray(i)) } } } } }[/code]

Program to print triangle pattern using Scala

Write a program to Print below triangle pattern using Scala? # ## ### #### ##### Using Scala functional style of programming it’s very easy to use print patterns than Java. Below is the code for printing the same using Scala for loops. Approach 1 – [code lang=”scala”]object PrintTriangle { def main(args: Array[String]) { for(i < – 1 to 5){ for(j <- 0 to i){ print("#") } println("") } } } [/code] Approach 2 – [code lang=”scala”]object PrintTriangle{ def main(args: Array[String]) { for(x <- 1 until 6) { println("#" * x) } } } [/code] Output: # ## ### #### #####

How to Remove Header and Trailer of File using Scala

Removing header and trailer of the File using Scala might not be real-time use case since you will be using Spark when dealing with large datasets. This post helpful mainly for interview purpose, An Interviewer might ask to write code for this using scala instead Unix/Spark. Here is the code snippet to achieve the same using Scala – [code lang=”scala”] import object RemoveHeaderTrailer{ def main(args: Array[String]){ println("start") val input = Source.fromFile("C:/Users/Sai/input.txt") //input.getLines().drop(1).foreach(println)//This is for removing Header alone val lines = input.getLines().toList val required_data = lines.slice(1,lines.size-1).mkString("\n") import val pw = new PrintWriter(new File("C:/Users/...

Lost Password


24 Tutorials