Creating DataFrame from Scala List or Sequence In some cases in order to test our business logic we need to have DataFrame and in most cases we would have created DataFrame from a sample file. Instead of doing that we can create a List of our sample data and we can convert it to DataFrame. Note : spark.implicits._ will be available in spark-shell by default. In case if we want to test in IDE we should import spark.implicits._ explicitly. From CSV Source From Parquet Source From Avro Source From JSON Source Using Spark StructType schema to create DataFrame on File Sources Using Spark StructType JSON Schema to create DataFrame on File Sources In some cases we may require to have a external StructType Schema in such cases we can define the StructType as JSON and store it as file and during ru...
Application Flow How This Application Works ? When user invokes the application using spark-submit First, the application will parse and validate the input options. Instantiate new SparkSession with mongo config spark.mongodb.output.uri. Depending on the input options provided by the user DataFrame will be created for source data file. If user provided a transformation SQL a temporary view will be created on source DataFrame and transformation will be applied to form transformed DataFrame or the source DataFrame will used for writing the data to Mongo Collection. Finally, either transformed DataFrame or Source DataFrame will be written into Mongo Collection depending on the write configuration provided by user or default write configuration. Read Configuration By default, application will ...
What is a companion object? Companion object is nothing but a singleton object which will have same name as the class name and it will hold all the static methods and members. Since, Scala doesn’t support definition of static methods in the class we have to create the companion object. Both the class and companion object should need to have same name and should be present in the same source file. Consider the below example – class ListToString private(list: List[Any]) { def size(): Int = { list.length } def makeString(sep: String = ","): String = { list.mkString(sep) } } object ListToString { def makeString(list: List[Any], sep: String = ","): String = { list.mkString(sep) } def apply(list: List[Any]): ListToString = new ListToString(list) } Class ListToString defined with two ...
Binary Search is a fast & efficient algorithm to search an element in sorted list of elements. It works on the technique of divide and conquer. Data structure: Array Time Complexity: Worst case: O(log n) Average case: O(log n) Best case: O(1) Space complexity: O(1) Let’s see how it can implemented in Scala with different approaches – Iterative Approach – Recursion Approach – Tail Recursion – Driver Program- val arr = Array(1, 2, 4, 5, 6, 7) val target = 7 println(binarySearch_iterative(arr, target) match { case -1 => s"$target doesn't exist in ${arr.mkString("[", ",", "]")}" case index => s"$target exists at index $index" }) println(binarySearch_Recursive(arr, target)() match { case -1 => s"$target doesnt match" case index => s"$target exists a...
In the data ingestion stage into Hadoop from RBDMS sources, it often requires password to hit source tables in RDBMS databases. Passing hard password directly is highly unsafe and bad practice in real time applications. So, password can be encrypted by creating JCEKS file. JCEKS is basically a keystore file saved in the Java Cryptography Extension KeyStore (JCEKS) format; used as an alternative keystore to the Java Keystore (JKS) format for the Java platform; stores encoded keys. When working on Spark application which deals with RDBMS sources JCEKS need to be decrypted to query the source tables. Below is the handy function to retrieve password from JCEKS file- Using PySpark Using Scala
Tail recursion is little tricky concept in Scala and takes time to master it completely. Before we get into Tail recursion, lets try to look into recursion. A Recursive function is the function which calls itself. If some action is repetitive, we can call the same piece of code again. Recursion could be applied to problems where you use regular loops to solve it. Factorial program with regular loops – [code lang=”scala”] def factorial(n: Int): Int = { var fact = 1 for(i <- 1 to n) { fact = fact * i; } return fact } [/code] The same can be re-written with recursion like below – [code lang=”scala”] def factorialWithRecursion(n: Int): Int = { if (n == 0) return 1 else return n * factorialWithRecursion(n-1) } [/code] In the recursive approach, we return e...
There is no direct library to create Dataframe on HBase table like how we read Hive table with Spark sql. This post gives the way to create dataframe on top of Hbase table. You need to add hbase-client dependency to achieve this. Below is the link to get the dependency. https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/2.1.0 Lets say the hbase table is ’emp’ with rowKey as ’empID’ and columns are ‘name’ and ‘city’ under the column-family named – ‘metadata’. Case class -EmpRow is used in order to give the structure to the dataframe. newAPIHadoopRDD is the API available in Spark to create RDD on hbase, configurations need to passed as shown below. Dataframe will be created when you parse this RDD on case class. ...
Q1) CASE Classes: A case class is a class that may be used with the match/case statement. Case classes can be pattern matched Case classes automatically define hashcode and equals Case classes automatically define getter methods for the constructor arguments. Case classes can be seen as plain and immutable data-holding objects that should exclusively depend on their constructor arguments. Case classes contain a companion object which holds the apply method. This fact makes possible to instantiate a case class without the new keyword. Q2) Pattern Matching Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy object MatchTest1 extends App { def matchTest(x: Int): String = x match { case 1 => “one” case 2 =>...