Archives

How to create Spark DataFrame from different sources

Creating DataFrame from Scala List or Sequence In some cases in order to test our business logic we need to have DataFrame and in most cases we would have created DataFrame from a sample file. Instead of doing that we can create a List of our sample data and we can convert it to DataFrame. Note : spark.implicits._ will be available in spark-shell by default. In case if we want to test in IDE we should import spark.implicits._ explicitly. From CSV Source From Parquet Source From Avro Source From JSON Source Using Spark StructType schema to create DataFrame on File Sources Using Spark StructType JSON Schema to create DataFrame on File Sources In some cases we may require to have a external StructType Schema in such cases we can define the StructType as JSON and store it as file and during ru...

Best Lines from Friedrich Nietzsche Writings

Friedrich Nietzsche (1844–1900) was a German classical scholar, philosopher, and critic of culture, who became one of the most influential of all modern thinkers. The German wrote 15 books in the seventeen years, including Beyond Good And Evil, Twilight of the idols, God is dead, Ecce Homo, the Antichrist, On The Genealogy of Morality, and Thus Spoke Zarathustra. Here are some best lines from Nietzsche which might resonate with you – 1. One must still have chaos in oneself to be able to give birth to a dancing star. 2. Most humans distract their thoughts; to cease to be aware of life. 3. Don’t just swallow the wisdom of the past masters of philosophy. Rather, strive to build your own path. Appreciate the masters from the past, but don’t just follow them blindly. 4. He who has a why t...

Captain Jack Sparrow Philosophy

Perhaps a strange source to find inspiration from but among the most powerful reasons to see Jack Sparrow the unlikely hero from the Pirates of the Caribbean series as someone we can learn from can be seen in the opening scene with the character as he rides a sinking ship into Port Royal during the start of the first film. Jack acts as the king of all that he surveys and though we can sense Inklings of the delusional and the deranged in him we believe in him. We believe in him because he believes in himself and so in this post we’ll be going over Jack Sparrow philosophy from the Pirates of the Caribbean series and the ways we can apply his principles into our own conduct. 1. The problem is not the problem, the problem is your attitude about the problem. The most prevailing inspiratio...

How to connect to Snowflake from AWS EMR using PySpark

As a ETL developer, we need to transport data between different platforms/services. It involves establishing connections between them. Below is one such use-case to connect Snowflake from AWS. Here are steps to securely connect to Snowflake using PySpark – Login to AWS EMR service and connect to Spark with below snowflake connectors pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4 Assumption for this article is that secret key is already created in AWS secrets manager service with SnowFlake credentials. In this example, consider the secret key is ‘test/snowflake/cluster’ Using boto3 library connect to AWS secrets manager and extract the snowflake credentials into json object. Sample code snippet below – def ge...

Reusable Spark Scala application to export files from HDFS/S3 into Mongo Collection

Application Flow How This Application Works ? When user invokes the application using spark-submit First, the application will parse and validate the input options. Instantiate new SparkSession with mongo config spark.mongodb.output.uri. Depending on the input options provided by the user DataFrame will be created for source data file. If user provided a transformation SQL a temporary view will be created on source DataFrame and transformation will be applied to form transformed DataFrame or the source DataFrame will used for writing the data to Mongo Collection. Finally, either transformed DataFrame or Source DataFrame will be written into Mongo Collection depending on the write configuration provided by user or default write configuration. Read Configuration By default, application will ...

All about AWS – GLUE

What is GLUE? Fully managed ETL service that makes it simple and cost effective to categorize your data, clean it, enrich it and move it reliably between various data stores. It’s a serverless system. Automatically handle discovery and definition of table definitions and schema. Its main use is to serve as a central metadata repository for your data lake. Discover those schemas out of your unstructured data, sitting in S3 or whatever, and publish table definitions for use with analysis tools such as Athena or Redshift or EMR. The purpose of GLUE itself is to extract structure from your unstructured data. If you have data sitting in a data lake, it can provide a schema for that so that you can query it using sequel or sequel like tools including Redshift and Athena and Amazon EMR and ...

Scala Companion Object Explained

What is a companion object? Companion object is nothing but a singleton object which will have same name as the class name and it will hold all the static methods and members. Since, Scala doesn’t support definition of static methods in the class we have to create the companion object. Both the class and companion object should need to have same name and should be present in the same source file. Consider the below example – class ListToString private(list: List[Any]) { def size(): Int = { list.length } def makeString(sep: String = ","): String = { list.mkString(sep) } } object ListToString { def makeString(list: List[Any], sep: String = ","): String = { list.mkString(sep) } def apply(list: List[Any]): ListToString = new ListToString(list) } Class ListToString defined with two ...

Binary Search in Scala (Iterative,Recursion,Tail Recursion Approaches)

Binary Search is a fast & efficient algorithm to search an element in sorted list of elements. It works on the technique of divide and conquer. Data structure: Array Time Complexity: Worst case: O(log n) Average case: O(log n) Best case: O(1) Space complexity: O(1) Let’s see how it can implemented in Scala with different approaches – Iterative Approach – Recursion Approach – Tail Recursion – Driver Program- val arr = Array(1, 2, 4, 5, 6, 7) val target = 7 println(binarySearch_iterative(arr, target) match { case -1 => s"$target doesn't exist in ${arr.mkString("[", ",", "]")}" case index => s"$target exists at index $index" }) println(binarySearch_Recursive(arr, target)() match { case -1 => s"$target doesnt match" case index => s"$target exists a...

Apache Spark Interview Questions For 2020

1.What is the version of spark you are using? Check the spark version you are using before going to Interview. As per 2020, the latest version of spark is 2.4.x 2.Difference between RDD, Dataframe, Dataset? RDD – RDD is Resilient Distributed Dataset. It is the fundamental data structure of Spark and is immutable collection of records partitioned across nodes of cluster. It allows us to perform in-memory computations on large clusters in a fault-tolerant manner. Compared with DF and DS, RDD will not hold the schema. It holds only the data. If user want to implement schema over the RDD, User have to create a case class and have to implement the schema over the data. We will use RDD for the below cases: -When our data is unstructured, A streams of text or media streams. -When we donR...

Handy Methods in SparkContext Object while writing Spark Applications

SparkContext is Main entry point for Spark functionality. Its basically a class in Spark framework, when initialized, gets access to Spark Libraries. A SparkContext is responsible for connecting to Spark cluster, and can be used to create RDD(Resilient Distributed Dataset), to broadcast variables on that cluster and has much more useful methods. To create or initialize Spark Context, SparkConf need to be created before hand. SparkConf is basically the class used to set some configurations for Spark Applications like setting Master, App Name etc. Creating SparkContext- from pyspark import SparkConf, SparkContext conf = SparkConf().set("master", "yarn") sc = SparkContext(conf=conf) In latest versions of Spark, sparkContext is available in SparkSession (Class in Spark SQL component/Main entry...

PySpark – Components

PySpark Core Components includes – Spark Core – All functionalities built on top of Spark Core. Contains classes like SparkContext, RDD Spark SQL – Gives API for structured data processing. Contains important classes like SparkSession, DataFrame, DataSet. Spark Streaming – Gives functionality for Streaming data processing using micro-batching technique. Contains classes like Streaming Context, DStream Spark ML – Provides API to implement Machine learning algorithms.

Lost Password

Register

24 Tutorials