24 Tutorials - Page 1906 of 1916 -

Reusable Spark Scala application to export files from HDFS/S3 into Mongo Collection

Suriya May 17, 2021 No Comments

Application Flow How This Application Works ? When user invokes the application using spark-submit First, the application will parse and validate the input options. Instantiate new SparkSession with mongo config spark.mongodb.output.uri. Depending on the input options provided by the user DataFrame will be created for source data file. If user provided a transformation SQL a temporary view will be created on source DataFrame and transformation will be applied to form transformed DataFrame or the source DataFrame will used for writing the data to Mongo Collection. Finally, either transformed DataFrame or Source DataFrame will be written into Mongo Collection depending on the write configuration provided by user or default write configuration. Read Configuration By default, application will ...

AWS

All about AWS – GLUE

Sai Kumar April 4, 2021 No Comments

What is GLUE? Fully managed ETL service that makes it simple and cost effective to categorize your data, clean it, enrich it and move it reliably between various data stores. It’s a serverless system. Automatically handle discovery and definition of table definitions and schema. Its main use is to serve as a central metadata repository for your data lake. Discover those schemas out of your unstructured data, sitting in S3 or whatever, and publish table definitions for use with analysis tools such as Athena or Redshift or EMR. The purpose of GLUE itself is to extract structure from your unstructured data. If you have data sitting in a data lake, it can provide a schema for that so that you can query it using sequel or sequel like tools including Redshift and Athena and Amazon EMR and ...

Scala

Scala Companion Object Explained

Suriya May 28, 2020 No Comments

What is a companion object? Companion object is nothing but a singleton object which will have same name as the class name and it will hold all the static methods and members. Since, Scala doesn’t support definition of static methods in the class we have to create the companion object. Both the class and companion object should need to have same name and should be present in the same source file. Consider the below example – class ListToString private(list: List[Any]) { def size(): Int = { list.length } def makeString(sep: String = ","): String = { list.mkString(sep) } } object ListToString { def makeString(list: List[Any], sep: String = ","): String = { list.mkString(sep) } def apply(list: List[Any]): ListToString = new ListToString(list) } Class ListToString defined with two ...

Programs / Scala

Binary Search in Scala (Iterative,Recursion,Tail Recursion Approaches)

Sai Kumar May 25, 2020 No Comments

Binary Search is a fast & efficient algorithm to search an element in sorted list of elements. It works on the technique of divide and conquer. Data structure: Array Time Complexity: Worst case: O(log n) Average case: O(log n) Best case: O(1) Space complexity: O(1) Let’s see how it can implemented in Scala with different approaches – Iterative Approach – Recursion Approach – Tail Recursion – Driver Program- val arr = Array(1, 2, 4, 5, 6, 7) val target = 7 println(binarySearch_iterative(arr, target) match { case -1 => s"$target doesn't exist in ${arr.mkString("[", ",", "]")}" case index => s"$target exists at index $index" }) println(binarySearch_Recursive(arr, target)() match { case -1 => s"$target doesnt match" case index => s"$target exists a...

Interview Questions

Apache Spark Interview Questions For 2020

24 Tutorials May 4, 2020 No Comments

1.What is the version of spark you are using? Check the spark version you are using before going to Interview. As per 2020, the latest version of spark is 2.4.x 2.Difference between RDD, Dataframe, Dataset? RDD – RDD is Resilient Distributed Dataset. It is the fundamental data structure of Spark and is immutable collection of records partitioned across nodes of cluster. It allows us to perform in-memory computations on large clusters in a fault-tolerant manner. Compared with DF and DS, RDD will not hold the schema. It holds only the data. If user want to implement schema over the RDD, User have to create a case class and have to implement the schema over the data. We will use RDD for the below cases: -When our data is unstructured, A streams of text or media streams. -When we donR...

PySpark / Spark

Handy Methods in SparkContext Object while writing Spark Applications

Sai Kumar May 3, 2020 Comments Closed

SparkContext is Main entry point for Spark functionality. Its basically a class in Spark framework, when initialized, gets access to Spark Libraries. A SparkContext is responsible for connecting to Spark cluster, and can be used to create RDD(Resilient Distributed Dataset), to broadcast variables on that cluster and has much more useful methods. To create or initialize Spark Context, SparkConf need to be created before hand. SparkConf is basically the class used to set some configurations for Spark Applications like setting Master, App Name etc. Creating SparkContext- from pyspark import SparkConf, SparkContext conf = SparkConf().set("master", "yarn") sc = SparkContext(conf=conf) In latest versions of Spark, sparkContext is available in SparkSession (Class in Spark SQL component/Main entry...

PySpark / Spark

PySpark – Components

24 Tutorials May 3, 2020 No Comments

PySpark Core Components includes – Spark Core – All functionalities built on top of Spark Core. Contains classes like SparkContext, RDD Spark SQL – Gives API for structured data processing. Contains important classes like SparkSession, DataFrame, DataSet. Spark Streaming – Gives functionality for Streaming data processing using micro-batching technique. Contains classes like Streaming Context, DStream Spark ML – Provides API to implement Machine learning algorithms.

Programs

Find the average of all contiguous subarrays of fixed size in it

24 Tutorials April 28, 2020 No Comments

Given an array, find the average of all contiguous subarrays of size ‘n’ in it. Array: [1, 3, 2, 6, -1, 4, 1, 8, 2], n=5 Output: [2.2, 2.8, 2.4, 3.6, 2.8] Solution: Sliding Window algorithm can be used to resolve this. Time Complexity: O(n) Space Complexity: O(1)

Programs

Find a pair in the array whose sum is equal to the given target

24 Tutorials April 28, 2020 No Comments

Given an array of sorted numbers and a target sum, find a pair in the array whose sum is equal to the given target. Write a function to return the indices of the two numbers (i.e. the pair) such that they add up to the given target. Example 1: Input: [1, 2, 3, 4, 6], target=6 Output: [1, 3] Explanation: The numbers at index 1 and 3 add up to 6: 2+4=6 We can use the Two Pointers approach to solve this. Solution: Time Complexity: O(n) Space Complexity: O(1)

hadoop

Hadoop Setup Documents

Veeraravi April 25, 2020 No Comments

Click here to download document to Setup Hadoop 2.X Click here to download document for Eclipse setup. Click here to download document for Ubuntu OS

Python / Spark

How to Retrieve Password from JCEKS file in Spark

Sai Kumar April 25, 2020 Comments Closed

In the data ingestion stage into Hadoop from RBDMS sources, it often requires password to hit source tables in RDBMS databases. Passing hard password directly is highly unsafe and bad practice in real time applications. So, password can be encrypted by creating JCEKS file. JCEKS is basically a keystore file saved in the Java Cryptography Extension KeyStore (JCEKS) format; used as an alternative keystore to the Java Keystore (JKS) format for the Java platform; stores encoded keys. When working on Spark application which deals with RDBMS sources JCEKS need to be decrypted to query the source tables. Below is the handy function to retrieve password from JCEKS file- Using PySpark Using Scala

Spark

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

Sai Kumar April 22, 2020 Comments Closed

Apache Spark SQL component comes with catalyst optimizer which smartly optimizes the jobs by re-arranging the order of transformations and by implementing some special joins according to datasets. Spark performs these joins internally or you can force it to perform them. It’s worthwhile to know this topic, so that it comes to rescue when optimizing the jobs according to your use case. Shuffle Hash Join Shuffle hash join shuffles the data based on join keys and then perform the join. The shuffled hash join ensures that data on each partition will contain the same keys by partitioning the second dataset with the same default partitioner as the first, so that the keys with the same hash value from both datasets are in the same partition. It follows the classic map-reduce pattern: First ...

Login

Lost Password

Register