Archives

Apache Spark Interview Questions For 2020

1.What is the version of spark you are using? Check the spark version you are using before going to Interview. As per 2020, the latest version of spark is 2.4.x 2.Difference between RDD, Dataframe, Dataset? RDD – RDD is Resilient Distributed Dataset. It is the fundamental data structure of Spark and is immutable collection of records partitioned across nodes of cluster. It allows us to perform in-memory computations on large clusters in a fault-tolerant manner. Compared with DF and DS, RDD will not hold the schema. It holds only the data. If user want to implement schema over the RDD, User have to create a case class and have to implement the schema over the data. We will use RDD for the below cases: -When our data is unstructured, A streams of text or media streams. -When we donR...

Handy Methods in SparkContext Object while writing Spark Applications

SparkContext is Main entry point for Spark functionality. Its basically a class in Spark framework, when initialized, gets access to Spark Libraries. A SparkContext is responsible for connecting to Spark cluster, and can be used to create RDD(Resilient Distributed Dataset), to broadcast variables on that cluster and has much more useful methods. To create or initialize Spark Context, SparkConf need to be created before hand. SparkConf is basically the class used to set some configurations for Spark Applications like setting Master, App Name etc. Creating SparkContext- from pyspark import SparkConf, SparkContext conf = SparkConf().set("master", "yarn") sc = SparkContext(conf=conf) In latest versions of Spark, sparkContext is available in SparkSession (Class in Spark SQL component/Main entry...

PySpark – Components

PySpark Core Components includes – Spark Core – All functionalities built on top of Spark Core. Contains classes like SparkContext, RDD Spark SQL – Gives API for structured data processing. Contains important classes like SparkSession, DataFrame, DataSet. Spark Streaming – Gives functionality for Streaming data processing using micro-batching technique. Contains classes like Streaming Context, DStream Spark ML – Provides API to implement Machine learning algorithms.

Find the average of all contiguous subarrays of fixed size in it

Given an array, find the average of all contiguous subarrays of size ‘n’ in it. Array: [1, 3, 2, 6, -1, 4, 1, 8, 2], n=5 Output: [2.2, 2.8, 2.4, 3.6, 2.8] Solution: Sliding Window algorithm can be used to resolve this. Time Complexity: O(n) Space Complexity: O(1)

Find a pair in the array whose sum is equal to the given target

Given an array of sorted numbers and a target sum, find a pair in the array whose sum is equal to the given target. Write a function to return the indices of the two numbers (i.e. the pair) such that they add up to the given target. Example 1: Input: [1, 2, 3, 4, 6], target=6 Output: [1, 3] Explanation: The numbers at index 1 and 3 add up to 6: 2+4=6 We can use the Two Pointers approach to solve this. Solution: Time Complexity: O(n) Space Complexity: O(1)

Hadoop Setup Documents

Click here to download document to Setup Hadoop 2.X Click here to download document for Eclipse setup. Click here to download document for Ubuntu OS

How to Retrieve Password from JCEKS file in Spark

In the data ingestion stage into Hadoop from RBDMS sources, it often requires password to hit source tables in RDBMS databases. Passing hard password directly is highly unsafe and bad practice in real time applications. So, password can be encrypted by creating JCEKS file. JCEKS is basically a keystore file saved in the Java Cryptography Extension KeyStore (JCEKS) format; used as an alternative keystore to the Java Keystore (JKS) format for the Java platform; stores encoded keys. When working on Spark application which deals with RDBMS sources JCEKS need to be decrypted to query the source tables. Below is the handy function to retrieve password from JCEKS file- Using PySpark Using Scala

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

Apache Spark SQL component comes with catalyst optimizer which smartly optimizes the jobs by re-arranging the order of transformations and by implementing some special joins according to datasets. Spark performs these joins internally or you can force it to perform them. It’s worthwhile to know this topic, so that it comes to rescue when optimizing the jobs according to your use case. Shuffle Hash Join Shuffle hash join shuffles the data based on join keys and then perform the join. The shuffled hash join ensures that data on each partition will contain the same keys by partitioning the second dataset with the same default partitioner as the first, so that the keys with the same hash value from both datasets are in the same partition. It follows the classic map-reduce pattern: First ...

Two Sum

Given an array of integers, return indices of the two numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Given nums = [1, 4, 8, 13], target = 12, Because nums[1] + nums[2] = 4 + 8 = 12, return [1, 2]. Solution:

Data Structures – Linked List

What is Linked List? A linked list is a linear data structure where each element is a separate object. Each element (we will call it a node) of a list is comprising of two items – the data and a reference to the next node. The last node has a reference to null. The entry point into a linked list is called the head of the list. It should be noted that head is not a separate node, but the reference to the first node. If the list is empty then the head is a null reference. A linked list is a dynamic data structure. The number of nodes in a list is not fixed and can grow and shrink on demand. Any application which has to deal with an unknown number of objects will need to use a linked list. Types of Linked list: Singly linked list. Doubly linked list. Circular linked list. where last nod...

Data Structures – Array

What is an Array? An array is collection of items stored at contiguous memory locations. The idea is to store multiple items of same type(Homogeneous) together. This makes it easier to calculate the position of each element by simply adding an offset to a base value, i.e., the memory location of the first element of the array (generally denoted by the name of the array). What is Contiguous Memory? Contiguous memory allocation is a classical memory allocation model that assigns a process consecutive memory blocks (that is, memory blocks having consecutive addresses). Contiguous memory allocation is one of the oldest memory allocation schemes. When a process needs to execute, memory is requested by the process. The size of the process is compared with the amount of contiguous main memory ava...

Introduction to Data Structures

Definition: Simple definition of Data structure is organizing the data in memory. It is a systematic way to organize data in order to use it efficiently. There are different ways to organize data in structure. One example is Array. Array is collection of elements i.e., collection of memory locations, it is the memory locations that we store the values. In the array, structure of data is sequential it occupies contiguous memory locations types of data structures. Type Description Linear In Linear data structures, the data items are arranged in a linear sequence. Example: Array Non-Linear In Non-Linear data structures, the data items are not in sequence. Example: Tree, Graph Homogeneous In homogeneous data structures, all the elements are of same type. Example: Array Non-Homogeneous In Non-H...

Lost Password

Register

24 Tutorials