Archives

Data Structures – Linked List

What is Linked List? A linked list is a linear data structure where each element is a separate object. Each element (we will call it a node) of a list is comprising of two items – the data and a reference to the next node. The last node has a reference to null. The entry point into a linked list is called the head of the list. It should be noted that head is not a separate node, but the reference to the first node. If the list is empty then the head is a null reference. A linked list is a dynamic data structure. The number of nodes in a list is not fixed and can grow and shrink on demand. Any application which has to deal with an unknown number of objects will need to use a linked list. Types of Linked list: Singly linked list. Doubly linked list. Circular linked list. where last nod...

Data Structures – Array

What is an Array? An array is collection of items stored at contiguous memory locations. The idea is to store multiple items of same type(Homogeneous) together. This makes it easier to calculate the position of each element by simply adding an offset to a base value, i.e., the memory location of the first element of the array (generally denoted by the name of the array). What is Contiguous Memory? Contiguous memory allocation is a classical memory allocation model that assigns a process consecutive memory blocks (that is, memory blocks having consecutive addresses). Contiguous memory allocation is one of the oldest memory allocation schemes. When a process needs to execute, memory is requested by the process. The size of the process is compared with the amount of contiguous main memory ava...

Introduction to Data Structures

Definition: Simple definition of Data structure is organizing the data in memory. It is a systematic way to organize data in order to use it efficiently. There are different ways to organize data in structure. One example is Array. Array is collection of elements i.e., collection of memory locations, it is the memory locations that we store the values. In the array, structure of data is sequential it occupies contiguous memory locations types of data structures. Type Description Linear In Linear data structures, the data items are arranged in a linear sequence. Example: Array Non-Linear In Non-Linear data structures, the data items are not in sequence. Example: Tree, Graph Homogeneous In homogeneous data structures, all the elements are of same type. Example: Array Non-Homogeneous In Non-H...

Understanding Tail recursion in Scala

Tail recursion is little tricky concept in Scala and takes time to master it completely. Before we get into Tail recursion, lets try to look into recursion. A Recursive function is the function which calls itself. If some action is repetitive, we can call the same piece of code again. Recursion could be applied to problems where you use regular loops to solve it. Factorial program with regular loops – [code lang=”scala”] def factorial(n: Int): Int = { var fact = 1 for(i <- 1 to n) { fact = fact * i; } return fact } [/code] The same can be re-written with recursion like below – [code lang=”scala”] def factorialWithRecursion(n: Int): Int = { if (n == 0) return 1 else return n * factorialWithRecursion(n-1) } [/code] In the recursive approach, we return e...

Memory Management in Spark and its tuning

Spark has two kinds of memory- 1.Execution Memory which is used to store temporary data of shuffles, joins, sorts, and aggregations 2. Storage Memory     which is used to cache RDDs and data frames Executor has some amount of total memory, which is divided into two parts, the execution block and the storage block.This is governed by two configuration options. 1. spark.executor.memory > It is the total amount of memory which is available to executors. It is 1 gigabyte by default 2. spark.memory.fraction > Fraction of the total memory available for execution and storage. In early version of Spark, these two kinds of memory were fixed. And if your job was to fill all the execution space, Spark had to spill data to disk, reducing performance of the application. On the other hand, if your...

How to flatten JSON in Spark Dataframe

How to flatten whole JSON containing ArrayType and StructType in it? In order to flatten a JSON completely we don’t have any predefined function in Spark. We can write our own function that will flatten out JSON completely. We will write a function that will accept DataFrame. For each field in the DataFrame we will get the DataType. If the field is of ArrayType we will create new column with exploding the ArrayColumn using Spark explode_outer function. If the field is of StructType we will create new column with parentfield_childfield for each field in the StructType Field. This is a recursive function. Once the function doesn’t find any ArrayType or StructType. It will return the flattened DataFrame. Otherwise, It will it iterate through the schema to completely flatten out the JSON...

How to create Spark Dataframe on HBase table[Code Snippets]

There is no direct library to create Dataframe on HBase table like how we read Hive table with Spark sql. This post gives the way to create dataframe on top of Hbase table. You need to add hbase-client dependency to achieve this. Below is the link to get the dependency. https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/2.1.0 Lets say the hbase table is ’emp’ with rowKey as ’empID’ and columns are ‘name’ and ‘city’ under the column-family named – ‘metadata’. Case class -EmpRow is used in order to give the structure to the dataframe. newAPIHadoopRDD is the API available in Spark to create RDD on hbase, configurations need to passed as shown below. Dataframe will be created when you parse this RDD on case class. ...

How to Add Serial Number to Spark Dataframe

You may required to add Serial number to Spark Dataframe sometimes. It can be done with the spark function called monotonically_increasing_id(). It generates a new column with unique 64-bit monotonic index for each row. But it isn’t significant, as the sequence changes based on the partition. In short,  random numbers will be assigned which are out of sequence. If the goal is add serial number to the dataframe, you can use zipWithIndex method available on RDD. below is how you can achieve the same on dataframe. [code lang=”python”] from pyspark.sql.types import LongType, StructField, StructType def dfZipWithIndex (df, offset=1, colName="rowId"): ”’ Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe and preserves a ...

XGBoost for Regression[Case Study]

Using Gradient Boosting  for Regression Problems Introduction : The goal of the blogpost is to equip beginners with basics of gradient boosting regressor algorithm and quickly help them to build their first model.  We will mainly focus on the modeling side of it . The data cleaning and preprocessing parts would be covered in detail in an upcoming post. Gradient Boosting for regression builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function. The idea of boosting came out of the idea of whether a weak learner can be modified to become better.A weak hypothesis or weak learner is defined as one whose performance is at least slig...

XGBoost for Classification[Case Study]

Boost Your ML skills with XGBoost Introduction : In this blog we will discuss one of the Popular Boosting Ensemble algorithm called XGBoost. XGBoost is the most popular machine learning algorithm these days. Regardless of the data type (regression or classification), it is well known to provide better solutions than other ML algorithms. Extreme Gradient Boosting (xgboost) is similar to gradient boosting framework but more efficient. It has both linear model solver and tree learning algorithms. So, what makes it fast is its capacity to do parallel computation on a single machine. This makes xgboost at least 10 times faster than existing gradient boosting implementations. It supports various objective functions, including regression, classification and ranking. Since it is very high in predi...

Understanding Principal Component Analysis(PCA)

Principal Component Analysis Implement from scratch and validate with sklearn framework Introduction : “Excess of EveryThing is Bad” The above line is specially in machine learning. When the data becomes too much in its dimension then it becomes a problem for pattern learning. Too much information is bad on two things : compute and execution time and quality of the model fit. When the dimension of the data is too high we need to find a way to reduce it. But that reduction has to be done in such a way that we maintain the original pattern of the data.  The algorithm that we are going to discuss in this article does the similar job. The algorithm is quite famous and widely used in varieties of tasks. Its name is Principal Component Analysis aks PCA. The main purposes of a principal component...

Simple Logistic Regression[Case Study]

Logic behind Simple Logistic Regression Introduction : The goal of the blogpost is to get the beginners started with fundamental concepts  of the Simple logistic regression concepts and quickly help them to build their first Simple logistic regression model.  We will mainly focus on learning to build your first logistic regression model . The data cleaning and preprocessing parts would be covered in detail in an upcoming post. Logistic regression are one of the most fundamental and widely used Machine Learning Algorithm. Logistic regression is usually among the first few topics which people pick while learning predictive modeling. Don’t get confused with suffix “regression” in the algorithm name.  Logistic regression is not a regression algorithm but actually a probabilistic classification...

Lost Password

Register

24 Tutorials