Author: Sai Kumar

Best Lines from Friedrich Nietzsche Writings

Friedrich Nietzsche (1844–1900) was a German classical scholar, philosopher, and critic of culture, who became one of the most influential of all modern thinkers. The German wrote 15 books in the seventeen years, including Beyond Good And Evil, Twilight of the idols, God is dead, Ecce Homo, the Antichrist, On The Genealogy of Morality, and Thus Spoke Zarathustra. Here are some best lines from Niet...

Captain Jack Sparrow Philosophy

Perhaps a strange source to find inspiration from but among the most powerful reasons to see Jack Sparrow the unlikely hero from the Pirates of the Caribbean series as someone we can learn from can be seen in the opening scene with the character as he rides a sinking ship into Port Royal during the start of the first film. Jack acts as the king of all that he surveys and though we can sense Inklin...

How to connect to Snowflake from AWS EMR using PySpark

As a ETL developer, we need to transport data between different platforms/services. It involves establishing connections between them. Below is one such use-case to connect Snowflake from AWS. Here are steps to securely connect to Snowflake using PySpark – Login to AWS EMR service and connect to Spark with below snowflake connectors pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net....

All about AWS – GLUE

What is GLUE? Fully managed ETL service that makes it simple and cost effective to categorize your data, clean it, enrich it and move it reliably between various data stores. It’s a serverless system. Automatically handle discovery and definition of table definitions and schema. Its main use is to serve as a central metadata repository for your data lake. Discover those schemas out of your u...

Binary Search in Scala (Iterative,Recursion,Tail Recursion Approaches)

Binary Search is a fast & efficient algorithm to search an element in sorted list of elements. It works on the technique of divide and conquer. Data structure: Array Time Complexity: Worst case: O(log n) Average case: O(log n) Best case: O(1) Space complexity: O(1) Let’s see how it can implemented in Scala with different approaches – Iterative Approach – Recursion Approach &#...

Handy Methods in SparkContext Object while writing Spark Applications

SparkContext is Main entry point for Spark functionality. Its basically a class in Spark framework, when initialized, gets access to Spark Libraries. A SparkContext is responsible for connecting to Spark cluster, and can be used to create RDD(Resilient Distributed Dataset), to broadcast variables on that cluster and has much more useful methods. To create or initialize Spark Context, SparkConf nee...

How to Retrieve Password from JCEKS file in Spark

In the data ingestion stage into Hadoop from RBDMS sources, it often requires password to hit source tables in RDBMS databases. Passing hard password directly is highly unsafe and bad practice in real time applications. So, password can be encrypted by creating JCEKS file. JCEKS is basically a keystore file saved in the Java Cryptography Extension KeyStore (JCEKS) format; used as an alternative ke...

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

Apache Spark SQL component comes with catalyst optimizer which smartly optimizes the jobs by re-arranging the order of transformations and by implementing some special joins according to datasets. Spark performs these joins internally or you can force it to perform them. It’s worthwhile to know this topic, so that it comes to rescue when optimizing the jobs according to your use case. Shuffl...

Understanding Tail recursion in Scala

Tail recursion is little tricky concept in Scala and takes time to master it completely. Before we get into Tail recursion, lets try to look into recursion. A Recursive function is the function which calls itself. If some action is repetitive, we can call the same piece of code again. Recursion could be applied to problems where you use regular loops to solve it. Factorial program with regular loo...

Memory Management in Spark and its tuning

Spark has two kinds of memory- 1.Execution Memory which is used to store temporary data of shuffles, joins, sorts, and aggregations 2. Storage Memory     which is used to cache RDDs and data frames Executor has some amount of total memory, which is divided into two parts, the execution block and the storage block.This is governed by two configuration options. 1. spark.executor.memory > It is th...

How to create Spark Dataframe on HBase table[Code Snippets]

There is no direct library to create Dataframe on HBase table like how we read Hive table with Spark sql. This post gives the way to create dataframe on top of Hbase table. You need to add hbase-client dependency to achieve this. Below is the link to get the dependency. https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/2.1.0 Lets say the hbase table is ’emp’ with rowKey ...

How to Add Serial Number to Spark Dataframe

You may required to add Serial number to Spark Dataframe sometimes. It can be done with the spark function called monotonically_increasing_id(). It generates a new column with unique 64-bit monotonic index for each row. But it isn’t significant, as the sequence changes based on the partition. In short,  random numbers will be assigned which are out of sequence. If the goal is add serial numb...

Lost Password

Register

24 Tutorials