Partitions- The data within an RDD is split into several partitions. Properties of partitions: – Partitions never span multiple machines, i.e., tuples in the same partition are guaranteed to be on the same machine. – Each machine in the cluster contains one or more partitions. – The number of partitions to use is configurable. By default, it equals the total number of cores on all executor nodes. Two kinds of partitioning available in Spark: – Hash partitioning – Range partitioning Customizing a partitioning is only possible on Pair RDDs. Hash partitioning- Given a Pair RDD that should be grouped: val purchasesPerCust = purchasesRdd.map(p -> (p.customerId, p.price)) // Pair RDD .groupByKey() groupByKey first computes per tuple (k, v) its partition p: p = k....
How Spark Jobs are Executed- A Spark application is a set of processes running on a cluster. All these processes are coordinated by the driver program. The driver is: -the process where the main() method of your program run. -the process running the code that creates a SparkContext, creates RDDs, and stages up or sends off transformations and actions. These processes that run computations and store data for your application are executors. Executors: -Run the tasks that represent the application. -Return computed results to the driver. -Provide in-memory storage for cached RDDs. Execution of a Spark program: 1. The driver program runs the Spark application, which creates a SparkContext upon start-up. 2. The SparkContext connects to a cluster manager (e.g., Mesos/YARN) which allocates resour...
Cut Command: – CUT is used to process data in file. – Works only on file having column formatted data Command 1: Display particular position character cut -c3 file.txt Command 2: Range of characters cut -c3-8 file.txt cut -c3- file.txt cut -c-10 file.txt Command 3: Display Columns after seperation cut -d “|” -f2 file.txt cut -d “|” -f2-3 file.txt cut -d “|” -f2- file.txt Command 4: Display all other than given columns[–complement] cut -d “|” -f2 file.txt cut -d “|” –complement -f2 file.txt
grep – Global Regular Expression Parser It is used to search data in one/more files. Command 1: Search pattern in file - grep hello file.txt - grep sai file.txt file2.txt Command 2: Search pattern in current folder with all txt extensions. grep 1000 *.txt Command 3: Search data in all files in current folder grep 1000 *.* Command 4: Search ignoring case[-i] grep "Sai" file.txt (case sensitive by default) grep -i "Sai" file.txt Command 5: Display line number [-n] grep -n "124" result.txt Command 6: Get only filenames in which data exists[-l] grep -l "100" *.* Command 7: Search exact word [-w] grep -w Sai file.txt Command 8: Search lines of files which does not have that data(reverse of search)[-v] grep -v "1000" file.txt Command 9: - Get one record before the search grep -B 1 "Msd" fi...
SED – Stream Editor Used to display & editing data Editing options are – Insertion/Updation/Deletion 2 Types of Operations ——————— – Lines Addressing – Context Addressing Line Addressing- Command 1: Display line multiple times sed '2p' file.txt sed -n '3p' file.txt (specific line => -n) sed -n '5p' file.txt Command 2: Display last line[$] sed '$p' file.txt (includes last line again along with original) sed -n '$p' file.txt (Specific) Command 3: Range of lines sed -n '2,4p' file.txt Command 4: Do not display specific lines sed -n '2!p' file.txt sed -n '2,4!p' file.txt - do not display specific range of lines(!) Context Addressing: Command 1: Display lines having a specific word sed -n '/Amit/p' file.txt sed -n '/[Aa]mi...
AWK – select column data -Search data in file and print data on console -Find data of specific columns -Format output data -Used on file with bulk of data for searching, conditional executions, updating, filtering Command 1 ——— Print specific columns awk '{print $1}' file.txt by default TAB seperator awk '{print $1 "--" $2}' file.txt Command 2 – ———– select all data from table awk '{print $0}' tabfile.txt Command 3- ———– select columns from CSV awk -F "," '{print $1}' commafile.txt 1.Seperating data using -F awk -F "," '{print $1}' commafile.txt 2.Using variable (FS) awk '{print $2}' FS="," commafile.txt Command 4- ———- Display content without displaying header of file awk 'NR!=1{print $1 " " $2...
Command 11 – ———- Find text at the start of line [ ^ ] awk -F "|" '$2-/^s/{print $0}' tabfile.txt Command 12 – ———- Find text at the ent of line [ $ ] awk -F "|" '$2 -/n$/{print $0}' file1.txt Command 13 – ———- perform condition check using if awk -F "|" '{if ($3>2000) print $0;}' file2.txt Command 14 – ———- perform condition check using if-else awk -F "|" '{if($3>=20000) print $2; else print "*****" ; }' file2.txt command 15 – ——– perform condition check using else if awk -F "|" '{ if ($3>=3000) print $2 "your tax is 30%"; else if($3>=2000) print $2 "your tax is 20%"; else print $2 "your tax is 10%;}' file2.txt Command 16 – ——– Begin B...
You will be having many methods in your application framework, and if want to trace and log current method name then the below code will be helpful for you. def getCurrentMethodName:String = Thread.currentThread.getStackTrace()(2).getMethodName def test{ println("you are in - "+getCurrentMethodName) println("this is doing some functionality") } test Output: you are in – test this is doing some functionality
In some cases where you applied Joins in the spark application, you might want to know the time taken to complete the particular join. Below code snippet might come in handy to achieve so. import java.util.Date val curent = new Date().getTime println(curent) Thread.sleep(30000) val end = new Date().getTime println(end) println("time taken "+(end-curent).toFloat/60000 + "mins") Output: import java.util.Date curent: Long = 1520502573995 end: Long = 1520502603996 time taken 0.5000167mins All you need to do is get current time before method starts and get current time after method ends, then calculate the difference to get total time taken to complete that particular method. Hope this code snippet helps!!
Scala doesn’t have its own library for Dates and timestamps, so we need to depend on Java libraries. Here is the quick method to get current datetimestamp and format it as per your required format. Please note that all the code syntaxes are in Scala, this can be used while writing Scala application. import java.sql.Timestamp def getCurrentdateTimeStamp: Timestamp ={ val today:java.util.Date = Calendar.getInstance.getTime val timeFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") val now:String = timeFormat.format(today) val re = java.sql.Timestamp.valueOf(now) re } import java.sql.Timestamp getCurrentdateTimeStamp: java.sql.Timestamp getCurrentdateTimeStamp res0: java.sql.Timestamp = 2018-03-18 07:48:00.0
Tricky Deployment: Once you’re done writing your app, you have to deploy it right? That’s where things get a little out of hand. Although there are many options for deploying your Spark app, the simplest and straightforward approach is standalone deployment. Spark supports Mesos and Yarn, so if you’re not familiar with one of those it can become quite difficult to understand what’s going on. You might face some initial hiccups when bundling dependencies as well. If you don’t do it correctly, the Spark app will work in standalone mode but you’ll encounter Class path exceptions when running in cluster mode. Memory Issues: As Apache Spark is built to process huge chunks of data, monitoring and measuring memory usage is critical. While Spark works just fine for normal usage, it has got tons of...