Archives

Program to print triangle pattern using Scala

Write a program to Print below triangle pattern using Scala? # ## ### #### ##### Using Scala functional style of programming it’s very easy to use print patterns than Java. Below is the code for printing the same using Scala for loops. Approach 1 – [code lang=”scala”]object PrintTriangle { def main(args: Array[String]) { for(i < – 1 to 5){ for(j <- 0 to i){ print("#") } println("") } } } [/code] Approach 2 – [code lang=”scala”]object PrintTriangle{ def main(args: Array[String]) { for(x <- 1 until 6) { println("#" * x) } } } [/code] Output: # ## ### #### #####

How to Remove Header and Trailer of File using Scala

Removing header and trailer of the File using Scala might not be real-time use case since you will be using Spark when dealing with large datasets. This post helpful mainly for interview purpose, An Interviewer might ask to write code for this using scala instead Unix/Spark. Here is the code snippet to achieve the same using Scala – [code lang=”scala”] import scala.io.Source object RemoveHeaderTrailer{ def main(args: Array[String]){ println("start") val input = Source.fromFile("C:/Users/Sai/input.txt") //input.getLines().drop(1).foreach(println)//This is for removing Header alone val lines = input.getLines().toList val required_data = lines.slice(1,lines.size-1).mkString("\n") import java.io._ val pw = new PrintWriter(new File("C:/Users/...

Top Apache Spark Interview Questions and Answers For 2018

According to StackOverFlow Survey, Apache Spark is Hot, Trending and Highly paid Skill in IT Industry. Apache Spark is extremely popular in the Big Data Analytics world. Here are the frequently asked Apache Spark interview questions to crack Spark job in 2018. What is Apache Spark? Apache Spark is a lighting fast, in-memory(RAM) computation tool to processing big data files stored in Hadoop’s HDFS, NoSQL, or on local systems. What are the Spark Ecosystem components? Spark Core/SQL, Spark Streaming, Spark MLLib, Spark GraphX Spark Vs MapReduce a. Speed: Spark is ten to hundred times faster than MapReduce b. Analytics: Spark supports streaming, machine learning, complex analytics. c. Spark is suitable for Real-time processing and Map Reduce is suitable for Batch processing d. Spark is ...

Ways to create DataFrame in Apache Spark [Examples with Code]

Ways to create DataFrame in Apache Spark – DATAFRAME is the representation of a matrix but we can have columns of different datatypes or similar table with different rows and having different types of columns (values of each column will be same data type). When working with Spark most of the times you are required to create Dataframe and play around with it. DATAFRAME is nothing but a data structure which is stored in memory and can be created by following ways – 1)Using Case Class 2)Using createDataFrame method 3)Using SQL method 4)Using read..load methods i) From flat files(JSON, CSV) ii) From RDBMS Databases 1)Using Case Class val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ case class Employee(name: String, sal: Int) Below is the sample...

Hadoop questions Part2

What is input spilt size for 64mb block ,min IS is 32mb and max IS is 128mb. 64kb 67mb 127mb How to Change replication factor in hdfs How to get size of each file in hdfs path user/hdfs What is default partitioner and combiner Output for inner join and left ,right,full Customer 1 A 2 B 2 B 4 C 5 D Transaction 8 200 2 100 2 100 9 200 6 200 How to compare two files How to get IDs of all jobs How to run the process in background and how to bring it foreground and kill the job How get 50th line in the text Copy the line from file Lin by Lin which are greater than 10 lines and write it to sales.txt @$ How to exclude two tables and import all tables, How to skip special characters /n /r which are in rbms table and import into hdfs Can sparkstreaming stopped without stopping spark context? Zlib i...

Bucketing in Hive

• Bucketing decomposes data sets into more manageable parts • Users can specify the number of buckets for their data set • Specifying bucketing does not guarantee that table is properly populated • The number of bucket does not vary with data • Bucketing is best suited for sampling • Map-side joins can be done well with bucketing In the below sample code , a hash function will be done on the  ‘emplid’ and similar ids will be placed in the same bucket SET hive.enforce.bucketing = true; or Set mapred.reduce.tasks = <<number of buckets>> CREATE TABLE empdata(emplid INT, fname STRING, lname STRING) PARTITIONED BY (join_dt STRING) CLUSTERED BY (emplid) INTO 64 BUCKETS;

Partitioning in Hive

Partition improves query performance The way Hive structures data storage changes with Partitioning Partitions are stored as sub-directories in the table directory Over Partitioning to be avoided – Each partition creates an HDFS directory with many files in it – It increases large number of small sized files in HDFS – It eventually consume the capacity of namenode as the metadata is kept in main memory by Hadoop Use a partition scheme that creates partitions with size in multiples of the HDFS block size Hive supports dynamic partitions also where partitions are created from query parameters Static partitions are created by the ‘PARTITIONED BY’ clause Dynamic partitions are not enabled by default and if enabled , it works in ‘strict’ mode. The maximum number of dynamic partitions are limite...

Hive Tables

Hive supports 2 types of tables : 1. Managed / Internal tables 2. External Tables Managed Tables – Life cycle of data in the table is controlled by Hive – data is stored under the sub directory defined by ‘ hive.metastore.warehouse.dir ‘ – When table is dropped , data & metadata is deleted – Not a good choice for sharing data with other tools External Tables – Use the keyword EXTERNAL with CREATE TABLE – Life cycle of data in the table is NOT controlled by Hive – data is stored under the directory defined by LOCATION clause in the CREATE TABLE command – When table is dropped , data is not deleted but metadata is deleted – Better choice for sharing data with other tools – Few HiveQL constructs are not allowed for external tables DESCRIBE EXTENDED <<table name>> could be used...

Hive Databases

Hive Databases are like namespaces/catalogs If no database name is specified, ‘default’ database is used We can also use the keyword SCHEMA instead of DATABASE in all the database related commands below. Hive creates a directory for each of the databases it creates The default directory created for the database under a top-level directory specified by the property hive.metastore.warehouse.dir You can specify a different directory using the LOCATION option in the CREATE command Creating Database CREATE DATABASE IF NOT EXISTS Mydb1 WITH DBPROPERTIES ( ‘prop1’=‘value1’; Listing all databases ( regular expressions also allowed in listing dbs) SHOW DATABASES; SHOW DATABASES LIKE ‘M*’; Describing Database DESCRIBE DATABASE Mydb1; // shows DB name, comment & DB directory DESCRIBE DATABASE EXT...

File Formats in Hive

File Format specifies how records are encoded in files Record Format implies how a stream of bytes for a given record are encoded The default file format is TEXTFILE – each record is a line in the file Hive uses different control characters as delimeters in textfiles ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n The term field is used when overriding the default delimiter FIELDS TERMINATED BY ‘\001’ Supports text files – csv, tsv TextFile can contain JSON or XML documents. Commonly used File Formats – TextFile format Suitable for sharing data with other tools Can be viewed/edited manually SequenceFile Flat files that stores binary key ,value pair SequenceFile offers a Reader ,Writer, and Sorter classes for reading ,writing, and sorting respectively Supports – Uncompr...

Data Model and Datatypes in Hive

Data  in Hive is organised into – Databases –  Namespace to separate table and other data Tables – Homogeneous collection of data having same schema Partitions – Divisions in table data based on key value Buckets – Divisions in partitions based on hash value of a particular column Hive Data Types: Hive supports primitive data types and three collection types. Primitive type – tinyint,   smallint,  int, bigint,   boolean,   string, timestamp, float,   double ,  binary Collection Types – 1. Struct address struct <city:STRING; state:STRING> – Eg: struct (‘Bengaluru’, ‘Karnataka’) and address.city = ‘Bengaluru’ 2. Array names array(‘Hari’, ’Sai’) – Eg: name[1] = Sai 3. Maps name map(‘first’, ‘Mahendra’, ‘last&#...

Hive Metastore Configurations

In order to store meta data Hive can use any of the below three strategies – Embedded – Local – Remote Hive – Metastore – Embedded Mainly used for unit tests Only one process is allowed to connect to the metastore at a time Hive metadata is stored in an embedded Apache Derby database Hive – Metastore – Local Metadata is stored in some other database like MySQL Hive Client will open the connection to datastore and make Hive queries against it Hive – Metastore – Remote All Hive Clients make a connection to the metastore serverand server queries the datastore for metadata. Metastore server and clients will communicate using Thrift protocol.

Lost Password

Register

24 Tutorials