Archives

Hive Metastore Configurations

In order to store meta data Hive can use any of the below three strategies – Embedded – Local – Remote Hive – Metastore – Embedded Mainly used for unit tests Only one process is allowed to connect to the metastore at a time Hive metadata is stored in an embedded Apache Derby database Hive – Metastore – Local Metadata is stored in some other database like MySQL Hive Client will open the connection to datastore and make Hive queries against it Hive – Metastore – Remote All Hive Clients make a connection to the metastore serverand server queries the datastore for metadata. Metastore server and clients will communicate using Thrift protocol.

Hive Architecture

Below are major components in Hive Architecture – UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed. Driver – Hive queries are sent to drivers for compilation, optimization and execution Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. Metastore – System catalog which contains metadata about table schemas and other system schemas. Stores in a separate DB – like MySQL Execution Engine – The component which executes the execution plan created by the comp...

Interacting with HIVE – CLI, GUI

Hive Command Line Interface (CLI) – Interaction with Hive is commonly done with CLI Hive CLI is started with the $HIVE_HOME/bin/hive command which is a bash script Prompt for hive is  hive > Using CLI , you can create tables, inspect schema and query tables CLI is a thick client for Hive – it needs local copy of all Hive and Hadoop client components along with their configurations It can work as a JDBC client , MapReduce client or HDFS client Hive Graphical user Interface(GUI) Tools – Ambari – Provided by Hortonworks Distribution HUE – Provided by Cloudera Distribution QuBole Karmasphere

Introduction to Hive – When, What, Why

History- At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day Rapidly grown data made traditional warehousing expensive Scaling up vertically is very expensive Hadoop is an alternative to store and process large data But MapReduce is very low-level and requires custom code Facebook developed Hive as solution Sept 2008 – Hive becomes a Hadoop subproject What is Hive – Hive is a Data Warehouse solution built on Hadoop It is a system for querying, managing and storing structured data on Hadoop An infrastructure on Hadoop for summarization and analysis of data Provides an SQL dialect called Hive QL to process data on Hadoop cluster Hive translates HiveQL queries to Map Reduce Java APIs Hive is not a full database It does not provide record level i...

Steps for creating DataFrames, SchemaRDD and performing operations using SparkSQL

Spark SQL: SparkSQL is a Spark module for Structured data processing. One use of SparkSQL is to execute SQL queries using a basic SQL syntax. There are several ways to interact with Spark SQL including SQL, the dataframes API,dataset API. The backbone for all these operation is Dataframes and SchemaRDD. DataFrames A dataFrame is a distributed collection of data organised into named columns. It is conceptually equivalent to a table in a relational database. SchemaRDD SchemaRDDs are made of row objects along with the metadata information. Spark SQL needs SQLcontext object,which is created from existing SparkContext. Steps for creating Dataframes,SchemaRDD and performing some operations using the sql methods provided by sqlContext. Step 1: start the spark shell by using the following command....

Word count program in Spark

WordCount in Spark WordCount program is like basic hello world program when it comes to Big data world. Below is program to achieve wordCount in Spark with very few lines of code. [code lang=”scala”]val inputlines = sc.textfile("/users/guest/read.txt") val words = inputlines.flatMap(line=>line.split(" ")) val wMap = words.map(word => (word,1)) val wOutput = wMap.reduceByKey(_ + _) wOutput.saveAsTextFile("/users/guest/")[/code]

Reversal of string in Scala using recursive function

Reversal of String in Scala using recursive function – object reverseString extends App { val s = “24Tutorials” print(revs(s)) def revs(s: String): String = { // if (s.isEmpty) “” if (s.length == 1)  s else revs(s.tail) + s.head //else revs(s.substring(1)) + s.charAt(0) } } } Output: slairotuT42

Scala Important topics-Interview questions

Q1) CASE Classes: A case class is a class that may be used with the match/case statement. Case classes can be pattern matched Case classes automatically define hashcode and equals Case classes automatically define getter methods for the constructor arguments. Case classes can be seen as plain and immutable data-holding objects that should exclusively depend on their constructor arguments. Case classes contain a companion object which holds the apply method. This fact makes possible to instantiate a case class without the new keyword. Q2) Pattern Matching Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy object MatchTest1 extends App { def matchTest(x: Int): String = x match { case 1 => “one” case 2 =>...

Hadoop MapReduce Interview Questions

Hadoop MapReduce Interview Questions and Answers Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. What are the core methods of a Reducer? The 3 core methods of a reducer are – 1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context) 2)reduce () it is heart of the reducer which is called once per key with the associated reduce task. Function Definition -...

Hadoop HDFS Interview questions

What is a block and block scanner in HDFS? Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace- f...

Pig Quick notes

PIG QUICK NOTES: Pig latin – is the language used to analyze data in Hadoop using Apache Pig. A RELATION is outermost structure of Pig Latin data model. and it is bag where- -A bag is collection of Tuples -A tuple is an ordered set of fields -A field is a piece of data Pig Latin –Statements While processing data using Pig Latin, statements are the basic constructs. 1. These statements work with relations. They include expressions and schemas. 2. Every statement ends with a semicolon (;). 3. We will perform various operations using operators provided by Pig Latin, through statements. 4. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. 5. As soon as you enter a Load statement in the Grunt...

Lost Password

Register

24 Tutorials