24 Tutorials - Page 1916 of 1918 -

Hive

Hive Metastore Configurations

Sai Kumar August 20, 2017 No Comments

In order to store meta data Hive can use any of the below three strategies – Embedded – Local – Remote Hive – Metastore – Embedded Mainly used for unit tests Only one process is allowed to connect to the metastore at a time Hive metadata is stored in an embedded Apache Derby database Hive – Metastore – Local Metadata is stored in some other database like MySQL Hive Client will open the connection to datastore and make Hive queries against it Hive – Metastore – Remote All Hive Clients make a connection to the metastore serverand server queries the datastore for metadata. Metastore server and clients will communicate using Thrift protocol.

Hive

Hive Architecture

Sai Kumar August 20, 2017 No Comments

Below are major components in Hive Architecture – UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed. Driver – Hive queries are sent to drivers for compilation, optimization and execution Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. Metastore – System catalog which contains metadata about table schemas and other system schemas. Stores in a separate DB – like MySQL Execution Engine – The component which executes the execution plan created by the comp...

Hive

Interacting with HIVE – CLI, GUI

Sai Kumar August 20, 2017 No Comments

Hive Command Line Interface (CLI) – Interaction with Hive is commonly done with CLI Hive CLI is started with the $HIVE_HOME/bin/hive command which is a bash script Prompt for hive is hive > Using CLI , you can create tables, inspect schema and query tables CLI is a thick client for Hive – it needs local copy of all Hive and Hadoop client components along with their configurations It can work as a JDBC client , MapReduce client or HDFS client Hive Graphical user Interface(GUI) Tools – Ambari – Provided by Hortonworks Distribution HUE – Provided by Cloudera Distribution QuBole Karmasphere

Hive

Introduction to Hive – When, What, Why

Sai Kumar August 19, 2017 No Comments

History- At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day Rapidly grown data made traditional warehousing expensive Scaling up vertically is very expensive Hadoop is an alternative to store and process large data But MapReduce is very low-level and requires custom code Facebook developed Hive as solution Sept 2008 – Hive becomes a Hadoop subproject What is Hive – Hive is a Data Warehouse solution built on Hadoop It is a system for querying, managing and storing structured data on Hadoop An infrastructure on Hadoop for summarization and analysis of data Provides an SQL dialect called Hive QL to process data on Hadoop cluster Hive translates HiveQL queries to Map Reduce Java APIs Hive is not a full database It does not provide record level i...

Spark

Steps for creating DataFrames, SchemaRDD and performing operations using SparkSQL

Sai Kumar June 15, 2017 No Comments

Spark SQL: SparkSQL is a Spark module for Structured data processing. One use of SparkSQL is to execute SQL queries using a basic SQL syntax. There are several ways to interact with Spark SQL including SQL, the dataframes API,dataset API. The backbone for all these operation is Dataframes and SchemaRDD. DataFrames A dataFrame is a distributed collection of data organised into named columns. It is conceptually equivalent to a table in a relational database. SchemaRDD SchemaRDDs are made of row objects along with the metadata information. Spark SQL needs SQLcontext object,which is created from existing SparkContext. Steps for creating Dataframes,SchemaRDD and performing some operations using the sql methods provided by sqlContext. Step 1: start the spark shell by using the following command....

Spark

Word count program in Spark

Sai Kumar June 14, 2017 No Comments

WordCount in Spark WordCount program is like basic hello world program when it comes to Big data world. Below is program to achieve wordCount in Spark with very few lines of code. [code lang=”scala”]val inputlines = sc.textfile("/users/guest/read.txt") val words = inputlines.flatMap(line=>line.split(" ")) val wMap = words.map(word => (word,1)) val wOutput = wMap.reduceByKey(_ + _) wOutput.saveAsTextFile("/users/guest/")[/code]

Scala

Reversal of string in Scala using recursive function

Sai Kumar June 14, 2017 No Comments

Reversal of String in Scala using recursive function – object reverseString extends App { val s = “24Tutorials” print(revs(s)) def revs(s: String): String = { // if (s.isEmpty) “” if (s.length == 1) s else revs(s.tail) + s.head //else revs(s.substring(1)) + s.charAt(0) } } } Output: slairotuT42

Scala

Scala Important topics-Interview questions

Sai Kumar June 12, 2017 No Comments

Q1) CASE Classes: A case class is a class that may be used with the match/case statement. Case classes can be pattern matched Case classes automatically define hashcode and equals Case classes automatically define getter methods for the constructor arguments. Case classes can be seen as plain and immutable data-holding objects that should exclusively depend on their constructor arguments. Case classes contain a companion object which holds the apply method. This fact makes possible to instantiate a case class without the new keyword. Q2) Pattern Matching Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy object MatchTest1 extends App { def matchTest(x: Int): String = x match { case 1 => “one” case 2 =>...

Interview Questions

Hadoop MapReduce Interview Questions

Sai Kumar March 4, 2017 No Comments

Hadoop MapReduce Interview Questions and Answers Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. What are the core methods of a Reducer? The 3 core methods of a reducer are – 1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context) 2)reduce () it is heart of the reducer which is called once per key with the associated reduce task. Function Definition -...

Interview Questions

Hadoop HDFS Interview questions

Sai Kumar March 4, 2017 No Comments

What is a block and block scanner in HDFS? Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace- f...

hadoop / Pig

Pig Quick notes

Sai Kumar February 5, 2017 No Comments

PIG QUICK NOTES: Pig latin – is the language used to analyze data in Hadoop using Apache Pig. A RELATION is outermost structure of Pig Latin data model. and it is bag where- -A bag is collection of Tuples -A tuple is an ordered set of fields -A field is a piece of data Pig Latin –Statements While processing data using Pig Latin, statements are the basic constructs. 1. These statements work with relations. They include expressions and schemas. 2. Every statement ends with a semicolon (;). 3. We will perform various operations using operators provided by Pig Latin, through statements. 4. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. 5. As soon as you enter a Load statement in the Grunt...

Search Engine Optimization

Step 8 : Off Page Optimization (Continuous Process)

Harinath Babu December 21, 2016 No Comments

Login

Lost Password

Register