Archives

Introduction to Hive – When, What, Why

History- At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and today it is 500+ TBs per day Rapidly grown data made traditional warehousing expensive Scaling up vertically is very expensive Hadoop is an alternative to store and process large data But MapReduce is very low-level and requires custom code Facebook developed Hive as solution Sept 2008 – Hive becomes a Hadoop subproject What is Hive – Hive is a Data Warehouse solution built on Hadoop It is a system for querying, managing and storing structured data on Hadoop An infrastructure on Hadoop for summarization and analysis of data Provides an SQL dialect called Hive QL to process data on Hadoop cluster Hive translates HiveQL queries to Map Reduce Java APIs Hive is not a full database It does not provide record level i...

Steps for creating DataFrames, SchemaRDD and performing operations using SparkSQL

Spark SQL: SparkSQL is a Spark module for Structured data processing. One use of SparkSQL is to execute SQL queries using a basic SQL syntax. There are several ways to interact with Spark SQL including SQL, the dataframes API,dataset API. The backbone for all these operation is Dataframes and SchemaRDD. DataFrames A dataFrame is a distributed collection of data organised into named columns. It is conceptually equivalent to a table in a relational database. SchemaRDD SchemaRDDs are made of row objects along with the metadata information. Spark SQL needs SQLcontext object,which is created from existing SparkContext. Steps for creating Dataframes,SchemaRDD and performing some operations using the sql methods provided by sqlContext. Step 1: start the spark shell by using the following command....

Word count program in Spark

WordCount in Spark WordCount program is like basic hello world program when it comes to Big data world. Below is program to achieve wordCount in Spark with very few lines of code. [code lang=”scala”]val inputlines = sc.textfile("/users/guest/read.txt") val words = inputlines.flatMap(line=>line.split(" ")) val wMap = words.map(word => (word,1)) val wOutput = wMap.reduceByKey(_ + _) wOutput.saveAsTextFile("/users/guest/")[/code]

Reversal of string in Scala using recursive function

Reversal of String in Scala using recursive function – object reverseString extends App { val s = “24Tutorials” print(revs(s)) def revs(s: String): String = { // if (s.isEmpty) “” if (s.length == 1)  s else revs(s.tail) + s.head //else revs(s.substring(1)) + s.charAt(0) } } } Output: slairotuT42

Scala Important topics-Interview questions

Q1) CASE Classes: A case class is a class that may be used with the match/case statement. Case classes can be pattern matched Case classes automatically define hashcode and equals Case classes automatically define getter methods for the constructor arguments. Case classes can be seen as plain and immutable data-holding objects that should exclusively depend on their constructor arguments. Case classes contain a companion object which holds the apply method. This fact makes possible to instantiate a case class without the new keyword. Q2) Pattern Matching Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy object MatchTest1 extends App { def matchTest(x: Int): String = x match { case 1 => “one” case 2 =>...

Hadoop MapReduce Interview Questions

Hadoop MapReduce Interview Questions and Answers Explain the usage of Context Object. Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output. What are the core methods of a Reducer? The 3 core methods of a reducer are – 1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context) 2)reduce () it is heart of the reducer which is called once per key with the associated reduce task. Function Definition -...

Hadoop HDFS Interview questions

What is a block and block scanner in HDFS? Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB. Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode. Explain the difference between NameNode, Backup Node and Checkpoint NameNode. NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace- f...

Pig Quick notes

PIG QUICK NOTES: Pig latin – is the language used to analyze data in Hadoop using Apache Pig. A RELATION is outermost structure of Pig Latin data model. and it is bag where- -A bag is collection of Tuples -A tuple is an ordered set of fields -A field is a piece of data Pig Latin –Statements While processing data using Pig Latin, statements are the basic constructs. 1. These statements work with relations. They include expressions and schemas. 2. Every statement ends with a semicolon (;). 3. We will perform various operations using operators provided by Pig Latin, through statements. 4. Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output. 5. As soon as you enter a Load statement in the Grunt...

Types of SEO Techniques – White Hat, Grey Hat & Black Hat

The techniques used for search engine optimization (SEO) are generally grouped under the hat jargon: White hat seo, Grey hat seo and Black hat seo. While the white hat seo are considered as acceptable seo techniques, the black hat seo techniques are considered unethical. White Hat SEO: White Hat SEO is the best approach to optimizing a website. The overall strategy is to create a well-coded website for search engines to read and understand. For example, if a user found a website and decided it was interesting they may write an article about it, discuss it on a forum or mention it on their blog. White Hat seeks to accelerate this process by writing articles, entering forum discussions or blogging as separate people to the website itself. The idea is to create enough buzz about the website t...

Basics of Website, Domain & Hosting?

What is Website? Working of Websites. Details about HTTP, HTTPS & FTP. How to register Site? Hosting of site? Domain Extensions and Sub-Domains HTML Basics Schema.org

Introduction to Search Engine Optimization

Search Engine Optimization commonly known as SEO. Search Engine Optimization is the process of improving webpage position in Search Engine Results Page (SERP) in organic way(free listing).Some examples search engines are Google, Bing, Yahoo, Baidu, Yandex etc.. A search engine results page (SERP) is the page displayed by a search engine in response to a query by a searcher. The main component of the SERP is the listing of results that are returned by the search engine in response to a keyword query.The time span of getting results in search engine takes minimum 3 months to few years based on competition. A Person who works on Search engine optimization is called “Search Engine Optimizer“. Search Engine Optimization consists of On-Page and Off-Page tactics to improve ranking. Fo...

Lost Password

Register

24 Tutorials