In this blog I will be discussing about what is Hive and the steps for installation it on Ubuntu machine. So let’s begin with.
What is Hive..??
Apache Hive is a data warehouse software built on top of Hadoop. It facilitates querying, data analysis and data summarization. It also supports in the analysis of large datasets stored in Hadoop’s HDFS and compatible file systems such as Amazon S3 file system. It provides an SQL-like language called HiveQL (Hive Query Language). SQL knowledge is wide spread and anyone who has decent knowledge would be able to use Hive effectively. Hive translates the query into Java Map Reduce code and runs the same on Hadoop cluster. Hive is best suited for Data Warehousing applications where data is structured, static and formatted. Hive is not a complete database. Hive processor converts most of its queries into a Map Reduce job which runs on Hadoop cluster. Hive is designed for easy and effective data aggregation, ad-hoc querying and analysis of huge volumes of data.
Hive converts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java MapReduce will required a lot of code to be written/debugged compared to HiveQL. It increases the developer productivity to use Hive.
Hive does not give SQL like latency as it ultimately runs Map Reduce programs underneath. Map Reduce framework is built for batch processing jobs it has high latency, even the fastest hive query would take several minutes to get executed on relatively smaller set of data in few megabytes. We cannot simply compare the performance of traditional SQL systems with hive. Hive is not an OLTP (On-line transaction Processing) application and not meant to be connected with systems which needs interactive processing. It is meant to be used to process batch jobs on huge data which is immutable.
To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce.
Installation of Hive on Ubuntu:
Before installing hive there are few basic pre-requisites which are mandatory:
Java and JDK should be installed.
Hadoop must be installed and running.
After the basic pre-requisites are met we can go ahead with installing hive.
Following are the steps for it:
Step 1: Download Apache Hive & Extract it.
Download from the link: http://apache.claz.org/hive/stable/
Click the apache-hive-1.2.1-bin.tar.gz and Save it.
Enter into Downloads directory, where Hive is downloaded.
$ cd Downloads
Extract hive tar file using following command
$ tar -xzvf hive-1.2.1-bin.tar.gz
Step 2: Setting Hive environment variable:
Edit the .bashrc file to update the environment variable for user.
Add the following at the end of the file:
export PATH=$PATH: $HIVE_HOME/bin
Step 3: Create Hive Directories within HDFS.
The directory warehouse is the location to store the table or data related to hive.
$hadoop fs -mkdir /user/hive/warehouse
Set read/write permissions for table.
$hadoop fs -chmod g+w /user/hive/warehouse
Step 4: Set Hadoop path in Hive config.sh
Go to the line where the following statements are written
# Allow alternate conf dir location.
Below these lines write the following
Step 5: Launch hive
Type exit to quit from hive.
Hive is installed successfully.