Hello Readers!! This blog is all about Sqoop as name suggest it is SQL to Hadoop…. So lets get started!!
Introduction to Sqoop
Sqoop the name represents “SQL to Hadoop and Hadoop to SQL”.
Sqoop is a connectivitytool designed to transfer data between Hadoop and relational databaseservers. It is used to import data from relational databases such as MySQL, Oracleto Hadoop HDFS, and export from Hadoop file system to relational databases.
It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. The traditional application management system, that is, the interaction of applications with relational database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure.
When Big Data storages and analysers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for importing and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between relational database server and HDFS.
Let us see the importance of Sqoop. The data residing in the relational database management systems need to be transferred to HDFS, work on the data and might need to transfer back to relational database management systems. In reality of Big Data world, Developers feel the transferring of data between relational database systems and HDFS is not that interesting, tedious but too seldom required. Developers can always write custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an alternative.
Sqoop automates most of the process, depends on the database to describe the schema of the data to be imported. Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers work easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. Sqoop takes care of remaining part.
Architecture of Sqoop
- Sqoop architecture is a client-side tool, which is tightly coupled with the Hadoop cluster.
- As shown in Figure: 3 a Sqoop command initiated by the client fetches the metadata of the tables, columns, and data types, according to the connectors and drivers interfaces.
- The import or export is translated to a Map-only Job program to load the data in parallel between the databases and Hadoop.
- Clients should have the appropriate connector and driver for the execution of the process.
Sqoop – The Bridge between RDBMS and Hadoop
Apache Sqoop can import/export from/to RDBMS and HDFS. It can be installed in any node in the Hadoop cluster. Typically in production it installed in separate node. The best practice is always updated .bashrc file after Sqoop binary extraction and configuration. It will help to run Sqoop from anywhere in the node command line interface.
As shown in Figure: 2 Sqoop is connected with RDBMS by JDBC drivers and it is first getting metadata of the database table. After querying the database table metadata Sqoop submits the map reduce job to Hadoop or Hadoop cluster which ever connected with the setup. As shown in Figure: 4 Sqoop submit only Mapper function and there is no reducer function. This submission is nothing but actual data transfer. The Database’s table data are sliced and transfer to the mapper. This transfer is happened in parallel to the mappers in Hadoop.
Features of Sqoop
- Ease of use: Sqoop requires client-side installation and configuration; Sqoop will be installed and configured server-side. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server; hence it is easy to use.
- Ease of Extension: In Sqoop, connectors will no longer be restricted to the JDBC model, but can rather define their own vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or dump operation.
- Security: Sqoop operates as the user that runs the ‘Sqoop’ command. The security principal used by a Sqoop job is determined by what credentials the users have when they launch Sqoop. Sqoop will operate as a server based application with support for securing access to external systems via role-based access to Connection objects. For additional security, Sqoop will no longer allow code generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.
- Sqoop is used for importing data from structured data sources such as RDBMS.
- Sqoop has connector based architecture. Connectors know how to connect to the respective data source and fetch the data.
- HDFS is a destination for data import using Sqoop.
- Sqoop data load is not event driven.
- In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.