Apache Sqoop — SQL to Hadoop

What is Sqoop?

Apache Sqoop is a tool designed to efficiently transfer bulk data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from an RDBMS such as MySQL or Oracle or a mainframe into the HDFS transform the data in Hadoop Map Reduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported.

Sqoop Architecture

  • When you use Sqoop to transfer data, the dataset being transferred is split into multiple partitions, and a map-only job is launched.
  • Individual mappers are now responsible for the transfer of each slice/partition of the dataset.
  • The metadata of the database is used to handle each data record in a type-safe manner. Type-Safety is extent to which a programming language discourages or prevents type errors due to difference in data types.

How Sqoop uses MapReduce jobs internally?

  • Once Sqoop connects to the database, it uses JDBC to examine the table to be imported by retrieving a list of all the columns and their SQL data types.
  • The SQL data types (integer, varchar etc.) can be mapped to Java data types (Integer, String etc.).
  • Sqoop has a code generator which creates a table-specific java class to hold the extracted records from the table by using information given by the JDBC about the data types, etc.
  • Then Sqoop connects to the cluster to submit a MapReduce job using the java class generated.
  • The dataset being transferred is split into multiple partitions, and a map-only job is launched.
  • The output of this is a set of files containing the imported data. As the import process is performed in parallel, the output is in multiple files.

How Sqoop Import & Export works?

Sqoop Tools and Help Command

Sqoop is a collection of related tools. To use Sqoop, specify the tool you want to use and the arguments that control the tool.

Sqoop tool-name [tool-arguments]
Sqoop help
Sqoop help [tool-name]
Sqoop help import

Sqoop Import Tool

The import tool imports an individual table from an RDBMS to HDFS. Each row form a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or Sequence files.

sqoop import
--connect jdbc:mysql://jdbc-uri/database_name
--table mysqltablename
--username mysql_username
--password mysql_password
-m num_mappers_to_run
--target-dir target_directory

Sqoop Export Tool

The export tool exports a set of files from HDFS back to RDBMS. The target must already exist in the database.

sqoop export
--connect jdbc:mysql://jdbc-uri/database_name
--username mysql_username
--password mysql_password
--table mysqltable
-m num_mappers_to_run
--export-dir target_directory

Demo 1: Importing MySQL table to HDFS

Let’s create the following MySQL table ‘databare_emp’ into dataguy database and consider it to be moved to HDFS location.

Create database dataguy;CREATE TABLE `dataguy`.`databare_emp` (
`empid` INT NOT NULL,
`fname` VARCHAR(45) NOT NULL,
`lname` VARCHAR(45) NULL,
`dob` DATE NOT NULL,
`gender` CHAR NOT NULL,
`hire_date` DATE NOT NULL,
PRIMARY KEY (`empid`));

INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1001,'Nitin','Khandelwal','1986-08-01','M','2017-10-20');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1002,'Naveen','Khandelwal','1988-01-12','M','2018-09-19');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1003,'Ravi','Aggarwal','1997-12-21','M','2020-06-03');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1004,'Rupika','Kaur','1992-05-07','F','2019-05-14');
INSERT INTO `dataguy`.`databare_emp` (`empid`,`fname`,`lname`,`dob`,`gender`,`hire_date`)VALUES (1005,'Anurag','Pandiya','1977-03-21','M','2000-08-11');
Select * from `dataguy`.`databare_emp` ;
sqoop import --connect jdbc:mysql://dbserver.databare.com/dataguy --table databare_emp --username dataguyAD --password mypassword -m 1 --target-dir /user/ad/emp_files/
hdfs dfs –cat /user/ad/emp_files/part*
hdfs dfs –ls /user/ad/emp_files/

Demo 2: Exporting data to MySQL

We will consider the same example used in demo1 and will now move the data from HDFS to some table in MySQL.

CREATE TABLE `dataguy`.`databare_employees` (
`empid` INT NOT NULL,
`fname` VARCHAR(45) NOT NULL,
`lname` VARCHAR(45) NULL,
`dob` DATE NOT NULL,
`gender` CHAR NOT NULL,
`hire_date` DATE NOT NULL,
PRIMARY KEY (`empid`));
sqoop export --connect jdbc:mysql://dbserver.databare.com/dataguy --username dataguyAD --password mypassword --table databare _employees -m 1 --export-dir /user/ad/emp_files/
Use dataguy;
Select * from databare_employees;

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dataguy! - databare.com

Dataguy! - databare.com

11 Followers

Hi! I am Amit your DataGuy. Folks do call me ‘AD’ as well. I have worked over a decade with into multiple roles - developer, dba, data engineer, data architect.