SQLBank: sqoop commands

Showing posts with label sqoop commands. Show all posts

Sunday, September 27, 2020

Sqoop Complete Tutorial Part-7

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here.

Part-2, please click here.

Part-3, please click here.

Part-4, please click here.

Part-5, please click here.

Part-6, please click here.

18) Importing all tables from MySQL to Hive

Importing a table from MySQL to Hive's default database. The below command will help in copying all the tables from MySQL to Hive user database.

sqoop import-all-tables
--connect jdbc:mysql://localhost/empdept
--username root
--password cloudera
--hive-import
--hive-database dbTest

The result is below:

If you look at the result, the data is replicated. This is because the tables "emp" and "dept" already exists in the database. hive-overwrite will help in replacing the data if already exist.

sqoop import-all-tables
--connect jdbc:mysql://localhost/empdept
--username root
--password cloudera
--hive-import
--warehouse-dir /user/hive/warehouse/dbtest
--hive-database dbtest
--hive-overwrite

Here we have additionally provided the warehouse directory to specify the location of the database.

19) Importing all tables but excluding few from MySQL to Hive

I have created a table named "location" in my current database 'empdept'.

I am about to import all tables but excluding 'emp' and 'dept' since those were already imported. Since "location" is the only table to import, I can specify the table name, however, let's see how it can be done with sqoop-import-all.

sqoop import-all-tables
--connect jdbc:mysql://localhost/empdept
--username root
--password cloudera
--hive-import
--hive-database dbtest
--exclude-tables "emp,dept"

If you look at the above screenshot, the import process selecting only "loc" table and excluding the tables "emp" and "dept" from the import.

The import process is completed and the table schema and data populated into Hive warehouse/database. Let's verify in Hive.

Please click here for the next part.

Sqoop Complete Tutorial Part-6

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here.

Part-2, please click here.

Part-3, please click here.

Part-4, please click here.

Part-5, please click here.

16) Importing a table from MySQL to Hive's default database.

Importing a table from MySQL to Hive's default database. The below command will help in copying "emp" table and data from MySQL to Hive "default" database as "employee".

sqoop import
--connect jdbc:mysql://localhost/empdept
--table emp
--username root
--password cloudera
--hive-import
--hive-table employees

Verifying the data:

17) Importing a table from MySQL to Hive's user database.

Importing a table from MySQL to Hive's default database. The below command will help in copying "emp" table and data from MySQL to Hive's user database (dbTest) as "employee" table.

sqoop import
--connect jdbc:mysql://localhost/empdept
--table emp
--username root
--password cloudera
--hive-import
--hive-table employees
--hive-database dbTest

Verifying the data:

Please click here for the next part.

Monday, September 7, 2020

Sqoop Complete Tutorial Part-5

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here.

Part-2, please click here.

Part-3, please click here.

Part-4, please click here.

14) Read the contents of the database without importing it.

sqoop list-tables
--connect jdbc:mysql://localhost/empdept
--username root
--password cloudera

This above command will list out the tables from the given database.

15) Get the list of of the databases using Sqoop

sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password cloudera

With the above command, we completed learning how to import data from MySQL to HDFS using Sqoop. The upcoming articles are on how to import data from MySQL to Hive.

Please click here for the next part

Wednesday, August 26, 2020

Sqoop Complete Tutorial Part-4

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here.

Part-2, please click here.

Part-3, please click here.

11) Importing data from MySQL to HDFS based on SQL query

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items"

Append & LastModified

The following is as per the Sqoop documentation.

Sqoop supports two types of incremental imports: append and lastmodified. You can use the --incremental argument to specify the type of incremental import to perform.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

Below two are examples of incremental imports.

12) Importing data from MySQL to HDFS - Incremental Import

Assume that the below is the "emp" data which is in MySQL's "EMPDEPT" database.

Importing it into HDFS using the below command

sqoop import
--connect jdbc:mysql://localhost:3306/empdept
--table emp
--username root
--password cloudera
-m1

After importing it into HDFS, insert two rows in MySQL's table.

These two records need to be updated/appended in HDFS location. To achieve it use the below command.

sqoop import
--connect jdbc:mysql://localhost:3306/empdept
--table emp
--username root
--password cloudera
--check-column empno
--incremental append
--last-value 7521

13) Import and append data from MySQL to HDFS

Consider the same example specified above.

As per the data, the last entry in the table is 1981-09-28, and if we update one or two records, as shown below.

The records which are updated in the source table should also need to be updated in the destination (HDFS). The command to do that -

sqoop import
--connect jdbc:mysql://localhost:3306/empdept
--table emp
--username root
--password cloudera
--check-column HireDate
--incremental lastmodified
--last-value 1981-09-28
--target-dir /user/cloudera/empdept

Please click here for the next part

Tuesday, August 25, 2020

Sqoop Complete Tutorial Part-3

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here.

Part-2, please click here.

5) Importing all the tables but excluding few tables from MySQL to HDFS

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items"

Note: There is no way to import only a few selected tables to HDFS. The only option available is "exclude-tables". If there is more than one table, then you can mention them in the same property and separate them using comma (,) the way given below.

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items,orders"

6) Import part of the data from a table from MySQL to HDFS using WHERE condition.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--username root
--table emp
--where ename='scott'
-m 1

7) Importing a few columns from a table from MySQL database to HDFS.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMP
--columns "EmpID,Sal”
-m 1

8) Importing and compressing the table's data and store it in HDFS.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--username root
--table emp
--target-dir empdata/emp
--compress -m 1

9) Importing data from MySQL to HDFS with a specific delimiter.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMP
--columns "EmpID,Sal”
--fields-terminated-by '@'
-m 1

Note: This will throw an error if you do not provide the username and password to connect to MySQL. Once data is imported, read the contents of the file either using Hue browser or from Hadoop terminal using "-cat" the usual Linux command. You'll find '@' delimitation between the columns.

10) Import data by splitting it based on a specific column.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMPDEPT
--username root
--P
--split-by deptno;

Note: This is one of the most important processes when we deal with a huge volume of data. If we import such huge data, the number of mappers or splits will be very less which might take a lot of time to complete and it can affect performance. Hadoop is meant for parallelism and to utilize it we use the "split" command on a column to increase more number of splits. That means more number of parallel processes in place.

In the above example, I have employee and department data in my database which I combined and created a single table as EmpDept. Now, there are few employees in each department, say 10 departments and 50 employees in each department. Once I execute the above command, due to my data, it generates 10 splits based on the "deptno" and runs 10 simultaneous importing jobs at once. This is the way to improve import performance to achieve faster parallelism.

Stay tuned and click here for the next part.

Sqoop Complete Tutorial Part-2

This is the continuation part of "Sqoop Complete Tutorial". If you want to read Part-1, please click here. We are about to discuss the following in the current and in the upcoming parts.

Importing data from MySQL to HDFS

Importing a table from MySQL to the default location of the HDFS
Importing a table from MySQL to user-specified location in HDFS
Importing all the tables from MySQL to the default location of the HDFS
Importing all the tables from MySQL to user-specified location in HDFS.
Importing all the tables but excluding few tables from MySQL to HDFS.
Importing part of the data from a table from MySQL to HDFS using WHERE condition.
Importing a few columns from a table from MySQL to HDFS
Importing and compressing the table's data to HDFS
Importing data by specifying the column delimiter.
Importing data by splitting the data based on a specific column.
Import the data from MySQL to HDFS based on a SQL Query.
From MySQL incremental import data into HDFS.
Import and append the data.
Import data from MySQL and store in a sequential file format in HDFS.
Read the contents of the source database without importing it.
Get the list of the databases.

The format of the import process is the same however some additional keywords will be used if the target is not HDFS.

Importing data from MySQL to Hive

Importing a table from MySQL to Hive's default database.
Importing a table from MySQL to Hive's user's database.
Importing all the tables from MySQL to Hive
Importing all the tables but excluding few from MySQL to Hive
Import the table's structure from MySQL to Hive without data.
Import the table from MySQL to Hive with the specific delimiter.

Sqoop Automation

Sqoop - Creating a job
Sqoop - Displaying the list of the jobs
Sqoop - Displaying the properties of a job
Sqoop - Executing the job
Sqoop - Deleting the job

Importing data from MySQL to HBase

Importing a table from MySQL to HBase
Importing a table with some selected columns from MySQL to HBase

Ok, let us begin the exercises with the above-mentioned points one by one.

Importing data from MySQL to HDFS

1) Importing a table from MySQL to HDFS default location.

sqoop import
--connect jdbc:mysql://localhost/retail_db
--table categories
--username root
--password cloudera
-m 1

Explanation: "Sqoop Import" is the keyword to be used while import. Sqoop is a JDBC utility to move data between Hadoop Ecosystems and relational databases. Username and password must be provided to connect to the RDBMS.

Assigning mappers is the user's choice. Users can assign a number of mappers based on the size of the data that is being imported. In case the password not to be disclosed in the connection string, the user can simply pass -P argument without providing any password. During the run-time, it will prompt you to enter the password. Refer to below -

Importing a table from MySQL to HDFS default location - Hiding password.

sqoop import
--connect jdbc:mysql://localhost/retail_db
--table categories
--username root
--P
-m 1

The above command will create the directory named “categories” and the table data will be stored as a file (part-m-00000) in the specified directory "/user/cloudera"

2) Importing a table from MySQL to HDFS user-specified target location.

sqoop import
--connect jdbc:mysql://localhost/retail_db
--table categories
--username root
--password cloudera
--target-dir 'retail_db'
-m 1

3) Importing all the tables from MySQL to HDFS default location

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera

Note: All the tables will be placed in the root directory as files.

4) Importing all the tables from MySQL to HDFS user-specified target location.

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--warehouse-dir 'retaildb'

Note: All the tables will be placed in the specified directory. The keyword "target-dir" will not work with import-all-tables. No mappers assigned since a single mapper is insufficient to import data from the "retail_db" database as the volume is higher than the HDFS block size.

Please click here for the next part.

Sqoop Complete Tutorial Part-1

Sqoop (SQL-to-Hadoop) is a Hadoop ecosystem component and an ETL tool that offers the capability to extract data from various structured data stores such as relational databases with the help of map-reduce. This command-line interpreter works efficiently to transfer a huge volume of the data to the Hadoop file system (HDFS, HBase, and Hive). Similarly, it exports the data from Hadoop to SQL.

Sqoop offers parallel processing as well as fault tolerance along with the automation functionality that helps in scheduling the data transfer jobs.

Sqoop Import

Imports individual or all the tables from RDBMS to HDFS or to Hive or HBase. Regardless of the target location, the table data will be stored as text files.

In HDFS, each row of the table is treated as a record, each table is treated as a sub-directory, and table data is stored as text files. Users can opt to store the table data either in text format or as binary data or in Avro format or in row-column file format or in Sequence file formats. Users can also have the privilege to opt if the data to be compressed.

In Hive, the target database must be created before importing all the tables from RDBMS. If not, the tables will be imported into the Hive’s ‘default’ database.

Sqoop Import will take care of the table’s creation if the table does not exist in Hive metastore.

Sqoop Export

Exports the files from Hadoop distributed file system to RDBMS. Sqoop will parse the contents of the files and convert them into rows and tables. Each line will be split into multiple columns based on the delimiter specified.

Before we start working with Sqoop, please note the following:

Sqoop is a command-line interface and the default HDFS terminal is used to issue Sqoop commands. There is hardly any need for GUI since the scope of Sqoop is limited to import and export the data. Please also note that Sqoop is case sensitive. One has to be very careful with the names of the table, column, directory, sub-directory, source or target locations along with the Sqoop reserved keywords.

Please click here for the next part.