Tuesday, August 25, 2020

Sqoop Complete Tutorial Part-3

This is the continuation part of "Sqoop Complete Tutorial". If you want to read -

Part-1, please click here


Part-2, please click here


5) Importing all the tables but excluding few tables from MySQL to HDFS 

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items" 

Note: There is no way to import only a few selected tables to HDFS. The only option available is "exclude-tables". If there is more than one table, then you can mention them in the same property and separate them using comma (,) the way given below.

sqoop import-all-tables
--connect jdbc:mysql://localhost/retail_db
--username root
--password cloudera
--exclude-tables "order_items,orders" 


6) Import part of the data from a table from MySQL to HDFS using WHERE condition. 

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--username root
--table emp
--where ename='scott' 
-m 1


7) Importing a few columns from a table from MySQL database to HDFS. 

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMP  
--columns "EmpID,Sal”
-m 1


8) Importing and compressing the table's data and store it in HDFS. 

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--username root 
--table emp
--target-dir empdata/emp
--compress -m 1


9) Importing data from MySQL to HDFS with a specific delimiter.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMP
--columns "EmpID,Sal”
--fields-terminated-by '@'
-m 1

Note: This will throw an error if you do not provide the username and password to connect to MySQL. Once data is imported, read the contents of the file either using Hue browser or from Hadoop terminal using "-cat" the usual Linux command. You'll find '@' delimitation between the columns.


10) Import data by splitting it based on a specific column.

sqoop import
--connect jdbc:mysql://localhost/EmpDB
--table EMPDEPT
--username root
--P
--split-by deptno;

Note: This is one of the most important processes when we deal with a huge volume of data. If we import such huge data, the number of mappers or splits will be very less which might take a lot of time to complete and it can affect performance. Hadoop is meant for parallelism and to utilize it we use the "split" command on a column to increase more number of splits. That means more number of parallel processes in place. 

In the above example, I have employee and department data in my database which I combined and created a single table as EmpDept. Now, there are few employees in each department, say 10 departments and 50 employees in each department. Once I execute the above command, due to my data, it generates 10 splits based on the "deptno" and runs 10 simultaneous importing jobs at once. This is the way to improve import performance to achieve faster parallelism.


No comments:

Post a Comment

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...