Hi Everybody,
Please do visit my new blog that has much more information about Big Data and SQL.
The site covers big data and almost all the RDBMS platforms such as SQL Server, MySQL, Oracle SQL*Plus, PostgreSQL, etc.
Hi Everybody,
Please do visit my new blog that has much more information about Big Data and SQL.
The site covers big data and almost all the RDBMS platforms such as SQL Server, MySQL, Oracle SQL*Plus, PostgreSQL, etc.
input
string with each character in the from
argument replaced with the corresponding character in
the to
argument. The characters are matched in the order they appear
in from
and to
.
This article will help you to learn how to implement chaining common table expressions in Cloudera's Impala.
-~ Chaining CTEs
WITH
Temp1 AS (
SELECT
OrderID,
OrderStatus,
ProductID
FROM temp2
WHERE order_status =
'completed'
),
Temp2 AS (
SELECT OrderID,
OrderStatus,
ProductID
FROM Orders
WHERE ProductID = 5
)
SELECT * FROM (SELECT COUNT(1) FROM Temp1) a;
Technically the query doesn't make any sense, it can be written in a
straight-forward and in a simple way -
SELECT COUNT(1)
FROM Orders
WHERE ProductID = 5
AND
order_status='completed'
However, observe the structure or the pattern. In order to execute
the first common table expression, it needs the values from the second
expression.
Often, some requests for reports seem simple and straightforward. But it is challenging to find a better solution to be applied. Cloudera's Impala provides several features, and in order to accomplish the tasks in the best possible way, an analyst needs to be aware of them.
Assume that, after filtering out specific values, there are two separate result sets that need to be combined into a single output with distinct values. How would you do that?
Let's see with an example.
The following commands will help in starting the services of Hive and Hive metastore in case they didn't start automatically without logging into Cloudera manager.
sudo service hive-metastore start
sudo service hive-server2 start
Split function splits the data based on the delimiter provided and it is mostly used function in Apache Hive. This function is not available in Impala. However, there is an alternative to it.
Let us first see the usage of the "split" function in Hive.
Below is the patient's blood pressure variations information.
TableName: PatientsData
Systolic-Diastolic
122/80, 122/83, 130/83, 135/86, 140/95, 147/92
SELECT split(data,'\/') as split_data from PatientsData;
Result:
split_data
122,80
122,83
130,83
130,83
135,86
140,95
147,92
SELECT split(data,'\/')[0] AS Systolic,
split(data,'\/')[1] AS Diastolic
FROM PatientsData;
Result:
Systolic Diastolic
122 80
122 83
130 83
130 83
135 86
140 95
147 92
Let's do the same exercise in Impala using "split_part" function.
SELECT split_part(data,'\/',1) AS Systolic,
split_part(data,'\/',2) AS Diastolic
FROM PatientsData;
Result:
Systolic Diastolic
122 80
122 83
130 83
130 83
135 86
140 95
147 92
As per the documentation from Apache, below is the description of the function.
SPLIT_PART(STRING source, STRING delimiter, BIGINT index)
Purpose: Returns the requested indexth part of the input source string split by the delimiter.
If the index is a positive number, returns the indexth part from the left within the source string. If the index is a negative number, returns the indexth part from the right within the source string. If the index is 0, returns an error.
The delimiter can consist of multiple characters, not just a single character.
All matching of the delimiter is done exactly, not using any regular expression patterns.
Return type: STRING
As we discussed earlier, HiveQL handles structured data only, much like SQL. This doesn't mean that Hive just manages structured data, it also processes and transforms the unstructured data into a readable structured way.
Unstructured data is usually not dependent on static data or other data files. But in structured data, especially if the data is imported from relational systems, due to normalization, the tables may be connected with other tables to obtain meaningful information for a few columns. Hence functionalities of SQL will also be needed in big-data platforms such as Joins, Sub-queries, casting, and conversion functions.
This article focuses on the joins that are available in Hive.
Below is the type of joins available in Hive.
I am not specifying "self-join" explicitly because it is also an inner join.
INNER JOIN is the join type that combines two tables to return records that have matching values from both the tables.
LEFT OUTER JOIN returns all records from the left table and the matched records from the right table.
RIGHT OUTER JOIN is a join that returns all records from the right table, and the matched records from the left table.
FULL OUTER JOIN displays all the records i.e. matched records and unmatched records from both the tables.
Not just the description but the implementation is also as same as in SQL.
Please do let me know if you want to see how it is to be implemented in Hive.
Can we create a table based on a view in Hive?
Yes, we can.!!
Let's create "View" by combining Emp and Dept tables.
-~ To create a view
CREATE VIEW Emp_View
AS
SELECT Ename, DName FROM Emp
INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;
Now let's create the table based on the view.
CREATE TABLE EmpDept
AS
SELECT * FROM Emp_View;
In this article, you'll learn how to create a view and table based on a select statement.
-~ To create a view
CREATE VIEW Emp_View
AS
SELECT Ename, DName FROM Emp
INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;
-~ To create a table
CREATE TABLE EmpDept
AS
SELECT Ename, DName FROM Emp
INNER JOIN Dept ON Emp.DeptNo = Dept.DeptNo;
Can we create a table based on a view? Click here.
DATABASE LEVEL:
Database or schema both are the same thing. These words can be used interchangeably.
The TBLPROPERTIES clause enables you to use your own metadata key/value pairs to tag the table definition.
There are also several predefined table properties, such as last-modified-user and last-modified-time, which Hive automatically adds and manages.
To view the properties of a table use the below command in hive prompt.
SHOW TBLPROPERTIES tblname;
This lists all the properties of the table.
If the table's input format is ORC (refer to the input file formats) then you'll see which compression (snappy or zlib) has opted. You'll see if the transactional property set to true or false. You'll also see the predefined table properties that managed by Hive.
This article aims to explain the usage of the SPLIT function in HiveQL. If you are looking for a similar function in SQL Server, then please click here.
Let's create a staging table to load the data temporarily.
CREATE TABLE tempData (col1 STRING);
Load the data to the table.
LOAD DATA INPATH 'Desktop/DataFile' OVERWRITE INTO TABLE tempData;
To split the data from the above-created temp table
SELECT word, count(1) AS count FROM
(SELECT explode(split(col1, '\s')) AS word FROM tempData) temp
GROUP BY word
ORDER BY word;
Split function splits the data based on the delimiter provided. The Explode function will further split the data into smaller chunks. Let's see what these explode and split functions are doing with another example.
Below is the patient's blood pressure variations information.
TableName: PatientsData
Systolic-Diastolic
122/80, 122/83, 130/83, 135/86, 140/95, 147/92
SELECT split(data,'\/') as split_data from PatientsData;
Result:
split_data
122,80
122,83
130,83
130,83
135,86
140,95
147,92
SELECT split(data,'\/')[0] AS Systolic, split(data,'\/')[1] AS Diastolic from PatientsData;
Result:
Systolic Diastolic
122 80
122 83
130 83
130 83
135 86
140 95
147 92
SELECT explode(split(data,'\/')) as exploded_data from PatientsData;
Result:
exploded_data
122
80
122
83
130
83
130
83
135
86
140
95
147
92
The purpose of this article is to address the different file formats and compression codecs in Apache Hive that are available for different data sets. We will also explore how to use them properly and when to use them.
HiveQL handles structured data only, much like SQL. In order to store the data in it, Hive has a derby database by default. The data will be stored as files in the backend framework while it shows the data in a structured format when it is retrieved. Some special file formats that Hive can handle are available, such as:
Commands are non-SQL statements such as setting a property or adding a resource. They can be used in HiveQL scripts or directly in the CLI or Beeline. The below commands are mostly used ones and are helpful while working with partitions, adding external jar files, and changing the configuration settings.
This article will explore some Hadoop basic commands that help in our day-to-day activities.
Hadoop file system shell commands are organized in a similar way to Unix/Linux environments. For people who work with Unix shell, it is easy to turn to Hadoop shell commands. Such commands communicate with HDFS and other Hadoop-supported file systems.
1) List-out the contents of the directory.
2) Create or delete a directory
To check the status of safemode
hadoop dfsadmin -safemode get
To change the safemode to ON
hadoop dfsadmin -safemode enter
To change the safemode to OFF / or to
leave the safemode
hadoop fs -put <sourcefilepath> <destinationfilepath>
Examples:
hadoop fs -put Desktop/Documents/emp.txt /user/cloudera/empdir
hadoop fs -copyFromLocal Desktop/Documents/emp.txt /user/cloudera/emp.txt
To know more about "copyFromLocal", "put" "copyToLocal" and "get", please click here.
4) Read the file
hadoop fs -cat /user/cloudera/emp.txt
The above command helps in reading the file however, one has to avoid using this command for large files since it can impact on I/O. This command is good for
files with small data.
5) Copy the file from HDFS to Local
System
This is reverse scenario of Put & CopyFromLocal. For more information click here.
6) Move the file from one HDFS location
to another (HDFS location)
Hadoop fs -mv emp.txt testDir
Hadoop fs -mv testDir tesDir2
Hadoop fs -mv testDir2/testDir /user/cloudera
7) Admin Commands
To view the config settings
go
to --> computer-browse folder-filesystem-->etc-->hadoop-->conf-->hdfs-site.xml
To
change the default configuration values such as dfs.replication or dfs.blocksize from hdfs-site.xml, use the sudo commands
Click "I" for insert option or to bring it in edit mode.
Modify the values as per your
requirement.
To save and exit :wq!
hadoop fs -tail [-f]
<file>
The Hadoop fs shell tail command shows the last 1KB of a file on console or stdout.
HDFS is a distributed file system designed to run on top of the local file system. Many times we may need to copy files from different sources i.e. from the internet, remote network, or from the local file system. There are "CopyFromLocal" and "Put" commands to help us in performing the task. While copying a file from the local file system to HDFS, if the file exists in the destination, the execution will fail and we will receive 'the file exists' error.
Let's assume the file "emp.txt" already exists in the path /user/cloudera.
Hadoop fs -put Desktop/emp.txt /user/cloudera/emp.txt
Hadoop fs -copyFromLocal Desktop/emp.txt /user/cloudera/emp.txt
Hadoop fs -copyFromLocal -f Desktop/Documents/emp.txt /user/cloudera/emp.txt
This is succeeded. The file is copied to the destination without any errors.
The usage of the "-f" option with -copyFromLocal will overwrite the destination if it already exists.
hadoop fs -copyFromLocal <Local system directory path> <HDFS file path>
A choice exists to overwrite an existing file using -f when using copyFromLocal. However, an error is returned if the file persists when "put" is executed.
In short, anything you do with copyFromLocal, you can do with "put", but not vice-versa.
CopyToLocal and Get:
These two commands are just opposite to "CopyFromLocal" and "Put".
The destination is restricted to a local file reference when we use copyToLocal. While using "Get" there are no such restrictions.
Anything you do with copyToLocal, you can do with "get" but not vice-versa.
hadoop fs -get <HDFS file path> <Local system directory path>
hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
For complete HDFS commands please click here. For complete Hive DDL commands please click here.
The NVL() function enables you to substitute null for a more relevant alternative in the query results. This function accepts two arguments. If the first argument is null, then it returns the second argument. If the first argument is not null, it returns the first one and will ignore the second argument. This function is available in Oracle SQL*Plus, but not in MySQL, SQL Server, and Hive.
However, as an alternative, ISNULL() and COALESCE() functions can be used to achieve the same result in MySQL and SQL Server. Since ISNULL() is not available in Hive, COALESCE() function is the only option to achieve the desired output.
The difference between NVL() and COALESCE() is that COALESCE() will return the first non-null value from the list of expressions while NVL() only takes two parameters and returns the first if it is not null, otherwise, returns the second.
Let's see what these three functions will do.
Oracle:
SELECT first_name + middle_name + last_name As EmpName
FROM Employees;
Result:
Employees
---------------------------
Robert Finn Hill
Bruce M. Wills
Maria Andrew Brown
NULL
The last row is null because there is no middle name of the employee. NULL is returned when concatenated the null with first-name and last-name. There we use NVL() function.
SELECT first_name + NVL(middle_name, ' ') + last_name As EmpName
FROM Employees;
Result:
Employees
----------------------------
Robert Finn Hill
Bruce M. Wills
Maria Andrew Brown
Ashley Miller
SQL Server:
SELECT first_name + ISNULL(middle_name,'') + last_name As EmpName
FROM Employees;
SELECT first_name + COALESCE(middle_name,'') + last_name As EmpName
FROM Employees;
Hive:
SELECT first_name + COALESCE(middle_name,'') + last_name As EmpName
FROM Employees;
Create "Internal" tables when:
Timestamps were introduced in Hive 0.8.0. It supports traditional UNIX timestamp with the optional nanosecond precision.
The supported Timestamps format is yyyy-mm-dd hh:mm:ss[.f…].
Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...