Showing posts with label Hive External Tables. Show all posts
Showing posts with label Hive External Tables. Show all posts

Monday, September 28, 2020

Hive Internal vs External Tables

This article offers summary of the situations in which 
you would need to create internal (managed) tables and external tables in Apache Hive.


Create "External" tables when:

  • the data is being used outside the Hive. The data files are read and interpreted by an existing program that does not lock the files, for instance.

  • data needs to stay in the underlying position even after a DROP TABLE. In other words, the data file always stays on the HDFS server even if you delete an external table. This also means that Metadata is maintained on the master node, and deleting an external table from HIVE only deletes the metadata not the data/file.

  • you choose a custom place to be used, use external tables.

  • Hive doesn't own the data.

  • you are not creating a table based on an existing table (AS SELECT), use external tables.

  • you are okay with the fact that External table files are accessible to anyone who has access to HDFS. Security needs to be handled at the HDFS folder level.


Create "Internal" tables when:

  • the data is temporary.

  • you want Hive to completely manage the lifecycle of the table and data.

  • you want the data and metadata to be stored inside Hive's warehouse.

  • You are okay with the fact that table deletion would also erase the master-node and HDFS metadata and actual data, respectively.

  • you want the security of the data to be controlled solely via HIVE. 

Conclusion:

In "Internal" tables, the table is created first and data is loaded later.

In "External" tables, the data is already present in HDFS and the table is created on top of it.



Friday, August 14, 2020

Hive Internal Table - With External Data

Have you ever wonder what will happen if you miss the "external" keyword while creating an external table?

Let's check it out.

Here is my sample data. It has three columns namely dno, dname, location.

11, marketing, hyd
12, hr, delhi
13, finance, bang
14, retail, madras
20, db, hyd
21, mg, madras

This file has been placed in /user/cloudera/NewEmpDir as dept.txt

Let us create the external table on top of this data without using "external" keyword.

 

CREATE TABLE newDept

(Dno INT, Dname String, loc String)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

LOCATION '/user/cloudera/NewEmpDir'; 

Surprisingly, the table will be created and the data will also be loaded without any errors.


This is more like an internal table but the data is not stored in Hive. The following things to be observed.

  • Unlike other tables, this table cannot be visible when you browse the files through Hue, though it is visible in Metastore Manager.
  • You can query and retrieve the data from the table.
  • If you drop this table, the source file and directory will also be deleted.


Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...