Monday, September 28, 2020

Hive Internal vs External Tables

This article offers summary of the situations in which 
you would need to create internal (managed) tables and external tables in Apache Hive.


Create "External" tables when:

  • the data is being used outside the Hive. The data files are read and interpreted by an existing program that does not lock the files, for instance.

  • data needs to stay in the underlying position even after a DROP TABLE. In other words, the data file always stays on the HDFS server even if you delete an external table. This also means that Metadata is maintained on the master node, and deleting an external table from HIVE only deletes the metadata not the data/file.

  • you choose a custom place to be used, use external tables.

  • Hive doesn't own the data.

  • you are not creating a table based on an existing table (AS SELECT), use external tables.

  • you are okay with the fact that External table files are accessible to anyone who has access to HDFS. Security needs to be handled at the HDFS folder level.


Create "Internal" tables when:

  • the data is temporary.

  • you want Hive to completely manage the lifecycle of the table and data.

  • you want the data and metadata to be stored inside Hive's warehouse.

  • You are okay with the fact that table deletion would also erase the master-node and HDFS metadata and actual data, respectively.

  • you want the security of the data to be controlled solely via HIVE. 

Conclusion:

In "Internal" tables, the table is created first and data is loaded later.

In "External" tables, the data is already present in HDFS and the table is created on top of it.



No comments:

Post a Comment

Big Data & SQL

Hi Everybody, Please do visit my new blog that has much more information about Big Data and SQL. The site covers big data and almost all the...