Apache Hive Class: Tasting The Honey

Summarize all the cool things about Hive.

Hadoop is known for its Map-Reduce engine for parallelizing data processing operations using HDFS as its native file storage system, but as we know Map-Reduce does not provide user-friendly libraries or interfaces to deal with the data (An operation like left inner join would need around 200–300 lines of code in JAVA Map-Reduce whereas in SQL it would just be a couple of lines of code). Plus, there is a very tight dependency of JAVA for using Map-Reduce framework. Here is why Hive came to the rescue!

What is Hive?

Job Execution Flow

According to https://www.guru99.com/introduction-hive.html
  1. Executing Query from the UI (or any other Hive Client like Thrift Server, JDBC, ODBC) — Driver receives it.
  2. The Driver is interacting with the Compiler and transfers Query.
  3. The Compiler converts Query (HiveQL statements) to an execution plan (contains steps needed to be performed by the MapReduce to get the output as translated by the query) and communicating with Meta store for getting metadata requests. One more feature is Optimizer (which wasn’t mentioned above) — Hive has a rule-based optimizer for optimizing logical plans.
  4. Metastore sends metadata information (information about database name, table names and column names — which is saved separately from the data) back to Compiler. Metastore stores the data in a traditional RDBMS format.
  5. Compiler communicating with Driver with the proposed plan to execute the query.
  6. Driver Sending execution plans to the Execution engine.
  7. Execution Engine acts as a bridge between Hive and Hadoop to process the query. First, it contacts the Name Node and then to Data nodes to get the values stored in tables. Second, communicates bi-directionally with Meta store present in Hive to perform DDL operations (like CREATE, DROP and ALTERING).
  8. And 9. Sending results to Execution engine and Driver.

Data Model

Table

Hive has two types of tables which are as follows:

  • Managed Tables: By default, Hive manages the data. It means that Hive moves the data into its warehouse directory. In case the user drops the table, the data will be deleted as well as it’s metadata.
  • External Tables: In this case, the user can control the creation and deletion of the data, meaning the data is no longer resides in the warehouse directory but some different custom directory instead. As a result, the user can create the data after he created the table and in case of dropping a table, the only thing lost is the metadata.

Partitions

Shortly, partitioning is a very useful solution for reducing latency of queries (we are scanning only relevant partitioned data instead of the whole dataset).

Partitioning gives effective results in a few scenarios:

  • When there is a limited number of partitions (there is the possibility of too many small partition creations - too many directories isn’t good for us).
  • While partitions are of comparatively equal size.

Buckets

So, when to use Buckets ?

In case the user wants to do some join queries, the bucketing reduce the need for shuffling and sorting operation, because those were done ahead of time. However, joining tables with bucketing is optional only if the two tables created with the same bucket column(s) and has the same number of buckets.

Limitations

  • Hive is not capable of handling real-time data and unstructured data.
  • Hive queries contain high latency.
  • Hive doesn’t support sub-queries.
  • Hive is not recommended for small data sets.

Read More

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store