Summarize all the cool things about Hive.
Hadoop is known for its Map-Reduce engine for parallelizing data processing operations using HDFS as its native file storage system, but as we know Map-Reduce does not provide user-friendly libraries or interfaces to deal with the data (An operation like left inner join would need around 200–300 lines of code in JAVA Map-Reduce whereas in SQL it would just be a couple of lines of code). Plus, there is a very tight dependency of JAVA for using Map-Reduce framework. Here is why Hive came to the rescue!
What is Hive?
Hive is a data warehouse (not an RDBMS) and query interface on top of Hadoop’s native Map-Reduce, it allows users to write SQL style queries in a native language known as Hive Query Language (HQL) and reads data from HDFS.
Job Execution Flow
- Executing Query from the UI (or any other Hive Client like Thrift Server, JDBC, ODBC) — Driver receives it.
- The Driver is interacting with the Compiler and transfers Query.
- The Compiler converts Query (HiveQL statements) to an execution plan (contains steps needed to be performed by the MapReduce to get the output as translated by the query) and communicating with Meta store for getting metadata requests. One more feature is Optimizer (which wasn’t mentioned above) — Hive has a rule-based optimizer for optimizing logical plans.
- Metastore sends metadata information (information about database name, table names and column names — which is saved separately from the data) back to Compiler. Metastore stores the data in a traditional RDBMS format.
- Compiler communicating with Driver with the proposed plan to execute the query.
- Driver Sending execution plans to the Execution engine.
- Execution Engine acts as a bridge between Hive and Hadoop to process the query. First, it contacts the Name Node and then to Data nodes to get the values stored in tables. Second, communicates bi-directionally with Meta store present in Hive to perform DDL operations (like CREATE, DROP and ALTERING).
- And 9. Sending results to Execution engine and Driver.
Hive’s data categorized into 3 different concepts:
Tables in Hive are the same table as in RDBMS. First of all, tables made up of the data being stored (practically, our data resides in Hadoop filesystem). Secondly, the metadata describes the layout of the data (which stored in our metastore — RDBMS). Plus, the user can perform DML commands on tables like SELECT, DELETE, INSERT and more.
Hive has two types of tables which are as follows:
- Managed Tables: By default, Hive manages the data. It means that Hive moves the data into its warehouse directory. In case the user drops the table, the data will be deleted as well as it’s metadata.
- External Tables: In this case, the user can control the creation and deletion of the data, meaning the data is no longer resides in the warehouse directory but some different custom directory instead. As a result, the user can create the data after he created the table and in case of dropping a table, the only thing lost is the metadata.
Hive has Partitions option, for grouping the same type of data together based on a column(s) or partition key(s). The big advantage is to make it easier and faster to query the data. For example: consider we have a table called Books containing some data about all the book in our library like _name, _id, _year_of_publication, _genre and so on. Now, if we perform partitioning on the basis_genre column, then the information of all the books belonging to a particular genre will be stored together in that very partition (technically, partition in Hive is just a directory under the original directory of Books in the filesystem). So, the queries regarding a specific book genre would only have to look through the data present in that specific book genre partition.
Shortly, partitioning is a very useful solution for reducing latency of queries (we are scanning only relevant partitioned data instead of the whole dataset).
Partitioning gives effective results in a few scenarios:
- When there is a limited number of partitions (there is the possibility of too many small partition creations - too many directories isn’t good for us).
- While partitions are of comparatively equal size.
Hive has Buckets option, for decomposing table data sets into more manageable parts. Tables or partition are subdivided into buckets based on the hash function of a column in the table (technically, every bucket will be a file in our filesystem under the table directory), this hash function ensures that all same values of a specific chosen column stored in the exact same file. Unlike partitions, here the user has to decide how many buckets needed, as a result, in one bucket we can found many column values. The purpose of this concept is to give extra structure to the data that may be used for faster query responses than non-bucketed tables.
So, when to use Buckets ?
In case the user wants to do some join queries, the bucketing reduce the need for shuffling and sorting operation, because those were done ahead of time. However, joining tables with bucketing is optional only if the two tables created with the same bucket column(s) and has the same number of buckets.
- Hive is “Schema ON READ”, so it doesn’t support update or modify functions (are allowed in recent releases of HIVE) and it doesn't confirm the schema until reading the data.
- Hive is not capable of handling real-time data and unstructured data.
- Hive queries contain high latency.
- Hive doesn’t support sub-queries.
- Hive is not recommended for small data sets.
- Hive partitions: https://data-flair.training/blogs/apache-hive-partitions/
- Hive metastore: https://data-flair.training/blogs/apache-hive-metastore/
- Hive bucketing video: https://www.youtube.com/watch?v=_Nk_pt5Izgg