Hive is a data warehouse software which is used for facilitates querying and managing large data sets residing in distributed storage.
Hive language almost look like SQL language called HiveQL. It also allows traditional map reduce programs to customize mappers and reducers when it is inconvenient or inefficient to execute the logic in HiveQL (User Defined Functions UDFS)
Hive metastore is a database that stores metadata about your Hive tables (eg. Table name, column names and types,table location, storage handler being used, number of buckets in the table, sorting columns if any, partition columns if any, etc.).
When you create a table,this metastore gets updated with the information related to the new table which gets queried when you issue queries on that table.
Hive is a central repository of hive metadata. it has 2 parts services and data. by default it uses derby db in local disk. it is referred as embedded metastore configuration. It tends to the limitation that only one session can be served at any given point of time.
Following classes are used by Hive to read and write HDFS files
•TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
•SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.
ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
•Instance of a Java class (Thrift or native Java)
•A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
•A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field)
A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.
There are following ways by which you can connect with the Hive Server:
Thrift Client: Using thrift you can call hive commands from a various programming languages e.g. C++, Java, PHP, Python and Ruby.
JDBC Driver : It supports the Type 4 (pure Java) JDBC Driver
ODBC Driver: It supports ODBC protocol.
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
Hive can use derby by default and can have three type metastore configuration. It supports
uses derby db to store data backed by file stored in disk. It can’t support multi session at same time. and services of metastore runs in same JVM as hive.
In this case we need to have stand alone db like MySql, which would be communicated by metastore services. Benefit of this approach is, it can support multiple hive session at a time. and service still runs in same process as Hive.
Metastore and Hive service would run in different process. with stand alone Mysql kind db.
Hive natively supports text file format, however hive also has support for other binary formats. Hive supports Sequence, Avro, RCFiles.
Sequence files :-General binary format. splittable, compressible and row oriented. a typical example can be. if we have lots of small file, we may use sequence file as a container, where file name can be a key and content could stored as value. it support compression which enables huge gain in performance.
Avro datafiles:-Same as Sequence file splittable, compressible and row oriented except support of schema evolution and multilingual binding support.
RCFiles :-Record columnar file, it’s a column oriented storage file. it breaks table in row split. in each split stores that value of first row in first column and followed sub subsequently..
No, it is not possible to use metastore in sharing mode. It’s recommended to use standalone “real” database like MySQL or PostGresSQL.
HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. Apache Hcatalog is a table and data management layer for hadoop,we can process the data on Hcatalog by using APache pig,Apache Mapreduce and Apache Hive. There is no need to worry in Hcatalog where data is stored and which format of data generated. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.
Hive/HCatalog also enables sharing of data structure with external systems including traditional data management tools.
The WebHcatServer provides a REST – like web API for Hcatalog. Applications make HTTP requests to run Pig, Hive, and HCatalog DDL from within applications.
Hive creates schema and append on top of an existing data file. One can have multiple schema for one data file, schema would be saved in hive’s metastore and data will not be parsed read or serialized to disk in given schema. When s/he will try to retrieve data schema will be used. Lets say if my file have 5 column (Id,Name,Class,Section,Course) we can have multiple schema by choosing any number of column.
Whenever you run the hive in embedded mode, it creates the local metastore. And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml.
Propertyis“javax.jdo.option.ConnectionURL”withdefaultvalue“jdbc:derby:;databaseName=metastore_db;create=true”. So to change the behavior change the location to absolute path, so metastore will be used from that location.
A SerDe is a short name for a Serializer Deserializer.
Hive uses SerDe (and FileFormat) to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive.
A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.
Hive currently use these SerDe classes to serialize and deserialize data:
• MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)
• ThriftSerDe: This SerDe is used to read/write thrift serialized objects.
• DynamicSerDe: This SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols,including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
•In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
•For example, the RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and possibly a list of column names
•If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through “thrift DDL” format, and it’s non-trivial to write a “thrift DDL” parser.
Get more questions and answers from onlineitguru trainers after completion of big data certification course
to our newsletter
Today application testing is the deciding factor to launch the application into the market. And people do not launch the application unless it goes true.
Today many people were enthusiastic, to know the exact details of things happening around him. This can get the proper knowledge on Blockchain.
Zeal to learn ethical hacking is common among college students and IT professionals. Because everybody wants to secure their system from cyber attacks.
Python is a dynamic interrupted language which is used in wide varieties of applications. It is very interactive object oriented and high-level programming language.
Tableau is a Software company that caters interactive data visualization products that provide Business Intelligence services. The company’s Head Quarters is in Seattle, USA.
Pega Systems Inc. is a Cambridge, Massachusetts based Software Company. It is known for developing software for Customer Relationship Management (CRM) and Business process Management (BPM).