Hive is a data warehouse system that facilitates reading, writing, and managing large datasets residing in distributed storage. Structure can be projected onto data already in storage. Hive offers a SQL-like language called HiveQL. Hive also offers connectors for various external applications like Tableau.
Configure Hive
Install Hive
Cloudera Manager distributes Hive in CDH and offers the following services:
- Hive Metastore Server (required) – The Hive Metastore service stores the metadata for Hive tables and partitions in a relational database, and provides clients (including Hive) access to this information via the Metastore service API. By default the Hive Metastore Server uses a Derby database. It is best to run the Hive Metastore Server on the same server as the database because large queries can cause performance issues if the database and the Hive Metastore Server are on separate servers. It is also recommended to install the Hive Metastore Server on the same node as an HDFS NameNode. Although Hive can be configured to use more than one Hive Metastore Server, Cloudera does not support having multiple Hive Metastore Servers as this may result in problems such as concurrency errors.
- HiveServer2 (required – same node as the Hive Metastore Server) – HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. HiveServer2 differs from HiveServer, and Beeline is the supported CLI to communicate with HiveServer2. HiveServer2 supports multiple clients making many simultaneous requests, which is an improvement over HiveServer.
- Gateway (required – add a Hive Gateway to any host from which you will run the Hive CLI, such as an APP server and where the Hive Metastore Server is installed). If you are not sure, add the Hive Gateway to all APP servers and the server where you installed the Hive Metastore Server.
- WebHCat Server (optional) – can be installed onto any node. WebHCat is the REST API for HCatalog, a table and storage management layer for Hadoop. We do not use this service.
Hive Configuration
Table Lock Manager
For Table Lock Manager, make the following configuration changes before using HiveServer2. To understand the Table Lock Manager, you need to understand Concurrency, which a must in databases. At a minimum, strive to support concurrent readers and writers whenever possible. It would be useful to add a mechanism to discover the current locks which have been acquired, though all locks are acquired implicitly. Failure to do so may result in unpredictable behavior. To enable Hive’s Table Lock Manager:
- Open Cloudera Manager’s UI, browse to Services, Hive, click on Configuration.
- Expand HiveServer2 Default Group, click on Advanced, and add the following to the HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml (substitute the Zookeeper servers in the hive.zookeeper.quorum <value>):
<property> <name>hive.support.concurrency</name> <description>Enable Hive's Table Lock Manager Service</description> <value>true</value> </property> <property> <name>hive.zookeeper.quorum</name> <description>Zookeeper quorum used by Hive's Table Lock Manager</description> <value>zk.servername12,zk.servername13</value> </property>
Long Live and Process (LLAP)
Long Live and Process (LLAP) functionality was added in Hive 2.0 (HIVE-7926 and associated tasks). HIVE-9850 links documentation, features and issues for this enhancement
For configuration of LLAP, see LLAP Section of Configuration Properties.
https://cwiki.apache.org/confluence/display/Hive/LLAP
http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/
Test build: http://www.lewuathe.com/blog/2015/08/12/try-hive-llap/
Administer Hive
Test Hive
Enter the Hive shell:
hive shell
show databases;
Debug Hive using the following commands:
hive -hiveconf hive.root.logger=ALL,console
show databases;
Query Examples
Query Hive table and output the results to HDFS folder:
set mapred.job.name = operations-report-file;
INSERT OVERWRITE DIRECTORY ‘/path/to/hdfs/folder/’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
select
TO_DATE(from_unixtime(UNIX_TIMESTAMP(split(rowkey,”:”)[3], ‘yyyy-MM-dd’))) AS datetime
, count(*) AS rowcount
FROM default.tablename
WHERE TO_DATE(from_unixtime(UNIX_TIMESTAMP(split(rowkey,”:”)[3], ‘yyyy-MM-dd’))) > date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),’yyyy-MM-dd’) ,14)
GROUP BY split(rowkey,”:”)[3]
ORDER BY datetime;
Split Function
select split(rowkey,”:”)[3] from tablename limit 10;
Create Table
Create Hive table using a file stored in HDFS:
CREATE TABLE default.operations_report_file (
date DATE,
rowcount INT
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE;
LOAD DATA INPATH ‘/ops/operations-report-file.tsv’ INTO TABLE default.operations_report_file;
Create Hive table using a local file:
CREATE DATABASE test;CREATE EXTERNAL TABLE test.hivetest1
(
CustomerNumber INT,
DepartmentNumber INT,
CategoryNumber INT,
VisitTime INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;LOAD DATA LOCAL INPATH ‘customers.csv’
OVERWRITE INTO TABLE test.hivetest1;
Create Hive table against an HBase table:
CREATE EXTERNAL TABLE hbase_test6c( rowKey string,
ac1 string,
ac2 string,
ac3 string,
ac4 string,
ac5 string,
ac6 string
)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,cf:c1,cf:c2,cf:c3,cf:c4,cf:c5,cf:c6”)
TBLPROPERTIES(“hbase.table.name” = “test6c”);
Additional example against an HBase table:
create external table if not exists hbasepersontable (id INT, firstname STRING, lastname STRING )
stored by ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
with serdeproperties (‘hbase.columns.mapping’ = ‘:key, firstname:firstname, lastname:lastname’)
tblproperties (‘hbase.table.name’ = ‘PERSON’);
Scripting Against Hive
Return results from Hive using bash:
# the -S silences the connection messages
# use -e to pass a command
my_value=`hive -S -e ‘select count(*) from salestrigger.external_triggered_stores;’|tail -n1`;
echo $my_value;
ODBC Connector
I found a useful ODBC Connector to connect to Hive’s HiveServer2: http://kb.tableau.com/articles/knowledgebase/hadoop-hive-connection
Troubleshooting
Hive shell debug mode
You can enter a debug mode in the hive shell with the following command:
hive -hiveconf hive.root.logger=DEBUG,console
Hive Metastore Servers will not start – Hive Metastore canary failed to create a database
The Hive Server will not connect to the Hive Metastore and Cloudera reports that canary failed to create a database. In the Hive Metastore logs you see the following error: Metastore Thrift Server threw an exception…
MetaException(message:Hive Schema version 0.13.0 does not match metastore’s schema version 0.12.0 Metastore is not upgraded or corrupt)
Resolution: You will need to update the database, there must have been an update to the Hive Server which missed the database – or the database has become corrupt. Stop the Hive service, (in Cloudera Manager) within the Hive Metastore instance, click Actions, and select Update Database Schema. Without Cloudera Manager you’ll find the update-database-schema script in the bin folder. After the schema has successfully been updated, start Hive Metastore and Hive Server.
From the Hive Metastore log:
8:26:23.133 AM ERROR org.apache.hadoop.hive.metastore.HiveMetaStore
Metastore Thrift Server threw an exception…
MetaException(message:Hive Schema version 0.13.0 does not match metastore’s schema version 0.12.0 Metastore is not upgraded or corrupt)
at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6311)
at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
…
8:26:22.774 AM ERROR org.apache.hadoop.hive.metastore.MetaStoreDirectSql
Database initialization failed; direct SQL is disabled
javax.jdo.JDOException: Exception thrown when executing query
at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:596)
at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:230)
at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.<init>(MetaStoreDirectSql.java:119)
…
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:224)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)
…
NestedThrowablesStackTrace:
org.postgresql.util.PSQLException: ERROR: column A0.OWNER_NAME does not exist
Position: 122
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
…
From the Hive Server log:
8:22:20.788 PM ERROR org.apache.hive.service.CompositeService
Error starting services HiveServer2
org.apache.hive.service.ServiceException: Unable to connect to MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.java:140)
at org.apache.hive.service.CompositeService.start(CompositeService.java:70)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
…
Caused by: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException java.net.SocketException: Broken pipe)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.logAndThrowMetaException(MetaStoreUtils.java:1114)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabases(HiveMetaStoreClient.java:839)
at org.apache.hive.service.cli.CLIService.start(CLIService.java:138)
… 9 more
Out of Memory Error – Within Hive Before being submitted to Yarn
Resolution:
It appears that the Hive query was not being submitted to Yarn. That points to an out of memory error from the HiveServer2. Increase memory under: Java Heap Size of HiveServer2
Error:
vi /var/log/hive/hadoop-cmf-hive-HIVESERVER2-servername12.log.out
2015-05-07 10:30:09,925 WARN org.apache.hive.service.cli.thrift.ThriftCLIService: Error executing statement:
java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:84)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
…
Caused by: java.lang.OutOfMemoryError: Java heap space