Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Drill is an Apache top-level project.
Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. Using jdbc Drill is able to query SQL, Oracle, etc. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in HDFS.
How to Use Drill
WebUI
Query Drill through the WebUI at http://servername:8047
SQLLine
SQLLine comes with the Drill distribution. To start using Drill, go to the directory into which you extracted the tar file and run SQLLine:
cd /opt/apache-drill/bin/
./sqlline -u jdbc:drill:zk=local
You can now execute SQL queries to Drill. To exit from SQLLine, run the command
!quit
You can see view the available schemas using this query
SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;
Show databases:
show databases;
Show schema:
show schemas;
Create a View
Create a view in Drill:
CREATE VIEW `hdfs.root`.`lib/drill/views/vwMyView` AS SELECT
columns[1] `zip_code`
,columns[2] `zip4`
FROM `hdfs`.`root`.`path/to/file.txt`
Update the view:
CREATE OR REPLACE VIEW `hdfs.root`.`lib/drill/views/vwMyView` AS SELECT
columns[1] `zip_code`
,columns[2] `zip4`
,columns[3] `address1`
FROM `hdfs`.`root`.`path/to/file.txt`
Delete the view:
DROP VIEW `hdfs`.`root`.`lib/drill/views/vwMyView`
Query a view:
SELECT * FROM `views`.`views`.`vwPlaylogs` LIMIT 100
Nested Data Functions
FLATTEN – separates the elements in a repeated field into individual records.
SELECT h.foo.id, h.foo.name
FROM (
SELECT
FLATTEN(r.holidays) foo
FROM `hdfs`.`root`.`asset/reference/timeanddate/sample` r
) h
Query dfs
dfs represents the local Linux file system.
To query the file system, try this:
SELECT * FROM dfs.root.`/web/logs`;
Query HBase
Some observations:
“LIKE” does not work well on charlist, so instead use “SIMILAR TO” instead (example: …WHERE fieldX SIMILAR TO ‘abc%’). For example, `hbase`.`TEST.TABLE1` has columns “row_key; S; SG”. In order to query a parameter (sub field) in SG, you have to include “,*” in the SELECT. Without “, *” you will not get any results back for a specific parameter in column “SG”. This does not repro for column “S”.
HBase data is stored in UTF8, make sure you convert from UTF8 when you return your results:
select CONVERT_FROM(row_key, ‘UTF8’) FROM hbase.`TEST.TABLE1` limit 10;
Query all ‘US’ ‘events’ from the TABLE1 table in HBase:
SELECT
CONVERT_FROM(TABLE1.S.TITLE, ‘UTF8’) AS Title
,CONVERT_FROM(TABLE1.S.A1, ‘UTF8’) AS Address
,CONVERT_FROM(TABLE1.S.CI, ‘UTF8’) AS City
,CONVERT_FROM(TABLE1.S.CNY, ‘UTF8’) AS Country
,CONVERT_FROM(TABLE1.S.DATETIME_LOCAL, ‘UTF8’) AS LocalDateTime
,CONVERT_FROM(TABLE1.S.VENUE, ‘UTF8’) AS Venue
,CONVERT_FROM(TABLE1.S.ZIP, ‘UTF8’) AS Zip
,CONVERT_FROM(TABLE1.SG.TYPE, ‘UTF8’) AS `Type`
FROM `hbase`.`TEST.TABLE1` TABLE1
WHERE CONVERT_FROM(TABLE1.SG.TYPE, ‘UTF8’) like ‘%events%’
AND CONVERT_FROM(TABLE1.S.CNY, ‘UTF8’) like ‘US’
Count Rows:
You DO NOT use select count(*) to count the number of rows, Drill will return a tiny number of rows, MUCH lower than the actual number of rows. Instead count on a specific column.
select COUNT(q.S.TITLE) from `hbase`.`TEST.TABLE1` `q` limit 1;
Query HDFS
List the top 100 rows from all files in a folder in HDFS:
SELECT * FROM `hdfs`.`root`.`/asset/reference/timeanddate` LIMIT 100;
Query Hive
Select from a hive table:
SELECT firstname,lastname FROM `hive`.test.customers limit 10;
Query JSON
List the top 10 rows from all files in a folder in HDFS:
SELECT * FROM `hdfs`.`root`.`/asset/reference/timeanddate` LIMIT 10;
Select from a nested JSON:
Data:
{“version”:2,”holidays”:[{“id”:1687,”urlid”:”romania/new-year-day”,”url”:”http://www.timeanddate.com/holidays/romania/new-year-day”,”country”:{“id”:”ro”,”name”:”Romania”},”name”:”New Year’s Day”,”oneliner”:”New Year’s Day (Anul Nou) and the following day, on January 1 and 2 respectively, are annual holidays in Romania.”,”types”:[“National holiday”],”date”:{“iso”:”2014-01-01″,”datetime”:{“year”:2014,”month”:1,”day”:1}},”uid”:”00069700000007de”},
Query:
SELECT h.foo.id, h.foo.name
FROM (
SELECT
FLATTEN(r.holidays) foo
FROM `hdfs`.`root`.`asset/reference/timeanddate/sample` r
) h
Query View
To query a view:
SELECT * FROM `views`.`views`.`vwSomeView` LIMIT 100
Configure Drill
Configure Authentication
Drill can be configured to use authentication, using pam, against LDAP. Using the instructions below an account is checked against /etc/passwd, test before installing by running id username. If you see results (list of groups) then you are ready to apply security to Drill.
Install PAM for Drill
Download the JPam-Linux_amd64-1.1.tgz tar file and copy to /tmp/:
Create the folder structure and extract the files:
sudo mkdir -p /opt/pam/
sudo tar zxvf /tmp/JPam-Linux_amd64-1.1.tgz –directory /opt/pam/
sudo cp /opt/pam/JPam-1.1/libjpam.so /opt/pam/
Configure PAM for Drill
Edit the drill-env.sh file:
sudo vi /opt/apache-drill/conf/drill-env.sh
Include the directory where the libjpam.so
file is located:
export DRILLBIT_JAVA_OPTS=”-Djava.library.path=/opt/pam/”
Edit the drill-override.conf file:
sudo vi /opt/apache-drill/conf/drill-override.conf
Add the following configuration to the drill.exec
block – simply enable security.user.auth if the section already exists:
security.user.auth {
enabled: true,
packages += “org.apache.drill.exec.rpc.user.security”,
impl: “pam”,
pam_profiles: [ “sudo”, “login” ]
}
Comment out the following keystore:
# Below SSL parameters need to be set for custom transport layer settings.
#javax.net.ssl {
# keyStore: “/keystore.file”,
# keyStorePassword: “ks_passwd”,
# trustStore: “/truststore.file”,
# trustStorePassword: “ts_passwd”
#}
Restart the drillbit:
sudo /opt/apache-drill/bin/drillbit.sh restart
Test Authentication
You can test authentication by signing onto sqlline or the ODBC connection. For sqlline, you will first need to open sqlline, then connect – you will then be asked for your username and password.
1. Open sqlline
/opt/apache-drill/bin/sqlline
2. Connect to Drill, substitute localhost with your ZooKeeper address:
!connect jdbc:drill:zk=localhost:2181
Configure Interfaces
Overview: https://drill.apache.org/docs/interfaces-introduction/
ODBC
MapR provides an ODBC 3.8 driver that connects Windows, Mac OS X, and Linux to Apache Drill and BI tools. Install the latest version of Apache Drill with the latest version of the MapR Drill ODBC driver. An ODBC driver that you installed with a older version of Drill probably will not work with an upgraded version of Drill.
Download the Windows 64 version of the ODBC driver from MapR.
This link has details on creating ODBC driver on Windows: https://drill.apache.org/docs/installing-the-driver-on-windows/
JDBC
You can connect to Drill through a JDBC client tool, such as SQuirreL, on Windows, Linux, and Mac OS X systems, to access all of your data sources registered with Drill. An embedded JDBC driver is included with Drill. Configure the JDBC driver in the SQuirreL client to connect to Drill from SQuirreL. This document provides instruction for connecting to Drill from SQuirreL on Windows.
Configure Drill to use HDFS High Availability
Drill’s Hive storage has not been written to use HDFS HA.
Copy the hdfs-site-xml file into Drill’s conf folder, use the following as a template as to which properties to include. The changes are highlighted in blue:
<?xml version=”1.0″ encoding=”UTF-8″?>
<!–Autogenerated by Cloudera Manager–>
<configuration>
<property>
<name>dfs.nameservices</name>
<value>nameservice1</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.nameservice1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.nameservice1</name>
<value>true</value>
</property>
<property>
<name>dfs.ha.namenodes.nameservice1</name>
<value>namenode80,namenode171</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nameservice1.namenode80</name>
<value>servername02:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nameservice1.namenode171</name>
<value>servername03:8020</value>
</property>
</configuration>
Troubleshooting
Logging
Drill
/var/log/apache-drill/
Increase logging verbosity:
ALTER SYSTEM SET `exec.errors.verbose` = true;
HBase
For further HBase logging, you need to modify “$DRILL_HOME/conf/logback.xml” on all Drillbit nodes and add the following snippet (/opt/apache-drill/conf/logback.xml):
<logger name=”org.apache.drill.exec.store.hbase” additivity=”false”>
<level value=”debug” />
<appender-ref ref=”FILE” />
</logger>
You will then see additional HBase logs in the /var/log/apache-drill/drillbit.log log file.
Example log results:
Took xxxx ms to get yyyyy records
UnknownHostException: nameservice1
There is a bug in Drill’s hive Storage when connecting to HDFS HA. When querying a hive database on HDFS HA, the query will throw the following error:
Resolution: As of 11/2/2015, version 1.2, Drill has not fixed this bug. To workaround this bug:
- NOT RECOMMENDED: Add nameservice1 to the server’s hosts file pointed to the IP Address of the current Active NameNode. Obviously this defeats the purpose of HDFS HA, but until Drill fixes this the workaround will help.
- The recommended solution is to use the configure Drill to use HDFS HA by adding specific sections of the hdfs-site.xml file to Drill’s conf folder. See the configuration section above for details.
Query Timeout – UnknownScannerException
Drill queries would hang, never complete, and canceling the queries would not work.
In sqlline, run a query and if it hangs, press CTRL-C to cancel the query. If the results display then you have this problem. The reason it fails is that the results are returned successfully but the connection does not close on the RegionServer and the session hangs. To resolve the problem I had to restart the RegionServer.
This problem presented many errors found in /var/log/apache-drill/drillbit.log:
org.objectweb.asm.tree.analysis.AnalyzerException: Error at instruction 84: Expected an object reference, but found .
at org.objectweb.asm.tree.analysis.Analyzer.analyze(Analyzer.java:294)
at org.objectweb.asm.util.CheckMethodAdapter$1.visitEnd(CheckMethodAdapter.java:450)
at org.objectweb.asm.MethodVisitor.visitEnd(MethodVisitor.java:877)
…
Caused by: org.objectweb.asm.tree.analysis.AnalyzerException: Expected an object reference, but found .
at org.objectweb.asm.tree.analysis.BasicVerifier.copyOperation(BasicVerifier.java:80)
at org.objectweb.asm.tree.analysis.BasicVerifier.copyOperation(BasicVerifier.java:47)
at org.objectweb.asm.tree.analysis.Frame.execute(Frame.java:276)
…
When you would stop the query the query would enter a PENDING or CANCELATION_REQUEST state and hang, eventually the following error would be thrown:
2015-12-14 17:22:16,786 [BitServer-1] WARN o.a.d.exec.rpc.control.WorkEventBus – A fragment message arrived but there was no registered listener for that message: profile {
state: FAILED
error {
error_id: “7277ec0e-cff6-41d9-9f43-3c3d2c84fa5b”
endpoint {
address: “servername04”
user_port: 31010
control_port: 31011
data_port: 31012
}
error_type: SYSTEM
message: “SYSTEM ERROR: RemoteWithExtrasException: org.apache.hadoop.hbase.UnknownScannerException: Name: 22491, already closed?\n\tat org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2092)\n\tat org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31443)\n\tat org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2035)\n\tat org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)\n\tat org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)\n\tat org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)\n\tat java.lang.Thread.run(Thread.java:745)\n\n\nFragment 1:0\n\n[Error Id: 7277ec0e-cff6-41d9-9f43-3c3d2c84fa5b on servername31:31010]”
exception {
exception_class: “org.apache.drill.common.exceptions.DrillRuntimeException”
message: “org.apache.hadoop.hbase.client.ScannerTimeoutException: 600330ms passed since the last invocation, timeout is currently set to 60000”
stack_trace {
class_name: “org.apache.drill.exec.store.hbase.HBaseRecordReader”
file_name: “HBaseRecordReader.java”
line_number: 185
method_name: “next”
is_native_method: false
}
stack_trace {
Finally I saw the following error in the drill log:
2015-12-14 18:28:02,790 [2990f646-eb31-bcb7-5c41-55a4e1b1a9d8:frag:1:0] ERROR o.a.d.exec.rpc.RpcConnectionHandler – Failed to establish connection
java.util.concurrent.TimeoutException: Timeout waiting for task.
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) ~[guava-14.0.1.jar:na]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ~[guava-14.0.1.jar:na]
at org.apache.drill.exec.rpc.ReconnectingConnection$ConnectionListeningFuture.waitAndRun(ReconnectingConnection.java:115) ~[drill-java-exec-1.2.0.jar:1.2.0]
…
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
2015-12-14 18:28:02,812 [BitServer-1] INFO o.a.d.e.w.fragment.FragmentExecutor – Applying request for early sender termination for 1:0 -> 0:0.
2015-12-14 18:28:02,822 [BitServer-1] WARN o.a.d.exec.rpc.control.ControlServer – Message of mode REQUEST of rpc type 7 took longer than 500ms. Actual duration was 121081ms.
2015-12-14 18:28:02,870 [2990f646-eb31-bcb7-5c41-55a4e1b1a9d8:frag:1:0] INFO o.a.d.e.w.fragment.FragmentExecutor – 2990f646-eb31-bcb7-5c41-55a4e1b1a9d8:1:0: State change requested RUNNING –> FINISHED
2015-12-14 18:28:02,870 [2990f646-eb31-bcb7-5c41-55a4e1b1a9d8:frag:1:0] INFO o.a.d.e.w.f.FragmentStatusReporter – 2990f646-eb31-bcb7-5c41-55a4e1b1a9d8:1:0: State to report: FINISHED
2015-12-14 18:30:02,822 [WorkManager.StatusThread] ERROR o.a.d.exec.rpc.RpcConnectionHandler – Failed to establish connection
java.util.concurrent.TimeoutException: Timeout waiting for task.
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) ~[guava-14.0.1.jar:na]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ~[guava-14.0.1.jar:na]
at org.apache.drill.exec.rpc.ReconnectingConnection$ConnectionListeningFuture.waitAndRun(ReconnectingConnection.java:115) ~[drill-java-exec-1.2.0.jar:1.2.0]
…
This pointed me to an HBase problem, since with was not a timeout, I restarted the HBase RegionServer and the problem was repaired. You should see a corresponding error to the UnknownScannerException error in the HBase RegionServer log.