Configuring Apache Drill for Azure Blob Storage

Apache Drill is a powerful tool to utilize many more data storage technologies in BiG EVAL test cases than it is possible by only using OLEDB or ODBC drivers. This article describes how to configure Apache Drill to access flat files stored in Azure Blob Storage.

info

Please Note!

There are individual settings depending on your infrastructure setup, Apache Drill setup, your security needs and many more things. We show you an easy way to setup this configuration, but we are not able to show you the best way for your specific needs. So please refer to an administrator or another expert for setting up this connection correctly.

Prerequisites

We assume that you already installed Apache Drill either on the same server like BiG EVAL or on a dedicated machine. You can find the installation instructions in the Apache Drill Manual.

Installing Drill Drivers

Drill comes with a couple of preinstalled drivers. As per the date of writing, these don’t include drivers for any Azure services. So we need to install the drivers first.

Download the most current version of the following two drivers (JAR-files) and copy them to your Apache Drill Installation sub-folder $DRILL_HOME\jars\3rdparty\

info

Please Note!

The links provided are for the most current versions of the drivers at the point of writing. You may browse for more current versions using the following links.
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure
https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage

Setup Azure Blob Storage Credentials

To access secured azure blob storage containers, we need to setup the credentials within Azure Drill. You need to let Drill know the access url and access keys to Azure Blob Storage.

Where to get the Azure Credentials from?

Navigate to the Azure Blob Storage Account in the Azure Portal.

Make note of the Name of your Azure Blob Storage Account.

Navigate to the “Access Keys” section in the navigation bar and copy the Key after making it visible.

Configure Azure Credentials within Apache Drill

Next, we edit Apache Drills configuration file $DRILL_HOME\conf\core-site.conf within a text-editor.

Add the following property-element within the <configuration> section. Replace the placeholders ACCOUNTNAME and KEY with your own values gathered in the paragraph above.

<property>
    <name>fs.azure.account.key.ACCOUNTNAME.blob.core.windows.net</name>
    <value>KEY</value>
</property>

The end-result will look like in the following screenshot.

Save the configuration-file and restart Apache Drill.

Configure Storage Plugin in Drill

Now as Apache Drill is aware about the credentials to access the Azure Storage Account, we need to configure the storage plugin within the Apache Drill Frontend.

Each storage plugin points to one single Blob Storage Container. So if your Storage Account has multiple containers you intend to access, you need an own Plugin Configuration for each of them. Keep this in mind when it comes to giving the storage plugin a name.

  1. Open the Apache Drill Frontend and navigate to the “Storage” section in the main menu.
  2. Create a new Plugin using the “Create” button on the top.
  3. Give the storage plugin a unique name.
  4. Use the following JSON-Code within the Plugin configuration. But ensure that you replace ACCOUNTNAME and CONTAINERNAME on the third line by your own values.
{
  "type": "file",
  "connection": "wasbs://CONTAINERNAME@ACCOUNTNAME.blob.core.windows.net/",
  "config": null,
  "workspaces": {
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    },
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null,
      "allowAccessOutsideWorkspace": false
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "extractHeader": true,
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  },
  "enabled": true
}

5. Save the plugin. It immediately gets added to the list of your plugins. Ensure it is enabled.

Configure and use the ODBC driver for Apache Drill

If Apache Drill is set up correctly, you can use Apache Drills ODBC driver to access the Azure Blob Storage Container.

Setup the Apache Drill ODBC driver

Following is an example of how to query a CSV file stored in Azure Blob Storage using the Apache Drill ODBC driver.

SELECT DISTINCT statecode FROM azraw.`/bigeval/sampledata/insurance/2019/2019_01/fl_insurance_sample.csv`

Table of Contents