Skip to Content

Tableau for Big Data Analysis – Part 2

Blog | April 26, 2022 | By Darian Pena

I. Understanding the Hadoop Ecosystem

The core technology for data storage is the Hadoop Distributed File System (HDFS). As discussed in the first part of this series Hadoop redundantly stores blocks (pieces of data) across servers. HDFS stores and manages data, while MapReduce processes and computes data in HDFS. However, there are abstractions built on top of HDFS that allow technology and analytics professionals to query data using SQL. These SQL layers introduce simplicity to the Hadoop ecosystem by removing the need to use programmatic methods of accessing and analyzing data and using query interfaces instead.

II. Hive

Hive provides a convenient SQL abstraction from the underlying data stored in HDFS. This allows users to query and analyze information in files. Furthermore, data split into multiple files can be accessed on one table instead of having to read one file at a time through HDFS. Hive, like HDFS, utilizes the MapReduce processing layer to access the underlying data. The issue with this structure is that this process creates latency issues whenever Tableau attempts to load data when dashboards are refreshed and updated. This makes it difficult to leverage the seamless real time interaction Tableau offers to users through dashboard drill downs.

III. Impala

Impala is the best performing solution because it circumvents MapReduce processing. Fast execution of SQL queries in Impala facilitates quick reads and refreshes in Tableau. There are also several ways to improve workbook performance for dashboards pointing to Impala data. On the Tableau side, some of the best optimizations include filtering and hiding unused fields, limit string related calculations in Tableau unless it’s necessary and avoiding joins of various data sources. Joins should generally be used as a process for refining the underlying data source. Performance optimizations that can be implemented with Impala include storing data in Parquet files as discussed earlier with Hive. Furthermore, partitioning data based on columns that have an ideal level of granularity allows tables to be split into similarly sized files. In addition, partitioning by small integer fields such as TINYINT instead of dates or strings make data reads much faster for Impala and consequently improves latency downstream with Tableau.

IV. Hue

The Hadoop User Experience (Hue) is an open-source web application that facilitates querying of data in Hive and Impala. It also provides users with the ability to examine table schemas and create new datasets. It also allows users to navigate HDFS interactively without relying on a command line interface. Hue is available in the Cloudera Data Platform (CDP), AWS, Google Cloud Platform, and Microsoft Azure.

Source: Cloudera

V. Using Hive & Impala with Tableau

When using dashboards that are pre-defined and cannot be customized by viewers, Hive can be a reliable data source for Tableau developers. However, there are important limitations to using Hive on Tableau. With the batch-oriented nature of Hive, it does not process small queries efficiently and will inevitably lead to slow processing of Tableau sheets and dashboards. This problem compounds when presenting visualizations that rely on calculated fields.

On the other hand, Impala by-passes the slow MapReduce architecture of Hive and supports rapid processing of interactive and ad-hoc queries. As a result, Impala is the ideal data source for creating dynamic, low-latency Tableau visualizations.

author image
About the Author
Big Data, Cloud, and Business Intelligence professional with experience supporting large scale Data & Analytics initiatives.
Darian Pena | Senior Analyst | USEReady
Back to top