Skip to Content

Data Observability – A Tech Primer

Blog | November 16, 2022 | By Shriram Sankaran

In the past, data infrastructure was built to handle small amounts of data–usually operational data from a few internal data sources–and it wasn’t expected to change drastically. However, in the recent past, organizations rely on data both from internal and external sources, and the sheer volume and velocity at which this data is collected can cause unexpected drift, schema changes, transformations, and delays.

Modern data stacks provide a wide variety of functionality, allowing users to store and query data in multiple ways. The more the functionality incorporated, the larger the complexity to just keep the platform functional. This complexity continues to grow given that one has to deal with

  • More external data sources
  • More complicated transformations
  • Sharpened focus on analytics engineering

Downstream reporting applications, BI systems rendered from these data platforms are often out of sync with source systems, seemingly unusable by data consumers across the organization. The reason being partial, erroneous, missing, or otherwise inaccurate data.  This makes data consumers to lose confidence in their ability to use data in key decision-making processes.

Fig 1: Evolution of data stack

Data Quality:

Data quality measures the condition of a set of data—how well suited it is for an organization’s needs. As rightly believed, high-quality data is far more reliable and readily available for use as compared to poor, low-quality data. Organizations constantly strive to improve the data quality.

In this direction, most organizations will measure data quality along these six distinct dimensions:

  • Accuracy, or how many errors contained within a given dataset. To measure accuracy, datasets are compared to a reference dataset. 
  • Completeness, or whether all critical fields are corrected captured? To measure completeness, calculate the percentage of records containing incomplete/partial data.
  • Consistency, or whether similar data pulled from two or more datasets agree with each other. Inconsistent data indicate inaccuracies in one or more datasets. 
  • Timeliness, or how recent is the data? Recent/fresh data tends to reflect business changes more accurately. 
  • Uniqueness, or whether there are any duplicates in a data set. Merge or purge duplicate data, as appropriate.
  • Validity, or how well data conforms to standard formats. It’s difficult to use data if it is in the wrong format

However, the concept of data quality is also a reactive process and does not save us from the damages which can happen to downstream applications. This creates an opportunity to move to a more robust & scalable solution.

Data Observability – A key DataOps process:

Data observability is an organization’s ability to fully understand the health of the data in their systems. It eliminates data downtime by applying best practices learned from DevOps to data pipelines. Data observability tools use automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues. This leads to healthier pipelines, more productive teams, and happier customers.

The five pillars of data observability are:

  • Freshness describes how current the data is and how often the data is updated.
  • Distribution details if data values fall within an acceptable range. Data outside this range may not be trustworthy. 
  • Volume gauges if data is complete. Inconsistent data volume indicates issues with data sources.
  • Schema tracks changes in data organization—who made what changes to the data, and when.
  • Lineage records and documents the entire flow of data from initial sources to end consumption.

How Are Data Quality and Data Observability Similar—and How Are They Different?

Both data quality and data observability are concerned with the usefulness of an organization’s data. To this end, they are both immensely important to an organization and complement each other. 

That said, data quality and data observability have slightly different goals. Data quality aims to ensure more accurate, more reliable data. Data observability seeks to ensure the quality and reliability of the entire data delivery system. Data quality is concerned with data itself, while data observability is concerned with the system that delivers that data. 

Hence data observability goes beyond just monitoring data and alerting users about data quality issues. Data observability attempts to identify data collection and management issues and fix those big-picture issues at the source. When data observability kicks in, the resultant is better quality data. 

Key differences between data quality and data observability mentioned below:

  • Data quality examines data at rest (in datasets), while data observability addresses data in motion (through data pipelines)
  • Data quality focuses on correcting individual data errors, while data observability focuses on fixing systemic problems.
  • Data quality utilizes static rules and metrics, while data observability uses dynamic automated and adaptive rules and metrics.
  • Data quality deals with the results of data issues, while data observability deals with the root causes of those issues.

Data Quality Vs Data Observability:

Key AreaTraditional Data QualityData Observability
Scope for scanning and remediationDatasetsDatasets (data at rest), Data pipelines (data in motion)
FocusCorrecting errors in dataReducing the cost of rework, remediation, and data downtime by observing data, data pipelines, and event streams
ApproachFinding ‘known’ issuesDetecting ‘unknown’ issues
Rules and metricsManual static rules and metricsAutomated dynamic, adaptive rules and metrics
Root cause investigationNoThrough data lineage, time series analysis, and cross-metric anomaly detection
Key personsData stewards, Business analystsData engineers, DataOps engineers
Common use casesTrusted reporting, compliance [Downstream]Anomaly detection, pipeline monitoring, data integration [Upstream]

Where do they overlap?

Data observability and data quality overlap when data observability is used to better data quality. When organizations adopt data observability to enhance data quality, there are bound to be great results. Some of them include

  1. Cost savings by getting data anomalies before they impact consumers-When an anomaly occurs, the data observability engine alerts the team immediately, allowing time to investigate and troubleshoot the problem before it affects consumers. Since the Data Engineering team is notified of the issue before it involves stakeholders, they could fix their pipeline and avert future anomalies from jeopardizing the integrity of their data.
  2. Improved collaboration by tracing field-level lineage-data observability helps understand dependencies.
  3. Raising productivity by keeping tabs on deprecated data sets-data observability gives greater transparency into critical data assets’ relevancy and usage patterns, informing them when different attributes are deprecated.
  4. Increase cost savings by reducing time to resolve tiresome data fire drills and regain trust in critical decision-making data.
  5. Better organization between data engineering and data analyst teams to comprehend critical dependencies between data assets.
  6. Drive greater efficiency and productivity by adding end-to-end visibility into the health, usage patterns, and relevancy of data assets.
author image
About the Author
He comes with more than a decade of experience in the field of data analytics & engineering. He has worked on several high impact projects in the field of data warehousing, data engineering, business intelligence.
Shriram Sankaran | Solutions Architect | USEReady
Back to top