When designing an enterprise architecture for business intelligence, advanced analytics, and other data­centric applications, it is often useful to capture major data flows. This may require some research into use cases and tooling and even a bit of hard thinking, but it’s a straightforward exercise. What isn’t so straightforward is capturing the state of metadata that accompanies these data flows. Understanding metadata is critical to obtaining full value and control from your data architecture, but it doesn’t lend itself as well to a typical data flow view.

Common data concerns that stem from gaps or inconsistencies within the metadata domain include:

  • Lineage – What source data contributed to a given view or data set, and how has that data been altered to get there?
  • Impact Analysis – The converse of lineage; if a data source were to be altered, which consuming applications would be affected?
  • Access & Compliance – Who is able to view or alter data? Who has exercised that ability on a given data set? Have their larger­scale access patterns put them in danger of any compliance issues?
  • Quality – How complete and accurate is a particular view on a data set?
  • Implementation – How long and how much effort is required to add a new data source or consumer?

In order to structure and assess these concerns, a metadata lifecycle model (MDLM) can be a useful architectural tool. The MDLM can be used as a framework to evaluate candidate architectures.

The Metadata Lifecycle Model

There are three user­driven events that segment different stages in the metadata lifecycle. Addressing all lifecycle stages for each data concern is a best practice; stages that are skipped or are inconsistent are primary contributors to the concerns outlined above.

The Metadata Lifecycle Model

The identified lifecycle stages are gated in time by the occurrence of these events:

  • Data Source Import – Incorporating a new data source involves not only ingesting the raw data, but also understanding the nature of that data. Note that this event is defined here as making the data available for use by applications; simple ingest­and­store (as per a Data Lake architecture) does not meet the definition.
    • Generate – Metadata describing the data source must be generated (this may be an automated or manual process depending upon the nature of the metadata and the available tooling). Data import is an ideal time to capture metadata as research and documentation are more likely to be available at this time; further there exists the possibility of additional time to invest, as there may not immediate demand for that data source.
    • Persist – This generated metadata must be captured into a documentation system that will support later discovery and usage of this metadata.
  • Consumer Design – When a consuming view or application is designed, it operates in a specific metadata context. While some types of metadata may be defined by a public or private standard, this is often not the case (or the standard may not be strict enough to definitively define every aspect of the metadata: the non­standard standard is standard.)
    • Generate – The assumptions within the metadata domain are declared for the consumer. This ensures that the consumer’s operating environment is well-understood by the designer, and provides an opportunity below to address differences in metadata from the source(s).
    • Note that within the framework presented here, this a declarative addressed below in the Discover and Mediate stages.
    • Persist ­- The consumer may itself become the source to another “downstream” consumer in the future, and therefore its metadata should be similarly persisted.
    • Discover – ­ As sources are identified to feed into the consumer, their metadata documentation must be discovered within the persistence tool.
    • Mediate – ­ With Source and Consumer metadata in­hand, differences must be identified and a reconciliation strategy defined. Specific actions will depend on the type of metadata, but examples include ETL design or access log configuration.
    • A well­designed metadata management solution will not just persist and index metadata, but will also help catalog a library of mediation strategies, allowing previous work to be reused.
  • Consumer Runtime ­ When the consumer is run over actual data by an end user, the strategies from the metadata domain described above must be applied efficiently to the runtime data domain.
    • Apply – ­ Mediation strategies are applied to the data stream/query. Examples include execution of an ETL job and appending to a lineage chart.

Types of Metadata

The MDLM applies to metadata in general, but it is most useful to identify specific types of metadata that pertain to particular goals, and examine individual treatment of these types throughout the lifecycle. The following metadata types serve to address the concerns outlined earlier in this document:

  • Context – ­ Schema, format, and semantics
  • Lineage – ­ Evolution of data: starting with the System of Record (SOR), identify all “touches” in the logic chain applied to the data
  • Security – ­ Specific access rules for given data. This may additionally include a record of previous accesses (i.e. an audit log).

Note that this not a comprehensive list, and that the MDLM can be generally applied to other types of metadata.

To conclude, we will apply the MDLM to these metadata types. Consider a data warehousing architecture as a motivating backdrop.

Context

  • Generate (Source) – ­ Identify schema and column formatting; correlate specific entities and attributes with Master Data definitions and business glossary terms.
  • Persist (Source) – ­ Store this information in a metadata documentation system.
  • Generate (Consumer) – Identify specific semantic concepts required for the consumer, and select a target schema/format in which to work with those concepts. This may be identification or creation of an operational standard.
  • Persist (Consumer) – ­ Store this information in a metadata documentation system.
  • Discover – Select data sources, and locate context documentation.
  • Mediate – Locate or design ETL (ELT, etc.) logic to transform data between source(s) and consumer.
  • Apply – Execute the ETL job.

Lineage

  • Generate (Source) – Document the SOR. If this is a relative SOR that sources data “upstream”, an interface may be available to provide richer lineage.
  • Persist (Source) – If this is a new SOR, add it to a central directory of data sources and processes.
  • Generate (Consumer) – Specify a designation for the new consumer.
  • Persist (Consumer) – Add this designation to the central directory of data sources and processes.
  • Discover – Locate data source SOR designations.
  • Mediate – Design an append operation to the lineage chart.
  • Apply – Execute the append operation to the lineage chart.

Security

  • Generate (Source) – Define access rules for the data source.
  • Persist (Source) – Store this information in a metadata store or security enforcement system.
  • Generate (Consumer) – Identify required access to data sources; specify downstream access restrictions.
  • Persist (Consumer) – Store this information in a metadata store or security enforcement system.
  • Discover – Identify available access to the required data sources.
  • Mediate – Select or create a security principal that supports all required access on the data source.
  • Apply – Utilize this principle against the data source.

In conclusion, the MDLM can provide a useful framework for assessing metadata gaps in a data architecture. If these steps are well­covered, then risks of a flawed architecture are greatly reduced, and the overall effort can proceed to address specific tooling solutions.

There is a nuance about Big Data analysis. It’s really about small data. While this may seem confusing and counter to the whole Big Data “movement”, small data is the product of Big Data analysis. This is not a new concept, nor is it unfamiliar to people who have been doing data analysis for any length of time. The overall working space is larger, but the answers lie somewhere in the “small”.

In the old days of traditional data analysis, we began with databases filled with customer information, product information, transactions, telemetry data, etc. Even then, there was too much data available to efficiently analyze. Systems, networks, and software didn’t have the performance or capacity to address the scale. As an industry we addressed the shortcomings by creating smaller data sets.

These smaller data sets were still fairly substantive and we quickly discovered other shortcomings, the most glaring was the mismatch between the data and the working context. If I worked in accounts payable, I had to look at a large amount of unrelated data in order to do my job. Again the industry responded by creating smaller, contextually relevant data sets. Big to small to smaller still.

You may recognize this as the migration from production databases to Data Warehouses to Data Marts. More often than not, the data for the warehouses and the marts were chosen on arbitrary or experimental parameters resulting in a great deal of trial and error. All too often, the data was chosen to support an output or a conclusion we wanted to see as opposed to discovering something new, interesting or anomalous. We weren’t getting the perspectives we needed or were possible because the capacity reductions weren’t based on computational fact.

Enter Big Data with all its volumes, velocities, and varieties and the problem remains or perhaps worsens. We have addressed the shortcomings of the infrastructure and can store and process huge amounts of additional data, but we also had to introduce new technologies specifically to help us manage Big Data. If we think this is challenging now, just wait a year or two. The emergence and inevitability of ubiquitous machine data is just around the corner. Don’t be scared, be prepared!

Despite the outward appearances, this is a wonderful thing. Today and in the future we will have more data than we can imagine and we’ll have the means to capture and manage it. What is more necessary than ever, is the ability to analyze the right data in a timely enough fashion to make decisions and take actions. We will still shrink the data sets into “fighting trim”, but we can do so computationally. We process the Big Data and turn it into small data so it’s easier to comprehend. It’s more precise and because it was derived from a much larger starting point, it’s more contextually relevant.