Big Data Analysis in Small Sets

There is a nuance about Big Data analysis. It’s really about small data. While this may seem confusing and counter to the whole Big Data “movement”, small data is the product of Big Data analysis. This is not a new concept, nor is it unfamiliar to people who have been doing data analysis for any length of time. The overall working space is larger, but the answers lie somewhere in the “small”.

In the old days of traditional data analysis, we began with databases filled with customer information, product information, transactions, telemetry data, etc. Even then, there was too much data available to efficiently analyze. Systems, networks, and software didn’t have the performance or capacity to address the scale. As an industry we addressed the shortcomings by creating smaller data sets.

These smaller data sets were still fairly substantive and we quickly discovered other shortcomings, the most glaring was the mismatch between the data and the working context. If I worked in accounts payable, I had to look at a large amount of unrelated data in order to do my job. Again the industry responded by creating smaller, contextually relevant data sets. Big to small to smaller still.

You may recognize this as the migration from production databases to Data Warehouses to Data Marts. More often than not, the data for the warehouses and the marts were chosen on arbitrary or experimental parameters resulting in a great deal of trial and error. All too often, the data was chosen to support an output or a conclusion we wanted to see as opposed to discovering something new, interesting or anomalous. We weren’t getting the perspectives we needed or were possible because the capacity reductions weren’t based on computational fact.

Enter Big Data with all its volumes, velocities, and varieties and the problem remains or perhaps worsens. We have addressed the shortcomings of the infrastructure and can store and process huge amounts of additional data, but we also had to introduce new technologies specifically to help us manage Big Data. If we think this is challenging now, just wait a year or two. The emergence and inevitability of ubiquitous machine data is just around the corner. Don’t be scared, be prepared!

Despite the outward appearances, this is a wonderful thing. Today and in the future we will have more data than we can imagine and we’ll have the means to capture and manage it. What is more necessary than ever, is the ability to analyze the right data in a timely enough fashion to make decisions and take actions. We will still shrink the data sets into “fighting trim”, but we can do so computationally. We process the Big Data and turn it into small data so it’s easier to comprehend. It’s more precise and because it was derived from a much larger starting point, it’s more contextually relevant.