Recently, a customer asked us to help transition a set of data flows from an overwhelmed RDBMS to a “Big Data” system. These data flows had a batch dynamic, and there was some comfort with Pig Latin in-house, so this made for an ideal target platform for the production data flows (with architectural flexibility for Spark and other technologies for new functionality, but here I digress).

One wrinkle from a vanilla Hadoop deployment: they wanted schema enforcement soup-to-nuts. A first instinct might be that this is simply a logical data warehouse – and perhaps it is. So often these days one hears about Hadoop and Data Lakes and Schema-On-Read as the new shiny that it is easy to forget that Schema-On-Write also has a time and a place, and as with most architectural decisions, there are tradeoffs – right (bad pun… intended?) times for each.

Schema-On-Read works well when:

  • Different valid views can be projected on a given data set. This data set may be not well-understood, or applicable across a number of varied use cases.
  • Flexibility outweighs performance.
  • The variety “V” is a dominant characteristic. Not all data will fit neatly into a given schema, and not all will actually be used; save the effort until it is known to be useful.

Schema-On-Write may be a better choice when:

  • Productionizing established flows using well-understood data.
  • Working with data that is more time-sensitive at use than it is at ingest. Fast interactive queries fall into this category, and traditional data warehousing reports do as well.
  • Data quality is critical – schema enforcement and other validation prevents “bad” data from being written, removing this burden from the data consumer.
  • Governance constraints require metadata to be tightly controlled.