You could be forgiven for thinking that Big Data systems are all similar. In fact, 'Big Data' is actually a family of complimentary systems, each of which fill a different role in a Big Data 'value chain'. Which one you select for your application depends on the specific problem you are trying to solve. Let's familiarise ourselves with this Value Chain. In later posts we can probe deeper into each individual stage:
Big Data Value Chain |
- Collection: getting the data we will use to create our information.
- Integration: combining data from multiple sources and applying context to the data: date & time, source, name, quality, etc so that we can use it.
- Modelling: applying a model of the real world the data came from -- a factory, a patient, a web-site, a social media service, with all the meaning and behaviour that goes with this context -- so that the disparate pieces of data we collected becomes information about the real-world environment we are interested in and which has value to us.
- Analysis: making judgements about what is happenning in the real world. This might be defect rates vs targets, bottlenecks, comparisons of physiological parameters to population norms, statistical trends, free to paid user conversion rates, depending on the context.
- Presentation: remember that time -- Latency and Freshness -- are critical to the value of a Big Data system. In many Big Data applications, it is critical that information is pushed at those relying on it as it is created, or at the very least is available to them 'immediately' they request it. There is always a time attribute to information with value, so we will be saying a lot more about this.
- Storage: once we have created the information we need, we need to preserve it in as rich a way as possible so that we can come back to it over time. Many discussions of Big Data only start at this point, as if the information came into being spontaneously! As we will see, some systems create information as a natural by-product. Many do not, and the Big Data practitioner, or 'Data Scientist' will find that those that do rarely present the complete modelling and analysis that their Users require by accident.
- Integration: once again, even when the information required has been safely stored, it may not all be in the same store. Information required for historical analysis often needs to be integrated together from multiple stores.
- Finally we get to Historical Analysis and Reporting. We separate these because altough superficially similar, they are in fact quite different: a Report being a way to present information in the same way repeatedly, which is of use to multiple Users, while Analysis is typically more complex manipulation, preformed by an individual to explore the information and gain new insights.