Monday, June 4, 2012

The Big Data Value Chain

So Big Data systems are different in kind from 'traditional' information systems, and are motivated by the need for large volumes of information with little latency. What do these systems look like?

You could be forgiven for thinking that Big Data systems are all similar. In fact, 'Big Data' is actually a family of complimentary systems, each of which fill a different role in a Big Data 'value chain'. Which one you select for your application depends on the specific problem you are trying to solve. Let's familiarise ourselves with this Value Chain. In later posts we can probe deeper into each individual stage:
Big Data Value Chain
Big Data Value Chain
Reading left to right -- fairly normal! -- we can see the Value Chain comprises 9 steps:
  1. Collection: getting the data we will use to create our information.
  2. Integration: combining data from multiple sources and applying context to the data: date & time, source, name, quality, etc so that we can use it.
  3. Modelling: applying a model of the real world the data came from -- a factory, a patient, a web-site, a social media service, with all the meaning and behaviour that goes with this context -- so that the disparate pieces of data we collected becomes information about the real-world environment we are interested in and which has value to us.
  4. Analysis: making judgements about what is happenning in the real world. This might be defect rates vs targets, bottlenecks, comparisons of physiological parameters to population norms, statistical trends, free to paid user conversion rates, depending on the context.
  5. Presentation: remember that time -- Latency and Freshness -- are critical to the value of a Big Data system. In many Big Data applications, it is critical that information is pushed at those relying on it as it is created, or at the very least is available to them 'immediately' they request it. There is always a time attribute to information with value, so we will be saying a lot more about this.
  6. Storage: once we have created the information we need, we need to preserve it in as rich a way as possible so that we can come back to it over time. Many discussions of Big Data only start at this point, as if the information came into being spontaneously! As we will see, some systems create information as a natural by-product. Many do not, and the Big Data practitioner, or 'Data Scientist' will find that those that do rarely present the complete modelling and analysis that their Users require by accident.
  7. Integration: once again, even when the information required has been safely stored, it may not all be in the same store. Information required for historical analysis often needs to be integrated together from multiple stores.
  8. Finally we get to Historical Analysis and Reporting. We separate these because altough superficially similar, they are in fact quite different: a Report being a way to present information in the same way repeatedly, which is of use to multiple Users, while Analysis is typically more complex manipulation, preformed by an individual to explore the information and gain new insights.
I hope this has been a useful survey of the Big Data Value Chain. As  I said, we will explore the different stages, their requirements and implications in more detail over the course of this Blog.

Tuesday, May 8, 2012

What is Big Data?

Big Data is a wonderful term, but what does it mean? Many IT professionals will say that they have been providing systems which capture, analyse and present data -- often in very "Big" quantities -- for years. What's new? I think the most useful definition of a Big Data system is a very pragmatic one: you need a Big Data system to capture, analyse and present information when traditional Relational Database centered systems cannot meet your Users' requirements. The reason Big Data is such a current topic right now is because so many businesses are running up against these limits and being forced to look beyond them. Why might this be? Here are the three key drivers, in my opinion. We will look at each in much more detail over the course of this blog:
  1. First and foremost is time. Time comes in two flavours: Latency -- how long it takes to get your data -- and what I'll call "Freshness" -- how up-to-date it is when you get it. If you cannot meet your Users' time expectations with traditional technologies, then you are into the realm of Big Data. These expectations are becoming more demanding, and data volume, which  always costs time, is growing by orders of magnitude. We will talk a lot more about time in this blog.
  2. Second is flexibility. The need to be flexible and adaptable can overload a traditional approach to data systems in three ways: Variety of Sources, Variety of Structures and Future Uncertainty. A common requirement for Big Data systems is that they collect their data from an enormous variety of sources. This data comes organised in a similarly large variety of structures or "schemas". While it is possible to come up with ingenious ways to get data from many sources into relational databases, these databases do not take kindly to data which does not conform to a predefined structure, or schema. Unfortunately, Big Data systems are often required to accept data coming in a huge variety of structures, and often have to deal with new, previously unanticipated, data structures and analysis requirements as time goes on.
  3. Third is complexity. What we are talking about here is the ability to model complex real-world behaviour and perform complex analyses, often in real time. Not something talked about a lot in the world of Big Data at the moment, but I believe will become more and more important as the applications of these systems become more sophisticated, and again, not something relational database technology, with its focus on selection of lists and relatively simple aggregations -- average, max, count, etc -- is ideally suited for.
So in summary, Big Data systems are by definition different in kind from 'traditional', relational database centric systems, and are driven to be so by User requirements which beyond certain limits of volume, structure and sophistication, are not deliverable by this 'traditional' technology. So if you are an IT professional, and your User or Customer is asking for something not achievable with your existing tool-set, it is not because they are wrong, it's because they are asking for Big Data!

Tuesday, May 1, 2012

Big Data

I have been quiet on the Fraysen blog for some time now, but very busy in the 'real world!' While we have been working away with Clients and on new product development, the market for the software Fraysen develops has aquired a name: "Big Data." I have spent some time thinking about how I can best contribute to the Big Data discussion. It is a very broad term, spanning a value chain of systems which can overwhelm and confuse, so my first goal will to try to bring some clarity and simplicity to the conversation, give you some reference points, and a context to understand it in. I hope you find it useful.