Let’s just examine again why these dimensions are important. There are two major aspects to be considered: the business value that can be obtained from data and the cost of storing and using it for meaningful query analysis. If we look at the latter, we can make a vague assumption that the cost is pretty much in line with the volume of data we store. Thus it costs ten times as much to store ten terabytes as it does one – we can argue this point for hours and I admit to it being a gross over-simplification, but it will serve its purpose in this discussion. The key to understanding data volume, and therefore cost, is to understand the three dimensions above. For example, if we feel that there is value for a mobile communications company in storing information about the calls people make, then the first thing to do is to decide at what level the call information should be held. Each individual call is manifest in a Call Detail Record (CDR), which is a single record generated for every individual telephone call made (or accepted).
Well, suppose we are an average-size mobile provider with 1,000,000 subscribers each making ten calls per day and we decide to hold individual CDRs – then we must be able to hold 10,000,000 per day.
We must next decide what attributes of the CDR are important to us for decision-making. In fact, the CDR contains many, many attributes of interest (see a later chapter for details) and each has a physical size when it comes to storage in our computer systems. Let’s say for now, however, that there are at least 200 characters of useful information in each CDR, so we can now multiply the sum from above by 200 to find out how much data is created each day by people making calls:
10,000,000 multiplied by 200 equals 2,000,000,000 characters (or bytes) of data daily.
This is TWO GIGABYTES of data generated every day.
We now have to tackle the issue of history. Basically, we must decide how many months’ worth of CDR history we need to store to allow us to make meaningful (and predictive) business decisions. Let’s say for now that we opt for thirteen months to allow us to do yearly month-on-month analysis. Well, there are approximately thirty days per month, so we must now multiply the above sum by three hundred and ninety:
2 gigabytes multiplied by 390 equals 780 gigabytes of total data per year.
This is hardly ‘big data’ but some of the mobile companies in Europe have huge numbers of subscribers, 20 million is not uncommon generating 20 Tbytes of data per year. Now THAT is BIG