The Data Warehouse

 

Data Warehousing is over twenty years old and still the many business and technical users out there do not wholly approve of the validity of the architecture, and still cite the enormous cost of Data Warehousing, the lack of building skills, the perceived failure rate, the poor quality of the data and the lack of the expected return on investment

When I joined Teradata, a US company, in 1987, the term ‘Data Warehouse’ was not in existence, although the nature of the applications that Teradata was used for matched all today’s criteria of a Data Warehouse. Teradata manufactured and sold a large but modular parallel processing platform that could basically store vast amounts of data and run SQL queries very quickly. Although many people shudder at the cost of Data Warehousing today, those early Teradata systems (called the DBC1012) were often sold as DASD (disk) replacement systems for DB2 mainframe applications because ‘Teradata’ storage was cheaper than the IBM equivalent.

Those early years of Teradata in Europe were very successful, and gradually the worth of holding large amounts of historical data and running very complex queries over that data began to be understood by some of Europe’s largest companies, including BT, BA and TSB. At some time in these early years a Teradata guy coined the phrase ‘Information Factory’ and from somewhere soon after came the term ‘Data Warehouse’ which has stuck with us today.

In those early days 300 gigabytes of data was deemed huge, and quite rightly so, because that volume of data could consume over six hundred disks. Now of course just one disk will do and Data Warehouses larger that several terabytes are common and running on hardware and RDBMSs from many vendors.

Although IBM and Oracle have probably competed best with Teradata in the Data Warehouse market, it has spawned many other vendors, to become the billion-dollar industry we commonly – and wrongly – call BI. There is money to be made everywhere!

So there are plenty of companies and components to help in the building of a Data Warehouse, but the basic fact is that the overall concept has not changed or improved significantly over all of these twenty years or so. We are still doing what we used to do, only quicker and with more data, and it’s pretty much this fact that frustrates me so much. I know companies on their third try at Data Warehousing and with each new iteration they change the platform, believing that it’s the platform that is failing them, not realising that they don’t actually have any real idea of what success is or how to understand what they’ve achieved so far.

Posted in Uncategorized | Leave a comment

Copy Management as the Answer?

The modern-day solution to the ‘data into intelligence’ process has been through the implementation of Data Marts and Data Warehouses. This ‘science’ is rather mature these days but there is still incredible confusion as to what they are, how they should be structured and who should use them. Incredibly, many Data Warehouses still under-perform, so in this chapter we will introduce the different architectural concepts and define clear roles for the Data Mart and Data Warehouse.

Basically, all contemporary solutions to the Business Intelligence challenge involve copy management – copying data from its source system to another ‘place’ where it can be used to provide reports and answer queries. There are several alternatives in architecture, all of which are used today. Whilst we read them we should bear in mind that copying data is an architectural sin. It always costs money and brings opportunity for error. It should be avoided at all cost.

 

The Data Mart

Basically, Data Marts are the fundamental elements of Business Intelligence today. We may not like it, it may be a flawed strategy, but the fact is that there are hundreds of thousands of Data Marts out there because they make people happy in many ways:

  • Ø They provide value.
  • Ø They are easy to build and maintain.
  • Ø They give the users a sense of power.
  • Ø The provide the vendors with lots of income.

In short, everyone wins except the purists … or do they?

 

2.8.2.1 What Is a Data Mart?

Everyone knows at least two generic things about data. They know that important data is spread all over the organisation, making it impossible to correlate in any meaningful way, and they know that nearly all reporting needs data from multiple systems.

We have already discussed these two statements in great detail, and so, given that these presumptions are true and not likely to change any time soon, there has been one overwhelmingly attractive solution. The Data Mart.

The Data Mart is a set of data copied from another system or systems that is stored in a single database and is structured in a way that supports certain pre-defined business reporting activities.

Probably the key characteristic of the Data Mart is that it supports a pre-defined type of business information requirement, and its data can thus be structured in a way to enhance performance and ease of use for these specific requirements. Because Data Marts have limited scope it is common to find many of them in the same enterprise, and because they are usually small in ‘size’, cost is not high and so Data Marts can be ‘purchased’ without complex ROI cases and CFO approval. On the same theme, and once again because the scope of the Data Mart is limited, it is often possible to avoid high-cost items common in the Data Warehouse world, such as complex ETL and backup/restore capabilities.

 

What Are the Pros and the Cons of the Data Mart?

Let’s start with the good:

  • Ø Generally the total cost of ownership of a Data Mart can be low.
  • Ø Performance for a known set of queries should be good.
  • Ø Change requests should be easy to execute.
  • Ø Privacy and security issues are generally simple.

Let’s move on to the bad:

  • Ø When an organisation examines the cost of all Data Marts deployed, the cost can be surprisingly high.
  • Ø This logic is also true about total effort and total cost of ownership.
  • Ø By splitting the overall problem into many parts (marts), computers and software are not used in an efficient way.

Let’s now finish with the ugly:

  • Ø Data Marts are rarely treated with the rigor of a Warehouse or operational system. Too often the data is corrupt, incomplete or too old, and the really ugly part is that too many times, reports from Data Marts are badly misleading or simply incorrect.
Posted in Uncategorized | 1 Comment

The start of Big Data

Let’s just examine again why these dimensions are important. There are two major aspects to be considered: the business value that can be obtained from data and the cost of storing and using it for meaningful query analysis. If we look at the latter, we can make a vague assumption that the cost is pretty much in line with the volume of data we store. Thus it costs ten times as much to store ten terabytes as it does one – we can argue this point for hours and I admit to it being a gross over-simplification, but it will serve its purpose in this discussion. The key to understanding data volume, and therefore cost, is to understand the three dimensions above. For example, if we feel that there is value for a mobile communications company in storing information about the calls people make, then the first thing to do is to decide at what level the call information should be held. Each individual call is manifest in a Call Detail Record (CDR), which is a single record generated for every individual telephone call made (or accepted).

Well, suppose we are an average-size mobile provider with 1,000,000 subscribers each making ten calls per day and we decide to hold individual CDRs – then we must be able to hold 10,000,000 per day.

We must next decide what attributes of the CDR are important to us for decision-making. In fact, the CDR contains many, many attributes of interest (see a later chapter for details) and each has a physical size when it comes to storage in our computer systems. Let’s say for now, however, that there are at least 200 characters of useful information in each CDR, so we can now multiply the sum from above by 200 to find out how much data is created each day by people making calls:

10,000,000 multiplied by 200 equals 2,000,000,000 characters (or bytes) of data daily.

 This is TWO GIGABYTES of data generated every day.

We now have to tackle the issue of history. Basically, we must decide how many months’ worth of CDR history we need to store to allow us to make meaningful (and predictive) business decisions. Let’s say for now that we opt for thirteen months to allow us to do yearly month-on-month analysis. Well, there are approximately thirty days per month, so we must now multiply the above sum by three hundred and ninety:

2 gigabytes multiplied by 390 equals 780 gigabytes of total data per year.

This is hardly ‘big data’ but some of the mobile companies in Europe have huge numbers of subscribers, 20 million is not uncommon generating 20 Tbytes of data per year. Now THAT is BIG

Posted in Uncategorized | Leave a comment

A short intro to Data Dimensions

      When we examine the characteristics of data we can see that it exists in three dimensions, namely its:

  • Ø level of detail
  • Ø content
  • Ø history

These three dimensions on their own go a long way to determining the data volume we currently bandy about when discussing issues of intelligent business analysis.

The concept of the level of ‘detail’ will be discussed later, but we should consider that the examples above describe data in common industries, at the level of the individual transaction. We are basically capturing a row of data reflecting a customer’s behaviour as it corresponds to his use or purchase of a product: a row every time someone makes a flight reservation, a row of data every time someone makes a telephone call etc.  It is possible to go to an even deeper level of detail than the examples we are using here (for example to record all the events of an individual airline reservation as it is altered before the actual flight), but is it useful? Maybe it is!

Posted in Uncategorized | Leave a comment

The Dilemma

The Dilemma

One of the most common arguments in the area of Business Intelligence today concerns the level of detail at which a company should keep its data for the purpose of informational analysis. There are many reasons for this argument, and perhaps the most pertinent three are influenced by:  What data is available  The limitations of current technology to support differing volumes of data  Existing and pending data protection laws and ownership issues The most restrictive of the above list is the third, which is governed by the fact that differing legislation in different countries regulates both the amount of data one can hold over history and the level of detail of this data, most especially if it can be used to track the behaviour of individuals. This issue of privacy is of utmost importance and will have ramifications across many aspects of decision-making. The second issue in our list is one of technology, and it’s fair to say that the state in which we find ourselves today is one in which current technology gives us the ability to store and enquire on huge (if not limitless) amounts of data. The pertinent discussion should centre not around ‘can we do it?’ but rather what business value can be obtained from differing ‘levels’ of data, and this subject is the major theme of this post. The remaining issue – what data is available – is obviously an important factor simply because you can’t keep and use what you don’t have in the first place. In fact, understanding the gap between available data and that which is required is a key objective in requirement definition and must primarily be a business-led activity, not only in defining important and missing data items, but also putting a worth on capturing them. Where Does this Data Come From? Let’s look at four very common business transactions to understand what types of data they create and manipulate.

Using an Automatic Teller Machine (ATM)

 Buying some items from a supermarket

Booking an airline ticket

Making a telephone call

Using an Automatic Teller Machine (ATM) ATMs not only give out money to the just, but also serve to capture a huge amount of data. Every time you withdraw money from an ATM, a transaction record is created and stored in the bank’s operational systems containing the following:  account number  date and time  the ATM identifier (i.e. where you are)  the amount of money involved  what you are requesting to do with that money, i.e. withdraw it Meanwhile, at that same instant, it is assessing how much money you have in your account, along with such information as credit limits etc. Buying some items from a supermarket

When you buy something from a supermarket the system records at the very least:  supermarket identification  till identification  time and day  items purchased  coupons used  price paid  payment method At best, and with the advent of loyalty cards, the system could know your:  name  address  date of birth  credit card identification  credit limit And of course the retailer knows already the full book price for the items bought, levels of discount applied and stock levels available in all outlets.

Booking an airline ticket. In this industry the reservation systems capture:  who you are  your address  payment details  data and time of reservation and flight  personal preferences (meals, seats, smoking habits)  destination  routes  arrival time  price paid  frequent flier attributes  associates flying together – family groups etc.  attributes of young fliers – age, gender etc. Making a mobile telephone call

Perhaps more data is collected when you use your mobile (or wire line telephone) than with any other sort of transaction. The Call Detail Record (CDR) which is generated by the telephone switches on a per-call basis (broadly speaking) is a great source of information for the telephone company in terms of analytical potential. For example, for mobile calls, the CDR contains:  calling number  called number  time of call  duration of call  location of caller and called  termination codes  tariffs

 

We see, therefore, that a huge amount of very detailed data is generated and made available to retailers in many industries when a customer buys or uses goods. The question is: Is it of value to hold a record of every transaction or can I derive the same value merely by summarising these transactions by some attributes? Well, to answer the question, let’s first look at some of the characteristics of data to understand the issues involved.

Posted in Uncategorized | Leave a comment

Making it to the ‘What If’ Era

 

The ‘What If’ era is predicated on an ability to modify the future in our aforementioned reality time. We have learned what has happened in the past, we have understood why it happened, and now we want to use this information to change what is likely to happen in the future. Let’s use a simple example:

What happened?

I find out that fewer people than expected flew from London to Paris on a specific flight.

Why did it happen?

I analyse competitive information and discover that a competitor has just started the same flight but with a much lower introductory price to grab tourists.

How will I change things? (What if?)

Now surely the obvious strategy is simply to compete on cost, and this illustrates the beauty of BI and illustrates why BI is the cornerstone enabling the move to the ‘What If’ era. Let’s suppose we drop our price, we can assume that this will also drop our profitability, and in doing so we start a downward spiral that no-one can win except the customer. Before we know it, the competitor drops his price, we respond and soon this once lucrative route is costing money. This is where many companies are today because they simply don’t have sophisticated BI capabilities.

However, by deep analysis and predictive capabilities my chosen strategy is to replace several rows of economy seats with business class seats and market an offer of triple air miles to all business users and anyone else flying business class.

You might be surprised at the complexity of the response when you were thinking that the obvious response would simply be to lower the price to beat the competition, but the actual response that I proposed used deep predictive modelling to come up with a hypothesis that would not only maintain profit but might even increase it, and in a way that the competitor could not copy. Basically, the analysis enables the identification of a set of passengers that did not care about the price of the flight (within reason), probably because they did not pay themselves, but did value air miles for the twice-a-year private holiday. If all else was even, then they would fly the cheaper airline – but hey! – if there’s something in it for me privately, then let’s go back to the higher price and greater comfort.

Why can’t the competitor respond? Because they have nothing to offer, no air miles programme, no greater comfort, in fact all they can offer is something that not everyone evidently values.

Now I’ve made this example up, but it illustrates one thing: ‘what if’ is very difficult because the obvious actions to take to change customer behaviour are often the most damaging.

‘What if’ analysis is usually based on predictive modelling, data mining, and we will come back to it later. One thing that data mining depends on, however, is a rich source of historical and detailed data.

Posted in Uncategorized | Leave a comment

Moving to the ‘Why Era’

You know by now that I am fond of examples so let’s use another one at this stage. Again imagine that you are that marketing executive at a mobile telephone company and you get a report on your desk every week telling you how many of your customers (they like to call them ‘subscribers’ for some strange reason) the company has lost to competitors over the previous seven days. Now of course this number is very important to the executives in the company, but it is a classic number from the ‘What Era’, often a classic report from a classic, traditional, dull old Data Warehouse.

So, we should ask ourselves, what is missing? Well there is one thing obviously missing, and that is any reason why these customers have left (churned). Wouldn’t you value the investment spent in your BI infrastructure if not only could it tell you who churned but just maybe it could have a go at telling you why they churned?

I will not lie to you here. What’- type reports are usually quite easy to create but moving into the ‘Why Era’ can be more difficult. Again, there are several reasons.

Firstly, we do not expect BI systems to help us with the ‘why’ issue, so tools aren’t too good at it. There is too much of this ‘just give me the facts and I’ll work out the rest (because that’s why I’m paid so much)’, and secondly, to understand the ‘why’ bit needs data that can be difficult to get. Standard reports just will not do, someone has to show some initiative here.

Just returning to our example, what might we do to enter the ‘Why Era’? Well, as a marketing guy I would guess that there are two things that could cause people to churn in large numbers: firstly, people churn because someone else offers a better product/price plan; secondly, people churn because of poor quality of service – too many disconnects etc.

Armed with this intuition, if I was the marketing guy I would ask for more data to be included in the Data Warehouse to better characterise the churners. I might organise some quality-based questionnaires to be commissioned to see what people think of our service, I might start collecting all competitor product/pricing information.  I might start to gather all information on failed calls in my Data Warehouse and above all, I might ask someone with good technical knowledge to help me write the telling reports which would combine all relevant data to provide me with the ‘why’ point of view.

Armed with all of this I use my BI environment to advise me that the most likely reason why ‘Jon Page’ churned is because of a better price plan being offered by a competitor who has better coverage of Mr Page’s house. Get the idea?

 

Posted in Uncategorized | Leave a comment