Sunday, February 1, 2015

Data Warehousing - The Inmon View

The Inmon view of Data Warehousing is the original view. In fact the Inmon definition is still cited in research papers. This definition sees the Data Warehouse as "a subject oriented, integrated, time-varying, and nonvolatile collection of data that is used primarily in organizational decision making." This seems like a serviceable definition until we pick it apart a little. First, if we consider a major subset of that definition " "a subject oriented, integrated... collection of data that is used primarily in organizational decision making." there is nothing that distinguishes data warehousing from enterprise databases. So, the adjectives "time-varying and nonvolatile" must be the difference. And indeed they are. But in this definition the key elements are buried in a flurry of other generic attributes.

Even if we highlight and embolden these terms, they still fail to capture the essence of a Data Warehouse. For example, in a traditional transaction processing system, time stamped transactions would satisfy "time-varying" so what is special about the Data Warehouse? And nonvolatile suggests, correctly, that the information is not updated. This is not entirely true. But in the cases where it is true, why is it true?

One final problem we have with the traditional definition is the name "Data Warehousing" itself. This is a problematic metaphor that fails to capture the essence of a Data Warehouse. The term was selected many years ago to give the impression of high volume, low cost storage where you go into the warehouse to retrieve information that is not readily at hand. Thus, Data Warehouses, over time, came to be seen as large junk heaps of historical data leaving us with some nontrivial philosophical problems such as "what does the data refer too?" and design problems such as "what are we trying to achieve with the Data Warehouse?"




Thursday, January 15, 2015

What is a Data Warehouse?

A Data Warehouse is, generally, a large repository of structured historical data. This definition was carefully constructed because there are two prevailing and competing views of data warehouses, and I wanted the initial definition to cover both of them. To understand the distinction, consider the following anecdote.

Several years ago, when I was teaching a course on data warehousing, a student in our program who was currently a data warehousing practitioner, came in my office and said,

"I hear you are teaching a course on data warehousing."

"Yes," I replied, "are you interested in taking it?"

"Well, I wanted to know," she continued, "are you an Inmonite or a Kimballite?"

From the mouths of practitioners, therein lies the difference.

Bill Inmon offers a view of data warehousing as a large repository of historical data derived from source transaction processing systems. This historical data can be analyzed and studied in support of important business decisions.

Ralph Kimball, on the other hand sees the data warehouse as a collection of historical data designed and collected to model measurable business processes.

Most people involved in data warehousing adhere to either the Inmon view or the Kimball view. Many who do so, do it unknowingly.

In the next few posts, I will elaborate on the differences.

Friday, January 9, 2015

What is a Relational Database?

First we must clear up a common misunderstanding. SQL Server, Oracle, MySQL, DB2 and other similar pieces of software are not relational databases. They are Relational Database Management Systems (RDBMS). RDBMSs support the relational data model but can be used for storage and retrieval as well.

Consider a web application that uses MySQL to store information used to construct web pages. MySQL is an RDBMS but the information in it does not constitute a relational database. It is the structure and nature of the information that makes it a relational database not the software used to store it.

Information in a relational database is inherently categorical data. That is, the information is stored according to categories. In a simple academic relational database we might have categories like Student, Course and Professor. Once the database is populated with information we use SQL to ask questions about the categories. How many students do we have? How many classes are being offered? We can ask much more complicated question such as: How many Students have taken a class with a Professor with the same last name?

 The purpose of a Relational Database Management System is to provide storage and retrieval as well as data management for databases some of which are relational. The purpose of a relational database is to store information in properly defined categories so that we can answer questions about the data and the categories. 

Thursday, January 1, 2015

What is Data Science ?

While on this theme of data, I thought it would be useful to take a step back and look at a larger emerging trend. And that is Data Science.Data Science is both very, very promising and very and very problematic. So, before I start carrying on about how important it is, let's look at some of the problems.

First, there is the definitional problem. What exactly is Data Science? Different people have different ideas about what Data Science entails. For example, some statisticians feel that Data Science is just a new name for statistics. Some data miners feel it is just a new name for data mining. And some knowledge management people feel that it is just a new name for knowledge management.

Wikipedia defines it as follows - "Data Science is, in general terms, the extraction of knowledge from data" This definition sounds good until you ask two important questions. First, what is knowledge and how do we know that what we have derived from data actually is knowledge? And, second, Are there any general principles that apply to all attempts to extract knowledge from data ? Is there, for example, a "data method" that would parallel "scientific method".

On the first question, we can go back to Plato's definition of knowledge which say something to the effect that in order for a claim to be considered knowledge, it has to be true, you have to believe it is true, and you must be able to explain why it is true. This last criterion, having to explain why it is true, is a snag for Data Science. Often times, relationships between variables can be extracted from data and possibly even verified without any understanding of the underlying mechanisms. Is this OK? Well, that has yet to be determined.

On the second question, all we can say is that there appears to be an emerging body of knowledge that might someday become a "data method" but that still lies in the future. An example of a book that I like in this new vein of data science is Nate Silver's The Signal and the Noise: Why So Many Predictions Fail — but Some Don't . Rather than discussing particular techniques, Silver talks about a lot of reasons why modelers go wrong, some cognitive and some philosophical It is works like this that will provide the foundations for this emerging science.

There are numerous, definitional, methodolical, cognitive, and philosophical issues that will need to be addressed as Data Science emerges. However, it will emerge and the questions will be answered. Consider the following progression in the evolution of our knowledge. First we used stories to explain the world around us. Second, we adopted science with it economical and verifiable theories. Now we are introducing a new intermediary between us and that confusing mess we call reality. And that intermediary is data. That is happening, not coincidentally, at a time when a vast amount of data is being produced. In many ways, Data Science is just an attempt to tame the world of Big Data before it gets away from us.