Dr Artz - Making Sense of IT

Sunday, February 1, 2015

Data Warehousing - The Inmon View

The Inmon view of Data Warehousing is the original view. In fact the Inmon definition is still cited in research papers. This definition sees the Data Warehouse as "a subject oriented, integrated, time-varying, and nonvolatile collection of data that is used primarily in organizational decision making." This seems like a serviceable definition until we pick it apart a little. First, if we consider a major subset of that definition " "a subject oriented, integrated... collection of data that is used primarily in organizational decision making." there is nothing that distinguishes data warehousing from enterprise databases. So, the adjectives "time-varying and nonvolatile" must be the difference. And indeed they are. But in this definition the key elements are buried in a flurry of other generic attributes.

Even if we highlight and embolden these terms, they still fail to capture the essence of a Data Warehouse. For example, in a traditional transaction processing system, time stamped transactions would satisfy "time-varying" so what is special about the Data Warehouse? And nonvolatile suggests, correctly, that the information is not updated. This is not entirely true. But in the cases where it is true, why is it true?

One final problem we have with the traditional definition is the name "Data Warehousing" itself. This is a problematic metaphor that fails to capture the essence of a Data Warehouse. The term was selected many years ago to give the impression of high volume, low cost storage where you go into the warehouse to retrieve information that is not readily at hand. Thus, Data Warehouses, over time, came to be seen as large junk heaps of historical data leaving us with some nontrivial philosophical problems such as "what does the data refer too?" and design problems such as "what are we trying to achieve with the Data Warehouse?"

Thursday, January 15, 2015

What is a Data Warehouse?

A Data Warehouse is, generally, a large repository of structured historical data. This definition was carefully constructed because there are two prevailing and competing views of data warehouses, and I wanted the initial definition to cover both of them. To understand the distinction, consider the following anecdote.

Several years ago, when I was teaching a course on data warehousing, a student in our program who was currently a data warehousing practitioner, came in my office and said,

"I hear you are teaching a course on data warehousing."

"Yes," I replied, "are you interested in taking it?"

"Well, I wanted to know," she continued, "are you an Inmonite or a Kimballite?"

From the mouths of practitioners, therein lies the difference.

Bill Inmon offers a view of data warehousing as a large repository of historical data derived from source transaction processing systems. This historical data can be analyzed and studied in support of important business decisions.

Ralph Kimball, on the other hand sees the data warehouse as a collection of historical data designed and collected to model measurable business processes.

Most people involved in data warehousing adhere to either the Inmon view or the Kimball view. Many who do so, do it unknowingly.

In the next few posts, I will elaborate on the differences.

Friday, January 9, 2015

What is a Relational Database?

First we must clear up a common misunderstanding. SQL Server, Oracle, MySQL, DB2 and other similar pieces of software are not relational databases. They are Relational Database Management Systems (RDBMS). RDBMSs support the relational data model but can be used for storage and retrieval as well.

Consider a web application that uses MySQL to store information used to construct web pages. MySQL is an RDBMS but the information in it does not constitute a relational database. It is the structure and nature of the information that makes it a relational database not the software used to store it.

Information in a relational database is inherently categorical data. That is, the information is stored according to categories. In a simple academic relational database we might have categories like Student, Course and Professor. Once the database is populated with information we use SQL to ask questions about the categories. How many students do we have? How many classes are being offered? We can ask much more complicated question such as: How many Students have taken a class with a Professor with the same last name?

The purpose of a Relational Database Management System is to provide storage and retrieval as well as data management for databases some of which are relational. The purpose of a relational database is to store information in properly defined categories so that we can answer questions about the data and the categories.

Thursday, January 1, 2015

What is Data Science ?

While on this theme of data, I thought it would be useful to take a step back and look at a larger emerging trend. And that is Data Science.Data Science is both very, very promising and very and very problematic. So, before I start carrying on about how important it is, let's look at some of the problems.

First, there is the definitional problem. What exactly is Data Science? Different people have different ideas about what Data Science entails. For example, some statisticians feel that Data Science is just a new name for statistics. Some data miners feel it is just a new name for data mining. And some knowledge management people feel that it is just a new name for knowledge management.

Wikipedia defines it as follows - "Data Science is, in general terms, the extraction of knowledge from data" This definition sounds good until you ask two important questions. First, what is knowledge and how do we know that what we have derived from data actually is knowledge? And, second, Are there any general principles that apply to all attempts to extract knowledge from data ? Is there, for example, a "data method" that would parallel "scientific method".

On the first question, we can go back to Plato's definition of knowledge which say something to the effect that in order for a claim to be considered knowledge, it has to be true, you have to believe it is true, and you must be able to explain why it is true. This last criterion, having to explain why it is true, is a snag for Data Science. Often times, relationships between variables can be extracted from data and possibly even verified without any understanding of the underlying mechanisms. Is this OK? Well, that has yet to be determined.

On the second question, all we can say is that there appears to be an emerging body of knowledge that might someday become a "data method" but that still lies in the future. An example of a book that I like in this new vein of data science is Nate Silver's The Signal and the Noise: Why So Many Predictions Fail — but Some Don't . Rather than discussing particular techniques, Silver talks about a lot of reasons why modelers go wrong, some cognitive and some philosophical It is works like this that will provide the foundations for this emerging science.

There are numerous, definitional, methodolical, cognitive, and philosophical issues that will need to be addressed as Data Science emerges. However, it will emerge and the questions will be answered. Consider the following progression in the evolution of our knowledge. First we used stories to explain the world around us. Second, we adopted science with it economical and verifiable theories. Now we are introducing a new intermediary between us and that confusing mess we call reality. And that intermediary is data. That is happening, not coincidentally, at a time when a vast amount of data is being produced. In many ways, Data Science is just an attempt to tame the world of Big Data before it gets away from us.

Friday, December 26, 2014

What is Business Analytics ?

Here is a simple metaphor to explain the difference between Big Data and Business Analytics. If you think of Big Data as a mining operation, you can think of Business Analytics as metallurgy.

I use this metaphor because Big Data provides very raw material in very large quantities. The output of Big Data requires a great deal of work to turn it into something valuable. This is the 'value' problem in Big Data (not one of the original V's but included increasingly more). It is hard to look at the ore coming out of a mine and see expensive jewelry. And it is hard to look at the vast amount of Big Data being created and see useful data products.

Business analytics, on the other hand, takes refined data products that already exist and mixes them in attempts to find a new composite or alloy that has desirable properties not available in the component products. While it is still difficult to know before hand what value might be found in the composite data products, there are two benefits that Business Analytics has over Big Data. First, you have a better idea what the outcome might be since you are working with refined data products rather than raw materials. Second, the volume is much, much less in Business Analytics so you can afford to do more trial and error.

Here is an example of attempting to use Big Data.I should mention that this example is constructed to make Big Data comprehensible to people who are not familiar with it as most examples are a little to arcane for the average person. Let's say you get a data feed from the Internet of Things. The Internet of Things is an emerging concept which is becoming more real over time and will continue to do so. In this feed you get information from parking meters, soda vending machines, websites, smart home appliances, EZ-pass toll booths, weather transponders, automobile computers and so on. This data comes at you as it is created. It is huge in volume with a great deal of variety. You want to use if to get a better understanding of your customers but you have no idea whether they are distinguished by regular car maintenance, how often they fail to feed parking meters, or whether they have milk going sour in their refrigerators. You have a huge volume of very raw data and need to figure out how to get some value out of it.

Compare that with an example of attempting to use Business Analytics.You have a chain of grocery stores and want to know if people in different locations have different purchasing behaviors resulting from weather forecasts. So, you pool all of the sales data from all of your stores, acquire some information on weather forecasts, and look for correlations. The data you are using is much more orderly and the things you are looking for are understood much.

Wednesday, December 17, 2014

What is Big Data?

Let's start with a question that a lot of people are wondering about. What is Big Data? First, I want to say that Big Data is a Big Deal. While technology has fueled the engines of transformation for the past few decades, data will fuel the engines of transformation for the next few. But, don't worry. I am not going to go off on a philosophical rant. Let's get right down to the brass tacks.

There is a lot of hype surrounding Big Data and a lot of misuse of the term. There are many definitions of Big Data none of which are particularly satisfying. But I will make use of two of my favorites.

Many people define Big Data in terms of three V's which are volume, variety and velocity. This means that Big Data is a huge amount (volume) of complicated data (variety) coming at you very fast (velocity). I have read papers and seen presentations where more V's are added. For example, veracity and value are popular as well. And both of these V's raise important issues. But, they are not central to the essence of Big Data.

Another definition that I like is that Data is big when it cannot be processed using traditional relational database technology. Relational databases require information to be highly structured (i.e. anti-variety) and the transaction models used to update the database have limitations on transactions (or updates) per second (i.e. anti-volume and anti-velocity).

It is probably best to think of Big Data as large volume of raw material from which data products can be made. These data products, in turn, can be used to make decisions which create value for a company (another V). These decisions can be large strategic decision or small individual decisions. A problem with Big Data is that it is unclear what it refers to. This is the veracity problem (yes, I snuck another V in there). Until it is tamed (i.e. we know what it refers to) it is difficult to use it in decisions.

Note that Big Data is largely defined by the amount of it. If there were a gigantic improvement in processing power, say parallel or quantum computers, which led to computers tens of thousands of times faster, there would no longer be such a thing as Big Data. It would just be data. Unlike relational databases which contain a particular kind of data (categorical) Big Data is largely defined by the amount and messiness of it both of which lead to processing constraints.

Should you be concerned with Big Data? As I said in the first paragraph, Big Data is a Big Deal. However, there is a lot of very valuable data that does not rise to the level of Big Data. If you are not yet doing everything you can with your Not So Big Data (Terabytes and less), it makes more sense to focus on that first. Once you are getting all the value you can from that, it would be appropriate to start taking on Big Data.

Monday, December 15, 2014

Making Sense of Information Technology

When I first started in Information Systems, more years ago than I care to admit, there were only a few technologies we had to worry about. There were operating systems, teleprocessing monitors, databases, applications and programming languages. Everybody knew how to program and everybody specialized in one of the preceding other four. It was still daunting, but nothing like it is today.

Since then we have had to adjust to personal computers, networks, artificial intelligence, web technologies, social interaction technologies, mobile devices, and more new programming and scripting languages than I even want to think about. But, as if that were not enough to worry about, we now have analytics, and big data to contend with. And on the horizon we have virtual worlds, video games, drones, a resurgence of artificial intelligence. A bit further off we have complexity theory and agent based modelling threatening to change a game that has already changed so many times that it can hardly even be considered the same game. This list, by the way, is by no means comprehensive. I am doing this off the top of my head. So I apologize if I have left out your pet emerging technology.

How does one keep up with all this stuff? How does one know what to be concerned about and what to ignore? I routinely hear people confusing Big Data with Analytics or Relational Databases with Data Warehousing. Most people know that Facebook is a Social Interaction Technology but what about YouTube and Wikipedia? And what is the difference between a Wiki and Wikipedia. While we are at it, what is the difference between a wiki, a blog and a forum? What is the difference between a web server and a web service? If your business had $10,000 to play around with an emerging information technology which one would it be? What about $100,000 or a million?

My biggest challenge since those salad days of mainframes has been to keep up with emerging technologies. And, in the process, I have learned a few things and learned a few tricks. I routinely explain things like this in my classes. So, I thought I would create a blog to reach a wider audience. This is not my first blog. In fact I have many. But I love to write and I love to figure things out. When I can figure things out and write about them, that is as good as it gets.

I should warn you, upfront, about my eratic blogging habits based on the other blogs that I have created. I write when and where I feel like it because I do my best work that way. Often, I will post a flurry of pieces to a blog and then ignore it for a while while I use other outlets for my writing. Eventually, I will come back and write some more. My goal with this blog will be to post something of interest every week or two on the average. So, if this look interesting, please book mark it or follow it. I also have a twitter account @DrJohnArtz which you can follow. The only thing I post to the twitter account is when there are new postings to a blog that has been fallow for a while. So, I won't fill your inbox with tweets about what I had for breakfast.