Thursday, January 1, 2015

What is Data Science ?

While on this theme of data, I thought it would be useful to take a step back and look at a larger emerging trend. And that is Data Science.Data Science is both very, very promising and very and very problematic. So, before I start carrying on about how important it is, let's look at some of the problems.

First, there is the definitional problem. What exactly is Data Science? Different people have different ideas about what Data Science entails. For example, some statisticians feel that Data Science is just a new name for statistics. Some data miners feel it is just a new name for data mining. And some knowledge management people feel that it is just a new name for knowledge management.

Wikipedia defines it as follows - "Data Science is, in general terms, the extraction of knowledge from data" This definition sounds good until you ask two important questions. First, what is knowledge and how do we know that what we have derived from data actually is knowledge? And, second, Are there any general principles that apply to all attempts to extract knowledge from data ? Is there, for example, a "data method" that would parallel "scientific method".

On the first question, we can go back to Plato's definition of knowledge which say something to the effect that in order for a claim to be considered knowledge, it has to be true, you have to believe it is true, and you must be able to explain why it is true. This last criterion, having to explain why it is true, is a snag for Data Science. Often times, relationships between variables can be extracted from data and possibly even verified without any understanding of the underlying mechanisms. Is this OK? Well, that has yet to be determined.

On the second question, all we can say is that there appears to be an emerging body of knowledge that might someday become a "data method" but that still lies in the future. An example of a book that I like in this new vein of data science is Nate Silver's The Signal and the Noise: Why So Many Predictions Fail — but Some Don't . Rather than discussing particular techniques, Silver talks about a lot of reasons why modelers go wrong, some cognitive and some philosophical It is works like this that will provide the foundations for this emerging science.

There are numerous, definitional, methodolical, cognitive, and philosophical issues that will need to be addressed as Data Science emerges. However, it will emerge and the questions will be answered. Consider the following progression in the evolution of our knowledge. First we used stories to explain the world around us. Second, we adopted science with it economical and verifiable theories. Now we are introducing a new intermediary between us and that confusing mess we call reality. And that intermediary is data. That is happening, not coincidentally, at a time when a vast amount of data is being produced. In many ways, Data Science is just an attempt to tame the world of Big Data before it gets away from us.

No comments:

Post a Comment