Two methods of analyzing data that are common in both academic and commercial fields are statistical analysis and data mining. While statistical analysis has a long scientific history, data mining is a more recent method of data analysis that has arisen from Computer Science. In this article I want to give an introduction to these methods and outline what I believe is one of the main differences between the two fields of analysis.
Statistical analysis commonly involves an analyst formulating a hypothesis and then testing the validity of this hypothesis by running statistical tests on data that may have been collected for the purpose. For example, if an analyst was studying the relationship between income level and the ability to get a loan, the analyst may hypothesis that there will be a correlation between income level and the amount of credit someone may qualify for.
The analyst could then test this hypothesis with the use of a data set that contains a number of people along with their income levels and the credit available to them. A test could be run that indicates for example that there may be a high degree of confidence that there is indeed a correlation between income and available credit. The main point here is that the analyst has formulated a hypothesis and then used a statistical test along with a data set to provide evidence in support or against that hypothesis.
Data mining is another area of data analysis that has arisen more recently from computer science that has a number of differences to traditional statistical analysis. Firstly, many data mining techniques are designed to be applied to very large data sets, while statistical analysis techniques are often designed to form evidence in support or against a hypothesis from a more limited set of data.
Probably the mist significant difference here, however, is that data mining techniques are not used so much to form confidence in a hypothesis, but rather extract unknown relationships may be present in the data set. This is probably best illustrated with an example. Rather than in the above case where a statistician may form a hypothesis between income levels and an applicants ability to get a loan, in data mining, there is not typically an initial hypothesis. A data mining analyst may have a large data set on loans that have been given to people along with demographic information of these people such as their income level, their age, any existing debts they have and if they have ever defaulted on a loan before.
A data mining technique may then search through this large data set and extract a previously unknown relationship between income levels, peoples existing debt and their ability to get a loan.
While there are quite a few differences between statistical analysis and data mining, I believe this difference is at the heart of the issue. A lot of statistical analysis is about analyzing data to either form confidence for or against a stated hypothesis while data mining is often more about applying an algorithm to a data set to extract previously unforeseen relationships.