Data mining isn't for the meek. Although it promises huge returns -- especially in the area of predicting customer buying habits, for instance -- there's still a lot to understand.
The purist's definition of data mining is the use of mathematical models to discover patterns in data that can then be used to make predictions. This is more proactive than traditional online analytical processing (OLAP), where people ask specific questions of data and then get specific answers through canned reports or other means.
For example, in a standard query tool, one might ask what products different types of customers bought. However a data-mining tool, if set up correctly, can tell you which products customers are likely to buy. Armed with this knowledge, customer service representatives can have much more accurate profiles of which customers to whom they might successfully up-sell or cross-sell other items.
But the accuracy of these predictions hinges on setting up the right algorithms to troll your databases -- or data that you might buy from outside sources. As such, mining is much closer as a discipline to domains like statistics or research, and is pretty far a field from IT. True data mining experts use -- and understand -- terms like regression, time series, predictive modeling, standard deviation and so forth.
The key is using the right data, and asking the right questions. Experts say that it's critical to go in with a particular problem you're looking to have solved. Then the algorithms you choose or develop can help point to patterns that you might not have known about.
Although data mining does not require a data warehouse to get going, it is that much easier if you have data that's already cleaned and pressed from multiple sources. Otherwise you'll have to start with "raw" data and spend a fair amount of time dealing with that issue.
While today's tools do shield users from many of the intricacies associated with data mining, it still pays to understand as much of the tool's underlying mathematical algorithms and assumptions as possible, said Herb Edelstein, a long-time data guru and now president of independent consultancy Two Crows Corp. in Potomac, Md. "It's like SQL programming. If you know how to use the optimization features, you get better code. But can you program SQL without knowing about optimization? Absolutely," he explained.
For that reason, Chubb & Son, the giant insurance concern in Warren, N.J., hired mathematics experts and statisticians before it even implemented SAS Institute's Enterprise Miner tool in January 2001, said Jeff Hoffman, vice president of business intelligence. It was important to hire experts that could help set up the right underlying algorithms and mathematical assumptions from the get-go, he said.
Payback and ROI can be elusive, which is why it's important to understand up front why you want to or need to invest in data mining. Chubb, for instance, started out using its SAS software as a marketing tool, to help understand customers' buying patterns and to help sell them more products. Now, however, the software is also being used to assess a customer's insurance risk based on multidimensional characteristics, Hoffman said.
"A focused statement usually results in the best payoff," Edelstein commented.
Data mining tools range in price from $20,000 or less to more than $250,000, depending on what you want to do. They come in several forms: traditional statistics or mining packages, OLAP tools and data mining functions that are part of other applications. In the standalone data mining field, Edelstein said that SAS, SPSS, Insightful and IBM are the leaders. Many of the OLAP tools have disappeared, he said, but one of his favorites comes from KXEN's Analytic Framework.
The real growth, analysts agree, will come in the last category of tools -- data mining merged with enterprise applications, including database management systems from the likes of Teradata, IBM and Oracle. Customer relationship management (CRM) software is also increasingly incorporating mining tools. Edelstein's favorite in this category is Blue Martini, but other names here include DataDistilleries, E.piphany, Identix, PeopleSoft and SAP. An up-and-coming application for data mining is supply-chain management.
But even more important than the tools is the process used to perform data mining and the team formed to do so, said Alexander Linden, research director of emerging trends and technologies at Gartner Group Inc. in Stamford, Conn. "Data mining requires completely different skill sets than most IT organizations possess," he said. "You need the right team of people, which include quantitative analysts, IT, financial and business people. It's difficult to get all of them together, which is one of the reasons why data mining is still not widely used in most organizations."
One reason for the slow growth may be that data mining, if used correctly, can change how people make decisions. But people can be pretty vested in what they perceive as their knowledge base and may loathe changing.
One of the big surprises at Chubb, Hoffman said, was the need to build workflow and other processes to be able to make effective use of all the new information that mining yields. "The ability to operationalize the use of derived data is not simple," he said. "In most cases, people have not had the type of information we're now able to supply."
Dan Vesset, manager of IDC's data-access program in Framingham, Mass., said the market is ripe to grow after a period of stagnation due to the world economic situation. "I'm hearing a lot of demand and interest," he said. "The market was fairly flat, but it was just a matter of customers postponing buying decisions." Particularly interested are telecommunications and financial institutions that want to use data mining for more efficient fraud detection, he said.
_____________________________________SPONSORED BY: EMC
Open Storage Management with AutoIS
AutoIS is EMC's strategy, products, and services for making storage management simple, automated, and open. It's about achieving results. AutoIS is a combination of architecture and automation to reduce manual tasks, lower costs, and eliminate errors.
Simple means having one management interface to mask the complexity and manage the growth inside your environment.
Automated means allowing your people to set the rules and let the technology do the work it was designed to deliver.
Open means...just that-management that works across all your storage assets. That's right: software that manages EMC and non-EMC storage platforms.