Gregory Piatetsky: Overfitting Is the Cardinal Sin of Data Science

Data scientists are some of the most demanded apecialists in the IT-market.  What tasks the solve? What challenges they face? DataReview has addressed these questions to the «pioneer» of data analysis, the founder of KDD concept, the president of KDnuggets and the son of the famous Soviet mathematician — Gregory Piatetsky-Shapiro.

— Gregory, you are known as one of the best specialists in data analysis. How did you realize that this area is your calling?

— Thank you, Larisa, but you are much too kind. There are now many thousands of excellent data scientists (see more on numbers and ranking in question 4), and I am glad if I am considered somewhere among them.

I am probably best known as one of the pioneers in this field. I organized the first 3 workshops/meetings on Knowledge Discovery and Data Mining (KDD-89, 91, 93), co-edited the first 2 books in this field (1991 and 1996), helped launch KDD Cup — the first large data mining competition in 1997, co-founded the SIGKDD association — ACM group for Knowledge Discovery and Data Mining in 1999, and served as SIGKDD chair from 2005 to 2009.

— How did you arrive to this field?

— As a child, I loved science fiction, especially A & B. Strugatsky, Isaac Asimov, and Stanislaw Lem, and that , along with mathematical inclination inherited from my father Ilya Piatetski-Shapiro, — one of the leading mathematicians in Moscow — led me to study computers and being interested in Artificial Intelligence and Machine Learning.

My PhD at New York University was on application of machine learning method for database optimization, and my first job was working with databases. So perhaps working with databases and being interested in Machine Learning naturally led me to try to combine the two, which led to my work on knowledge discovery in data.

I described my journey to data mining in my chapter in Journeys to Data Mining: Experiences from 15 Renowned Researchers , Mohamed Medhat Gaber (Editor)
Springer, 2012

— You invented the concept of data mining. Could you, please, explain the difference between data mining and KDD?

— Of course I did not invent data mining — analyzing facts and finding patterns is probably one of the basic human traits. Statisticians have been working on data analysis for centuries.

Regarding the different names of this field — Data Mining, Knowledge Discovery, Predictive Analytics, Data Science — here is a very brief history.

In 1960-s, statisticians have used terms like «Data Fishing» or «Data Dredging» to refer to what they considered a bad practice of analyzing data without a prior hypothesis. The term «Data Mining» appeared around 1990’s in the database community. I coined the term «Knowledge Discovery in Databases» (KDD) for the first workshop on the same topic (1989) and this term became popular in academic and research community. KDD conference, now in its 21 year, is the top research conference in the field and there are also KDD conferences in Europe and Asia.

However, the term «data mining» is easier to understand it became more popular in the business community and the press.

In 2003, the term «data mining» acquired a bad image in the US because of its association with US government program called TIA (Total information Awareness) which was closed by US Senate after protests by privacy advocates. In 2006, the term «Analytics» jumped to great popularity, driven by introduction of Google Analytics (Dec 2005). According to Google Trends, «Analytics» became more popular than «Data Mining», as measured by Google searches, around 2006, and continued to climb ever since.

The words «Data Science» appeared in early 2000, but became used in its current meaning only since 2012, and we can see a huge demand in jobs for «Data Scientist» on indeed, a popular platform for jobs.

We can see early trends using Google books Ngram viewer (1970-2008)

and more recent ones with Google Trends for search terms «Predictive Analytics, Data Mining, Data Science, Big Data»

— What are the most interesting problems you have solved with the help of data mining?

— I worked on many interesting problems in predicting mobile customer churn, health care data analysis, Microarray DNA modeling, Altzheimer biomarker prediction, analyzing usage of CAD software, detecting fraud, upsell and cross-sell of banking customers and many others. I had a very fun project helping Tiffany detect counterfeit jewelry items on eBay.

However, most of my consulting work was confidential for the client and I cannot go into details. However, I can describe an interesting research project on healthcare data analysis, which was published in a book chapter, Selecting and Reporting What is Interesting: The KEFIR Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

US healthcare costs are the highest in the world and about 18% of GDP (as of 2012). Some of the health-care costs are due to potentially fixable problems such as fraud or misuse. Understanding where the problems are is first step to fixing them. In 1990s I worked for GTE, a large telephone company, which was self-insured for medical costs and was very motivated to reduce them. GTE’s healthcare costs in were in hundreds of millions of dollars.

The task for our project in support of GTE Health Care Management was to analyze employee health care data and identify problem areas which could be addressed. With Chris Matheus and Dwight McNeil, we developed a system called Key Findings Reporter, or KEFIR.

The KEFIR approach was to analyze all possible deviations, then select most actionable findings using interestingness (see Figure 1). KEFIR also augmented key findings with explanations of plausible causes and recommendations of appropriate actions. Finally, KEFIR converted findings to a user-friendly report with text and graphics .

Fig. 1. KEFIR measure of Interestingness

KEFIR was very innovative project for its time, and received a top GTE technical award.

Currently we can see some of the same ideas in Google Analytics Intelligence, which automatically finds significant deviations from the norm across multiple hierarchies.

— What are the typical problems faced by aspiring data scientists?

— Many beginning data scientist think that the key problem is selecting the best algorithm which will increase the accuracy, perhaps from 78% to 79%.

When running machine learning algorithms, beginning data scientists need to avoid overfitting — the cardinal sin of data science. See my post The Cardinal Sin of Data Mining and Data Science: Overfitting

While getting the most accuracy on predictive modeling is important (and fun!), this is not the major problem.

The more diffucult task is data cleaning, pre-processing, and feature engineering — that usually takes most of the time and if done properly, gives more improvement. When understanding the data, the right mindset is to question the assumptions, understand the causes of missing data, and see what other external data may be available.

Data scientists also need to know how to best present the results to decision makers, how to deploy the solutions, and most importantly, frame the right questions to ask about the data.

Data scientists need to be aware of thorny ethical and privacy issues when data involves people, but those issues are usually decided on the top business level, not by beginning data scientists.

See my post 7 Steps for Learning Data Mining and Data Science and The Do’s and Don’ts of Data Mining

How many data scientists are out there? I estimate is there are perhaps 100-150,000 data scientists (see my post how many data scientists ), but the demand is very strong — here is a trend for «Data Scientists» jobs on indeed.

— Tell us, please, how did the idea of ​​creating KDnuggets appear. Have you thought at the time that the site would become one of the most respected and popular in their area?

— I started a newsletter called «Knowledge Discovery Nuggets» in 1993 a way to connect researchers attending the KDD-93 workshop on Knowledge Discovery and Data Mining . The first issue went to 50 subscribers. This newsletter served as an unofficial publication of KDD workshops, and helped them to grow and become conferences.

With the appearance of World Wide Web, I created a website called Knowledge Discovery Mine hosted at GTE Labs, which was at the time the second site in the world covering data mining and knowledge discovery field. When I left GTE Labs in 1997, I created site, with «KDnuggets» standing for Knowledge Discovery nuggets, conveying the mission of covering the field with short, concise «nuggets» of useful information.

I was working from 1997 to 2000 as a Chief Scientist for an analytics startup (KSP) which was providing data mining and customer analytics consulting to major banks and financial institutions, and editing KDnuggets late at night after work.

When Y2K crisis caused most of KSP to freeze any future projects (including any data mining projects), KSP lost most of its revenue and cut salaries and work hours for all employees, including me. The extra time and reduced salary led me to put the first ads to support his work on KDnuggets.

KSP was acquired by Xchange in April 2000, and I worked for Xchange for about a year, leaving it in May 2001, when Xchange went into decline during the collapse of the dot com bubble. I am self-employed since then. KDnuggets grew from a small hobby to my main activity, taking almost all of my work time — covering the field of Analytics, Data Mining, Data Science and Big Data, writing and editing posts, tweeting at @kdnuggets, managing the site, etc.

Of course, I could not imagine back in 1993 that I would still be publishing KDnuggets or that it would be so popular.

I moved KDnuggets from my custom platform to WordPress in Dec 2013 and now have excellent help from 3 young data science students — Anmol Rajpurohit (KDnuggets Assistant Editor), and also Grant Marshall and Ran Bi.

KDnuggets is now a very popular site covering Business Analytics, Big Data, Data Mining, and Data Science, which received many awards and honors, including

In 2014 KDnuggets had an average of about 100,000 unique visitors/month — see the
chart below which shows the recent growth of KDnuggets subscribers and visitors.

KDnuggets newsletter is emailed 2-3 times a month to all the subscribers.

KDnuggets also has a Facebook page , a LinkedIn group, and a very popular @KDnuggets Twitter account, which was voted the Best Big Data Twitter.

KDnuggets News, published since 1993, is a unique mirror of the history and present state of predictive analytics, data mining, and data science field.

All KDnuggets News issues from the very first one are available online at

The most recent posts are at

— What is happening in the world of data analysis right now? Are there any new trends and directions? In your opinion, what are the most promising of them?

— With Facebook, LinkedIn, and other social networks becoming the dominant mode of interaction for many young people, social network analysis continues to be an important topics.

Dealing with fast and streaming data is another great research topic. Better using of geolocation and finding spatio-temporal patterns is also growing. Another important topic is how to address privacy issues and still allow data mining — is it possible to have privacy-preserving data mining.

Applications of data mining to security, fraud, and threat detection continue to be very important.

In US, applications to health care data analysis grow in importance.

Deep Learning is a very hot area of Machine Learning Research, with many remarkable recent successes, such as 97.5% accuracy on face recognition, nearly perfect German traffic sign recognition, or even winning Dogs vs Cats image recognition competition with 98.9% accuracy. Many winning entries in recent Kaggle Data Science competitions have used Deep Learning. See my post Where to Learn Deep Learning – Courses, Tutorials, Software . However, recent research uncovered a potential big flaw — see

KDnuggets post Does Deep Learning Have Deep Flaws?

Workshops are frequently the leading edge of research, so for the hottest and most important topics see a list of KDD workshops KDD 2014 Workshops – the latest in Data Mining and Data Science Research and IEEE Big Data 2014 – 21 Workshops, Posters – CFP.

Larisa Shuriga, DataReview 


Добавить комментарий

Ваш адрес email не будет опубликован.



Отправить на почту




+ =