News & Events


Speaker:Professor Shaw-Hwa Lo (Department of Statistics, Columbia University)


Topic:Some Challenging Issues Facing Statistical Learning Today/ Future: A Proposing Remedy of It
Speaker:Prof. Shaw-Hwa Lo (Department of Statistics, Columbia University)
Date Time:Tue. JUL 2, 2019, 10:10-12:00 , 13:30-15:30
Place: 4F-427, Assembly Building I


A brief review of existing methods of learning and an introduction of  Partition-Based approach:  We first briefly review some methods that are currently available from both statistical and machine learning literature . In order to meet current and future challenges arising from current/future big data, novel capable methods are called for. To meet these needs we response by introducing an alternative interaction-partition-based strategy:   

Discovering Influential Variables sets we consider a computer intensive approach (Partition Retention (PR), Chernoff, Lo and Zheng (09)), based on an earlier method (Lo and Zheng (2002) for detecting which, of many potential explanatory variables, have an influence on a dependent variable Y. This approach is suited to detect influential variables in groups, where causal effects depend on the confluence of values of several variables. It has the advantage of avoiding a difficult direct analysis, involving possibly thousands of variables, guided by a measure of influence I. The main objective is to discover the influential variables, rather than to measure their effects. Once they are detected, the problem of dealing with a much smaller group of influential variables should be vulnerable to standard analysis. We are confining our attention to locating a few needles in a haystack.

We apply PR (Partition Retention method (09)) to deal with big data applications, typically involving complex and extremely high dimensional data.  We first introduce an interaction-based feature selection and a prediction procedure, using a breast cancer gene expression data as an illustrated example. The quality of variables selected is evaluated in two ways: first by classification error rates, then by functional relevance using external biological knowledge. We demonstrate that (1) the classification error rates can be significantly reduced by considering interactions; (2) incorporating interaction information into data analysis can be very rewarding in generating novel scientific findings. Heuristic explanations why and when the proposed methods may lead to such a dramatic (classification/ predictive) gain are briefly discussed. 

Why Aren’t Significant Variables Automatically Good Predictors
It is noticed recently that newly discovered significant variables from GWAS studies are not very useful in improving prediction rate. Why so (Lo et al. (2015, 16. PNAS)) ? We offer statistical insights by clarifying what makes variables good for significance and what makes variables good for prediction depend on very different distributional properties. We conclude that progress in prediction requires efforts toward a new research agenda of searching for highly predictive variables sets rather than highly significant variables.

We introduce a selection and predictive approach that directly measures a variable set's ability to predict (termed “predictivity”), without relying on the CV. We argue that the previously proposed I-score not only measures the amount of interactions, it can be related to a lower bound of the correct prediction rate and does not over fit. We suggest shifting the research agenda toward searching for a new criterion to locate highly predictive variables using partition retention (PR) method with I-score. The PR was effective in reducing prediction error from 30% to 8% on a long-studied breast cancer data set. Furthermore, we offer recommendations (on daily applications) how to determine a significant variable(s) is predictive or no value of prediction. When the two concepts of significance and predictivity converge?
Last modification time:2019-06-19 AM 9:49

  • recruiting animation-EN
cron web_use_log