:::

News & Events

:::

Speaker:Professor Shaw-Hwa Lo (Department of Statistics, Columbia University)

Seminar
Poster:Webmasters事件日期:2019-01-04
 
Topic:Selecting Influential & Predictive Variables for a BIGDATA set    
 
Speaker:Prof. Shaw-Hwa Lo (Department of Statistics, Columbia University)
 
Date Time:FRI. Jan 4,2019, 10:40 AM - 11:30 AM 
 
Place: 4F-427, Assembly Building I

Abstract

Identifying variables or factors that are influential for response and good for prediction are two important aims. When data are not large and numbers of variables involved are small to moderate, applying the methods (or their variations) developed during last century produced good results and thus served Sciences well. As data grow unwieldly during last 10 years, searching for important variables from a much larger collection of variables, most of them are noisy variables with no useful information, become urgent and challenging.  To respond this challenge, we consider an alternative "Partition Retention" (PR) approach, for variable selection and prediction problems involving very complex and large data sets. This approach seeks to alter statistical practice and predictive literature in the analysis of big data by changing the focus from common significance-based modeling to evaluating variables’ ability to predict.

This approach directly measures a variable set's ability to predict (termed “predictivity”), the I-score, without relying on the CV. There are many important and challenging problems arising in BIGDATA which require innovative idea and methods to treat them, including areas from all Natural, Social and Engineering Sciences.  We argue that the I-score not only reflects the true amount of interactions among variables, it can be related to a lower bound of the correct prediction rate and does not over fit. The values of the I-score measure the amount of “influence” of the variables set under consideration. We suggest shifting the research agenda toward searching for a new criterion to locate highly predictive variables using partition retention (PR) method with I-score. The PR was effective in reducing prediction error from 30% to 8% on a long-studied breast cancer data set.
 
Last modification time:2019-01-04 AM 11:37

  • recruiting animation-EN
cron web_use_log