-----------------------------------------------------------------------------
A v v i s o d i S e m i n a r i o
-----------------------------------------------------------------------------
Mercoledì 05 Dicembre, ore 11:00am
-----------------------------------------------------------------------------

Sala 34 (quarto piano)
Dipartimento di Scienze Statistiche
Sapienza Università di Roma

DAVID CAUSEUR
(Agrocampus Ouest, Dep. of Statistics and Computer Science)

terrà un seminario dal titolo

HANDLING DEPENDENCE OR NOT IN STATISTICAL
LEARNING FOR HIGH-DIMENSIONAL DATA

tutti gli interessati sono invitati a partecipare.

-----------------------------------------------------------------------------

Saluti

Pierpaolo Brutti

---

ABSTRACT

The proper way to handle dependence across features in high-throughput data has raised fundamental discussions with unclear general conclusions or final recommendations. One of the most obvious illustration of this point is the tremendous effort of the statistics research community to address the impact of dependence on the False Discovery Rate (FDR)-controlling method by Benjamini and Hochberg (1995), which was initially designed under an independence assumption. Another famous questioning example is provided by the strikingly good performance of a naïve Bayes procedure ignoring dependence in a comparative study of machine learning methods by Dudoit et al. (2002) to predict classes from gene expression data.

Addressing the dependence issue has often consisted in assessing its detrimental impact on the performance of standard methods designed to be optimal under independence, and deduce patches. To be valid for arbitrarily complex dependence patterns, such approaches in which dependence is viewed as a curse can lead to poorly powerful procedures. Therefore, both for machine learning and testing issues, a new generation of methods have emerged, advocating for an ad-hoc handling of dependence consisting in a preliminary whitening of the data (see Ahdesmäki and Strimmer, 2010, Hall and Jin, 2010). However, disentangling the dependent noise and the true association signal is very challenging and decorrelation can then lead to an alteration of the true association signal.

For the purpose of global testing, where the objective is to test for the significance of an association signal between a set of features and a covariate, Arias-Castro el al. (2011) suggests that the optimal handling of dependence shall be specific of the pattern of the true association signal, especially through its sparsity rate. The former global testing framework covers a wide scope of applications, such as functional Analysis of Variance (fANOVA) and association tests between a region of the genome formed by contiguous Single Nucleotide Polymorphisms (SNP) and a case/control response variable in Genome Wide Association Studies. Interestingly, in the two former fields of applications, many popular methods are just based on simple aggregation of pointwise test statistics ignoring their dependence.

The talk will start by a selection of short stories with confusing conclusions about the proper way to handle dependence. After a tentative clarification, I will introduce two methods for global testing in which whitening is adapted both to the pattern of dependence across pointwise test statistics and to the pattern of the true signal. The performance of the two testing methods will be illustrated by applications to significance analysis of ElectroEncephalogram curves in Event-Related Potentials (ERP) designs and by SNPset approaches of Genome Wide Association Studies for genetic epidemiology issues. We also discuss the applications of the former general principles to prediction in high-dimension.