-----------------------------------------------------------------------------
A v v i s o d i S e m i n a r i o
-----------------------------------------------------------------------------
Mercoledì 05 Dicembre, ore 11:00am
-----------------------------------------------------------------------------
Sala 34 (quarto piano)
Dipartimento di Scienze Statistiche
Sapienza Università di Roma
DAVID CAUSEUR
(Agrocampus Ouest, Dep. of Statistics and Computer Science)
terrà un seminario dal titolo
HANDLING DEPENDENCE OR NOT IN STATISTICAL
LEARNING FOR HIGH-DIMENSIONAL DATA
tutti gli interessati sono invitati a partecipare.
-----------------------------------------------------------------------------
Saluti
Pierpaolo Brutti
---
ABSTRACT
The proper way to handle dependence across features in high-throughput data
has raised fundamental discussions with unclear general conclusions or
final recommendations. One of the most obvious illustration of this point
is the tremendous effort of the statistics research community to address
the impact of dependence on the False Discovery Rate (FDR)-controlling
method by Benjamini and Hochberg (1995), which was initially designed under
an independence assumption. Another famous questioning example is provided
by the strikingly good performance of a naïve Bayes procedure ignoring
dependence in a comparative study of machine learning methods by Dudoit et
al. (2002) to predict classes from gene expression data.
Addressing the dependence issue has often consisted in assessing its
detrimental impact on the performance of standard methods designed to be
optimal under independence, and deduce patches. To be valid for arbitrarily
complex dependence patterns, such approaches in which dependence is viewed
as a curse can lead to poorly powerful procedures. Therefore, both for
machine learning and testing issues, a new generation of methods have
emerged, advocating for an ad-hoc handling of dependence consisting in a
preliminary whitening of the data (see Ahdesmäki and Strimmer, 2010, Hall
and Jin, 2010). However, disentangling the dependent noise and the true
association signal is very challenging and decorrelation can then lead to
an alteration of the true association signal.
For the purpose of global testing, where the objective is to test for the
significance of an association signal between a set of features and a
covariate, Arias-Castro el al. (2011) suggests that the optimal handling of
dependence shall be specific of the pattern of the true association signal,
especially through its sparsity rate. The former global testing framework
covers a wide scope of applications, such as functional Analysis of
Variance (fANOVA) and association tests between a region of the genome
formed by contiguous Single Nucleotide Polymorphisms (SNP) and a
case/control response variable in Genome Wide Association Studies.
Interestingly, in the two former fields of applications, many popular
methods are just based on simple aggregation of pointwise test statistics
ignoring their dependence.
The talk will start by a selection of short stories with confusing
conclusions about the proper way to handle dependence. After a tentative
clarification, I will introduce two methods for global testing in which
whitening is adapted both to the pattern of dependence across pointwise
test statistics and to the pattern of the true signal. The performance of
the two testing methods will be illustrated by applications to significance
analysis of ElectroEncephalogram curves in Event-Related Potentials (ERP)
designs and by SNPset approaches of Genome Wide Association Studies for
genetic epidemiology issues. We also discuss the applications of the former
general principles to prediction in high-dimension.