SEMINAR IN DECISION SCIENCES
DEPARTMENT OF ECONOMICS AND FINANCE
LUISS
Viale Romania 32
00197 Roma
Title of the talk:
A general theory for Bayesian inference with linked and duplicated data
Speaker:
Brunero Liseo (Sapienza Università di Roma)
on March 8th from 12:00 to 13:00 in room 207
Abstract
Record linkage is a statistical method which aims at identifying
whether two or more records refer to the same statistical entity or
not. Duplications of the same entity within one single source or
across different files may be interpreted as “clusters of records”,
showing strong similarities across their fields. In this paper we
frame the record linkage process into a formal Bayesian clustering
model and we investigate the role of species sampling models as prior
distributions for the clustering structure. In fact the different
latent entities underlying one or more data sources can be treated as
the sampled species and the observed records as noisy measurements of
their features. We also discuss an important issue in the clustering
approach to entity resolution, that is the need to bound the clusters
sizes even for large data sets. The proposed statistical models will
be used both in a classical record linkage scenario and in the more
complex framework of multiple duplications on the same data source and
will serve as supporting models for Bayesian regression analyses with
linked data.
Key words: Species Sampling, Latent structure, Hit and Miss Algorithm,
Clustering.
References
1. Tancredi, A. and Liseo, B. (2015) Regression Analysis with linked
data: Problems and possible solutions. Statistica, 75,1:19–35.
2. Lahiri, P. and Larsen, M.D., (2005). Regression analysis with
linked data. Journal of the American Statistical Association, 100, pp.
222–230.
3. Tancredi, A., Liseo, B. (2011) A Hierarchical Bayesian Approach to
Record Linkage and Population Size Problems. The Annals of Applied
Statistics, Vol. 5,
No. 2B, 1553–1585.
4. R.C. Steorts, R. Hall & S. Fienberg. (2013). A Bayesian approach to
graphical record linkage and de-duplication. Journal of the American
Statistical Association, (In Press) preprint arXiv:1312.4645.