SEMINAR IN DECISION SCIENCES DEPARTMENT OF ECONOMICS AND FINANCE LUISS Viale Romania 32 00197 Roma
Title of the talk: A general theory for Bayesian inference with linked and duplicated data
Speaker: Brunero Liseo (Sapienza Università di Roma) on March 8th from 12:00 to 13:00 in room 207
Abstract Record linkage is a statistical method which aims at identifying whether two or more records refer to the same statistical entity or not. Duplications of the same entity within one single source or across different files may be interpreted as “clusters of records”, showing strong similarities across their fields. In this paper we frame the record linkage process into a formal Bayesian clustering model and we investigate the role of species sampling models as prior distributions for the clustering structure. In fact the different latent entities underlying one or more data sources can be treated as the sampled species and the observed records as noisy measurements of their features. We also discuss an important issue in the clustering approach to entity resolution, that is the need to bound the clusters sizes even for large data sets. The proposed statistical models will be used both in a classical record linkage scenario and in the more complex framework of multiple duplications on the same data source and will serve as supporting models for Bayesian regression analyses with linked data.
Key words: Species Sampling, Latent structure, Hit and Miss Algorithm, Clustering.
References 1. Tancredi, A. and Liseo, B. (2015) Regression Analysis with linked data: Problems and possible solutions. Statistica, 75,1:19–35. 2. Lahiri, P. and Larsen, M.D., (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, pp. 222–230. 3. Tancredi, A., Liseo, B. (2011) A Hierarchical Bayesian Approach to Record Linkage and Population Size Problems. The Annals of Applied Statistics, Vol. 5, No. 2B, 1553–1585. 4. R.C. Steorts, R. Hall & S. Fienberg. (2013). A Bayesian approach to graphical record linkage and de-duplication. Journal of the American Statistical Association, (In Press) preprint arXiv:1312.4645.