Data-driven testing for sample selection bias

Non-random sample selection is a commonplace amongst many empirical studies and it arises when an output variable of interest is available only for a restricted non-random subsample of data. This often occurs in sociological, medical and economic studies where individuals systematically select themselves into (or out of) the sample based on a combination of observed and unobserved characteristics. Estimates based on models that ignore such a non-random selection may be biased and inconsistent. The aim of this project is to develop new testing procedures for the presence of sample selection bias.

In its classical form, the sample selection model consists of two equations which model the probability of inclusion in the sample and the outcome variable through a set of available predictors and of a joint bivariate distribution linking the two equations. The project will built on the recently introduced framework of generalised sample selection models which incorporates regression splines in order to deal with non-linear covariate-response relationships, and tackles non-normal bivariate distributions between the model equations through the use of copulae. The absence of sample selection in such models is equivalent to a product copula . The testing procedures considered in the project will also include a model selection step, through the choice of a copula, thus yielding flexible data-driven methods of testing. In the project, the new proposed testing procedures will be compared with other existing methods in a simulation study where their empirical power and empirical significance level will be investigated.

Supervisor: Dr Malgorzata Wojtys