A logical difficulty with regression analysis: the estimation of non-existent parameters

Robin Willink
Industrial Research Ltd.

‘All models are wrong’ (Box), so any model used in a regression problem can only provide an approximation to the unknown function f(x). Therefore, the parameters of the model do not all represent quantities that actually exist and the quantities ‘estimated’ by the calculated regression coefficients are not all properly defined. So ‘parameter estimation’ is a misnomer and the values of the parameters are actually ‘chosen’. Furthermore, confidence intervals and credible intervals often quoted for the non-existent quantities have no legitimate meaning.

We describe this logical problem in the context of univariate linear regression. Subsequently, we identify quantities that do actually exist and are efficiently estimated by the ordinary least-squares coefficients. The problem of genuine interest is often the estimation of f(x), not the choice of values for the parameters of some approximating function. So we also present a method of estimating f(x) that takes some account of the error incurred by choosing a model. Lastly, we identify other misleading terminology in mathematics and statistics.

Session 1a, Statistical Methodology: 10:50 — 11:10, Room 446

Presentation Program