# All Presentation Abstracts

## Session 1a, Statistical Methodology

This session will be held in the Erskine Building, Room 446

10:50 — 11:10

### A logical difficulty with regression analysis: the estimation of non-existent parameters

Industrial Research Ltd.

‘All models are wrong’ (Box), so any model used in a regression problem can only provide an approximation to the unknown function f(x). Therefore, the parameters of the model do not all represent quantities that actually exist and the quantities ‘estimated’ by the calculated regression coefficients are not all properly defined. So ‘parameter estimation’ is a misnomer and the values of the parameters are actually ‘chosen’. Furthermore, confidence intervals and credible intervals often quoted for the non-existent quantities have no legitimate meaning.

We describe this logical problem in the context of univariate linear regression. Subsequently, we identify quantities that do actually exist and are efficiently estimated by the ordinary least-squares coefficients. The problem of genuine interest is often the estimation of f(x), not the choice of values for the parameters of some approximating function. So we also present a method of estimating f(x) that takes some account of the error incurred by choosing a model. Lastly, we identify other misleading terminology in mathematics and statistics.

11:10 — 11:30

### The fourth-root-n consistency and the efficiency of profile likelihood

Yuichi Hirose
Victoria University of Wellington

Profile likelihood is a popular method of estimation in the presence of nuisance parameter. Especially, it is useful for estimation in semi-parametric models, since the method reduces the infinite-dimensional estimation problem to a finite-dimensional one.In this presentation, we show the efficiency of a semi-parametric maximum likelihood estimator based on the profile likelihood. By introducing a new parameterization, we improve the seminal work of Murphy and van der Vaart (2000) in two ways: we prove the no bias condition in a general semi-parametric model context, and dealt with the direct quadratic expansion of the profile likelihood rather than an approximate one.

11:30 — 11:50

### Comparison between K–Means and Hierarchical Clustering of Dependent and Independent Data Generated from Multivariate Gaussian Copula Function

Francesca Marta Lilja Di Lascio
Department of Statistics, University of Bologna, Italy

Warren J. Ewens
Department of Biology, University of Pennsylvania, USA

We use the multivariate gaussian copula function ([3]) to evaluate the ability of the K–means algorithm ([2]) and the hierarchical method ([1]) to identify clusters correspondent to the marginal probability functions holding by the dependence structure of their joint distribution function via copula function.

Both for the k–means and the hierarchical clustering we make simulations distinguishing (1) small and big sample, (2) the value of the dependence parameter of the copula function, (3) the value of parameters margins (well–separated, overlapped and distinct margins) and, finally, (4) the kind of the dispersion matrix (unstructured and exchangeable).

We evaluate the performance of the two clustering methods under study by means of (1) the difference between the ‘real’ value of the dependence parameters and its value post-clustering, (2) the percentage of iterations in which the number of the observations for each cluster is different from the ‘real’ one, (3) the capability to identify the exact probability model of the margins.

We find that the hierarchical method works well if the margins are well-distinct irrespective of cluster size, while the k–means works well only if the sample is small. The performance of both clustering methods is independent from the dispersion structure.

References:

1. [1] Everitt, B., (1993). Cluster Analysis (3 ed.), London, E. Arnold, NY, Halsted Press.
2. [2] Hartigan, J.A., and Wong, M.A., (1979). Algorithm AS 136: A K–Means Clustering Algorithm, Applied Statistics, vol. 28, n. 1, pp. 100–108.
3. [3] Nelsen, R.B., (2006). Introduction to copulas, New York, Springer.

11:50 — 12:10

### Bias reduction for kernel estimates of density functionals

Professor Martin Hazelton
Institute of Information Sciences and Technology, Massey University

There are a number of important statistical functions that can be expressed as simple functionals of probability densities. These include the relative risk function (a ratio of typically bivariate densities used in geographical epidemiology and elsewhere) and the binary regression function. In many cases parametric models are insufficiently flexible to describe these functionals and a nonparametric approach is to be preferred.

Nonparametric estimation of such functionals can be achieved by substituting kernel estimates in place of the unknown densities. Moreover, in principle we can obtain improved performance in the functional estimates by applying a range of bias reduction techniques developed for density estimation per se. However, in practice this approach tends to lead to poor results.

In this talk I will describe a new methodology which combines local bias reduction techniques borrowed from the density estimation literature with global smoothing optimized for the particular functional to be estimated. The results are encouraging.

The methodology is illustrated through examples on binary regression for low birth weight data, and on geographical variation in the relative risk of cancer of the larynx.

## Session 1b, Statistical Methodology

This session will be held in the Erskine Building, Room 446

13:10 — 13:30

### A General Algorithm for obtaining Standard Errors within the EM algorithm Framework

Dr Nazim Khan
University of Western Australia

The Em algorithm is a powerful tool for parameter estimation when there are missing or incomplete data. In most applications it is easy to implement - the mathematics involved is, in principle, not very demanding, and the method does not require second derivatives. This latter feature is at once an attraction of the algorithm as well as one of its shortcomings; standard errors are not automatically generated during the EM computations. Various methods have been proposed for obtaining standard errors when using the EM algorithm. In 1982 Loius obtained the observed information matrix using the "missing information principle" of Orchard and Woodbury. However, Loius' the exact observed information cannot be computed using this method when the data are not independent, as is the case for example in hidden Markov models. Hugh (1997) used Loius' idea to approximate the observed information for hidden Markov models.

We present a general algorithm to the obtain exact observed information within the EM framework. The algorithm is simple, and the computations can be performed in the last cycle of the EM algorithm. Examples using mixture models are given and some comparisons made with the work of Loius. Finally, some simulation results and data analysis are presented in the context of hidden Markov models and ion channel data.

13:30 — 13:50

### Bayesian Analysis of Linear Regression Models Using Exact Markov Chain Monte Carlo

Jason Phillip Bentley
University of Canterbury

Bayesian variable selection (BVS) typically requires Markov chain Monte Carlo (MCMC) exploration of large sample spaces. MCMC methods provide samples distributed approximately according to the stationary distribution of a Markov chain. Coupling from the past (CFTP) proposed by Propp and Wilson (1996), outlines a framework for exact MCMC methods. We investigate the use of an exact Gibbs sampler for BVS in linear regression models using a posterior distribution proposed by Celeux et al (2006). We consider this within the wider context of Bayesian analysis of linear regression models. We use simulated and real data studies to assess performance and inference. We consider methods proposed by Huang and Djuric (2002) and Corcoran and Schneider (2004). We find that the CFTP Gibbs sampler method provides exact samples, while the monotone version provides only close to exact samples. We conclude that exact MCMC methods for Bayesian analysis in linear regression benefit the accuracy of inference when their use is available.

13:50 — 14:10

### Coupling and Mixing Times in Markov Chains

Jeffrey J Hunter
Massey University Auckland

The properties of the time to coupling and the time to mixing in Markov chains are explored. In particular, the expected time to coupling is compared with the expected time to mixing (as introduced by the presenter in “Mixing times with applications to perturbed Markov chains”, Linear Algebra Appl. (417, 108-123 (2006).) Comparisons in some special cases as well as some general results are presented.

## Session 1c, Statistics Education

This session will be held in the Erskine Building, Room 446

14:15 — 14:35

### Statistical shifts and the Curriculum

Mike Camden
Statistics New Zealand and NZSA Education Committee

Statistical practice has shifted in recent years, to be more widely applied, more accessible, and more visual. The writers of the draft curriculum were sensitive to these shifts. The new curriculum, we hope, will further improve the links between mathematics education and the world of statistical practice and communication. We will outline what we see as the shifts, and how teachers can benefit from them. There are some unfamiliar and possibly scary items in the Draft (like data cleaning, multivariate data, experimental design, and resampling).These can shift school statistics to be more accessible to students, and more relevant to their future lives and careers. We will demystify these items with activities for the classroom.

14:35 — 14:55

### A Review of a Visual Teaching Resource for Statistics and Modelling in Schools.

John A. Harraway
University of Otago

At the statistics education afternoon of the 2005 New Zealand Statistical Association Conference researchers from seven departments at the University of Otago described current research that used statistics procedures. The presentations were recorded on video, re-recorded, and edited in the studio resulting in a DVD of 145 minutes. Statistics New Zealand contributed two further studies. Data from the nine studies were placed on a CD, and a DVD/CD pack containing the videos and data was made available to schools. Teachers at the education afternoon were enthusiastic about having access to such a resource, and this influenced the decision to proceed to the final product. It was seen as a way of providing ideas for project work in Statistics and Modelling, as well as helping to motivate the teaching of statistics by showing current research in interesting contexts. Clips from the DVD will be shown. The full DVD is available for viewing. Opinion is sought from teachers about the value of this resource as a teaching aid. Are the contexts of interest? Can the data be accessed in schools? Should there be a fuller description of the data with a list of questions for investigation? If there are positive answers to these and other questions further similar resources will be developed.

14:55 — 15:15

### The Development of Teaching Resources for Statistics and Modelling.

John A. Harraway
University of Otago

Production of a second DVD/CD pack for use in schools is currently under review. Identifying reaction to the current DVD/CD pack will be a guide to the worth of such teaching aids. Several areas being considered for future inclusion are rugby injury data, lifestyle data and health survey data. One example described here identifies the prevalence of alcohol and tobacco consumption among 6-24-month post-partum New Zealand women, many of whom are breast feeding. Maternal alcohol consumption is known to negatively affect the fetus. Tobacco consumption is known to negatively affect exposed young children. This study, involving 318 South Island women and using a self-administered questionnaire, assesses the prevalence of these lifestyle behaviours and the categories of other socio-demographic factors. The sampling methodology and the data will be described. It includes, for each respondent, ethnicity, education level, income level, marital status as well as maternal status all of which are categorical, and age which is continuous. Some results from the data will be reported. The question is whether this study with potential for further analysis and investigation would be of interest to schools that are teaching the subject Statistics and Modelling.

## Session 1d, Statistics Education Resources

This session will be held in the Erskine Building, Room 446

15:40 — 16:00

### Exploring data on a rare threatened bird - the rock wren

Ian Westbrooke
Department of Conservation

I will introduce a small threatened native bird - the rock wren, and provide resources on the bird itself to set the context. I will then explore a small dataset exactly as it came to me from remote mountains in Fiordland on this fascinating bird's population status. There are 24 numbers - counts from 12 grids of 25 hectares in each of 1984/5 and 2005. This data gives a great opportunity to create tables and graphs, and calculate means, by hand or using a tool like Excel, tasks suitable for both junior and senior classes. For year thirteen students, the data poses the question asked by the person who sent the data - what is a confidence interval for the average change in the population per grid square. A confidence interval can be created using either traditional or resampling techniques.

16:00 — 16:20

### Maui’s dolphin: uncovering a new subspecies

National Institute of Water and Atmospheric Research Ltd. (NIWA)

Hector's dolphins (Cephalorhynchus hectori) are the smallest and rarest marine dolphin in the world. Two geographically separated populations of Hector's dolphin exist: one on the west coast near Auckland and one around the South Island. In 2002, I was asked to help uncover whether the North Island population was a separate subspecies from the larger South Island population, following a study that showed considerable genetic differences between them. To address this question, we examined the skulls of Hector's dolphins held by various museums around the country to compare the two populations. The North Island dolphins were shown to be sufficiently distinct in head size and shape to those from the South Island to be recognised as a new subspecies, Maui's dolphin (Cephalorhynchus hectori maui). Here, I will demonstrate some exploratory analysis of the skull data that were used in this study, focusing primarily on graphical methods.

16:20 — 16:40

### Data Quality: A Business Statisticians greatest challenge

Michelle Wood
IAG New Zealand Limited

Whilst we live to fit the model and answer the question of life there is often much work and many hours spent at the computer looking at, and trying to decipher the data. It can be an art form to extract meaningful information from databases not designed with statistical analysis in mind. This talk is a lighted hearted look at some of the discoveries that I have made about different data that I have handled in preparation for statistical analysis within a variety of industries, ranging from clinical trials research through to commercial lines insurance pricing. There are some useful database design basics that can assist us in the understanding of the data structure. Once the data structure is understood then attention is drawn to the variables and their content, which can have some interesting challenges especially in industries where information may be secondary to doing business and data quality doesn't have a high priority. I hope to share some of my successes and failures in handling data from different sources and offer some advice that may assist you in maximising the value you can extract from your data sources.

16:40 — 17:00

### The first SURF: a Synthetic Unit Record File for schools

Sharleen Forbes & Emma Mawby
Statistics New Zealand

Teachers of statistics need unit-record datasets that are rich with relevant information. Official Statistics agencies have many such datasets, but need to keep them confidential. Our first SURF for schools aims to move towards meeting both these needs. It contains seven variables on 200 synthetic people. These people were designed to have very similar characteristics to 8,500 real people in the NZ Income Survey of June 2004. We will describe how we worked from the CURF (Confidentialised Unit Record File) for the Income Survey, to the SURF. The SURF is intended for learners, and the CURF is intended for researchers. We will do some school-friendly analysis on both, and assess the validity of results that learners would get from the SURF. The SURF comes with classroom-ready activities that we will outline. Copies of the SURF for Schools CD will be available.

## Session 2a, Sample Surveys

This session will be held in the Erskine Building, Room 031

10:50 — 11:10

### Assessment of Imputation Methods for Integrated Business Data

Ricardo Enrico Namay II
Statistics New Zealand

A comparison of three imputation methods for an integrated business data was carried out. Initially, multiple imputation, mean imputation and donor imputation were tested. Because of computational limitations, the study was consequently restricted to mean and donor imputation. The paper demonstrates how imputation methods can be computationally compared with respect to several dimensions: order and distribution preservation, plausibility of individual values, preservation of correlations, and aggregate statistics. Through random sampling and linear programming, this paper also proposes a method to construct a rectangular subsample that replicates the pattern of missing values of the dataset to be imputed while at the same time preserving relative imputation class sizes.

11:10 — 11:30

### Stratified sampling for skewed populations: beyond the cumulative square root rule

Michael Hayward
University of Canterbury

Stratified sampling is a widely used sample selection technique, particularly for skewed populations encountered in business, agriculture, income, and wealth. An important consideration in stratification design is strata delineation, and is often based of the cumulative square root of frequencies work of Dalenuis and Hodges (1959) and Cochran (1977). This talk will cover investigations in to the success of the cumulative square root approach for a range of skewed populations and comparisons with other recent methods.

References:

• Cochran, W. G. (1977). Further aspects of stratified sampling. In Sampling techniques (3rd ed.) (pp.115-149). New York, USA: John Wiley & Sons.
• Dalenius, T., & Hodges, J. L. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88-101.

11:30 — 11:50

### Optimal Survey Design When Nonrespondents are Subsampled for Followup

A. James O’Malley
Harvard Medical School

Healthcare surveys often first mail questionnaires to sampled members of health plans and then follow up mail nonrespondents by phone. The high unit costs of telephone interviews make it cost-effective to subsample the followup. We derive optimal subsampling rates for the phone subsample for comparison of health plans. Computations under design-based inference depart from the traditional formulae for Neyman allocation because the phone sample size at each plan is constrained by the number of mail non-respondents and multiple plans are subject to a single cost constraint. Because plan means for mail respondents are highly correlated with those for phone respondents, more precise estimates (at fixed overall cost) for potential phone respondents are obtained by combining the direct estimates from phone followup with predictions from the mail survey using small-area estimation (SAE) models.

11:50 — 12:10

### Estimation in Multiple Frame Surveys

Alastair Scott
University of Auckland

Patricia Metcalf
University of Auckland

In a multiple frame survey, independent samples are drawn from a number of frames whose union is equal to the population of interest, although individual frames might only give partial coverage. For example, the 4th Auckland Diabetes Heart and Health Survey combined samples from two frames, the standard Statistics New Zealand list of census mesh blocks and the Electoral Roll. In this case, the first frame would provide complete coverage of the population of interest (all Auckland residents between the ages of 18 and 74). However the survey specifications stipulated that at least 1000 Maori and 1000 Pacific Islanders be included in the sample and using the Electoral Roll provides a reasonably easy way of achieving this objective.

In this talk, we look at a class of estimation procedures that have two desirable properties. Firstly, they can be implemented using only standard software for single frame surveys and, secondly, the same set of weights is used for all variables. We examine the performance of several members of the class on data from the Auckland Diabetes Heart and Health Survey.

## Session 2b, Confidentiality Issues

This session will be held in the Erskine Building, Room 031

13:10 — 13:30

### Census tables: richness, structure and risk

Lisa Henley
Statistics New Zealand

Mike Camden
Statistics New Zealand

Detailed tables of counts from a census are highly valued by planners for their richness of information, but bring risk of disclosing particulars about individuals. Statistical agencies aim to sift the richness from the risk. We will look at the structures inside some tables, and suggest ways to measure the important features of these structures. We will examine the risks that come with sparseness, and assess the effects on the richness of rounding and suppression methods.

13:30 — 13:50

### Generating Synthetic Unit-Record Data from Published Marginal Tables

Alan Lee
University of Auckland

We survey methods for generating synthetic data sets without making use of unit-record data. The methods we describe allow the creation of unit-record data in the form of high-dimensional tables whose marginals match publicly available marginal tables. We consider methods based in integer and quadratic programming which allow the construction of tables which exactly match the public tables, and also methods based on iterative proportional fitting which match the public tables approximately.

We describe a set of R functions which implement the methods under study, and apply the methods to data from the 2001 Census of Population and Dwellings.

13:50 — 14:10

### Global Recoding, Information Loss, and Confidentiality

Alistair Gray
Statistics Research Associates Ltd

Fraser Jackson
Victoria University of Wellington

Stephen Fienberg
Carnegie Mellon University

The problem of providing informative summaries of contingency tables has been addressed many different ways. We regard categories as primary elements in the definition of the space over which the table is defined and this talk suggests a method of exploring how they match the information structure in the data. It develops a systematic single category collapsing procedure which is based on finding the member of a class of restricted models which maximizes the likelihood of the data and uses this to find a parsimonious means of representing the table. The focus is on information rather than statistical testing. A feature of the procedure is that it can easily be applied to tables with up to millions of cells providing a new way of analysing large data sets in many disciplines. An obvious application is confidentiality of Census tables where there is a tradeoff between preserving the information in the table for the user and preserving nondisclosure for the respondent. This talk is based on the findings of the OSRDAC 2005/06 project “Impacts of global recoding to preserve confidentiality on information loss and statistical validity of subsequent data analysis” to be published on the SNZ website.

## Session 2c, Social Surveys

This session will be held in the Erskine Building, Room 031

14:15 — 14:35

### Nothing to worry about: problems in the disaggregation of expenditure statistics

Geoffrey Jones
Massey University

Stephen Haslett
Massey University

Jamas Enright
Statistics New Zealand

This project, funded by the StatResearch programme of Statistics New Zealand through the Official Statistics Research and Data Archive Programme (OSRDAC), investigated the use of small-area estimation techniques for breaking down national expenditure statistics into different ethnic groups, with a particular focus on Maori, and into different expenditure groups. Data from the 2001 New Zealand Census were used to add strength to the direct estimates available from the 2001 Household Expenditure Survey via a unit-level regression model. While the method worked successfully for Total Expenditure, the attempt to extend the methodology to estimation of the finer CPI expenditure categories met with some interesting practical and methodological problems.

14:35 — 14:55

### The Post-Enumeration Survey-Features of the Estimation Methodology.

Judith Archibald
Statistics New Zealand

Counting more than four million people throughout New Zealand is a major undertaking, and inevitably some people will be missed or counted more than once by the census. Many countries conduct surveys to estimate the populations not enumerated by their censuses. The 2006 Post-enumeration Survey (PES) was the third to be undertaken in New Zealand since the inaugural PES in 1996. The main objective of the 2006 PES was to gauge the level of national coverage (undercount and overcount) in the 2006 Census.

This paper will describe the statistical rationale behind the PES and the principles of the estimation methodology. It will also discuss some methodological extensions to deal with the practicalities of the Census environment.

14:55 — 15:15

### Modelling Social Change: The parameterisation of log-linear models to measure inter-ethnic cohabitation patterns in New Zealand.

Lyndon Walker
University of Auckland

This paper discusses the application of log-linear modelling techniques to Census data in order to examine cohabitation patterns in New Zealand from 1981 to 2001. The main focus of the study is ethnic homogamy (couples where each partner has the same ethnicity) amongst couples who live together and how it has changed over this time period. A quasi-independence (or diagonal dominance) model is applied to each period in order to see the relative degree of homogamy across different ethnic groups. This model is then reparameterised to incorporate a time factor so that the changes in each group can be measured across the five Census periods. An alternative parameterisation known as the "crossing parameter model" is then applied to the data to test whether there has been a change in the degree to which people will cross ethnic boundaries in their relationships. In particular this parameterisation aims to test whether there is a difference in the cohabitation choices of people who have indicated more than one ethnicity in their Census form.

## Session 2d, Social Data

This session will be held in the Erskine Building, Room 031

15:40 — 16:00

### A new New Zealand static microsimulation model – challenges with data

Rissa Ota
Ministry of Social Development

Helen Stott
Ministry of Social Development

The New Zealand Ministry of Social Development has been developing a new static microsimulation model of the national tax and transfer system. The survey data used for the simulation is Survey of Family, Income and Employment (SoFIE), which has a rich source of information about income, employment, benefits and family structure changes along the interview year. The 2002/3 survey data is the first wave of a longitudinal survey which will be carried out for eight years.

As the primary use of the database will be modelling changes to the income support system, the primary focus is on benefit recipients and low income families. This paper gives an overview of the development of the database, with emphasis on the data synthesis, imputation and calibration of the beneficiary population. Calibration using generalised regression estimators has enabled a wide range of benchmarks to be used. However, there have been a range of challenges encountered along the way, including issues around updating the data as the benefit system has been undergoing major changes, and the representativeness of the data as the number of unemployed has dropped significantly since the first wave was collected.

16:00 — 16:20

### Text Mining of Te Puni Kokiri Project Data

Paul J. Bracewell
Offlode Ltd.

This presentation outlines the findings from a text mining proof of concept performed for Te Puni Kokiri (Ministry of Maori Development) by Offlode Ltd using SAS Enterprise Miner.

Almost 2,500 bilingual documents relating to Te Puni K okiri's Whanau Development Action and Research projects were provided for analysis. These documents relate to approximately 100 projects conducted under the Ministry's direction over a two year period. The aim was to identify and supply evidence of the actual and implied outcomes of the projects using a cost-effective methodology.

Analyses revealed that among the success indicators of the Whanau Development Action and Research Projects is the implementation of communal infrastructure targeting local communities. Plans for sustainability rely on the community with specialist assistance from external agencies.

Additionally, based on interpretation and consultation with domain experts it was possible to quantify the quality of the final reports. This is particularly useful for determining which parties may need assistance communicating their results effectively. Additionally, it is possible to predict the quality of a final report based on preliminary documentation, such as proposals and e-mails. Preliminary documents that are not overly reworked and have an action theme tend to result in better quality documents.

16:20 — 16:40

### Small Area Estimation for ILO-Unemployment

Stephen Haslett
Massey University

Alasdair Noble
Massey University

Felibel Zabala
Statistics New Zealand

This research fits hierarchical Bayes models under a superpopulation structure to provide sound Territorial Local Authority level estimates of International Labour Organisation (ILO) unemployment. The models are fitted via R and WinBUGS using Markov Chain Monte Carlo techniques and are based on strong priors developed from extensive historical information. Unemployed count models combine survey information on ILO unemployment from the quarterly Household Labour Force Survey (HLFS) with monthly Ministry of Social Development (MSD) information on registered unemployed for the period first quarter 2001 to first quarter 2006. The accuracy of estimates is good for levels at which sample sizes in HLFS are otherwise too small, and the method also allows monitoring of changes of model parameters over time. Relative risk models, which incorporate census population projections, are also fitted. The outcome is improved and potentially publishable ILO based estimates of unemployment at a finer geographic level than is currently possible from the HLFS alone. The research was funded under the Statistics New Zealand OSRDAC Official Statistics Research programme.

16:40 — 17:00

### Measuring Labour Mobility in New Zealand

Walter Davis
Statistics New Zealand

Statistics New Zealand’s Linked Employer-Employee Database (LEED) links employer payroll data with Statistics New Zealand’s Business Frame to create a view of the labour market which includes nearly every business and employee in the economy. This database provides the opportunity to investigate the dynamics of the labour market. This presentation looks at the possibilities and challenges of using LEED to measure geographic labour mobility in 58 labour market areas (LMAs) as defined by the Department of Labour (Newell & Papps 2001). Preliminary analysis will investigate inter-regional labour flows and mobility by various firm and worker characteristics.

## Session 3a, Statistical Genomics

This session will be held in the Erskine Building, Room 445

10:50 — 11:10

### Statistical methods for microarray-based gene set analysis

Sarah Song
University of Auckland

The analysis of gene sets has become a popular topic in recent times, with researchers attempting to improve the interpretability of their microarray analyses through the inclusion of supplementary biological information. While a number of options for gene set analysis exist, most do not incorporate inter-gene correlation information, despite the fact that such correlations are known to be biologically relevant. In this talk the characteristics of some of the most widely used gene set analysis methods will be examined, based on their performance in both simulated and real data sets. In particular the importance of incorporating correlation information into the analysis process will be investigated.

11:10 — 11:30

### A novel statistical model to identify biomarkers in 2D proteomic gels

Steven Wu
University of Auckland

M.Black
University of Otago

R.North
University of Auckland

A.Rodrigo
University of Auckland

Proteomic technologies are used to identify differentially expressed plasma proteins that may serve as biomarkers to predict disease. In this study, our aim is to use 2D-gel electrophoresis to identify sets of proteins in early pregnancy plasma that are associated with the subsequent development of pre-eclampsia, a severe hypertensive complication of pregnancy. However, due to technical issues, traditional statistical methods lack the power to detect significant changes in protein abundance between women with and without diseases. We have developed a novel statistical model of 2D-gel data that incorporates both the probability that a spot is expressed and the conditional probability of expression intensity. The model also takes account of threshold detection levels. Using this model, we have gone on to develop two approaches to identifying spots implicated in differences between women with and without pre-eclampsia. These approaches use either a Likelihood Ratio Test or a Bayesian MCMC procedure to identify significant spots. In this talk, I present our model and discuss the relative merits of the two approaches we have developed.

11:30 — 11:50

### Incorporating Genotype Uncertainty Into Mark-recapture-type Models For Estimating Abundance Using DNA Samples

Janine Wright
University of Otago

The use of genetic tags (from non-invasive samples such as hair and faeces) to identify individual animals is increasingly common in wildlife studies. Non-invasive genetic sampling has many advantages and huge potential, but while it is possible to generate significant amounts of data from these samples, the biggest challenge in the application of the approach is overcoming inherent errors. Genotyping errors arise when the poor sample quality due to an insufficient quantity of DNA leads to failure of DNA amplification at one or more loci. This has the effect of heterozygous individuals being scored as homozygotes at those loci as only one allele is detected (termed 'allelic drop-out'). False alleles are also possible. Error rates will be species-specific, and will depend on the source of samples and the way the samples have been handled. If errors go undetected and the genotypes are naively used in mark-recapture models, significant overestimates of population size can occur. Using data from the brush-tailed possum (Trichosurus vulpecula) in New Zealand and the European badger (Meles meles) we describe a method based on Bayesian imputation that allows us to model data from samples that include uncertain genotypes.

11:50 — 12:10

### Filling in the Blanks - Inferring Genetic Relationships Between Individuals Based on Incomplete Information

Steven Miller
University of Auckland

Grant Harper
Department of Conservation

James Russell
University of Auckland

Hamish MacInnes
University of Auckland

Rachel Fewster
University of Auckland

A common method for inferring individuals’ genetic relationships is to calculate the probability each individual is a member of each of a set of potential populations. This is achieved most simply by building population allele profiles based on individuals sampled from that population, then using the Hardy-Weinberg Equilibrium equations to calculate the probability that individuals’ genotypes could have been drawn from those genetic profiles. The individuals’ sets of probabilities can then be used to characterise groups of genetically similar individuals. However, these relative comparisons are invalid if individuals possess different levels of completeness for the selected genetic traits. It is not always feasible to characterise every individual for every genetic trait selected for the analysis. We present a novel method for inferring an individual’s complete- information probability of belonging to a population when that individual has an incomplete set of genetic information. We apply this method to data from a Rattus sp. post-eradication repopulation scenario from Pearl Island, Stewart Island (Rakiura), New Zealand. In this scenario, the reappearance of rats following eradication needed to be identified as a reinvasion from the surrounding mainland, or a failed eradication on the island itself.

## Session 3b, Biometrics

This session will be held in the Erskine Building, Room 445

13:10 — 13:30

### Climate Reconstruction

Matthew R. Schofield
University of Otago

Richard J. Barker
University of Otago

The study of climatological data is inhibited by the availability of data. Inference about the climate over the past hundreds or thousands of years cannot be based on direct observations, which are only available for the past century or two. To obviate this problem proxies with many more observations, such as isotopes, tree rings and ice cores are used to predict the missing climate observations using calibration/inverse regression methods. In this talk we will investigate the assumptions and corresponding limitations of various calibration strategies and make suggestions about the use of such methods. If time permits, an example will also be given.

13:30 — 13:50

### Who has mud on their hands? A bootstrapping technique for determining a fingerprint for sediment tracing in the Whangapoua Harbour

Judith L. McWhirter
University of Waikato

Brendan Roddy
University of Waikato

Removal of soil from the earth's surface by wind and water and subsequent delivery to streams and rivers is a natural process that operates over geological time scales. Human land use activities such as agriculture and silviculture hasten this process and can increase the erosion. The sediment is delivered to streams, where the suspended fraction is richly organic and also transports bound nutrients and chemical pollutants. These then impact plant, fish and invertebrate communities; the physical and chemical characteristics of the streams and estuaries; as well as the physical appearance of these water bodies. In the New Zealand context, estuaries are the most impacted of all coastal waters and have water quality issues relating from the surrounding land uses, but sediment fingerprinting has rarely been used to determine the source.

We discuss an innovative bootstrapping technique which allows for the fingerprinting of sediment samples to their source areas so that the relative importance of these sources can be determined. We report the results from a pilot study undertaken in 2006 where it was concluded that the technique of sediment fingerprinting could distinguish between source areas based on land use (native forest, exotic forest, agriculture) in the Waitakuri River catchment.

13:50 — 14:10

### Case studies in association mapping.

Dr Roderick D. Ball
Ensis Wood Quality (NZ Forest Research Institute)

We discuss statistical analysis and experimental design for association mapping with reference to case studies from Chapters 7 and 8 of the book “Association Mapping in Plants”, Springer 2007.

Case studies include:

• A case control test for an association between the HbS mutation and malaria.
• Detectability of associations between the APOE locus and Alzheimer's disease from a whole genome scan of SNP.
• Candidate gene-based associations in eucalyptus and maize.
• Power of TDTQ1-QQ5 tests.

We re-analyse previous results using Bayesian methods. Equivalent Bayes factors are derived from published results and posterior probabilities for putative associations assessed. Bayes factors are derived for the common association tests: the chi-squared, Fisher's exact test, and the TDT and S-TDT tests for discrete and continuous traits. Various methods are used including direct integration, MCMC, and the Savage-Dickey density ratio.

A common theme is the inadequacy of p-values as a measure of evidence for testing scientific hypotheses as noted by Berger and Berry (1998). Higher sample sizes are needed to obtain respectable Bayes factors, and even higher sample sizes are needed to obtain sufficiently high Bayes factors to overcome low prior probabilities for genomic associations.

References:

• Berger, J.O. and Berry, D.A. (1988) "Statistical analysis and the illusion of objectivity," American Scientist 159-165.

## Session 3c, Environmetrics

This session will be held in the Erskine Building, Room 445

14:15 — 14:35

### Confidence Intervals for Expected Abundance of Rare Species

David Fletcher
University of Otago

Queensland University of Technology

In many ecological research studies, abundance data are skewed and contain more zeros than might be expected. Often, the aim is to model abundance in terms of covariates, and to estimate expected abundance for a given set of covariate values. Welsh et al. (1996) have advocated use of a conditional-model approach for this purpose. This allows one to separately model presence and abundance given presence, which should lead to a more complete understanding as to how the covariates influence abundance. The focus of this talk is on the calculation of confidence intervals for expected abundance given particular values of the covariates. The Wald confidence interval used by Welsh et al. (1996) is symmetric, and therefore unlikely to be of much use for skewed data, where confidence intervals for abundance measures are likely to be asymmetric. We show how to calculate a profile likelihood confidence interval for expected abundance using a conditional model.

References

• Welsh, A.H., Cunningham, R.B., Donnelly, C.F. and Lindenmayer, D.B. 1996. Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling 88: 297-308.

14:35 — 14:55

### Screening potatoes for resistance to late blight

Arier Lee
University of Auckland

C. M. Triggs
University of Auckland

J. A. D. Anderson
New Zealand Institute for Crop & Food Research Limited

World wide, potato (Solanum tuberosum L.) is considered one of the most important vegetable crops. Late blight caused by Phytophthora infestans is recognized as the most serious potato disease. A biennial field screening trial for resistance to late blight has been carried out at Pukekohe for over twenty years. Trials were laid out as latinised row and column designs in a single rectangular array of plots, indexed by rows and columns. In each trial disease severity based on the percentage of affected foliage was repeatedly assessed on a 1-9 ordinal scale from the first sign of infection in each trial and at 4 to 6 subsequent occasions.

Based on threshold model for ordinal responses, we developed a Bayesian nonlinear model which fits a logistic sigmoidal decay curve to the latent variable for repeated ordinal measurements and random effects arisen from latinised row and column design.

14:55 — 15:15

### Hidden Markov models for feeding data from groups of red deer

R.P Littlejohn
AgResearch

A. Bryant
AgResearch

I.D. Corson
AgResearch

We present an analysis of datasets consisting of start/stop times of individual deer feeding episodes over several days of continuous automated observation. Feeding episodes occur in clusters which constitute 'meals', during which the deer is primarily feeding, while at other times it is engaged in some other activity. Individuals within the group tend to have their meals at the same time. Since activity is not observed, but only whether or not each animal is feeding, this suggests that a hidden Markov or semi-Markov model could be used to analyse the data. Such models for individual cattle, but with no group context, have been used by Allcroft et al (2004). We also consider a generalization including feedback given by Zucchini et al (2005).

## Session 3d, Medical Statistics

This session will be held in the Erskine Building, Room 445

15:40 — 16:00

### Breast Cancer Diagnosis using SHG Laser Microscopy and Statistical Image Analysis

Gregory Falzon
University of New England

Second-Harmonic Laser Microscopy promises to be a useful diagnostic modality for breast cancer. Statistical image analysis has provided key insights into the differences between images of normal, benign and malignant breast tissue. Spectral analysis of image features coupled with a support-vector machine classifier is demonstrated to accurately separate normal from tumour tissue. Further analysis of the tumour group using the multi-scale, multi-directional, steerable pyramid filter has revealed features that can be used to separate benign from malignant breast tissue. The classifier presented can serve as a prototype for devices developed to serve in a clinical setting.

16:00 — 16:20

### Incorporating Biological Information into the Tumour Classification Process.

The University of Auckland

The incorporation of biological information into the microarray analysis process has become increasingly important. One reason for doing this is to provide a biologically meaningful interpretation of the analysis results. While the incorporation of such information is well documented in terms of detecting differentially expressed genes, less work has been done on extending these ideas into the classification of biologically distinct samples. We describe a method for incorporating gene set information, such as KEGG pathway or Gene Ontology details into the classification process. This approach utilises principal co-ordinates analysis (PCO) to create a summary of gene set activity, and then uses these summaries as explanatory variables in the classification and prediction process. This procedure is illustrated via application to a breast cancer data set published by Wang et al (2005, The Lancet, vol. 365).

16:20 — 16:40

### Comparison of optimal and balanced two-stage case-control designs under cost constraints

Jennifer Wilcock
University of Auckland

Alan Lee
University of Auckland

In two-stage case-control studies, outcome status and one or more inexpensive covariates are observed for a large sample but additional, more expensive covariates are collected for a subsample only, selected by random sampling from the strata defined at the first stage. Large efficiency and/or cost gains are possible using two-stage rather than one-stage studies of comparable cost or power. Here we demonstrate a method for designing two-stage studies to obtain the best possible precision under specified cost constraints, by applying an efficient semi- parametric maximum likelihood approach due to Scott and Wild (University of Auckland) which has been developed for the analysis of a class of generalised case-control designs.

As with all model-based approaches, the ‘optimal’ design found is sensitive to the values of the model parameters used for deriving the design. If the design parameters are particularly inaccurate this may result in an ‘optimal’ design which is less efficient than that which would have been derived using a more robust design approach. The efficiency of the design depends on the sampling fractions within each stratum, and here a method will be presented for comparing designs with ‘optimal’ to those with balanced second stage sample sizes, under specified cost constraints.

Independent component analysis and statistical parametric mapping of the relationship between personality and brain blood flow in normal males

16:40 — 17:00

### Independent component analysis and statistical parametric mapping of the relationship between personality and brain blood flow in normal males

In Kang
University of Canterbury

Marco Reale
University of Canterbury

Carl Scarrott
University of Canterbury

Irene L Hudson
University of South Australia

Robin Tuner
University of New South Wales

Medical images are an important source of information about physiological processes, but they are often deteriorated by noise due to various sources of interference and other phenomena that affect the measurement processes in imaging and data acquisition systems. The images are mixtures of unknown combinations of sources summing differently at each of the sensors. Independent component analysis (ICA) [Hyv�rinen, A. and Oja, E., 2001] is an effective method for removing artifacts and separating sources of the brain signals from medical images. In this study, we assess the relationship between regional cerebral blood flow (rCBF) and all seven of the Temperament and Character Index (TCI) personality traits using ICA. ICA can assess the difference in rCBF between quartile groups for each personality trait to identify brain regions. Significant clusters of activation (increasing level of trait associated with increasing blood flow) or deactivation (decreasing level of trait associated with increasing level of blood flow) were found in relation to all seven TCI traits. The ICA linear model results showed that a significant relationship in specific regions of the brain. Graphs of the average regional cerebral blood flow highlighted the existence of non-linear relationships delineated by the independent components. These results support previous work showing a biological basis for the TCI model [Cloninger, C., 2002] and non-linear model [Turner et al 2003, Turner, R. 2005].

Presentation Program