dispersion of each of each imputation method will be  assessed by computation of variance of all
estimated missing values and comparing them to that of known values that has
been set to  missing. Proportionate
Variance(PV) will be calculated for each imputation method.

will be  used to determine the
closeness of estimated values of parameter to the true value. They do not always
give the same result, the data are more panelised by RMSD because the
difference term is squared. Another summary measure is  BIAS. Which is defined as

Mean Absolute Deviation(MAD) is defined as 

Where y is the true value and  is the imputed value and m is the number of
missing value.

We will write a custom essay sample on
Under them to that of known values that
Specifically for you for only $16.38 $13.9/page

order now

Root Mean Square Deviation is defined as   

After imputation of  “missing values” the performance of the
estimates will be examined using four summary measures. Two measures of
accuracy Root Mean Square Deviation(RMSD) and 
Mean absolute Deviation(MAD) will be used .


measures for imputation methods


The subscript 
 denotes the  imputation in MI and Q is the quantity of
interest The total variance is


Where B is between imputation variance and U is
within imputation variance defined by


The RIV is defined  by the 

The estimate of degree of freedom in imputed
model is not influenced by sample size. DF increases as the number of
imputation increases.

The degree of freedom is defined by  the equation

Where r is the relative increase in
variance(RIV) due to non response, v is the degree of freedom(DF).  The FMI is estimated based on how correlated
a variable is to other variables in the imputed model and the percentage of
missing for this variable. If FMI is high for any variable, then the number of
imputation should be considered.

The fraction of missing information  for a limited number of imputation in MI is
estimated by


The RE of an imputation is an indicator of how well
the true population parameters are estimated. It is related to the number (m)
of missing information as well as imputation performed.

Where ? is the fraction of missing information,
m is the number of imputations.

The relative(variance) efficiency(RE) of MI is
defined as      



Examples of how standard  error are calculated

Relative efficiency (RE) (Rubin 1987)

Degree of freedom

Fraction of missing information(FMI)(Rubin 1987)

Relative increase in variance(RIV)

In order to assess how well  the imputation performed , the following
measures  will be used


  Imputation diagnostics

Autocorrelation plot will be useful in assessing
convergence. Autocorrelation measures correlation between predicted values at
each iteration.

After performing multiple imputation, It is
useful to first look at the means, frequencies and box plots comparing the
observed and imputed values to assess if the range appears reasonable. This
will be followed by examination of the plots of residuals and outliers for each
individual imputed data set to see if there are anomalies. Evidence of anomaly
no matter how small in number is an indication of problem with imputed
model(White et al 2010). Next, is the use of trace plot to assess convergence
of each imputed variable. Trace plot are plots of estimated parameters against
iteration numbers.


      Visual inspection of
Imputed data

In this project, 
the first step in MI will involve (i) identifying variables with missing
values, (ii) computing the proportion of missing values for each variable and
(iii) assessment of the existence of missing values pattern(monotone? or
arbitrary?)  in the data.  The second step will  involve analyses  of m
complete data set using standard procedures. In the third step, the estimates
of the parameters from each imputed data sets are combined to get final sets of
parameter estimates.

Multiple imputation(MI) of missing values starts
from the core idea of regression. Imputation then adds further steps to obtain a
more realistic estimate of standard errors or uncertainty. These involve
creating multiple sets of artificial observation in which missing values
replaced by regression predictions plus noise. Then a final step pool the
information of these multiple imputation to estimate the regression model along
with the standard error task. In multiple imputation each missing value is
replaced with a multiple value that represents distribution of possibilities
(Alison 2001). MI procedure is simulation based, and its main purpose,
according to Schaffer (1997) is not create each missing values that are very
close to the true ones, but to handle missing data to achieve valid inference.



In FIML , data are not imputed or filled in like
multiple imputation, rather it make estimates for model parameters using
available information(raw data)(Enders 2001)



The approach for single imputation or
deterministic imputation involves using predicted scores from regression
equation to replace missing values. The advantage of using this imputation
method lies on the premise of using complete information to impute values. The
disadvantage is that fitted(statistical)model cannot be distinguished between
observed and imputed values, as a result error or uncertainty associated with
imputed values do not incorporate into the model.



In this project, using the NCDS data the
conceptual approach will begin with complete case analysis or listwise deletion
under the assumption that events rate in group who had missing data was the
same as the event rate  for groups
without missing data.


1.2.1     Complete
case analysis




Data that are missing not at random

Different problems arise when data are missing
in binary or categorical variables. Some procedures may handle these types of
missing data better than others, and this area requires further research

Are the characteristics of subjects who provide
complete information (completers) different from those who don’t

Effects of imputation on measures of
relationship between variables


This project will be centered on impact of
imputation methods on the following issues

Which imputation method is best suited for
problem that may arise when data are missing in binary or categorical variable

How do relationship between variables from
imputed datasets compared to other similar studies.

How significant are the relationship between
variables with reference to imputation methods

What impact do 
various imputation methods have on degree of relationship between

Key questions to be examined are


The nature and properties of missing data
can be very different from the originally observed data, it is important to analyse various
methods of treating missing data in order to determine which methods work
best under a given set of conditions(Cheema 2014)

To determine the best method of handling missing
data, it is beneficial to first consider the contest at which the data is
missing. This means when MNAR data are
incorrectly treated as MCAR or MAR, it means that the missing data
process is not being modeled correctly, and parameter estimates will not be
accurate. Similarly, when MCAR and MAR data
are incorrectly treated as MNAR, it means that the researcher is introducing
unnecessarily more complexity into the handling
of missing data. Finally, when
MAR data are incorrectly treated as
MCAR, the researcher is oversimplifying the handling
of missing data and will
generate parameter estimates that are not generalizable to the

The goal of statistical methods is to ensure
that inferences made on the population of interest are valid and efficient. A
good missing data handling method as observed by Allison(2001) should be
able to reduce bias, maximise use of available information and good estimates
of uncertainty. Based on the results of the scoping review that compared
missing data handling methods in epidemiology, the impact four popular methods
derived from the review will be assessed and evaluated using NCDS data. The
missing data methods to be compared include listwise deletion, single
imputation, multiple imputation, and full information maximum likelihood .


  Missing data handling



Wald test and likelihood ratio test(LRT) of
different model will be used to test the statistical significance of each
predictor.  Wald test is computed as
ratio of parameter estimate for each variable in the model to its corresponding
standard error. The null hypothesis for Wald test is that each parameter
coefficient is zero. Rejection of the null hypothesis indicates that the effect
of a variable is significant. Effects of multiple predictors can simultaneously
be tested by Wald test. The LRT compares the 
-2LL of different models.  A
significant LRT means that set of variables included in a fitted model makes  a significant contribution to the model.     Testing significance of predictors of missing values

Where df is the degree of freedom associated
with deviance, n is the sample size and  is the deviance of the fitted model.

The BIC adjusts the deviance by its degree of
freedom and sample size:

 Where k
is the number of parameters ( the number of independent variables plus the
intercept), n is the sample size, and  or  is the deviance of the fitted model.

The number of predictors in a model and the
sample size is used by AIC to adjust deviance.

Akaike information criterion(AIC)(Akaike, 1974)
and Bayesian information criteria(BIC)(Schwarz, 1978) which are based on
deviance statistics will be used to compare non nested model(models with
different sets of independent variable) AIC and BIC will be useful for this
project because of variations in number of missing values in each parameter of
the fitted models. The smaller the AIC 
the better model fit.     Information criteria indices



Pseudo R2 Measures

Pseudo R2


Likelihood ratio R2 (McFadden R2


Cox and Snell R2 (maximum
likelihood R2)


R2 (Cragg and Uhler’s R2)




R2 will be used to compare different fitted models with the same
outcome. The higher pseudo R2 indicates which model better predicts
the outcome. The table below shows three pseudo R2 that will be
employed to compare models in this project.   
Pseudo R2


logistic regression for model with one predictor, the likelihood ratio chi
square test will be used to compare the deviance between the model with only
one intercept and model with one independent variable. If the likelihood ratio
chi square test is significant, the null hypothesis will be rejected with
conclusion that the model with one independent variable fits data better than
the model with only the intercept (null model). In model with multiple
predictors, the likelihood ratio chi-square test will be used to decide which
data fits the model.   
likelihood ratio chi square test

deviance is one of the goodness of fit statistics that compares a fitted model
with a saturated model to show how well the model fits the data perfectly. If
the difference between the saturated model and fitted model is small, the model
is a good fit. On the other hand, if the deviance is large, the model has poor
fit smaller deviances means better fit(Zhu 2014).   
 The deviance

following measures of statistics –  the
deviance, log likelihood ratio test, pseudo R2, AIC and BIC
statistics will be used to assess whether the model fits the data well.




this project, a consideration is given to a realistic situation where Rij and Rik (j ? k) are not independent. That is there are pattern where two covariates
tend to have data missing altogether or to say that there are covariates that
may influence data missing.

Xobs  is the observed part and
Xmiss is the missing part. R may be defined to be matrix of missing
data indicators with (i, j)th element
Rij=1 if Xij the value of jth predictor for ith subject is observed and 0 if missing.

X denote the n x p matrix of
predictors(covariates). This can be partitioned as X= (Xobs, Xmiss)

? = E
(Y ? X1,…,Xp)
and Bj  is the Jth
regression coefficient, to predict the influence of data missing in predictor
variable on outcome(response) variable.

multiple predictors, a standard linear logistic model will be fitted. This
could be expressed as

above logistic model is used to fit single predictors to determine the crude
estimates of each socioeconomic and health covariates(predictors).

this case the outcome is binary Y and prespecified socioeconomic and health
predictors X1, …, Xp. The parameters in the data will be
used to fit a standard logistic model


discriminant function analysis, logistic regression does not assume that
predictor variables are distributed as multivariate normal distribution with
equal covariance matrix, instead it assumes that binomial distribution
describes the distribution of error that equals the actual y minus predicted y

the research question will be addressed by either ordinary least
square(OLS)regression or linear discriminant function analysis. Both these
techniques were subsequently found to be less than ideal for handling
dichotomous outcome due to strict statistical assumptions, i.e. linearity
normality, and continuity for OLS regression and multivariate normality with
equal variances and covariates for discriminant analysis.


investigating a crude association between missingness indicator variable and
other variables, the next stage will be to investigate the independent
contribution of variables in relation to the probability of
indicator(dependent) variable being missing. This will be carried out by
simultaneously fitting logistic regression model with all selected variables as
covariates. The logistic regression model will be able to confirm and establish
if the missing values in the indicator variable are MCAR, MAR or MNAR given
significance of other variables in the model.


I'm Dora!

Would you like to get a custom essay? How about receiving a customized one?

Click here