### Under them to that of known values that

Underdispersion of each of each imputation method will be assessed by computation of variance of allestimated missing values and comparing them to that of known values that hasbeen set to missing.

ProportionateVariance(PV) will be calculated for each imputation method. RMSD and MAD will be used to determine thecloseness of estimated values of parameter to the true value. They do not alwaysgive the same result, the data are more panelised by RMSD because thedifference term is squared. Another summary measure is BIAS. Which is defined as Mean Absolute Deviation(MAD) is defined as Where y is the true value and is the imputed value and m is the number ofmissing value.Root Mean Square Deviation is defined as After imputation of “missing values” the performance of theestimates will be examined using four summary measures. Two measures ofaccuracy Root Mean Square Deviation(RMSD) and Mean absolute Deviation(MAD) will be used . 1.

2.7 Performancemeasures for imputation methods.The subscript denotes the imputation in MI and Q is the quantity ofinterest The total variance is Where B is between imputation variance and U iswithin imputation variance defined by The RIV is defined by the equationThe estimate of degree of freedom in imputedmodel is not influenced by sample size.

DF increases as the number ofimputation increases.The degree of freedom is defined by the equationWhere r is the relative increase invariance(RIV) due to non response, v is the degree of freedom(DF). The FMI is estimated based on how correlateda variable is to other variables in the imputed model and the percentage ofmissing for this variable. If FMI is high for any variable, then the number ofimputation should be considered.The fraction of missing information for a limited number of imputation in MI isestimated by The RE of an imputation is an indicator of how wellthe true population parameters are estimated.

It is related to the number (m)of missing information as well as imputation performed.Where ? is the fraction of missing information,m is the number of imputations.The relative(variance) efficiency(RE) of MI isdefined as e) Examples of how standard error are calculatedd) Relative efficiency (RE) (Rubin 1987)c) Degree of freedomb) Fraction of missing information(FMI)(Rubin 1987)a) Relative increase in variance(RIV)In order to assess how well the imputation performed , the followingmeasures will be used 1.2.6 Imputation diagnosticsAutocorrelation plot will be useful in assessingconvergence.

Autocorrelation measures correlation between predicted values ateach iteration.After performing multiple imputation, It isuseful to first look at the means, frequencies and box plots comparing theobserved and imputed values to assess if the range appears reasonable. Thiswill be followed by examination of the plots of residuals and outliers for eachindividual imputed data set to see if there are anomalies. Evidence of anomalyno matter how small in number is an indication of problem with imputedmodel(White et al 2010). Next, is the use of trace plot to assess convergenceof each imputed variable.

Trace plot are plots of estimated parameters againstiteration numbers. 1.2.5 Visual inspection ofImputed dataIn this project, the first step in MI will involve (i) identifying variables with missingvalues, (ii) computing the proportion of missing values for each variable and(iii) assessment of the existence of missing values pattern(monotone? orarbitrary?) in the data. The second step will involve analyses of mcomplete data set using standard procedures. In the third step, the estimatesof the parameters from each imputed data sets are combined to get final sets ofparameter estimates.

Multiple imputation(MI) of missing values startsfrom the core idea of regression. Imputation then adds further steps to obtain amore realistic estimate of standard errors or uncertainty. These involvecreating multiple sets of artificial observation in which missing valuesreplaced by regression predictions plus noise.

Then a final step pool theinformation of these multiple imputation to estimate the regression model alongwith the standard error task. In multiple imputation each missing value isreplaced with a multiple value that represents distribution of possibilities(Alison 2001). MI procedure is simulation based, and its main purpose,according to Schaffer (1997) is not create each missing values that are veryclose to the true ones, but to handle missing data to achieve valid inference. 1.2.4 MultipleImputationIn FIML , data are not imputed or filled in likemultiple imputation, rather it make estimates for model parameters usingavailable information(raw data)(Enders 2001) 1.2.3 FIMLThe approach for single imputation ordeterministic imputation involves using predicted scores from regressionequation to replace missing values.

The advantage of using this imputationmethod lies on the premise of using complete information to impute values. Thedisadvantage is that fitted(statistical)model cannot be distinguished betweenobserved and imputed values, as a result error or uncertainty associated withimputed values do not incorporate into the model. 1.2.2 SingleImputation In this project, using the NCDS data theconceptual approach will begin with complete case analysis or listwise deletionunder the assumption that events rate in group who had missing data was thesame as the event rate for groupswithout missing data. 1.

2.1 Completecase analysis · Data that are missing not at random· Different problems arise when data are missingin binary or categorical variables. Some procedures may handle these types ofmissing data better than others, and this area requires further research· Are the characteristics of subjects who providecomplete information (completers) different from those who don’t(non-completers)?· Effects of imputation on measures ofrelationship between variables This project will be centered on impact ofimputation methods on the following issues IV. Which imputation method is best suited forproblem that may arise when data are missing in binary or categorical variable III.

How do relationship between variables fromimputed datasets compared to other similar studies. II. How significant are the relationship betweenvariables with reference to imputation methods I. What impact do various imputation methods have on degree of relationship betweenvariablesKey questions to be examined are The nature and properties of missing datacan be very different from the originally observed data, it is important to analyse variousmethods of treating missing data in order to determine which methods workbest under a given set of conditions(Cheema 2014)To determine the best method of handling missingdata, it is beneficial to first consider the contest at which the data ismissing. This means when MNAR data areincorrectly treated as MCAR or MAR, it means that the missing dataprocess is not being modeled correctly, and parameter estimates will not beaccurate. Similarly, when MCAR and MAR dataare incorrectly treated as MNAR, it means that the researcher is introducingunnecessarily more complexity into the handlingof missing data. Finally, whenMAR data are incorrectly treated asMCAR, the researcher is oversimplifying the handlingof missing data and willgenerate parameter estimates that are not generalizable to thepopulation(reference)The goal of statistical methods is to ensurethat inferences made on the population of interest are valid and efficient.

Agood missing data handling method as observed by Allison(2001) should beable to reduce bias, maximise use of available information and good estimatesof uncertainty. Based on the results of the scoping review that comparedmissing data handling methods in epidemiology, the impact four popular methodsderived from the review will be assessed and evaluated using NCDS data. Themissing data methods to be compared include listwise deletion, singleimputation, multiple imputation, and full information maximum likelihood . 1.2 Missing data handlingstrategy Wald test and likelihood ratio test(LRT) ofdifferent model will be used to test the statistical significance of eachpredictor. Wald test is computed asratio of parameter estimate for each variable in the model to its correspondingstandard error. The null hypothesis for Wald test is that each parametercoefficient is zero.

Rejection of the null hypothesis indicates that the effectof a variable is significant. Effects of multiple predictors can simultaneouslybe tested by Wald test. The LRT compares the -2LL of different models. Asignificant LRT means that set of variables included in a fitted model makes a significant contribution to the model. 1.

1.1.5 Testing significance of predictors of missing valuesWhere df is the degree of freedom associatedwith deviance, n is the sample size and is the deviance of the fitted model. The BIC adjusts the deviance by its degree offreedom and sample size: Where kis the number of parameters ( the number of independent variables plus theintercept), n is the sample size, and or is the deviance of the fitted model.

The number of predictors in a model and thesample size is used by AIC to adjust deviance.Akaike information criterion(AIC)(Akaike, 1974)and Bayesian information criteria(BIC)(Schwarz, 1978) which are based ondeviance statistics will be used to compare non nested model(models withdifferent sets of independent variable) AIC and BIC will be useful for thisproject because of variations in number of missing values in each parameter ofthe fitted models. The smaller the AIC the better model fit. 1.1.1.

4 Information criteria indices Pseudo R2 Measures Pseudo R2 Formula Likelihood ratio R2 (McFadden R2 ) Cox and Snell R2 (maximum likelihood R2) Nagelkerke R2 (Cragg and Uhler’s R2) PseudoR2 will be used to compare different fitted models with the sameoutcome. The higher pseudo R2 indicates which model better predictsthe outcome. The table below shows three pseudo R2 that will beemployed to compare models in this project. 1.

1.1.3 Pseudo R2 Inlogistic regression for model with one predictor, the likelihood ratio chisquare test will be used to compare the deviance between the model with onlyone intercept and model with one independent variable. If the likelihood ratiochi square test is significant, the null hypothesis will be rejected withconclusion that the model with one independent variable fits data better thanthe model with only the intercept (null model). In model with multiplepredictors, the likelihood ratio chi-square test will be used to decide whichdata fits the model. 1.

1.1.2 likelihood ratio chi square testThedeviance is one of the goodness of fit statistics that compares a fitted modelwith a saturated model to show how well the model fits the data perfectly.

Ifthe difference between the saturated model and fitted model is small, the modelis a good fit. On the other hand, if the deviance is large, the model has poorfit smaller deviances means better fit(Zhu 2014). 1.1.1.1 The devianceThefollowing measures of statistics – thedeviance, log likelihood ratio test, pseudo R2, AIC and BICstatistics will be used to assess whether the model fits the data well. 1.

1.1 Modelvalidation Inthis project, a consideration is given to a realistic situation where Rij and Rik (j ? k) are not independent. That is there are pattern where two covariatestend to have data missing altogether or to say that there are covariates thatmay influence data missing. WhereXobs is the observed part andXmiss is the missing part. R may be defined to be matrix of missingdata indicators with (i, j)th elementRij=1 if Xij the value of jth predictor for ith subject is observed and 0 if missing.

LetX denote the n x p matrix ofpredictors(covariates). This can be partitioned as X= (Xobs, Xmiss)Where? = E(Y ? X1,…,Xp)and Bj is the Jthregression coefficient, to predict the influence of data missing in predictorvariable on outcome(response) variable.Formultiple predictors, a standard linear logistic model will be fitted.

Thiscould be expressed asTheabove logistic model is used to fit single predictors to determine the crudeestimates of each socioeconomic and health covariates(predictors).Inthis case the outcome is binary Y and prespecified socioeconomic and healthpredictors X1, …, Xp. The parameters in the data will beused to fit a standard logistic model Unlikediscriminant function analysis, logistic regression does not assume thatpredictor variables are distributed as multivariate normal distribution withequal covariance matrix, instead it assumes that binomial distributiondescribes the distribution of error that equals the actual y minus predicted yNormally,the research question will be addressed by either ordinary leastsquare(OLS)regression or linear discriminant function analysis. Both thesetechniques were subsequently found to be less than ideal for handlingdichotomous outcome due to strict statistical assumptions, i.

e. linearitynormality, and continuity for OLS regression and multivariate normality withequal variances and covariates for discriminant analysis. Afterinvestigating a crude association between missingness indicator variable andother variables, the next stage will be to investigate the independentcontribution of variables in relation to the probability ofindicator(dependent) variable being missing. This will be carried out bysimultaneously fitting logistic regression model with all selected variables ascovariates.

The logistic regression model will be able to confirm and establishif the missing values in the indicator variable are MCAR, MAR or MNAR givensignificance of other variables in the model.