3.2.1. Content-related validity
In criteria-related validity, you check the performance of your operationalization against some criterion. This is different from content validity in that in content validity, the criteria are the construct definition itself – it is a direct comparison. In criterion-related validity, we usually make a prediction about how the operationalization will perform based on our theory of the construct. The differences among the different criterion-related validity types is in the criteria they use as the standard for judgment.
i) Predictive Validity
In predictive validity, we assess the operationalization’s ability to predict something it should theoretically be able to predict. For instance, we might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession. We could give our measure to experienced engineers and see if there is a high correlation between scores on the measure and their salaries as engineers. A high correlation would provide evidence for predictive validity — it would show that our measure can correctly predict something that we theoretically think it should be able to predict.
ii) Concurrent Validity
In concurrent validity, we assess the operationalization’s ability to distinguish between groups that it should theoretically be able to distinguish between. For example, if we come up with a way of assessing manic-depression, our measure should be able to distinguish between people who are diagnosed manic-depression and those diagnosed paranoid schizophrenic. If we want to assess the concurrent validity of a new measure of empowerment, we might give the measure to both migrant farm workers and to the farm owners, theorizing that our measure should show that the farm owners are higher in empowerment. As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.
iii) Convergent Validity
In convergent validity, we examine the degree to which the operationalization is similar to (converges on) other operationalization that it theoretically should be similar to. For instance, to show the convergent validity of a Head Start program, we might gather evidence that shows that the program is similar to other Head Start programs. Or, to show the convergent validity of a test of arithmetic skills, we might correlate the scores on our test with scores on other tests that purport to measure basic math ability, where high correlations would be evidence of convergent validity.
iv) Discriminant Validity
In discriminant validity, we examine the degree to which the operationalization is not similar to (diverges from) other operationalization that it theoretically should be not be similar to. For instance, to show the discriminant validity of a Head Start program, we might gather evidence that shows that the program is not similar to other early childhood programs that don’t label themselves as Head Start programs. Or, to show the discriminant validity of a test of arithmetic skills, we might correlate the scores on our test with scores on tests that of verbal ability, where low correlations would be evidence of discriminant validity.
3.2.2. Criterion-related validity
Criterion validity (or criterion-related validity) measures how well one measure predicts an outcome for another measure. A test has this type of validity if it is useful for predicting performance or behavior in another situation (past, present, or future). For example:
· A job applicant takes a performance test during the interview process. If this test accurately predicts how well the employee will perform on the job, the test is said to have criterion validity.
The first measure (in the above examples, the job performance test and the GRE) is sometimes called the predictor variable or the estimator. The second measure is called the criterion variable as long as the measure is known to be a valid tool for predicting outcomes.
One major problem with criterion validity, especially when used in the social sciences, is that relevant criterion variables can be hard to come by.
Types of Criterion Validity
The three types are:
· Predictive Validity: if the test accurately predicts what it is supposed to predict. For example, the SAT exhibits predictive validity for performance in college. It can also refer to when scores from the predictor measure are taken first and then the criterion data is collected later.
· Concurrent Validity: when the predictor and criterion data are collected at the same time. It can also refer to when a test replaces another test (i.e. because it’s cheaper). For example, a written driver’s test replaces an in-person test with an instructor.
· Postdictive validity: if the test is a valid measure of something that happened before. For example, does a test for adult memories of childhood events work?
3.2.3. Test-retest method
Test–retest reliability is one way to assess the consistency of a measure. The reliability of a set of scores is the degree to which the scores result from systemic rather than chance or random factors. Reliability measures the proportion of the variance among scores that are a result of true differences. True differences refer to actual differences, not measured differences. That is, if you are measuring a construct such as depression, some differences in scores will be caused by true differences and some will be caused by error. For example, if 90% of the differences are a result of systematic factors, then the reliability is .90, which indicates that 10% of the variance is based on chance or random factors.
3.2.4. Equivalent –Forms method
Parallel forms reliability (also called equivalent forms reliability) uses one set of questions divided into two equivalent sets (“forms”), where both sets contain questions that measure the same construct, knowledge or skill. The two sets of questions are given to the same sample of people within a short period of time and an estimate of reliability is calculated from the two sets.
Put simply, you’re trying to find out if test A measures the same thing as test B. In other words, you want to know if test scores stay the same when you use different instruments.
Example: you want to find the reliability for a test of mathematics comprehension, so you create a set of 100 questions that measure that construct. You randomly split the questions into two sets of 50 (set A and set B), and administer those questions to the same group of students a week apart.Steps:
· Step 1: Give test A to a group of 50 students on a Monday.
· Step 2: Give test B to the same group of students that Friday.
· Step 3: Correlate the scores from test A and test B.
In order to call the forms “parallel”, the observed score must have the same mean and variances. If the tests are merely different versions (without the “sameness” of observed scores), they are called alternate forms.
3.2.5. Face Validity
Some authors are of the opinion that face validity is a component of content validity while others believe it is not. Face validity is established when an individual (and or researcher) who is an expert on the research subject reviewing the questionnaire (instrument) concludes that it measures the characteristic or trait of interest. Face validity involves the expert looking at the items in the questionnaire and agreeing that the test is a valid measure of the concept which is being measured just on the face of it. This means that they are evaluating whether each of the measuring items matches any given conceptual domain of the concept. Face validity is often said to be very casual, soft and many researchers do not consider this as an active measure of validity. However, it is the most widely used form of validity in developing countries. (Bolarinwa, 2015)
3.2.6. Construct Validity
This refers to the nature of the psychological construct or characteristic being measured. A measure is said to possess construct validity to the degree that it conforms to predicted correlations with other theoretical propositions. It measures the degree to which scores on a test can be accounted for by the explanatory construct of sound theory. In this case, we associate a set of other propositions with the results received from using our measurement instrument. If the measurements on our devised scale correlate in a predicated way with these other propositions, we conclude that there is some construct validity. (Gakuu, Kidombo, 2010)
There are four types of evidence that can be obtained for the purpose of construct validity depending on the research problem, as discussed below by Bolarinwa:
a) Convergent validity
There is evidence that the same concept measured in different ways yields similar results. In this case, one could include two different tests. In convergent validity where different measures of the same concept yield similar results, a researcher uses self-report versus observation (different measures). The 2 scenarios given below illustrate this concept.
A researcher could place meters on respondent’s television (TV) sets to record the time that people spend with certain health programs on TV. Then, this record can be compared with survey results on ‘exposure to health program on televised’ using questionnaire.
The researcher could send someone to observe respondent’s TV use at their home and compare the observation results with the survey results using questionnaire.
b) Discriminant validity
There is evidence that one concept is different from other closely related concepts. Using the scenarios of TV health program exposure above, the researcher can decide to measure the exposure to TV entertainment programs and determine if they differ from TV health program exposure measures. In this case, the measures of exposure to TV health program should not be highly related to the measures of exposure to TV entertainment programs.
c) Known-group validity
In known-group validity, a group with already established attribute of the outcome of construct is compared with a group in whom the attribute is not yet established. Since the attribute of the two groups of respondents is known, it is expected that the measured construct will be higher in the group with related attribute but lower in the group with unrelated attribute. For example, in a survey that used questionnaire to explore depression among two groups of patients with clinical diagnosis of depression and those without. It is expected (in known-group validity) that the construct of depression in the questionnaire will be scored higher among the patients with clinically diagnosed depression than those without the diagnosis. Another example was shown in a study by Singh et al where cognitive interview study was conducted among school pupils in 6 European countries.
d) Factorial validity
This is an empirical extension of content validity. This is because it validates the contents of the construct employing the statistical model called factor analysis. It is usually employed when the construct of interest is in many dimensions which form different domains of a general attribute. In the analysis of factorial validity, the several items put up to measure a particular dimension within a construct of interest is supposed to be highly related to one another than those measuring other dimensions. For instance, using health-related quality of life questionnaire using short form – 36 version 2 (SF-36v2). This tool has 8 dimensions and it is therefore expected that all the items of SF-36v2 questionnaire measuring social function (SF), which is one of the 8 dimensions, should be highly related than those items measuring mental health domain which measure another dimension.
e) Hypothesis-testing validity
Evidence that a research hypothesis about the relationship between the measured concept (variable) and other concepts (variables), derived from a theory, is supported. In the case of TV viewing, for example, there is a social learning theory stating how violent behavior can be learned from observing and modelling televised physical violence. From this theory, we could derive a hypothesis stating a positive correlation between physical aggression and the amount of televised physical violence viewing. If the evidence collected supports the hypothesis, we can conclude that there is a high degree of construct validity in the measurements of physical aggression and viewing of televised physical violence since the two theoretical concepts are measured and examined in the hypothesis-testing process.
3.2.7. Internal Consistency Reliability or Homogenity
The two methods so far considered (i.e., the test-retest and the equivalent methods) require two administration or testing sessions. However, there are other methods of estimating reliability which requires only a single administration of an instrument. They are; the split-half method, the Kuder- Richardson approaches and the alpha coefficient method. (Gakuu, Kidombo, 2010)
Here are the methods as discussed below by Gakuu and Kidombo:
a) The split- half methods
This involves scoring two-halves of a test separately for each person and then calculating a correlation coefficient for the two sets of scores. In most cases researchers will split the instrument into the odd items and the even items. The resulting coefficient indicates the degree to which the two halves of the test provide the same results, and hence describes the internal consistency of the test.
The reliability coefficient is calculated using the Spearman-Brown prophecy formula as indicated here below:
Reliability of scores on total test = 2× reliability for ½ test 1+ reliability for ½ tests.
It is possible to increase reliability by increasing its length if the items added are similar to the original ones.
b) Kuder- Richardson approaches:
This is the most frequently used method by researchers for determining internal consistency. It uses two formulas, the KR20 and KR 21. KR20 formula requires three types of information: the number of items in the test, the mean, and the standard deviation. It is important to note that this formula can only be used if we assume that the items are of equal difficulty.
The formula is stated as follows: As you are aware by now, this is a coefficient and that a coefficient value of .00 indicates a complete absence of a relationship and hence no reliability at all. A coefficient of 1.00 on the other hand indicates a complete relationship. For research purposes, the rule of thumb is that the reliability should be at least.70 and preferably higher.
c) The Alpha coefficient (Cronbach alpha):
This is a general form KR20 formula and it is used to calculating reliability of items that are not scored right versus wrong.
Internal consistency concerns the extent to which items on the test or instrument are measuring the same thing. The appeal of an internal consistency index of reliability is that it is estimated after only one test administration and therefore avoids the problems associated with testing over multiple time periods. Internal consistency is estimated via the split-half reliability index and coefficient alpha index which is the most common used form of internal consistency reliability. Sometimes, Kuder-Richardson formula 20 (KR-20) index was used.
The split-half estimate entails dividing up the test into two parts (e.g. odd/even items or first half of the items/second half of the items), administering the two forms to the same group of individuals and correlating the responses. Coefficient alpha and KR-20 both represent the average of all possible split-half estimates. The difference between the two is when they would be used to assess reliability. Specifically, coefficient alpha is typically used during scale development with items that have several response options (i.e., 1 = strongly disagree to 5 = strongly agree) whereas KR-20 is used to estimate reliability for dichotomous (i.e., yes/no; true/false) response scales.
The formula to compute KR-20 is:
KR-20 = n/(n ? 1)1 ? Sum(piqi)/Var(X).
n = Total number of items
Sum(piqi) = Sum of the product of the probability of alternative responses
Var(X) = Composite variance.
And to calculate coefficient alpha (a) by Allen and Yen, 1979:
a = n/(n ? 1)1 ? Sum Var (Yi)/Var (X).
Where n = Number of items
Sum Var(Yi) = Sum of item variances
Var(X) = Composite variance.
It should be noted that KR-20 and Cronbach alpha can easily be estimated using several statistical analysis software these days. Therefore, researchers do not have to go through the laborious exercise of memorizing the mathematical formula given above. As a rule of thumb, the higher the reliability value, the more reliable the measure. The general convention in research has been prescribed by Nunnally and Bernstein, which states that one should strive for reliability values of 0.70 or higher. It is worthy of note that reliability values increase as test length increases. That is, the more items we have in our scale to measure the construct of interest, the more reliable our scale will become. However, the problem with simply increasing the number of scale items when performing applied research is that respondents are less likely to participate and answer completely when confronted with the prospect of replying to a lengthy questionnaire. Therefore, the best approach is to develop a scale that completely measures the construct of interest and yet does so in as parsimonious or economical manner as is possible. A well-developed yet brief scale may lead to higher levels of respondent participation and comprehensiveness of responses so that one acquires a rich pool of data with which to answer the research question.