Handout: Psychometric Considerations

	PSY 448 Clinical Neuropsychology	Last revised: Oct. 01, 2023
Clinical Assessment: Basic Psychometric Principles

What do we want from the tests we use to evaluate someone's neuropsychological capabilities? We want them to be reliable, valid, and standardized.

Reliability

We want a test we can depend upon, that is, one which will give the same results no matter who is doing the testing or when the the testing is done. That is, we want the test to be consistent in its findings. Tests which are reliable are consistent.

There are several different ways that a test shows reliability:

Test-Retest Reliability: This means that a test given at two different points of time provides identical or highly similar scores for the subject.


Alternate Form Reliability: This means that the test comes in at least two different forms (say, Form A and Form B). On tests with alternative form reliability, it doesn't matter whether the subject takes Form A or Form B. The subject's score on each form should be identical or nearly identical.


Split-Half Reliability: In this measurement of reliabilty, the test is split into two parts, e.g., all the even numbered items and all the odd numbered items. The score a subject gets on these two sets ought to be very closely similar.

Interitem Consistency: In a test measuring some quality or property (for example, sad mood or visual accuracy), all the items on the test should contribute toward an overall score. Thus, tests can be examined to see that the response to each item is related to responses given to the other items.


Interscorer Consistency: It shouldn't make any difference if Bill or Mary or Kim is rating a subject's performance on a test. Indeed, the final score of a subject should be the same if all three scorers or raters mark a subject's test performance.

Note, though, that a test can be consistent and still be inaccurate or wrong. For example, in a Math test, a scoring manual may say that the solution to an equation is one value, but it might actually be another value. So, someone taking the test might get that question "right" but still be marked wrong, consistently wrong, because there is an error in marking. Note, therefore, that reliable tests may give consistent results but still not be accurate or valid.

Validity

Not only do we want tests to be consistent (reliable), we want them to be accurate, that is, to actually measure what they claim to measure. For example, we want spelling tests to measure spelling abilities, not math abilities. And, we want tests of visual-motor speeds to measure how fast our perceptual abililities or our muscles work, not whether we can read a particular language. And, if we're measuring someone's level of depression, we don't want it to be really measuring that person's ability to hear or to drive a car or to write a coherent sentence.

The quality of a test's accuracy--its ability to measure what it claims to measure--is a test's validity. A claim that a test is valid is usually established on several grounds:

Face Validity. While this claim doesn't rest on any mathematical quantity, it is simple: a test should at least look like it is measuring what it claims to measure. For example, if the test is a reading test, there ought to be some sort of reading involved. Otherwise, the test may lack "face validity" -- it not be a credible test. Note, though, that a test might appear valid on the surface, but actually fail to be valid on the other criteria below.


Content Validity. When a test is constructed, the domain or area to be tested should be examined carefully and a range or variety of questions developed on the basis of that examination. So, for example, to test for someone's knowledge of psychology, a team might look over many different textbooks in the field and assemble a large number of topics, concepts, and vocabulary words -- from this collection of items, a random sample of items would be drawn to appear on the test. Again, this is not a type of validity that can be measured quantitatively; rather, this validity flows from the expert judgment of those involved in creating the test.


Criterion Validity. A test is valid if it can predict accurately how a person performs. So, for example, the SAT or GRE exam scores are judged to be reasonably valid to predict how well someone will do in undergraduate or graduate school. Or, a visual motor acuity test for pilots would be valid in predicting how well the pilot could see and react during flight. Or, in the field of neuropsychology, the Glasgow Coma Scale predicts or offers odds about whether a person will survive a head injury. And, the California Verbal Learning Test predicts different aspects of a person's memory. In order to establish this kind of validity, two different kinds of criteria can be used:

Concurrent Criterion Validity predicts how well a person performs on some task in the present. Two tests are given at the same time and the scores on one test ought to be related or correlated with the scores on the other test.

Predictive Criterion Validity predicts how well a person will perform on some task in the future. So test scores are compared over time with performance on some variable in the future, e.g., SAT scores from Senior year of high school are compared to Freshman grade point averages in college.

Construct Validity. Sometimes tests are used to predict the behavior or say something meaningful about a person vis-a-vis a concept or a trait, for example, a test of aggressiveness or generosity or creativity. The trait or concept is a theoretical notion which can't be seen concretely. Rather, it has to be inferred by several, and sometimes many different kinds of behavior. In establishing construct validity, test results are compared to several other tests, tasks, or outcomes. There should be evidence of validity on the basis of the several different comparisons.

Standardization

When all is said and done, a test offers a comparison: how did the subject do on the test compared to other persons. So, in a test of "intelligence" or "visual motor speed" or "long-term memory", the score a subject gets on the test is meaningful because it tells us something about how that person performed vis-a-vis other people. If person X gets 15 questions right on a test with 60 possible right answers, it is important to know if the average person gets only 6 questions right or 35 questions right. In the first case, person X's performance would be considered very strong vis-a-vis others while in the second case, person X's performance would be considered very weak. So, in evaluating a subject's tested performance we have to be concerned with the standards against which that subject is being compared. Without becoming too technical, our concern for standardization is a demand to know two things:

Comparison Group Standardization: What is the makeup of the group to which the person's performance is being compared? Ideally, this group would be a randomly-selected sample drawn from a national setting. Did the group include men and women, old and young, brain-injured and normal persons, etc.?


Administration Standardization: Did the person who was tested take the test in the same way or under the same conditions as everyone else in the comparison group? This is an important question. There is a mountain of psychological research which demonstrates that even small changes in the wording or conditons under which someone is tested can significantly affect how well they do on a test. Hence, tests should be administered in the same way under similar conditions.