Error when comparing psychometric test results
Posted by Cynthia W on 24 August 2009 09:42 PM

The first thing to remember is that if you are using a purely ipsative personality test then you should not be comparing test results between candidates.  Ipsative tests are self-referencing - they are comprised of force-choice items.  They are useful in coaching, team-building and career guidance, but should not be used alone in recruitment and selection scenarios.

Some tests on the market, such as the Apollo Profile are joint normative-ipsative tests and these would be fine to be used to compare between candidates.  A normative test is one which allows the candidate to respond based on the strength of their agreement or disagreement with a statement. The end results are then compared with a group of similar others who have previously taken the test (the norm group). 

Purely normative tests such as the Identity Self-Perception Questionnaire would also be good to use for comparing candidates.  Aptitude tests are by their nature normative tests and hence can be used to compare between candidates. 

So, let’s assume that we have administered a normative personality assessment to two candidates and we are particularly interested in finding a candidate with a high tendency towards creative thinking.  We have decided to use a personality assessment alongside other means of assessment including an abstract reasoning test to assess this.  We ask  Lee and Jane to complete both of these tests.  These are their scores on the test scale of interest (presented in sten scores):

Lee
Creative thinking:8

Jane
Creative thinking:6

Now, keeping in mind that we would never use test results on their own to make a decision, let’s look at how most decision-makers would approach the above scenario based on test results alone for simplicity.

It obviously appears that Lee is somewhat better suited to the position than Jane.

However, in psychometric testing just as in any assessment procedure undertaken for Human Resources, there is always a chance of error.  In fact, it’s more than chance!  We know that error is always present. 

When interviewing somebody the error is present, when running an assessment center the error is also present.  Likewise, error is also present in the use of psychometric tests.  Given a desire to be scientific, reputable test publishers will actually assess their tests for error. 

One way of doing this is to ask a group of respondents to complete the test today and to invite them back a month later to complete the same test.  Ignoring practice effects (which are controlled for), the expectation is that there should be a strong relationship between how a candidate scored at time one and how they score at time two.  The idea is that test results should remain consistent over time.  Psychometricians refer to this as test-retest reliability.

We hope for high test-retest reliability and we really should be choosing tests which have proven high levels.  If we don’t we will have little confidence in test results and be very limited in terms of how we use them.

The assessment for error that shows us how much confidence we can have in test scores is referred to as the standard error of measurement (SEM).  It uses an equation to ascertain how confident we can be that a candidate’s test result is a reflection of their true score as opposed to their true score PLUS error.

The equation is very simple, it is just: Standard Deviation multiplied by the square root of 1 minus the test-retest reliability of the assessment.  If you don’t like statistics, sorry - they really are necessary to use tests competently!

If you choose a reputable test, often the publisher will quote the SEM in the test manual.  If not, you can use the equation above to calculate it.  You would use the standard deviation for your scale of interest taken from the manual alongside the test-retest from the manual (note…if your publisher fails to provide these figures you should probably not be using their tests!!). 

The point is that the lower the SEM (or the higher the test-retest reliability), the better.  Why?

Going back to Lee and Jane above.  If our test has an SEM of 1.5 STENS, this would mean that we are 68% confident that Lee’s true score for the creative thinking scale is between 6.5 and 9.5 (we add and subtract the SEM from the observed score).  It would also mean that we are 68% certain that Jane’s true score lies between 4.5 and 7.5 on the same test.

Now we can see that some doubt begins to arise as to whether the differences observed between the two candidates is as a result of a real score difference or an error difference (i.e., the true score for both candidates could be 7!).  We don’t want to make a mistake and choose the wrong candidate, so let’s now look at how we can compare the differences.

We can take this further and calculate something called the standard error of difference. This tells us how confident we can be that there is a true difference between the scores of the two candidates.  Because both candidates completed the same test, we use the following equation: SEdiff= the square root of (1.414 * SEM squared of the test in question). 

Let’s say that our test has an SEM of 1.5 STENS.  Using the SEdiff equation, we get a figure of 3.18 for the SEdiff. This represents our “critical figure”. It means that the difference between the candidate’s scores must be at least 3.18 before we can conclude there is a true score difference.

In our example, the difference between the candidate’s scores is only 2.  Hence we cannot conclude there is a true score difference.  The implication for selection is that we should not (everything else being equal) select one candidate over the other because, although we observe differences, the differences may not be true differences, they may be simply error differences.

Note that if we choose a more reliable test it will reduce the SEM.  So for example, if we have an SEM of 1 STEN, our SEdiff for the above example would be 1.19.  In this case, since the difference between the candidate’s scores is 2 STENS, we could conclude that there is a true difference.  We would be at least 68% certain and almost 96% certain.  We won’t go into degrees of certainty in this article, but the point is made!

In summary, do not compare candidate’s test results without a knowledge of the test’s reliability and standard deviation or in other words, do not ignore the SEM.  Every assessment technique has an error variable.  Competent users of psychometric tests will be aware of this and ensure they do not make the wrong selection decision or give incorrect development/careers advice on the basis on error rather than true score differences. 

This article is (C) 2009 PsyAsia International. Some websites have been given permission to post this article.  The article must always contain our copyright, publisher details and a live link to our website. Please do not violate these terms.