A comparative analysis of pre- equating and post-equating in a large-scale assessment, high stakes examination

Statistical procedure used in adjusting test score difficulties on test forms is known as “equating”. Equating makes it possible for various test forms to be used interchangeably. In terms of where the equating method fits in the assessment cycle, there are preequating and post-equating methods. The major benefits of pre-equating, when applied, are that it facilitates the operational processes of examination bodies in terms of rapid score reporting, quality control and flexibility in the assessment process. The purpose of this study is to ascertain if preand post-equating results are comparable. Data for this study, which adopted an equivalent group design method, was taken from the 2012 Unified Tertiary Matriculation Examination (UTME) pre-test and 2013 UTME posttest in Use of English (UOE) subject. A pre-equating model using the 3-parameter (3PL) Item Response Theory (IRT) model was used. IRT software was used for the item calibration. Preand post-equating were carried out using 100-items per test form in an UOE test. The results indicate that the raw-score and ability estimates between the pre-equated model and the post-equated model were comparable.


Introduction
Developments in the field of education, psychology and statistics communities have immensely assisted resear chers in assessment through its contributions towards the rapidly growing statistical and psychometric methodologies used in test equating.In largescale examinations such as, the Unified Tertiary Matriculation Examination (UTME) where candidates' scores are used for highstakes decisions, testing programmes require new versions of tests to be continually produced.The essence and expectation is that tests produced should be equivalent in test score difficulty as well as in functionality over time.The UTME is a computerbased test (CBT) conducted by the Joint Admissions and Matriculation Board (JAMB) for the purposes of selecting qualified candidates for admissions into Nigerian tertiary institutions.The examination, which comprised of 23 subjects including the UOE, is conducted at different times within a specified period of 14 days for over 1.5 million candidates.Therefore, the UTME is compulsory for any candidate seeking admissions into any tertiary institution in Nigeria.It is therefore a highstakes test since results obtained from this examination is used in making important decisions about the candidates.Since this examination is conducted at different times and different days using several test forms in 23 subject areas, equating of the test forms is necessary.Equating is therefore a statistical procedure used in adjusting scores of two or more tests such that the resulting new forms of the test can be comparable.In supporting this assertion, Livingston (2004) defined equating as a statistical procedure that adjusts test scores for difficulty of the items.Equating as a statistical process refers to the derivation of transformations which places scores of different forms of a test onto a scale such that after transformation, the scores on the resulting forms are comparable.This definition can be likened to the meaning of equating by Kolen and Brennan (2004) who are of the opinion that it is a process that is used in adjusting scores on two or more test forms such that the scores can be used interchangeably.
Equating is an important component of any testing programme that produces more than one form for a test.It places scores from different forms onto a single scale.Once scores are placed on a single scale, the scores are interchangeable (Kolen & Brannam, 2004;Holland & Dorans, 2006).This development permits standardisation of scores across test forms such that what is applied to one test form is also applied to the other forms enabling consistency and accuracy across test forms in classification decisions.It is for this reason that equating has become essentially important to testing programmes that use test scores for the measurement of students' growth as well as highstakes decisions.In the UTME, preequating is used in establishing a conversion table prior to the operational testing.Kirkpatrick and Way (2008) affirmed that a series of advantages arise from the use of the preequating over the use of postequating.Top on the list of benefits stated include assessment that is more flexible and a better qualitycontrol check for the tests.
Generally, what equating does is adjust test score difference because of score difficulty.Normally, it is desirable to have the same group of test takers take the new test form as well as the reference form at the same time.The difference in average performance on the two forms indicates the difference in form difficulty.After this, scores on the new test form can then be statistically adjusted to make the average performances on both forms equivalent.Nonetheless, in practice, it is not possible to compel test takers to take two different tests at the same time; rather it is more convenient to have the two different groups of test takers take the two forms of the test at the same time or on two different occasions.However, because these two groups of test takers could have different average abilities, Xuan and Rochelle (2011) are of the opinion that the difference in average performance on the two forms could be an indication of the existence of both group ability differences and form difficulty differences.
Equating may be classified as preequating or postequating depending on the period when the equating practice is being conducted.Preequating according to Tong, Wu and Xu (2008) is to conduct equating prior to the operational testing while postequating involves conducting equating after the operational testing.In their paper, they stated that preequating and post equating are used in K12 largescale assessment programmes.In many largescale, high stakes examinations such as the UTME where immediate reporting of scores are required, preequating is often a preferred alternative to postequating since the equating transformation must be produced in a rather short period of time.Every prospecting UTME candidate is expected to enrol four UTME subjects including the UOE.The subjects are selected based on the faculty and course requirements.Normalised scores are reported based on the four subjects for each candidate.The normalised scores are based on Zscore and Tscore transformations of the raw score.No other form of equating is carried out since the equating has been done prior to test administration.
The UTME results is solely used by the Joint Admissions and Matriculation Board (JAMB) and the tertiary institutions in Nigeria as an entrance examination for selecting eligible candidates into the various programmes/courses offered by the institutions.The computerbased testing administered by the JAMB takes place at different times and dates and so, several forms of the same test are required in each session in order to forestall item overexposure of the items in the item bank.This is a strategy for curbing incidences of examination security breach.Since immediate score reporting is needed, all forms of tests for all the subjects are pre equated in order to make them equivalent.This is to ensure that no candidate is in any way placed at a disadvantage because of administering any form of the test forms.
When embarking on equating, care must be exercised in order to avoid equating errors.If equating errors exceed some tolerable limits as a result of applying preequating, this can likely lead to multidimensionality.The probable cause for preequating error is the presence of bias in the item parameter estimates caused by the violation of the assumption of item local independence (Kolen & Brennan, 2004).A guide against committing serious equating errors through ensuring that model assumptions are to a reasonable extent complied with adds value to the final equating results.

Statement of problem
In many largescale high stakes assessment enterprises such as the UTME, stakeholders need assessment evidence as quickly as possible to enable them to make informed decisions relating to admissions or other policy issues.The nature of the UTME assessment makes it pertinent to release candidates' results as quickly as possible in compliance to requests requiring meeting some deadlines in reporting scores.To facilitate this, test items are often calibrated prior to the operational administration with the raw score to scale score conversion tables prepared well ahead of the test administration to ease problems that impede quick reporting.The use of different forms of the same test for assessment often raises the issue of the comparability of test scores across forms.In order to use the scores from different forms of a test interchangeably, they must be put on a common scale.The problem is how to make the several test forms, which consists of different test items drawn from the same content areas of the syllabus, psychometrically equivalent so that whichever form is given to any candidate, s/he will not in any way be disadvantaged.

Purpose of study
Measurement equivalence is said to exist when candidates with the same scores on the latent trait have the same expected raw or true score at the item level.Raju, Laffitte and Byrne (2002: 517) inferred that without measurement equivalence, it is difficult to interpret observed mean score differences meaningfully.The purpose of this study therefore, is to compare pre equating and postequating scores of candidates in the UTME high stakes examination in order to ascertain if the tests function the same way for students in a field test administration as well as in an operational test administration.

Literature review
While some researchers have varied views regarding the efficacy of preequating in a high stakes examination, other studies have suggested that preequating can achieve satisfactory results.For instance, a study by Livingston (2004) which adopted some sort of method similar to regression, demonstrated that preequating was highly accurate in three of the four New Jersey College Basic Skills Placement tests.Studies have also shown that there is a dearth of literature on postequating.However, Kirkpatrick and Way (2008) were of the opinion that in postequating, new operational data can be obtained for items selected from the calibrated item pool.They explained that item parameters are estimated for the operational data, and operational items are postequated using the pool (old) and current (new) item parameters as well as a scale transformation procedure.If new field test items were administered with the operational items, this transformation can be applied to their calibration results as well.
Furthermore, in two of the most recent studies conducted by Domaleski (2006) and Tong et al. (2008), they supported the use of preequating by having similar pre and postequated scoring tables and similar accuracy of classifying students into different performance levels.Apart from different research findings about preequating, a literature review indicates that little research has been conducted on whether preequating agrees with the postequating for a test letbased and computeradministered testing programme.What is more, given the controversial view towards the use of preequating and the appealing features that pre equating can offer more research is clearly needed in this area.To this end, this study, which employed empirical data, aims at investigating whether the preequating results agree with the equating results based on operational data (postequating).The study examined the degree to which the IRT preequating results agreed with those from IRT postequating and the degree to which the two equating designs agree with each other.
Since preequating establishes a conversion table prior to the operational testing, a series of advantages often arise from the use of the preequating over that of postequating (Kolen & Brennan, 2004) (Kirkpatrick & Way, 2008).These advantages include assessment that is more flexible, a better qualitycontrol check for the tests and its ability to facilitate immediate score reporting of tests right after the test administration.

Equating designs and equating method
This research is based on the equivalent group equating design.The UTME test is a high stakes standardised test that is made up of 100 items.Twentythree other subjects are also tested but candidates are only allowed to choose four subjects according to faculty and departmental requirements.The Use of English (UOE) subject is compulsory for all candidates and all the tests are administered via a computerbased testing mode using the linearontheflytesting (LOFT) method.In the UOE test, test forms C1, C2, C3 and C4 were created with each taking into cognisance the subsections of the syllabus and weights as stated in the UTME syllabus.In so doing, more than one parallel forms were created.Each of these trialtested items was used in 2012 in creating tests administered in a subsequent operational examination.
The UTME test therefore contains many versions of the same test (test forms) created from the same rational content domain as stored in JAMB item banks.The test forms were built and made equivalent in terms of content and psychometric properties.For example, test form C1 in UOE from the trialtest was taken as a reference form while forms C2, C3 and C4, etc., were made equivalent and taken as the focal groups for the preequating.Data in these test forms were organised such that they have item distributions of mean = 0 in terms of item difficulties b and discrimination parameter a varying between 1 and 2. Test scores on different forms of the 2013 post operational exams were also equated using a common reference form -D1 and adjusting the test score difficulties of the other 3 test forms D2, D3 and D4 respectively.The 3parameter IRT logistic model was used for the item analysis for the 8 UOE test forms comprising C1, C2, C3, C4, D1, D2, D3 and D4.

Data
Data for the study was extracted from the UTME master file after posttest administration as well as from the trialtest.The trialtest data is made up of responses of data from a representative sample of students from Senior Secondary Class III in the Use of English subject and indeed all other 22 UTME subjects.The students were administered the various test forms in a classroom setting at a period when they were psychologically ready for their senior secondary examination.The tests were administered to students in a scrambled form so that the groups of students taking each form were randomly equivalent.A preequating model, which employed the 3parameter IRT logistic model, was used.The Xcalibre 4.0.0 software was used because of the necessity to have scoring tables prior to test administration.In this study, item parameter estimate and the raw score to theta (e.g., scoring table) relationship for preequating model were calibrated and developed on the field test data.To enable a comparison of the difference in equating results between pre and postequating, data based on the postadministration for the four different test forms in UOE of the field test of 2012 and 4 different postadministration data of test items in UOE in 2013 CBT were used.Each of the test forms consists of a sample of approximately 650 candidates' responses.In all, the data used is made up of 5,166 responses.Tong et al. (2008) defined preequating as conducting equating prior to the operational testing.The equating design used in preequating the UOE items was the IRT equivalent group equating procedure.In order to preequate the test forms in the 2012 UOE, the response data collected during the 2012 field test were first calibrated.Then, one of the test forms comprising of response data from the trialtest was calibrated using "a prior" information from previous operational data.Thereafter, the pretest items were put on the same scale as the one calibrated using information from the operational items through the mean/sigma method.The item parameter estimates from the above step were then used to create the rawtoscale conversion table for each form to the reference form using IRT preequating.The preequating process was carried out by applying the following procedures: a. Estimates of item parameters were produced using the threeparameter IRT model on the 2012 trialtest data.

IRT pre-equating
b.The item parameters were placed onto the reference scale by using the item equivalent group equating design.
c. Some items were selected from the item bank and used along with some pretest items to build new test forms for parallelism d.A raw score to theta relationship for these new test forms are developed using the trialtest preequated item parameters.
Despite the advantage of using preequating as a cushion where immediate score reporting is necessary and as a guide towards reducing incidences of examination security breach, this equating method can be vulnerable to equating errors and bias in a test.

IRT post-equating
In carrying out postequating, the post administration item parameters and scoring table were produced using the operational data.During post equating, all the rules used in preequating were simulated during postequating such as applying the mean/sigma equating method to place the item parameter estimates and scoring tables on the same scale.The following steps as suggested by Kolen et al. (2004) were applied during postequating: 1. Calibrate all items on the operational test form by making the postoperational item difficulties centre at a mean value of zero and obtain raw score to theta scoring table.
2. Obtain mean test score difficulty using the post administration item parameters from the previous stage.
3. Obtain the scaling constant for postequating by subtracting the mean item difficulty from stage 2 from the mean item difficulty from preequating.
4. Adjust all the postadministrational item parameters by adding the scaling constant obtained from stage 3.

Test calibration and analysis
A number of procedures can be performed to achieve item calibration and item linking such as carrying out separate calibration with linking, concurrent calibration or fixed parameter calibration.In this study, separate calibrations were carried out on all the test forms using the threeparameter IRT logistic model (3PL).The 3PL is an IRT model that specifies the probability of a correct response to a dichotomously scored multiplechoice item as a logistic distribution that introduces a guessing parameter in addition to the discrimination and difficulty parameters.Estimation of candidates' ability was done using the Maximum Likelihood Estimation (MLE) method.In statistics, MLE is a method of estimating the parameters of a statistical model's given observations by finding the parameter values that maximise the likelihood (or probability) of making the observations, given the parameters.Thereafter, the mean/standard deviation suggested by Livingston (2004) was used in placing the item parameters on the same scale.

Assessment criteria
In assessing the preequating and postequating results, one major area of concern is the item parameter estimates.In order to compare the item parameters of two more test forms from postequating, the two must be placed onto a common operational scale.Statistical methods such as correlation analysis can then be used in comparing the differences in the item parameter estimates obtained between the two.Correlation coefficients obtained are expected to be close to .90 and the average absolute differences between estimates are expected to be below 0.20.This same criteria may be applied when comparing pre and post equating results.
It is also important and interesting to observe how different the rawscoretotheta scoring tables tend to be based on prepost contrast.In the largescale assessment context, decisions on classifications are also important.In this study, percentages of students in each of the performance levels are also contrasted between pre and postequating.Another reliability index examined is the classification accuracy.This is meant to establish what percentages of students were accurately classified.The classification method adopted by Gao, He and Ruan (2012) was applied to compute classification accuracy index for the pre and post equating results.To calculate the classification reliability index for a given ability score θ, the observed score θˆ is expected to be normally distributed with a mean of θ and a standard deviation of SE(θ) -the standard error of measurement associated with the given θ.The expected proportion of examinees with true scores in any particular level on high/low or pass/ fail classification rates given by different equating methods was also reported.Each test has two cut scores, C and D cuts.Classification rates for the C and D cuts were reported for the UOE test in this study.
While there is no consensus on the best measures of equating effectiveness (Kolen et al., 2004), three commonly employed measures used in equating studies include the Root Mean Square Error (RMSE), the Standard Error of Equating (SEE) and (3) BIAS of the equated raw scores (Pomplun, Omar & Custer, 2004).These measures represent total equating error, random equating error and systematic equating error, respectively.Notice that all three indices were weighted by the frequency of numbercorrect raw score at each particular level.Total equating error and systematic error were calculated with the formulas below: where fi is the frequency of numbercorrect raw score level i, X i ' is the equated score at each of the numbercorrect raw score level and X i is the equated score from IRT preequating at the numbercorrect raw score level i.
The standard error of equating is a measure of random equating error and can be estimated with the RMSE and BIAS.The standard error of equating at each possible raw score was estimated with: where fi is the frequency of numbercorrect raw score level i.

Results
Table 1 shows the item parameter estimates disparity between pre and postequating results for test forms C1 and D1 representing the base test form for preequating and one test form from the postequating.Columns1 and 2 in table 1 shows the pvalues of test forms C1 and D1.Overall, the pvalues appear to be higher for the postequated form than for the pre equated one.The reason perhaps may be attributed to the prevailing situation during the conduction of the pretest, as most students do not often take trialtests as serious as other high stakes examinations.However, the item parameter values from the preequating were found not to be different from the postequating item parameter estimates because of the mean/sigma equating, the average of the item parameter estimates were equated to be the same for pre and postequating.The average absolute difference between the item parameter estimates were computed as .000342for C1 and D1, .00491for C2 and D2, .00572for C3 and D3 and .00557for C4 and D4.In addition, all were found to be less than the benchmark of .20.Making decisions from the criteria earlier stated in assessment criteria (i.e., correlation being 0.90 and average absolute difference being less than 0.20), the item parameter estimates between the two equating models are the same.Figures 1, 2, 3, and 4 also show the scatter plot of the relationship between the preequating and postequating test forms.All the items constituting the two different forms were aligned to the linear straight line showing a highly close relationship.In the same way, figures 5, 6, 7 and 8 depict the raw scoretotheta scoring tables based on the two equating models mentioned above.While the horizontal axis represents the ability estimates, the vertical axis represents raw scores.From the figures, it is certain that the raw scoretotheta scoring tables for preequating and postequating models were overlapping each other.Table 3 shows that for the classification rate, the IRT post-equa than the pre-equating methods in total.The table shows that the pass fewer examinees than the IRT post-equating method at th reverse is the case for the D cut, where the pre-equating method C1, C2 and C4.Table 3 shows that for the classification rate, the IRT post-equ than the pre-equating methods in total.The table shows that th pass fewer examinees than the IRT post-equating method at t reverse is the case for the D cut, where the pre-equating method C1, C2 and C4.Table 3 shows that for the classification rate, the IRT post-equ than the pre-equating methods in total.The table shows that th pass fewer examinees than the IRT post-equating method at reverse is the case for the D cut, where the pre-equating method C1, C2 and C4.Table 3 shows that for the classification rate, the IRT postequating tended to pass more examinees than the preequating methods in total.The table shows that the IRT preequating method tended to pass fewer examinees than the IRT postequating method at the C cut and in total.However, the reverse is the case for the D cut, where the preequating method passed more candidates in test forms C1, C2 and C4.Finally, table 5 also presents the results of the three indices used to evaluate the equating results with IRT preequating results as the baseline.All three indices indicated that the IRT postequating yielded closer results to the IRT preequating method by having the smaller RMSD, BIAS and SEE in all four of the test forms.

Discussions on results
The perception on the higher pvalues from the postequating method can probably be explained.During field trials, the items constituting the UOE were administered in paperand pencil mode while the same items used in subsequent operational examination was done in a computerbased testing environment.The difference in the modes of examination could be a direct consequence for the perceived difference between the preequating method and postoperational method.The design of the UTME delivery system made it possible to include innovations such as the use of the four arrow keys on the keyboard as an alternative to the use of the mouse, review of items to reveal unanswered items prior to submission as well as inclusion of a timer among other things.These features added value to the test delivery system, distinguishing it from the paperandpencil mode of testing.
The seriousness or stake attached to the two examinations may also have contributed to the difference in the pvalues observed.Since the trialtest does not often attract motivational gains, students often do not take the examination as serious as the UTME highstakes examination.This could account for the difference in the overall performance of the candidates.Again, the level of preparedness of the students can constitute its own problem as well, which also affects performance.
Observing the performance of the candidates through direct examination of the pvalues shows that for instance, test forms C1 and D1could offer more insight into differences in preequating and postoperational methods.Test form C1 represents the preequating while D1 stands for the postequating method.Of the 100 items tested, 56 of them were found to be harder in the pre than in the postequating test form.Experience has shown that in the trialtesting situation, candidates are often less serious in taking examinations possibly because of a lack of motivation on the perceived consequences of the test.Wolf and Smith (1995) presented a research study, which showed that testing students in consequential condition compels them to outperform other students in a nonconsequential condition by an effect size of .26.They concluded that consequences influences motivation and motivation influences performance.
It is certain therefore that motivation is a likely contributor to performance differences found in this study between students that took the field test compared to students that took the UTME high stakes assessment.Indeed, it appears reasonable to say that students taking the field test according to Damaleski (2006) would not exert as much effort since no stakes were associated with this test event and, in fact, no student level results were ever reported.This lack of seriousness regarding trialtests by students often accounts for the high rates of omitted and unreached items seen in many field tests and this possibly explains reasons why trialtest items were found to be harder due to the relatively large amount of missing or incomplete data.
The equality argument for fairness in assessment according to advocates assessing all students in a standardised manner using an identical assessment method, content and same administration, scoring and interpretation procedures.With this approach to assuring fairness, if different groups of test takers differ on some irrelevant knowledge or skills that can affect assessment performance, bias will exist.This situation is avoided by ensuring that preequating is carried out prior to real test administration.The analysis carried out in this study has shown that the preequating and postequating methods have provided comparable results.This will mitigate the fears of stakeholders who are apprehensive of whether pre equating is actually doing what it is supposed to do or providing validity evidence as to the equivalency of the test forms used in testing in the UTME UOE.

Conclusion/Recommendation
The result of this study has shown that all three major indices involving RMSE, BIAS and SEE which represent total error, systematic error and standard equating error indicated that the IRT postequating yielded closer results to the IRT preequating method and are therefore comparable.However, carrying out equating using IRT is complex, both conceptually and procedurally.
Another score point for the postequating method is that the method passed more candidates than the preequating especially in the total and ccut.This shows that the field test items are predicting performance of candidates in the UTME operational examination.These results are pointers to the fact that item parameters obtained during the trialtest were remarkably equivalent to those obtained during the operational assessment of UTME in the UOE.All other 22 UTME subjects were also subjected to preequating prior to operational test administration and similar results were achieved.The extent to which those inferences are appropriate for different groups of test takers is an important aspect of fairness The practice of using the preequating method to build score tables prior to an operational assessment should be sustained since the method yielded comparable results with the post equating method.This occurs as long as the probable cause for preequating error such as the presence of bias in the item parameter estimates, which are caused by the violation of the assumption of item local independence, are removed (Kolen & Brennan, 2004).Preequating test forms prior to test administration in actual examination is a good way of assuring equity and fairness in assessment.When the tests given to the students are unbiased and function the same way for different groups of test takers, fairness is said to have been built into the test.

Fig 1 :
Fig 1: Scatter plot of relationship between Fig. 2: Scatter pre-equating and post-equating of C1 and D1 equating and p

Fig. 4 :
Fig. 4:Scatter plot of relationship between preequating and postequating of C4 and D4 test forms

Figure 5 :
Figure 5: TCC of test forms C1 and D1 Figure 6: TC Figure 5: TCC of test forms C1 and D1

Figure 5 :
Figure 5: TCC of test forms C1 and D1 Figure 6: TCC of te

Table 1 :
Comparisons between pre-equated and post administration item parameter estimates of use of English

Table 2 :
Correlation of preequating and postequating item parameters *. Correlation is significant at the 0.05 level (2tailed).

Table 3
shows that for the classification rate, the IRT po than the pre-equating methods in total.The table shows pass fewer examinees than the IRT post-equating meth

Table 3
shows that for the classification rate, the IRT post-equ than the pre-equating methods in total.The table shows that th

Table 3 :
Classification frequency for aggregate pass rate, Cpass and Dpass rates for the UTME UOEThe means and standard deviations of the equated scores from different equating methods are shown in table 4. From the table, it can be seen that the item parameters of the test forms from the preequating and postequating consistently yielded almost the same values except for test forms C1, representing preequating and the corresponding D2 for postequating which has slightly higher means and SDs.

Table 4 :
Means and standard deviations of the equated scores from different equating methods

Table 5 :
Indices used in evaluate the equating results with IRT preequating as the baseline