Introduction to PISA Data Analytics, the Student Scores

In the previous article in this series, computing standard errors with replicate weights in PISA database, in this article we will take an overview of one of the most controversial points of these studies, the complex system of scores implemented.

We are all accustomed to the direct scores on the exams, whit each answer associated with a certain score, whose total sum corresponds to the final exam score, or that note is simply calculated as a percentage of correct answers. However, in PISA, a complex statistical model by which the results of the students are imputed is used, and notes no longer reflect simply the actual result obtained by each student in the tests.

As in previous articles, in this link you can download the PISA data analysis manual, with a deep development of all these themes.

The PISA exams consist of an extensive battery of questions, so extensive that it is impossible for a student to answer all of them, so they are distributed in a series of booklets with different combinations of questions that are randomly assigned to students. Since not all questions have the same difficulty, the first step is to establish such difficulty through a series of preliminary tests conducted with a selection of students from different participating countries. In this way, the booklets are developed so that all have the same average difficulty, in addition to contain some common questions that allow link the results using the same scale.

Since the variable we want to measure, the ability of students, can be considered a continuous variable unobservable to be deducted from the pattern of responses, it has chosen to use a variant of the Rasch model to link the pattern of responses to the student performance.

Basically, with the Rasch model we construct a continuous scale using logistic regression that expresses the probability of correctly answer a certain question depending on their difficulty and the supposed ability of the student. A student with a high skill will have a different probability distribution in a difficult question that a student with a lower capacity. A probabilistic function links the difficulties of different questions with the skills of pupils using the same scale.

The fundamental assumption for the validity of the Rasch model is the equal difficulty of the questions for all students, and this is one of the key criticisms directed against these tests, since the same question can have different difficulties in different countries. To avoid this, it has very careful in translating questions into the different languages and even it is removed some of them or new ones are included, depending on the characteristics of each participating country, in addition to preliminary studies in each country to ensure that this assumption is satisfied.

Thus, it is intended to reflect the variability inherent in all sampling so that they can get unbiased estimators and avoid underestimating or overestimating the results.

But the difficulty does not end here. Since the sample of students in each individual really represents a population, estimated by the weight assigned to each of them, the score does not correspond directly with the test result. Instead, a series of maximum likelihood estimators are used, obtained from the personal questionnaire responses of the student to obtain a posterior distribution that reflects the variability of the possible scores, from which are drawn at random five plausible values, that reflect the range of possible values obtained by this population. This distribution is constructed so that the scores have a mean of 500.

The goal, again, is to avoid underestimating or overestimating the results of the calculations in which are involved the results of the students.

The PISA tests are divided into three main sections, or domains, reading math and science. For each of these domains, there are five different plausible values. Moreover, each year the study focuses on one of these three areas, and also there are plausible values for different subdomains of the main domains.

One must be careful when comparing results between different years. In 2000, the test focuses on reading, in 2003 in mathematics, science in 2006, in 2009 again in reading and in 2012 again in mathematics. Since there have been some modifications to the calibration of the items only are directly comparable the scores from different years in the following cases:

Science: you can compare the years 2006, 2009 and 2012.
Math: you can compare 2003, 2006, 2009 and 2012.
Reading: you can compare all years.

So we have a continuous scale that links the difficulty of the questions with the student skills, and scores obtained with a technique of missing values imputation consistent in five different plausible values for each student, which are randomly obtained from a probability distribution. I do not have enough level to make a technical assessment of this system, But I understand that it may be at least surprising to the not experts accustomed to "traditional scores".

But, as our goal is not to criticize the PISA procedures, but simply perform statistical practices with a good set of data, let's move on to see a number of examples of how to work with plausible values when making estimates.

Working with the plausible values

In this link you code you can download R code to work with plausible values, with the sample functions that we will use.

We also will use two sets of data extracted from the PISA database, in this link you can download the data sample for PISA Germany 2003 , and in this other you can download sample data of PISA Spain, Peru and Finland 2012, Peru and Finland 2012, which we already used in the previous article in the series, about the replicate weights.

When we perform any calculations involving the plausible values, the procedure is always the same, we compute separately with each of the five values and, finally, we perform the mean of the five results, either regression coefficients, means, correlation coefficients, or any other estimator.

When calculating the sampling variance and standard errors, we proceed just as we viewed in the previous article, but this time the calculation is performed separately with the 80 replicate weights with each of the five plausible values. This forces us to repeat each calculation 405 times (fortunately, we have computers). Also, in this case we must consider the variance between the five plausible values, or imputation variance, which we must add to the sampling variance. This imputation variance is calculated from the squared difference of the value obtained with each of the plausible values and the mean value of the total, divided by four (the number of plausible values minus one) and multiplied by 1.2 (one plus one divided by five).

It is incorrect to first calculate the average of the five plausible values and perform the calculations with this result, since this leads to biased results. It is preferable to perform the calculations with only one of the plausible values, what results in unbiased results, although is not possible to calculate the imputation variance.

To not to overextend, since the data analysis PISA manual contains a great development of all these points, we go directly to view how to use the different functions of the examples.

Calculation of mean and standard deviation of the plausible values

Let's start loading the data set Germany 2003 in a dataframe:

datadeu<-read.csv(“datadeu-2003.csv”,sep=”;”) names(datadeu) [1] "YEAR" "STUDENTID" "SCHOOLID" "COUNTRY_NAME" "SUBNATIO_NAME" "STRATUM_NAME" [7] "CULTPOSS" "HISCED" "HISEI" "ST01Q01" "ST03Q01" "PARED" [13] "W_FSTR1" "W_FSTR2" "W_FSTR3" "W_FSTR4" "W_FSTR5" "W_FSTR6" [19] "W_FSTR7" "W_FSTR8" "W_FSTR9" "W_FSTR10""W_FSTR11""W_FSTR12" [25] "W_FSTR13""W_FSTR14""W_FSTR15""W_FSTR16""W_FSTR17""W_FSTR18" [31] "W_FSTR19""W_FSTR20""W_FSTR21""W_FSTR22""W_FSTR23""W_FSTR24" [37] "W_FSTR25""W_FSTR26""W_FSTR27""W_FSTR28""W_FSTR29""W_FSTR30" [43] "W_FSTR31""W_FSTR32""W_FSTR33""W_FSTR34""W_FSTR35""W_FSTR36" [49] "W_FSTR37""W_FSTR38""W_FSTR39""W_FSTR40""W_FSTR41""W_FSTR42" [55] "W_FSTR43""W_FSTR44""W_FSTR45""W_FSTR46""W_FSTR47""W_FSTR48" [61] "W_FSTR49""W_FSTR50""W_FSTR51""W_FSTR52""W_FSTR53""W_FSTR54" [67] "W_FSTR55""W_FSTR56""W_FSTR57""W_FSTR58""W_FSTR59""W_FSTR60" [73] "W_FSTR61""W_FSTR62""W_FSTR63""W_FSTR64""W_FSTR65""W_FSTR66" [79] "W_FSTR67""W_FSTR68""W_FSTR69""W_FSTR70""W_FSTR71""W_FSTR72" [85] "W_FSTR73""W_FSTR74""W_FSTR75""W_FSTR76""W_FSTR77""W_FSTR78" [91] "W_FSTR79""W_FSTR80""W_FSTUWT""PV1MATH" "PV2MATH" "PV3MATH" [97] "PV4MATH" "PV5MATH" "PV1READ" "PV2READ" "PV3READ" "PV4READ" [103] "PV5READ" "SCWEIGHT""BFMJ_2003" "BMMJ_2003" "BSMJ_2003"

We have five plausible values in mathematics and reading in the columns PV1MATH to PV5MATH and PV1READ to PV5READ.

Now we load the code examples and calculate the mean and standard deviation in math using the wght_meansd_pv function:

source("pisa-pv.R") wght_meansd_pv(datadeu,94:98,93,13:92) MEAN SE-MEAN STDEV SE-STDEV 502.985532 3.317209 102.592793 1.765833

In the first parameter we pass the dataframe with the data, in the second the indices of the columns with the plausible values, the third is to indicate the column with the student weight and the last one for the indices of the columns with replicate weights.

We obtain a vector with four positions with the mean, mean standard error, the standard deviation and their standard error.

With these data, we can construct confidence intervals for the mean multiplying the standard error by 1.96 (95%) for example, and adding and subtracting the result from the mean or the standard deviation.

Now, suppose we want to get the mean separately by the gender of students. This data is in the dichotomous variable in the ST03Q01 column, 1 for the boys and 2 for the girls. First we obtain a subsample eliminating the missing values. To do this, on this link you can download the PISA data sampling source code examples.

source(“pisa-sampling-code.R”) deusamp<-wght_sample(datadeu,c(9,11,13:98),5000,13:93)

With this we have selected, in addition to the weights and plausible values in mathematics, only the ST03Q01 column with the gender of the students, and the HISEI column, an index of socioeconomic status that we will use later.

Now, we can calculate the mean and standard deviation grouped by gender with the wght_meansdfact_pv function:

wght_meansdfact_pv(deusamp,84:88,2,83,3:82) ST03Q01-1 ST03Q01-2 MEAN 506.243645 521.023972 SE-MEAN 3.813855 4.021791 STDEV 92.854737 95.907179 SE-STDEV 2.096294 2.478750

In the first parameter we pass again the dataframe, in the second the indices of the plausible values, in the third the index of the column (or columns) with which we want to group the results in the fourth we pass the index od the column with the student weight and the latter again is for the column indices of the replicate weights.

As a result we have a matrix with a column with the boy data, the first, and another one for the girls.

Mean differences

Let's view now the two functions from the examples that can be used to calculate mean differences. For these examples we will work with a sample containing several countries, Finland, Spain and Peru, in 2012:

data<-read.csv("dataesp-fin-per-2012.csv",sep=";") names(data) [1] "YEAR" "STUDENTID" "SCHOOLID" "COUNTRY_NAME" "SUBNATIO_NAME" "STRATUM_NAME" [7] "CULTPOSS" "REPEAT_2012" "FAMSTRUC2" "ST35Q04_2012" "ST35Q05_2012" "ST35Q06_2012" [13] "HISEI" "TIMEINT" "ST03Q01" "BELONG" "W_FSTR1" "W_FSTR2" [19] "W_FSTR3" "W_FSTR4" "W_FSTR5" "W_FSTR6" "W_FSTR7" "W_FSTR8" [25] "W_FSTR9" "W_FSTR10" "W_FSTR11" "W_FSTR12" "W_FSTR13" "W_FSTR14" [31] "W_FSTR15" "W_FSTR16" "W_FSTR17" "W_FSTR18" "W_FSTR19" "W_FSTR20" [37] "W_FSTR21" "W_FSTR22" "W_FSTR23" "W_FSTR24" "W_FSTR25" "W_FSTR26" [43] "W_FSTR27" "W_FSTR28" "W_FSTR29" "W_FSTR30" "W_FSTR31" "W_FSTR32" [49] "W_FSTR33" "W_FSTR34" "W_FSTR35" "W_FSTR36" "W_FSTR37" "W_FSTR38" [55] "W_FSTR39" "W_FSTR40" "W_FSTR41" "W_FSTR42" "W_FSTR43" "W_FSTR44" [61] "W_FSTR45" "W_FSTR46" "W_FSTR47" "W_FSTR48" "W_FSTR49" "W_FSTR50" [67] "W_FSTR51" "W_FSTR52" "W_FSTR53" "W_FSTR54" "W_FSTR55" "W_FSTR56" [73] "W_FSTR57" "W_FSTR58" "W_FSTR59" "W_FSTR60" "W_FSTR61" "W_FSTR62" [79] "W_FSTR63" "W_FSTR64" "W_FSTR65" "W_FSTR66" "W_FSTR67" "W_FSTR68" [85] "W_FSTR69" "W_FSTR70" "W_FSTR71" "W_FSTR72" "W_FSTR73" "W_FSTR74" [91] "W_FSTR75" "W_FSTR76" "W_FSTR77" "W_FSTR78" "W_FSTR79" "W_FSTR80" [97] "W_FSTUWT" "PV1MATH" "PV2MATH" "PV3MATH" "PV4MATH" "PV5MATH" [103] "PV1READ" "PV2READ" "PV3READ" "PV4READ" "PV5READ" "PV1SCIE" [109] "PV2SCIE" "PV3SCIE" "PV4SCIE" "PV5SCIE" "SCWEIGHT"

We get a sample without missing values with the HISEI and ST03Q01 columns:

datasamp<-wght_multiple_sample(data,4,c(13,15,17:102),10000,17:97)

And we apply the wght_meandiffcnt_pv function to obtain the mean difference between the three countries:

wght_meandiffcnt_pv(datasamp,85:89,1,84,4:83) Finland-Peru Finland-Spain Peru-Spain MEANDIFF 152.16477 37.419909 -114.74486 SE 3.80965 2.747142 4.01396

In the first parameter we pass the dataframe with the data, in the second the indices of the columns with the plausible values in mathematics, the third is for the index of the column with the country, the fourth is for the index of the column with the student weight and the last one is for the indices of the columns with replicate weights.

We obtain a matrix with the mean difference and its standard error, with a column for each combination of two countries. We see that all the differences are significant simply dividing the mean value by its standard error and checking that the result is higher than 1.96. Keep in mind that between countries, being independent samples, the standard error is equal to the square root of the sum of the squared standard errors of the mean in each country.

If we want to get the mean difference grouped by a series of factors, we can use the function wght_meandifffactcnt_pv:

wght_meandifffactcnt_pv(datasamp,85:89,1,3,84,4:83) $Finland ST03Q01-0-1 MEANDIFF 0.2220833 SE 2.8871318 $Peru ST03Q01-0-1 MEANDIFF -18.963934 SE 3.951501 $Spain ST03Q01-0-1 MEANDIFF -14.467193 SE 3.326229 $BTWNCNT , , Finland-Peru ST03Q01-0-1 MEANDIFF 19.186017 SE 4.893863 , , Finland-Spain ST03Q01-0-1 MEANDIFF 14.689276 SE 4.404467 , , Peru-Spain ST03Q01-0-1 MEANDIFF -4.496741 SE 5.165091

We have added a new parameter, the fourth, to indicate the indexes of the columns with the factors for grouping.

This time, we get a list of matrices. The first positions are for the mean differences within each country between the boys and the girls. We can check that in Finland the mean difference is not significant, whereas it is significant in Peru and Spain.

The last item in the list contains a three-dimensional array, one of which dimensions is to indicate the couple of countries between which is estimated the mean difference and in the other two dimensions is a matrix similar to those with the differences within each country, but instead contains the difference between mean differences within each country. Thus we check that, between Finland and Peru, there is a difference of 19 points between the mean differences between boys and girls.

Calculating regression coefficients

In this last example, we will calculate regression coefficients using plausible values as dependent variables and the socioeconomic index HISEI and the student gender in ST03Q01 columns as independent variables. We will use the Germany 2003 dataset.

First we recode the gender variable so that it is 0 for girls and 1 for the boys:

deusamp[deusamp$ST03Q01 == 2,"ST03Q01"]<-0

And then we use the wght_lmpv function:

wght_lmpv(deusamp,"HISEI+ST03Q01",84:88,83,3:82) $PV1MATH Call: lm(formula = frmlpv, data = sdata, weights = sdata[, wght]) Coefficients: (Intercept) HISEI ST03Q01 407.216 2.344 -13.510 $PV2MATH Call: lm(formula = frmlpv, data = sdata, weights = sdata[, wght]) Coefficients: (Intercept) HISEI ST03Q01 409.167 2.313 -14.543 $PV3MATH Call: lm(formula = frmlpv, data = sdata, weights = sdata[, wght]) Coefficients: (Intercept) HISEI ST03Q01 407.989 2.322 -13.936 $PV4MATH Call: lm(formula = frmlpv, data = sdata, weights = sdata[, wght]) Coefficients: (Intercept) HISEI ST03Q01 405.827 2.373 -13.372 $PV5MATH Call: lm(formula = frmlpv, data = sdata, weights = sdata[, wght]) Coefficients: (Intercept) HISEI ST03Q01 405.368 2.392 -14.976 $RESULT (Intercept) HISEI ST03Q01 R2 ADJ.R2 407.1134620 2.3486810 -14.0673728 0.1485837 0.1481672 $SE (Intercept) HISEI ST03Q01 R2 ADJ.R2 7.61739862 0.13066454 3.54528219 0.01360461 0.01361126

In the first parameter we pass the dataframe with the data, in the second the right side of the formula as a text string, the third is for the indices of the columns with plausible values, the fourth is for the index of the column with the student weight, and the last is for the indices of the columns with replicate weights.

We get a list with a position for the coefficients of the models of each of the separate plausible values, another one for the mean of the resulting coefficients (RESULT), together with R2 and the adjusted R2 and a final position (SE) for the standard errors.