Introduction to PISA data analytics, the sample weights and replicate weights
In the previous article in this series we saw an introduction to PISA data analytics, with examples of functions in R code for sampling, and we talked about the sampling weights, which ponder each student so that it represents a group of individuals with the same characteristics rather than a single student, (remember that PISA aims to assess the effect of educational policies on the whole population of the country, not on individual students). In this article, we will see how to use these weights to calculate estimators from samples and we will see also how to calculate standard errors of these estimators using replicated weights.
In this link you have a more detailed introduction to the PISA program, and in this other you can download the PISA data analysis manual, where you will find a more rigorous and comprehensive exposition of the topics commented here.
Here you can download sample R code to perform calculations with replicate weights, these functions are what I will use in this article. I will also use two data sets as example, here you can download the data sample for PISA Germany 2003 , and in this another one you can download sample data of PISA Spain, Peru and Finland 2012.
Germany 2003 data set
This dataset contains data for all students who took the exam in Germany in the year 2003. First, we will load the data into a dataframe and examine which columns it has:
datadeu<-read.csv("datadeu-2003.csv",sep=";")
names(datadeu)
[1] "YEAR" "STUDENTID" "SCHOOLID" "COUNTRY_NAME" "SUBNATIO_NAME" "STRATUM_NAME"
[7] "CULTPOSS" "HISCED" "HISEI" "ST01Q01" "ST03Q01" "PARED"
[13] "W_FSTR1" "W_FSTR2" "W_FSTR3" "W_FSTR4" "W_FSTR5" "W_FSTR6"
[19] "W_FSTR7" "W_FSTR8" "W_FSTR9" "W_FSTR10""W_FSTR11""W_FSTR12"
[25] "W_FSTR13""W_FSTR14""W_FSTR15""W_FSTR16""W_FSTR17""W_FSTR18"
[31] "W_FSTR19""W_FSTR20""W_FSTR21""W_FSTR22""W_FSTR23""W_FSTR24"
[37] "W_FSTR25""W_FSTR26""W_FSTR27""W_FSTR28""W_FSTR29""W_FSTR30"
[43] "W_FSTR31""W_FSTR32""W_FSTR33""W_FSTR34""W_FSTR35""W_FSTR36"
[49] "W_FSTR37""W_FSTR38""W_FSTR39""W_FSTR40""W_FSTR41""W_FSTR42"
[55] "W_FSTR43""W_FSTR44""W_FSTR45""W_FSTR46""W_FSTR47""W_FSTR48"
[61] "W_FSTR49""W_FSTR50""W_FSTR51""W_FSTR52""W_FSTR53""W_FSTR54"
[67] "W_FSTR55""W_FSTR56""W_FSTR57""W_FSTR58""W_FSTR59""W_FSTR60"
[73] "W_FSTR61""W_FSTR62""W_FSTR63""W_FSTR64""W_FSTR65""W_FSTR66"
[79] "W_FSTR67""W_FSTR68""W_FSTR69""W_FSTR70""W_FSTR71""W_FSTR72"
[85] "W_FSTR73""W_FSTR74""W_FSTR75""W_FSTR76""W_FSTR77""W_FSTR78"
[91] "W_FSTR79""W_FSTR80""W_FSTUWT""PV1MATH" "PV2MATH" "PV3MATH"
[97] "PV4MATH" "PV5MATH" "PV1READ" "PV2READ" "PV3READ" "PV4READ"
[103] "PV5READ" "SCWEIGHT""BFMJ_2003" "BMMJ_2003" "BSMJ_2003"
The first six columns simply identify the student and I won't use them. The data used in the examples are in the following columns:
- HISEI: It is an index of socioeconomic status based on international ISEI index. They take the highest of the indices that correspond to the parents according to their occupation. It is a continuous variable.
- ST01Q01: It is the degree in which the student was at the time of examination. The exam are not performed to students in the same degree, but to students who are the same age, so it is possible to find, depending on certain parameters such as birth month, students from different degrees.
- ST03Q01: Here is the student gender, is a factor with two levels, 1 is for male and 2 is for female.
- W_FSTUWT: In this column we have the weight that corresponds to the student.
- W_FSTR1 to W_FSTR80: These columns are for the 80 replicate weights, we will see them later.
On the PISA official website of the OECD you can find documentation on all the exam questions, and an extensive technical report for each one of the years in which the test was performed.
Spain, Peru and Finland 2012 data set
This other dataset contains data from three different countries, Finland, Peru and Spain, and I will use it to expose examples of work on samples with several countries. As before, we load it and inspect their columns:
dataepf<-read.csv("dataesp-fin-per-2012.csv",sep=";")
names(dataepf)
[1] "YEAR" "STUDENTID" "SCHOOLID" "COUNTRY_NAME" "SUBNATIO_NAME" "STRATUM_NAME"
[7] "CULTPOSS" "REPEAT_2012" "FAMSTRUC2" "ST35Q04_2012" "ST35Q05_2012" "ST35Q06_2012"
[13] "HISEI" "TIMEINT" "ST03Q01" "BELONG" "W_FSTR1" "W_FSTR2"
[19] "W_FSTR3" "W_FSTR4" "W_FSTR5" "W_FSTR6" "W_FSTR7" "W_FSTR8"
[25] "W_FSTR9" "W_FSTR10" "W_FSTR11" "W_FSTR12" "W_FSTR13" "W_FSTR14"
[31] "W_FSTR15" "W_FSTR16" "W_FSTR17" "W_FSTR18" "W_FSTR19" "W_FSTR20"
[37] "W_FSTR21" "W_FSTR22" "W_FSTR23" "W_FSTR24" "W_FSTR25" "W_FSTR26"
[43] "W_FSTR27" "W_FSTR28" "W_FSTR29" "W_FSTR30" "W_FSTR31" "W_FSTR32"
[49] "W_FSTR33" "W_FSTR34" "W_FSTR35" "W_FSTR36" "W_FSTR37" "W_FSTR38"
[55] "W_FSTR39" "W_FSTR40" "W_FSTR41" "W_FSTR42" "W_FSTR43" "W_FSTR44"
[61] "W_FSTR45" "W_FSTR46" "W_FSTR47" "W_FSTR48" "W_FSTR49" "W_FSTR50"
[67] "W_FSTR51" "W_FSTR52" "W_FSTR53" "W_FSTR54" "W_FSTR55" "W_FSTR56"
[73] "W_FSTR57" "W_FSTR58" "W_FSTR59" "W_FSTR60" "W_FSTR61" "W_FSTR62"
[79] "W_FSTR63" "W_FSTR64" "W_FSTR65" "W_FSTR66" "W_FSTR67" "W_FSTR68"
[85] "W_FSTR69" "W_FSTR70" "W_FSTR71" "W_FSTR72" "W_FSTR73" "W_FSTR74"
[91] "W_FSTR75" "W_FSTR76" "W_FSTR77" "W_FSTR78" "W_FSTR79" "W_FSTR80"
[97] "W_FSTUWT" "PV1MATH" "PV2MATH" "PV3MATH" "PV4MATH" "PV5MATH"
[103] "PV1READ" "PV2READ" "PV3READ" "PV4READ" "PV5READ" "PV1SCIE"
[109] "PV2SCIE" "PV3SCIE" "PV4SCIE" "PV5SCIE" "SCWEIGHT"
In this data set we will use the same columns as above, but we will besides stratify by countries by means of the COUNTRY_NAME column. The gender in this data set is coded as 0 (male) and 1 (female).
The student weight
The W_FSTUWT column contains the weight value of each student. The result of the sum of weights on the full sample should give us the total number of 15 years old students in the concerned country. For example, in Germany 2003:
sum(datadeu[,"W_FSTUWT"])
884357.7
Although we actually have little more than 4000 records:
nrow(datadeu)
4660
The correct way to compute statistics, such as mean, is weighing the values of the columns involved in calculating with the student weight, and to use the total sum of weights as sample size. For this reason, when we take a subsample, we must correct the values in the columns containing the weights so that their sum remains equal to the total population. For example, we extract a subsample of 1,000 students in Germany 2003 (here you can download the PISA data sampling source code examples):
source("pisa-sampling-code.R")
deusamp<-wght_sample(datadeu,c(9,10,11,107,13:93),1000,13:93)
The second parameter is for the indexes of the columns we want to obtain, the third the maximum number of records to extract, and the last one indicates the indexes of the columns containing weights. We discard the rows with missing data, since, for simplicity, the examples are coded to work only with samples that hasn't missing values.
If we now calculate the sum of the weights in the subsample, we will see that the value is still about the total population:
sum(deusamp[,"W_FSTUWT"])
868076
As we have made a selection of approximately about one in four records of the entire sample, the weights in the subsample have been multiplied approximately by 4.
The sampling variance and the replicate weights
We have viewed that, performing the sum of weights on the full sample and in the subsample, the result is not exactly the same. If we take any other random subsample, the same thing will happen, since the data will be slightly different, and this affects the calculation of any statistic.
The sampling variance reflects precisely the variability of the values of a statistical among several different random samples taken from the same population, and its square root is the typical or standard error.
In PISA, the sampling is carried out in two stages, first they select the schools, and then, students within each school are selected. This makes that the differences between students selected within each center can differ of that with students from other schools, so they can’t be considered independent. The consequence of this is that the typical error calculated for the statistic will be higher than this for a random sample of students performed in a single stage (Randomly selecting students from the entire population).
Depending of the educational policies and, in general, of the composition of society in each of the different countries participating in the study, we will find that in certain countries, such as those in northern Europe, the variance within schools is higher than the variance between schools, indicating a more egalitarian system, while in other countries the opposite happens.
In general, due to the wide range of differences between educational systems, perform a calculation using some sort of formula for the sampling variance requires a lot of very complex procedures, so they have opted for a simpler method for such estimation, using replicate weights. Specifically they were chosen the Fay's version of balanced replicate weights (Fay BRR replicates), which it is an appropriate method for two stage samples.
Replicate weights methods use a lot of different subsamples (replicate samples) to calculate the parameter of interest in each one of them and estimate the sampling variance from the variability of the parameter between the different samples and the calculation obtained from the whole sample.
The idea is to divide the schools in a series of strata where the centers with similar characteristics are grouped (eg rural and urban centers, and within these strata are subdivided by size). Then we select randomly two centers within each stratum, with which we construct replicate subsamples.
With the BRR method used in PISA, for each replicate subsample, a school is randomly selected from each stratum, and the weights of this center are adjusted by multiplying them by a factor that reduces it value, while in the same way we increase the weight of the other center in the stratum (By using a factor of 0.5, we reduce one and the other is increased 1.5 times).
In PISA, 80 of these replicated samples are generated and, therefore, 80 replicate weights, which are in W_FSTR1 to W_FSTR80 variables.
Thus, to calculate, for example, the standard error of the mean of the HISEI variable in the data set Germany 2003, the procedure is to calculate the mean using the student weight and the mean by using each of the 80 replicate weights. The standard error is the square root of the sum of the squared differences of each of these means with the mean obtained with the student weight divided by 20 (this value is obtained by multiplying 80 by the square of 1 minus the adjustment factor, which is 0.5. In the data analysis manual you have a very full presentation of all those procedures).
With the standard error you can calculate confidence intervals for the statistic.
Now let's see the example. We calculate the mean and standard deviation of the HISEI socioeconomic index in the subsample of 1,000 students obtained above:
source("pisa-brr-mean-sd.R")
wght_meansd(deusamp,"HISEI",85,5:84)
HISEI
MEAN 48.5000493
SE-MEAN 0.6071643
STDEV 14.8578949
SE-STDEV 0.2968571
Now let's get another subsample with the maximum number of cases that have a non-null value in the variable in question (we only select, in addition from the weights, the HISEI column and the ST03Q01 column, corresponding to the student gender):
deusamphisei<-wght_sample(datadeu,c(9,11,13:93),5000,13:93)
wght_meansd(deusamphisei,"HISEI",83,3:82)
HISEI
MEAN 48.3451345
SE-MEAN 0.4175300
STDEV 15.2116896
SE-STDEV 0.1711459
As you can see, the values for the mean and standard deviation are slightly different in the two samples, and the standard errors are lower in the sample with more cases. To calculate the 95% confidence interval of the average in each one of the two cases, we just have to multiply the standard error by 1.96, and then subtract and add the calculated value to the mean. In the first case, we obtain an interval between 47.3 and 49.7, while in the second we have a more accurate value, between 47.5 and 49.2.
With another function of the examples, we can calculate the same statistics, but grouping the results by gender, for example:
wght_meansdfact(deusamphisei,1,2,83,3:82)
ST03Q01-1 ST03Q01-2
HISEI-MEAN 48.1962231 48.4997784
HISEI-MEAN-SE 0.5165389 0.4982433
HISEI-STDEV 15.0769228 15.3488631
HISEI-STDEV-SE 0.2383348 0.2171738
Where 1 is for male and 2 for female.
If a sample contains more than one country, the calculations for each country must be made separately, since sampling is independent between countries, should never mix data from samples from several countries in the same calculation.
Reviewing the full sample of Finland, Peru and Spain 2012, you can view that the number of cases is very different among the three countries:
summary(dataepf[,"COUNTRY_NAME"])
Finland Peru Spain
8829 6035 25313
So we get a subsample of 6000 cases for each country:
epfsample<-wght_multiple_sample(dataepf,4,c(13,15,17:97),6000,17:97)
summary(epfsample[,1])
Finland Peru Spain
6000 5865 6000
And again we calculate the average for each country:
wght_meansdbycnt(epfsample,2,1,84,4:83)
$Finland
HISEI
MEAN 55.6057865
SE-MEAN 0.4423959
STDEV 20.3976583
SE-STDEV 0.1703312
$Peru
HISEI
MEAN 35.0376205
SE-MEAN 0.8072296
STDEV 22.0254373
SE-STDEV 0.3894301
$Spain
HISEI
MEAN 47.2766651
SE-MEAN 0.6689927
STDEV 21.6684690
SE-STDEV 0.2214915
Standard errors in the calculation of other types of estimators
Replicated weights should be used to calculate standard errors when computing any other estimator, such as percentages, mean differences, regression coefficients, correlation coefficients, etc. The procedure is always the same, calculate the estimator using the student weight, and then the standard error is obtained by means of the differences with the value calculated using each of the 80 replicate weights.
To not to overextend this article, we will see only a few examples applied to the calculation of percentages and mean differences.
To calculate percentages, rather than counting cases what we do is to sum the weights. We begin by making a selection of data with columns of factor data type:
deusamp<-wght_sample(datadeu,c(10,11,13:93),5000,13:93)
The variables selected are ST01Q01, which is the degree at which the student is at the time of examination, and ST03Q01, which is the student gender.
We calculate the percentage for each of the levels of the ST01Q01 variable:
source("pisa-brr-ppc.R")
wght_ppc(deusamp,2,84,4:83)
ST01Q01-7 ST01Q01-8 ST01Q01-9 ST01Q01-10 ST01Q01-11
PPC 0.012842361 0.131188804 0.607029322 0.247570089 0.0013694241
SE 0.001709619 0.007742315 0.007703966 0.006291796 0.0005622651
We can also calculate the percentages crossing the gender (ST03Q01) with the grade (ST01Q01):
wght_ppccrossed(deusamp,3,2,84,4:83)
$PPC
ST01Q01-7 ST01Q01-8 ST01Q01-9 ST01Q01-10 ST01Q01-11
ST03Q01-1 0.004358764 0.05979673 0.3073397 0.1414563 0.0009367937
ST03Q01-2 0.008483597 0.07139207 0.2996896 0.1061138 0.0004326304
$SE
ST01Q01-7 ST01Q01-8 ST01Q01-9 ST01Q01-10 ST01Q01-11
ST03Q01-1 0.001231345 0.004413145 0.007526257 0.006872076 0.0004729134
ST03Q01-2 0.001443247 0.005295298 0.008157956 0.005292804 0.0003061333
If what we have is a sample with several countries, we can also calculate the proportions of one or more factors using the same procedure, considering that each country should be treated as an sample independent of the other countries.
epfsample<-wght_multiple_sample(dataepf,4,c(8,9,15,17:97),3000,17:97)
wght_ppc_bycnt(epfsample,1,c(2,3),85,5:84)
$PPC
REPEAT_2012-0 REPEAT_2012-1 FAMSTRUC2-OT FAMSTRUC2-SP FAMSTRUC2-TP
Finland 0.9612392 0.03876079 0.008453862 0.16754413 0.8240020
Peru 0.7419961 0.25800394 0.035015059 0.17617838 0.7888066
Spain 0.6931980 0.30680197 0.007125347 0.09000235 0.9028723
$SE
REPEAT_2012-0 REPEAT_2012-1 FAMSTRUC2-OT FAMSTRUC2-SP FAMSTRUC2-TP
Finland 0.007014295 0.007014295 0.003348302 0.007966431 0.008795873
Peru 0.013091412 0.013091412 0.003424426 0.007678364 0.008152635
Spain 0.016219592 0.016219592 0.002599557 0.006451242 0.007152187
Here we have two different factors, REPEAT_2012, which is 0 for no repeaters and 1 for repeaters, and FAMSTRUCT, family structure, which value is TP for two-parent families, SP for families with only one of them, and OT for other types of households. The percentages are calculated separately for each factor.
We can check if a percentage is greater than zero, say that with a significance level of 95%, simply by dividing the value by its standard error. If the result is greater than 1.96, we can say that this percentage is greater than zero, at least with 95% confidence.
Finally we will view at two examples of mean differences and calculating their standard errors. In the first example, the calculation is performed with data from Germany 2013, we calculate the mean difference in the socioeconomic index HISEI between the genders. The standard error in this case is calculated in the same way, first we calculate the mean difference using the student weight, and then using the replicate weights and calculating the 80 differences.
source("pisa-brr-differences.R")
deusamp<-wght_sample(datadeu,c(9,11,13:93),5000,13:93)
wght_meandiff(deusamp,1,2,83,3:82)
ST03Q01-1-2
HISEI-MEANDIFF -0.3035552
HISEI-MEANDIFF-SE 0.5798229
Again we can check if this value is significantly greater than zero dividing it by its standard error. In this case the absolute value is less than 1.96, so we can say with 95% confidence that there is no difference in the socio-economic index HISEI of the parents of male and female students.
In the case of the mean differences between countries, the standard error is no longer calculated in the same way, because they are independent samples. In this case, the standard error is the square root of the sum of the squared standard errors of each country.
epfsample<-wght_multiple_sample(dataepf,4,c(13,15,17:97),3000,17:97)
wght_meandiffbycnt1(epfsample,2,1,84,4:83)
Finland-Peru Finland-Spain Peru-Spain
HISEI-MEANDIFF 20.668446 9.105845 -11.562602
HISEI-MEANDIFF-SE 1.034906 0.918880 1.125643