sábado, 12 de marzo de 2016

# PISA Data Analytics, the plausible values

In this post you can download the R code samples to work with plausible values in the PISA database, to calculate averages, mean differences or linear regression of the scores of the students, using replicate weights to compute standard errors.

This post is related with the article calculations with plausible values in PISA database.

In this link you can download the R code for calculations with plausible values. These functions work with data frames with no rows with missing values, for simplicity. In what follows we will make a slight overview of each of these functions and their parameters and return values.

## Examples of calculating the mean and standard deviation with plausible values

In the script we have two functions to calculate the mean and standard deviation of the plausible values in a dataset, along with their standard errors, calculated through the replicate weights, as we saw in the article computing standard errors with replicate weights in PISA database.

### Example 1, calculating the mean and standard deviation

In this example, we calculate the value corresponding to the mean and standard deviation, along with their standard errors for a set of plausible values.

The function is wght_meansd_pv, and this is the code:

`wght_meansd_pv<-function(sdata,pv,wght,brr) { mmeans<-c(0, 0, 0, 0); mmeanspv<-rep(0,length(pv)); stdspv<-rep(0,length(pv)); mmeansbr<-rep(0,length(pv)); stdsbr<-rep(0,length(pv)); names(mmeans)<-c("MEAN","SE-MEAN","STDEV","SE-STDEV"); swght<-sum(sdata[,wght]); for (i in 1:length(pv)) { mmeanspv[i]<-sum(sdata[,wght]*sdata[,pv[i]])/swght; stdspv[i]<-sqrt((sum(sdata[,wght]*(sdata[,pv[i]]^2))/swght)- mmeanspv[i]^2); for (j in 1:length(brr)) { sbrr<-sum(sdata[,brr[j]]); mbrrj<-sum(sdata[,brr[j]]*sdata[,pv[i]])/sbrr; mmeansbr[i]<-mmeansbr[i] + (mbrrj - mmeanspv[i])^2; stdsbr[i]<-stdsbr[i] + (sqrt((sum(sdata[,brr[j]]*(sdata[,pv[i]]^2))/sbrr)-mbrrj^2) - stdspv[i])^2; } } mmeans<-sum(mmeanspv) / length(pv); mmeans<-sum((mmeansbr * 4) / length(brr)) / length(pv); mmeans<-sum(stdspv) / length(pv); mmeans<-sum((stdsbr * 4) / length(brr)) / length(pv); ivar <- c(0,0); for (i in 1:length(pv)) { ivar <- ivar + (mmeanspv[i] - mmeans)^2; ivar <- ivar + (stdspv[i] - mmeans)^2; } ivar = (1 + (1 / length(pv))) * (ivar / (length(pv) - 1)); mmeans<-sqrt(mmeans + ivar); mmeans<-sqrt(mmeans + ivar); return(mmeans);}`

To calculate the mean and standard deviation, we have to sum each of the five plausible values multiplied by the student weight, and, then, calculate the average of the partial results of each value. To calculate the standard error we use the replicate weights method, but we must add the imputation variance among the five plausible values, what we do with the variable ivar.

In the sdata parameter you have to pass the data frame with the data. The names or column indexes of the plausible values are passed on a vector in the pv parameter, while the wght parameter (index or column name with the student weight) and brr (vector with the index or column names of the replicate weights) are used as we have seen in previous articles.

As a result we obtain a vector with four positions, the first for the mean, the second for the mean standard error, the third for the standard deviation and the fourth for the standard error of the standard deviation.

### Example 2, compute the average and standard deviation grouping by the levels of a series of factors

In this example is performed the same calculation as in the example above, but this time grouping by the levels of one or more columns with factor data type, such as the gender of the student or the grade in which it was at the time of examination.

The function is wght_meansdfact_pv, and the code is as follows:

`wght_meansdfact_pv<-function(sdata,pv,cfact,wght,brr) { nc<-0; for (i in 1:length(cfact)) { nc <- nc + length(levels(as.factor(sdata[,cfact[i]]))); } mmeans<-matrix(ncol=nc,nrow=4); mmeans[,]<-0; cn<-c(); for (i in 1:length(cfact)) { for (j in 1:length(levels(as.factor(sdata[,cfact[i]])))) { cn<-c(cn, paste(names(sdata)[cfact[i]], levels(as.factor(sdata[,cfact[i]]))[j],sep="-")); } } colnames(mmeans)<-cn; rownames(mmeans)<-c("MEAN","SE-MEAN","STDEV","SE-STDEV"); ic<-1; for(f in 1:length(cfact)) { for (l in 1:length(levels(as.factor(sdata[,cfact[f]])))) { rfact<-sdata[,cfact[f]]==levels(as.factor(sdata[,cfact[f]]))[l]; swght<-sum(sdata[rfact,wght]); mmeanspv<-rep(0,length(pv)); stdspv<-rep(0,length(pv)); mmeansbr<-rep(0,length(pv)); stdsbr<-rep(0,length(pv)); for (i in 1:length(pv)) { mmeanspv[i]<-sum(sdata[rfact,wght]*sdata[rfact,pv[i]])/swght; stdspv[i]<-sqrt((sum(sdata[rfact,wght] * (sdata[rfact,pv[i]]^2))/swght)-mmeanspv[i]^2); for (j in 1:length(brr)) { sbrr<-sum(sdata[rfact,brr[j]]); mbrrj<-sum(sdata[rfact,brr[j]]*sdata[rfact,pv[i]])/sbrr; mmeansbr[i]<-mmeansbr[i] + (mbrrj - mmeanspv[i])^2; stdsbr[i]<-stdsbr[i] + (sqrt((sum(sdata[rfact,brr[j]] * (sdata[rfact,pv[i]]^2))/sbrr)-mbrrj^2) - stdspv[i])^2; } } mmeans[1, ic]<- sum(mmeanspv) / length(pv); mmeans[2, ic]<-sum((mmeansbr * 4) / length(brr)) / length(pv); mmeans[3, ic]<- sum(stdspv) / length(pv); mmeans[4, ic]<-sum((stdsbr * 4) / length(brr)) / length(pv); ivar <- c(sum((mmeanspv - mmeans[1, ic])^2), sum((stdspv - mmeans[3, ic])^2)); ivar = (1 + (1 / length(pv))) * (ivar / (length(pv) - 1)); mmeans[2, ic]<-sqrt(mmeans[2, ic] + ivar); mmeans[4, ic]<-sqrt(mmeans[4, ic] + ivar); ic<-ic + 1; } } return(mmeans);}`

In addition to the parameters of the function in the example above, with the same use and meaning, we have the cfact parameter, in which we must pass a vector with indices or column names of the factors with whose levels we want to group the data.

The result is returned in an array with four rows, the first for the means, the second for their standard errors, the third for the standard deviation and the fourth for the standard error of the standard deviation. In each column we have the corresponding value to each of the levels of each of the factors.

## Examples of calculation of mean differences

In the two examples that follow, we will view how to calculate mean differences of plausible values and their standard errors using replicate weights.

### Example 3, calculating the mean difference between countries

This function works on a data frame containing data of several countries, and calculates the mean difference between each pair of two countries.

The function is wght_meandiffcnt_pv, and the code is as follows:

`wght_meandiffcnt_pv<-function(sdata,pv,cnt,wght,brr) { nc<-0; for (j in 1:(length(levels(as.factor(sdata[,cnt])))-1)) { for(k in (j+1):length(levels(as.factor(sdata[,cnt])))) { nc <- nc + 1; } } mmeans<-matrix(ncol=nc,nrow=2); mmeans[,]<-0; cn<-c(); for (j in 1:(length(levels(as.factor(sdata[,cnt])))-1)) { for(k in (j+1):length(levels(as.factor(sdata[,cnt])))) { cn<-c(cn, paste(levels(as.factor(sdata[,cnt]))[j], levels(as.factor(sdata[,cnt]))[k],sep="-")); } } colnames(mmeans)<-cn; rn<-c("MEANDIFF", "SE"); rownames(mmeans)<-rn; ic<-1; for (l in 1:(length(levels(as.factor(sdata[,cnt])))-1)) { for(k in (l+1):length(levels(as.factor(sdata[,cnt])))) { rcnt1<-sdata[,cnt]==levels(as.factor(sdata[,cnt]))[l]; rcnt2<-sdata[,cnt]==levels(as.factor(sdata[,cnt]))[k]; swght1<-sum(sdata[rcnt1,wght]); swght2<-sum(sdata[rcnt2,wght]); mmeanspv<-rep(0,length(pv)); mmcnt1<-rep(0,length(pv)); mmcnt2<-rep(0,length(pv)); mmeansbr1<-rep(0,length(pv)); mmeansbr2<-rep(0,length(pv)); for (i in 1:length(pv)) { mmcnt1<-sum(sdata[rcnt1,wght]*sdata[rcnt1,pv[i]])/swght1; mmcnt2<-sum(sdata[rcnt2,wght]*sdata[rcnt2,pv[i]])/swght2; mmeanspv[i]<- mmcnt1 - mmcnt2; for (j in 1:length(brr)) { sbrr1<-sum(sdata[rcnt1,brr[j]]); sbrr2<-sum(sdata[rcnt2,brr[j]]); mmbrj1<-sum(sdata[rcnt1,brr[j]]*sdata[rcnt1,pv[i]])/sbrr1; mmbrj2<-sum(sdata[rcnt2,brr[j]]*sdata[rcnt2,pv[i]])/sbrr2; mmeansbr1[i]<-mmeansbr1[i] + (mmbrj1 - mmcnt1)^2; mmeansbr2[i]<-mmeansbr2[i] + (mmbrj2 - mmcnt2)^2; } } mmeans[1,ic]<-sum(mmeanspv) / length(pv); mmeansbr1<-sum((mmeansbr1 * 4) / length(brr)) / length(pv); mmeansbr2<-sum((mmeansbr2 * 4) / length(brr)) / length(pv); mmeans[2,ic]<-sqrt(mmeansbr1^2 + mmeansbr2^2); ivar <- 0; for (i in 1:length(pv)) { ivar <- ivar + (mmeanspv[i] - mmeans[1,ic])^2; } ivar = (1 + (1 / length(pv))) * (ivar / (length(pv) - 1)); mmeans[2,ic]<-sqrt(mmeans[2,ic] + ivar); ic<-ic + 1; } } return(mmeans);}`

Here the calculation of standard errors is different. You must calculate the standard error for each country separately, and then obtaining the square root of the sum of the two squares, because the data for each country are independent from the others.

Again, the parameters are the same as in previous functions. We have the new cnt parameter, in which you must pass the index or column name with the country.

The result is a matrix with two rows, the first with the differences and the second with their standard errors, and a column for the difference between each of the combinations of countries.

### Example 4, calculation of mean differences grouping by levels of a series of factors

With this function the data is grouped by the levels of a number of factors and wee compute the mean differences within each country, and the mean differences between countries.

The function is wght_meandifffactcnt_pv, and the code is as follows:

`wght_meandifffactcnt_pv<-function(sdata,pv,cnt,cfact,wght,brr) { lcntrs<-vector('list',1 + length(levels(as.factor(sdata[,cnt])))); for (p in 1:length(levels(as.factor(sdata[,cnt])))) { names(lcntrs)[p]<-levels(as.factor(sdata[,cnt]))[p]; } names(lcntrs)[1 + length(levels(as.factor(sdata[,cnt])))]<-"BTWNCNT"; nc<-0; for (i in 1:length(cfact)) { for (j in 1:(length(levels(as.factor(sdata[,cfact[i]])))-1)) { for(k in (j+1):length(levels(as.factor(sdata[,cfact[i]])))) { nc <- nc + 1; } } } cn<-c(); for (i in 1:length(cfact)) { for (j in 1:(length(levels(as.factor(sdata[,cfact[i]])))-1)) { for(k in (j+1):length(levels(as.factor(sdata[,cfact[i]])))) { cn<-c(cn, paste(names(sdata)[cfact[i]], levels(as.factor(sdata[,cfact[i]]))[j], levels(as.factor(sdata[,cfact[i]]))[k],sep="-")); } } } rn<-c("MEANDIFF", "SE"); for (p in 1:length(levels(as.factor(sdata[,cnt])))) { mmeans<-matrix(ncol=nc,nrow=2); mmeans[,]<-0; colnames(mmeans)<-cn; rownames(mmeans)<-rn; ic<-1; for(f in 1:length(cfact)) { for (l in 1:(length(levels(as.factor(sdata[,cfact[f]])))-1)) { for(k in (l+1):length(levels(as.factor(sdata[,cfact[f]])))) { rfact1<- (sdata[,cfact[f]] == levels(as.factor(sdata[,cfact[f]]))[l]) & (sdata[,cnt]==levels(as.factor(sdata[,cnt]))[p]); rfact2<- (sdata[,cfact[f]] == levels(as.factor(sdata[,cfact[f]]))[k]) & (sdata[,cnt]==levels(as.factor(sdata[,cnt]))[p]); swght1<-sum(sdata[rfact1,wght]); swght2<-sum(sdata[rfact2,wght]); mmeanspv<-rep(0,length(pv)); mmeansbr<-rep(0,length(pv)); for (i in 1:length(pv)) { mmeanspv[i]<-(sum(sdata[rfact1,wght] * sdata[rfact1,pv[i]])/swght1) - (sum(sdata[rfact2,wght] * sdata[rfact2,pv[i]])/swght2); for (j in 1:length(brr)) { sbrr1<-sum(sdata[rfact1,brr[j]]); sbrr2<-sum(sdata[rfact2,brr[j]]); mmbrj<-(sum(sdata[rfact1,brr[j]] * sdata[rfact1,pv[i]])/sbrr1) - (sum(sdata[rfact2,brr[j]] * sdata[rfact2,pv[i]])/sbrr2); mmeansbr[i]<-mmeansbr[i] + (mmbrj - mmeanspv[i])^2; } } mmeans[1,ic]<-sum(mmeanspv) / length(pv); mmeans[2,ic]<-sum((mmeansbr * 4) / length(brr)) / length(pv); ivar <- 0; for (i in 1:length(pv)) { ivar <- ivar + (mmeanspv[i] - mmeans[1,ic])^2; } ivar = (1 + (1 / length(pv))) * (ivar / (length(pv) - 1)); mmeans[2,ic]<-sqrt(mmeans[2,ic] + ivar); ic<-ic + 1; } } } lcntrs[[p]]<-mmeans; } pn<-c(); for (p in 1:(length(levels(as.factor(sdata[,cnt])))-1)) { for (p2 in (p + 1):length(levels(as.factor(sdata[,cnt])))) { pn<-c(pn, paste(levels(as.factor(sdata[,cnt]))[p], levels(as.factor(sdata[,cnt]))[p2],sep="-")); } } mbtwmeans<-array(0, c(length(rn), length(cn), length(pn))); nm <- vector('list',3); nm[]<-rn; nm[]<-cn; nm[]<-pn; dimnames(mbtwmeans)<-nm; pc<-1; for (p in 1:(length(levels(as.factor(sdata[,cnt])))-1)) { for (p2 in (p + 1):length(levels(as.factor(sdata[,cnt])))) { ic<-1; for(f in 1:length(cfact)) { for (l in 1:(length(levels(as.factor(sdata[,cfact[f]])))-1)) { for(k in (l+1):length(levels(as.factor(sdata[,cfact[f]])))) { mbtwmeans[1,ic,pc]<-lcntrs[[p]][1,ic] - lcntrs[[p2]][1,ic]; mbtwmeans[2,ic,pc]<-sqrt((lcntrs[[p]][2,ic]^2) + (lcntrs[[p2]][2,ic]^2)); ic<-ic + 1; } } } pc<-pc+1; } } lcntrs[[1 + length(levels(as.factor(sdata[,cnt])))]]<-mbtwmeans; return(lcntrs);}`

To the parameters of the function in the previous example, we added cfact, where we pass a vector with the indices or column names of the factors.

In this case, the data is returned in a list. For each country there is an element in the list containing a matrix with two rows, one for the differences and one for standard errors, and a column for each possible combination of two levels of each of the factors, from which the differences are calculated.

In the last item in the list, a three-dimensional array is returned, one dimension containing each combination of two countries, and the two other form a matrix with the same structure of rows and columns of those in each country position.

## Linear regressions with plausible values

In this last example, we will view a function to perform linear regressions in which the dependent variables are the plausible values, obtaining the regression coefficients and their standard errors.

### Example 5, linear regression using plausible values as dependent variables

The function is wght_lmpv, and this is the code:

`wght_lmpv<-function(sdata,frml,pv,wght,brr) { listlm <- vector('list', 2 + length(pv)); listbr <- vector('list', length(pv)); for (i in 1:length(pv)) { if (is.numeric(pv[i])) { names(listlm)[i] <- colnames(sdata)[pv[i]]; frmlpv <- as.formula(paste(colnames(sdata)[pv[i]],frml,sep="~")); } else { names(listlm)[i]<-pv[i]; frmlpv <- as.formula(paste(pv[i],frml,sep="~")); } listlm[[i]] <- lm(frmlpv, data=sdata, weights=sdata[,wght]); listbr[[i]] <- rep(0,2 + length(listlm[[i]]\$coefficients)); for (j in 1:length(brr)) { lmb <- lm(frmlpv, data=sdata, weights=sdata[,brr[j]]); listbr[[i]]<-listbr[[i]] + c((listlm[[i]]\$coefficients - lmb\$coefficients)^2,(summary(listlm[[i]])\$r.squared- summary(lmb)\$r.squared)^2,(summary(listlm[[i]])\$adj.r.squared- summary(lmb)\$adj.r.squared)^2); } listbr[[i]] <- (listbr[[i]] * 4) / length(brr); } cf <- c(listlm[]\$coefficients,0,0); names(cf)[length(cf)-1]<-"R2"; names(cf)[length(cf)]<-"ADJ.R2"; for (i in 1:length(cf)) { cf[i] <- 0; } for (i in 1:length(pv)) { cf<-(cf + c(listlm[[i]]\$coefficients, summary(listlm[[i]])\$r.squared, summary(listlm[[i]])\$adj.r.squared)); } names(listlm)[1 + length(pv)]<-"RESULT"; listlm[[1 + length(pv)]]<- cf / length(pv); names(listlm)[2 + length(pv)]<-"SE"; listlm[[2 + length(pv)]] <- rep(0, length(cf)); names(listlm[[2 + length(pv)]])<-names(cf); for (i in 1:length(pv)) { listlm[[2 + length(pv)]] <- listlm[[2 + length(pv)]] + listbr[[i]]; } ivar <- rep(0,length(cf)); for (i in 1:length(pv)) { ivar <- ivar + c((listlm[[i]]\$coefficients - listlm[[1 + length(pv)]][1:(length(cf)-2)])^2,(summary(listlm[[i]])\$r.squared - listlm[[1 + length(pv)]][length(cf)-1])^2, (summary(listlm[[i]])\$adj.r.squared - listlm[[1 + length(pv)]][length(cf)])^2); } ivar = (1 + (1 / length(pv))) * (ivar / (length(pv) - 1)); listlm[[2 + length(pv)]] <- sqrt((listlm[[2 + length(pv)]] / length(pv)) + ivar); return(listlm);}`

In this function, you must pass the right side of the formula as a string in the frml parameter, for example, if the independent variables are HISEI and ST03Q01, we will pass the text string "HISEI + ST03Q01". The function calculates a linear model with the lm function for each of the plausible values, and, from these, builds the final model and calculates standard errors.

As a result we obtain a list, with a position with the coefficients of each of the models of each plausible value, another with the coefficients of the final result, and another one with the standard errors corresponding to these coefficients.