PISA data analytics. Data sampling

In this post you will find examples of R code for data sampling in PISA database. In these examples the different weights of students, schools or parents are corrected depending on the number of records selected for the sample. Also there are examples of stratified sampling using the values in a particular column in the data set.

The samples are made from a dataframe containing the complete data set, which is constructed from the data extracted by the WinPODUtil application.

This post is related with the article introduction to PISA data analytics.

In this link you can download the R code examples. They consist of three small functions, one for an ordinary sampling, and two for a stratified sampling based on one of the columns of the sample, which will generally be the country to which the students belong.

In this link you can download the Windows version of R program.

Example 1, make a simple sampling and correct the weights

In this example we will obtain a sample of n rows by selecting some columns of the entire dataset and we will correct the weights to scale them proportionally to the size of the subsample. The idea is that the sum of the weights always gives you the approximate size of the total population from which the sample comes, so we must multiply the weight value by the size of the entire sample and then divide it by the size of the subsample.

The name of the function is wght_sample, and this is the code:

wght_sample <- function(data,fields,nmax,wghts,naomit = T) { if (naomit) { rdata<-na.omit(data[,fields]); } else { rdata<-data[,fields]; } if (nmax > nrow(rdata)) { nsamp<-nrow(rdata); } else { nsamp<-nmax; } rdata<-rdata[sample(nrow(rdata),nsamp),]; if (!is.null(wghts)) { if (is.numeric(wghts[0])) { rdata[,names(data)[wghts]] <- rdata[,names(data)[wghts]] * length(data[,1]) / length(rdata[,1]); } else { rdata[,wghts] <- rdata[,wghts] * length(data[,1]) / length(rdata[,1]); } } return(rdata); }

The data parameter is for the dataframe with the entire dataset. In the fields parameter, you must pass a list of the columns you want to preserve in the final sample, which can be either a list of field names or their indexes. The nmax parameter indicates the maximum size you want for the sample, and wghts contains the list of columns containing weights, which must be corrected in the final sample. This list can also be constructed either with the names or the indexes of the columns.

In the naomit parameter you can indicate if you want to exclude the rows with missing values. By default, these rows are not included in the final sample.

The result returned by the function will be a dataframe with the selected columns and the weights updated according to the size of the sample.

Example 2, stratified sampling by one of the data fields

In this second example we make again a sampling with weights correction, but this time using a column of the dataframe as an index for a stratified sampling. This column must be of factor datatype.

Normally, this column will be the country, in datasets containing several countries, so we obtain approximately the same number of registers for each of the countries in the sample, but it can be used using any other column, provided it are of factor datatype.

The name of the function is wgtht_multiple_sample, and this is the code:

wght_multiple_sample <- function(data,cnt,fields,nmax,wghts,naomit = T) { lindex<-levels(data[,cnt]); if (naomit) { rdata<-na.omit(data[,c(cnt,fields)]); } else { rdata<-data[,c(cnt,fields)]; } rsamp<-subset(rdata,is.na(index)); for (i in 1:length(lindex)) { total<-length(data[data[,index]==lindex[i],1]); subs<-subset(rdata,rdata[,1]==lindex[i]); if (nmax > nrow(subs)) { nsamp<-nrow(subs); } else { nsamp<-nmax; } isamp<-subs[sample(nrow(subs),nsamp),]; if (!is.null(wghts)) { if (is.numeric(wghts[0])) { isamp[,names(data)[wghts]] <- isamp[,names(data)[wghts]] * total / length(isamp[,1]); } else { isamp[,wghts] <- isamp[,wghts] * total / length(isamp[,1]); } } rsamp<-rbind(rsamp,isamp); } return(rsamp); }

This function has the same parameters as the previous one, with the same use and meaning, and the new cnt parameter where you must pass the name or index of the column that contains the values by which the sample must be stratified.

The result is a dataframe with the selected columns, plus the stratification column, with approximately equal numbers of records for each of the values in that column.

Example 3, stratified sampling for only one of the values

In this last example, we have a function that performs the same stratified sampling as the previous example, but it returns only data for one of the values in the stratification column. For example it can be used to extract data from a single country from a sample with several countries.

The name of the function is wght_cnt_sample and this is the code:

wght_cnt_sample <- function(data,cnt,cntval,fields,nmax,wghts,naomit = T) { if (naomit) { rdata<-na.omit(data[,c(cnt,fields)]); } else { rdata<-data[,c(cnt,fields)]; } rsamp<-subset(rdata,is.na(cnt)); subs<-subset(rdata,rdata[,cnt]==cntval); if (nmax > nrow(subs)) { nmax <- nrow(subs); } rsamp<-subs[sample(nrow(subs),nmax),]; if (!is.null(wghts)) { if (is.numeric(wghts[0])) { rsamp[,names(data)[wghts]] <- rsamp[,names(data)[wghts]] * length(data[,1]) / length(rsamp[,1]); } else { rsamp[,wghts] <- rsamp[,wghts] * length(data[,1]) / length(rsamp[,1]); } } return(rsamp); }

Again we have added a parameter, cntval, where you must pass the value that the column indicated in the cnt parameter must have so that the rows are selected for the sample.