viernes, 21 de octubre de 2016

# PISA Data Analytics, Correspondence Analysis

The PISA database contains, in addition to the scores of students, a lot of demographic, socioeconomic and cultural data about them, collected through a series of questionnaires, that allow contextualize the academic results and make studies with a great number of variables. Most of these data are categorical, making the correspondence analysis a particularly appropriate tool to work with them. In this article I will show you how to easily perform this analysis using the ca package of the R program.

The PISA data can be obtained from the PISA official website of the OECD. In this blog I posted a SQL Server version of the PISA database, which you can download and install locally. In the link above you can also access a tool to query the database and obtain samples.

In this other link you can download a file with sample data that I will use in this article. It contains data from three countries, Finland, one that gets better results, Spain, located in a middle level, and Peru, one of the last of the classification, for the year 2012.

## The correspondence analysis

Make a deep study on the fundamentals of correspondence analysis is beyond the scope of a post on a blog, so I recommend the book Correspondence Analysis in Practice, by Michael Greenacre, where you can find an excellent and simple exposition of everything about this analysis technique.

Basically, the correspondence analysis is a technique for optimal scaling of data, which allows graphical representation in the form of maps, usually two-dimensional, where you can see the relationships between the rows and columns of a contingency table.

Consider the rows and columns of the table as a series of n-dimensional vectors, being n the number of columns in a row, or rows in a column. In this way you can see a row or column as a point in an n-dimensional space, whose coordinates are the corresponding proportions of the crossed values of the factor, with a continuous value between 0 and 1. We call these vectors row or column profiles.

The marginal frequencies of the rows and columns form a new profile called average profile. There is an average column profile and an average row profile. The correspondence analysis can be undertaken with the rows or columns as a reference, obtaining equivalent results. It is a symmetric analysis tool.

If you take, for example, the rows as the reference, you can consider a region of a space of dimension m - 1, where m is the number of coordinates or columns of each row, delimited by the points where each of the rows represent the 100% of elements of the columns. This region is called simplex, and in a three dimensional space, for example, would be the plane through the points (1,0,0), (0,1,0) and (0,0,1).

Within this space, you can plot the points corresponding to each of the columns, using the proportions as coordinates corresponding to each of the rows. These points are all on the surface of the simplex.

It is evident that these points will be found closer to the vertices corresponding to the coordinates with higher relative frequencies of each column. In correspondence analysis, the relative weight of each of the coordinates of the profiles of each row or column is called mass. Each row or column also has a total row or column mass, corresponding to its marginal frequency.

The row and column with the marginal frequencies, or row profile and column profile can be considered as the average profile for all rows or columns, and are called centroids. From here, you can calculate a distance from each of the rows or columns to its centroid or row or column average, so that, if you expect all of them are similar to the average, you can quantify how much they deviate from this hypothesis and their statistical significance.

As distance measure, the chi-squared distance is used, calculated from the ratio of the difference between the observed and expected values squared and the expected value, where the observed value are the actual coordinates of the row or column and value expected is the average profile of rows or columns.

This statistic follows a χ2 distribution with (r-1)(c-1) degrees of freedom, where r is the number of rows and c the number of columns, so we can accept or reject the hypothesis of homogeneity between rows or columns easily.

The chi-squared distance gives us an idea of the degree of dispersion of the points on the graph map of the correspondence analysis. A measure derived from this statistic, called inertia, is calculated from the distance chi-squared and the total number of elements in the row, column or entire table, which gives us a measure of the variance independent of size.

As is normal that we use this technique to analyze relatively large data tables, where is not easy to see the relationships between the data, we will have high-dimensional profile spaces, while the graphic representation is performed in a two-dimensional space, three at most, so it will be necessary to project the points. To perform this operation, we use a technique called singular value decomposition, or SVD. Optimal scaling techniques are also used to locate the projected points so that the relations between them are shown as clearly as possible.

This way, we are working with categorical data as if it were continuous values.

## Correspondence analysis with R

Now I will show how to perform in practice this type of analysis with the R program and one of the packages dedicated to correspondence analysis, the ca package.

First, we load the sample data in csv format, in a dataframe, and the scripts with the data sampling functions:

`data<-read.csv("ca-filters.csv",sep=";")source("pisa-sampling-code.r")source("sample-ppc.r")`

Recall that in the PISA database, the data from a student does not represent an individual but a group of individuals, so we must use the weights assigned to each record to weight calculations.

The data columns are the following:

`names(data) "YEAR" "STUDENTID" "SCHOOLID" "COUNTRY_NAME"  "SUBNATIO_NAME" "STRATUM_NAME" "ST115Q01_2012" "ST26Q06_2012"  "ST26Q07_2012" "ST08Q09_2012" "W_FSTUWT"`

Of these, we are only interested in COUNTRY_NAME, which is the name of the country, and the last five, which contain data on the student and the weight.

There are data from three different countries, Finland, Spain and Peru, with different number of records:

`summary(data[,"COUNTRY_NAME"])Finland Peru Spain 8829 6035 25313`

First, we obtain a sample with an approximately equal number of records for all countries without missing values, with the data of country, weight and the ST08Q09_2012 column, which contains data on full days that the student has missed class:

`summary(data[,"ST08Q09_2012"]) 1-2 3-4 5- N NA's 6356 782 372 32239 428`

The data are divided into four categories, 1 or 2 days, 3 or 4 days, 5 or more days and N to indicate that the student has not missed any day. To get a sample without missing values we will use the following function:

`datas<-wght_multiple_sample(data,"COUNTRY_NAME", c("ST08Q09_2012","W_FSTUWT"),50000,"W_FSTUWT")`

Where the parameters are the column with the country, the additional columns you want to get, the maximum number of samples per country and the column containing the weights. We rename the columns for readability of the graph. This is the result:

`names(datas)<-c("COUNTRY","D","WEIGHT")summary(datas) PAIS D PESO Finland: 8645 1-2: 6356 Min. : 1.008 Peru : 5997 3-4: 782 1st Qu.: 3.325 Spain :25107 5- : 372 Median : 8.076 N :32239 Mean : 21.463 3rd Qu.: 33.012 Max. :297.305`

Now let's get a contingency table crossing the three countries with the number of days, weighted, that the students have missed a class:

`mtx<-wght_ppc_bycnt(datas,"COUNTRY","D","WEIGHT")mtx D-1-2 D-3-4 D-5- D-NFinland 0.006319715 0.0005741573 0.0004663525 0.06328732Peru 0.055626536 0.0107576788 0.0035847888 0.42217349Spain 0.105852223 0.0115135558 0.0051497598 0.31469442`

With this table, it is very easy to perform the correspondence analysis:

`require(ca)ca1<-ca(mtx)plot(ca1,map="colprincipal",mass=c(T,T))`

With the map parameter we have indicated that we want that the main coordinates be the columns, which are shown in reference to the vertices, where are the countries. The mass parameter is used to indicate that we want to draw the symbols that represent the various points with a size proportional to their mass.

In the graph we can see that not miss any full class day is the most common, in similar proportions in all three countries, though a bit more in Finland and Peru. Miss a few days, between 1 and 4, it is more common in Spain than in the other two countries, though missing 3 or 4 days has a similar frequency between Spain and Peru. We also see that missing 5 days or more is practically anecdotal, by the small size of the symbol, and the frequency is more or less similar in all three countries, as it is quite close to the center, although somewhat higher in Spain. We can confirm this by checking the contingency table.

If we examine the ca1 object containing the data of the correspondence analysis, we can see the following:

`Principal inertias (eigenvalues): 1 2 Value 0.033218 0.000423Percentage 98.74% 1.26% Rows: Finland Peru SpainMass 0.070648 0.492142 0.437210ChiDist 0.241314 0.149918 0.205517Inertia 0.004114 0.011061 0.018467Dim. 1 1.267201 0.819762 -1.127524Dim. 2 3.398378 -0.599934 0.126178 Columns: D-1-2 D-3-4 D-5- D-NMass 0.167798 0.022845 0.009201 0.800155ChiDist 0.391888 0.201091 0.247627 0.089320Inertia 0.025770 0.000924 0.000564 0.006384Dim. 1 -2.149648 -0.825093 -1.357733 0.489967Dim. 2 0.423008 -6.487395 0.443576 0.091414`

First we have the total inertia explained by each of the two dimensions. You can see that you can distribute the 100% of the inertia between the two dimensions.

Below are the values of the analysis of the rows (countries). In this case, the mass is proportional to the population of each country. The ChiDist row contains the chi-squared distance to the centroid or average profile of each country. We have another row with the inertia of each of the countries and, finally, the coordinates of each in the dimensions 1 and 2.

The same applies to the corresponding columns with the count of the days that students missed classes.

We can view another example with two other variables from the sample data. The ST26Q06_2012 column, which indicates whether the student has an internet connection, encoded as N (not) or Y (yes), and ST26Q07_2012, indicating whether there are books of classical literature at home, encoded in the same way. First, we perform a sampling without missing values:

`datas<-wght_multiple_sample(data,4,c(8,9,11),50000,11)`

This time we have made the sampling indicating the index of columns instead of their names. Now we build a new column from the INTERNET and LITERATURE ones with the four possible combinations:

`c<-paste(datas[,2],datas[,3],sep="-")datas<-cbind(datas,as.factor(c))names(datas)<-c("COUNTRY","INTERNET","LITERATURE","WEIGHT","I+L")`

We build the contingency table with this last column:

`mtx<-wght_ppc_bycnt(datas,1,5,4)mtx I+L-N-N I+L-N-Y I+L-Y-N I+L-Y-YFinland 0.000225593 4.605372e-05 0.03448387 0.0364791Peru 0.079078933 2.073570e-01 0.05093388 0.1551462Spain 0.011895258 1.123424e-02 0.13828133 0.2748386`

And we perform the correspondence analysis again:

`ca1<-ca(mtx)plot(ca1,map="colprincipal",mass=c(T,T))`

We can see that in Peru is much more common to not have internet than in Spain and Finland. We can also check that in Finland is more normal to have internet than literature at home. In Peru is more normal to not having internet, but possess literature at home. However, keep in mind that having both have a fairly large mass and is quite close to the center, indicating that it is quite common in all countries not be confuse by the position of the N-N case (nor Internet nor literature) so close to Peru, it's just in this country where are most cases, but is not the norm, as its incidence (mass) is very small.

The correspondence analysis is symmetrical, which means that always is performed for both rows and columns. We can plot the graph using the row coordinates as the main ones simply changing the parameter map:

`plot(ca1,map="rowprincipal",mass=c(T,T))`

Now the reference categories are the columns, which occupy the vertices of a tetrahedron of three dimensions that we have projected onto two dimensions. We can see that in Spain and Finland have no internet is almost anecdotal, since they are located almost on the line joining the vertices Y-Y and Y-N.

We can also draw the map using the same scale for rows and columns:

`plot(ca1,map="symmetric",mass=c(T,T))`

These examples have a low dimension, so they are quite trivial. In practice, this type of analysis is used to study the relationships of contingency tables with large numbers of rows and columns, where there are no longer so obvious the relationships between the elements.