Introduction to PISA data analytics. Data sampling
Every three years, since 2000, the OECD (Organization for Economic Cooperation and Development) performs a series of tests in a number of countries at national level to 15-years-old students, in order to assess the degree of knowledge in three main groups of areas: science, reading and math. This is the PISA program, whose last edition took place in 2015.
In addition to test scores in these areas, a lot of statistical data on the socioeconomic status of the students, their attitude towards studies, school and life in general are collected, well as data on schools and, in some countries, the student parents. All these data are used by the governments of the participating countries to evaluate their education policies, and a lot of technical studies are produced, many of them available online for download.
In this link you have a more detailed introduction to the PISA program.
All the data collected are published in the PISA official website of the OECD, and they are available for download approximately one year after completion of the tests. At present, there are data until 2012. They are plain text files with the answers to all questions of all questionnaires, along with control files for loading data in SPSS and SAS.
All this makes this data set excellent for statistical analysis practices, since we have many works for which we can try to reproduce the results. In this link you have examples of reports based on PISA data. As SPSS and SAS have a fairly high price, what I have done is dump this data in SQL Server, in the PISA database, from which you can extract data using SQL queries. Also, you can use the SQL Server features for data analysis and data mining.
The data can be extracted from the database using the WinPODUtil application, with which you obtain CSV files that can then be processed using the R program, which is an open code free application. All the examples that we'll see in this series of articles are made with this program.
Here we'll see a brief introduction of the main characteristics of these data and the main techniques that must be used with them for their analysis. In this link you can download the PISA data analysis manual, where they explain in great detail and more rigorously all those techniques. This manual is devoted to the analysis using SPSS, and all the examples are coded for this application, but the theoretical part is excellent, and here we'll see all the examples made in R code.
In this first article in the series we will see how data from these studies are collected and what to keep in mind when performing the sampling. In this link you can download the PISA data sampling source code examples.
The data sampling in PISA
The selection of students for the PISA studies is performed following a two-stage sampling. In a first phase, are selected the schools that will participate, ensuring that the distribution and characteristics of them is representative of the country's educational system in which the study is conducted. After this, is made a random selection of students within each school, of all those who are 15 years old at the time of the test. This means that not all students necessarily belong to the same course.
In each school are selected about 35 students, but not all schools have this number of students and, as participation is voluntary, not all selected students participate in testing at the end. For this reason, it is assigned to each student a weight, which is calculated using the probability for the school to be selected for the study and the probability of the student to be selected within the school.
The sum of the weights of all students in the sample of a country gives a value approximately equal to the total population of 15-years-old students in the country.
All calculations should be made taking into account the weight of each student. Thus, each record shouldn't be viewed as the personal data of an individual student, but each student represents a population of individuals with similar characteristics.
Therefore, when you take a subsample of data from a given country with an amount of individuals less than the size of the entire sample, these weights must be corrected so that their sum continue being the total population. The adjustment is very simple, simply multiply the weight by the number of individuals in the full sample and divide it by the number of selected individuals for the subsample.
In WinPODUtil, the weights of the students are in the Estimates tab (Previously you must select at least one year and a country with its territorial divisions). Select Estimator types in the drop-down list, and then select Weights in the list of available options. Then, select Weights and estimators from the dropdown list in order to see the different types of weights available.
The student weight is W_FSTUWT, and the corresponding for the school is SCWEIGHT. It is advisable to download a separate file with all weights of students in a country, as the query is pretty heavy. Later you can combine this file with other files with a selection of responses, as exposed in the article process and merge csv files with the Process option of the WinPODUtil tool.
Due to the large number of questions comprised in each of these studies, it is impossible that students answer all of them, so they are distributed in a number of questionnaires, each of which contains a different set of questions. For this reason, we find a large number of missing values in the responses in each of the records.
This can result in, depending on the combination of questions that we have selected, that the program does not return any data. To avoid this, in the Responses tab of the WinPODUtil program, you can select all the types of missing values you want to obtain. For R to automatically recognize them as missing values, it is advisable to change their value to NA, clicking with the mouse on the text of the value, as shown in the picture:
In the next article in the series, you will see another set of weights that is essential for calculations of the sampling variance and standard error, the replicated weights. You can also find examples of R code to perform these calculations, which are quite heavy.