PISA Database, a brief introduction
PISA (Program for International Student Assessment) is a program of the OECD (Organization for Economic Cooperation and Development) for the evaluation of education policies. It consists of a series of tests and surveys conducted to 15 years old students, schools and parents at national level in the countries participating in the program (in 2012 there were 65 participating countries, and in 2015 there are more than 70). These tests seek to assess the level of knowledge in three main areas: Reading, science and math.
In addition to the exams, socioeconomic, attitudinal and contextual data of students and their parents are collected, as well as data on schools, through a series of additional surveys. Other socio-economic indices are added also. These tests are performed every three years, beginning in 2000, so, at present, data for five years are available, that can be used to study the evolution and impact of education policies in the different participating countries.
Constantly are published researches based on this database, which contains data of millions of students and schools (only in 2012, data were collected for 510,000 students). All this makes it very suitable to practice data analysis, not only we have a huge amount of data, but there are many studies published that we can try to reproduce.
This is a link to the PISA official website of the OECD, where you can find, plus a lot of documentation and analysis, published in several languages, the data in SAS and SPSS format. What I've done is create a series of SQL Server database and dump the data into them, so that we can extract it with a query application, which I have also developed, in csv format for use with the R statistical program, which is free. I have to say, that by the fact of having manipulated the data, these cannot be considered officially valid, because, due to their large volume, I cannot guarantee that I have preserved the integrity of the original data, I publish them here only for educational purposes, in order to make database management and data analysis practices.
In this article we will see how these data are organized and will see an overview of the main tables and the structure of databases I created, in subsequent articles we will discuss basic analysis techniques to be used with these data, using the R application.
The primary entity is the year in which each study is conducted. At the territorial level, we have the different countries or economies participating in the study. For these we have two types of subdivisions, a territorial one, which divides the country in different regions, and other socioeconomic, dividing it into strata (rural, urban, public, private, etc.). For each country, the year or years in which it participates are indicated.
The individuals in the sample are divided into three groups, students, parents and schools.
We have a series of questionnaires that collect data on the student, school or parents (only in some countries) or test questions on reading, science and math. Each questionnaire consists of a lot of questions, most of Likert type, along with a number of numerical indices calculated from groups of answers to questions and socioeconomic variables concerning the student, parent or school. You can find documentation for the questions and their possible answers for each of the studies in the official website of PISA.
Finally, there are a number of statistics with which you must perform the calculations with the data, these are the qualifications of the students, the weights of students and schools and some replicated weights to calculate standard errors.
How are structured the PISA databases
Due to the large amount of data they contain, besides the fact that every three years new data, with increasing volume, must be added, is unfeasible keep all within the same database, so I decided to create a database with general-purpose entities, as the years, countries, questionnaires and questions and, at national level, a separated database by country containing all students, schools, parents and their answers to the different questionnaires, along with their corresponding statistics. In this way, you can install only the data from countries that interest you (the full data set size is currently about 150 GB).
I have included these databases in the POD project, which I have discussed in other articles, and this is the first database that must be installed. Here you can download a copy of the POD configuration database.
Then you must download and install the PISA database. Finally, in this other link you can download and install the PISA national databases.
What follows is a brief explanation of the main tables of the PISA database. It is not necessary to install and use it; I add it only for those who might be interested.
PISA configuration database
Let's start with the configuration tables that allow integrate the database in POD. In this link you have a description of the POD configuration database. All those tables are in the cfg schema of the PISAComun database and are used as helpers to build queries to obtain the data through the POD API.
If you have read the article where I make an introduction to the configuration of POD, you know that the entities are those elements that we use to structure the information, which consists of the answers and statistics for each student, parent or school, so, the years, countries, questionnaires, questions, etc. are entities. These tables provide information to WinPODUtil application to build the user interface for filtering and selection of data of interest.
The Configuracion_entidades (entities configuration) table contains this information. id_cfgentidad is simply an auto-incremental key, the tabla (table) field indicates the table or view from where the data from a given entity are extracted, the clave (key) field indicates which is the key field of the table or view that contains the data (what the application sees), and the valor (value) field indicates what field we show to the user instead of the key value.
The field identificador (identifier) is the identifier of the entity as it are registered in the tables of the PODConfiguracion database (the query application always accesses the database PODConfiguracion, and, through it, to PISA data, so all the entities must be registered there). id_tipovalor indicates the key value type, which can be N (Numeric) or A (Alphanumeric), orden (order) is used to sort data returned by the query over the entities, contains the field order numbers list, comma separated, with which the data returned is sorted (basically are the arguments of the ORDER BY clause of the SQL statement that is built).
The valor_devuelto (returned value) field can contain an alternative field to that indicated in valor (value) field (which is the data that is displayed to the user) to return in the results data set. It is used only to indicate from which field are returned the value of responses and missing values. valor_columna (column value) is 0 for entities whose results are not returned as a unique column of the result table, but as a field within each row. If the value is 1, the returned data are built in a different column for each entity key. To understand this with an example, the year is a single column, although there are different years, but each question is a different column in the returned results. In the entidad_columna (column entity) field we can indicate code of the entity from which the values of the column is filled, in case of that is not from the entity itself, for example, in each question column the answer is returned, which is a different entity.
The interna (internal) field simply indicates if the entity is only for internal use of the database stored procedures, and codigo_resumen (summary code) and valor_resumen (summary value) are used by the query application to display a summary of the user's selection prior to obtaining the data.
Well, this was the most complicated. With the Campos_Entidades (Entity fields) table we define the fields of the entities that can be returned to the user, and which one is the default field. For example, the country entity has as default field the name of the country, but has additional fields such as the ISO code or if it was an OECD country at the time of the exam.
The table Filtros_Entidades (Entity filters) is used to configure which entities are used as a filter for other ones. For example, countries can be filtered by year, and the territorial divisions of the country by country and by year. With this, the application can be configured in order to allow the user to make the choice of data with successive filters. For those that continue reading at this point and are asking for the reason of such complications, consider that the query application knows nothing about the database or its structure, is a general-purpose program, which are used with both to PISA as any other set of data included in POD database.
The remaining tables are auxiliary to facilitate complex query building system needed to return the results.
Tables of data structure of the PISA database
These tables are located in the pisa schema, the data they contain are been extracted directly from the original files provided by the OECD, both the data itself and the control files for loading them into SPSS, and their fields are mapped to the information contained on these files, the only added information are the keys of the tables and the links between them.
The first table, which can be seen as the root entity, is años (years), is a very simple table, with only one field indicating the year of the study, there are records for 2000, 2003, 2006, 2009 and 2012.
For the territorial structure, we have the following tables:
The inicio (start) and final (end) fields indicate the initial and final column of the field data in the original files (the original data consists of plain text with columns of fixed position and size). The field nombre_pisa (PISA name) is the name given to the corresponding column in the PISA documentation.
As for the structure of tables to store different questionnaires and questions it is as follows:
The questionnaires are classified by mean of the table tipos_cuestionario (questionnaire types). Questions are stored in the tables preguntas_cuestionario (questionnaire questions) and preguntas (questions). Currently there is a 1:1 relationship between these two tables, which can be considered as forming a single entity. The original idea was to allow that one single question can be associated with more than one questionnaire. The codigo_pisa (pisa code) field contains the original code of the question as it appears in the PISA documentation, columna_inicio (start column) and columna_final (end column) contain the columns where the data of the answer start and end in the original data files from the OECD, the text of the question has been copied literally from the original data from PISA, so all are in English.
Questions can have numerical answers (such as age), or may be factors (Yes or No, agree, disagree, etc.), with the table tipos_respuesta (answer types) is identified the kind of response that corresponds to each question.
For answers of factor type, the factores (factors) table contains all the possible answer values, again literally copied from those provided in the original OECD data, and, to each question, the possible responses are associated through the factores_preguntas (factors questions) table.
Finally, the table grupos_preguntas (question groups) is used to group equivalent questions pertaining to different years, which are sometimes codified with different codes depending on the year. For example, age, sex, etc. are questions that are repeated every year. There is an option when querying the data to select to if the answers to these questions are returned in different columns, one for each year, or all in the same column, under the name of the group.
Data tables of the PISA database
These tables are contained in each of the national databases, and contain the students, schools and parents, as well as responses to questionnaires and different statistical for each student, like scores, weights and replicated weights.
The tables where the different individuals in the sample are stored are alumnos (students), escuelas (schools) and padres (parents):
Although, for each student, their corresponding school and their parents the stratum and territorial division are the same, these data are duplicated in all three tables because the queries can be done only by school or parents, and there are too many records in the student table to cross it in a query only to obtain this information, so we sacrifice normalization of the tables in favor of query optimization. In the codigo (code) field of the table of schools and students there is the code assigned in the original database to the student or school.
The statistics concerning individuals (weights, replicated weights and scores), are in the following tables:
In all of them, the valor_txt (textual value) field is the literal value in the official OECD data.
Finally, as there are a lot of different answers to the questionnaires, I have distributed them into five different tables, to optimize access:
The student questionnaire answers have been divided into two tables, respuestas_otros (other answers) and respuestas_alumnos (student responses), in respuestas_rms are the answers to reading, math and science exams. The answers to the school questionnaire are in respuestas_escuelas and those of parent questionnaire on respuestas_padres. In the field valor_alfa (alphanumeric value) is the literal value of the answer in the database of the OECD, while valor_num (numerical value) is this value converted to number, if possible.
The answers may be valid or some of the various nulls (NA, Invalid, Missing, Not Reached, Ambiguous or Ungraded, as the literal encoding in OECD database). The kind of answer is indicated in the id_clase field. The kinds of answers are in the table clase_respuesta (answer class), in the PISAComun schema:
The validez (validity) field is 1 for valid answers or 0 for missing or null values; valor_devuelto (returned value) is the default value code that is returned to the user as the value of the response. If the value is NULL, then the answer value in the corresponding table is returned.