Methods

The GEM Report team has designed a model-based approach to estimate time series of out-of-school rates from 2000 to the present. The approach consolidates administrative enrolment counts, censuses, and household surveys to produce complete and coherent estimates for the period of study. The strategy can be divided into four components: compilation, pre-processing, modelling, and post-processing.

Compilation

The data used to estimate out-of-school rates is separable into the two broad categories.

The first is administrative enrollment counts provided by the UNESCO Institute of Statistics (UIS). UIS collects single-age enrollment counts from Ministries of Education which are combined with population estimates from the UN World Population Prospects (WPP), Eurostat, and SingStat to form out-of-school rate observations.

The second type of data is survey-based data. Household surveys ask multiple questions about school attendance, creating an important resource for enrollment indicators. The present analysis uses Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), other country-specific household surveys, and population censuses. Census data has been retrieved from IPUMS in extracts of 1 million observations. All survey-based data is collected at the individual level and aggregated in the next stage.

Pre-processing

Given the substantial differences in the nature of the two types of data, we divide the description of the pre-processing stage into two segments.

Administrative Data

As noted, single-age administrative enrollment counts are provided by UIS, and the corresponding population estimates are sourced elsewhere. The enrollment counts are first filtered using the theoretical school ages for each country such that only populations expected to be in school are considered. Enrollment at each country-age-year is then aggregated by level such that all enrollment is captured. This analysis is concerned with removal from school, not with accelerated or delayed enrollment. As such, no child enrolled at any stage should be excluded. Then, to compute estimates of out-of-school rates, we divide the total enrollment by population and subtract from one.

The administrative observations are then assessed for data quality. We note two key aspects of the data:

The composition of the enrollment counts is first considered. As noted, enrollment in any level in a given country-age-year is considered in-school. Occasionally, however, one or more school levels is missing from the administrative enrollment counts. In such a scenario, the out-of-school rate observations for ages corresponding to the missing level are inflated and must be excluded. Additionally, in countries with substantial delayed entry or repetition, ages beyond the missing level may be inflated and must also be excluded. Concretely, if primary school corresponds to ages 6-12 but many students enter late, we would expect substantial primary enrollment for 13-year-olds. If primary enrollment is missing, the 13-year-old out-of-school rate would be inflated despite corresponding to lower secondary school. Thus, the 13-year-old value must be excluded in addition to the 6–12-year-olds.
Second, as the enrollment and population data are sourced separately, discrepancies arise. Over 18% of the country-age-year observations have enrollments that exceed the population estimate. While one could remove all such observations, we instead recognize that these “invalid” observations contain valuable information. In addition to suggesting that the true out-of-school rate for a country-age-year is low or near-zero, the negative observations offer information on the scale of the uncertainty for all administrative observations, positive and negative. As such, we only discard negative observations if there are additional reasons (ex. age grouping in the data, jurisdictional differences in enrollment and population, etc.).

Survey Data

Survey-based data, in contrast to administrative data, does not contain aggregate enrollment and aggregate population. Rather, each individual surveyed is typically asked if they attended school at any time in the present school year. These individual-level attendance values can be aggregated by age and sex to produce country-level out-of-school rate estimates. As this data are sample-based, we compute sampling variances for each country-age-year using the clustered jackknife method frequently used by demographic surveys for other indicators.

Out-of-school rate data generated from survey data are then assessed to ensure they are of high quality. Data can be excluded if an insufficient number of samples for a given country-age-year are available or if the survey data are not plausible.

Modelling

Out-of-school rate estimation is a challenging task given the underlying processes governing patterns in out-of-school rates and the difficulties in consolidating data of varying sources and quality. To address these challenges, the GEM Report team adopted a model-based approach designed to consider and address the following key observations:

Out-of-school rates are customarily reported by education level (primary, lower secondary, upper secondary) but the granular age-specific data indicate that out-of-school rates are not constant within a level. They are fluid and form distinctive curves that reflect core education challenges like late entry and dropout. A modelling exercise should thus operate on single ages, not levels.
The two types of data have distinct challenges that must be addressed simultaneously. Specifically, administrative data is not subject to sampling errors but suffers from discrepancies in component data sourcing and is not always complete. Survey data is infrequent, subject to sampling errors, and can be subject to large bias.
Students progress through education systems as cohorts. Exploiting this natural cohorting structure can help link together data across years. In the out-of-school context, we note that students that drop out in one year (ie. an increase in the out-of-school rate) are typically still absent the next (ie. the increase in the out-of-school rate is maintained).

To address these challenges, the GEM Report team has developed a Bayesian hierarchical model that constructs underlying out-of-school rate curves for cohorts over time. These curves are smoothed over time such that changes in the behaviour of cohorts is meaningful but gradual. The data interacts with the latent out-of-school rate patterns through two likelihood formulations designed to address the specific constraints, biases, and error structures of the two classes of data.

For details on this model, see this link.

Post-processing

The model produces out-of-school rates for country-age-years. After extracting these values, the age-specific out-of-school rates are aggregated by level using population estimates. Additional aggregations by region, income, and other groups are computed as well. Finally, the out-of-school rates at each level of aggregation are multiplied by the corresponding population estimates to produce estimates of the number of out-of-school children.

Visualization

In addition to describing the modelling process in text, the following applet is designed to allow users to explore the inner mechanics of the model. In contrast with the visualizations on other pages, these plots present age-specific estimates instead of education level-specific values.

The above views offer different pictures of the out-of-school rate model. To interpret each view, please note the following:

Summary: The summary page provides an overview of the data and output of the model. The left panel plots the observed data, and the right panel plots the complete series of age-specific out-of-school rates. Each age is assigned a unique colour and each source a unique marker, both of which can be identified with the legend to the right of the plot. Note that the right panel does not include source data points, only model points.
By age: The age page separates the data and estimates by age. These plots are similar to those of the country page where the data and estimates are presented by level. The black line is the time series of out-of-school rates, and the grey shaded area is the corresponding 90% uncertainty interval. Observed data are plotted using coloured points with each source having a distinct colour identified by the legend on the right of the plot.
By year: The year page separates the data and estimates by year. Each estimate is plotted with a black point and the associated 90% uncertainty intervals are denoted by the corresponding black lines. Observed data are plotted using coloured points with each source having a distinct colour identified by the legend on the right of the plot.
By cohort: The cohort page separates the data and estimates by cohort. The year label defines the entry year for each cohort. For example, the 2000 plot covers individuals that were of the entry age in 2000, one year older than entry age in 2001, and so on. Plots with incomplete series correspond to cohorts that only partially overlap with the data interval of 2000 onwards. For example, the 2020 cohort has not completed the education cycle yet and thus a complete series is not presented. Similarly, the 1996 cohort entered before 2000 and so only older ages are presented. Observed data are plotted using coloured points with each source having a distinct colour identified by the legend on the right of the plot.

Download technical background paper

Download policy paper: New estimation confirms out-of-school population is growing in sub-Saharan Africa