Introduction

The Model of Agricultural Production and Its Impact on the Environment (MAgPIE) is a global land-use modeling framework that simulates agricultural production, land-use change, and environmental impacts under various socio-economic and climate scenarios (Climate Impact Research (PIK), 2025). In our work, MAgPIE is used as a modeling tool to analyze key variables related to the Brazilian territory and its international interactions, including deforestation, harvested area in agriculture, crop production, cattle herd dynamics, and trade.

Within the MAgPIE settings, spatial variables can be represented at multiple resolutions—ranging from global and regional levels to cluster-based and cellular scales, these with increasing spatial resolution. In the default configuration of MAgPIE version 4.10.0, which is the version adopted in our study, the world is divided into 12 regions, with Brazil incorporated into the LAM region, which encompasses Latin American countries.

Within the LAM region, MAgPIE defines 26 spatial clusters, which are aggregations of cells grouped according to multiple criteria, such as potential crop productivity, land availability, climatic conditions, etc. Importantly, these clusters are not necessarily spatially contiguous, meaning that a single cluster may contain cells spread across multiple countries. This default spatial configuration directly influences the MAgPIE outputs we are analyzing, introducing limitations for representing the specificities of the Brazilian context. Since the cluster definitions do not align with national boundaries, and certain variables—such as trade—are modeled exclusively at the regional level, the reliability of the analyses for Brazil is significantly reduced.

In light of these limitations, we carried out a dedicated process to modify the default MAgPIE configuration in order to define Brazil as a distinct and isolated region. In this new configuration, the clusters within the BRA region correspond directly to the 27 Brazilian states. The purpose of this report is to document the steps involved in creating this revised spatial structure, which includes modifications to the core mapping files and the preprocessing of relevant input data.

Context and objective

This report describes the input data files, details the processing steps performed, and explains the data preparation procedures for running MAgPIE considering a new region corresponding to Brazil. The objective is to guarantee transparency and reproducibility in data processing, as well as to tailor the input datasets to the scope of this study, which centers on analyses related to the national context of Brazil.

MAgPIE default configuration

In the default configuration of MAgPIE, the global land surface is discretized into spatial grid cells at a $0.5^\circ$ resolution, which are subsequently aggregated into 200 clusters to ensure computational efficiency while preserving regional heterogeneity. These clusters are organized into 12 world regions, each containing a specific number of clusters according to its geographic extent and socioeconomic relevance. This hierarchical structure—grid cell -> cluster -> region—provides the spatial framework for land-use allocation, production modeling, and policy analysis within MAgPIE. The overall distribution of clusters across regions is depicted in Figure 1, highlighting the spatial aggregation scheme adopted in the default setup.

library(dplyr)
library(ggplot2)
library(ggspatial)

df<-readRDS("clustermap_rev4.117_c200_67420_h12.rds")

coords <- strsplit(df$cell, "\\.")  
mat    <- do.call(rbind, coords) 
df$lon <- mat[,1]
df$lat <- mat[,2]
df$iso <- mat[,3]
df$lon <- as.numeric(gsub("p", ".", df$lon))
df$lat <- as.numeric(gsub("p", ".", df$lat))

df_plot <- df %>%
  mutate(cluster3 = substr(cluster, 1, 3))

valores_unicos <- unique(df_plot$cluster)

resultado <- tibble(valor = valores_unicos) %>%
  mutate(regiao = substr(valor, 1, 3)) %>%  
  count(regiao, name = "n")    %>%     
  mutate(regiao_final = paste0(regiao, " (", n, ")")) 

df_final <- df_plot %>%
  left_join(resultado, by = c("cluster3" = "regiao"))
cores_custom <- c(
 #"#A8A8A8",
  "#ED9659", "#3CB44B", "#FDD61C", "#898916",
  "#FF9999", "#9DCFC9", "#4363D8", "#43D4F4", "#820505",
  "#9A6425", "#911FB4", "#E52654"
)


ggplot(df_final, aes(x = lon, y = lat, fill = regiao_final)) +
  geom_tile() +
  coord_equal() +
  scale_fill_manual(values = cores_custom, 
                    guide = guide_legend(nrow = 2,
                    title.position = "top" )) +
  theme_minimal() +
  labs(fill = "Region (number of cluster)") +
  theme(
    legend.position = "bottom",
    legend.box = "horizontal",
    legend.title.align = 0
  )

Figure 1: MAgPIE world regions and cluster settings (Default version).

Focusing more specifically on the LAM region, which comprises the countries of Latin America—including Brazil—this region is represented in the default MAgPIE setup by 26 clusters. The spatial distribution of these clusters, along with their constituent grid cells, is illustrated in Figure 2, where each cluster is depicted in a distinct color for visualization purposes. It is important to emphasize that grid cells assigned to the same cluster are not required to be geographically contiguous. As a result, a single cluster may group together areas that share similar production conditions or land-use characteristics, even if they are spatially dispersed across different countries of the region. Of the 26 clusters in the LAM region, numbered 59 to 84, 18 contain at least one grid cell within Brazilian territory.

library(raster)
library(sp)

df<-readRDS("clustermap_rev4.117_c200_67420_h12.rds")
LAM<-subset(df, region=='LAM')
coords <- strsplit(LAM$cell, "\\.")  
mat    <- do.call(rbind, coords) 
LAM$lon <- mat[,1]
LAM$lat <- mat[,2]
LAM$iso <- mat[,3]
LAM$lon <- as.numeric(gsub("p", ".", LAM$lon))
LAM$lat <- as.numeric(gsub("p", ".", LAM$lat))
valores_unicos <- unique(LAM$cluster)
r <- rasterFromXYZ(cbind(LAM[,c("lon","lat")], z=1), 
                   res = c(min(diff(sort(unique(LAM$lon)))),
                           min(diff(sort(unique(LAM$lat))))))

polys <- rasterToPolygons(r, dissolve = FALSE)
pts_sp <- SpatialPointsDataFrame(
  coords      = LAM[, c("lon","lat")],
  data        = LAM["cluster", drop = FALSE],
  proj4string = CRS(proj4string(polys)))

polys$cluster <- over(polys, pts_sp)$cluster

library(Polychrome)
pal <- createPalette(26, c("#ff0000", "#00ff00", "#0000ff"))   
cols  <- pal[ match(polys$cluster, valores_unicos) ]

plot(polys,
     col    = cols,
     border = "grey80",
     lwd    = 0.5)
legend("bottomleft", inset = c(0.05, 0),  
       legend = valores_unicos,
       fill = pal, 
       ncol = 3,
       cex = 0.6,
       pt.cex = 0.6
)

Figure 2: Spatial grid cells in the LAM region, aggregated into clusters according to the MAgPIE default configuration.

Creation of new clusters

To achieve our objective of analyzing MAgPIE results specifically for the Brazilian territory, we redefined the default configuration of the model of regions and clusters. This required modifications to the mapping files responsible for defining these classifications.

The file defines the correspondence between countries and their respective regions. It consists of three data columns: the first contains the full country names, the second the corresponding country codes, and the third the associated regional codes. To create the new region, a single change was made to this file, in the third column, corresponding to the region of Brazil, which now has the code BRA.

Cluster redefinition is primarily implemented through changes to the file, which maps each grid cell to its corresponding region, country, and cluster. In the default configuration, cluster allocation follows a global optimization procedure based on multiple criteria, including potential agricultural productivity, land and water availability, land-use patterns, climatic conditions, population density, and food demand. Although this approach is suitable for global modeling, it does not adequately capture the heterogeneity and regional characteristics of the Brazilian environment.

Therefore, we introduced a new clustering scheme in which the Brazilian territory is subdivided into 27 clusters, each corresponding to one of the Brazilian states. With the separation of Brazil from the LAM region, two clusters of the standard configuration no longer exist, LAM.67 and LAM.70, as all of their cells referred to Brazilian territory. Consequently, the numbering of the clusters was changed in order to make the numbering continuous. Figure 3 shows the new configuration considered.

library(dplyr)
library(ggplot2)
library(ggspatial)

df<-readRDS("clustermap_rev4.117_c225_67420_h13.rds")

coords <- strsplit(df$cell, "\\.")  
mat    <- do.call(rbind, coords) 
df$lon <- mat[,1]
df$lat <- mat[,2]
df$iso <- mat[,3]
df$lon <- as.numeric(gsub("p", ".", df$lon))
df$lat <- as.numeric(gsub("p", ".", df$lat))


df_plot <- df %>%
  mutate(cluster3 = substr(cluster, 1, 3))

valores_unicos <- unique(df_plot$cluster)

resultado <- tibble(valor = valores_unicos) %>%
  mutate(regiao = substr(valor, 1, 3)) %>%   
  count(regiao, name = "n")    %>%      
  mutate(regiao_final = paste0(regiao, " (", n, ")"))           

df_final <- df_plot %>%
  left_join(resultado, by = c("cluster3" = "regiao"))
#3 primeiras letras
cores_custom <- c(
 "#A8A8A8",
  "#ED9659", "#3CB44B", "#FDD61C", "#898916",
  "#FF9999", "#9DCFC9", "#4363D8", "#43D4F4", "#820505",
  "#9A6425", "#911FB4", "#E52654"
)


ggplot(df_final, aes(x = lon, y = lat, fill = regiao_final)) +
  geom_tile() +
  coord_equal() +
  scale_fill_manual(values = cores_custom, 
                    guide = guide_legend(nrow = 2,
                    title.position = "top" )) +
  theme_minimal() +
  labs(fill = "Region (number of cluster)") +
  theme(
    legend.position = "bottom",
    legend.box = "horizontal",
    legend.title.align = 0
  )

Figure 3: MAgPIE new world regions and cluster settings (Brazil version).

By removing Brazilian cells from these clusters and reallocating them to the newly defined BRA clusters, the original structure of the LAM region is modified, as illustrated in Figure 4.

library(raster)
library(sp)

map<-readRDS("clustermap_rev4.117_c225_67420_h13.rds")

LAM<-subset(map, region=='LAM')
coords <- strsplit(LAM$cell, "\\.")  
mat    <- do.call(rbind, coords) 
LAM$lon <- mat[,1]
LAM$lat <- mat[,2]
LAM$iso <- mat[,3]
LAM$lon <- as.numeric(gsub("p", ".", LAM$lon))
LAM$lat <- as.numeric(gsub("p", ".", LAM$lat))
valores_unicos <- unique(LAM$cluster)

r <- rasterFromXYZ(cbind(LAM[,c("lon","lat")], z=1), 
                   res = c(min(diff(sort(unique(LAM$lon)))),
                           min(diff(sort(unique(LAM$lat))))))

polys <- rasterToPolygons(r, dissolve = FALSE)
pts_sp <- SpatialPointsDataFrame(
  coords      = LAM[, c("lon","lat")],
  data        = LAM["cluster", drop = FALSE],
  proj4string = CRS(proj4string(polys)))
polys$cluster <- over(polys, pts_sp)$cluster

library(Polychrome)
pal <- createPalette(26, c("#ff0000", "#00ff00", "#0000ff"))  # 

cols  <- pal[ match(polys$cluster, valores_unicos)]

plot(polys,
     col    = cols,
     border = "grey80",
     lwd    = 0.5)
legend("bottomleft", inset = c(0.05, 0),                    
       legend = valores_unicos,
       fill = pal, 
       ncol = 3,
       cex = 0.6,
       pt.cex = 0.6
)

Figure 4: Spatial grid cells in the LAM region, aggregated into clusters according to the new settings.

The new clusters are named sequentially from BRA.199 to BRA.225. This modification allows the model to reflect more accurately the socio-environmental diversity present across Brazil and enables more detailed spatial analyses.

Moreover, unlike the default clustering, the new configuration imposes a spatial-contiguity constraint: all cells within a Brazilian cluster must be geographically adjacent. This ensures that each cluster represents a continuous geographic region, improving the interpretability of spatial patterns and reducing distortions related to non-contiguous cluster assignments. It is important to emphasize that the current cluster configuration—corresponding to the Brazilian states—represents a preliminary setup intended solely to test the model’s adaptability to a new spatial configuration. In future developments, we aim to enhance the spatial resolution by treating each cluster as a unique cell within Brazil, thereby allowing for fully disaggregated, cell-level modeling across the national territory. Figures 5 illustrates the new spatial configuration adopted.

library(raster)
library(sp)

map<-readRDS("clustermap_rev4.117_c225_67420_h13.rds")
BRA<-subset(map, region=='BRA')
coords <- strsplit(BRA$cell, "\\.")  
mat    <- do.call(rbind, coords) 
BRA$lon <- mat[,1]
BRA$lat <- mat[,2]
BRA$iso <- mat[,3]
BRA$lon <- as.numeric(gsub("p", ".", BRA$lon))
BRA$lat <- as.numeric(gsub("p", ".", BRA$lat))
valores_unicos <- unique(BRA$cluster)

r <- rasterFromXYZ(cbind(BRA[,c("lon","lat")], z=1), 
                   res = c(min(diff(sort(unique(BRA$lon)))),
                           min(diff(sort(unique(BRA$lat))))))

polys <- rasterToPolygons(r, dissolve = FALSE)
pts_sp <- SpatialPointsDataFrame(
  coords      = BRA[, c("lon","lat")],
  data        = BRA["cluster", drop = FALSE],
  proj4string = CRS(proj4string(polys))  # usa o mesmo CRS de 'polys'
)
polys$cluster <- over(polys, pts_sp)$cluster



library(Polychrome)
pal <- createPalette(27, c("#ff0000", "#00ff00", "#0000ff"))  # cores 
cols  <- pal[ match(polys$cluster, valores_unicos) ]
myplot<-
plot(polys,
     col    = cols,
     border = "grey80",
     lwd    = 0.5)#,
    #add    = TRUE)
legend("bottomleft", inset = c(0.05, 0),                 
       legend = valores_unicos,
       fill = pal, 
       ncol = 3,
       cex = 0.6,
       pt.cex = 0.6
)

Figure 5: Spatial grid cells in the BRA region, aggregated into clusters according to the new settings.

Description of input files and data processing

This chapter presents the input data files required for model execution and describes the data processing steps performed to prepare these files for model execution. The main goal was to reprocess the data considering the newly defined BRA region, ensuring that all inputs were consistent with the requirements of the model.

The input files used for running the model are available in the public PIK repository (Climate Impact Research (PIK), 2018), organized into five input data bundles. Each bundle was thoroughly reviewed to ensure that all datasets were adapted to the new regional configuration. The following sections present all processed input bundles along with their corresponding details.

The preprocessing procedures involved the adaptation of existing routines and the development of new ones when necessary. Several challenges emerged during this process, mainly due to the limited availability of the original scripts used by the model developers. As a result, additional adjustments and manual harmonization steps were required.

Challenges and limitations in data processing

The data processing phase revealed a series of technical issues related to data structure, metadata consistency, and intermediate scripts. This section documents these challenges and provides context on their origin.

The FAOSTAT datasets (Food & United Nations (FAO), 2025) used to compute the model inputs have recently undergone a structural reorganization. As a result, the automated download functions can no longer be executed successfully, as they consistently return the following error:

Error in download.file(faoMeta$FileLocation, destfile = destfile, mode = "wb"): invalid url argument

The access keys for FAO datasets available through the Bulk Download link were modified following a structural reorganization of the platform. As a result, the data download process has become more effort-intensive, since it is now necessary to identify the corresponding database for each dataset and, in some cases, adapt it to the previous structure to ensure compatibility with existing reading functions. Consequently, a comprehensive adaptation of all FAO datasets used by MAgPIE is currently underway to fully reproduce the processing pipeline. Below is a list of the datasets previously used and their corresponding replacements in the new workflow.

Even after these adjustments, the FAOSTAT databases could not be automatically read and processed by the corresponding functions. Consequently, some files were modified and will be examined in greater detail at a later stage to identify and implement potential corrections.

A recurring issue involved datasets derived from scientific publications, which are not always provided in a structured, ready-to-use format, nor made available for automated access via download links. In such cases, no dedicated download functions exist, and the following error was encountered:

ERROR: Sourcefolder does not contain data for the requested source type = type subtype = subtype and there is no download script which could provide the missing data. Please check your settings!

In these cases, it was necessary to manually locate the required datasets and apply the appropriate adjustments so that the data-reading function could be used.

Overview of main datasets

This section presents the main datasets used as input for the model.

Processing Procedures and Functions Used

The preprocessing procedures were conducted in the R environment, employing the packages and functions recommended by the model developers. Among these, the madrat package (version 3.24.1) and its associated dependencies played a central role. This package, specifically designed for the preprocessing of input data used within the MAgPIE modeling framework, was essential to ensure consistency and reproducibility in data preparation.

One of the most frequently employed functions in this process was calcOutput() from the madrat package. This function was developed as a wrapper for specific routines designed to handle the various types of outputs utilized within the model. When executed with a specified output type, calcOutput() calls the corresponding function from one of the auxiliary preprocessing packages and performs the entire workflow, ranging from downloading the required datasets to aggregating the data by regions. It is also possible to provide a file containing country-to-region mappings, which enables the function to execute the regional aggregation step automatically.

Another widely used function was toolAggregate(), also from the madrat package. This function performs the aggregation (or disaggregation) of a dataset according to a relation matrix or mapping. In addition, it allows for the inclusion of weights, which are applied in the calculation of the final aggregated values.

Bundle cellular: rev4.117_h12_fd712c0b_cellularmagpie_c200_MRI-ESM2-0-ssp370_lpjml-8e6c5eb1.tgz

This bundle contains files processed at all levels, including: cell, country, global, regional, and cluster. Files at the cell, country, and global levels do not require additional processing.

Cellular level

Country level

Global level:

Regional level

Cluster level

Bundle regional: rev4.117_h12_magpie.tgz

This bundle contains the largest number of files, most of which are already processed at the regional level. However, it also includes datasets at the global and country levels, which do not require additional processing.

Country level files:

Global level files:

Regional level files:

Bundle validation: rev4.117_h12_validation.tgz

This bundle contains a validation file that was not used in this processing step.

Bundle additional = additional_data_rev4.62.tgz

This bundle contains global-level files; therefore, they were not processed again. The original files were used.

Bundle calibration = calibration_H12_FAO_13Mar25.tgz

This bundle contains regional-level calibration files. These files still need to be studied in greater detail to refine the process; however, for the new BRA region, the same values as those for the LAM region were initially used.

Summary and next steps

The process of creating new clusters within a newly defined region in the MAgPIE configuration involved both straightforward and complex steps. Given that input data preparation is a crucial step when modifying MAgPIE’s spatial structure, the preprocessing phase plays a key role in refining the resolution and reliability of model outputs for Brazil. While initial tasks—such as acquiring input data and modifying mapping files—were relatively simple, the workflow also required extensive, time-consuming efforts to identify and resolve errors that emerged during the generation of new input files based on the updated spatial configuration.

For the next phases, once all input data has been properly prepared, we aim to replicate the analyses performed using MAgPIE’s original spatial structure. The objective is to compare the key output variables—such as harvested crop areas, cattle herd production, and deforestation—between the newly customized spatial configuration and the original setup. Once more, we will benchmark our results against official Brazilian public datasets to validate consistency and enhance reliability. Additionally, by defining Brazil as a standalone region, we will be able to explicitly evaluate trade flows between Brazil and other global regions, enabling more accurate analyses of international trade dynamics involving Brazilian agricultural and land-use sectors.

References

Climate Impact Research (PIK), P. I. for. (2018). Index of /data/magpie/public. https://rse.pik-potsdam.de/data/magpie/public/.

Climate Impact Research (PIK), P. I. for. (2025). MAgPIE - an open source land-use modeling framework 4.10.0. https://rse.pik-potsdam.de/doc/magpie/4.10.0/index.htm.

Food, & United Nations (FAO), A. O. of the. (2025). FAOSTAT - food and agriculture data. https://www.fao.org/faostat/en/#home.

Working Report II: Input Data Processing and Preparation for MAgPIE

Letícia F. Godoi, Mário L. Vicchietti, Fernando M. Ramos

2026-03-23