Skip to content

Latest commit

 

History

History
915 lines (717 loc) · 35.2 KB

README.md

File metadata and controls

915 lines (717 loc) · 35.2 KB

OSSL: Global Soil Spectral Calibration Models

Jose L. Safanelli ([email protected]), Tomislav Hengl ([email protected]), Leandro Parente ([email protected]), and Jonathan Sanderman ([email protected]) 04 December, 2023

DOI

SoilSpec4GG logo

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Open Soil Spectral Library

Part of: https://github.com/soilspectroscopy
Project: Soil Spectroscopy for Global Good
Last update: 2023-12-04
Dataset: OSSL

The directory/folder path:

# dir <- "/mnt/soilspec4gg/ossl/ossl_models/mlr3/"
dir <- "~/mnt-ossl/ossl_models/"
# db.dir <- "/mnt/soilspec4gg/ossl/ossl_import/"
db.dir <- "~/mnt-ossl/ossl_import/"

Overview

This tutorial explains the steps required to fit Soil Spectral Prediction Models for the purpose of the Soil Spectroscopy for Global Good project using the MLR3 framework. After running several model evaluations, including the test of several instruments and diverse independent test sets, we are releasing fine-tuned Cubist models, a machine learning algorithm that takes advantage of a decision-tree splitting method but fits linear regression models at each terminal leaf. It also uses a boosting mechanism (sequential trees adjusted by weights) that allows the growth of a forest by tuning the number of committees. We haven’t used the correction of final predictions by the nearest neighbors’ influence due to the lack of this feature in the MLR3 framework.

We provide, in addition to the response of a property of interest, the uncertainty via the standard deviation estimation along with the lower and upper bounds, determined via conformal prediction interval. Lastly, an extra column is provided to flag potential spectral underrepresentation, warning if the feature space of the calibration set is representative to new samples.

The following code is implemented in a high-performance computing environment with many steps are fully parallelized with enough RAM (> 64GB). Running this code on a standard computer without subsetting data is not recommended.

For each soil property of interest, the possible model combinations are:

# MIR, VisNIR and NIR from Neospectra
spectra.type <- c("mir", "visnir", "nir.neospectra")

# Using KSSL only or the whole OSSL
subset <- c("kssl", "ossl")

# Adding (ll) or not adding (na) geocovariates to models
geocovariates <- c("na")

# Basic structure
modeling.combinations <- tibble(spectra_type = spectra.type) %>%
  crossing(subset = subset) %>%
  crossing(geo = geocovariates)

At this stage, Neospectra models are fitted using only the OSSL subset, as they have minimal differences (OSSL ~ KSSL). In addition, models with geocovariates are not implemented at this version as additional tests are required.

# Target models
modeling.combinations <- modeling.combinations %>%
  dplyr::filter(!(grepl("neospectra", spectra_type) & subset == "kssl")) %>%
  dplyr::filter(!(geo == "ll"))

# Model name
modeling.combinations <- modeling.combinations %>%
  mutate(model_name = paste0(spectra_type, "_cubist_",
                             subset, "_", geo, "_v1.2"), .before = 1)

modeling.combinations %>%
  knitr::kable()
model_name spectra_type subset geo
mir_cubist_kssl_na_v1.2 mir kssl na
mir_cubist_ossl_na_v1.2 mir ossl na
nir.neospectra_cubist_ossl_na_v1.2 nir.neospectra ossl na
visnir_cubist_kssl_na_v1.2 visnir kssl na
visnir_cubist_ossl_na_v1.2 visnir ossl na

After database update v1.2, the following properties of interest have models fitted:

soil.properties <- read_csv("../out/ossl_models_soil_properties.csv")

soil.properties <- soil.properties %>%
  filter(include == TRUE) %>%
  mutate(export_name = ifelse(log == TRUE, paste0("log..", soil_property), soil_property)) %>%
  select(-include, -log)

soil.properties.names <- soil.properties %>%
  pull(soil_property)

knitr::kable(soil.properties)
soil_property description export_name
acidity_usda.a795_cmolc.kg Acidity, BaCl2-TEA Extractable, pH 8.2 log..acidity_usda.a795_cmolc.kg
aggstb_usda.a1_w.pct Aggregate Stability log..aggstb_usda.a1_w.pct
al.dith_usda.a65_w.pct Aluminum, Crystalline, Total Pedogenic Iron log..al.dith_usda.a65_w.pct
al.ext_usda.a1056_mg.kg Aluminum, Extractable, Mehlich3 log..al.ext_usda.a1056_mg.kg
al.ext_usda.a69_cmolc.kg Aluminum, Extractable, KCl log..al.ext_usda.a69_cmolc.kg
al.ox_usda.a59_w.pct Aluminum, Amorphous, Total Non-Crystalline Iron log..al.ox_usda.a59_w.pct
awc.33.1500kPa_usda.c80_w.frac Available Water Content, Difference 33-1500 kPa log..awc.33.1500kPa_usda.c80_w.frac
b.ext_mel3_mg.kg Boron, Extractable, Mehlich3 log..b.ext_mel3_mg.kg
bd_usda.a4_g.cm3 Bulk Density, <2mm fraction, Clod log..bd_usda.a4_g.cm3
c.tot_usda.a622_w.pct Carbon, Total NCS log..c.tot_usda.a622_w.pct
ca.ext_usda.a1059_mg.kg Calcium, Extractable, Mehlich3 log..ca.ext_usda.a1059_mg.kg
ca.ext_usda.a722_cmolc.kg Calcium, Extractable, NH4OAc log..ca.ext_usda.a722_cmolc.kg
caco3_usda.a54_w.pct Carbonate, <2mm Fraction log..caco3_usda.a54_w.pct
cec_usda.a723_cmolc.kg CEC, pH 7.0, NH4OAc, 2M KCl displacement log..cec_usda.a723_cmolc.kg
cf_usda.c236_w.pct Coarse Fragments, Greater 2mm log..cf_usda.c236_w.pct
clay.tot_usda.a334_w.pct Clay clay.tot_usda.a334_w.pct
cu.ext_usda.a1063_mg.kg Copper, Extractable, Mehlich3 log..cu.ext_usda.a1063_mg.kg
ec_usda.a364_ds.m Electrical Conductivity, (w/w) log..ec_usda.a364_ds.m
efferv_usda.a479_class Effervescense, 1N HCl efferv_usda.a479_class
fe.dith_usda.a66_w.pct Iron, Crystalline, Total Pedogenic Iron log..fe.dith_usda.a66_w.pct
fe.ext_usda.a1064_mg.kg Iron, Extractable, Mehlich3 log..fe.ext_usda.a1064_mg.kg
fe.ox_usda.a60_w.pct Iron, Amorphous, Total Non-Crystalline Iron log..fe.ox_usda.a60_w.pct
k.ext_usda.a1065_mg.kg Potassium, Extractable, Mehlich3 log..k.ext_usda.a1065_mg.kg
k.ext_usda.a725_cmolc.kg Potassium, Extractable, NH4OAc, 2M KCl displacement log..k.ext_usda.a725_cmolc.kg
mg.ext_usda.a1066_mg.kg Magnesium, Extractable, Mehlich3 log..mg.ext_usda.a1066_mg.kg
mg.ext_usda.a724_cmolc.kg Magnesium, Extractable, NH4OAc, 2M KCl displacement log..mg.ext_usda.a724_cmolc.kg
mn.ext_usda.a1067_mg.kg Manganese, Extractable, Mehlich3 log..mn.ext_usda.a1067_mg.kg
mn.ext_usda.a70_mg.kg Manganese, Extractable, KCl log..mn.ext_usda.a70_mg.kg
n.tot_usda.a623_w.pct Nitrogen, Total NCS log..n.tot_usda.a623_w.pct
na.ext_usda.a1068_mg.kg Sodium, Extractable, Mehlich3 log..na.ext_usda.a1068_mg.kg
na.ext_usda.a726_cmolc.kg Sodium, Extractable, NH4OAc, 2M KCl displacement log..na.ext_usda.a726_cmolc.kg
oc_usda.c729_w.pct Organic Carbon, Total C without CaCO3, S prep log..oc_usda.c729_w.pct
p.ext_usda.a274_mg.kg Phosphorus, Extractable, Olsen log..p.ext_usda.a274_mg.kg
p.ext_usda.a1070_mg.kg Phosphorus, Extractable, Mehlich3 log..p.ext_usda.a1070_mg.kg
p.ext_usda.a270_mg.kg Phosphorus, Extractable, Bray1 log..p.ext_usda.a270_mg.kg
ph.cacl2_usda.a481_index pH, 1:2 Soil-CaCl2 Suspension ph.cacl2_usda.a481_index
ph.h2o_usda.a268_index pH, 1:1 Soil-Water Suspension ph.h2o_usda.a268_index
s.tot_usda.a624_w.pct Sulfur, Total NCS log..s.tot_usda.a624_w.pct
sand.tot_usda.c60_w.pct Sand, Total, S prep sand.tot_usda.c60_w.pct
silt.tot_usda.c62_w.pct Silt, Total, S prep silt.tot_usda.c62_w.pct
wr.1500kPa_usda.a417_w.pct Water Retention, 15 Bar (1500 kPa) log..wr.1500kPa_usda.a417_w.pct
wr.33kPa_usda.a415_w.pct Water Retention, 1/3 Bar (33 kPa) log..wr.33kPa_usda.a415_w.pct
zn.ext_usda.a1073_mg.kg Zinc, Extractable, Mehlich3 log..zn.ext_usda.a1073_mg.kg

Note: some soil properties are natural-log transformed (with offset = 1, that is why we use log1p function) to increase the prediction performance of skewed distribution soil properties. They are back-transformed only at the end after running all modeling steps, including performance estimation and the definition of the uncertainty intervals.

Final modeling combinations:

modeling.combinations <- soil.properties %>%
  crossing(modeling.combinations)

write_csv(modeling.combinations, "../out/modeling_combinations_v1.2.csv")

Spectral ranges

Few datasets from the OSSL have distinct spectral ranges. We strictly define them in the chunk below and format the the OSSL naming. MIR is represented in log10 absorbance units, while the VisNIR and NIR are represented in reflectance units.

mir.spectral.range <- paste0("scan_mir.", seq(600, 4000, by = 2), "_abs")
visnir.spectral.range <- paste0("scan_visnir.", seq(400, 2500, by = 2), "_ref")
nir.neospectra.spectral.range <- paste0("scan_nir.", seq(1350, 2550, by = 2), "_ref")

Checking folders

target.dirs <- paste0(dir, soil.properties$export_name)

# Folders already created?
invisible(sapply(target.dirs, function(x) {
  if(!dir.exists(x)){dir.create(x)}
}))

Loading regression matrix

The regression-matrices were produced in the OSSL-imports repository.

## Reading OSSL level 1
rm.ossl <- qread(paste0(db.dir, "ossl_all_L1_v1.2.qs"))

# Preparing the bind of soil data level 1 for Neospectra 
neospectra.soillab <- rm.ossl %>%
  dplyr::select(id.layer_uuid_txt, id.scan_local_c,
                all_of(soil.properties.names)) %>%
  filter(grepl("XS|XN", id.scan_local_c)) %>%
  mutate(id.scan_local_c = gsub("XS|XN", "", id.scan_local_c))

# Keeping only the important columns
rm.ossl <- rm.ossl %>%
  dplyr::select(id.layer_uuid_txt, dataset.code_ascii_txt,
                any_of(soil.properties.names), all_of(visnir.spectral.range), all_of(mir.spectral.range))

# head(names(rm.ossl), 20)
# tail(names(rm.ossl), 20)

## Reading Neospectra datasets
neospectra.nir <- qread(paste0(db.dir, "neospectra_nir_v1.2.qs"))

# head(names(neospectra.nir), 20)

# Averaging spectra collected by multiple instruments
neospectra.nir <- neospectra.nir %>%
  dplyr::select(id.sample_local_c, all_of(nir.neospectra.spectral.range)) %>%
  group_by(id.sample_local_c) %>%
  summarise_all(mean)

rm.neospectra <- inner_join(neospectra.soillab, neospectra.nir,
                            by = c("id.scan_local_c" = "id.sample_local_c"))

# Selecting only important columns
rm.neospectra <- rm.neospectra %>%
  dplyr::select(id.layer_uuid_txt,
                any_of(soil.properties.names), all_of(nir.neospectra.spectral.range))

# Preparing named list of datasets
# Selecting only important columns, spectra range and rows with available spectra
data.list <- list(
  "mir_cubist_kssl_v1.2" = {rm.ossl %>%
      dplyr::filter(dataset.code_ascii_txt == "KSSL.SSL") %>%
      dplyr::select(id.layer_uuid_txt, any_of(soil.properties.names),
                    all_of(mir.spectral.range)) %>%
      dplyr::filter(!is.na(scan_mir.1500_abs))},
  "mir_cubist_ossl_v1.2" = {rm.ossl %>%
      dplyr::select(id.layer_uuid_txt, any_of(soil.properties.names),
                    all_of(mir.spectral.range)) %>%
      dplyr::filter(!is.na(scan_mir.1500_abs))},
  "visnir_cubist_kssl_v1.2" = {rm.ossl %>%
      dplyr::filter(dataset.code_ascii_txt == "KSSL.SSL") %>%
      dplyr::select(id.layer_uuid_txt, any_of(soil.properties.names),
                    all_of(visnir.spectral.range)) %>%
      dplyr::filter(!is.na(scan_visnir.1500_ref))},
  "visnir_cubist_ossl_v1.2" = {rm.ossl %>%
      dplyr::select(id.layer_uuid_txt, any_of(soil.properties.names),
                    all_of(visnir.spectral.range)) %>%
      dplyr::filter(!is.na(scan_visnir.1500_ref))},
  "nir.neospectra_cubist_ossl_v1.2" = {rm.neospectra %>%
      dplyr::select(id.layer_uuid_txt, any_of(soil.properties.names),
                    all_of(nir.neospectra.spectral.range)) %>%
      dplyr::filter(!is.na(scan_nir.1500_ref))}
  )

# # Different sizes
# lapply(data.list, function(x) dim(x))

Preprocessing

To remove baseline offset, additive/multiplicative scattering effects, and multicollinearity of spectra from multiple sources, SNV preprocessing is used before PCA compression and model calibration:

prep.list <- lapply(data.list, function(x){
  x %>%
    dplyr::select(-id.layer_uuid_txt, -any_of(soil.properties.names)) %>%
    as.matrix() %>%
    prospectr::standardNormalVariate(X = .) %>%
    as_tibble() %>%
    bind_cols({x %>%
      dplyr::select(id.layer_uuid_txt, any_of(soil.properties.names))}, .)
  })

# # Checking preprocessing. Names are kept consistently across list objects and table columns
# lapply(data.list, function(x) dim(x))

PCA compression

Compress and save PCA models for the different versions of spectra. This is used to reduce the number of dimension and also control multicollinearity (Chang, Laird, Mausbach, & Hurburgh Jr, 2001). Only 120 components will be used in the models but the remaining farther components are used for the trustworthiness test, i.e., if a sample is underrepresented because of their unique features on those farther components not accounted by the models, then a flag will be raised.

Objects from prcomp:

  • sdev: the standard deviations of the principal components.
  • rotation: the matrix of variable loadings (columns are eigenvectors).
  • center: the variable means (means that were substracted).
  • scale: the variable standard deviations (the scaling applied to each variable).
  • x: the coordinates of the individuals (observations) on the principal components, known as scores.

We can omit the x in the objects to save to disk, later reassigning the class prcomp for making predictions.

Fit PCA models in parallel:

pca.list <- mclapply(1:length(prep.list), function(i) {
  
  x <- prep.list[[i]] %>%
    dplyr::select(starts_with("scan_")) %>%
    as.data.frame()
  
  prcomp(x, center = T, scale = T)
  
}, mc.cores = length(prep.list))

names(pca.list) <- names(prep.list)

# # Checking the number of components of each spectra type
# lapply(pca.list, function(x) ncol(x$rotation))
# 
# # Checking how many components explain 95% of the original variance
# lapply(pca.list, function(x) {which(cumsum(x$sdev/sum(x$sdev)) > 0.95)[1]})
# 
# # Checking how many components explain 99% of the original variance
# lapply(pca.list, function(x) {which(cumsum(x$sdev/sum(x$sdev)) > 0.99)[1]})
# 
# # Checking how many components explain 99.9%of the original variance
# lapply(pca.list, function(x) {which(cumsum(x$sdev/sum(x$sdev)) > 0.999)[1]})
# 
# # Checking how many components explain 99.99% of the original variance
# lapply(pca.list, function(x) {which(cumsum(x$sdev/sum(x$sdev)) > 0.9999)[1]})

Saving as qs files:

## Saving simplified pca models only for prediction purposes (m = modified):
# Omitting 5th object of prcomp, i.e x table (scores)
for(i in 1:length(pca.list)){
  qsave(pca.list[[i]][1:4], paste0(dir, "pca.ossl/mpca_", names(pca.list)[i], ".qs"))
}

## Saving the scores to be used as predictors
for(i in 1:length(pca.list)){
  
  idata <- prep.list[[i]] %>%
    dplyr::select(id.layer_uuid_txt, all_of(soil.properties.names))
  
  iscores <- pca.list[[i]]$x %>%
    as_tibble()
  
  idata.export <- bind_cols(idata, iscores)
  
  qsave(idata.export, paste0(dir, "pca.ossl/pca_scores_", names(pca.list)[i], ".qs"))
  
}

Target models

modeling.combinations <- read_csv("../out/modeling_combinations_v1.2.csv")
count.table <- read_csv("../out/tab_dataset_count.csv")

# Defining available data from ossl-import
count.table <- count.table %>%
  filter(dataset %in% c("KSSL.SSL", "OSSL")) %>%
  rename(subset = dataset) %>%
  mutate(subset = recode(subset,
                         "KSSL.SSL" = "kssl",
                         "OSSL" = "ossl")) %>%
  pivot_longer(-all_of(c("soil_property", "subset")),
               names_to = "spectra_type", values_to = "count") %>%
  mutate(spectra_type = recode(spectra_type,
                               "n_mir" = "mir",
                               "n_visnir" = "visnir",
                               "n_neospectra" = "nir.neospectra"))

# Defining models with at least 500 observations
modeling.combinations <- left_join(modeling.combinations,
                                   count.table,
                                   by = c("soil_property", "spectra_type", "subset"))

modeling.combinations <- modeling.combinations %>%
  filter(count > 500) %>%
  filter(!(soil_property == "efferv_usda.a479_class"))

write_csv(modeling.combinations, "../out/fitted_modeling_combinations_v1.2.csv")

# Available soil properties
modeling.combinations %>%
  distinct(soil_property) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1    42
# Final modeling combinations
modeling.combinations %>%
  count(spectra_type, subset)
## # A tibble: 5 × 3
##   spectra_type   subset     n
##   <chr>          <chr>  <int>
## 1 mir            kssl      41
## 2 mir            ossl      42
## 3 nir.neospectra ossl      29
## 4 visnir         kssl       6
## 5 visnir         ossl      23

Model fitting

For model calibration we use the first 120 PCs, a cutoff considering the trade-off between spectral representation and compression level. We did a internal experiment and found this number to be optimal when compared to a cumulative explained variance of 99% (which include more features and hinders the compression capacity).

The use of compressed PCs is presented in Chang et al. (2001), although the authors used only top 10 PCA components.

The following framework is used:

  • Base learner: Cubist (cubist). - Hyperparameter optimization is done with internal resampling (inner) using 5-fold cross-validation and a smaller subset for speeding up this operation. This task is performed with a grid search of HP space testing up to 5 configurations within the predifined range, and the is RMSE set as the loss function. A final model is fitted at the end with the full data and the best HPO.
  • Performance evaluation is performed with an external (outer) 10-fold cross-validation (cv10) of the final models.
  • Cross-validation predictions, accuracy plot, and good-of-fitness metrics are exported to disk.
  • Uncertainty model via conformal prediction: the cross-validated predictions are used to estimate absolute residuals (error), which are fed to a new model using the same fine-tuned structure of the response model. The observed and predicted errors are used to estimate conformity scores given a confidence level (68% ~ 1 std dev, 90%, 95% etc.).
## Parallelization is done inside the the autotuner
lgr::get_logger("mlr3")$set_threshold("warn")

i=1
for(i in 1:nrow(modeling.combinations)) {

  isoil_property = modeling.combinations[[i,"soil_property"]]
  imodel_name = modeling.combinations[[i,"model_name"]]
  iexport_name = modeling.combinations[[i,"export_name"]]
  ispectra_type = modeling.combinations[[i,"spectra_type"]]
  isubset = modeling.combinations[[i,"subset"]]
  igeo = modeling.combinations[[i,"geo"]]

  # Learner
  learner_cubist = lrn("regr.cubist", predict_type = "response",
                       neighbors = 0, unbiased = FALSE, seed = 1993)

  # Hyperparameters space
  search_space_ensemble = ps(
    committees = p_int(1, 20) # Up to 20 tress are fitted sequentially
  )

  # PCA scores
  n.comps <- 120
  selected.comps <- paste0("PC", seq(1, n.comps, by = 1))

  data <- qread(paste0(dir, "pca.ossl/pca_scores_", ispectra_type, "_cubist_", isubset, "_v1.2.qs"))

  # Apply log transform to soil property
  if(grepl("log..", iexport_name)){

    data <- data %>%
      mutate(!!isoil_property := log1p(!!as.name(isoil_property)))

  }

  # Defining train data
  sel.data <- data %>%
    select(id.layer_uuid_txt, # ID column
           all_of(isoil_property), # Target
           all_of(selected.comps)) %>% # Only compressed spectra
    filter(!is.na(!!as.name(isoil_property)))

  # Subset for speeding up HPO
  if(nrow(sel.data) >= 2000) {

    set.seed(1993)
    sel.data.hpo <- sel.data %>%
      sample_n(2000)

  } else {

    sel.data.hpo <- sel.data

  }

  # Exporting train data

  export.data <- paste0(dir, iexport_name, "/task_", imodel_name, ".qs")
  export.data.alt <- gsub("\\.qs", "\\.rds", export.data)

  if(file.exists(export.data)){file.remove(export.data)}
  if(file.exists(export.data.alt)){file.remove(export.data.alt)}

  tryCatch(
    expr = {qsave(sel.data, export.data)},
    error = function(e){saveRDS(sel.data, export.data.alt)}
  )

  # Create regression task
  task.hpo <- as_task_regr(sel.data.hpo, id = "hpo", target = isoil_property, type = "regression")
  task.train <- as_task_regr(sel.data, id = "train", target = isoil_property, type = "regression")

  # Defining id column
  task.hpo$set_col_roles("id.layer_uuid_txt", roles = "name")
  task.train$set_col_roles("id.layer_uuid_txt", roles = "name")

  # For block CV. If 'id.tile' not present in the data.frame, default to random CV
  if("id.tile" %in% colnames(sel.data)) {
    task.hpo$set_col_roles("id.tile", roles = "group")
    task.train$set_col_roles("id.tile", roles = "group")
  }

  # Inner resampling for HPO with 5-fold cv
  inner_resampling = rsmp("cv", folds = 5)

  # Auto tuner
  # Grid search
  # Resolution 5 means that within the range, 5 equidistant values will be tested
  at = auto_tuner(tuner = tnr("grid_search", resolution = 5, batch_size = 1),
                  learner = learner_cubist,
                  resampling = inner_resampling,
                  measure = msr("regr.rmse"),
                  search_space = search_space_ensemble,
                  terminator = trm("none"),
                  store_models = FALSE)

  # Multicore processing
  future::plan("multisession")

  # Fit autotuner
  at$train(task.hpo)

  # # Overview
  # at$tuning_result
  # at$tuning_instance

  # Final model with best HPO
  final.model <- learner_cubist
  final.model$param_set$values = at$tuning_result$learner_param_vals[[1]]
  final.model$train(task.train)

  future:::ClusterRegistry("stop")

  # # Overview
  # summary(final.model$model$regr.lm$model)

  # Saving trained final model to disk
  export.model <- paste0(dir, iexport_name, "/model_", imodel_name, ".qs")
  export.model.alt <- gsub("\\.qs", "\\.rds", export.model)

  if(file.exists(export.model)){file.remove(export.model)}
  if(file.exists(export.model.alt)){file.remove(export.model.alt)}

  tryCatch(
    expr = {qsave(final.model, export.model)},
    error = function(e){saveRDS(final.model, export.model.alt)}
  )

}

Model evaluation

We can export the summary accuracy statistics and standard plot visualizations for all modeling combinations using 10-fold cross validation.

fitted.modeling.combinations <- read_csv("../out/fitted_modeling_combinations_v1.2.csv")

## Evaluation pipeline
lgr::get_logger("mlr3")$set_threshold("warn")

i=1
for(i in 1:nrow(fitted.modeling.combinations)) {

  # Parameters
  isoil_property = fitted.modeling.combinations[[i,"soil_property"]]
  imodel_name = fitted.modeling.combinations[[i,"model_name"]]
  iexport_name = fitted.modeling.combinations[[i,"export_name"]]
  ispectra_type = fitted.modeling.combinations[[i,"spectra_type"]]
  isubset = fitted.modeling.combinations[[i,"subset"]]
  igeo = fitted.modeling.combinations[[i,"geo"]]

  cat(paste0("Run ", i, "/", nrow(fitted.modeling.combinations), " at ", lubridate::now(), "\n"))

  # Task
  sel.data <- qread(paste0(dir,
                           iexport_name,
                           "/task_",
                           imodel_name,
                           ".qs"))

  # Create regression task
  task <- as_task_regr(sel.data, id = "train",
                       target = isoil_property,
                       type = "regression")

  # Defining id column
  task$set_col_roles("id.layer_uuid_txt", roles = "name")

  # For block CV. If 'id.tile' not present in the data.frame, default to random CV
  if("id.tile" %in% colnames(sel.data)) {
    task$set_col_roles("id.tile", roles = "group")
  }

  # Autotuned model
  model <- qread(paste0(dir,
                        iexport_name,
                        "/model_",
                        imodel_name,
                        ".qs"))

  # Outer 10-CV evaluation
  tuned_learner <- as_learner(model$graph_model)

  future::plan("multisession")

  set.seed(1993)
  rr = mlr3::resample(task = task,
                      learner = tuned_learner,
                      resampling = rsmp("cv", folds = 10))

  cv.results <- lapply(1:length(rr$predictions("test")), function(i){
    as.data.table(rr$predictions("test")[[i]]) %>%
      mutate(fold = i)})

  cv.results <- Reduce(rbind, cv.results)

  cv.export <- left_join(task$row_names, cv.results, by = c("row_id" = "row_ids")) %>%
    rename(id.layer_uuid_txt = row_name) %>%
    select(-row_id)

  tryCatch(
    expr = {
      qsave(cv.export, paste0(dir,
                              iexport_name,
                              "/cvpred_",
                              imodel_name,
                              ".qs"))
    },
    error = function(e){
      write_csv(cv.export, paste0(dir,
                                  iexport_name,
                                  "/cvpred_",
                                  imodel_name,
                                  ".csv"))
    }
  )

  # Metrics
  performance.metrics <- cv.results %>%
    summarise(n = n(),
              rmse = rmse_vec(truth = truth, estimate = response),
              bias = msd_vec(truth = truth, estimate = response),
              rsq = rsq_vec(truth = truth, estimate = response),
              ccc = ccc_vec(truth = truth, estimate = response, bias = T),
              rpiq = rpiq_vec(truth = truth, estimate = response))

  perfomance.annotation <- paste0("Lin's CCC = ", round(performance.metrics[[1,"ccc"]], 3),
                                  "\nRMSE = ", round(performance.metrics[[1,"rmse"]], 3))

  performance.metrics <- performance.metrics %>%
    mutate(soil_property = isoil_property,
           model_name = imodel_name,
           .before = 1)

  write_csv(performance.metrics, paste0(dir,
                                        iexport_name,
                                        "/perfmetrics_",
                                        imodel_name,
                                        ".csv"))

  # Plot
  if(grepl("log..", iexport_name)) {

    p.hex <- ggplot(cv.results, aes(x = truth, y = response)) +
      geom_hex(bins = 30, alpha = 0.75) +
      geom_abline(intercept = 0, slope = 1) +
      geom_text(aes(x = -Inf, y = Inf, hjust = -0.1, vjust = 1.2),
                label = perfomance.annotation) +
      scale_fill_viridis_c(trans = "log10") +
      labs(x = "log(observed)", y = "log(predicted)", fill = "Count"),
           title = iexport_name) +
      theme_bw(base_size = 10) +
      theme(legend.position = "bottom",
            plot.title = element_text(hjust = 0.5),
            plot.subtitle = element_text(hjust = 0.5),
            legend.key.size = unit(0.35, "cm"))

    r.max <- max(layer_scales(p.hex)$x$range$range)
    r.min <- min(layer_scales(p.hex)$x$range$range)

    s.max <-max(layer_scales(p.hex)$y$range$range)
    s.min <-min(layer_scales(p.hex)$y$range$range)

    t.max <-round(max(r.max,s.max),1)
    t.min <-round(min(r.min,s.min),1)

    p.hex <- p.hex + coord_equal(xlim=c(t.min,t.max),ylim=c(t.min,t.max))

    ggsave(paste0(dir,
                  iexport_name,
                  "/valplot_",
                  imodel_name,
                  ".png"),
           p.hex, dpi = 200, width = 5, height = 5, units = "in", scale = 1)

  } else {

    p.hex <- ggplot(cv.results, aes(x = truth, y = response)) +
      geom_hex(bins = 30, alpha = 0.75) +
      geom_abline(intercept = 0, slope = 1) +
      geom_text(aes(x = -Inf, y = Inf, hjust = -0.1, vjust = 1.2),
                label = perfomance.annotation) +
      scale_fill_viridis_c(trans = "log10") +
      labs(x = "observed", y = "predicted", fill = fill = "Count"),
           title = iexport_name) +
      theme_bw(base_size = 10) +
      theme(legend.position = "bottom",
            plot.title = element_text(hjust = 0.5),
            plot.subtitle = element_text(hjust = 0.5),
            legend.key.size = unit(0.35, "cm"))

    r.max <- max(layer_scales(p.hex)$x$range$range)
    r.min <- min(layer_scales(p.hex)$x$range$range)

    s.max <-max(layer_scales(p.hex)$y$range$range)
    s.min <-min(layer_scales(p.hex)$y$range$range)

    t.max <-round(max(r.max,s.max),1)
    t.min <-round(min(r.min,s.min),1)

    p.hex <- p.hex + coord_equal(xlim=c(t.min,t.max),ylim=c(t.min,t.max))

    ggsave(paste0(dir,
                  iexport_name,
                  "/valplot_",
                  imodel_name,
                  ".png"),
           p.hex, dpi = 200, width = 5, height = 5, units = "in", scale = 1)

  }

  cat(paste0("Run ", i, "/", nrow(fitted.modeling.combinations), "\n\n"))

}

Example of an accuracy plot:

Uncertainty model

## Parallelization is done inside the the autotuner
lgr::get_logger("mlr3")$set_threshold("warn")

i=1
for(i in 1:nrow(modeling.combinations)) {

  isoil_property = modeling.combinations[[i,"soil_property"]]
  imodel_name = modeling.combinations[[i,"model_name"]]
  iexport_name = modeling.combinations[[i,"export_name"]]
  ispectra_type = modeling.combinations[[i,"spectra_type"]]
  isubset = modeling.combinations[[i,"subset"]]
  igeo = modeling.combinations[[i,"geo"]]

  # PCA scores
  n.comps <- 120
  selected.comps <- paste0("PC", seq(1, n.comps, by = 1))

  data <- qread(paste0(dir, "pca.ossl/pca_scores_", ispectra_type, "_cubist_", isubset, "_v1.2.qs"))

  # CV predictions
  predictions <- qread(paste0(dir, iexport_name, "/cvpred_", imodel_name, ".qs"))

  predictions <- predictions %>%
    mutate(residual = abs(truth-response)) %>%
    select(id.layer_uuid_txt, residual)

  data <- right_join(data, predictions, by = "id.layer_uuid_txt") %>%
    relocate(residual, .after = id.layer_uuid_txt)

  # Defining train data
  sel.data <- data %>%
    select(id.layer_uuid_txt, # ID column
           residual, # Target
           all_of(selected.comps)) %>% # Only compressed spectra
    filter(!is.na(residual))

  # Create regression task
  task.error <- as_task_regr(sel.data, id = "train", target = "residual", type = "regression")

  # Defining id column
  task.error$set_col_roles("id.layer_uuid_txt", roles = "name")

  # For block CV. If 'id.tile' not present in the data.frame, default to random CV
  if("id.tile" %in% colnames(sel.data)) {
    task.error$set_col_roles("id.tile", roles = "group")
  }

  # Importing the best HP from prediction model and setting to error model
  error.model <- qread(paste0(dir, iexport_name, "/model_", imodel_name, ".qs"))

  # Train
  error.model$train(task.error)

  # Saving trained error model to disk
  export.file <- paste0(dir, iexport_name, "/error_model_", imodel_name, ".qs")
  export.file.alt <- gsub("\\.qs", "\\.rds", export.file)

  if(file.exists(export.file)){file.remove(export.file)}
  if(file.exists(export.file.alt)){file.remove(export.file.alt)}

  tryCatch(
    expr = {qsave(error.model, export.file)},
    error = function(e){saveRDS(error.model, export.file.alt)}
  )

  # Getting residual predictions and saving to disk
  results <- predict(error.model, newdata = as.data.table(task.error))

  pred.export <- sel.data %>%
    select(-starts_with("PC")) %>%
    mutate(pred_residual = results) %>%
    mutate(alpha_scores = residual/pred_residual)

  export.error <- paste0(dir, iexport_name, "/error_pred_", imodel_name, ".qs")
  export.error.alt <- gsub("\\.qs", "\\.rds", export.error)

  if(file.exists(export.error)){file.remove(export.error)}
  if(file.exists(export.error.alt)){file.remove(export.error.alt)}

  tryCatch(
    expr = {qsave(pred.export, export.error)},
    error = function(e){saveRDS(pred.export, export.error.alt)}
  )

  cat(paste0("Exported error model at ", now(), "\n\n"))

}

References

Chang, C.-W., Laird, D., Mausbach, M. J., & Hurburgh Jr, C. R. (2001). Near-infrared reflectance spectroscopy–principal components regression analyses of soil properties. Soil Science Society of America Journal, 65(2), 480. doi:10.2136/sssaj2001.652480x