src/dwc_mapping.Rmd

---
title: "Darwin Core mapping of VMM rattenapp data"
author:
- Damiano Oldoni
- Peter Desmet
date: "`r Sys.Date()`"
output:
  html_document:
    df_print: paged
    number_sections: yes
    toc: yes
    toc_depth: 3
    toc_float: yes
---

# Setup 

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE)
```

Load libraries:

```{r load_pkgs, message = FALSE}
library(dplyr)          # To transform data
library(tidylog)        # To provide feedback on dplyr functions
library(readr)          # To read and write tabular text files
library(purrr)          # To work with lists
library(stringr)        # To work with strings
library(here)           # To find files
library(glue)           # To insert variables in strings
library(readxl)         # To read Excel files
library(DBI)            # To create and query databases
library(RSQLite)        # To work with SQLite databases
library(digest)         # To generate hashes
```

## Read observations

Read all observations as fetched and merged in `./src/fetch_data.Rmd`:

```{r read_obs}
message("Read interim data...")
occurrences <- readr::read_csv(
  file = here::here("data", "interim", "raw_occurrences.csv"),
  col_types = cols(.default = col_character()))
```

as well the ones to be removed as already published on GBIF by LIFE MICA:

```{r read_life_mica_data}
life_mica_obs <- readr::read_csv(
  file = here::here("data", "interim", "life_mica_obs.csv"),
  col_types = cols(.default = col_character()))
message("DONE")
```

## Create database

Create a SQLite database with the source data, so it can be queried with SQL in the next steps:

```{r create_db}
message("Create in memory SQLite database...")
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# import occurrences
DBI::dbWriteTable(con, "occurrences", occurrences)
# import occurrences to filter out
DBI::dbWriteTable(con, "life_mica_obs", life_mica_obs)
message("DONE")
```

# Darwin Core mapping

Create [Occurrence](https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml) extension:

```{r occurrence}
message("Map occurrences to DwC...")
dwc_occurrence_sql <- glue::glue_sql(
  readr::read_file(here::here("sql", "dwc_occurrence.sql")), 
  .con = con)
dwc_occurrence <- DBI::dbGetQuery(con, dwc_occurrence_sql)
message("DONE")
```

# Save data to CSV

```{r save_csv}
readr::write_csv(
  dwc_occurrence,
  here::here("data", "processed", "occurrence.csv"),
  na = ""
)
```

# Overview changes

In this section we summarize the changes in the DwC output after the very last udpate.

## Read DwC files

Read DwC mapping files from both `main`:

```{r main}
dwc_occurrence_main <- readr::read_csv(
  file = "https://raw.githubusercontent.com/riparias/vmm-rattenapp-occurrences/main/data/processed/occurrence.csv",
  col_types = cols(.default = "c"))
```

and actual branch (`automatic-update-*`):
 
```{r automatic-update}
dwc_occurrence_update <- readr::read_csv(
  file = here::here("data","processed","occurrence.csv"),
  col_types = cols(.default = "c"))
```

## Overview

How many new occurrences, i.e. new `occurrenceID` values, have been added?

```{r new_occurrenceID_values}
new_occs <- 
  dwc_occurrence_update %>%
  filter(!.data$occurrenceID %in% dwc_occurrence_main$occurrenceID)
message("New occurrences:")
print(new_occs)
```

Have some occurrences been removed?

```{r removed_occs}
removed_occs <- 
  dwc_occurrence_main %>%
  filter(!.data$occurrenceID %in% dwc_occurrence_update$occurrenceID)
message("Removed occurrences:")
print(removed_occs)
```

Total number of rows of new DwC output in comparison with the previous version:

```{r nrows}
message(paste("Number of occurrences (new):", nrow(dwc_occurrence_update)))
message(paste("Number of occurrences (old):", nrow(dwc_occurrence_main)))
message(paste("Difference:", 
              nrow(dwc_occurrence_update) - nrow(dwc_occurrence_main))
)
```

## New `Sporen Waarnemingen Naam` values

The DwC `scientificName` is a manual mapping of Dutch vernacular names (e.g. `Roerdomp`) or expressions (e.g. `Eendensterfte circa 25 st`) from field `Sporen Waarnemingen Naam`. It is therefore important to get a list with all new values:

```{r new_Sporen_Waarnemingen_Naam}
occs_without_sc_name <- 
  dwc_occurrence_update %>%
  filter(is.na(scientificName)) %>%
  pull(occurrenceID)
query <- glue::glue_sql("
  SELECT 
    DISTINCT o.\"Sporen Waarnemingen Naam\" AS \"SporenWaarnemingenNaam\"
  FROM occurrences AS o 
  WHERE  
  o.\"Registratie ID\" || ':' || o.\"species_name_hash\" IN ({occs_without_sc_name*})
  ", .con = con)
query
raw_sc_names <- dplyr::as_tibble(DBI::dbGetQuery(con, query))
message("New values in field 'Sporen Waarnemingen Naam' to be mapped:")
print(raw_sc_names)
```