generated from trias-project/checklist-recipe
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdwc_mapping.Rmd
169 lines (135 loc) · 4.62 KB
/
dwc_mapping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Darwin Core mapping of VMM rattenapp data"
author:
- Damiano Oldoni
- Peter Desmet
date: "`r Sys.Date()`"
output:
html_document:
df_print: paged
number_sections: yes
toc: yes
toc_depth: 3
toc_float: yes
---
# Setup
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE)
```
Load libraries:
```{r load_pkgs, message = FALSE}
library(dplyr) # To transform data
library(tidylog) # To provide feedback on dplyr functions
library(readr) # To read and write tabular text files
library(purrr) # To work with lists
library(stringr) # To work with strings
library(here) # To find files
library(glue) # To insert variables in strings
library(readxl) # To read Excel files
library(DBI) # To create and query databases
library(RSQLite) # To work with SQLite databases
library(digest) # To generate hashes
```
## Read observations
Read all observations as fetched and merged in `./src/fetch_data.Rmd`:
```{r read_obs}
message("Read interim data...")
occurrences <- readr::read_csv(
file = here::here("data", "interim", "raw_occurrences.csv"),
col_types = cols(.default = col_character()))
```
as well the ones to be removed as already published on GBIF by LIFE MICA:
```{r read_life_mica_data}
life_mica_obs <- readr::read_csv(
file = here::here("data", "interim", "life_mica_obs.csv"),
col_types = cols(.default = col_character()))
message("DONE")
```
## Create database
Create a SQLite database with the source data, so it can be queried with SQL in the next steps:
```{r create_db}
message("Create in memory SQLite database...")
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# import occurrences
DBI::dbWriteTable(con, "occurrences", occurrences)
# import occurrences to filter out
DBI::dbWriteTable(con, "life_mica_obs", life_mica_obs)
message("DONE")
```
# Darwin Core mapping
Create [Occurrence](https://rs.gbif.org/core/dwc_occurrence_2022-02-02.xml) extension:
```{r occurrence}
message("Map occurrences to DwC...")
dwc_occurrence_sql <- glue::glue_sql(
readr::read_file(here::here("sql", "dwc_occurrence.sql")),
.con = con)
dwc_occurrence <- DBI::dbGetQuery(con, dwc_occurrence_sql)
message("DONE")
```
# Save data to CSV
```{r save_csv}
readr::write_csv(
dwc_occurrence,
here::here("data", "processed", "occurrence.csv"),
na = ""
)
```
# Overview changes
In this section we summarize the changes in the DwC output after the very last udpate.
## Read DwC files
Read DwC mapping files from both `main`:
```{r main}
dwc_occurrence_main <- readr::read_csv(
file = "https://raw.githubusercontent.com/riparias/vmm-rattenapp-occurrences/main/data/processed/occurrence.csv",
col_types = cols(.default = "c"))
```
and actual branch (`automatic-update-*`):
```{r automatic-update}
dwc_occurrence_update <- readr::read_csv(
file = here::here("data","processed","occurrence.csv"),
col_types = cols(.default = "c"))
```
## Overview
How many new occurrences, i.e. new `occurrenceID` values, have been added?
```{r new_occurrenceID_values}
new_occs <-
dwc_occurrence_update %>%
filter(!.data$occurrenceID %in% dwc_occurrence_main$occurrenceID)
message("New occurrences:")
print(new_occs)
```
Have some occurrences been removed?
```{r removed_occs}
removed_occs <-
dwc_occurrence_main %>%
filter(!.data$occurrenceID %in% dwc_occurrence_update$occurrenceID)
message("Removed occurrences:")
print(removed_occs)
```
Total number of rows of new DwC output in comparison with the previous version:
```{r nrows}
message(paste("Number of occurrences (new):", nrow(dwc_occurrence_update)))
message(paste("Number of occurrences (old):", nrow(dwc_occurrence_main)))
message(paste("Difference:",
nrow(dwc_occurrence_update) - nrow(dwc_occurrence_main))
)
```
## New `Sporen Waarnemingen Naam` values
The DwC `scientificName` is a manual mapping of Dutch vernacular names (e.g. `Roerdomp`) or expressions (e.g. `Eendensterfte circa 25 st`) from field `Sporen Waarnemingen Naam`. It is therefore important to get a list with all new values:
```{r new_Sporen_Waarnemingen_Naam}
occs_without_sc_name <-
dwc_occurrence_update %>%
filter(is.na(scientificName)) %>%
pull(occurrenceID)
query <- glue::glue_sql("
SELECT
DISTINCT o.\"Sporen Waarnemingen Naam\" AS \"SporenWaarnemingenNaam\"
FROM occurrences AS o
WHERE
o.\"Registratie ID\" || ':' || o.\"species_name_hash\" IN ({occs_without_sc_name*})
", .con = con)
query
raw_sc_names <- dplyr::as_tibble(DBI::dbGetQuery(con, query))
message("New values in field 'Sporen Waarnemingen Naam' to be mapped:")
print(raw_sc_names)
```