DBNL corpus: footnotes #1346
Labels
affects-elasticsearch-index
changes that require re-indexing elasticsearch data
corpus
changes to corpus definitions or new corpora
good first issue
A tip from a researcher:
The extractor for the DBNL data puts footnotes in the body of the text. This is inappropriate because footnotes are often later additions to the text, so it introduces anachronisms. It also breaks up the text awkwardly.
Examples:
Footnotes are marked as
<note>
in the XML so they should be easy to extract.I would suggest
So this section:
Currently looks like this on i-analyzer:
Should be formatted as follows.
Content:
Notes:
Note that the reference
[a]
is included in the XML as<note n="a">
.To do:
notes
field to the corpus. Add an XML extractor for it, and adjust the XML extractor of the main text. You may need to write some python functions that transform theBeautifulSoup
tree.The text was updated successfully, but these errors were encountered: