-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit "duplicate line" warnings #77
Comments
These two lines are marked with <del></del>. Perseus 5.0 names the deleted lines "382a" and "383a". https://github.com/PerseusDL/canonical-greekLit/blob/2d5ba46c2e4b5a0c884548c431503ef7d63d14ce/data/tlg0001/tlg001/tlg0001.tlg001.perseus-grc2.xml#L2044-L2051 #77
One of these caused a spurious "duplicate line" warning warning: Dion.: duplicate line '25.528' #77
Regarding theocritus, most of the duplicates are due to typographically split lines (already handled in #27 or by me just now in 04dd4a1); but there are two cases I don't know what to do with.
21.65No splits here; there are 5 lines between 60 and 65.
https://archive.org/details/idyllsoftheocrit00theo_0/page/139/mode/1up 27.11There are 5 lines between 5 and 10; or 6 if you count the The
https://archive.org/details/idyllsoftheocrit00theo_0/page/168/mode/1up |
The table of line totals formerly hardcoded the WORKS from Table 1 of "SEDES: Metrical Position in Greek Hexameter". But there have been changes to the corpus since then that affect line numbering, for example sasansom/sedes#77 sasansom/sedes#79 sasansom/sedes@04dd4a1 Furthermore, Table 1 from the "SEDES" article is produced using an xmlstarlet command running on the source TEI directly counting l and lb elements, not on the derived CSV files. In our notes for the table we remark that this is because duplicate line numbers cause the counts to come out too low: For future reference: $ (echo "work,lines"; for a in corpus/*.xml; do echo "$a,$(xmlstarlet sel -t -m '//l' -v '"l"' -n -t -m '//lb' -v '"lb"' -n "$a" | wc -l)"; done) > corpus.csv > x <- read.csv("corpus.csv") > sum(x$lines) [1] 73098 > summary(x$lines) Min. 1st Qu. Median Mean 3rd Qu. Max. 479 1017 2434 6092 9628 21356 --- Table 1 numbers checked 2022-09-17, sedes commit cf795ef740. --- > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > x %>% group_by(work) %>% summarize(n = n()) NB the line counts you get from counting distinct line numbers in the CSV are slightly different (smaller) from what you get from xmlstarlet, because of duplicated line numbers. > x %>% select(work, book_n, line_n) %>% unique %>% nrow [1] 72954 > x %>% select(work, book_n, line_n) %>% unique %>% group_by(work) %>% summarize(n = n()) In this repository I've started adding a workaround for the duplicate line numbers, counting up a line whenever word_n fails to increase with the same work, book_n, and line_n in input order. But even with that, the automatically determined counts for Callim.Hymn and Q.S. are 1 smaller than they used to be, and unlike Dion. and Theoc., we have not made changes to those texts that should affect line count totals. I am planning to look at those more closely, but for now, go ahead with the automatically computed line numbers, because that's what all our percentages etc. are based on. If I repeat the xmlstarlet calculation with current SEDES files (605a27b3af22089379aad22ba96edf113970a7b0), the only change I get is 3 fewer lines in Dion. Using the automatically determined line numbers takes it down another 99 lines across 4 works. work old_num_lines redo_old_num_lines diff1 new_num_lines diff2 <chr> <dbl> <dbl> <dbl> <int> <dbl> 1 Phaen. 1155 1155 0 1155 0 2 Argon. 5834 5834 0 5834 0 3 Callim.Hymn 941 941 0 940 -1 4 Hom.Hymn 2342 2342 0 2342 0 5 Il. 15683 15683 0 15683 0 6 Dion. 21356 21353 -3 21259 -97 7 Od. 12107 12107 0 12107 0 8 Q.S. 8801 8801 0 8800 -1 9 Sh. 479 479 0 479 0 10 Theoc. 2527 2527 0 2524 -3 11 Theog. 1042 1042 0 1042 0 12 W.D. 831 831 0 831 0 13 total 73098 73095 -3 72996 -102
Fixed in 490d519 by renumbering some lines, as in Perseus 5.0. Theoc. 21.65: lines are out of order: 64, 66, 65, 67. Theoc. 27.11: preceding lines renumbered 8a and 9b. |
Reopening this because, despite what #77 (comment) says, nothing seems to have happened with the split lines before Theoc. 5.70 and 5.71. Lines 677 to 689 in 2ec2a12
|
Handled Theoc. 5.66 in 31a4f58. |
At the meeting today, we talked about how the Perseus-assigned line numbers are not always unique, and how that creates problems when trying to make other tables that refer to specific lines.
We discussed adding our own, parallel set of line numbers, guaranteed to be non-ambiguous. But as I started looking for examples of duplicate line numbers to use as examples in a new issue, I found that (1) there are not that many instances of reused line numbers, and (2) they may represent source errors that should be fixed anyway.
The ones I looked at had already been fixed in Perseus 5.0 texts (cf. #57), for example https://github.com/PerseusDL/canonical-greekLit/blob/2d5ba46c2e4b5a0c884548c431503ef7d63d14ce/data/tlg0001/tlg001/tlg0001.tlg001.perseus-grc2.xml#L2044-L2051 which changed a sequence (382, 383, 382, 383) to (382a, 383a, 382, 383).
I recommend first reviewing all the cases of duplicate line numbers, because if it turns out they can all be made unambiguous, that saves us the trouble of inventing and maintaining our own numbering schema.
The text was updated successfully, but these errors were encountered: