Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Automatically compute the number of lines per work.
The table of line totals formerly hardcoded the WORKS from Table 1 of "SEDES: Metrical Position in Greek Hexameter". But there have been changes to the corpus since then that affect line numbering, for example sasansom/sedes#77 sasansom/sedes#79 sasansom/sedes@04dd4a1 Furthermore, Table 1 from the "SEDES" article is produced using an xmlstarlet command running on the source TEI directly counting l and lb elements, not on the derived CSV files. In our notes for the table we remark that this is because duplicate line numbers cause the counts to come out too low: For future reference: $ (echo "work,lines"; for a in corpus/*.xml; do echo "$a,$(xmlstarlet sel -t -m '//l' -v '"l"' -n -t -m '//lb' -v '"lb"' -n "$a" | wc -l)"; done) > corpus.csv > x <- read.csv("corpus.csv") > sum(x$lines) [1] 73098 > summary(x$lines) Min. 1st Qu. Median Mean 3rd Qu. Max. 479 1017 2434 6092 9628 21356 --- Table 1 numbers checked 2022-09-17, sedes commit cf795ef740. --- > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > x %>% group_by(work) %>% summarize(n = n()) NB the line counts you get from counting distinct line numbers in the CSV are slightly different (smaller) from what you get from xmlstarlet, because of duplicated line numbers. > x %>% select(work, book_n, line_n) %>% unique %>% nrow [1] 72954 > x %>% select(work, book_n, line_n) %>% unique %>% group_by(work) %>% summarize(n = n()) In this repository I've started adding a workaround for the duplicate line numbers, counting up a line whenever word_n fails to increase with the same work, book_n, and line_n in input order. But even with that, the automatically determined counts for Callim.Hymn and Q.S. are 1 smaller than they used to be, and unlike Dion. and Theoc., we have not made changes to those texts that should affect line count totals. I am planning to look at those more closely, but for now, go ahead with the automatically computed line numbers, because that's what all our percentages etc. are based on. If I repeat the xmlstarlet calculation with current SEDES files (605a27b3af22089379aad22ba96edf113970a7b0), the only change I get is 3 fewer lines in Dion. Using the automatically determined line numbers takes it down another 99 lines across 4 works. work old_num_lines redo_old_num_lines diff1 new_num_lines diff2 <chr> <dbl> <dbl> <dbl> <int> <dbl> 1 Phaen. 1155 1155 0 1155 0 2 Argon. 5834 5834 0 5834 0 3 Callim.Hymn 941 941 0 940 -1 4 Hom.Hymn 2342 2342 0 2342 0 5 Il. 15683 15683 0 15683 0 6 Dion. 21356 21353 -3 21259 -97 7 Od. 12107 12107 0 12107 0 8 Q.S. 8801 8801 0 8800 -1 9 Sh. 479 479 0 479 0 10 Theoc. 2527 2527 0 2524 -3 11 Theog. 1042 1042 0 1042 0 12 W.D. 831 831 0 831 0 13 total 73098 73095 -3 72996 -102
- Loading branch information