Audit "duplicate line" warnings #77

whoopsedesy · 2022-11-10T00:14:03Z

At the meeting today, we talked about how the Perseus-assigned line numbers are not always unique, and how that creates problems when trying to make other tables that refer to specific lines.

We discussed adding our own, parallel set of line numbers, guaranteed to be non-ambiguous. But as I started looking for examples of duplicate line numbers to use as examples in a new issue, I found that (1) there are not that many instances of reused line numbers, and (2) they may represent source errors that should be fixed anyway.

The ones I looked at had already been fixed in Perseus 5.0 texts (cf. #57), for example https://github.com/PerseusDL/canonical-greekLit/blob/2d5ba46c2e4b5a0c884548c431503ef7d63d14ce/data/tlg0001/tlg001/tlg0001.tlg001.perseus-grc2.xml#L2044-L2051 which changed a sequence (382, 383, 382, 383) to (382a, 383a, 382, 383).

I recommend first reviewing all the cases of duplicate line numbers, because if it turns out they can all be made unambiguous, that saves us the trouble of inventing and maintaining our own numbering schema.

aratus.xml

argonautica.xml a58e572

warning: Argon.: duplicate line '2.382'
warning: Argon.: duplicate line '2.383'

callimachushymns.xml

homerichymns.xml Fix duplicate line numbers in homerichymns.xml #76

warning: Hom.Hymn: duplicate line '3.340'
warning: Hom.Hymn: duplicate line '9.5'
warning: Hom.Hymn: duplicate line '29.5'

iliad.xml

nonnusdionysiaca.xml Resolve duplicate line warnings in nonnusdionysiaca.xml #79

warning: Dion.: duplicate line '25.528'
warning: Dion.: duplicate line '28.83'
warning: Dion.: duplicate line '28.84'
warning: Dion.: duplicate line '28.93'
warning: Dion.: duplicate line '28.94'
warning: Dion.: duplicate line '32.90'
warning: Dion.: duplicate line '37.568'
warning: Dion.: duplicate line '37.625'
warning: Dion.: duplicate line '39.385'
warning: Dion.: duplicate line '40.566'
warning: Dion.: duplicate line '44.145'

odyssey.xml
quintussmyrnaeus.xml
shield.xml

theocritus.xml

 warning: Theoc.: duplicate line '5.66'
 warning: Theoc.: duplicate line '5.66'
 warning: Theoc.: duplicate line '10.20'
 warning: Theoc.: duplicate line '14.1'
 warning: Theoc.: duplicate line '14.2'
 warning: Theoc.: duplicate line '14.2'
 warning: Theoc.: duplicate line '14.3'
 warning: Theoc.: duplicate line '15.1'
 warning: Theoc.: duplicate line '15.3'
 warning: Theoc.: duplicate line '15.3'
 warning: Theoc.: duplicate line '15.14'
 warning: Theoc.: duplicate line '15.24'
 warning: Theoc.: duplicate line '15.26'
 warning: Theoc.: duplicate line '15.38'
 warning: Theoc.: duplicate line '15.60'
 warning: Theoc.: duplicate line '15.60'
 warning: Theoc.: duplicate line '15.60'
 warning: Theoc.: duplicate line '15.61'
 warning: Theoc.: duplicate line '15.72'
 warning: Theoc.: duplicate line '15.73'
 warning: Theoc.: duplicate line '21.37'
 warning: Theoc.: duplicate line '22.66'

theogony.xml
worksanddays.xml Fix duplicate WD 363 #63
```
warning: W.D.: duplicate line 'WD.363'
```

The text was updated successfully, but these errors were encountered:

These two lines are marked with <del></del>. Perseus 5.0 names the deleted lines "382a" and "383a". https://github.com/PerseusDL/canonical-greekLit/blob/2d5ba46c2e4b5a0c884548c431503ef7d63d14ce/data/tlg0001/tlg001/tlg0001.tlg001.perseus-grc2.xml#L2044-L2051 #77

One of these caused a spurious "duplicate line" warning warning: Dion.: duplicate line '25.528' #77

whoopsedesy · 2023-01-03T21:54:47Z

Regarding theocritus, most of the duplicates are due to typographically split lines (already handled in #27 or by me just now in 04dd4a1); but there are two cases I don't know what to do with.

Typographical splits handled in #27:: 5.70, 5.71, 10.20, 14.1, 14.2, 14.3, 15.1, 15.3, 15.24, 15.26, 15.60, 15.61, 15.72, 15.73
Typographical splits handled in 04dd4a1:: 15.41, 21.40, 22.70

21.65

No splits here; there are 5 lines between 60 and 65.

<lb rend="displayNum" n="60" />a)lla\ menei=n e)pi\ ga=s kai\ tw=| xrusw=| basileu/sein.
<lb />tau=ta/ me ka)ch/geire, tu\ d' w)= ce/ne loipo\n e)/reide
<lb />ta\n gnw/man: o(/rkon ga\r e)gw\ to\n e)pw/mosa tarbw=.</p></sp>
<sp><speaker>*(Etai=ros</speaker><p>
<lb />kai\ su/ge ti/ tre/sseis; ou)k w)/mosas: ou)de\ ga\r i)xqu\n
<lb />xru/seon w(s i)/des eu(=res, i)/sa d' h)=n yeu/desin o)/yis,
<lb />e)lpi\s tw=n u(/pnwn. za/tei to\n sa/rkinon i)xqu/n,
<lb rend="displayNum" n="65" />ei) ga/r pa| knw/sswn e)/t' e)tw/sia tau=ta mateu/seis,

https://archive.org/details/idyllsoftheocrit00theo_0/page/139/mode/1up

27.11

There are 5 lines between 5 and 10; or 6 if you count the <gap /> line by Daphnis.

The <gap /> line is getting assigned line number 10 and the next line a( stafuli\s stafi/s is getting assigned line number 11. 27.10 is not being reported as a duplicate because the first instance, the <gap /> line, is blank.

<sp><speaker><add>*ko/rh</add></speaker><p><lb rend="displayNum" n="5" />to\ sto/ma meu plu/nw kai\ a)poptu/w to\ fi/lama.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb />plu/neis xei/lea sei=o; di/dou pa/lin o)/fra fila/sw.</p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />kalo/n soi dama/las file/ein, ou)k a)/zuga kw/ran.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb />mh\ kauxw=: ta/xa ga/r se pare/rxetai w(s o)/nar h(/bh.</p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />h)\n de/ ti ghra/skw, to/de pou me/li kai\ ga/la pi/nw.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb /><gap /></p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />a( stafuli\s stafi/s e)sti kai\ ou) r(o/don au)=on o)lei=tai.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb rend="displayNum" n="10" />deu=r' u(po\ ta\s koti/nous, i(/na soi/ tina mu=qon e)ni/yw.</p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />ou)k e)qe/lw: kai\ pri/n me parh/pafes a(de/i mu/qw|.</p></sp>

https://archive.org/details/idyllsoftheocrit00theo_0/page/168/mode/1up

The table of line totals formerly hardcoded the WORKS from Table 1 of "SEDES: Metrical Position in Greek Hexameter". But there have been changes to the corpus since then that affect line numbering, for example sasansom/sedes#77 sasansom/sedes#79 sasansom/sedes@04dd4a1 Furthermore, Table 1 from the "SEDES" article is produced using an xmlstarlet command running on the source TEI directly counting l and lb elements, not on the derived CSV files. In our notes for the table we remark that this is because duplicate line numbers cause the counts to come out too low: For future reference: $ (echo "work,lines"; for a in corpus/*.xml; do echo "$a,$(xmlstarlet sel -t -m '//l' -v '"l"' -n -t -m '//lb' -v '"lb"' -n "$a" | wc -l)"; done) > corpus.csv > x <- read.csv("corpus.csv") > sum(x$lines) [1] 73098 > summary(x$lines) Min. 1st Qu. Median Mean 3rd Qu. Max. 479 1017 2434 6092 9628 21356 --- Table 1 numbers checked 2022-09-17, sedes commit cf795ef740. --- > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > x %>% group_by(work) %>% summarize(n = n()) NB the line counts you get from counting distinct line numbers in the CSV are slightly different (smaller) from what you get from xmlstarlet, because of duplicated line numbers. > x %>% select(work, book_n, line_n) %>% unique %>% nrow [1] 72954 > x %>% select(work, book_n, line_n) %>% unique %>% group_by(work) %>% summarize(n = n()) In this repository I've started adding a workaround for the duplicate line numbers, counting up a line whenever word_n fails to increase with the same work, book_n, and line_n in input order. But even with that, the automatically determined counts for Callim.Hymn and Q.S. are 1 smaller than they used to be, and unlike Dion. and Theoc., we have not made changes to those texts that should affect line count totals. I am planning to look at those more closely, but for now, go ahead with the automatically computed line numbers, because that's what all our percentages etc. are based on. If I repeat the xmlstarlet calculation with current SEDES files (605a27b3af22089379aad22ba96edf113970a7b0), the only change I get is 3 fewer lines in Dion. Using the automatically determined line numbers takes it down another 99 lines across 4 works. work old_num_lines redo_old_num_lines diff1 new_num_lines diff2 <chr> <dbl> <dbl> <dbl> <int> <dbl> 1 Phaen. 1155 1155 0 1155 0 2 Argon. 5834 5834 0 5834 0 3 Callim.Hymn 941 941 0 940 -1 4 Hom.Hymn 2342 2342 0 2342 0 5 Il. 15683 15683 0 15683 0 6 Dion. 21356 21353 -3 21259 -97 7 Od. 12107 12107 0 12107 0 8 Q.S. 8801 8801 0 8800 -1 9 Sh. 479 479 0 479 0 10 Theoc. 2527 2527 0 2524 -3 11 Theog. 1042 1042 0 1042 0 12 W.D. 831 831 0 831 0 13 total 73098 73095 -3 72996 -102

whoopsedesy · 2023-10-31T02:01:12Z

Fixed in 490d519 by renumbering some lines, as in Perseus 5.0.

Theoc. 21.65: lines are out of order: 64, 66, 65, 67.

Theoc. 27.11: preceding lines renumbered 8a and 9b.

whoopsedesy · 2024-01-26T07:00:46Z

Reopening this because, despite what #77 (comment) says, nothing seems to have happened with the split lines before Theoc. 5.70 and 5.71.

sedes/corpus/theocritus.xml

Lines 677 to 689 in 2ec2a12

    
           <lb rend="displayNum" n="65" />th/nas ta\s para\ ti\n culoxi/zetai: e)/sti de\ *mo/rswn.</p></sp> 
        
           <sp><speaker>*La/kwn</speaker><p> 
        
           <lb />bwstre/wmes.</p></sp> 
        
           <sp><speaker>*Koma/tas</speaker><p> 
        
           <lb rend="displayNumAndIndent" />tu\ ka/lei nin.</p></sp> 
        
           <sp><speaker>*La/kwn</speaker><p> 
        
           <lb rend="displayNumAndIndent" />i)/q' w)= ce/ne mikko\n a)/kouson 
        
           <lb />tei=d' e)nqw/n: a)/mmes ga\r e)ri/sdomes, o(/stis a)rei/wn 
        
           <lb />boukoliasta/s e)sti. tu\ d' w)= fi/le mh/t' e)me\ *mo/rswn 
        
           <lb />e)n xa/riti kri/nh|s, mh/t' w)=n tu/ga tou=ton o)na/sh|s.</p></sp> 
        
           <sp><speaker>*Koma/tas</speaker><p> 
        
           <lb rend="displayNum" n="70" />nai\ poti\ ta=n *numfa=n *mo/rswn fi/le mh/te *koma/ta| 
        
           <lb />to\ ple/on i)qu/nh|s, mh/t' w)=n tu/ga tw=|de xari/ch|.

whoopsedesy · 2024-03-13T00:13:23Z

Reopening this because, despite what #77 (comment) says, nothing seems to have happened with the split lines before Theoc. 5.70 and 5.71.

Handled Theoc. 5.66 in 31a4f58.

whoopsedesy mentioned this issue Nov 10, 2022

Non-rule-obeying manual scansions #75

Closed

whoopsedesy pushed a commit that referenced this issue Jan 2, 2023

Remove extraneous linebreaks from nonnusdionysiaca.xml.

2869243

One of these caused a spurious "duplicate line" warning warning: Dion.: duplicate line '25.528' #77

whoopsedesy mentioned this issue Jan 2, 2023

Resolve duplicate line warnings in nonnusdionysiaca.xml #79

Closed

whoopsedesy assigned sasansom Oct 24, 2023

sasansom added a commit that referenced this issue Oct 30, 2023

Fix numbering #77

1fa387f

whoopsedesy pushed a commit that referenced this issue Oct 31, 2023

Fix numbering #77

490d519

whoopsedesy closed this as completed Oct 31, 2023

whoopsedesy reopened this Jan 26, 2024

whoopsedesy closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit "duplicate line" warnings #77

Audit "duplicate line" warnings #77

whoopsedesy commented Nov 10, 2022 •

edited

Loading

whoopsedesy commented Jan 3, 2023

whoopsedesy commented Oct 31, 2023

whoopsedesy commented Jan 26, 2024

whoopsedesy commented Mar 13, 2024

Audit "duplicate line" warnings #77

Audit "duplicate line" warnings #77

Comments

whoopsedesy commented Nov 10, 2022 • edited Loading

whoopsedesy commented Jan 3, 2023

21.65

27.11

whoopsedesy commented Oct 31, 2023

whoopsedesy commented Jan 26, 2024

whoopsedesy commented Mar 13, 2024

whoopsedesy commented Nov 10, 2022 •

edited

Loading