Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit "duplicate line" warnings #77

Closed
12 tasks done
whoopsedesy opened this issue Nov 10, 2022 · 4 comments
Closed
12 tasks done

Audit "duplicate line" warnings #77

whoopsedesy opened this issue Nov 10, 2022 · 4 comments
Assignees

Comments

@whoopsedesy
Copy link
Collaborator

whoopsedesy commented Nov 10, 2022

At the meeting today, we talked about how the Perseus-assigned line numbers are not always unique, and how that creates problems when trying to make other tables that refer to specific lines.

We discussed adding our own, parallel set of line numbers, guaranteed to be non-ambiguous. But as I started looking for examples of duplicate line numbers to use as examples in a new issue, I found that (1) there are not that many instances of reused line numbers, and (2) they may represent source errors that should be fixed anyway.

The ones I looked at had already been fixed in Perseus 5.0 texts (cf. #57), for example https://github.com/PerseusDL/canonical-greekLit/blob/2d5ba46c2e4b5a0c884548c431503ef7d63d14ce/data/tlg0001/tlg001/tlg0001.tlg001.perseus-grc2.xml#L2044-L2051 which changed a sequence (382, 383, 382, 383) to (382a, 383a, 382, 383).

I recommend first reviewing all the cases of duplicate line numbers, because if it turns out they can all be made unambiguous, that saves us the trouble of inventing and maintaining our own numbering schema.

  • aratus.xml
  • argonautica.xml a58e572
    warning: Argon.: duplicate line '2.382'
    warning: Argon.: duplicate line '2.383'
    
  • callimachushymns.xml
  • homerichymns.xml Fix duplicate line numbers in homerichymns.xml #76
    warning: Hom.Hymn: duplicate line '3.340'
    warning: Hom.Hymn: duplicate line '9.5'
    warning: Hom.Hymn: duplicate line '29.5'
    
  • iliad.xml
  • nonnusdionysiaca.xml Resolve duplicate line warnings in nonnusdionysiaca.xml #79
    warning: Dion.: duplicate line '25.528'
    warning: Dion.: duplicate line '28.83'
    warning: Dion.: duplicate line '28.84'
    warning: Dion.: duplicate line '28.93'
    warning: Dion.: duplicate line '28.94'
    warning: Dion.: duplicate line '32.90'
    warning: Dion.: duplicate line '37.568'
    warning: Dion.: duplicate line '37.625'
    warning: Dion.: duplicate line '39.385'
    warning: Dion.: duplicate line '40.566'
    warning: Dion.: duplicate line '44.145'
    
  • odyssey.xml
  • quintussmyrnaeus.xml
  • shield.xml
  • theocritus.xml
     warning: Theoc.: duplicate line '5.66'
     warning: Theoc.: duplicate line '5.66'
     warning: Theoc.: duplicate line '10.20'
     warning: Theoc.: duplicate line '14.1'
     warning: Theoc.: duplicate line '14.2'
     warning: Theoc.: duplicate line '14.2'
     warning: Theoc.: duplicate line '14.3'
     warning: Theoc.: duplicate line '15.1'
     warning: Theoc.: duplicate line '15.3'
     warning: Theoc.: duplicate line '15.3'
     warning: Theoc.: duplicate line '15.14'
     warning: Theoc.: duplicate line '15.24'
     warning: Theoc.: duplicate line '15.26'
     warning: Theoc.: duplicate line '15.38'
     warning: Theoc.: duplicate line '15.60'
     warning: Theoc.: duplicate line '15.60'
     warning: Theoc.: duplicate line '15.60'
     warning: Theoc.: duplicate line '15.61'
     warning: Theoc.: duplicate line '15.72'
     warning: Theoc.: duplicate line '15.73'
     warning: Theoc.: duplicate line '21.37'
     warning: Theoc.: duplicate line '22.66'
    
  • theogony.xml
  • worksanddays.xml Fix duplicate WD 363 #63
    warning: W.D.: duplicate line 'WD.363'
    
whoopsedesy pushed a commit that referenced this issue Jan 2, 2023
One of these caused a spurious "duplicate line" warning
	warning: Dion.: duplicate line '25.528'
#77
@whoopsedesy
Copy link
Collaborator Author

Regarding theocritus, most of the duplicates are due to typographically split lines (already handled in #27 or by me just now in 04dd4a1); but there are two cases I don't know what to do with.

Typographical splits handled in #27:
5.70, 5.71, 10.20, 14.1, 14.2, 14.3, 15.1, 15.3, 15.24, 15.26, 15.60, 15.61, 15.72, 15.73
Typographical splits handled in 04dd4a1:
15.41, 21.40, 22.70

21.65

No splits here; there are 5 lines between 60 and 65.

<lb rend="displayNum" n="60" />a)lla\ menei=n e)pi\ ga=s kai\ tw=| xrusw=| basileu/sein.
<lb />tau=ta/ me ka)ch/geire, tu\ d' w)= ce/ne loipo\n e)/reide
<lb />ta\n gnw/man: o(/rkon ga\r e)gw\ to\n e)pw/mosa tarbw=.</p></sp>
<sp><speaker>*(Etai=ros</speaker><p>
<lb />kai\ su/ge ti/ tre/sseis; ou)k w)/mosas: ou)de\ ga\r i)xqu\n
<lb />xru/seon w(s i)/des eu(=res, i)/sa d' h)=n yeu/desin o)/yis,
<lb />e)lpi\s tw=n u(/pnwn. za/tei to\n sa/rkinon i)xqu/n,
<lb rend="displayNum" n="65" />ei) ga/r pa| knw/sswn e)/t' e)tw/sia tau=ta mateu/seis,

https://archive.org/details/idyllsoftheocrit00theo_0/page/139/mode/1up
idyllsoftheocrit00theo_0_0149

27.11

There are 5 lines between 5 and 10; or 6 if you count the <gap /> line by Daphnis.

The <gap /> line is getting assigned line number 10 and the next line a( stafuli\s stafi/s is getting assigned line number 11. 27.10 is not being reported as a duplicate because the first instance, the <gap /> line, is blank.

<sp><speaker><add>*ko/rh</add></speaker><p><lb rend="displayNum" n="5" />to\ sto/ma meu plu/nw kai\ a)poptu/w to\ fi/lama.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb />plu/neis xei/lea sei=o; di/dou pa/lin o)/fra fila/sw.</p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />kalo/n soi dama/las file/ein, ou)k a)/zuga kw/ran.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb />mh\ kauxw=: ta/xa ga/r se pare/rxetai w(s o)/nar h(/bh.</p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />h)\n de/ ti ghra/skw, to/de pou me/li kai\ ga/la pi/nw.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb /><gap /></p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />a( stafuli\s stafi/s e)sti kai\ ou) r(o/don au)=on o)lei=tai.</p></sp>
<sp><speaker>*Da/fnis</speaker><p><lb rend="displayNum" n="10" />deu=r' u(po\ ta\s koti/nous, i(/na soi/ tina mu=qon e)ni/yw.</p></sp>
<sp><speaker>*ko/rh</speaker><p><lb />ou)k e)qe/lw: kai\ pri/n me parh/pafes a(de/i mu/qw|.</p></sp>

https://archive.org/details/idyllsoftheocrit00theo_0/page/168/mode/1up
idyllsoftheocrit00theo_0_0178

whoopsedesy pushed a commit to sasansom/breaking-hermanns-bridge that referenced this issue Jun 1, 2023
The table of line totals formerly hardcoded the WORKS from Table 1 of
"SEDES: Metrical Position in Greek Hexameter". But there have been
changes to the corpus since then that affect line numbering, for example
sasansom/sedes#77
sasansom/sedes#79
sasansom/sedes@04dd4a1

Furthermore, Table 1 from the "SEDES" article is produced using an
xmlstarlet command running on the source TEI directly counting l and lb
elements, not on the derived CSV files. In our notes for the table we
remark that this is because duplicate line numbers cause the counts to
come out too low:

	For future reference:

	$ (echo "work,lines"; for a in corpus/*.xml; do echo "$a,$(xmlstarlet sel -t -m '//l' -v '"l"' -n -t -m '//lb' -v '"lb"' -n "$a" | wc -l)"; done) > corpus.csv
	> x <- read.csv("corpus.csv")
	> sum(x$lines)
	[1] 73098
	> summary(x$lines)
	Min. 1st Qu. Median Mean 3rd Qu. Max.
	479 1017 2434 6092 9628 21356

	---

	Table 1 numbers checked 2022-09-17, sedes commit cf795ef740.

	---

	> x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character())))
	> x %>% group_by(work) %>% summarize(n = n())

	NB the line counts you get from counting distinct line numbers in the CSV are slightly different (smaller) from what you get from xmlstarlet, because of duplicated line numbers.
	> x %>% select(work, book_n, line_n) %>% unique %>% nrow
	[1] 72954
	> x %>% select(work, book_n, line_n) %>% unique %>% group_by(work) %>% summarize(n = n())

In this repository I've started adding a workaround for the duplicate
line numbers, counting up a line whenever word_n fails to increase with
the same work, book_n, and line_n in input order. But even with that,
the automatically determined counts for Callim.Hymn and Q.S. are 1
smaller than they used to be, and unlike Dion. and Theoc., we have not
made changes to those texts that should affect line count totals. I am
planning to look at those more closely, but for now, go ahead with the
automatically computed line numbers, because that's what all our
percentages etc. are based on.

If I repeat the xmlstarlet calculation with current SEDES files
(605a27b3af22089379aad22ba96edf113970a7b0), the only change I get is 3
fewer lines in Dion. Using the automatically determined line numbers
takes it down another 99 lines across 4 works.

   work        old_num_lines redo_old_num_lines diff1 new_num_lines diff2
   <chr>               <dbl>              <dbl> <dbl>         <int> <dbl>
 1 Phaen.               1155               1155     0          1155     0
 2 Argon.               5834               5834     0          5834     0
 3 Callim.Hymn           941                941     0           940    -1
 4 Hom.Hymn             2342               2342     0          2342     0
 5 Il.                 15683              15683     0         15683     0
 6 Dion.               21356              21353    -3         21259   -97
 7 Od.                 12107              12107     0         12107     0
 8 Q.S.                 8801               8801     0          8800    -1
 9 Sh.                   479                479     0           479     0
10 Theoc.               2527               2527     0          2524    -3
11 Theog.               1042               1042     0          1042     0
12 W.D.                  831                831     0           831     0
13 total               73098              73095    -3         72996  -102
sasansom added a commit that referenced this issue Oct 30, 2023
whoopsedesy pushed a commit that referenced this issue Oct 31, 2023
@whoopsedesy
Copy link
Collaborator Author

Fixed in 490d519 by renumbering some lines, as in Perseus 5.0.

Theoc. 21.65: lines are out of order: 64, 66, 65, 67.

Theoc. 27.11: preceding lines renumbered 8a and 9b.

@whoopsedesy
Copy link
Collaborator Author

Reopening this because, despite what #77 (comment) says, nothing seems to have happened with the split lines before Theoc. 5.70 and 5.71.

sedes/corpus/theocritus.xml

Lines 677 to 689 in 2ec2a12

<lb rend="displayNum" n="65" />th/nas ta\s para\ ti\n culoxi/zetai: e)/sti de\ *mo/rswn.</p></sp>
<sp><speaker>*La/kwn</speaker><p>
<lb />bwstre/wmes.</p></sp>
<sp><speaker>*Koma/tas</speaker><p>
<lb rend="displayNumAndIndent" />tu\ ka/lei nin.</p></sp>
<sp><speaker>*La/kwn</speaker><p>
<lb rend="displayNumAndIndent" />i)/q' w)= ce/ne mikko\n a)/kouson
<lb />tei=d' e)nqw/n: a)/mmes ga\r e)ri/sdomes, o(/stis a)rei/wn
<lb />boukoliasta/s e)sti. tu\ d' w)= fi/le mh/t' e)me\ *mo/rswn
<lb />e)n xa/riti kri/nh|s, mh/t' w)=n tu/ga tou=ton o)na/sh|s.</p></sp>
<sp><speaker>*Koma/tas</speaker><p>
<lb rend="displayNum" n="70" />nai\ poti\ ta=n *numfa=n *mo/rswn fi/le mh/te *koma/ta|
<lb />to\ ple/on i)qu/nh|s, mh/t' w)=n tu/ga tw=|de xari/ch|.

@whoopsedesy whoopsedesy reopened this Jan 26, 2024
@whoopsedesy
Copy link
Collaborator Author

Reopening this because, despite what #77 (comment) says, nothing seems to have happened with the split lines before Theoc. 5.70 and 5.71.

Handled Theoc. 5.66 in 31a4f58.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants