Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving british2american #784

Closed
vr8hub opened this issue Jan 9, 2025 · 4 comments
Closed

Improving british2american #784

vr8hub opened this issue Jan 9, 2025 · 4 comments

Comments

@vr8hub
Copy link
Contributor

vr8hub commented Jan 9, 2025

This has been on my list for a long time, and I finally took some time to look at it. I have made some changes and tested them on a couple of productions I've done that needed the command, and the results are greatly improved.

In summary, the changes are:

  1. Assume any instance of a closing single quote followed by a letter is an apostrophe. This replaces the two existing regexes with a single one, and I moved it to the first of the closing right quote regexes, rather than last. (This gets all of those "known" apostrophes out of the way for the rsq regexes).
  2. After running the existing regexes to identify rsq's, add two additional ones that: a) Identify valid lsq/rsq pairs and mark them as lsqm/rsqm, and b) Find plain lsq followed by a closing single quote, and mark those as lsqm/rsqm pairs, on the assumption they're much more likely to be rsq than apostrophes.
  3. Replace the tags with the appropriate quotes as before.
  4. The "common errors" updates are not needed (with the above, at least, they cause more problems than they fix).

As before, this is not foolproof, but it is much better, i.e. fewer errors, on what I've tested than the existing one. In Decline and Fall (the Waugh, not the Gibbon), the results come within one change of all of my manual corrections originally, with the exception of moving commas/periods inside quotes, which of course the command does not attempt to address. (Although we might could with the above changes; after step 2, any closing single quote is theoretically an apostrophe, so we could then swap <rsq>[.,] before step 3. I have not tried that, but I think I will.)

If you have any specific texts you want tested, let me know and I'll run it on them (on the text before the command), and if they look better, I'll open a PR. Otherwise, I can ask on the list for ones anyone knows they used the command on, and test several. How many would make you comfortable?

@acabal
Copy link
Member

acabal commented Jan 9, 2025

OK great. I think it's important to write the actual Python test cases too - if we had that when this tool was created we could have updated it a long time ago, but it's one of the earliest tools and a lot of why it does what it does is fuzzy to me.

I don't remember which books were especially difficult with this tool. The tricky things are elision in dialect - ‘get ’em!” but lots of more complex examples - and quotes before and after em dashes. I think D. H. Lawrence has a lot of dialect, and a recent one I did, The Good Companions, also had fair amount.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jan 9, 2025

Yes, tests, of course.

Properly quoted elisions in dialect at the front of a word aren't the problem. That's why I moved the tagging of <ap> to the beginning, and updated it such that any closing single quote followed by a letter is an apostrophe.

The first dialect problem is (and there were a lot of these in Decline and Fall) improperly quoted dialect, e.g. ‘em instead of ’em; those are going to get changed to rdq's early on, and there's not much to be done about that without actually parsing the text (and I'm not sure that would totally solve the problem, either), and I'm not nearly smart enough for that. I'm also not sure we could with lxml anyway; it's not as flexible as Beautiful Soup with its inputs. But, these aren't correctly handled now, and they won't be with the update, so that's not a regression.

The second problem are elisions at the end of words, and telling whether those are elisions or closing quotes, e.g.

This is a sentence ‘bein’ hard to handle’ for what the script does.

It's pretty much impossible for an automated tool to tell whether the rsq on bein is closing the quote, or the one on handle. I even thought about loading the -ing words from our words table into an array and checking to see if the leftover closing single quotes were on one of the words, but there are over 7500 of them, so that didn't sound very performant. The changes I have so far will treat the bein quote as closing and so get it wrong, but the current version gets it wrong a different way by not closing the double quote at all.

I'll test the changes against Good Companions and look at the Lawrence's and see how it goes.

@vr8hub
Copy link
Contributor Author

vr8hub commented Jan 10, 2025

I think things look pretty good, at least for a next step. I'm going to start putting together a test file, but here are the differences in ch. 1-1 of the Priestley between the old and new versions. It's has colored output, so needs less -R command to view it correctly. I ran it on the text before you moved the punctuation inside the quotes, since that's not something we have them do in the Step by Step, but the command now moves some (hopefully most) of them itself.
c1-1diff.txt

@acabal
Copy link
Member

acabal commented Jan 10, 2025

Looks great!

@vr8hub vr8hub closed this as completed Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants