-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving british2american #784
Comments
OK great. I think it's important to write the actual Python test cases too - if we had that when this tool was created we could have updated it a long time ago, but it's one of the earliest tools and a lot of why it does what it does is fuzzy to me. I don't remember which books were especially difficult with this tool. The tricky things are elision in dialect - |
Yes, tests, of course. Properly quoted elisions in dialect at the front of a word aren't the problem. That's why I moved the tagging of The first dialect problem is (and there were a lot of these in Decline and Fall) improperly quoted dialect, e.g. ‘em instead of ’em; those are going to get changed to rdq's early on, and there's not much to be done about that without actually parsing the text (and I'm not sure that would totally solve the problem, either), and I'm not nearly smart enough for that. I'm also not sure we could with lxml anyway; it's not as flexible as Beautiful Soup with its inputs. But, these aren't correctly handled now, and they won't be with the update, so that's not a regression. The second problem are elisions at the end of words, and telling whether those are elisions or closing quotes, e.g.
It's pretty much impossible for an automated tool to tell whether the rsq on bein is closing the quote, or the one on handle. I even thought about loading the -ing words from our words table into an array and checking to see if the leftover closing single quotes were on one of the words, but there are over 7500 of them, so that didn't sound very performant. The changes I have so far will treat the bein quote as closing and so get it wrong, but the current version gets it wrong a different way by not closing the double quote at all. I'll test the changes against Good Companions and look at the Lawrence's and see how it goes. |
I think things look pretty good, at least for a next step. I'm going to start putting together a test file, but here are the differences in ch. 1-1 of the Priestley between the old and new versions. It's has colored output, so needs |
Looks great! |
This has been on my list for a long time, and I finally took some time to look at it. I have made some changes and tested them on a couple of productions I've done that needed the command, and the results are greatly improved.
In summary, the changes are:
As before, this is not foolproof, but it is much better, i.e. fewer errors, on what I've tested than the existing one. In Decline and Fall (the Waugh, not the Gibbon), the results come within one change of all of my manual corrections originally, with the exception of moving commas/periods inside quotes, which of course the command does not attempt to address. (Although we might could with the above changes; after step 2, any closing single quote is theoretically an apostrophe, so we could then swap
<rsq>[.,]
before step 3. I have not tried that, but I think I will.)If you have any specific texts you want tested, let me know and I'll run it on them (on the text before the command), and if they look better, I'll open a PR. Otherwise, I can ask on the list for ones anyone knows they used the command on, and test several. How many would make you comfortable?
The text was updated successfully, but these errors were encountered: