-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semanticate: Work on the XML elements tree #753
base: master
Are you sure you want to change the base?
Conversation
I understand the reasoning for this PR, but I think regex is acceptable for this use case 95% of the time. It's much simpler to understand and - importantly - maintain, and with full XML parsing there will likely be complex edge cases that we will have to custom code around. Also any errors introduced by a regex are easily corrected using git diff - it never results in big catastrophes. If you want to play around with it further go ahead, but I think you'll find that this will become very unwieldy very fast. If you're able to put together a complete solution that is 100% correct vs. our current implementation's 95% correct, and it doesn't look like a maintenance hellscape, then we can consider it. |
Sure 95% of the time a regex will get you there, but your existing regexes are needing to be stuffed with lookaheads and other features to compensate for the lack of context. Flatly XML cannot be parsed with regexes. On the other hand walking XML trees using something built for the job is pretty straight forward and should be much easier to maintain and contribute to in the long run. |
I understand. We do not need lookaheads in the general case (or, strictly speaking, in this case). That is because the occasions in which the regexes without lookaheads are wrong, given our narrow use cases, are rare, and undoing the mistake one of these regex makes is simple - an error is never catastrophic. Remember that these tools are not general purpose XML processors, they are for the very narrow use case of ebooks in the English language that are made with our style guide. On the other hand, parsing an entire syntax tree and dealing with the hundreds of edge cases of the various English language constructs we must examine is extremely complex, error prone, and most importantly, difficult to maintain. It will also likely be quite slow because the tree must be walked and updated for almost every check requiring even the most basic of logic. I understand the desire for intellectual purity and formal correctness, but using regex is easier to develop and maintain as a practical matter. Of course I'm happy to be proved wrong! |
I agree with the point that regexes are simpler, and I don't know where the sweet spot lies. However, a more XML-based approach could still use a lot of regexes, the main difference is choosing where they are applied. In my current implementation, the regex rules are very similar to what they are in the current implementation. I copy-pasted them from the old function and made only some minor changes: Lines 176 to 180 in b4e9357
So the representation of the rules doesn't have to change much. The difference is that these regexes are no longer applied to the entire raw XML document at once, but only to one bit of text at a time. This avoids changes to attribute values and makes it easier to say things like: "do not apply these rules inside One thing that does become a bit more difficult is when the regexes look at XML tags. For example, when deciding to add Lines 228 to 239 in b4e9357
Or a simpler example, checking we're inside a Lines 192 to 194 in b4e9357
Secondly, it also becomes more difficult to apply regexes that match text on both sides of a tag, but there currently aren't any rules that need this. I'm not convinced this is necessarily better, but it's good to know that it's not a binary choice between regex and syntax tree parsing. |
Rough example (see #752) of how the semanticate rules could be applied to the content only by walking recursively through the XML tree.
Not sure if this is a good idea, so this is a fairly quick proof-of-concept.
The tricky bit is that the regular expressions might introduce new XML elements. In this version, this works by applying the regex to text string (which doesn't contain any tags), parsing the regex output to create any new XML elements, and finally inserting those in the document at the right place.