Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that added unsolicited new lines when parsing an ordered list containing paragraphs #359

Closed

Conversation

youssefmimo
Copy link
Contributor

@youssefmimo youssefmimo commented Dec 20, 2023

When converting an ordered list where items contain paragraphs, new lines are added in between each elements.
This is a suggestion to fix this edge case.

Description

Following the CommonMarks specs, if a list item contains multiple blocks, the item can be encapsulated in a paragraph tags.
See : A list item may contain blocks that are separated by more than one blank line

Take the following example :

This code is with an ordered list and paragraphs.

1.  Yes, this is a `code` element
2.  No :

    *   `Some code we are looking at`

Using babelmark, we can confirm that most implementations will translate it to the following code

<p>
  This code is with an ordered list and paragraphs.
</p>
<ol>
  <li>
    <p>
      Yes, this is a <code>code
      </code> element
    </p>
  </li>
  <li>
    <p>
      No :
    </p>
    <ul>
      <li>
        <code>
          Some code we are looking at
        </code>
      </li>
    </ul>
  </li>
</ol>

Providing this code to the online demo of Html2Markdown will return the following markdown :

This code is with an ordered list and paragraphs. 

1.  

 Yes, this is a ``` code ``` element 
 >>New Line<<
2.  >>New Line<<
>>New Line<<
 No : 

    *   ```
          Some code we are looking at

```

To remove these new lines, I tried following the design we discussed earlier by :

  1. Adding a CommonMarkTextFormattingReplacementGroup class that will call ReplaceList with a flag indicating we want to be compliant with CommonMark
  2. Used this newly created class in CommonMark.cs to replace the initial call of TextFormattingReplacementGroup

What are the changes within HtmlReplacer ?

  • ReplaceLists now accepts a supportCommonMark flag
  • If the flag is set to true AND the currently processed item starts with a paragraph tag , we call ReplaceParagraph while we process the lists. Instead of calling it later as part of the replacement groups list. This will allow us to avoid adding new lines around the paragraph tags when they are part of a list item.
  • Independently from the supportCommonMark flag, to handle multiline html, if a closing li tags is followed by a new line. We also remove this new line. This is done to prevent cases where this html
    [...]<li><p>First Item</p>\\n</li>\\n<li>Second item[...] becomes [...]<li><p>First Item</p>\\n\\n<li>Second item[...], creating a double line break because we only removed the tag. L. 44
  • Independently from the supportCommonMark flag, when aggregating the markdownList if we notice an item that is already ending with a new line, we do not add a new line. L. 66

Unfortunately there are multiple changes for this unique PR but I tried to fix things one after another as I was discovering them.

Do you think this is a legitimate change ?

Also, do you think there is a better way of handling the Common Markdown flag ? With some kind of static global variable ? This could prevent having to play with flags and instead set everything trough the CommonMark class.

Related issue

N/A

Thanks.

@baynezy baynezy added the awaiting review This pull request is awaiting review label Dec 20, 2023
baynezy added a commit that referenced this pull request Dec 21, 2023
Closes #359

May relate to #202

Massive thanks to https://github.com/YoussefAzaroual for most of the
work in #359
@github-actions github-actions bot added the in progress In flight work label Dec 21, 2023
baynezy added a commit that referenced this pull request Dec 21, 2023
Closes #359

May relate to #202

Massive thanks to https://github.com/YoussefAzaroual for most of the
work in #359
@baynezy
Copy link
Owner

baynezy commented Dec 21, 2023

Many thanks for this @YoussefAzaroual I have a few comments that I have implemented in #364

  1. The way you have handled paragraph tags in lists is not just a good fix for Commonmark but should be a fix for Markdown too
  2. To avoid replication, I have created concrete implementations of most of the tags. See Aggregate common replacers #362

As soon as the PR runs. I will merge and create a release for NuGet.

Thanks again.

@baynezy baynezy removed the awaiting review This pull request is awaiting review label Dec 21, 2023
@baynezy baynezy closed this in #364 Dec 21, 2023
@github-actions github-actions bot added next release Completed work that is yet to be deployed and removed in progress In flight work labels Dec 21, 2023
@youssefmimo
Copy link
Contributor Author

@baynezy Thanks a lot for your reactivity on this. I looked at your other changes, moving to concrete implementations makes the Replacement group much more easier to read !

@baynezy
Copy link
Owner

baynezy commented Dec 21, 2023

@YoussefAzaroual no need to thank me. It's much easier to react to issues when people collaborate with PRs rather than issues.

Many thanks.

@baynezy baynezy removed the next release Completed work that is yet to be deployed label Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants