Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9 #240

Open
jtojnar opened this issue Nov 13, 2020 · 3 comments
Open

WordPress lazy-loading noscript cleaner broken with libxml2 < 2.9.9 #240

jtojnar opened this issue Nov 13, 2020 · 3 comments

Comments

@jtojnar
Copy link
Collaborator

jtojnar commented Nov 13, 2020

With libxml2 2.9.4 (included in Ubuntu 18.04 LTS), Graby’s WordPress lazy-loading noscript cleaner is unable to remove the second image in the noscript text:

<p><img data-lazyloaded="1" src="" class="aligncenter size-full wp-image-32079" data-src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /><noscript><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></noscript></p>

is turned into:

<p><img data-lazyloaded="1" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" class="aligncenter size-full wp-image-32079" alt="" width="639" height="408" /></p><noscript>
<p><img class="aligncenter size-full wp-image-32079" src="https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png" alt="" width="639" height="408" /></p>

It works fine with libxml2 2.9.10 in later versions of Ubuntu, it was likely fixed by https://gitlab.gnome.org/GNOME/libxml2/-/commit/35e83488505d501864826125cfe6a7950d6cba78.

You can reproduce this by running

$ git clone https://github.com/jtojnar/graby-double-images && cd graby-double-images
$ composer install
$ php test.php

on system with libxml2 before 2.9.9, or if you have Nix:

$ $nix-shell --run 'composer install && php test.php'

See fossar/selfoss#1230 for more details.

@jtojnar
Copy link
Collaborator Author

jtojnar commented Nov 13, 2020

At this point I see these possible solutions:

  • Recommend to use html5lib instead of libxml but not sure how performant it is.
  • Try to find out if it is possible to make libxml parse the noscript inside p correctly.
  • Make the ContentExtractor look for noscript to parent node’s sibling as well.
  • Ask Ubuntu and other distros to backport the patch since it is trivial,
  • Do nothing, ask users to upgrade. But Ubuntu 18.04 is supported at least until April 2023 😿

@jtojnar
Copy link
Collaborator Author

jtojnar commented Nov 13, 2020

There is also a separate bug in tidy that wraps the img in the noscript in a p, resulting in invalid p > noscript > p nesting but that does not seem to cause issues thanks to another libxml2 bug 🤷‍♀️

@jtojnar
Copy link
Collaborator Author

jtojnar commented Nov 16, 2020

Apparently, html5lib suffers from this even worse, even with j0k3r/php-readability#60. I thought it might use libxml2 internally but it happens on libxml2 2.9.10 as well:

$graby = new Graby([
	'extractor' => [
		'default_parser' => 'html5lib',
		'allowed_parsers' => ['html5lib'], // Without this it would still use libxml
	]
], new GuzzleAdapter());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant