Skip to content

Commit

Permalink
Keep h1 and other headings
Browse files Browse the repository at this point in the history
Even though using h1 tags for sections inside an article is semantically
wrong, a lot of websites are doing it anyway. So the idea here is to
stop stripping headings, including h1 on Readability's side.

Fixes wallabag/wallabag#5805

Signed-off-by: Kevin Decherf <[email protected]>
  • Loading branch information
Kdecherf committed Jun 29, 2022
1 parent 6689f19 commit 41ef592
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions src/Readability.php
Original file line number Diff line number Diff line change
Expand Up @@ -395,14 +395,17 @@ public function prepArticle(\DOMNode $articleContent): void
$this->clean($articleContent, 'object');
$this->clean($articleContent, 'iframe');
$this->clean($articleContent, 'canvas');
$this->clean($articleContent, 'h1');

/*
* If there is only one h2, they are probably using it as a main header, so remove it since we
* If there is only one h1 or h2, they are probably using it as a main header, so remove it since we
* already have a header.
*/
$h1s = $articleContent->getElementsByTagName('h1');
if (1 === $h1s->length && mb_strlen($this->getInnerText($h1s->item(0), true, true)) < 100) {
$this->clean($articleContent, 'h1');
}
$h2s = $articleContent->getElementsByTagName('h2');
if (1 === $h2s->length && mb_strlen($this->getInnerText($h2s->item(0), true, true)) < 100) {
if (0 === $h1s->length && 1 === $h2s->length && mb_strlen($this->getInnerText($h2s->item(0), true, true)) < 100) {
$this->clean($articleContent, 'h2');
}

Expand Down

0 comments on commit 41ef592

Please sign in to comment.