-
-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mid dot is not processed #8373
Comments
The unicode is \N{U+00B7} https://unicodeplus.com/U+00B7, which is listed as separator in Ingredients.pm:
There are some more entries having this middle dot, not only Catalan:
|
It seems to have many ingredients_text containing this middle dot see examples on Mirabelle:
Suggestion:
This can be related to data quality for ingredients. This new alert would detect products where ingredients should be reviewed due to bad text extraction that are not corrected. What do you guys think? @stephanegigandet @alexgarel @CharlesNepote @teolemon |
@benbenben2 Good idea. I think we could replace: my $separators_except_comma = qr/(;|:|$middle_dot|[|{|(|\N{U+FF08}|( $dashes ))|(/|\N{U+FF0F})/i by: my $separators_except_comma = qr/(;|:| $middle_dot |[|{|(|\N{U+FF08}|( $dashes ))|(/|\N{U+FF0F})/i (spaces around $middle_dot) |
Thanks @benbenben2! Working on staging but not yet deployed in production: |
Describe the bug
Additive
E460
in Catalancel·lulosa
is shown as two words:cel
,lulosa
.To Reproduce
https://es.openfoodfacts.org/producto/8422410814206/queso-rallado-mozzarella-bonpreu
Expected behavior
cel·lulosa
is accepted asE460
.Screenshots
Additional context
Word is registered:
openfoodfacts-server/taxonomies/additives.txt
Line 13146 in 76dc037
Part of
The text was updated successfully, but these errors were encountered: