-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 221 (partially) #288
Conversation
Signed-off-by: Holger Bruch <[email protected]>
Signed-off-by: Holger Bruch <[email protected]>
Signed-off-by: Holger Bruch <[email protected]>
Cool, thanks! May I ask how you test and iterate in general? Do you have a nominatim installation and feed photon after each change or do you reindex somehow? |
I'm running a local nominatim installation with a smaller region from which I feed photon. For this region, I'm running a small test set via geocoder/geocoder-tester. If results look promising, I deploy it to a server running a nominatim instance with a germany dataset which is fed to photon again (takes about 10h). For this I run geocoder-test for germany and an internal reference dataset of ~1000 addresses. For the reference dataset, some aggregated stats (Max distance to reference coord < 100m? Dice similarity coefficient > 0,5? housenumber found?) give a good indication of match quality. However, these tests have a bias to whole address geocoding, so type-ahead behavior is something I check manually via (limited) explorative testing. Reindexing is a good idea and might significantly speed up development cycles, did you already? Another option I thought of was importing the json-dump. This would lower initial barriers for new contributers, as they would not need to setup nominatim (but currently no JsonImporter exists, does it?). How do you iterate? |
Never touched major feeding stuff before and modified just the querying part. But as I'm probably now a bit more into this topic I'll do similar to you and will also create country extracts easier see #281 We should invest in this topic as this makes developing new stuff easier. Also performance testing is important but complex ... so many things to do :)
No
Yes, good idea. And I hope with the extracts this will improve the work too
Yes |
I just realized that the json dump format corresponds to the elastic bulk format, so it should be feasible to provide an importer. Various bulk importer exist (stream2es, bulkEs), but photon should provide a self-contained json import. I think a reindexing of an existing photon index would not work, as all context information is dropped from source during the import. (if it should be indexed at all is another question, I'm currently thinking about, as many unexpected matches are caused by 'invisible' context matches...) |
@karussell Should we create a world-wide DB with this PR to play around with? Can we move the (very interesting) testing discussion into a separate issue? |
@hbruch are you using your newly added tests or can you recommend what else has to be green?
Let us fix&merge the "decay" problem before (#296) and then I'll do one global export with this PR. |
Yes, though there are some failing tests because photon does not return the expected result as first result. Returning them at least in the first couple of matches is better than not at all, but I was unsure about tuning the test case for photon by increasing the I checked against an internal reference database of (german) addresses and for the germany test-suite of geocoder-tester, all formerly green tests stayed green. This is hopefully the case for the planet db as well, but still to check. What I didn't check, but surely is worth doing, is behavior for british addresses, i.e. postal codes. |
@lonvia please have a look #314 (comment) |
Ah, the problem is now that photon.komoot.de also runs with this data and we cannot easily see the improvements. This was my fault writing this into the "latest DB" |
There are many more tests passing now, nice! Especially in geocoder_tester/world/france/nordpasdecalais/test_public_buildings_addresses.csv which are related to this issue, I think. But it looks like some got worse but when looking in the details it seems that the tests are too strict (radius=100m) and the OSM data is too bad (just a way, no housenumbers, like for |
I noticed one oddity: partial addresses with a numerical house number do much worse now than the same address with a housenumber with letter. Search for 'Kiefernstr 9a' and you get the buildings returned. Search for 'Kiefernstr 9' and there are only roads, even though there are results that should match in the database ('Kiefernstr 9, dresden' for one example). Is that somehow related to these changes or should I open a new issue? |
Could we verify, if it performs worse than before? Current photon.komoot.de db is built with the same version, isn't it? A slightly different weighting could occur due to an added 'unique' filter, but I think the general issue existed before. Searching for Kiefernstr 9 returns streets with postalcodes starting with 9 (as postal codes are sliced by the egde-ngram tokenizer). Not sure, if we should support search as you type for partial numerals, neither postal codes nor housenumbers. |
Ok, the partial search would explain it. I noticed some other oddities with that but that's really for a different issue. Otherwise this looks ok to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kiefernstr anomaly has disappeared with the latest test instance. Maybe that was just due to the mixed setup. All tests I could come up with look good.
I still wonder why there are apparently more new failing than new passing tests now. Will have a look into it tomorrow. |
Thanks! And yes, this is strange. Still at least 10 are wrongly reported as written in #314 (comment) regarding the geocoder-tester bug. |
Thank you for setting up the both instances. Analyzing the differences, I see different issues, non of which I see related to this PR:
So to me, #288 seems to perform better on housenumbers. |
Oh, wow. Thanks for this in-depth analysis. This makes sense! So as a result: IMO we can merge this issue and need to improve our testing pipeline.
This could be an issue with the one server I had with utf-8 (will try on a different)
I think this is unrelated to the utf-8 things as this happened to me with geocoder-tester for my local proper functioning laptop too (you can run just these tests to see what I mean) Could we reduce the non-deterministic behaviour somehow (e.g. using a sorted search)? |
Fix #221 (partially). This PR improves housenumber matching, especially in cases where the OSM housenumber suffix
Tests are provided via geocoders/geocoder-tester#28 which highlight the improvements as well as further existing issues.
Still open: