Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the dataset doesn't have a valid URL, it cannot be serialized to n3 and ttl formats #244

Closed
1 task done
adinuca opened this issue Jul 2, 2018 · 12 comments
Closed
1 task done
Assignees
Labels

Comments

@adinuca
Copy link
Contributor

adinuca commented Jul 2, 2018

@adinuca adinuca added the Vitamin label Jul 2, 2018
@adinuca
Copy link
Contributor Author

adinuca commented Jul 2, 2018

Hi @EricSoroos, could you please take a look at this issue?

Logs can be found here.

@EricSoroos
Copy link
Collaborator

Looking into this a bit -- the error is in that ckanext-dcat (and the dependency rdflib) strictly expects that URL be a valid URL and doesn't trap the error or skip when it's invalid. We definitely have metadata that's not a valid url, so the combination of that metadata not conforming to the field definition and the strict definition of the formats causes the error.

3 options to fix:

  • Skip the links if url isn't valid. -- The links are done in three places in the templates, we have access to the url, but the helper is_url is ... not helpful, as it doesn't check for invalid urls. We could add a helper to dcat, or just check for a couple of the likely invalid characters that we hit, like space. Easy to do as a hack, a little more involved to do it cleanly.

  • Trigger a different, verbose error on the actual link. This is pretty easy, and will help get those urls out of the search engines. We should probably trigger a 4xx series error, but I don't see a good one off hand.

  • Patch rdflib to skip the url if it's invalid. This is a bigger job, especially to do it in a way that makes these still be valid n3/turtle files.

@EricSoroos
Copy link
Collaborator

sample possible error response:

screen shot 2018-07-10 at 11 53 50 am

@adinuca
Copy link
Contributor Author

adinuca commented Jul 10, 2018

I think the response can be shorter. Something like : "Format not supported due to invalid URL".

It would be good if you could also catch the exception and log a message that tells you exactly what the issue is, instead of the long stack-trace.

@EricSoroos
Copy link
Collaborator

That error message is 90% url. That response is essentially catching the exception and returning something useful to the browser that will explain the situation and prevent a crawler from retaining it.

I don't think we need to log it, since we know exactly what's causing it and can find all cases of this with a sql query.

@adinuca
Copy link
Contributor Author

adinuca commented Jul 12, 2018

Ok @EricSoroos, my main reason for the above message was to not have so many error logs that don't help. I agree we can just ignore the exception and return a proper response to the user.

@adinuca
Copy link
Contributor Author

adinuca commented Jul 27, 2018

Hi @EricSoroos , do you have any update on this? There have been a few emails regarding errors generated by these URLs

@adinuca
Copy link
Contributor Author

adinuca commented Aug 24, 2018

Hi @EricSoroos, any update on this?

@adinuca
Copy link
Contributor Author

adinuca commented Sep 7, 2018

Hi @deirdrelee, do you know when this will get done? As with #249, it is hard to spot real problems in the logs because of these error logs generated by this issue.

@EricSoroos
Copy link
Collaborator

I've pushed a fix for this to staging

@adinuca
Copy link
Contributor Author

adinuca commented Sep 10, 2018

Thank you!

@adinuca
Copy link
Contributor Author

adinuca commented Sep 10, 2018

This has been fixed and deployed to production by @EricSoroos . Thank you!

@adinuca adinuca closed this as completed Sep 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants