When the dataset doesn't have a valid URL, it cannot be serialized to n3 and ttl formats #244

adinuca · 2018-07-02T11:03:28Z

Why

There are lots of errors reported because the turtle format and the notation3 format cannot be generated for datasets, when the URL is not valid.

What

decide how it is best to be solved: either by making the formats unavailable in the source page if the URL is not valid or by fixing the serializer for the 2 formats.

Notes

The URLs for the 2 formats is available in the source code of the dataset page(Eg: https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac)

Examples:
https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac
https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac.ttl
https://www.resourcedata.org/dataset/33a07bf8-35f4-45be-a951-b61aed8287ac.n3

https://www.resourcedata.org/dataset/f4d3130b-4557-47fb-b609-6b0080b05025
https://www.resourcedata.org/dataset/f4d3130b-4557-47fb-b609-6b0080b05025.ttl
https://www.resourcedata.org/dataset/f4d3130b-4557-47fb-b609-6b0080b05025.n3

https://www.resourcedata.org/dataset/7bbcb65a-653c-42ea-acb0-45943630bbef
https://www.resourcedata.org/dataset/7bbcb65a-653c-42ea-acb0-45943630bbef.ttl
https://www.resourcedata.org/dataset/7bbcb65a-653c-42ea-acb0-45943630bbef.n3

https://www.resourcedata.org/dataset/28350801-8f55-4155-81ca-874b94b0809d
https://www.resourcedata.org/dataset/28350801-8f55-4155-81ca-874b94b0809d.ttl
https://www.resourcedata.org/dataset/28350801-8f55-4155-81ca-874b94b0809d.n3

The text was updated successfully, but these errors were encountered:

adinuca · 2018-07-02T11:04:37Z

Hi @EricSoroos, could you please take a look at this issue?

Logs can be found here.

EricSoroos · 2018-07-10T10:49:00Z

Looking into this a bit -- the error is in that ckanext-dcat (and the dependency rdflib) strictly expects that URL be a valid URL and doesn't trap the error or skip when it's invalid. We definitely have metadata that's not a valid url, so the combination of that metadata not conforming to the field definition and the strict definition of the formats causes the error.

3 options to fix:

Skip the links if url isn't valid. -- The links are done in three places in the templates, we have access to the url, but the helper is_url is ... not helpful, as it doesn't check for invalid urls. We could add a helper to dcat, or just check for a couple of the likely invalid characters that we hit, like space. Easy to do as a hack, a little more involved to do it cleanly.
Trigger a different, verbose error on the actual link. This is pretty easy, and will help get those urls out of the search engines. We should probably trigger a 4xx series error, but I don't see a good one off hand.
Patch rdflib to skip the url if it's invalid. This is a bigger job, especially to do it in a way that makes these still be valid n3/turtle files.

EricSoroos · 2018-07-10T10:54:09Z

sample possible error response:

adinuca · 2018-07-10T11:03:24Z

I think the response can be shorter. Something like : "Format not supported due to invalid URL".

It would be good if you could also catch the exception and log a message that tells you exactly what the issue is, instead of the long stack-trace.

EricSoroos · 2018-07-11T19:45:44Z

That error message is 90% url. That response is essentially catching the exception and returning something useful to the browser that will explain the situation and prevent a crawler from retaining it.

I don't think we need to log it, since we know exactly what's causing it and can find all cases of this with a sql query.

adinuca · 2018-07-12T05:36:28Z

Ok @EricSoroos, my main reason for the above message was to not have so many error logs that don't help. I agree we can just ignore the exception and return a proper response to the user.

adinuca · 2018-07-27T16:09:27Z

Hi @EricSoroos , do you have any update on this? There have been a few emails regarding errors generated by these URLs

adinuca · 2018-08-24T07:45:04Z

Hi @EricSoroos, any update on this?

adinuca · 2018-09-07T15:19:10Z

Hi @deirdrelee, do you know when this will get done? As with #249, it is hard to spot real problems in the logs because of these error logs generated by this issue.

EricSoroos · 2018-09-10T14:31:11Z

I've pushed a fix for this to staging

adinuca · 2018-09-10T14:31:55Z

Thank you!

adinuca · 2018-09-10T15:09:16Z

This has been fixed and deployed to production by @EricSoroos . Thank you!

adinuca added the Vitamin label Jul 2, 2018

adinuca mentioned this issue Jul 2, 2018

Investigate where the requests for <dataset>.ttl and <dataset>.n3 are coming from #236

Closed

3 tasks

deirdrelee assigned EricSoroos Jul 9, 2018

adinuca closed this as completed Sep 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the dataset doesn't have a valid URL, it cannot be serialized to n3 and ttl formats #244

When the dataset doesn't have a valid URL, it cannot be serialized to n3 and ttl formats #244

adinuca commented Jul 2, 2018 •

edited

Loading

adinuca commented Jul 2, 2018

EricSoroos commented Jul 10, 2018

EricSoroos commented Jul 10, 2018

adinuca commented Jul 10, 2018

EricSoroos commented Jul 11, 2018

adinuca commented Jul 12, 2018

adinuca commented Jul 27, 2018 •

edited

Loading

adinuca commented Aug 24, 2018

adinuca commented Sep 7, 2018

EricSoroos commented Sep 10, 2018

adinuca commented Sep 10, 2018

adinuca commented Sep 10, 2018

When the dataset doesn't have a valid URL, it cannot be serialized to n3 and ttl formats #244

When the dataset doesn't have a valid URL, it cannot be serialized to n3 and ttl formats #244

Comments

adinuca commented Jul 2, 2018 • edited Loading

Why

What

Notes

adinuca commented Jul 2, 2018

EricSoroos commented Jul 10, 2018

EricSoroos commented Jul 10, 2018

adinuca commented Jul 10, 2018

EricSoroos commented Jul 11, 2018

adinuca commented Jul 12, 2018

adinuca commented Jul 27, 2018 • edited Loading

adinuca commented Aug 24, 2018

adinuca commented Sep 7, 2018

EricSoroos commented Sep 10, 2018

adinuca commented Sep 10, 2018

adinuca commented Sep 10, 2018

adinuca commented Jul 2, 2018 •

edited

Loading

adinuca commented Jul 27, 2018 •

edited

Loading