-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR for Issue #388 (Improve 50A Collection) on Main Repo #19
PR for Issue #388 (Improve 50A Collection) on Main Repo #19
Conversation
…n to enrich data gathered in officer csv
…el and additional attributes for command model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came across a few errors. One is pretty simple. The href
one, I'd need to dig in more to understand.
|
||
address_link = response.css(".intro > a") | ||
|
||
if 'maps' in address_link.attrib['href']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this key is not always assigned. I'm seeing a lot of errors thrown up like this:
...
File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/command.py", line 22, in parse_command
commanding_officer_url, command_address, command_description = self.parse_intro(response)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/command.py", line 48, in parse_intro
if 'maps' in address_link.attrib['href']:
~~~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'href'
Once it starts hitting that, no further entries are being collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it! It's strange, I didn't encounter this issue when testing but I'll certainly look into it and put in some defensive checks to ensure it doesn't break the spider.
def parse_age(self, response): | ||
age = response.css(".age::text").get() | ||
if age: | ||
age = parse_string_to_number(age) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, this value returns a range. That's leading to a crash in the parser and the loss of the entry.
File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/officer.py", line 35, in parse_age
age = parse_string_to_number(age)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/police-data-trust-scrapers/scrapers/common/parse.py", line 12, in parse_string_to_number
raise NotImplementedError(e)
NotImplementedError: invalid literal for int() with base 10: '65-69'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into this as well, I did not realize this returned a range. Thanks for bringing it to my attention. In this scenario if it returned a range, would it be better to take some valid number in the range(i.e lowest, highest, average, median, etc.) or would it be better to instead change the type to string and keep the range as is or perhaps something else entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best move is to take the midpoint of the range. If it's an estimate anyway, that should be fine.
PR for Issue #388 on Main Repo