Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR for Issue #388 (Improve 50A Collection) on Main Repo #19

Closed
wants to merge 16 commits into from
Closed

PR for Issue #388 (Improve 50A Collection) on Main Repo #19

wants to merge 16 commits into from

Conversation

aasnani
Copy link

@aasnani aasnani commented Jul 23, 2024

Copy link
Collaborator

@DMalone87 DMalone87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across a few errors. One is pretty simple. The href one, I'd need to dig in more to understand.


address_link = response.css(".intro > a")

if 'maps' in address_link.attrib['href']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this key is not always assigned. I'm seeing a lot of errors thrown up like this:

...
  File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/command.py", line 22, in parse_command
    commanding_officer_url, command_address, command_description = self.parse_intro(response)
                                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/command.py", line 48, in parse_intro
    if 'maps' in address_link.attrib['href']:
                 ~~~~~~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'href'

Once it starts hitting that, no further entries are being collected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it! It's strange, I didn't encounter this issue when testing but I'll certainly look into it and put in some defensive checks to ensure it doesn't break the spider.

def parse_age(self, response):
age = response.css(".age::text").get()
if age:
age = parse_string_to_number(age)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, this value returns a range. That's leading to a crash in the parser and the loss of the entry.

  File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/officer.py", line 35, in parse_age
    age = parse_string_to_number(age)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/police-data-trust-scrapers/scrapers/common/parse.py", line 12, in parse_string_to_number
    raise NotImplementedError(e)
NotImplementedError: invalid literal for int() with base 10: '65-69'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into this as well, I did not realize this returned a range. Thanks for bringing it to my attention. In this scenario if it returned a range, would it be better to take some valid number in the range(i.e lowest, highest, average, median, etc.) or would it be better to instead change the type to string and keep the range as is or perhaps something else entirely?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best move is to take the midpoint of the range. If it's an estimate anyway, that should be fine.

@aasnani aasnani closed this by deleting the head repository Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants