PR for Issue #388 (Improve 50A Collection) on Main Repo #19

aasnani · 2024-07-23T20:55:20Z

PR for Issue #388 on Main Repo

…n to enrich data gathered in officer csv

…ct as previous

…el and additional attributes for command model

…lection

DMalone87

I came across a few errors. One is pretty simple. The href one, I'd need to dig in more to understand.

DMalone87 · 2024-07-31T02:58:29Z

scrapers/fifty_a/fifty_a/spiders/command.py

+
+        address_link = response.css(".intro > a")
+
+        if 'maps' in address_link.attrib['href']:


It looks like this key is not always assigned. I'm seeing a lot of errors thrown up like this:

... File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/command.py", line 22, in parse_command commanding_officer_url, command_address, command_description = self.parse_intro(response) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/command.py", line 48, in parse_intro if 'maps' in address_link.attrib['href']: ~~~~~~~~~~~~~~~~~~~^^^^^^^^ KeyError: 'href'

Once it starts hitting that, no further entries are being collected.

Thanks for looking into it! It's strange, I didn't encounter this issue when testing but I'll certainly look into it and put in some defensive checks to ensure it doesn't break the spider.

DMalone87 · 2024-07-31T03:19:41Z

scrapers/fifty_a/fifty_a/spiders/officer.py

+    def parse_age(self, response):
+        age = response.css(".age::text").get()
+        if age:
+            age = parse_string_to_number(age)


In some cases, this value returns a range. That's leading to a crash in the parser and the loss of the entry.

File "/police-data-trust-scrapers/scrapers/fifty_a/fifty_a/spiders/officer.py", line 35, in parse_age age = parse_string_to_number(age) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/police-data-trust-scrapers/scrapers/common/parse.py", line 12, in parse_string_to_number raise NotImplementedError(e) NotImplementedError: invalid literal for int() with base 10: '65-69'

I can look into this as well, I did not realize this returned a range. Thanks for bringing it to my attention. In this scenario if it returned a range, would it be better to take some valid number in the range(i.e lowest, highest, average, median, etc.) or would it be better to instead change the type to string and keep the range as is or perhaps something else entirely?

I think the best move is to take the midpoint of the range. If it's an estimate anyway, that should be fine.

Armand added 16 commits July 21, 2024 12:22

Implemented downloading of officers CSV and edited run.sh

13b6bcf

update gitignore to ignore csv files

4a41cd7

Updated run script to include downloading officers csv

81dd408

updated officer item and scraper to only gather additional informatio…

6bec21e

…n to enrich data gathered in officer csv

update officer item model to use int for complaint list instead of di…

8e6cb33

…ct as previous

new command officer item to represent officer item under commmand mod…

0f448a7

…el and additional attributes for command model

update command crawler to get all new attributes

9c9ce91

merge

42a82e3

add text utilities

3446a36

change officer age type from str to int

2b7322b

change officer age scraping to return int instead of str

3866a17

fix relative URL issue and use text utility

613a09e

add new command page to use for testing instead of list of commands page

41fca72

update command tests to use new command page

34fd5ad

update officer page with most recent changes

ac52a5e

update officer tests to reflect most recent changes to schema and col…

2c574fd

…lection

aasnani mentioned this pull request Jul 23, 2024

[FEATURE] Improve 50-a Data Collection codeforboston/police-data-trust#388

Open

DMalone87 requested changes Jul 31, 2024

View reviewed changes

aasnani closed this by deleting the head repository Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR for Issue #388 (Improve 50A Collection) on Main Repo #19

PR for Issue #388 (Improve 50A Collection) on Main Repo #19

aasnani commented Jul 23, 2024 •

edited

Loading

DMalone87 left a comment

DMalone87 Jul 31, 2024

aasnani Aug 1, 2024

DMalone87 Jul 31, 2024

aasnani Aug 1, 2024

DMalone87 Aug 13, 2024


		address_link = response.css(".intro > a")

		if 'maps' in address_link.attrib['href']:

PR for Issue #388 (Improve 50A Collection) on Main Repo #19

PR for Issue #388 (Improve 50A Collection) on Main Repo #19

Conversation

aasnani commented Jul 23, 2024 • edited Loading

DMalone87 left a comment

Choose a reason for hiding this comment

DMalone87 Jul 31, 2024

Choose a reason for hiding this comment

aasnani Aug 1, 2024

Choose a reason for hiding this comment

DMalone87 Jul 31, 2024

Choose a reason for hiding this comment

aasnani Aug 1, 2024

Choose a reason for hiding this comment

DMalone87 Aug 13, 2024

Choose a reason for hiding this comment

aasnani commented Jul 23, 2024 •

edited

Loading