New Criminal Filings still not getting all cases #54

adamrlinder · 2020-11-11T20:48:00Z

Even after merging in the fix for getting the correct page count, our New Criminal Filings scraping script (0_scrape.py) is not getting the right number of cases. Today's run, which scraped cases from yesterday, scraped 32 cases. There are 44 total: https://www.courts.phila.gov/NewCriminalFilings/date/default.aspx?search=2020-11-10

notchia · 2020-11-11T21:05:30Z

Is it missing filings only above/below a certain docket number, or dropping dockets in between?

adamrlinder · 2020-11-11T21:17:12Z

It's not clear to me if there is a pattern. It is always missing the cases from later in the day, but I think that's probably just a function of the order they're in on the page, not that it chokes on the time (but who knows, maybe it does!).

This spreadsheet was generated by our Github action. The latest entry is docket MC-51-CR-0021987-2020, assigned at 1:01PM yesterday.
2020-11-10 - cfp scrape.xlsx

This spreadsheet was generated by the script I wrote that predates this project starting, and is worse in almost every way, except that it manages to download every case:
2020-11-10 - adam scrape.xlsx

douglaswlee · 2020-11-11T22:18:25Z

I don't think it's related to this issue but I believe the script will fail if there is only one page worth of records (as is the case for today). The issue is that below will result in an empty list if there's only a single page of filings:

# Remove last entry since that's just the the link to the next or ">>" button
pages = ul.findAll("li", recursive=False)[:-1]

Something like below might take care of this but I don't know if it will address this issue specifically:

# Get all "links" to pages
pages = ul.findAll("li", recursive=False)

# If there are > 1 pages then there is an extra "link" for the next page, o/w there is only one
end_page = len(pages) - 1 if len(pages) > 1 else 1

adamrlinder · 2020-11-15T15:14:46Z

Per some investigation, @douglaswlee has determined that 0_scrape.py gets the correct number of cases when run locally and only fails when run as part of the Github action. Both @douglaswlee and @notchia noted that the version being run is defined in the container container: rayfallon/pbf-scraping:latest

@RaymondFallon would you mind taking a look at that container and seeing if you see anything that would cause it to get the incorrect number of cases? We have since merged in a couple of changes to the script here on the master branch.

Wondering more generally if we should move away from Github actions and into Lambda or something...

RaymondFallon · 2020-11-16T03:55:31Z

Hm, I'm not seeing anything obvious... is it possible that this is not a function of the code, but the fact that this GitHub action is being run at 5am (it was 6am but the daylight savings time shift moved it to 5a, since I guess UTC did not shift). Is it possible not all of the data available on the court website yet at 5am?

RaymondFallon · 2020-11-16T03:56:28Z

(That's not to say there isn't a problem with the container, just that I didn't see anything after about an hour of looking)

douglaswlee · 2020-11-16T05:07:33Z

I think that's a possibility. Right now no filings are listed for today (11/15) later than 1:19 PM, so it could just be that the full set of filings doesn't come in until 6 am.

adamrlinder · 2020-11-16T11:58:20Z

Thanks for looking into it, Ray. I just tried rerunning now and it did get the correct number, 58, whereas at 5AM it got 42. I am almost positive previous tests had it getting the incorrect number of cases when rerun later the next day, but I can’t swear to it. I’m going to change the action to run at 8AM daily and see what happens.

RaymondFallon · 2020-11-16T14:22:39Z

If changing the time of day doesn't appear to fix the problem, maybe I could pair with someone more familiar with the script itself and we can look at it together with and without Docker and see if we can find anything funky going.

adamrlinder added bug Something isn't working data-scraping labels Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Criminal Filings still not getting all cases #54

New Criminal Filings still not getting all cases #54

adamrlinder commented Nov 11, 2020

notchia commented Nov 11, 2020 •

edited

Loading

adamrlinder commented Nov 11, 2020

douglaswlee commented Nov 11, 2020 •

edited

Loading

adamrlinder commented Nov 15, 2020

RaymondFallon commented Nov 16, 2020

RaymondFallon commented Nov 16, 2020

douglaswlee commented Nov 16, 2020

adamrlinder commented Nov 16, 2020

RaymondFallon commented Nov 16, 2020

New Criminal Filings still not getting all cases #54

New Criminal Filings still not getting all cases #54

Comments

adamrlinder commented Nov 11, 2020

notchia commented Nov 11, 2020 • edited Loading

adamrlinder commented Nov 11, 2020

douglaswlee commented Nov 11, 2020 • edited Loading

adamrlinder commented Nov 15, 2020

RaymondFallon commented Nov 16, 2020

RaymondFallon commented Nov 16, 2020

douglaswlee commented Nov 16, 2020

adamrlinder commented Nov 16, 2020

RaymondFallon commented Nov 16, 2020

notchia commented Nov 11, 2020 •

edited

Loading

douglaswlee commented Nov 11, 2020 •

edited

Loading