Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Criminal Filings still not getting all cases #54

Open
adamrlinder opened this issue Nov 11, 2020 · 9 comments
Open

New Criminal Filings still not getting all cases #54

adamrlinder opened this issue Nov 11, 2020 · 9 comments
Labels
bug Something isn't working data-scraping

Comments

@adamrlinder
Copy link
Collaborator

Even after merging in the fix for getting the correct page count, our New Criminal Filings scraping script (0_scrape.py) is not getting the right number of cases. Today's run, which scraped cases from yesterday, scraped 32 cases. There are 44 total: https://www.courts.phila.gov/NewCriminalFilings/date/default.aspx?search=2020-11-10

@adamrlinder adamrlinder added bug Something isn't working data-scraping labels Nov 11, 2020
@notchia
Copy link
Collaborator

notchia commented Nov 11, 2020

Is it missing filings only above/below a certain docket number, or dropping dockets in between?

@adamrlinder
Copy link
Collaborator Author

It's not clear to me if there is a pattern. It is always missing the cases from later in the day, but I think that's probably just a function of the order they're in on the page, not that it chokes on the time (but who knows, maybe it does!).

This spreadsheet was generated by our Github action. The latest entry is docket MC-51-CR-0021987-2020, assigned at 1:01PM yesterday.
2020-11-10 - cfp scrape.xlsx

This spreadsheet was generated by the script I wrote that predates this project starting, and is worse in almost every way, except that it manages to download every case:
2020-11-10 - adam scrape.xlsx

@douglaswlee
Copy link

douglaswlee commented Nov 11, 2020

I don't think it's related to this issue but I believe the script will fail if there is only one page worth of records (as is the case for today). The issue is that below will result in an empty list if there's only a single page of filings:

# Remove last entry since that's just the the link to the next or ">>" button
pages = ul.findAll("li", recursive=False)[:-1]

Something like below might take care of this but I don't know if it will address this issue specifically:

# Get all "links" to pages
pages = ul.findAll("li", recursive=False)

# If there are > 1 pages then there is an extra "link" for the next page, o/w there is only one
end_page = len(pages) - 1 if len(pages) > 1 else 1

@adamrlinder
Copy link
Collaborator Author

Per some investigation, @douglaswlee has determined that 0_scrape.py gets the correct number of cases when run locally and only fails when run as part of the Github action. Both @douglaswlee and @notchia noted that the version being run is defined in the container container: rayfallon/pbf-scraping:latest

@RaymondFallon would you mind taking a look at that container and seeing if you see anything that would cause it to get the incorrect number of cases? We have since merged in a couple of changes to the script here on the master branch.

Wondering more generally if we should move away from Github actions and into Lambda or something...

@RaymondFallon
Copy link
Collaborator

Hm, I'm not seeing anything obvious... is it possible that this is not a function of the code, but the fact that this GitHub action is being run at 5am (it was 6am but the daylight savings time shift moved it to 5a, since I guess UTC did not shift). Is it possible not all of the data available on the court website yet at 5am?

@RaymondFallon
Copy link
Collaborator

(That's not to say there isn't a problem with the container, just that I didn't see anything after about an hour of looking)

@douglaswlee
Copy link

I think that's a possibility. Right now no filings are listed for today (11/15) later than 1:19 PM, so it could just be that the full set of filings doesn't come in until 6 am.

@adamrlinder
Copy link
Collaborator Author

Thanks for looking into it, Ray. I just tried rerunning now and it did get the correct number, 58, whereas at 5AM it got 42. I am almost positive previous tests had it getting the incorrect number of cases when rerun later the next day, but I can’t swear to it. I’m going to change the action to run at 8AM daily and see what happens.

@RaymondFallon
Copy link
Collaborator

If changing the time of day doesn't appear to fix the problem, maybe I could pair with someone more familiar with the script itself and we can look at it together with and without Docker and see if we can find anything funky going.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data-scraping
Projects
None yet
Development

No branches or pull requests

4 participants