Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appellate links are wrong; not sure fix #439

Open
mlissner opened this issue Dec 14, 2023 · 5 comments
Open

Appellate links are wrong; not sure fix #439

mlissner opened this issue Dec 14, 2023 · 5 comments

Comments

@mlissner
Copy link
Member

There's been some sadness lately in the Big Cases Twitter and BlueSky mentions because links to appellate cases don't seem to be working. The bot is sending links like this:

https://www.courtlistener.com/docket/68073028/1208579363/united-states-v-donald-trump/?redirect_or_modal=True

And the correct link seems to be this:

https://www.courtlistener.com/docket/68073028/01208579363/united-states-v-donald-trump/?redirect_or_modal=True

With an extra zero at the front. I'm not sure what we can do to fix this, but it's not great. Ideas?

@ERosendo
Copy link
Contributor

While debugging the link issue, I initially suspected that the WebhookEvent model was causing the problem. The reason behind this is that the bot's link is generated using docket_id, document_number, and slug from that model and We're storing document numbers as BigIntegers which removes leading zeros automatically during conversion. This could explain why the bot's link doesn't have leading zeros compared to the CL link.

However, upon closer inspection, I noticed the webhook payload itself also omits the leading zeros from document numbers. So, even if we update the WebhookEvent model to store them as CharField, it wouldn't fix the issue. This is because the bot would still generate links with missing leading zeros due to the webhook's format.

Here's the relevant section of the webhook payload for your reference:

{
  "payload": {
    "results": [
      {
        "id": 373186711,
        "docket": 68073028,
        "date_filed": "2023-12-13",
        "time_filed": null,
        "date_created": "2023-12-13T15:15:10.388110-08:00",
        "entry_number": 1208579363,
        "date_modified": "2023-12-13T15:15:10.391724-08:00",
        "recap_documents": [
          {
            "id": 380735093,
            "sha1": "fef8195894832d007eb4fe71033734cbfcdfd6e4",
            "file_size": 49269,
            "is_sealed": null,
            "thumbnail": null,
            "ocr_status": 2,
            "page_count": 2,
            "date_upload": "2023-12-13T15:15:12.252060-08:00",
            "description": "",
            "filepath_ia": "",
            "absolute_url": "/docket/68073028/1208579363/united-states-v-donald-trump/",
            "date_created": "2023-12-13T15:15:10.401577-08:00",
            "is_available": true,
            "pacer_doc_id": "01208579363",
            "date_modified": "2023-12-13T15:15:12.254297-08:00",
            "document_type": 1,
            "filepath_local": "recap/gov.uscourts.cadc.40415/gov.uscourts.cadc.40415.1208579363.0.pdf",
            "document_number": "1208579363",
            "is_free_on_pacer": null,
            "thumbnail_status": 0,
            "attachment_number": null,
            "ia_upload_failure_count": null
          }
        ],
        "pacer_sequence_number": null,
        "recap_sequence_number": "2023-12-13.003"
      }
    ]
  },
}

@mlissner
Copy link
Member Author

mlissner commented Dec 14, 2023

This feels related, no: freelawproject/courtlistener#3385

@ERosendo ERosendo moved this from Main Backlog to ✍🏻In Progress in @erosendo's backlog Jan 3, 2024
@ERosendo
Copy link
Contributor

ERosendo commented Jan 3, 2024

This feels related, no: freelawproject/courtlistener#3385

@mlissner Yes. it's related.

I've been thinking about this issue and trying to find a workaround while we merge freelawproject/courtlistener#3385, but it's trickier than I initially thought. It seems that the Bot could just check whether the entry is using a regular document number or the pacer_doc_is as the document number, and then create the link accordingly.

Although the previous idea seems feasible and could solve the problem temporarily, it has a flaw. I noticed the document_number could be updated and sometimes doesn't include the leading zeros and, in other instances, uses the entire pacer_doc_id. This inconsistency with the document_number will always create unreliable links. So, I believe the best approach to fix this issue is to first address the issues related to the document number in CL and then make adjustments to the WebhookEvent model.

@ERosendo
Copy link
Contributor

ERosendo commented Jan 3, 2024

I noticed the document_number could be updated and sometimes doesn't include the leading zeros and, in other instances, uses the entire pacer_doc_id.

@mlissner Here's more context about this sentence:

I've tried to reproduce the payload shown in #439 (comment) in my local environment. To do so, I cloned the United States v. Donald Trump case and downloaded several HTML files and PDFs that were used to generate the entries in the docket.

I started uploading the files sequentially(mirroring the order we received them in CL) and realized the payload was only possible when we got the PDF document linked to a docket entry prior to the data to create it, but I also noticed that a subsequent docket upload could overwrite the document number if the entry is included within the report.

I began examining the code in order to figure out why the document number was being overwritten. I checked the process_recap_appellate_docket function and noticed the add_docket_entries helper uses the data we got from Juriscraper(which parses the document_number properly and doesn't remove the leading zeros) to create/update the docket entries and the recap document linked to the entry. This method stores the entire pacer_doc_id(e.g., "01208584455") in the document_number field.

OTOH, if a RECAP document record has the full pacer_doc_id as its document_number and we receive a PDF upload for this document, the process_recap_pdf function will use the ProcessingQueue's document_number, which is stored as a BigInteger(e.g., "1208584455"), to replace the value of RD's document_number field.

It is important to consider those cases where the 'document_number' field is updated because the Bot builds document links using the document_number in this format:

https://www.courtlistener.com/docket/<docket_id>/<doc_num>/<slug>/

Unfortunately, the bot doesn't update old posts, nor does it have a way of knowing when a document number has been updated. This means that changes to the "doc_num" in CL can silently break these links.

@mlissner
Copy link
Member Author

mlissner commented Jan 4, 2024

Complicated. What fix do you suggest?

@mlissner mlissner moved this to Backlog Jan 6 - Jan 17 in Sprint (Web Team) Nov 22, 2024
@mlissner mlissner moved this from Backlog Jan 13 - Jan 24 to General Backlog in Sprint (Web Team) Jan 9, 2025
@mlissner mlissner moved this from General Backlog to PACER Data/Issues in Sprint (Web Team) Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PACER Data/Issues
Development

No branches or pull requests

2 participants