feat(search): Adds logic to download search results #4893

ERosendo · 2025-01-06T23:36:20Z

This PR implements the backend logic for exporting search results (#599).

Key changes:

Introduces a new rate limiter to throttle CSV export requests to 5 per day
Adds a new setting named MAX_SEARCH_RESULTS_EXPORTED (default: 250) to control the maximum number of rows included in the generated CSV file.
Refactors the view.py file within the search module. Helper functions related to fetching Elasticsearch results have been moved to the search_utils.py file for better organization and clarity.
Introduces two new helper functions fetch_es_results_for_csv and get_headers_for_search_export
Adds a new task that takes the user_id and the query string as input. It then sends an email with a CSV file containing at most MAX_SEARCH_RESULTS_EXPORTED rows.

semgrep-app · 2025-01-16T02:33:28Z

Semgrep found 3 avoid-pickle findings:

cl/lib/search_utils.py
- L342 - Triage
- L346 - Triage
- L430 - Triage

Avoid using pickle, which is known to lead to code execution vulnerabilities. When unpickling, the serialized data could be manipulated to run arbitrary code. Instead, consider serializing the relevant data as JSON or a similar text-based serialization format.

_{Ignore this finding from avoid-pickle}

Semgrep found 1 direct-use-of-jinja2 finding:

cl/search/tasks.py
- L411 - Triage

Detected direct use of jinja2. If not done properly, this may bypass HTML escaping which opens up the application to cross-site scripting (XSS) vulnerabilities. Prefer using the Flask method 'render_template()' and templates with a '.html' extension in order to prevent XSS.

_{Ignore this finding from direct-use-of-jinja2}

This commit refactors the search module by moving helper functions from `view.py` to `search_utils.py`. This improves code organization and makes these helper functions reusable across different modules.

mlissner

I gave this a once-over and it feels about right. Concerns I'll highlight for you guys to consider:

Memory: We're putting the CSV in memory, which sure is handy. I think this is fine b/c it'll be pretty small, a couple hundred KB, right? This must be fine, but it's on my mind.
The fields in the result might be annoying with columns that aren't normalized to human values (like SOURCE: CR or something, and local_path: /recap/gov.xxxx.pdf instead of https://storage.courtlistener.com/recap/gov.xxx.pdf). I didn't see code to fix that, but it's probably something we should do if we can. This CSV is supposed to be for humans, in theory.

I appreciate the refactor, but I'd suggest it in a separate PR in the future, so it's not mixed in.

But this looks about right to me otherwise. :)

albertisfu

@ERosendo this looks good and is on the right track. I’ve left some comments and suggestions in the code, along with additional feedback here:

In addition to Mike's comment about normalizing values for humans, a similar suggestion is that I noticed that the CSV headers don’t maintain a fixed order when the CSV is generated and I found it difficult to determine when the results belong to the same "Case," particularly when matching child documents. It might be a good idea to ensure that the headers are fixed for each search type and to prioritize key headers that help identify whether the results belong to the same case. For instance, in RECAP, the headers could start like this:

docket_id, docket_number, pacer_case_id, court_exact, case_name, document_number, attachment_number, ...
Highlighted fields in the results are always represented as a list of terms, even though we are only highlighting a single fragment. Most of the HL fields are not naturally lists of terms. However, some fields, such as citations in case law, can be lists and are also highlighted.

Currently, these fields are rendered as lists:

As in the frontend, perhaps you could use the render_string_or_list filter or a modified version of it to render HL fields as strings instead of lists when they are not multi-fields?

Regarding Judge Search, I noticed that the CSV only contains fields from PersonDocument, and some fields currently rendered as flat fields in the frontend (such as "Appointers" and other similar fields extracted from the database in merge_unavailable_fields_on_parent_document) are not included. Is this behavior expected?
I'd recommend adding at least one integration test to help catch and prevent regressions in the future related to the suggestions and bugs mentioned in the comments.

Thank you!

albertisfu · 2025-01-20T23:15:07Z

cl/lib/search_utils.py

+            keys = set(
+                [
+                    *DocketDocument.__dict__["_fields"].keys(),
+                    *ESRECAPDocument.__dict__["_fields"].keys(),
+                ]
+            )


I think this can be simplified as:

Suggested change

keys = set(

[

*DocketDocument.__dict__["_fields"].keys(),

*ESRECAPDocument.__dict__["_fields"].keys(),

]

)

keys = {*DocketDocument.__dict__["_fields"].keys(),

*ESRECAPDocument.__dict__["_fields"].keys()}

Similar for SEARCH_TYPES.OPINION

albertisfu · 2025-01-20T23:28:22Z

cl/lib/search_utils.py

+        search result.
+    """
+    csv_rows: list[dict[str, Any]] = []
+    while len(csv_rows) <= settings.MAX_SEARCH_RESULTS_EXPORTED:


Could you please explain whether the while loop is required here? I did some testing and I think this could work without the while loop and avoid additional queries for each page to reach MAX_SEARCH_RESULTS_EXPORTED resultss since you're already using do_es_search with MAX_SEARCH_RESULTS_EXPORTED rows.

To make this work for RECAP, you may need to pass a parameter to do_es_search to indicate the query is for CSV generation. This would ensure the rows variable for both RECAP and Dockets within do_es_search matches MAX_SEARCH_RESULTS_EXPORTED, since this parameter is currently overridden for RECAP to 10 documents per page by the setting: rows = settings.RECAP_SEARCH_PAGE_SIZE

Additionally, since CSV queries don't require count queries, you could use this same parameter to indicate the query is for CSV generation and avoid performing count queries in fetch_es_results. This would prevent the overhead of parent and child count queries currently performed via the Multi-Search API.

cl/lib/search_utils.py

albertisfu · 2025-01-20T23:36:55Z

cl/lib/search_utils.py

+                flat_results = []
+                for result in results.object_list:
+                    parent_dict = result.to_dict()
+                    child_docs = parent_dict.pop("child_docs")


I tested a search query that returned some results for empty dockets, and it failed on this line:
KeyError: 'child_docs'

Instead of pop we should use get

Suggested change

child_docs = parent_dict.pop("child_docs")

child_docs = parent_dict.get("child_docs")

albertisfu · 2025-01-20T23:54:15Z

cl/lib/search_utils.py

+                    if child_docs:
+                        flat_results.extend(
+                            [
+                                parent_dict | doc["_source"].to_dict()


I noticed that you want to include HL in the CSV results, correct?

The highlights are being shown correctly on highlighted child fields and in parent fields for parent documents with no matched child documents. However, highlights are not being shown on parent fields for results that matched child documents.
I think the issue lies in this line where parent_dict fields are being merged with child document's fields. Since child documents also contain parent fields, but parent fields are not highlighted in child documents, these fields override the highlighted parent fields from parent_dict.

cl/search/tasks.py

albertisfu · 2025-01-21T00:30:31Z

cl/search/templates/search_results_email.txt

+Hi {{username}},
+
+Your requested search results are attached as a CSV file.
+


I think it would be a good idea to include a link to the search query within the email body so users can identify which query the attached results belong to.

cl/search/views.py

freelawproject deleted a comment from semgrep-app bot Jan 6, 2025

ERosendo force-pushed the 599-feat-add-logic-to-download-results branch 2 times, most recently from 535f37a to e8f6fd3 Compare January 13, 2025 18:31

ERosendo added 3 commits January 15, 2025 22:32

feat(search): Adds ratelimiter to throttle csv exports per day

90c8dc9

feat(settings): Adds setting to limit number of exported search results

ca4f4f5

feat(search): Adds email template for search results export

8cda135

ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from b3ba8d5 to 8721c41 Compare January 16, 2025 02:32

ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from 8721c41 to ae29bba Compare January 16, 2025 04:07

ERosendo added 3 commits January 16, 2025 00:19

refactor(search): Moves helper functions to search_utils.py

93d31a2

This commit refactors the search module by moving helper functions from `view.py` to `search_utils.py`. This improves code organization and makes these helper functions reusable across different modules.

feat(search): Adds task to send email with search results as attachment

d4b2f1a

feat(search): Adds view to export search results

92cddf5

ERosendo force-pushed the 599-feat-add-logic-to-download-results branch from ae29bba to 92cddf5 Compare January 16, 2025 04:19

ERosendo marked this pull request as ready for review January 16, 2025 05:59

ERosendo requested a review from mlissner January 16, 2025 05:59

ERosendo assigned mlissner Jan 16, 2025

Merge branch 'main' into 599-feat-add-logic-to-download-results

0f2c992

mlissner reviewed Jan 17, 2025

View reviewed changes

mlissner assigned albertisfu and unassigned mlissner Jan 17, 2025

Merge branch 'main' into 599-feat-add-logic-to-download-results

f8f5fee

albertisfu requested changes Jan 21, 2025

View reviewed changes

albertisfu assigned ERosendo and unassigned albertisfu Jan 21, 2025

ERosendo added 3 commits January 21, 2025 13:21

Merge branch 'main' into 599-feat-add-logic-to-download-results

b4cf486

feat(search): Adds docket search type to CSV generation helpers

fe3e21f

Merge branch 'main' into 599-feat-add-logic-to-download-results

a66cdb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): Adds logic to download search results #4893

feat(search): Adds logic to download search results #4893

ERosendo commented Jan 6, 2025 •

edited

Loading

semgrep-app bot commented Jan 16, 2025

mlissner left a comment

albertisfu left a comment

albertisfu Jan 20, 2025

albertisfu Jan 20, 2025

albertisfu Jan 20, 2025

albertisfu Jan 20, 2025

albertisfu Jan 21, 2025

	child_docs = parent_dict.pop("child_docs")
	child_docs = parent_dict.get("child_docs")

		Hi {{username}},

		Your requested search results are attached as a CSV file.

feat(search): Adds logic to download search results #4893

Are you sure you want to change the base?

feat(search): Adds logic to download search results #4893

Conversation

ERosendo commented Jan 6, 2025 • edited Loading

semgrep-app bot commented Jan 16, 2025

mlissner left a comment

Choose a reason for hiding this comment

albertisfu left a comment

Choose a reason for hiding this comment

albertisfu Jan 20, 2025

Choose a reason for hiding this comment

albertisfu Jan 20, 2025

Choose a reason for hiding this comment

albertisfu Jan 20, 2025

Choose a reason for hiding this comment

albertisfu Jan 20, 2025

Choose a reason for hiding this comment

albertisfu Jan 21, 2025

Choose a reason for hiding this comment

ERosendo commented Jan 6, 2025 •

edited

Loading