Refactor integration tests to remove random collection sampling #749

mfisher87 · 2024-07-06T18:47:00Z

Resolves #215

Replaces random collection sampling with hardcoded lists of 100 top collections per provider in popularity order, with script to regenerate the lists as needed. Instead of sampling n random collections we select n most popular.

There's still a clear need for refactoring of the 4 cloud/onprem download/open test modules. They share a lot of code that can be fixturized. I don't want this PR to grow larger than it already is, so IMO that should be a follow-up activity.

tests/integration/popular_collections/generate.py

mfisher87 · 2024-07-09T17:56:48Z

Worked on this with @itcarroll during hack day. Notes: #755

mfisher87 · 2024-07-09T17:58:42Z

We considered the usefulness of random sampling tests. We don't think we should be doing this for integration tests, especially when they execute on every PR. We could, for example, run them on a cron job and create reports, but that seems like overkill when we have a community to help us identify datasets and connect with the right support channel if there's an issue with the provider.

We may still consider a cron job for, for examle, recalculating the most popular datasets on a monthly basis.

mfisher87 · 2024-07-09T18:47:30Z

We decided we can hardcode a small number and expand the list as we go. Other things like random tests on a cron or updating the list of popular datasets on a cron can be addressed separately.

mfisher87 · 2024-08-06T18:42:21Z

@betolink will take on work to update generate.py to generate top N collections for all providers.

@mfisher87 will continue working on test_onprem_download.py for just NSIDC_ECS for now to make it use the new source of collections.

mfisher87 · 2024-08-06T18:44:36Z

We will update the .txt files to .csv files and add boolean field for "does the collection have a EULA?" and then we'll use that field to mark those tests as xfail.

mfisher87 · 2024-08-21T00:03:30Z

Two major milestones:

@danielfromearth updated the script which generates the top collection lists to use all providers supported by earthaccess 🎉 Still TODO: Make them CSVs with a boolean representing whether the collection has a EULA
We just got the test_onprem_download.py module working without randomization! 🎉 Still TODO: Refactor the other 3 integration test modules to share this behavior. Let's try and remove duplicate code while we're at it!

Thanks to @DeanHenze and @Sherwin-14 for collaborating on this on today's hackathon!

mfisher87 · 2024-08-21T00:03:51Z

earthaccess/results.py

@@ -244,6 +244,9 @@ def _repr_html_(self) -> str:
        granule_html_repr = _repr_granule_html(self)
        return granule_html_repr

+    def __hash__(self) -> int:
+        return hash(self["meta"]["concept-id"])


@betolink @chuckwondo This seems reasonable to me, but please validate me :)

Thinking about it for like 5 minutes, this is obviously a bad idea. This class is subclassing dict. We'd need to implement like a frozendict.

mfisher87 · 2024-08-21T00:06:23Z

Also still TODO: Run generate.py in GHA on a monthly/quarterly cron and auto-open a PR with the changes to top collections?

mfisher87 · 2024-08-21T16:26:44Z

If we want to determine whether a collection has a EULA, this example was provided:

curl -i -XGET "https://cmr.earthdata.nasa.gov/search/collections.json?concept_id=C1808440897-ASF&pretty=true"

The metadata "eula_identifiers" : [ "1b454cfb-c298-4072-ae3c-3c133ce810c8" ] is present in the response. We're not 100% sure whether this can be used authoritatively. Discussion in progress: https://nsidc.slack.com/archives/C2LRKMDEV/p1724179804149239

mfisher87 · 2024-10-01T17:19:52Z

tests/integration/test_onprem_open.py

TODO: Add tests for OBDAAC on-prem open. Related to #828 - we want to make sure the data streams successfully. Opening data from OBDAAC on-prem relies on both #828 and a (potentially) unreleased change to fsspec! (check the September release notes)

danielfromearth · 2024-10-29T17:59:49Z

Looks like part of this issue may be related to work on EULAs in this issue.

mfisher87 · 2024-11-27T02:11:03Z

tests/integration/popular_collections/POCLOUD.txt

@@ -0,0 +1,100 @@
+C2799438299-POCLOUD
+C1996881146-POCLOUD
+# C2204129664-POCLOUD


This collection isn't working so good 🤒

We get 0 granules:

> assert len(granules) > 0, msg E AssertionError: AssertionError for C2204129664-POCLOUD E assert 0 > 0 E + where 0 = len([])

But it's still 3rd most-popular? I'm confused :)

maybe someone from PODAAC can clarify, cc @DeanHenze

mfisher87 · 2024-11-27T02:34:09Z

tests/integration/test_kerchunk.py

@@ -6,6 +6,7 @@
 from fsspec.core import strip_protocol

 logger = logging.getLogger(__name__)
+pytestmark = pytest.mark.skip(reason="Tests are broken.")


These tests are failing on the release. Maybe xfail is a better mark. I preferred not to get into fixing this in this PR.

fixed the tests by pinning zarr to 2.x, but we have to merge main into the PR.

Merged and removed mark!

mfisher87 · 2024-11-27T02:38:59Z

mkdocs.yml

I felt this needed simplification as I added more.

We have a few "guide" things, so I gave them a naming pattern so they can be mentally grouped

Removed the word "Our" because it wasn't adding anything

"Naming conventions" felt out of place, too specific. Like the new integration test doc. So I created a new "Topics" subsection (but not a subdirectory to keep the URL flatter). I don't like "Topics", but it's the best I have thought of so far.

betolink · 2024-12-02T16:55:02Z

Wow, this is a big one! I can start today but I'm not sure if I can finish today! great work @mfisher87 !!

mfisher87 · 2024-12-02T18:07:08Z

Thanks for taking a look, @betolink ! There are some opportunities for refactoring, but I really tried to keep the scope narrow in this PR to avoid growing even bigger :)

betolink

I looked at the PR again and although we haven't set on "should we test all these" I think this is a great improvement on reproducibility and we can pass the info to other DAACs on relevant datasets that for some reason fail. I think we need to merge main, regenerate the uv lock and this should be good to go IMO! great work @mfisher87!!

betolink · 2024-12-03T21:18:57Z

docs/contributing/integration-tests.md

+
+Some integration tests operate on the most popular collections for each provider in CMR.
+Those collections are cached as static data in `tests/integration/popular_collections/`
+to give our test suite more stability. The list of most popular collections can be


This is awesome!

betolink · 2024-12-03T21:22:52Z

tests/integration/popular_collections/POCLOUD.txt

@@ -0,0 +1,100 @@
+C2799438299-POCLOUD
+C1996881146-POCLOUD
+# C2204129664-POCLOUD


maybe someone from PODAAC can clarify, cc @DeanHenze

tests/integration/popular_collections/generate.py

betolink · 2024-12-03T21:24:40Z

tests/integration/test_cloud_download.py

Ahh so it's configurable! great!!

What do you mean?

betolink · 2025-01-18T18:36:47Z

tests/integration/test_kerchunk.py

@@ -6,6 +6,7 @@
 from fsspec.core import strip_protocol

 logger = logging.getLogger(__name__)
+pytestmark = pytest.mark.skip(reason="Tests are broken.")


fixed the tests by pinning zarr to 2.x, but we have to merge main into the PR.

betolink

This is good to me, waiting for Chuck and/or Joe's feedback.

jhkennedy

This looks good to me -- I left two tiny comments that you can take or leave. Approval stands either way.

tests/integration/popular_collections/generate.py

docs/contributing/integration-tests.md

Co-authored-by: Joseph H Kennedy <[email protected]>

mfisher87 · 2025-01-22T17:51:54Z

Oh no! The tests are failing 🤣

@betolink NSIDC doing maintenance?

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='n5eil01u.ecs.nsidc.org', port=443): Max retries exceeded with url: /DP5/ATLAS/ATL08.006/2018.10.14/ATL08_20181014070058_02390107_006_02.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbb54137850>: Failed to establish a new connection: [Errno 111] Connection refused'))

betolink · 2025-01-22T17:55:53Z

Oh no! The tests are failing 🤣

@betolink NSIDC doing maintenance?

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='n5eil01u.ecs.nsidc.org', port=443): Max retries exceeded with url: /DP5/ATLAS/ATL08.006/2018.10.14/ATL08_20181014070058_02390107_006_02.h5 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbb54137850>: Failed to establish a new connection: [Errno 111] Connection refused'))

I think so, it's Wednesday! 😆

mfisher87 · 2025-01-22T19:13:00Z

We're so close to merging this behemoth 🤣

mfisher87 changed the title ~~Refactor integration tests~~ Refactor integration tests to remove random collection sampling Jul 6, 2024

mfisher87 commented Jul 7, 2024

View reviewed changes

tests/integration/popular_collections/generate.py Show resolved Hide resolved

mfisher87 force-pushed the integration-tests-refactor branch from d79f48f to 194fd29 Compare July 23, 2024 17:39

mfisher87 commented Aug 21, 2024

View reviewed changes

mfisher87 added the needs: help Extra attention is needed label Sep 3, 2024

mfisher87 commented Oct 1, 2024

View reviewed changes

asteiker mentioned this pull request Oct 29, 2024

Integration tests are flaky -- replace dataset sampling with top 50 datasets #215

Open

mfisher87 and others added 14 commits November 26, 2024 11:30

Extract duplicated function

413b086

Add popular collection script proof of concept

0ffc1b4

Use union type instead of union operator

8123e3a

Remove logic which accepts up to 10% integration test failure

7e4c0f9

Enable import of sampling test utility function

ef7bc0a

Adjust test logging/docstrings for consistent & correct language

a371dc2

Update generate script to fail if paging

668c18e

Add helper function to sample from collection list file

1b09388

Fix granule sampling bug that can result in dupes

022613d

Update test parameter schema (WIP)

cc1c932

Make granules hashable, fix granule sampling logic

b1d39ea

Remove random collection sampling from test module

e1f635b

loop through all providers while generating collection lists

3f13536

add collection text files for all currently listed providers

fae144a

mfisher87 requested review from chuckwondo, jhkennedy and betolink and removed request for chuckwondo and jhkennedy November 27, 2024 00:30

mfisher87 commented Nov 27, 2024

View reviewed changes

Add documentation on integration tests

ffc1fec

mfisher87 commented Nov 27, 2024

View reviewed changes

mfisher87 requested a review from danielfromearth November 27, 2024 23:19

Merge branch 'main' into integration-tests-refactor

bf8b3d6

betolink reviewed Jan 18, 2025

View reviewed changes

mfisher87 added 2 commits January 18, 2025 12:38

Merge branch 'main' into integration-tests-refactor

45b12c4

Unmark kerchunk tests as broken

c28b8fa

betolink previously approved these changes Jan 22, 2025

View reviewed changes

jhkennedy previously approved these changes Jan 22, 2025

View reviewed changes

tests/integration/popular_collections/generate.py Outdated Show resolved Hide resolved

docs/contributing/integration-tests.md Outdated Show resolved Hide resolved

Use pathlib method to write text to file

836258e

Co-authored-by: Joseph H Kennedy <[email protected]>

mfisher87 dismissed stale reviews from jhkennedy and betolink via 836258e January 22, 2025 17:14

Be explicit about what data we're caching

2000420

Co-authored-by: Joseph H Kennedy <[email protected]>

jhkennedy approved these changes Jan 22, 2025

View reviewed changes

mfisher87 closed this Jan 24, 2025

mfisher87 reopened this Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor integration tests to remove random collection sampling #749

Refactor integration tests to remove random collection sampling #749

mfisher87 commented Jul 6, 2024 •

edited

Loading

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 21, 2024 •

edited

Loading

mfisher87 Aug 21, 2024 •

edited

Loading

mfisher87 Aug 21, 2024

mfisher87 commented Aug 21, 2024

mfisher87 commented Aug 21, 2024

mfisher87 Oct 1, 2024

danielfromearth commented Oct 29, 2024

mfisher87 Nov 27, 2024 •

edited

Loading

betolink Dec 3, 2024

mfisher87 Nov 27, 2024

betolink Jan 18, 2025

mfisher87 Jan 18, 2025

mfisher87 Nov 27, 2024

betolink commented Dec 2, 2024

mfisher87 commented Dec 2, 2024

betolink left a comment

betolink Dec 3, 2024

betolink Dec 3, 2024

betolink Dec 3, 2024

mfisher87 Jan 18, 2025

betolink Jan 18, 2025

betolink left a comment

jhkennedy left a comment

mfisher87 commented Jan 22, 2025

betolink commented Jan 22, 2025

mfisher87 commented Jan 22, 2025

Refactor integration tests to remove random collection sampling #749

Are you sure you want to change the base?

Refactor integration tests to remove random collection sampling #749

Conversation

mfisher87 commented Jul 6, 2024 • edited Loading

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Jul 9, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 6, 2024

mfisher87 commented Aug 21, 2024 • edited Loading

mfisher87 Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfisher87 commented Aug 21, 2024

mfisher87 commented Aug 21, 2024

Choose a reason for hiding this comment

danielfromearth commented Oct 29, 2024

mfisher87 Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betolink commented Dec 2, 2024

mfisher87 commented Dec 2, 2024

betolink left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betolink left a comment

Choose a reason for hiding this comment

jhkennedy left a comment

Choose a reason for hiding this comment

mfisher87 commented Jan 22, 2025

betolink commented Jan 22, 2025

mfisher87 commented Jan 22, 2025

mfisher87 commented Jul 6, 2024 •

edited

Loading

mfisher87 commented Aug 21, 2024 •

edited

Loading

mfisher87 Aug 21, 2024 •

edited

Loading

mfisher87 Nov 27, 2024 •

edited

Loading