Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Loading P2 - 1: Determine new resources #233

Open
cjohns-scottlogic opened this issue Jan 16, 2025 · 0 comments
Open

Incremental Loading P2 - 1: Determine new resources #233

cjohns-scottlogic opened this issue Jan 16, 2025 · 0 comments

Comments

@cjohns-scottlogic
Copy link
Contributor

As part of phase 2 of incremental loading, it's necessary to determine if any NEW resources have been downloaded - that is one's that haven't been seen before. In order to determine this, the existing log.csv can be scanned to get all the resources. and the collector can check each fetched resource against this set.

The result of this should be a new file from the collector, probably in the 'var' directory, which can be used by the next stage of incremental loading. If no new resources are downloaded, the file will be created, but will be empty.

If the existing log.csv cannot be read, or any other error happens then this file will not be created (and an existing one should be removed) to signal that this information is not available. In this case, incremental loading will not be available.

Tech Approach

Update the collector to read existing log.csv, and get the currently known resources. If log.csv is unavailable or unreadable, print a diagnostic message but continue the collector. In this instance, nothing else will be done in terms of determining new resources.

Keep a list of resources downloaded that are not in the set of known resources.

At the end of the collector, save this list in a file in a suitable file in var (possibly var/collection-name/new-resources.csv?) If this file cannot be created, report a diagnostic warning but continue as usual.

Acceptance Criteria

Code has appropriate tests.

The collector will generate a new file of new resources. If none are new the file will be created, but empty.

If log.csv isn't available or readable or if the output file cannot be created then the collector will run as usual, but output diagnostic messages.

@cjohns-scottlogic cjohns-scottlogic converted this from a draft issue Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant