-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve: Parse POM from maven repo to extract and save license information in Java DB #21
base: main
Are you sure you want to change the base?
Conversation
…ct and save license information in Java DB
Sorry for chiming in. I am just an interested outsider who knows next to nothing about the architecture of this project. I see that you save the license as a
|
Hi @ChristianCiach , Didn't want to over complicate the change(additional tables to track licenses, map GAV i.e create a unique id in indices table, additional joins in read queries). It would definitely be an optimization , but not sure if the benefits are worth the effort especially because the difference in size isn't a lot and the DB is cached/refreshed every 3 days if I am not wrong. Even in case of CICD, there might be alternatives to mount the cache directory to avoid db download multiple times. But if everybody feels it's a mandatory change, we can work on it. Regarding licenses being a TEXT column; my bad. I had my use case in mind. It could be an array (sqlite might not support it directly) similar to the existing license type in the Package model. @DmitriyLewen let me know if you have any suggestions. |
The savings in storage are actually the least of my concerns, but I wanted to mention this because these concerns have been mentioned multiple times in the linked issue. I have a strong background in database administration, so I always try to normalize the data instead of storing them in a pre-processed way (like the joined strings here). I also think it would be easier to move from a normalized schema to a de-normalized schema later instead of the other way around. Lastly, storing the licences in an atomic way (meaning un-joined) would also address the concerns about the length of the attribute. I want to stress again that I know next to nothing about the architecture of Trivy, so all my my points may very well be, well, pointless. Anyway, thank you so much for adressing this issue! I was actually tasked by my company to develop a trivy-sbom-postprocessor to add all the missing licenses, so you might save me a lot of time :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left small comments
Hello @ChristianCiach
This make sense. I don't have much experience with sql can you point out the timing nuances: I ask because we have 2 important points:
|
pkg/crawler/crawler.go
Outdated
} | ||
|
||
// TODO: Check if we can limit the length of license string i.e trim and save | ||
return strings.Join(licenses, ","), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to think about this, because some licenses (e.g. The Apache License, Version 2.0
) can contain commas. In this case will be confusion if we will split by commas (e.g. The Apache License, Version 2.0,The 2-Clause BSD License
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the ,
separator, we can probably use |
Gave license classifier a try.
It expects a license file . Does not result in matches using just the license name in POM in most cases.
We essentially need to extract license URL from POM, fetch the contents of URL, dump it in a file and then use license classifier for the right results.
It will add an additional overhead. We might have to add an in memory cache as well in order to avoid redundant processing.
In cases where URL is missing in POM, we'll have to rely on the license name in POM.
Effectively a 3 step process:
- Fetch license URL contents and classify license
- If URL is missing or license couldn't be classified in step 1 , then classify license using license name in POM
- If step 2 also fails then dump the license name as is (precautionary check to limit the license string length)
In all cases where license classifier is being used, confidence threshold used could be 80%
Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.
Still need to figure out the specifics.
=========
The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.
This is really problem. We can add new step before crawler: parse all pom.xml
files (as with sha1
files) as you said (Effectively a 3 step process:
) and save to map/file (url -> license name). And use this map/file in crawler.
But there may be problem with CI/CD (work time, memory, etc...).
The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.
I think it is good solution. At least we can start with this
Maven docs say(https://maven.apache.org/pom.html#Licenses): Using an SPDX identifier as the license name is recommended.
It is not required, but we can refer on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really problem. We can add new step before crawler: parse all pom.xml files (as with sha1 files) as you said (Effectively a 3 step process:) and save to map/file (url -> license name). And use this map/file in crawler.
But there may be problem with CI/CD (work time, memory, etc...).
Let me give this a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR.
- Crawl to fetch GAV information + parse POM to extract license keys (hash of url if available , else name)
- Save license keys along with GAV information in the indexes cache directory and simultaneously build a map of unique license keys with meta information about license name and url
- Use license classifier to find license info associated with unique license keys found in step 2 and save them referenced by license key in a licenses cache directory
- As part of build, parse the cached data in indexes directory. Extract license keys from the same and use it to extract license metadata from the licenses cache directory.
- Write aggregated data to DB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DB size now is just 50MB more with license information(800MB) vs 750MB without license information.
| separated list of unique licenses :
licenses.csv
The only problem seems to be time taken for building java db. https://github.com/aquasecurity/trivy-java-db/pull/21/files#diff-526a7142f7b15a72eab215dcfa540f5b6d502951a2f9f3b34dd900013e82c675R426 seems to be a major contributor other than additional POM parsing.
Don't think it should be a problem though since we refresh the database once a week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good way for us. And changes of place look good. 50mb is not much.
Sorry for the late reply. I am on vacation and can only skim this PR using my smartphone. To answer your questions:, I expect the difference in performance to be barely measurable. Every database management system worth its salt will join the tables using highly specialised indexes. I wouldn't even be too surprised to see an improvement in performance because most database management systems handle large Edit: I will be mostly offline for the next two weeks or so. I am also not a Go developer, but I am highly skilled in Java and Python. Even if this PR gets merged as-is, I am curious enough to compare it to the model I would've preferred, so maybe you will see a follow-up PR by me later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it looks better!
I left 1 more idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some thoughts about optimization.
Can you take a look?
pkg/crawler/crawler.go
Outdated
return nil | ||
} | ||
|
||
func (c *Crawler) prepareClassifierData() cmap.ConcurrentMap[string, License] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked on build action. This function worked 1:25 hours.
Can we update this logic to up work speed?
What if we will use similar logic as in Crawl
function?
We can use semafore, use limit of gorutines from options, etc...
We can save urls in channel and take them in loop.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting this because I'm worried about maintaining this code. It will be hard to find a bug if you have to wait 1.5 hours for 1 Crawler launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Have been trying to figure out why this function takes so much time. Ideally license file text shouldn't be huge. Downloading them and writing contents to files concurrently shouldn't ideally take so much time. Wondering if i am missing some nuance.
My guess is , it's because of the use of concurrent maps (uniqueLicenseKeys and filesLicenseMap) within the function. They are thread safe so blocking other goroutines.
What if we will use similar logic as in Crawl function?
We can use semafore, use limit of gorutines from options, etc...
We can save urls in channel and take them in loop.
Are you suggesting to use a similar framework? Not sure if this is going to make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can avoid filesLicenseMap
. It need not be a concurrent map. There are other ways to achieve it.
Regarding uniqueLicenseKeys
, we can extract the contents into a normal map, since we just need a get.
Let me give that a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is going to make a difference.
If i understand correctly - you use 10 parallel goroutines to write license files. Can we increase this value to 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Http client timeout error is not being handled. Updated it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be fine now. Takes 31m for crawl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. It is really good.
Let's wait result size of db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new sizes:
454.36 MB is size before unpacking (Trivy will download image with this size).
792 MB - is db size in cache dir.
current sizes:
452.25 MB is size before unpacking.
743M - is db size.
CI/CD time changes:
Crawl before - 8-10 min
Crawl after - 30 min
Build actions without changes.
LGTM.
Good job, @namandf !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great :)
@knqyf263 hello! Can you take a look? |
Hope you guys are doing great. Wondering if there is any update. Looking forward to getting this change in as soon as possible. Thank you |
Hope you guys are doing great. Wondering if there is any update. Looking forward to getting this change in as soon as possible. Thank you |
@namandf We have tons of things to do. Please wait a little longer. |
Thank you for the update @knqyf263 . FYI noticed a related bug in the latest trivy version aquasecurity/trivy#5027 |
@@ -59,7 +59,7 @@ func (db *DB) Init() error { | |||
if _, err := db.client.Exec("CREATE TABLE artifacts(id INTEGER PRIMARY KEY, group_id TEXT, artifact_id TEXT)"); err != nil { | |||
return xerrors.Errorf("unable to create 'artifacts' table: %w", err) | |||
} | |||
if _, err := db.client.Exec("CREATE TABLE indices(artifact_id INTEGER, version TEXT, sha1 BLOB, archive_type TEXT, foreign key (artifact_id) references artifacts(id))"); err != nil { | |||
if _, err := db.client.Exec("CREATE TABLE indices(artifact_id INTEGER, version TEXT, sha1 BLOB, archive_type TEXT, license TEXT, foreign key (artifact_id) references artifacts(id))"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should create a license table and store foreign keys here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.
Querying the data would involve JOINS which would mean a change in all queries. Moreever the effort involved didn't seem to reap proportionate benefits since the db size with the current change is just 50MB more than normal and queries are simpler.
Intended to keep the changes to existing code as minimal as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.
It is RDBMS. I don't think this change would be complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Since there is no use case of filtering by license strings, I thought normalization can be deferred and queries can be kept simpler where a direct access to the row in the indices table will have all required info.
It does affect the database size since information is replicated, but it didn't seem to have a major impact hence decided to stick to it.
But I agree, in the long run it might blow up and cause issues w.r.t database size. Don't foresee query performance issues since indexes will help efficiently access the artifact metadata. Will let you guys take a call.
// license classifier | ||
classifier *backend.ClassifierBackend | ||
|
||
// uniqueLicenseKeys : key is hash of license url or name in POM, whichever available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need hash here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We saw scenarios where individuals have dumped full license files and invalid characters in the license name/url field within pom.xml so a hash ensures the key is consistent and short.
We dump the same hash in the indexes cache file and use it during the build stage to map/update the license information from the license cache files which track the license key hash to classified license string mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this approach allows us to avoid using digests.
#21 (comment)
defer close(prepStatus) | ||
|
||
// process license keys channel | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need concurrency for classifying licenses. We can fetch licenses when crawling URLs in parallel similar to sha1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side effect of google license classifier usage. If we were to only process content in pom.xml licenses header directly, this won't be needed as you mentioned.
We are deferring the process of license classification so that we don't repeat it while crawling URLs and sha. Unique URLs/name across maven are around 2k. We'll be unncessarily repeating the classification process unless we introduce some thread safe cache which might slow down crawl due to locking.
We are trying to aggregate unique license urls from pom.xml, dump license url content in files concurrently and later on process them as a batch using google license classifier to get standardised license strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless we introduce some thread safe cache which might slow down crawl due to locking.
Using cache looks much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup simpler, but slower. Crawling to concurrently process maven indexes will be slow due to blocking lock/unlock operations while reading/updating cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you measure it? I don't think it is so slow.
|
||
licenseFileName := getLicenseFileName(c.licensedir, licenseKey) | ||
licenseMeta := uniqLicenseKeyMap[licenseKey] | ||
ok, err := c.generateLicenseFile(client, licenseFileName, licenseMeta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to generate license files? It accepts []byte
or io.Reader
.
https://github.com/google/licenseclassifier/blob/c1ed8fcf4babdf7c37d872cf6da5b1c32907d34c/v2/classifier.go#L321
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The v2 license classifier tool loads assets i.e uses embedded license data https://github.com/google/licenseclassifier/blob/main/v2/tools/identify_license/backend/backend.go#L50 which is used to find appropriate matches for supplied input.
The code you pointed out would require us providing a local directory with all assets/training data if I am not wrong. It is doable but we'll have to maintain a copy locally I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you pointed out would require us providing a local directory with all assets/training data if I am not wrong.
No, we don't have to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get you.
https://github.com/google/licenseclassifier/blob/main/v2/classifier.go#L278
https://github.com/google/licenseclassifier/blob/main/v2/classifier_test.go#L44
Am I missing something?
Just wanted to chime in and say thanks for all your efforts on this. 🙏 |
Scared to ask, but any progress on this? :) |
Discussion: aquasecurity/trivy#4236 (reply in thread)
Unit Tests: Working as expected