Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking inside of a jar file and run Scancode upon the Java source files. #384

Open
michelescarlato opened this issue Jan 14, 2022 · 7 comments
Assignees

Comments

@michelescarlato
Copy link
Member

Hi @MagielBruntink Magiel,

As the follow-up to yesterday's dev call, I tried to look at source code within the fasten-project GitHub organization running Scancode upon an unzipped Jar file containing Java source files.
Unfortunately, I couldn't find any source code related to this task.

Could you please point me out to the code you mentioned?

I appreciate any help you can provide.
M

@MagielBruntink
Copy link
Member

Hi @michelescarlato

As can be seen in this example: https://github.com/fasten-project/fasten/blob/fd0e82ac5524d3b5b17c92a4a9234f7f910a5bd0/analyzer/vulnerability-packages-listener/src/test/resources/real-callable-index-message.json
the POM Analyzer puts a link to the jar file containing sources. It is just a matter of HTPP getting that jar file and unzipping it.

@michelescarlato
Copy link
Member Author

michelescarlato commented Jan 16, 2022

@MagielBruntink , thanks for the input.

A follow-up question is:

As Sebastian @proksch mentioned about Kafka messages encapsulated, I still don't know how these messages are encapsulated.

sourcesUrl belongs to the Kafka message produced to the fasten.POMAnalyzer.out topic.
How can I retrieve it, consuming the fasten.MetadataDBExtension.out topic?

@MagielBruntink
Copy link
Member

It will be wrapped somewhere (this is a messy bit, tbh):

{
   "input": {
      "input": {
         "input": {
            "groupId": "org.apache.logging.log4j",
            "artifactId": "log4j-core",
            "version": "2.14.1"
         },
         "plugin_version": "0.1.2",
         "consumed_at": 1642324728,
         "payload": {
            "date": 1615093953000,
            "repoUrl": "",
            "groupId": "org.apache.logging.log4j",
            "version": "2.14.1",
            "parentCoordinate": "org.apache.logging.log4j:log4j:2.14.1",
            "artifactRepository": "https://repo.maven.apache.org/maven2/",
            "forge": "mvn",
            "sourcesUrl": "https://repo.maven.apache.org/maven2/org/apache/logging/log4j/log4j-core/2.14.1/log4j-core-2.14.1-sources.jar",
            "artifactId": "log4j-core",
            "dependencyData": {
               "dependencyManagement": {
                  "dependencies": []
               },
              ......

What I do is search for the sourcesUrl field recursively, like with the following method:

        static String findSourcesUrl(JSONObject json) {
            for (var key : json.keySet()) {
                if (key.equals("sourcesUrl")) {
                    return json.getString("sourcesUrl");
                } else {
                    var other = json.get(key);
                    if (other instanceof JSONObject) {
                        var nestedResult = findPayload((JSONObject) other);
                        if(nestedResult != null) return nestedResult;
                    }
                }
            }
            return null;
        }

@michelescarlato
Copy link
Member Author

Hi @MagielBruntink Magiel,

I am integrating the portion of code that you are suggesting here and here (I need to extract the package name and the package version, in Debian).

Inside the code you mentioned, there is the findPayload() function. What is precisely doing this function?
Can I see it somewhere?
Or is it just the same function where the json key is payload (instead of the sourcesUrl)?

@MagielBruntink
Copy link
Member

Hi Michele, find the method here:

@proksch
Copy link
Contributor

proksch commented May 19, 2022

I have just re-discovered this issue. xD So we did not just talk about the problems in a dev call that SIG had with the Flink sync job, but we even discussed and illustrated the ease of use of ..-sources.jar files in this very issue. We could have saved us a ton of headache if we would have just followed these recommendations here... well, everybody is smarter afterwards.

@michelescarlato
Copy link
Member Author

The Java license detector made heavy use of the messages produced by the RepoCloner. Unfortunately, modifications to the detector's code are required to adapt it to a new approach.

Also, please consider that an approach that avoids the use of Flink but with a very similar implementation of the Java license detector has been carried out in Python (where the input Kafka topic is fasten.MetadataDBPythonExtension.out).

As you can imagine, the development and the deployment of the three different license detectors (Java, Python, and C) are tight to the pipeline itself, which are different between languages.

Since the Java license detector was mainly developed in July, the Java pipeline in that period relied heavily on the usage of the RepoCloner. That's the main reason for having the Java license detector looking iteratively into the Kafka records consumed at the RepoCloner.out.

I only recently discovered, performing an analysis with @MagielBruntink, that the repoUrl (which could contain a GitHub URL that the detector uses to retrieve license information in spdx format from the GitHub APIs) is produced by the POM Analyzer.
This means that the outbound license detection based on GitHub URLs can still be performed, even after removing the RepoCloner plugin.

On the other hand, following the discussion with Magiel, we understood that having a common place where unjarred maven packages reside could benefit both plugins, Rapid and the License Detector.

As you suggested in the last dev call, this task could be performed directly by the POM Analyzer, preventing the insertion of another plugin in the Java pipeline.

This could be excellent for both Rapid and License Detector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants