'How to get all the dependencies of Maven packages

To start off, I am still a student and trying to learn things, but I am quite stuck on how to get all the dependencies of each package from Maven. Initially, I got all the packages from https://libraries.io/, but without their dependencies, as I would like to construct a temporal graph in which I could display those dependencies using time. Libraries.io does not take time into account and therefore I only downloaded the packages.

Now, I'm finding it quite hard to download the dependencies of each package. Initially, I thought parsing the .pom files from this central repo https://repo1.maven.org/maven2/ is enough to get the dependencies for each package, but later on, I found out that there are also external dependencies that you can only get by resolving the .jar file (I think?). I do not really understand that, and I am trying to learn how could I get all the data in this case, as parsing does not seem to be entirely accurate.

Also, I should mention that I am trying to do this in golang, as it was a new language I wanted to try out and it seemed fitting for the task.

EDIT 1- For example, I would like to have this for each package from Maven:

{
  "name": "react-dom",
  "versions": {
    "1.00": {
      "timestamp": "06-05-2022T10:00:01",
      "dependencies": {
        "name": "^1.0.2"
        "name": "^2.1.2"
      }
    }
  }
}

EDIT 2: Yes, this is for research purposes. This is the description of the task https://imgur.com/a/D0LcbzF. Initially, I thought this research was not this complex, but after the first meeting, I was told that we basically have to do what https://libraries.io/ does, but make it accurate by adding a time component. From what I understood by the professor, what libraries.io does not take into account is an example like this:

  • Library A releases a version at time 1, named version A v1.1
  • Library B, which depends on library A’s latest version at the current time, is releasing a version at time 2. Therefore, library B depends on library A v1.1
  • Library A releases a version at time 3, named version A v1.2
  • Library C, which depends on library B, releases a version at time 4. Therefore, it also depends on library A. Library C should depend on the latest version of library A, which is A v1.2, even though library B depends on version A v1.1.

So, to summarize, I have been tasked with getting the packages from https://libraries.io/api, and get the dependencies by resolving the pom files somehow (I cannot tell you more because I am very confused, I am not knowledgeable enough on this matter, I have just started) Professor sent me this "WRT maven dependency listing: I was thinking something along the lines of mvn -DgroupId=junit -DartifactId=junit -Dversion=4.13.1 dependency:get but his only works for retrieving the pom/jar for downloading particular dependency into .m2", maybe that helps you understand something.

Afterward, after data is available, make a temporal graph that could suit the example above, and finally, see what measures we can find to see what are the most used software?

To me, and from what you have already said, this seems very out of reach and I am literally lost on what to do, as the professor does not really guide me in any way.

Some papers I found: https://www.researchgate.net/publication/335499638_The_Maven_Dependency_Graph_A_Temporal_Graph-Based_Representation_of_Maven_Central



Solution 1:[1]

The part of your comment: on the data to see what are the most used packages is data you don't have.

  • for that you need the download statistics of central repository where you don't have access to ...

  • If you would even have the download statistics of central repository it does not represent how much an artifact is being used because many companies are using repository managers which means an artifact is being downloaded exactly once but internally used a lot.

  • Furthermore if an artifacts is being downloaded does not really mean it's used. Some artifacts are downloaded based on transitive dependencies or just added in pom file but are not really used.

  • Technically you can download all artifacts or at least the pom files and analyse the dependencies but you lack the download statistics of central repository.

Also this is prevented because you can't download all the artifacts because you would being blocked from central repository.

The size of central is, an educated guess of mine ca. 5 TiB+

Another thing is that not only central repository does exist there are a lot of other Maven repositories available and being used.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 khmarbaise