ScholarVista is a tool that extracts and plots information from a set of Academic Research Papers in PDF / TEI XML format. To process PDFs, it utilizes Grobid to generate the TEI XML files, then ScholarVista extracts the relevant information from the TEI XML files and generates the following data:
- Keyword Cloud for each of the paper's abstract and for the total of all abstracts.
- Links List for each one of the links found in the paper.
- Figures Histogram comparing the number of figures per paper.
Python >=3.12 is required for installing the ScholarVista package, not for the Docker Image.
If you want to generate the results from a set of PDF academic papers, you must ensure that the Grobid Service is installed and running in your machine. See Grobid installation instrucions here.
The most straight-forward way of starting and running Grobid Service is by running a Docker image. Make sure you have Docker installed in your system.
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
This command will run Grobid and expose a web client in port 8070.
If you already have the TEI XML files generated from Grobid saved in a folder, you can directly generate the information from them.
Note: The TEI XML files MUST be obtained using Grobid, as this tool is intended to work only with Grobid generated TEI XML files.
To install ScholarVista from source, you can clone the repository and install the package using pip. When using pip it is a good practice to use virtual environments. Check out the official documentation on virtual envornments here.
git clone https://github.com/mciccale/ScholarVista
cd ScholarVista
conda create -n scholarvista-env-3.12 python=3.12
conda activate scholarvista-env-3.12
pip install .
Note: You can use PyEnv to create a virtual environment. But since ScholarVista needs Python >=3.12, it is more suitable to use Conda, where you can select the Python version to use.
If you prefer running ScholarVista from a Docker Container, you can build the Docker Image with the following commands.
git clone https://github.com/mciccale/ScholarVista
cd ScholarVista
docker build -t scholarvista-app .
This will create an image called scholarvista-app.
The most convenient way of using ScholarVista is by using its CLI.
The CLI Tool will generate and save to a directory a keyword cloud of the abstract of each paper and a list of URLs for each PDF analyzed; together with a histogram comparing the numer of figures of each PDF and a general keyword cloud of all abstracts.
Usage: scholarvista [OPTIONS] COMMAND [ARGS]...
ScholarVista's CLI main entry point.
Options:
--input-dir PATH Directory containing PDF files. [required]
--output-dir PATH Directory to save results. Defaults to current directory.
--help Show this message and exit.
Commands:
process-pdfs Process all PDFs in the given directory.
process-xmls Process all TEI XMLs in the given directory.
- Start Grobid service using the container.
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
- Run ScholarVista's CLI to process all the PDFs in a given directory and leave the results in another directory.
# Process PDF files and save the results to a specified directory
scholarvista --input-dir ./pdfs --output-dir ./output process-pdfs
ScholarVista provides a set of classes and modules to take leverage of all its functionality from your Python code. To see an example, see example.py
If you prefer running ScholarVista with Docker, you can make use of ScholarVista CLI directly from the Docker Image you created following these instructions.
- Start Grobid service using the container.
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
- Run ScholarVista's container with 2 mounted volumes for input and output directories and connected to the host network.
docker run -it --rm --network=host -v /path/to/input/dir:/input -v /path/to/output/dir:/output scholarvista-app
Note: The default behaviour of ScholarVista's Docker Image is processing pdf files, you can override this by providing the process-xmls
argument after the image name.
Here's an example where we process a set of PDFs contained in the foo
directory and we leave the results at bar
using the Docker Image. Assuming the Grobid service is running at localhost:8070
.
docker run -it --rm --network=host -v foo:/input -v bar:/output scholarvista-app process-pdfs
You can try to run ScholarVista through Docker Compose. However, this feature is still in development and may not work as expected. ScholarVista will be trying to connect to Grobid before it has started, and it will be restarted until the Grobid service is up and running. You can try it by:
INPUT_DIR=/path/to/input/dir OUTPUT_DIR=/path/to/output/dir COMMAND='process-pdfs' docker-compose up
$env:INPUT_DIR="/path/to/input/dir"; $env:OUTPUT_DIR="/path/to/output/dir"; $env:COMMAND="process-pdfs" docker-compose up
Note: The COMMAND variable can be either process-pdfs
or process-xmls
. And the directories are the host machine directories where the files are extracted and left, respectively.
Please refer to the LICENSE
file.
For further assistance or to contribute to the project, please refer to the CONTRIBUTING.md
file.