Skip to content

Commit

Permalink
feat: refactoring of the code
Browse files Browse the repository at this point in the history
  • Loading branch information
VinciGit00 committed Oct 21, 2024
1 parent d0ec310 commit f2e3cb7
Show file tree
Hide file tree
Showing 6 changed files with 185 additions and 82 deletions.
100 changes: 42 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,89 +1,73 @@
# ScrapeBiblio: PDF Reference Extraction and Verification Library

## Powered by Scrapegraphai
![Drag Racing](docs/scrapebiblio.png)
![ScrapeBiblio Logo](docs/scrapebiblio.png)
[![Downloads](https://static.pepy.tech/badge/scrapebiblio)](https://pepy.tech/project/scrapebiblio)

This library is designed to extract references from a PDF file, check them against the Semantic Scholar database, and save the results to a Markdown file.
ScrapeBiblio is a powerful library designed to extract references from PDF files, verify them against various databases, and convert the content to Markdown format.

## Overview
## Features

The library performs the following steps:
- Extract text from PDF files
- Extract references using OpenAI's GPT models
- Verify references using Semantic Scholar, CORE, and BASE databases
- Convert PDF content to Markdown format
- Integration with ScrapeGraph for additional reference checking

### First usage: extracting references from
1. **Extract Text from PDF**: Reads the content of a PDF file and extracts the text.
2. **Split Text into Chunks**: Splits the extracted text into smaller chunks to manage large texts efficiently.
3. **Extract References**: Uses the OpenAI API to extract references from the text.
4. **Save References**: Saves the extracted references to a Markdown file.
5. **Check References in Semantic Scholar**: (Optional) Checks if the extracted references are present in the Semantic Scholar database.

## Installation and Setup

To install the required dependencies, you can use the following command:
## Installation

Install ScrapeBiblio using pip:
```bash
pip install scrapebiblio
```

Ensure you have a `.env` file in the root directory of your project with the following content:
## Configuration

Create a `.env` file in your project root with the following content:

```plaintext
OPENAI_API_KEY="YOUR_OPENAI_KEY"
SEMANTIC_SCHOLARE_API_KEY="YOUR_SEMANTIC_SCHOLAR_KEY"
OPENAI_API_KEY=your_openai_api_key
SEMANTIC_SCHOLAR_API_KEY=your_semantic_scholar_api_key
CORE_API_KEY=your_core_api_key
BASE_API_KEY=your_base_api_key
```

## Usage

To use the library, ensure you have the required environment variables set and run the script. The extracted references will be saved to a Markdown file named `references.md`.

### Example

Here is an example of how to use the library:
Here's a basic example of how to use ScrapeBiblio:

```python
import logging
import os
from scrapebiblio.core.find_reference import process_pdf
from dotenv import load_dotenv
from biblio.find_reference import process_pdf

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

import os
load_dotenv()
pdf_path = 'path/to/your/pdf/file.pdf'
output_path = 'references.md'
openai_api_key = os.getenv('OPENAI_API_KEY')
semantic_scholar_api_key = os.getenv('SEMANTIC_SCHOLAR_API_KEY')
core_api_key = os.getenv('CORE_API_KEY')
base_api_key = os.getenv('BASE_API_KEY')
process_pdf(pdf_path, output_path, openai_api_key, semantic_scholar_api_key,
core_api_key=core_api_key, base_api_key=base_api_key)
```
## Advanced Usage

def main():
"""
Main function that processes a PDF, extracts text, and saves the references.
"""
pdf_path = 'test/558779153.pdf'
references_output_path = 'references.md'

openai_api_key = os.getenv('OPENAI_API_KEY')
semantic_scholar_api_key = os.getenv('SEMANTIC_SCHOLARE_API_KEY')

if not openai_api_key:
raise EnvironmentError("OPENAI_API_KEY environment variable not set.")
if not semantic_scholar_api_key:
raise EnvironmentError("SEMANTIC_SCHOLARE_API_KEY environment variable not set.")

logging.debug("Starting PDF processing...")

process_pdf(pdf_path, references_output_path, openai_api_key, semantic_scholar_api_key)

logging.debug("Processing completed.")
ScrapeBiblio offers additional functionalities:

if __name__ == "__main__":
main()
1. Convert PDF to Markdown:
```python
from scrapebiblio.core.convert_to_md import convert_to_md
convert_to_md(pdf_path, output_path, openai_api_key)
```
2. Check references with ScrapeGraph:

```python
from scrapebiblio.utils.api.reference_utils import check_reference_with_scrapegraph
result = check_reference_with_scrapegraph("Reference Title")
```
## Contributing

We welcome contributions to this project. If you would like to contribute, please follow these steps:

1. Fork the repository.
2. Create a new branch for your feature or bugfix.
3. Make your changes.
4. Submit a pull request with a detailed description of your changes.
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for more details.

## License

This project is licensed under the MIT License. See the `LICENSE` file for more information.
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
24 changes: 24 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import asyncio
import logging
from src.core.pdf_processor import process_pdf
from src.models.api_config import APIConfig

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

async def main():
api_config = APIConfig(
openai_api_key="your_openai_api_key",
semantic_scholar_api_key="your_semantic_scholar_api_key",
core_api_key="your_core_api_key",
base_api_key="your_base_api_key"
)

await process_pdf(
pdf_path="path/to/your/pdf",
references_output_path="path/to/save/references",
markdown_output_path="path/to/save/markdown",
api_config=api_config
)

if __name__ == "__main__":
asyncio.run(main())
60 changes: 36 additions & 24 deletions scrapebiblio/find_reference.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,60 @@
find_reference module
"""
import logging
from dataclasses import dataclass
from typing import List, Optional
import asyncio
from .utils.pdf_utils import extract_text_from_pdf
from .utils.openai_utils import extract_references
from .utils.reference_utils import check_reference, check_reference_with_scrapegraph

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

def process_pdf(pdf_path: str, references_output_path: str,
openai_api_key: str, semantic_scholar_api_key: str,
core_api_key: str = None, base_api_key: str = None):
@dataclass
class APIConfig:
openai_api_key: str
semantic_scholar_api_key: str
core_api_key: Optional[str] = None
base_api_key: Optional[str] = None

@dataclass
class Reference:
title: str
authors: List[str]
year: int

async def process_pdf(pdf_path: str, references_output_path: str, api_config: APIConfig) -> None:
"""
Processes a PDF, extracts text, and saves the references.
Args:
pdf_path (str): Path to the PDF file.
references_output_path (str): Path to the output file for references.
openai_api_key (str): The API key for OpenAI.
semantic_scholar_api_key (str): The API key for Semantic Scholar.
core_api_key (str, optional): The API key for CORE. Defaults to None.
base_api_key (str, optional): The API key for BASE. Defaults to None.
api_config (APIConfig): Configuration object containing API keys.
"""
logging.debug("Starting PDF processing...")

pdf_text = extract_text_from_pdf(pdf_path)
pdf_text = await extract_text_from_pdf(pdf_path)
references = await extract_references(pdf_text, api_key=api_config.openai_api_key)

references = extract_references(pdf_text, api_key=openai_api_key)
await save_references(references, references_output_path)

with open(references_output_path, 'w') as file:
file.write(f"# References\n\n{references}")
tasks = [
check_reference(ref, api_config) for ref in references
]
results = await asyncio.gather(*tasks)

logging.debug(f"References saved to {references_output_path}")
for reference, result in zip(references, results):
logging.debug(f"Reference check result for {reference.title}: {result}")

for reference in references.split('\n'):
if reference.strip():
result = check_reference(reference,
semantic_scholar_api_key=semantic_scholar_api_key,
core_api_key=core_api_key,
base_api_key=base_api_key,
use_semantic_scholar=True)
logging.debug(f"Reference check result: {result}")

# Check reference with ScrapeGraph
scrapegraph_result = check_reference_with_scrapegraph(reference)
logging.debug(f"ScrapeGraph check result: {scrapegraph_result}")
scrapegraph_result = await check_reference_with_scrapegraph(reference.title)
logging.debug(f"ScrapeGraph check result for {reference.title}: {scrapegraph_result}")

logging.debug("Processing completed.")

async def save_references(references: List[Reference], output_path: str) -> None:
with open(output_path, 'w') as file:
file.write("# References\n\n")
for ref in references:
file.write(f"- {ref.title} by {', '.join(ref.authors)} ({ref.year})\n")
logging.debug(f"References saved to {output_path}")
36 changes: 36 additions & 0 deletions src/core/pdf_processor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import logging
import asyncio
from ..models.api_config import APIConfig
from ..utils.pdf_utils import extract_text_from_pdf
from ..services.reference_service import extract_and_check_references
from ..services.markdown_service import convert_to_markdown

async def process_pdf(pdf_path: str, references_output_path: str, markdown_output_path: str, api_config: APIConfig) -> None:
logging.debug("Starting PDF processing...")

pdf_text = await extract_text_from_pdf(pdf_path)

references = await extract_and_check_references(pdf_text, api_config)

await save_references(references, references_output_path)

markdown_text = await convert_to_markdown(pdf_text, api_config.openai_api_key)

await save_markdown(markdown_text, markdown_output_path)

logging.debug("PDF processing completed.")

async def save_references(references, output_path: str) -> None:
with open(output_path, 'w') as file:
file.write("# References\n\n")
for ref in references:
file.write(f"- {ref.title} by {', '.join(ref.authors)} ({ref.year})\n")
logging.debug(f"References saved to {output_path}")

async def save_markdown(markdown_text: str, output_path: str) -> None:
with open(output_path, 'w') as file:
file.write(markdown_text)
logging.debug(f"Markdown saved to {output_path}")

async def main(pdf_path: str, references_output_path: str, markdown_output_path: str, api_config: APIConfig) -> None:
await process_pdf(pdf_path, references_output_path, markdown_output_path, api_config)
9 changes: 9 additions & 0 deletions src/models/api_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from dataclasses import dataclass
from typing import Optional

@dataclass
class APIConfig:
openai_api_key: str
semantic_scholar_api_key: str
core_api_key: Optional[str] = None
base_api_key: Optional[str] = None
38 changes: 38 additions & 0 deletions src/services/reference_service.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import logging
from openai import OpenAI

def extract_references(text:str, model:str="gpt-4o", api_key:str=None)->str:
"""
Extracts references from the text using the OpenAI API.
Args:
text (str): Text from which to extract references.
model (str): The model to use for the API call.
api_key (str): The API key for OpenAI.
Returns:
str: Extracted references.
"""
logging.debug("Starting extraction of references from text...")
client = OpenAI(api_key=api_key)

response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": """You are a helpful
assistant that extracts references from text."""},
{"role": "user", "content": f"""Extract all references from the following
text and format them in a consistent manner: \n{text}\n.
Format each reference as:\n
1. \"Title\" by Authors - [Reference Number]\n
2. \"Title\" by Authors - [Reference Number]\n
..."""}
],
max_tokens=4096,
n=1,
stop=None,
temperature=0.0,
)
references = response.choices[0].message.content.strip()
logging.debug("References extracted from text.")
return references

0 comments on commit f2e3cb7

Please sign in to comment.