feat: refactoring of the code

ScrapeGraphAI · Oct 21, 2024 · f2e3cb7 · f2e3cb7
1 parent d0ec310
commit f2e3cb7
Show file tree

Hide file tree

Showing 6 changed files with 185 additions and 82 deletions.
diff --git a/README.md b/README.md
@@ -1,89 +1,73 @@
 # ScrapeBiblio: PDF Reference Extraction and Verification Library
 
 ## Powered by Scrapegraphai
-![Drag Racing](docs/scrapebiblio.png)
+![ScrapeBiblio Logo](docs/scrapebiblio.png)
 [![Downloads](https://static.pepy.tech/badge/scrapebiblio)](https://pepy.tech/project/scrapebiblio)
 
-This library is designed to extract references from a PDF file, check them against the Semantic Scholar database, and save the results to a Markdown file.
+ScrapeBiblio is a powerful library designed to extract references from PDF files, verify them against various databases, and convert the content to Markdown format.
 
-## Overview
+## Features
 
-The library performs the following steps:
+- Extract text from PDF files
+- Extract references using OpenAI's GPT models
+- Verify references using Semantic Scholar, CORE, and BASE databases
+- Convert PDF content to Markdown format
+- Integration with ScrapeGraph for additional reference checking
 
-### First usage: extracting references from 
-1. **Extract Text from PDF**: Reads the content of a PDF file and extracts the text.
-2. **Split Text into Chunks**: Splits the extracted text into smaller chunks to manage large texts efficiently.
-3. **Extract References**: Uses the OpenAI API to extract references from the text.
-4. **Save References**: Saves the extracted references to a Markdown file.
-5. **Check References in Semantic Scholar**: (Optional) Checks if the extracted references are present in the Semantic Scholar database.
-
-## Installation and Setup
-
-To install the required dependencies, you can use the following command:
+## Installation
 
+Install ScrapeBiblio using pip:
 ```bash
 pip install scrapebiblio
 ```
 
-Ensure you have a `.env` file in the root directory of your project with the following content:
+## Configuration
+
+Create a `.env` file in your project root with the following content:
 
 ```plaintext
-OPENAI_API_KEY="YOUR_OPENAI_KEY"
-SEMANTIC_SCHOLARE_API_KEY="YOUR_SEMANTIC_SCHOLAR_KEY"
+OPENAI_API_KEY=your_openai_api_key
+SEMANTIC_SCHOLAR_API_KEY=your_semantic_scholar_api_key
+CORE_API_KEY=your_core_api_key
+BASE_API_KEY=your_base_api_key
 ```
-
 ## Usage
 
-To use the library, ensure you have the required environment variables set and run the script. The extracted references will be saved to a Markdown file named `references.md`.
-
-### Example
-
-Here is an example of how to use the library:
+Here's a basic example of how to use ScrapeBiblio:
 
 ```python
-import logging
-import os
+from scrapebiblio.core.find_reference import process_pdf
 from dotenv import load_dotenv
-from biblio.find_reference import process_pdf
-
-logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
-
+import os
 load_dotenv()
+pdf_path = 'path/to/your/pdf/file.pdf'
+output_path = 'references.md'
+openai_api_key = os.getenv('OPENAI_API_KEY')
+semantic_scholar_api_key = os.getenv('SEMANTIC_SCHOLAR_API_KEY')
+core_api_key = os.getenv('CORE_API_KEY')
+base_api_key = os.getenv('BASE_API_KEY')
+process_pdf(pdf_path, output_path, openai_api_key, semantic_scholar_api_key,
+core_api_key=core_api_key, base_api_key=base_api_key)
+```
+## Advanced Usage
 
-def main():
-    """
-    Main function that processes a PDF, extracts text, and saves the references.
-    """
-    pdf_path = 'test/558779153.pdf'
-    references_output_path = 'references.md'
-
-    openai_api_key = os.getenv('OPENAI_API_KEY')
-    semantic_scholar_api_key = os.getenv('SEMANTIC_SCHOLARE_API_KEY')
-
-    if not openai_api_key:
-        raise EnvironmentError("OPENAI_API_KEY environment variable not set.")
-    if not semantic_scholar_api_key:
-        raise EnvironmentError("SEMANTIC_SCHOLARE_API_KEY environment variable not set.")
-
-    logging.debug("Starting PDF processing...")
-
-    process_pdf(pdf_path, references_output_path, openai_api_key, semantic_scholar_api_key)
-
-    logging.debug("Processing completed.")
+ScrapeBiblio offers additional functionalities:
 
-if __name__ == "__main__":
-    main()
+1. Convert PDF to Markdown:
+```python
+from scrapebiblio.core.convert_to_md import convert_to_md
+convert_to_md(pdf_path, output_path, openai_api_key)
 ```
+2. Check references with ScrapeGraph:
 
+```python
+from scrapebiblio.utils.api.reference_utils import check_reference_with_scrapegraph
+result = check_reference_with_scrapegraph("Reference Title")
+```
 ## Contributing
 
-We welcome contributions to this project. If you would like to contribute, please follow these steps:
-
-1. Fork the repository.
-2. Create a new branch for your feature or bugfix.
-3. Make your changes.
-4. Submit a pull request with a detailed description of your changes.
+We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for more details.
 
 ## License
 
-This project is licensed under the MIT License. See the `LICENSE` file for more information.
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
diff --git a/main.py b/main.py
@@ -0,0 +1,24 @@
+import asyncio
+import logging
+from src.core.pdf_processor import process_pdf
+from src.models.api_config import APIConfig
+
+logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
+
+async def main():
+    api_config = APIConfig(
+        openai_api_key="your_openai_api_key",
+        semantic_scholar_api_key="your_semantic_scholar_api_key",
+        core_api_key="your_core_api_key",
+        base_api_key="your_base_api_key"
+    )
+
+    await process_pdf(
+        pdf_path="path/to/your/pdf",
+        references_output_path="path/to/save/references",
+        markdown_output_path="path/to/save/markdown",
+        api_config=api_config
+    )
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/scrapebiblio/find_reference.py b/scrapebiblio/find_reference.py
@@ -2,48 +2,60 @@
 find_reference module
 """
 import logging
+from dataclasses import dataclass
+from typing import List, Optional
+import asyncio
 from .utils.pdf_utils import extract_text_from_pdf
 from .utils.openai_utils import extract_references
 from .utils.reference_utils import check_reference, check_reference_with_scrapegraph
 
 logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
 
-def process_pdf(pdf_path: str, references_output_path: str,
-                openai_api_key: str, semantic_scholar_api_key: str,
-                core_api_key: str = None, base_api_key: str = None):
+@dataclass
+class APIConfig:
+    openai_api_key: str
+    semantic_scholar_api_key: str
+    core_api_key: Optional[str] = None
+    base_api_key: Optional[str] = None
+
+@dataclass
+class Reference:
+    title: str
+    authors: List[str]
+    year: int
+
+async def process_pdf(pdf_path: str, references_output_path: str, api_config: APIConfig) -> None:
     """
     Processes a PDF, extracts text, and saves the references.
 
     Args:
         pdf_path (str): Path to the PDF file.
         references_output_path (str): Path to the output file for references.
-        openai_api_key (str): The API key for OpenAI.
-        semantic_scholar_api_key (str): The API key for Semantic Scholar.
-        core_api_key (str, optional): The API key for CORE. Defaults to None.
-        base_api_key (str, optional): The API key for BASE. Defaults to None.
+        api_config (APIConfig): Configuration object containing API keys.
     """
     logging.debug("Starting PDF processing...")
 
-    pdf_text = extract_text_from_pdf(pdf_path)
+    pdf_text = await extract_text_from_pdf(pdf_path)
+    references = await extract_references(pdf_text, api_key=api_config.openai_api_key)
 
-    references = extract_references(pdf_text, api_key=openai_api_key)
+    await save_references(references, references_output_path)
 
-    with open(references_output_path, 'w') as file:
-        file.write(f"# References\n\n{references}")
+    tasks = [
+        check_reference(ref, api_config) for ref in references
+    ]
+    results = await asyncio.gather(*tasks)
 
-    logging.debug(f"References saved to {references_output_path}")
+    for reference, result in zip(references, results):
+        logging.debug(f"Reference check result for {reference.title}: {result}")
 
-    for reference in references.split('\n'):
-        if reference.strip():
-            result = check_reference(reference,
-                                     semantic_scholar_api_key=semantic_scholar_api_key,
-                                     core_api_key=core_api_key,
-                                     base_api_key=base_api_key,
-                                     use_semantic_scholar=True)
-            logging.debug(f"Reference check result: {result}")
-
-            # Check reference with ScrapeGraph
-            scrapegraph_result = check_reference_with_scrapegraph(reference)
-            logging.debug(f"ScrapeGraph check result: {scrapegraph_result}")
+        scrapegraph_result = await check_reference_with_scrapegraph(reference.title)
+        logging.debug(f"ScrapeGraph check result for {reference.title}: {scrapegraph_result}")
 
     logging.debug("Processing completed.")
+
+async def save_references(references: List[Reference], output_path: str) -> None:
+    with open(output_path, 'w') as file:
+        file.write("# References\n\n")
+        for ref in references:
+            file.write(f"- {ref.title} by {', '.join(ref.authors)} ({ref.year})\n")
+    logging.debug(f"References saved to {output_path}")
diff --git a/src/core/pdf_processor.py b/src/core/pdf_processor.py
@@ -0,0 +1,36 @@
+import logging
+import asyncio
+from ..models.api_config import APIConfig
+from ..utils.pdf_utils import extract_text_from_pdf
+from ..services.reference_service import extract_and_check_references
+from ..services.markdown_service import convert_to_markdown
+
+async def process_pdf(pdf_path: str, references_output_path: str, markdown_output_path: str, api_config: APIConfig) -> None:
+    logging.debug("Starting PDF processing...")
+
+    pdf_text = await extract_text_from_pdf(pdf_path)
+
+    references = await extract_and_check_references(pdf_text, api_config)
+
+    await save_references(references, references_output_path)
+
+    markdown_text = await convert_to_markdown(pdf_text, api_config.openai_api_key)
+
+    await save_markdown(markdown_text, markdown_output_path)
+
+    logging.debug("PDF processing completed.")
+
+async def save_references(references, output_path: str) -> None:
+    with open(output_path, 'w') as file:
+        file.write("# References\n\n")
+        for ref in references:
+            file.write(f"- {ref.title} by {', '.join(ref.authors)} ({ref.year})\n")
+    logging.debug(f"References saved to {output_path}")
+
+async def save_markdown(markdown_text: str, output_path: str) -> None:
+    with open(output_path, 'w') as file:
+        file.write(markdown_text)
+    logging.debug(f"Markdown saved to {output_path}")
+
+async def main(pdf_path: str, references_output_path: str, markdown_output_path: str, api_config: APIConfig) -> None:
+    await process_pdf(pdf_path, references_output_path, markdown_output_path, api_config)
diff --git a/src/models/api_config.py b/src/models/api_config.py
@@ -0,0 +1,9 @@
+from dataclasses import dataclass
+from typing import Optional
+
+@dataclass
+class APIConfig:
+    openai_api_key: str
+    semantic_scholar_api_key: str
+    core_api_key: Optional[str] = None
+    base_api_key: Optional[str] = None
diff --git a/src/services/reference_service.py b/src/services/reference_service.py
@@ -0,0 +1,38 @@
+import logging
+from openai import OpenAI
+
+def extract_references(text:str, model:str="gpt-4o", api_key:str=None)->str:
+    """
+    Extracts references from the text using the OpenAI API.
+
+    Args:
+        text (str): Text from which to extract references.
+        model (str): The model to use for the API call.
+        api_key (str): The API key for OpenAI.
+
+    Returns:
+        str: Extracted references.
+    """
+    logging.debug("Starting extraction of references from text...")
+    client = OpenAI(api_key=api_key)
+
+    response = client.chat.completions.create(
+        model=model,
+        messages=[
+            {"role": "system", "content": """You are a helpful
+                                            assistant that extracts references from text."""},
+            {"role": "user", "content": f"""Extract all references from the following
+                                            text and format them in a consistent manner: \n{text}\n.
+                                            Format each reference as:\n
+                                                1. \"Title\" by Authors - [Reference Number]\n
+                                                2. \"Title\" by Authors - [Reference Number]\n
+                                            ..."""}
+        ],
+        max_tokens=4096,
+        n=1,
+        stop=None,
+        temperature=0.0,
+    )
+    references = response.choices[0].message.content.strip()
+    logging.debug("References extracted from text.")
+    return references