Merge pull request #89 from swisstopo/LGVISIUM-84-AWS-Environment-var…

…iables-in-development Close #LGVISIUM-84: Added possibility to pass the AWS Credentials in an .env file or as environment variables
swisstopo · Oct 8, 2024 · 45af9a4 · 45af9a4 · github-actions · Oct 8, 2024
2 parents 2771506 + e1a8277
commit 45af9a4
Show file tree

Hide file tree

Showing 9 changed files with 248 additions and 117 deletions.
diff --git a/.env.template b/.env.template
@@ -0,0 +1,6 @@
+MLFLOW_TRACKING="True"
+MLFLOW_TRACKING_URI="http://127.0.0.1:5000"
+
+AWS_ACCESS_KEY_ID=your_access_key_id
+AWS_SECRET_ACCESS_KEY=your_secret_access_key
+AWS_ENDPOINT=your_endpoint_url
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -5,6 +5,7 @@
         "depthcolumn",
         "depthcolumnentry",
         "dotenv",
+        "fastapi",
         "fitz",
         "mlflow",
         "pixmap",

diff --git a/README.md b/README.md
@@ -89,28 +89,28 @@ To execute the data extraction pipeline, follow these steps:
 
 1. **Activate the virtual environment**
 
-    Activate your virtual environment. On unix systems this is
+Activate your virtual environment. On unix systems this is
 
-    ``` bash
-    source env/bin/activate
-    ```
+``` bash
+source env/bin/activate
+```
 
 2. **Download the borehole profiles, optional**
 
-    Use `boreholes-download-profiles` to download the files to be processed from an AWS S3 storage. In order to do so, you need to authenticate with aws first. We recommend to use the aws CLI for that purpose. This step is optional, you can continue with step 3 on your own set of borehole profiles.
+Use `boreholes-download-profiles` to download the files to be processed from an AWS S3 storage. In order to do so, you need to authenticate with aws first. We recommend to use the aws CLI for that purpose. This step is optional, you can continue with step 3 on your own set of borehole profiles.
 
 3. **Run the extraction script**
 
-    The main script for the extraction pipeline is located at `src/stratigraphy/main.py`. A cli command is created to run this script.
+The main script for the extraction pipeline is located at `src/stratigraphy/main.py`. A cli command is created to run this script.
 
-    Run `boreholes-extract-all` to run the main extraction script. You need to specify the input directory or a single PDF file using the `-i` or `--input-directory` flag. 
- The script will source all PDFs from the specified directory and create PNG files in the `data/output/draw` directory.
+Run `boreholes-extract-all` to run the main extraction script. You need to specify the input directory or a single PDF file using the `-i` or `--input-directory` flag. 
+The script will source all PDFs from the specified directory and create PNG files in the `data/output/draw` directory.
 
-    Use `boreholes-extract-all --help` to see all options for the extraction script.
+Use `boreholes-extract-all --help` to see all options for the extraction script.
 
 4. **Check the results**
 
-    Once the script has finished running, you can check the results in the `data/output/draw` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the specified input directory.
+Once the script has finished running, you can check the results in the `data/output/draw` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the specified input directory.
 
 ### Output Structure
 The `predictions.json` file contains the results of a data extraction process from PDF files. Each key in the JSON object is the name of a PDF file, and the value is a list of extracted items in a dictionary like object. The extracted items for now are the material descriptions in their correct order (given by their depths).
@@ -247,46 +247,45 @@ To launch the API and access its endpoints, follow these steps:
 
 1. **Activate the virtual environment**
 
-    Activate your virtual environment. On Unix systems, this can be done with the following command:
+Activate your virtual environment. On Unix systems, this can be done with the following command:
 
-    ```bash
-    source env/bin/activate
-    ```
+```bash
+source env/bin/activate
+```
 
 2. **Environment variables**
 
-    Please make sure to define the environment variables needed for the API to access the S3 Bucket of interest.
+Please make sure to define the environment variables needed for the API to access the S3 Bucket of interest.
 
-    ```python
-    aws_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID")
-    aws_secret_key_access = os.environ.get("AWS_SECRET_ACCESS_KEY")
-    aws_session_token = os.environ.get("AWS_SESSION_TOKEN")
-    aws_endpoint = os.environ.get("AWS_ENDPOINT")
-    ```
+```python
+aws_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID")
+aws_secret_key_access = os.environ.get("AWS_SECRET_ACCESS_KEY")
+aws_endpoint = os.environ.get("AWS_ENDPOINT")
+```
 
 3. **Start the FastAPI server**
 
-    Run the following command to start the FastAPI server:
+Run the following command to start the FastAPI server:
 
-    ```bash
-    uvicorn src.app.main:app --reload --host 0.0.0.0 --port 8002
-    ```
+```bash
+uvicorn src.app.main:app --reload --host 0.0.0.0 --port 8002
+```
 
-    This will start the server on port 8002 of the localhost and enable automatic reloading whenever changes are made to the code. You can see the OpenAPI Specification (formerly Swagger Specification) by opening: `http://127.0.0.1:8002/docs#/` in your favorite browser. 
+This will start the server on port 8002 of the localhost and enable automatic reloading whenever changes are made to the code. You can see the OpenAPI Specification (formerly Swagger Specification) by opening: `http://127.0.0.1:8002/docs#/` in your favorite browser. 
 
 4. **Access the API endpoints**
 
-    Once the server is running, you can access the API endpoints using a web browser or an API testing tool like Postman.
+Once the server is running, you can access the API endpoints using a web browser or an API testing tool like Postman.
 
-    The main endpoint for the data extraction pipeline is `http://localhost:8000/extract-data`. You can send a POST request to this endpoint with the PDF file you want to extract data from.
+The main endpoint for the data extraction pipeline is `http://localhost:8000/extract-data`. You can send a POST request to this endpoint with the PDF file you want to extract data from.
 
-    Additional endpoints and their functionalities can be found in the project's source code.
+Additional endpoints and their functionalities can be found in the project's source code.
 
-    **Note:** Make sure to replace `localhost` with the appropriate hostname or IP address if you are running the server on a remote machine.
+**Note:** Make sure to replace `localhost` with the appropriate hostname or IP address if you are running the server on a remote machine.
 
 5. **Stop the server**
 
-    To stop the FastAPI server, press `Ctrl + C` in the terminal where the server is running. Please refer to the [FastAPI documentation](https://fastapi.tiangolo.com) for more information on how to work with FastAPI and build APIs using this framework.
+To stop the FastAPI server, press `Ctrl + C` in the terminal where the server is running. Please refer to the [FastAPI documentation](https://fastapi.tiangolo.com) for more information on how to work with FastAPI and build APIs using this framework.
 
 
 ## API as Docker Image
@@ -295,99 +294,166 @@ The borehole application offers a given amount of functionalities (extract text,
 
 1. **Navigate to the project directory**
 
-    Change your current directory to the project directory:
+Change your current directory to the project directory:
 
-    ```bash
-    cd swissgeol-boreholes-dataextraction
-    ```
+```bash
+cd swissgeol-boreholes-dataextraction
+```
 
 2. **Build the Docker image**
 
-    Build the Docker image using the following command:
+Build the Docker image using the following command:
 
-    ```bash
-    docker build -t borehole-api . -f Dockerfile
-    ```
+```bash
+docker build -t borehole-api . -f Dockerfile
+```
 
-    ```bash
-    docker build --platform linux/amd64 -t borehole-api:test .
-    ```
+```bash
+docker build --platform linux/amd64 -t borehole-api:test .
+```
 
-    This command will build the Docker image with the tag `borehole-api`.
+This command will build the Docker image with the tag `borehole-api`.
 
 3. **Verify the Docker image**
 
-    Verify that the Docker image has been successfully built by running the following command:
+Verify that the Docker image has been successfully built by running the following command:
 
-    ```bash
-    docker images
-    ```
+```bash
+docker images
+```
 
-    You should see the `borehole-api` image listed in the output.
+You should see the `borehole-api` image listed in the output.
 
 4. **Run the Docker container**
 
-    To run the Docker container, use the following command:
+4.1. **Run the Docker Container without concerning about AWS Credentials**
+
+To run the Docker container, use the following command:
+
+```bash
+docker run -p 8000:8000 borehole-api
+```
 
-    ```bash
-    docker run -p 8000:8000 borehole-api
-    ```
+This command will start the container and map port 8000 of the container to port 8000 of the host machine.
 
-    This command will start the container and map port 8000 of the container to port 8000 of the host machine.
+4.2. **Run the docker image with the AWS credentials**
 
-5. **Run the docker image with the AWS credentials**
+4.2.1. **Using a `~/.aws` file**
 
 If you have the AWS credentials configured locally in the `~/.aws` file, you can run the following command to forward your AWS credentials to the docker container 
 
-    ```bash
+To run the docker image from `Dockerfile` locally: 
 
-    docker run -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api
-    ```
+```bash
 
-    ```bash 
-    docker run --platform linux/amd64 -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api:test
-    ```
+docker run -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api
+```
 
+To run the Docker image from `Dockerfile` with the environment variables from the `.env` file
 
-6. **Access the API**
+```bash
+docker run --env-file .env -d -p 8000:8000 borehole-api
+```
+
+To run the docker image used for AWS Lambda: `Dockerfile.aws.lambda`: 
 
-    Once the container is running, you can access the API by opening a web browser and navigating to `http://localhost:8000`.
+```bash 
+docker run --platform linux/amd64 -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api:test
+```
 
-    You can also use an API testing tool like Postman to send requests to the API endpoints.
+4.2.2. **Passing the AWS credentials as Environment Variables**
 
-    **Note:** If you are running Docker on a remote machine, replace `localhost` with the appropriate hostname or IP address.
+It is also possible to set the AWS credentials as you environment variables and the environment variables of the Docker image you are trying to run. 
 
+Unix-based Systems (Linux/macOS)
 
-7. **Query the API**
+Add the following lines to your `~/.bashrc`, `~/.bash_profile`, or `~/.zshrc` (depending on your shell):
 
 ```bash
-    curl -X 'POST' \
-    'http://localhost:8000/api/V1/create_pngs' \
-    -H 'accept: application/json' \
-    -H 'Content-Type: application/json' \
-    -d '{
-    "filename": "10021.pdf"
-    }'
+export AWS_ACCESS_KEY_ID=your_access_key_id
+export AWS_SECRET_ACCESS_KEY=your_secret_access_key
+export AWS_ENDPOINT=your_endpoint_url
 ```
 
-8. **Stop the Docker container**
+Please note that the endpoint url is in the following format: `https://{bucket}.s3.<RegionName>.amazonaws.com`. This 
+URL can be found in AWS when you go to your target S3 bucket, select any item in the bucket and look into the 
+Properties under `Object URL`. Please remove the file specific extension and you will end up with your endpoint URL. 
 
-    To stop the Docker container, press `Ctrl + C` in the terminal where the container is running.
+After editing, run the following command to apply the changes:
 
-    Alternatively, you can use the following command to stop the container:
+```bash
+source ~/.bashrc  # Or ~/.bash_profile, ~/.zshrc based on your configuration
+```
 
-    ```bash
-    docker stop <container_id>
-    ```
+Windows (Command Prompt or PowerShell)
 
-    Replace `<container_id>` with the ID of the running container, which can be obtained by running `docker ps`.
+For Command Prompt:
+
+```bash
+setx AWS_ACCESS_KEY_ID your_access_key_id
+setx AWS_SECRET_ACCESS_KEY your_secret_access_key
+setx AWS_ENDPOINT your_endpoint_url
+```
+
+For PowerShell:
+
+```bash
+$env:AWS_ACCESS_KEY_ID=your_access_key_id
+$env:AWS_SECRET_ACCESS_KEY=your_secret_access_key
+$env:AWS_ENDPOINT=your_endpoint_url
+```
+
+4.2.3. **Passing the AWS credentials in an Environment File**
+
+Another option is to store the credentials in a .env file and load them into your Python environment using the `python-dotenv` package:
+
+```bash
+AWS_ACCESS_KEY_ID=your_access_key_id
+AWS_SECRET_ACCESS_KEY=your_secret_access_key
+AWS_ENDPOINT=your_endpoint_url
+```
+
+You can find an example for such a `.env` file in `.env.template`. If you rename this file to `.env` and add your AWS credentials you should be good to go. 
+
+5. **Access the API**
+
+Once the container is running, you can access the API by opening a web browser and navigating to `http://localhost:8000`.
+
+You can also use an API testing tool like Postman to send requests to the API endpoints.
+
+**Note:** If you are running Docker on a remote machine, replace `localhost` with the appropriate hostname or IP address.
+
+
+6. **Query the API**
+
+```bash
+curl -X 'POST' \
+'http://localhost:8000/api/V1/create_pngs' \
+-H 'accept: application/json' \
+-H 'Content-Type: application/json' \
+-d '{
+"filename": "10021.pdf"
+}'
+```
+
+7. **Stop the Docker container**
+
+To stop the Docker container, press `Ctrl + C` in the terminal where the container is running.
+
+Alternatively, you can use the following command to stop the container:
+
+```bash
+docker stop <container_id>
+```
+
+Replace `<container_id>` with the ID of the running container, which can be obtained by running `docker ps`.
 
 
 ## AWS Lambda Deployment
 
 AWS Lambda is a serverless computing service provided by Amazon Web Services that allows you to run code without managing servers. It automatically scales your applications by executing code in response to triggers. You only pay for the compute time used.
 
-In this project we are using Mangum to wrap the FastAPI with a handler that we will package and deploy as a Lambda function in AWS. Then using AWS API Gateway we will route all incoming requests to invoke the lambda and handle the routing internally within our application.
+In this project we are using `Mangum` to wrap the FastAPI with a handler that we will package and deploy as a Lambda function in AWS. Then using AWS API Gateway we will route all incoming requests to invoke the lambda and handle the routing internally within our application.
 
 We created a script that should make it possible for you to deploy the FastAPI in AWS lambda using a single command. The script is creating all the required AWS resources to run the API. The resources that will be created for you are: 
 - AWS Lambda Function
@@ -397,11 +463,10 @@ We created a script that should make it possible for you to deploy the FastAPI i
 
 To deploy the staging version of the FastPI, run the following command: 
 
-```shell
+```bash
 IMAGE=borehole-fastapi ENV=stage AWS_PROFILE=dcleres-visium AWS_S3_BUCKET=dcleres-boreholes-integration-tmp ./deploy_api_aws_lambda.sh
 ```
 
-
 ## Experiment Tracking
 We perform experiment tracking using MLFlow. Each developer has his own local MLFlow instance. 
 

diff --git a/src/app/api/v1/endpoints/create_pngs.py b/src/app/api/v1/endpoints/create_pngs.py
@@ -5,7 +5,6 @@
 
 import fitz
 from app.common.aws import load_pdf_from_aws, upload_file_to_s3
-from app.common.config import config
 from app.common.schemas import PNGResponse
 from fastapi import HTTPException
 
@@ -49,8 +48,7 @@ def create_pngs(aws_filename: Path):
             )
 
             # Generate the S3 URL
-            png_url = f"https://{config.bucket_name}.s3.amazonaws.com/{s3_bucket_png_path}"
-            png_urls.append(png_url)
+            png_urls.append(s3_bucket_png_path)
 
             # Clean up the local file
             os.remove(png_path)
File	Stmts	Miss	Cover	Missing
src/stratigraphy
__init__.py	8	1	88%	11
extract.py	186	186	0%	3–483
get_files.py	19	19	0%	3–47
main.py	119	119	0%	3–310
src/stratigraphy/data_extractor
data_extractor.py	50	3	94%	32, 62, 98
src/stratigraphy/depthcolumn
boundarydepthcolumnvalidator.py	41	20	51%	47, 57, 60, 81–84, 110–128, 140–149
depthcolumn.py	194	64	67%	25, 29, 50, 56, 59–60, 84, 87, 94, 101, 109–110, 120, 137–153, 191, 228, 247–255, 266, 271, 278, 309, 314–321, 336–337, 380–422
depthcolumnentry.py	28	6	79%	17, 21, 36, 39, 56, 65
find_depth_columns.py	106	19	82%	42–43, 73, 86, 180–181, 225–245
src/stratigraphy/layer
layer_identifier_column.py	74	52	30%	16–17, 20, 28, 43, 47, 51, 59–63, 66, 74, 91–96, 99, 112, 125–126, 148–158, 172–199
src/stratigraphy/lines
geometric_line_utilities.py	86	2	98%	81, 131
line.py	51	4	92%	25, 50, 60, 110
linesquadtree.py	46	1	98%	75
src/stratigraphy/metadata
coordinate_extraction.py	108	5	95%	30, 64, 94–95, 107
src/stratigraphy/text
description_block_splitter.py	70	2	97%	24, 139
extract_text.py	29	3	90%	19, 53–54
find_description.py	64	28	56%	27–35, 50–63, 79–95, 172–175
textblock.py	80	9	89%	28, 56, 64, 89, 101, 124, 145, 154, 183
src/stratigraphy/util
dataclasses.py	32	3	91%	37–39
interval.py	104	55	47%	29–32, 37–40, 46, 52, 56, 66–68, 107–153, 174, 180–196
predictions.py	107	107	0%	3–282
util.py	39	17	56%	41, 69–76, 90–92, 116–117, 129–133
TOTAL	1641	725	56%