Skip to content

Commit

Permalink
Merge pull request #89 from swisstopo/LGVISIUM-84-AWS-Environment-var…
Browse files Browse the repository at this point in the history
…iables-in-development

Close #LGVISIUM-84: Added possibility to pass the AWS Credentials in an .env file or as environment variables
  • Loading branch information
dcleres authored Oct 8, 2024
2 parents 2771506 + e1a8277 commit 45af9a4
Show file tree
Hide file tree
Showing 9 changed files with 248 additions and 117 deletions.
6 changes: 6 additions & 0 deletions .env.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
MLFLOW_TRACKING="True"
MLFLOW_TRACKING_URI="http://127.0.0.1:5000"

AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_ENDPOINT=your_endpoint_url
1 change: 1 addition & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
"depthcolumn",
"depthcolumnentry",
"dotenv",
"fastapi",
"fitz",
"mlflow",
"pixmap",
Expand Down
229 changes: 147 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,28 +89,28 @@ To execute the data extraction pipeline, follow these steps:

1. **Activate the virtual environment**

Activate your virtual environment. On unix systems this is
Activate your virtual environment. On unix systems this is

``` bash
source env/bin/activate
```
``` bash
source env/bin/activate
```

2. **Download the borehole profiles, optional**

Use `boreholes-download-profiles` to download the files to be processed from an AWS S3 storage. In order to do so, you need to authenticate with aws first. We recommend to use the aws CLI for that purpose. This step is optional, you can continue with step 3 on your own set of borehole profiles.
Use `boreholes-download-profiles` to download the files to be processed from an AWS S3 storage. In order to do so, you need to authenticate with aws first. We recommend to use the aws CLI for that purpose. This step is optional, you can continue with step 3 on your own set of borehole profiles.

3. **Run the extraction script**

The main script for the extraction pipeline is located at `src/stratigraphy/main.py`. A cli command is created to run this script.
The main script for the extraction pipeline is located at `src/stratigraphy/main.py`. A cli command is created to run this script.

Run `boreholes-extract-all` to run the main extraction script. You need to specify the input directory or a single PDF file using the `-i` or `--input-directory` flag.
The script will source all PDFs from the specified directory and create PNG files in the `data/output/draw` directory.
Run `boreholes-extract-all` to run the main extraction script. You need to specify the input directory or a single PDF file using the `-i` or `--input-directory` flag.
The script will source all PDFs from the specified directory and create PNG files in the `data/output/draw` directory.

Use `boreholes-extract-all --help` to see all options for the extraction script.
Use `boreholes-extract-all --help` to see all options for the extraction script.

4. **Check the results**

Once the script has finished running, you can check the results in the `data/output/draw` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the specified input directory.
Once the script has finished running, you can check the results in the `data/output/draw` directory. The result is a `predictions.json` file as well as a png file for each page of each PDF in the specified input directory.

### Output Structure
The `predictions.json` file contains the results of a data extraction process from PDF files. Each key in the JSON object is the name of a PDF file, and the value is a list of extracted items in a dictionary like object. The extracted items for now are the material descriptions in their correct order (given by their depths).
Expand Down Expand Up @@ -247,46 +247,45 @@ To launch the API and access its endpoints, follow these steps:

1. **Activate the virtual environment**

Activate your virtual environment. On Unix systems, this can be done with the following command:
Activate your virtual environment. On Unix systems, this can be done with the following command:

```bash
source env/bin/activate
```
```bash
source env/bin/activate
```

2. **Environment variables**

Please make sure to define the environment variables needed for the API to access the S3 Bucket of interest.
Please make sure to define the environment variables needed for the API to access the S3 Bucket of interest.

```python
aws_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID")
aws_secret_key_access = os.environ.get("AWS_SECRET_ACCESS_KEY")
aws_session_token = os.environ.get("AWS_SESSION_TOKEN")
aws_endpoint = os.environ.get("AWS_ENDPOINT")
```
```python
aws_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID")
aws_secret_key_access = os.environ.get("AWS_SECRET_ACCESS_KEY")
aws_endpoint = os.environ.get("AWS_ENDPOINT")
```

3. **Start the FastAPI server**

Run the following command to start the FastAPI server:
Run the following command to start the FastAPI server:

```bash
uvicorn src.app.main:app --reload --host 0.0.0.0 --port 8002
```
```bash
uvicorn src.app.main:app --reload --host 0.0.0.0 --port 8002
```

This will start the server on port 8002 of the localhost and enable automatic reloading whenever changes are made to the code. You can see the OpenAPI Specification (formerly Swagger Specification) by opening: `http://127.0.0.1:8002/docs#/` in your favorite browser.
This will start the server on port 8002 of the localhost and enable automatic reloading whenever changes are made to the code. You can see the OpenAPI Specification (formerly Swagger Specification) by opening: `http://127.0.0.1:8002/docs#/` in your favorite browser.

4. **Access the API endpoints**

Once the server is running, you can access the API endpoints using a web browser or an API testing tool like Postman.
Once the server is running, you can access the API endpoints using a web browser or an API testing tool like Postman.

The main endpoint for the data extraction pipeline is `http://localhost:8000/extract-data`. You can send a POST request to this endpoint with the PDF file you want to extract data from.
The main endpoint for the data extraction pipeline is `http://localhost:8000/extract-data`. You can send a POST request to this endpoint with the PDF file you want to extract data from.

Additional endpoints and their functionalities can be found in the project's source code.
Additional endpoints and their functionalities can be found in the project's source code.

**Note:** Make sure to replace `localhost` with the appropriate hostname or IP address if you are running the server on a remote machine.
**Note:** Make sure to replace `localhost` with the appropriate hostname or IP address if you are running the server on a remote machine.

5. **Stop the server**

To stop the FastAPI server, press `Ctrl + C` in the terminal where the server is running. Please refer to the [FastAPI documentation](https://fastapi.tiangolo.com) for more information on how to work with FastAPI and build APIs using this framework.
To stop the FastAPI server, press `Ctrl + C` in the terminal where the server is running. Please refer to the [FastAPI documentation](https://fastapi.tiangolo.com) for more information on how to work with FastAPI and build APIs using this framework.


## API as Docker Image
Expand All @@ -295,99 +294,166 @@ The borehole application offers a given amount of functionalities (extract text,

1. **Navigate to the project directory**

Change your current directory to the project directory:
Change your current directory to the project directory:

```bash
cd swissgeol-boreholes-dataextraction
```
```bash
cd swissgeol-boreholes-dataextraction
```

2. **Build the Docker image**

Build the Docker image using the following command:
Build the Docker image using the following command:

```bash
docker build -t borehole-api . -f Dockerfile
```
```bash
docker build -t borehole-api . -f Dockerfile
```

```bash
docker build --platform linux/amd64 -t borehole-api:test .
```
```bash
docker build --platform linux/amd64 -t borehole-api:test .
```

This command will build the Docker image with the tag `borehole-api`.
This command will build the Docker image with the tag `borehole-api`.

3. **Verify the Docker image**

Verify that the Docker image has been successfully built by running the following command:
Verify that the Docker image has been successfully built by running the following command:

```bash
docker images
```
```bash
docker images
```

You should see the `borehole-api` image listed in the output.
You should see the `borehole-api` image listed in the output.

4. **Run the Docker container**

To run the Docker container, use the following command:
4.1. **Run the Docker Container without concerning about AWS Credentials**

To run the Docker container, use the following command:

```bash
docker run -p 8000:8000 borehole-api
```

```bash
docker run -p 8000:8000 borehole-api
```
This command will start the container and map port 8000 of the container to port 8000 of the host machine.

This command will start the container and map port 8000 of the container to port 8000 of the host machine.
4.2. **Run the docker image with the AWS credentials**

5. **Run the docker image with the AWS credentials**
4.2.1. **Using a `~/.aws` file**

If you have the AWS credentials configured locally in the `~/.aws` file, you can run the following command to forward your AWS credentials to the docker container

```bash
To run the docker image from `Dockerfile` locally:

docker run -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api
```
```bash

```bash
docker run --platform linux/amd64 -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api:test
```
docker run -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api
```

To run the Docker image from `Dockerfile` with the environment variables from the `.env` file

6. **Access the API**
```bash
docker run --env-file .env -d -p 8000:8000 borehole-api
```

To run the docker image used for AWS Lambda: `Dockerfile.aws.lambda`:

Once the container is running, you can access the API by opening a web browser and navigating to `http://localhost:8000`.
```bash
docker run --platform linux/amd64 -v ~/.aws:/root/.aws -d -p 8000:8000 borehole-api:test
```

You can also use an API testing tool like Postman to send requests to the API endpoints.
4.2.2. **Passing the AWS credentials as Environment Variables**

**Note:** If you are running Docker on a remote machine, replace `localhost` with the appropriate hostname or IP address.
It is also possible to set the AWS credentials as you environment variables and the environment variables of the Docker image you are trying to run.

Unix-based Systems (Linux/macOS)

7. **Query the API**
Add the following lines to your `~/.bashrc`, `~/.bash_profile`, or `~/.zshrc` (depending on your shell):

```bash
curl -X 'POST' \
'http://localhost:8000/api/V1/create_pngs' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"filename": "10021.pdf"
}'
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
export AWS_ENDPOINT=your_endpoint_url
```

8. **Stop the Docker container**
Please note that the endpoint url is in the following format: `https://{bucket}.s3.<RegionName>.amazonaws.com`. This
URL can be found in AWS when you go to your target S3 bucket, select any item in the bucket and look into the
Properties under `Object URL`. Please remove the file specific extension and you will end up with your endpoint URL.

To stop the Docker container, press `Ctrl + C` in the terminal where the container is running.
After editing, run the following command to apply the changes:

Alternatively, you can use the following command to stop the container:
```bash
source ~/.bashrc # Or ~/.bash_profile, ~/.zshrc based on your configuration
```

```bash
docker stop <container_id>
```
Windows (Command Prompt or PowerShell)

Replace `<container_id>` with the ID of the running container, which can be obtained by running `docker ps`.
For Command Prompt:

```bash
setx AWS_ACCESS_KEY_ID your_access_key_id
setx AWS_SECRET_ACCESS_KEY your_secret_access_key
setx AWS_ENDPOINT your_endpoint_url
```

For PowerShell:

```bash
$env:AWS_ACCESS_KEY_ID=your_access_key_id
$env:AWS_SECRET_ACCESS_KEY=your_secret_access_key
$env:AWS_ENDPOINT=your_endpoint_url
```

4.2.3. **Passing the AWS credentials in an Environment File**

Another option is to store the credentials in a .env file and load them into your Python environment using the `python-dotenv` package:

```bash
AWS_ACCESS_KEY_ID=your_access_key_id
AWS_SECRET_ACCESS_KEY=your_secret_access_key
AWS_ENDPOINT=your_endpoint_url
```

You can find an example for such a `.env` file in `.env.template`. If you rename this file to `.env` and add your AWS credentials you should be good to go.

5. **Access the API**

Once the container is running, you can access the API by opening a web browser and navigating to `http://localhost:8000`.

You can also use an API testing tool like Postman to send requests to the API endpoints.

**Note:** If you are running Docker on a remote machine, replace `localhost` with the appropriate hostname or IP address.


6. **Query the API**

```bash
curl -X 'POST' \
'http://localhost:8000/api/V1/create_pngs' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"filename": "10021.pdf"
}'
```

7. **Stop the Docker container**

To stop the Docker container, press `Ctrl + C` in the terminal where the container is running.

Alternatively, you can use the following command to stop the container:

```bash
docker stop <container_id>
```

Replace `<container_id>` with the ID of the running container, which can be obtained by running `docker ps`.


## AWS Lambda Deployment

AWS Lambda is a serverless computing service provided by Amazon Web Services that allows you to run code without managing servers. It automatically scales your applications by executing code in response to triggers. You only pay for the compute time used.

In this project we are using Mangum to wrap the FastAPI with a handler that we will package and deploy as a Lambda function in AWS. Then using AWS API Gateway we will route all incoming requests to invoke the lambda and handle the routing internally within our application.
In this project we are using `Mangum` to wrap the FastAPI with a handler that we will package and deploy as a Lambda function in AWS. Then using AWS API Gateway we will route all incoming requests to invoke the lambda and handle the routing internally within our application.

We created a script that should make it possible for you to deploy the FastAPI in AWS lambda using a single command. The script is creating all the required AWS resources to run the API. The resources that will be created for you are:
- AWS Lambda Function
Expand All @@ -397,11 +463,10 @@ We created a script that should make it possible for you to deploy the FastAPI i

To deploy the staging version of the FastPI, run the following command:

```shell
```bash
IMAGE=borehole-fastapi ENV=stage AWS_PROFILE=dcleres-visium AWS_S3_BUCKET=dcleres-boreholes-integration-tmp ./deploy_api_aws_lambda.sh
```

## Experiment Tracking
We perform experiment tracking using MLFlow. Each developer has his own local MLFlow instance.

Expand Down
4 changes: 1 addition & 3 deletions src/app/api/v1/endpoints/create_pngs.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

import fitz
from app.common.aws import load_pdf_from_aws, upload_file_to_s3
from app.common.config import config
from app.common.schemas import PNGResponse
from fastapi import HTTPException

Expand Down Expand Up @@ -49,8 +48,7 @@ def create_pngs(aws_filename: Path):
)

# Generate the S3 URL
png_url = f"https://{config.bucket_name}.s3.amazonaws.com/{s3_bucket_png_path}"
png_urls.append(png_url)
png_urls.append(s3_bucket_png_path)

# Clean up the local file
os.remove(png_path)
Expand Down
Loading

1 comment on commit 45af9a4

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
src/stratigraphy
   __init__.py8188%11
   extract.py1861860%3–483
   get_files.py19190%3–47
   main.py1191190%3–310
src/stratigraphy/data_extractor
   data_extractor.py50394%32, 62, 98
src/stratigraphy/depthcolumn
   boundarydepthcolumnvalidator.py412051%47, 57, 60, 81–84, 110–128, 140–149
   depthcolumn.py1946467%25, 29, 50, 56, 59–60, 84, 87, 94, 101, 109–110, 120, 137–153, 191, 228, 247–255, 266, 271, 278, 309, 314–321, 336–337, 380–422
   depthcolumnentry.py28679%17, 21, 36, 39, 56, 65
   find_depth_columns.py1061982%42–43, 73, 86, 180–181, 225–245
src/stratigraphy/layer
   layer_identifier_column.py745230%16–17, 20, 28, 43, 47, 51, 59–63, 66, 74, 91–96, 99, 112, 125–126, 148–158, 172–199
src/stratigraphy/lines
   geometric_line_utilities.py86298%81, 131
   line.py51492%25, 50, 60, 110
   linesquadtree.py46198%75
src/stratigraphy/metadata
   coordinate_extraction.py108595%30, 64, 94–95, 107
src/stratigraphy/text
   description_block_splitter.py70297%24, 139
   extract_text.py29390%19, 53–54
   find_description.py642856%27–35, 50–63, 79–95, 172–175
   textblock.py80989%28, 56, 64, 89, 101, 124, 145, 154, 183
src/stratigraphy/util
   dataclasses.py32391%37–39
   interval.py1045547%29–32, 37–40, 46, 52, 56, 66–68, 107–153, 174, 180–196
   predictions.py1071070%3–282
   util.py391756%41, 69–76, 90–92, 116–117, 129–133
TOTAL164172556% 

Tests Skipped Failures Errors Time
82 0 💤 0 ❌ 0 🔥 5.948s ⏱️

Please sign in to comment.