This repository contains the code produced by the Molise.ai
team in the Neural Wave Hackathon 2024 competition in
Lugano.
Here is a brief explanation of the challenge: The challenge was proposed by Ai4Privacy, a company that builds global solutions that enhance **privacy protections ** in the rapidly evolving world of Artificial Intelligence. The challenge goal is to create a machine learning model capable of detecting and masking PII (Personal Identifiable Information) in text data across several languages and locales. The task requires working with a synthetic dataset to train models that can automatically identify and redact 17 types of PII in natural language texts. The solution should aim for high accuracy while maintaining the usability of the underlying data. The final solution could be integrated into various systems and enhance privacy protections across industries, including client support, legal, and general data anonymization tools. Success in this project will contribute to scaling privacy-conscious AI systems without compromising the UX or operational performance.
Create a .env
file. Start copying the .env.example
file and rename it to .env
. Fill in the required values.
cp .env.example .env
pip install -r requirements.txt
export PYTHONPATH="${PYTHONPATH}:$PWD"
You can run inference on the complete test dataset using the following command:
python inference.py -s ./dataset/test
To perform inference on a small subset of the dataset, use the --subsample flag:
python inference.py -s ./dataset/test --subsample
To run the UI for interacting with the models and viewing results, use Streamlit:
streamlit run ui.py
To start the API for the model, you'll need FastAPI. Run the following command:
fastapi run api.py
This repository supports two main types of experiments:
- Fine-tuning models from the BERT family.
- Fine-tuning models from the GLiNER family.
Both experiment types are located in the experiments/
folder, and each fine-tuning script allows you to pass specific
arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you can utilize alternative columns that are preprocessed during the data preparation phase.
python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer may also work with minimal modifications:
- BERT classic
bert-base-uncased
,bert-large-uncased
,bert-base-cased
,bert-large-cased
- DistilBERT
distilbert-base-uncased
,distilbert-base-cased
- RoBERTa
roberta-base
,roberta-large
- ALBERT
albert-base-v2
,albert-large-v2
,albert-xlarge-v2
,albert-xxlarge-v2
- Electra
google/electra-small-discriminator
,google/electra-base-discriminator
,google/electra-large-discriminator
- DeBERTa
microsoft/deberta-base
,microsoft/deberta-large
The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process happens in two stages:
- Step 1: Prepare Dataset for GLiNER Models Run the GLiNER dataset preparation script to pre-process your dataset:
python experiments/gliner_prepare.py --dataset path/to/dataset
This will create a new JSON-formatted dataset file with the same name in the specified output directory.
- Step 2: Fine-Tune GLiNER Model
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
After the dataset preparation, run the GLiNER fine-tuning script:
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
gliner-community/gliner_xxl-v2.5
gliner-community/gliner_large-v2.5
gliner-community/gliner_medium-v2.5
gliner-community/gliner_small-v2.5
A results folder is available in the repository to store the results of the various experiments and related metrics.
We also provide a solution to the issue in
the pii-masking-400k repository.
We created a method to transform the natural language text into a token-tag format that can be used to train a Named
Entity Recognition (NER) model using the AutoTrain
huggingface
api.