Skip to content

Neural-Wave/project-MoliseAI

Repository files navigation

Neural Wave - Hackathon 2024 - Lugano

This repository contains the code produced by the Molise.ai team in the Neural Wave Hackathon 2024 competition in Lugano.

Challenge

Here is a brief explanation of the challenge: The challenge was proposed by Ai4Privacy, a company that builds global solutions that enhance **privacy protections ** in the rapidly evolving world of Artificial Intelligence. The challenge goal is to create a machine learning model capable of detecting and masking PII (Personal Identifiable Information) in text data across several languages and locales. The task requires working with a synthetic dataset to train models that can automatically identify and redact 17 types of PII in natural language texts. The solution should aim for high accuracy while maintaining the usability of the underlying data. The final solution could be integrated into various systems and enhance privacy protections across industries, including client support, legal, and general data anonymization tools. Success in this project will contribute to scaling privacy-conscious AI systems without compromising the UX or operational performance.

Getting Started

Create a .env file. Start copying the .env.example file and rename it to .env. Fill in the required values.

cp .env.example .env

Install the dependencies

pip install -r requirements.txt

Set PYTHONPATH if needed

export PYTHONPATH="${PYTHONPATH}:$PWD"

Inference

Inference on the full dataset

You can run inference on the complete test dataset using the following command:

python inference.py -s ./dataset/test

Inference on a small dataset

To perform inference on a small subset of the dataset, use the --subsample flag:

python inference.py -s ./dataset/test --subsample

Run ui

To run the UI for interacting with the models and viewing results, use Streamlit:

streamlit run ui.py

Run api

To start the API for the model, you'll need FastAPI. Run the following command:

fastapi run api.py

Experiments

This repository supports two main types of experiments:

  1. Fine-tuning models from the BERT family.
  2. Fine-tuning models from the GLiNER family.

Both experiment types are located in the experiments/ folder, and each fine-tuning script allows you to pass specific arguments related to model choices, datasets, output directories, and optional alternative dataset columns.

BERT Fine-Tuning

The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you can utilize alternative columns that are preprocessed during the data preparation phase.

python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]

Available BERT models

Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer may also work with minimal modifications:

  • BERT classic
    • bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased
  • DistilBERT
    • distilbert-base-uncased, distilbert-base-cased
  • RoBERTa
    • roberta-base, roberta-large
  • ALBERT
    • albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2
  • Electra
    • google/electra-small-discriminator, google/electra-base-discriminator, google/electra-large-discriminator
  • DeBERTa
    • microsoft/deberta-base, microsoft/deberta-large

GLiNER Fine-Tuning

The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process happens in two stages:

  1. Step 1: Prepare Dataset for GLiNER Models Run the GLiNER dataset preparation script to pre-process your dataset:
python experiments/gliner_prepare.py --dataset path/to/dataset

This will create a new JSON-formatted dataset file with the same name in the specified output directory.

  1. Step 2: Fine-Tune GLiNER Model
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]

After the dataset preparation, run the GLiNER fine-tuning script:

python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]

Available GLiNER models

You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:

  • gliner-community/gliner_xxl-v2.5
  • gliner-community/gliner_large-v2.5
  • gliner-community/gliner_medium-v2.5
  • gliner-community/gliner_small-v2.5

Results

A results folder is available in the repository to store the results of the various experiments and related metrics.

Other Information

We also provide a solution to the issue in the pii-masking-400k repository. We created a method to transform the natural language text into a token-tag format that can be used to train a Named Entity Recognition (NER) model using the AutoTrain huggingface api.