Data Harmonization Benchmark

This repository contains the code and data for the Data Harmonization Benchmark. The benchmark is a collection of datasets that are used to evaluate the performance of data harmonization methods including schema matching, value mapping.

Code Structure

Note: datasets can be downloaded following the instructions in the next section.

|-- data_harmonization_benchmark
    |-- datasets # Put everything downloaded from the link above here
        |-- parse_valentine_benchmark.ipynb # parse valentine data format to our format
    |-- matchers # Schema matching methods
        |-- Coma
        |-- ComaInst
        |-- DistributionBased
        |-- ISResMat # X. Du et al. - In Situ Neural Relational Schema Matcher (10.1109/ICDE60146.2024.00018)
        |-- JaccardDistance
        |-- Magneto # Magneto is introduced as a method from our team, find the source code here: https://github.com/VIDA-NYU/data-integration-eval
        |-- SimilarityFlooding
        |-- Unicorn # Tu et al. Unicorn: A unified multi-tasking model for supporting matching tasks in data integration
    |-- utils
        |-- mrr.py # Mean reciprocal rank metric
        |-- result_proc.py # Process the result of schema matching methods
    |-- config.py # Configuration file, including source, target, and running configurations
    |-- matching.py # Wrapper for different matchers
    |-- runbenchmark.py # Run benchmark tasks
|-- slurm_run # SLURM scripts for running schema matching methods on server
    |-- benchmark_batch.sh # Run all schema matching methods
    |-- benchmark_scalabilty.sh # Run scalability benchmarks on various target samples
    |-- setup_penv.sh # Setup python environment with conda
    |-- slurm_job_cpu.SBATCH # SLURM job script for CPU
    |-- slurm_job_gpu.SBATCH # SLURM job script for GPU

0. Dataset Accessability

The datasets used in this benchmark are available for download via the following links:

Dropbox Link
Google Drive Link

After downloading the datasets, unzip the subfolders under the datasets directory. The directory structure should look like this:

|-- data_harmonization_benchmark
    |-- datasets
        |-- datasets
            |-- GDC
            |-- OpenData
            |-- TPC-DI
            |-- ...

1. Schema Matching

Schema matching is the process of identifying correspondences between attributes from two database schemas. Typically, schema matching methods employ one or more functions to establish a similarity value between pairs of elements from the schemas, referred to as matching candidates. These functions, known as matchers, take two elements as input and estimate a similarity value between 0 and 1, where a higher value indicates greater similarity. Matchers can utilize a variety of strategies to estimate similarities, such as comparing schema element names, assessing their semantic similarity using a thesaurus, analyzing data types and cardinality, or even examining data values when available.

1.1. Supported Matchers

We support the following schema matching methods, all of them can be run on-server with SLURM or locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Harmonization Benchmark

Code Structure

0. Dataset Accessability

1. Schema Matching

1.1. Supported Matchers

1.1.1 Coma

1.1.2 Coma++

1.1.3 Distribution-based

1.1.4 Jaccard Distance

1.1.5 Similarity Flooding

1.1.6 Unicorn

1.1.7 ISResMat

1.1.8 Magneto

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Harmonization Benchmark

Code Structure

0. Dataset Accessability

1. Schema Matching

1.1. Supported Matchers

1.1.1 Coma

1.1.2 Coma++

1.1.3 Distribution-based

1.1.4 Jaccard Distance

1.1.5 Similarity Flooding

1.1.6 Unicorn

1.1.7 ISResMat

1.1.8 Magneto