Update the multimodal text data #117

zechengz · 2023-10-16T05:54:30Z

Added multimodal data stats information, some datasets' performance, and fixed the dataset (split, NAs in the test set)
Notice that the stats now is only for datasets that only have categorical, numerical and text data (those having temporal and multi-categorical data are not included)

zechengz · 2023-10-16T05:56:41Z

Script to compute the stats for later use.

import os.path as osp

import torch_frame
from torch_frame.config.text_embedder import TextEmbedderConfig
from torch_frame.datasets import MultimodalTextBenchmark
from torch_frame.testing.text_embedder import HashTextEmbedder

### For a single data
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'tmp')

### For dataset collection
names = [
    'product_sentiment_machine_hack',
    'jigsaw_unintended_bias100K',
    'news_channel',
    'wine_reviews',
    'fake_job_postings2',
    'google_qa_answer_type_reason_explanation',
    'google_qa_question_type_reason_explanation',
    'bookprice_prediction',
    'jc_penney_products',
    'women_clothing_review',
    'news_popularity2',
]
out_channels = 10
for name in names:
    dataset = MultimodalTextBenchmark(
        root=path,
        name=name,
        text_embedder_cfg=TextEmbedderConfig(
            text_embedder=HashTextEmbedder(out_channels), batch_size=None),
    ).materialize()
    tf = dataset.tensor_frame

    print(f'        * - {name}')
    print(f'          - {tf.num_rows:,}')
    for stype in [
            torch_frame.numerical, torch_frame.categorical,
            torch_frame.text_embedded
    ]:
        if stype not in tf.col_names_dict:
            num_cols = 0
        else:
            num_cols = len(tf.col_names_dict[stype])
        print(f'          - {num_cols:,}')

    num_classes = 1
    if dataset.task_type.is_classification:
        num_classes = dataset.num_classes
    print(f'          - {num_classes:,}')
    print(f'          - {dataset.task_type.value}')
    ratio = dataset.df.isna().sum().sum() / (dataset.df.shape[0] *
                                             dataset.df.shape[1])
    print(f'          - {100*ratio:.1f}%')

weihua916

Thanks! I am curious about the performance gap between val and test.

weihua916 · 2023-10-16T20:14:48Z

examples/fttransformer_text.py

+# Best Val Acc: 0.9334, Best Test Acc: 0.8814
+# ========== data_scientist_salary ==========
+# Best Val Acc: 0.5355, Best Test Acc: 0.4582
+# ======== jigsaw_unintended_bias100K =======
+# Best Val Acc: 0.9543, Best Test Acc: 0.9511


Do you know why we have such a huge val/test gap? I assume we used the random split.

I don't know. Maybe overfitting? I trained the model with 100 epochs.

zechengz and others added 4 commits October 7, 2023 22:19

Update

65c13ca

Update

43df324

Update

31af7f3

Update

ddc4909

zechengz added the dataset label Oct 16, 2023

zechengz requested a review from weihua916 October 16, 2023 05:54

zechengz self-assigned this Oct 16, 2023

github-actions bot added the example label Oct 16, 2023

CHANGELOG

bb224d0

zechengz added 2 commits October 15, 2023 23:00

Update

6575b58

Update

6b61d80

weihua916 approved these changes Oct 16, 2023

View reviewed changes

zechengz merged commit 759ce26 into master Oct 16, 2023
3 checks passed

zechengz deleted the zecheng_update_text_dataset branch October 16, 2023 21:53

zechengz mentioned this pull request Mar 15, 2024

Fix text dataset stats and benchmark materialize return #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the multimodal text data #117

Update the multimodal text data #117

zechengz commented Oct 16, 2023 •

edited

Loading

zechengz commented Oct 16, 2023

weihua916 left a comment

weihua916 Oct 16, 2023

zechengz Oct 16, 2023

Update the multimodal text data #117

Update the multimodal text data #117

Conversation

zechengz commented Oct 16, 2023 • edited Loading

zechengz commented Oct 16, 2023

weihua916 left a comment

Choose a reason for hiding this comment

weihua916 Oct 16, 2023

Choose a reason for hiding this comment

zechengz Oct 16, 2023

Choose a reason for hiding this comment

zechengz commented Oct 16, 2023 •

edited

Loading