Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the multimodal text data #117

Merged
merged 7 commits into from
Oct 16, 2023
Merged

Conversation

zechengz
Copy link
Member

@zechengz zechengz commented Oct 16, 2023

  • Added multimodal data stats information, some datasets' performance, and fixed the dataset (split, NAs in the test set)
  • Notice that the stats now is only for datasets that only have categorical, numerical and text data (those having temporal and multi-categorical data are not included)
    image

@zechengz zechengz requested a review from weihua916 October 16, 2023 05:54
@zechengz zechengz self-assigned this Oct 16, 2023
@zechengz
Copy link
Member Author

Script to compute the stats for later use.

import os.path as osp

import torch_frame
from torch_frame.config.text_embedder import TextEmbedderConfig
from torch_frame.datasets import MultimodalTextBenchmark
from torch_frame.testing.text_embedder import HashTextEmbedder

### For a single data
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'tmp')

### For dataset collection
names = [
    'product_sentiment_machine_hack',
    'jigsaw_unintended_bias100K',
    'news_channel',
    'wine_reviews',
    'fake_job_postings2',
    'google_qa_answer_type_reason_explanation',
    'google_qa_question_type_reason_explanation',
    'bookprice_prediction',
    'jc_penney_products',
    'women_clothing_review',
    'news_popularity2',
]
out_channels = 10
for name in names:
    dataset = MultimodalTextBenchmark(
        root=path,
        name=name,
        text_embedder_cfg=TextEmbedderConfig(
            text_embedder=HashTextEmbedder(out_channels), batch_size=None),
    ).materialize()
    tf = dataset.tensor_frame

    print(f'        * - {name}')
    print(f'          - {tf.num_rows:,}')
    for stype in [
            torch_frame.numerical, torch_frame.categorical,
            torch_frame.text_embedded
    ]:
        if stype not in tf.col_names_dict:
            num_cols = 0
        else:
            num_cols = len(tf.col_names_dict[stype])
        print(f'          - {num_cols:,}')

    num_classes = 1
    if dataset.task_type.is_classification:
        num_classes = dataset.num_classes
    print(f'          - {num_classes:,}')
    print(f'          - {dataset.task_type.value}')
    ratio = dataset.df.isna().sum().sum() / (dataset.df.shape[0] *
                                             dataset.df.shape[1])
    print(f'          - {100*ratio:.1f}%')

Copy link
Contributor

@weihua916 weihua916 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I am curious about the performance gap between val and test.

Comment on lines +26 to +30
# Best Val Acc: 0.9334, Best Test Acc: 0.8814
# ========== data_scientist_salary ==========
# Best Val Acc: 0.5355, Best Test Acc: 0.4582
# ======== jigsaw_unintended_bias100K =======
# Best Val Acc: 0.9543, Best Test Acc: 0.9511
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we have such a huge val/test gap? I assume we used the random split.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Maybe overfitting? I trained the model with 100 epochs.

@zechengz zechengz merged commit 759ce26 into master Oct 16, 2023
3 checks passed
@zechengz zechengz deleted the zecheng_update_text_dataset branch October 16, 2023 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants