-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the multimodal text data #117
Conversation
zechengz
commented
Oct 16, 2023
•
edited
Loading
edited
- Added multimodal data stats information, some datasets' performance, and fixed the dataset (split, NAs in the test set)
- Notice that the stats now is only for datasets that only have categorical, numerical and text data (those having temporal and multi-categorical data are not included)
Script to compute the stats for later use. import os.path as osp
import torch_frame
from torch_frame.config.text_embedder import TextEmbedderConfig
from torch_frame.datasets import MultimodalTextBenchmark
from torch_frame.testing.text_embedder import HashTextEmbedder
### For a single data
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'tmp')
### For dataset collection
names = [
'product_sentiment_machine_hack',
'jigsaw_unintended_bias100K',
'news_channel',
'wine_reviews',
'fake_job_postings2',
'google_qa_answer_type_reason_explanation',
'google_qa_question_type_reason_explanation',
'bookprice_prediction',
'jc_penney_products',
'women_clothing_review',
'news_popularity2',
]
out_channels = 10
for name in names:
dataset = MultimodalTextBenchmark(
root=path,
name=name,
text_embedder_cfg=TextEmbedderConfig(
text_embedder=HashTextEmbedder(out_channels), batch_size=None),
).materialize()
tf = dataset.tensor_frame
print(f' * - {name}')
print(f' - {tf.num_rows:,}')
for stype in [
torch_frame.numerical, torch_frame.categorical,
torch_frame.text_embedded
]:
if stype not in tf.col_names_dict:
num_cols = 0
else:
num_cols = len(tf.col_names_dict[stype])
print(f' - {num_cols:,}')
num_classes = 1
if dataset.task_type.is_classification:
num_classes = dataset.num_classes
print(f' - {num_classes:,}')
print(f' - {dataset.task_type.value}')
ratio = dataset.df.isna().sum().sum() / (dataset.df.shape[0] *
dataset.df.shape[1])
print(f' - {100*ratio:.1f}%') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I am curious about the performance gap between val and test.
# Best Val Acc: 0.9334, Best Test Acc: 0.8814 | ||
# ========== data_scientist_salary ========== | ||
# Best Val Acc: 0.5355, Best Test Acc: 0.4582 | ||
# ======== jigsaw_unintended_bias100K ======= | ||
# Best Val Acc: 0.9543, Best Test Acc: 0.9511 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we have such a huge val/test gap? I assume we used the random split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Maybe overfitting? I trained the model with 100 epochs.