-
Notifications
You must be signed in to change notification settings - Fork 197
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* - add insight mining * meta tags aggregator * naive reverse grouper * * resolve the bugs when running insight mining in multiprocessing mode * * update unittests * * update unittests * * update unittests * tags specified field * * update readme for analyzer * doc done * * use more detailed key * + add reference * move mm tags * move meta key * done * test done * rm nested set * Update constant.py minor fix * rename agg to batch meta * export in naive reverse grouper --------- Co-authored-by: null <[email protected]> Co-authored-by: gece.gc <[email protected]> Co-authored-by: lielin.hyl <[email protected]> Co-authored-by: Daoyuan Chen <[email protected]>
- Loading branch information
1 parent
1fe821f
commit fb98c56
Showing
54 changed files
with
1,230 additions
and
627 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,48 @@ | ||
import json | ||
import os | ||
|
||
from data_juicer.utils.constant import Fields | ||
from data_juicer.utils.file_utils import create_directory_if_not_exists | ||
|
||
from ..base_op import OPERATORS, Grouper, convert_dict_list_to_list_dict | ||
|
||
|
||
@OPERATORS.register_module('naive_reverse_grouper') | ||
class NaiveReverseGrouper(Grouper): | ||
"""Split batched samples to samples. """ | ||
|
||
def __init__(self, *args, **kwargs): | ||
def __init__(self, batch_meta_export_path=None, *args, **kwargs): | ||
""" | ||
Initialization method. | ||
:param batch_meta_export_path: the path to export the batch meta. | ||
Just drop the batch meta if it is None. | ||
:param args: extra args | ||
:param kwargs: extra args | ||
""" | ||
super().__init__(*args, **kwargs) | ||
self.batch_meta_export_path = batch_meta_export_path | ||
|
||
def process(self, dataset): | ||
|
||
if len(dataset) == 0: | ||
return dataset | ||
|
||
samples = [] | ||
batch_metas = [] | ||
for sample in dataset: | ||
if Fields.batch_meta in sample: | ||
batch_metas.append(sample[Fields.batch_meta]) | ||
sample = { | ||
k: sample[k] | ||
for k in sample if k != Fields.batch_meta | ||
} | ||
samples.extend(convert_dict_list_to_list_dict(sample)) | ||
if self.batch_meta_export_path is not None: | ||
create_directory_if_not_exists( | ||
os.path.dirname(self.batch_meta_export_path)) | ||
with open(self.batch_meta_export_path, 'w') as f: | ||
for batch_meta in batch_metas: | ||
f.write(json.dumps(batch_meta, ensure_ascii=False) + '\n') | ||
|
||
return samples |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.