-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] How to use my own POS model when training a constituency model? #1356
Comments
I actually think that's a case of the logging not correctly reflecting the
reality. I added some log lines to the parser which should tell you which
model it loaded for the retagging. You can try it using the dev branch, if
you're comfortable with that. (I will need to make a new release within a
day to fix some other issues, anyway.)
Beyond that, I think there's probably something else going wrong... would
you post the entire log message?
Kudos for going ahead with the Icelandic parser. I knew that dataset
existed, but had not tried to build a model yet.
…On Thu, Feb 29, 2024 at 12:54 PM Ingunn Jóhanna Kristjánsdóttir < ***@***.***> wrote:
I am working on adding a constituency model for Icelandic. I used the
constituency treebank I have for training a POS tagger but how do I use it
when training the constituency model?
The instructions say this: "To change to a specific model (such as if you
build one yourself) use the --retag_model_path command line flag." but when
I try to run this: "python -m stanza.utils.training.run_constituency
is_icepahc --retag_model_path saved_models/pos/
is_icepahc_nocharlm_tagger.pt" it still just uses the default pos tagger
for Icelandic (which I don't want to use). Do I need to use some more
flags, other than --retag_model_path (for example --retag_package?), to
make sure it uses my model?
Here is what I get when I only use the flag --retag_model_path
saved_models/pos/is_icepahc_nocharlm_tagger.pt:
...
retag_method: xpos
retag_model_path: saved_models/pos/is_icepahc_nocharlm_tagger.pt
retag_package: default
retag_pretrain_path: None
retag_xpos: True
...
And:
2024-02-29 20:46:55 INFO: Reading trees from
/stanza/constituency/data/icelandic/processed_data/is_icepahc_train.mrg
2024-02-29 20:47:16 INFO: Read 58394 trees for the training set
2024-02-29 20:47:18 INFO: Filtered 512 duplicates from train dataset
2024-02-29 20:47:18 INFO: Eliminated 3 trees with missing structure
2024-02-29 20:47:18 INFO: Reading trees from
/stanza/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg
2024-02-29 20:47:19 INFO: Read 7299 trees for the dev set
2024-02-29 20:47:20 INFO: Filtered 24 duplicates from dev dataset
2024-02-29 20:47:20 INFO: Retagging trees using the xpos tags from the
default package... (i.e. not using my model.. and then the training fails
after retagging because this default pos tagger is not compatible with my
data)
—
Reply to this email directly, view it on GitHub
<#1356>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWOC5ANTDKGII5Q4RJTYV6KRXAVCNFSM6AAAAABEAUJ6QKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3DEMBYHEZDOMQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
It's also possible that the failure you saw was related to the same error found here: #1357 If you could verify if the dev branch fixes your problem, that would be great. If using the dev branch is difficult, posting the stack trace for the error you ran into would also help. I have to make a new release with a fix for issue 1357, and if the existing fix doesn't also address your problem, I can try to fix that as well. |
Were you able to make progress on this with the updated version? |
Hi, sorry for the late answer! I tried using the dev branch and that didn't seem to change much, unfortunately. Here is the entire log message:
It's still using the xpos tags from the default package which are not compatible with the data I am trying to train on. |
That's actually not complaining about a POS tag, but rather a constituent tag. I will update the error to make it more clear. You can either check in the treebank for a tree with such a typo, or you can give it the I suppose it might be useful to have it report which tree is causing the error... I'm short on time now, but I can probably do that tonight. Are you comfortable using the dev branch? If not, I can put it on testpypi or something. Either way, we should make it so that it's easier to diagnose this problem. One other thing I notice is that the treebank is using |
…nstituents barfs. Apparently that is happening with the Icelandic treebank... #1356
Alright, I added what is hopefully a very thorough message for when the constituent checker fails. Please let us know what information it gives you. If using the |
Great, thanks! The thorough message for when the constituents checker fails helped me figure out what the problem was and fix it! And yes, it would probably be a good idea to add * to the list of functional tags to cut off. |
Glad to hear that helped! I added One thing to try for improved accuracy is that the overall model will have much higher accuracy with a transformer. I know IceBERT and ScandiBert are a couple possible options you can download from HF using the You can also finetune those transformers specifically for the constituency task. That uses quite a lot of disk space and GPU memory, of course. The best settings I found so far are in the flag Last random comment for now - I improved the TOP_DOWN dynamic oracle quite a bit in the last month or so, and although I haven't made it the default yet, I find that it's actually more accurate that the default IN_ORDER transition scheme. You can try that with |
Hi again and thank you for the help! I have now successfully trained a constituency model for Icelandic (it gets 90.63 on the test set, which is currently the best score for an Icelandic constituency parser), would it be possible to add it to Stanza? |
Yes, that would be great! In general, if there were any code changes to
convert the original Icelandic annotations to the format usable by the
parser, the first step would be a PR which adds that script
to stanza/utils/datasets/constituency
After that, I know there are a couple different transformers on HF which
include Icelandic. If you've tried with those models, and have some notes
on which ones give which scores, that would also be helpful. We haven't
integrated those yet into the IS
pipelines: stanza/resources/default_packages.py
Did you mention having your own POS tagging? That might be relevant, if
you think the POS tagger is more useful than one built from the IS
Universal Dependencies treebanks
|
Great, I will start working on that in the next few days! I did not have my own POS tagging in the end and I used IceBERT from HF. I got the best results with the flags Sentences in the IcePaHC treebank are divided into matrix clauses and the previous parsing pipelines for Icelandic text that have been trained on IcePaHC (https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf pages 48-51 and https://office.clarin.eu/v/CE-2019-1512_CLARIN2019_ConferenceProceedings.pdf pages 138-141) do matrix clause boundary detection before the parsing. They use this tool for the matrix clause boundary detection: https://github.com/antonkarl/iceParsingPipeline/tree/4d8e65958e7ebc9d28ab463ba27ffcbb895e6f1c/tools/splitter. I was wondering if it would be possible to add this to the Stanza pipeline for Icelandic text so that users don't have to run the splitter on their input text themselves before parsing it with Stanza? |
It's probably easier to have used the UD POS tags! I have also found that
the TOP_DOWN model is working better with the upgraded dynamic oracle.
Perhaps I should revisit the IN_ORDER oracle to see if it can be improved.
matrix clauses
Hmm, that's effectively a constraint on the parse structure built by the
model, right? I haven't implemented that in the constituency parser at
all. It wouldn't happen any time soon, either. Another possibility would
be to use the parser itself to extract those clauses, or to try doing that
and see how accurate it is compared to gold annotations or the splitting
tool.
|
…nstituents barfs. Apparently that is happening with the Icelandic treebank... #1356
I am working on adding a constituency model for Icelandic. I used the constituency treebank I have for training a POS tagger but how do I use it when training the constituency model?
The instructions say this: "To change to a specific model (such as if you build one yourself) use the --retag_model_path command line flag." but when I try to run this: "python -m stanza.utils.training.run_constituency is_icepahc --retag_model_path saved_models/pos/is_icepahc_nocharlm_tagger.pt" it still just uses the default pos tagger for Icelandic (which I don't want to use). Do I need to use some more flags, other than --retag_model_path (for example --retag_package?), to make sure it uses my model?
Here is what I get when I only use the flag --retag_model_path saved_models/pos/is_icepahc_nocharlm_tagger.pt:
...
retag_method: xpos
retag_model_path: saved_models/pos/is_icepahc_nocharlm_tagger.pt
retag_package: default
retag_pretrain_path: None
retag_xpos: True
...
And:
2024-02-29 20:46:55 INFO: Reading trees from /stanza/constituency/data/icelandic/processed_data/is_icepahc_train.mrg
2024-02-29 20:47:16 INFO: Read 58394 trees for the training set
2024-02-29 20:47:18 INFO: Filtered 512 duplicates from train dataset
2024-02-29 20:47:18 INFO: Eliminated 3 trees with missing structure
2024-02-29 20:47:18 INFO: Reading trees from /stanza/constituency/data/icelandic/processed_data/is_icepahc_dev.mrg
2024-02-29 20:47:19 INFO: Read 7299 trees for the dev set
2024-02-29 20:47:20 INFO: Filtered 24 duplicates from dev dataset
2024-02-29 20:47:20 INFO: Retagging trees using the xpos tags from the default package... (i.e. not using my model.. and then the training fails after retagging because this default pos tagger is not compatible with my data)
The text was updated successfully, but these errors were encountered: