Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird tokenization in Spanish #1440

Open
LazerJesus opened this issue Dec 18, 2024 · 8 comments
Open

Weird tokenization in Spanish #1440

LazerJesus opened this issue Dec 18, 2024 · 8 comments
Labels

Comments

@LazerJesus
Copy link

Describe the bug
In yo como carne, como is identified as upos SCONJ, while it should be VERB.

I am running this pipeline:

{
  "text": "Yo como carne.",
  "processors": "tokenize,mwt,pos,lemma,depparse",   
  "language": "es"
}

and get out:

[
        {
          "index": 1,
          "token": "Yo",
          "lemma": "yo",
          "xpos": "pp1csn00",
          "upos": "PRON",
          "feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
          "start_char": 0,
          "end_char": 2
        },
        {
          "index": 2,
          "token": "como",
          "lemma": "como",
          "xpos": "cs",
          "upos": "SCONJ",
          "feats": null,
          "start_char": 3,
          "end_char": 7
        },
        {
          "index": 3,
          "token": "carne",
          "lemma": "carne",
          "xpos": "ncfs000",
          "upos": "NOUN",
          "feats": "Gender=Fem|Number=Sing",
          "start_char": 8,
          "end_char": 13
        },
        {
          "index": 4,
          "token": ".",
          "lemma": ".",
          "xpos": "fp",
          "upos": "PUNCT",
          "feats": "PunctType=Peri",
          "start_char": 13,
          "end_char": 14
        }
      ]

the JSON format is due to me using this repo (mine):
https://github.com/vivalence/dockerized-stanza-nlp
Its really just a shallow wrapper.
The interesting lines are probably these
https://github.com/vivalence/dockerized-stanza-nlp/blob/main/script.py#L115-L116

@LazerJesus LazerJesus added the bug label Dec 18, 2024
@AngledLuffa
Copy link
Collaborator

This is an interesting / weird one. There are 3500 instances of "como" as an ADJ, SCONJ, or CCONJ in the training data, and 3 of it as a first person verb. So, ultimately I don't really see any way of fixing it, since the data is so heavily biased and there aren't that many first person verbs for any verb in the training data to begin with. We can keep it in mind as something that needs fixing, though

@AngledLuffa
Copy link
Collaborator

Maybe we could try adding 10 different sentences with it as a verb and see if that helps...

@LazerJesus
Copy link
Author

LazerJesus commented Dec 19, 2024

If i can support with data, let me know. i am running through A LOT of llm generated sentences and could capture them for you guys.
and i know the verb lemma, tense, person, and other annotations i am prompting the llm with.

my flow goes:
identify the verb annotation i want to practice
-> prompt llm to generate sentence
-> throw the sentence into stanza to get annotated tokens for every word.

so I can capture structured data for certain annotations with a simple if(annotation.matches(AngledLuffasCriteria)) appentToFile({sentence,promptedAnnotation})

@AngledLuffa
Copy link
Collaborator

I see that the model gets quiero correct, or so it appears to me

# text = Yo quiero carne
# sent_id = 0
1       Yo      yo      PRON    pp1csn00        Case=Nom|Number=Sing|Person=1|PronType=Prs      2       nsubj   _       start_char=0|end_char=2|ner=O
2       quiero  querer  VERB    vmip1s0 Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   0       root    _       start_char=3|end_char=9|ner=O
3       carne   carne   NOUN    ncfs000 Gender=Fem|Number=Sing  2       obj     _       start_char=10|end_char=15|ner=O|SpaceAfter=No

Maybe what we could do would be

  • take 10 sentences with como used as a verb
  • replace those sentences with quiero and annotate with Stanza
  • replace those annotations back with como and include those in the training data

Is that something already available via your LLM work? If not, I could probably find something similar. We can start with 10 - I don't know if 10 will be enough, but probably it won't outweigh any of the other typical word senses for como, which as mentioned total about 3500 in the GSD and Ancora treebanks.

@LazerJesus
Copy link
Author

can you show me what the model input format looks like?
i can give you a lot of data from my system. i run through maybe +500 sentences in a day.

@AngledLuffa
Copy link
Collaborator

Raw sentences could work, and I could send back the processing and you could tell me if it makes sense, or you could output the sentences with

print("{:C}".format(doc))

@AngledLuffa
Copy link
Collaborator

The basic format would then look like the conll output I posted above, but it's not necessary to make it by hand. I think it'd be pretty straightforward to use the doc formatting on sentences with quiero in place of como, then switch out the verbs. Probably shortish sentences so that it isn't too onerous to check and that errors elsewhere in the sentence are less likely. Hopefully not all of the format "I eat ---", though!

@AngledLuffa
Copy link
Collaborator

I just pushed out a new version, but this particular error still occurs. It's possible to update the models for the new version, though, if you have a few of the relevant sentences to add to the training data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants