Can Compromise detect words separated by punctuation? #954

ryanj11 · 2022-08-31T14:25:59Z

ryanj11
Aug 31, 2022

I have some text where the user has accidentally left out a space between punctuation and words. Is Compromise able to detect words separated like this?

For example:
Hello.Welcome.
I have apples,bananas and pears.

Example code:

import nlp from "compromise";

const nlpProcessedText = nlp("Hello.Welcome.");

The result of nlpProcessedText.document is:

[        
   {      
        text: 'Hello.Welcome',
        pre: '',
        post: '.',
        tags: Set(2) { 'Noun', 'Singular' },
        normal: 'hello.welcome',
        index: [ 0, 0 ],
        id: 'hello.welcome|00100000F',
        confidence: 0.1,
        chunk: 'Noun'
   }
]

Then if I wanted to use the match() function to search for the word "Welcome", it cannot find it because it only recognises "Hello.Welcome" as a word.

Is this a bug or maybe is there a way to use Compromise to extract the separate words from this example text?

spencermountain · 2022-08-31T15:01:18Z

spencermountain
Aug 31, 2022
Maintainer

hey ryan, yeah i agree this seems pretty brittle. The assumption in compromise is that the text is correct, but maybe we should allow for sloppier inputs, when the meaning here is clear.

One thing you can do is swap-out the sentence tokenizer with a more tolerant one.
something like this -
https://runkit.com/spencermountain/630f7602fb14620009a788d9

the sentence tokenizer is here if you wanted to fork it.

not sure what the best approach is.

1 reply

ryanj11 Aug 31, 2022
Author

Thanks @spencermountain I will try out your sample code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can Compromise detect words separated by punctuation? #954

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Can Compromise detect words separated by punctuation? #954

ryanj11 Aug 31, 2022

Replies: 1 comment · 1 reply

spencermountain Aug 31, 2022 Maintainer

ryanj11 Aug 31, 2022 Author

ryanj11
Aug 31, 2022

Replies: 1 comment 1 reply

spencermountain
Aug 31, 2022
Maintainer

ryanj11 Aug 31, 2022
Author