[QUESTION]Semantic Sentence Tokenization #1383

TheAIMagics · 2024-04-18T14:09:09Z

I'm working with a corpus that primarily consists of longer documents. I'm seeking recommendations for the most effective approach to semantically tokenize them.

Examples:

Original Text: "I like the ambiance but the food was terrible."
Desired Output: ["I like the ambiance"] ["but the food was terrible."]

Original Text: "I don't know. I like the restaurant but not the food."
Desired Output: ["I don't know."] ["I like the restaurant"] ["but not the food."]

Any suggestions or advice on how to achieve this would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-04-18T14:33:20Z

We don't have anything we explicitly does what you're looking for. You could constituency parse the sentence and take the top level divisions and that might do a good job, though.

stale · 2025-01-21T23:22:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

TheAIMagics added the question label Apr 18, 2024

stale bot added the stale label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION]Semantic Sentence Tokenization #1383

[QUESTION]Semantic Sentence Tokenization #1383

TheAIMagics commented Apr 18, 2024

AngledLuffa commented Apr 18, 2024

stale bot commented Jan 21, 2025

[QUESTION]Semantic Sentence Tokenization #1383

[QUESTION]Semantic Sentence Tokenization #1383

Comments

TheAIMagics commented Apr 18, 2024

AngledLuffa commented Apr 18, 2024

stale bot commented Jan 21, 2025