Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION]Semantic Sentence Tokenization #1383

Open
TheAIMagics opened this issue Apr 18, 2024 · 2 comments
Open

[QUESTION]Semantic Sentence Tokenization #1383

TheAIMagics opened this issue Apr 18, 2024 · 2 comments

Comments

@TheAIMagics
Copy link

I'm working with a corpus that primarily consists of longer documents. I'm seeking recommendations for the most effective approach to semantically tokenize them.

Examples:

Original Text: "I like the ambiance but the food was terrible."
Desired Output: ["I like the ambiance"] ["but the food was terrible."]

Original Text: "I don't know. I like the restaurant but not the food."
Desired Output: ["I don't know."] ["I like the restaurant"] ["but not the food."]

Any suggestions or advice on how to achieve this would be greatly appreciated!

@AngledLuffa
Copy link
Collaborator

We don't have anything we explicitly does what you're looking for. You could constituency parse the sentence and take the top level divisions and that might do a good job, though.

Copy link

stale bot commented Jan 21, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants