Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Languages other than English #3

Open
ghost opened this issue Aug 16, 2016 · 7 comments
Open

Languages other than English #3

ghost opened this issue Aug 16, 2016 · 7 comments

Comments

@ghost
Copy link

ghost commented Aug 16, 2016

Hi there, thanks for the cool project. The bottom of the README says the support for other languages is a thing to look forward to -- could you elaborate on it a bit? Any particular plans? Let me know if you're looking for contributors that could handle different languages.

@jjangsangy
Copy link
Owner

jjangsangy commented Aug 16, 2016

Sure. So ExplainToMe currently does 3 things.

  1. Grabs HTML from Webpage
  2. Extracts the main article components.
  3. Generates semantic graph and computes it's centroid.

Currently #1, #2 do not care about language, mostly dealing with HTML and webpage metadata. #3 cares about language, but mostly dealing with stopwords and language cleaning. If the user specifies the language of the article in advance (sometimes we can discover in HTML), we can provide stopwords, and most romantic languages should generate a decent summary.

Most likely start by supporting those languages.

I am interested in doing non-romance languages, but we'll see how far we get

@ghost
Copy link
Author

ghost commented Aug 16, 2016

Cool. I take it you only use sumy as the summarisation platform? It seems to support Czech, French, German, Portuguese, Slovak, and Spanish out-of-the-box (the stop words for these languages are included in the package).

On 16 Aug 2016, at 21:18, Sang Han [email protected] wrote:

Sure. So ExplainToMe currently does 3 things.

Grabs HTML from Webpage
Extracts the main article components.
Generates semantic graph and computes it's centroid.
Currently #1 #1, #2 #2 do not care about language, mostly dealing with HTML and webpage metadata. #3 #3 cares about language, but mostly dealing with stopwords and language cleaning. If the user specifies the language of the article in advance, we can provide stopwords, and most romantic languages should generate a decent summary.

Most likely start by supporting those languages.

I am interested in doing non-romance languages, but we'll see how far we get


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #3 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AIDKDuCFRSqFkUTrWD6Gb4rwCtoBRyugks5qgf75gaJpZM4JlrRb.

@jjangsangy
Copy link
Owner

Correct. Sumy provides the right framework for building document Summarizer as well as the most popular techniques implemented.

My main concern about adding more languages is I can't really attest to their accuracy in an intuitive way. My experience with cross-language NLP is that techniques vary on effectiveness based on latent cultural features.

@gioferreira
Copy link

I'd love to help with Portuguese (Brazilian Portuguese).
I've been looking for something like this in Portuguese for ages.

@jjangsangy
Copy link
Owner

jjangsangy commented May 31, 2017

Awesome. Where I would start looking is under textrank.py. There is a function called run_summarizer that takes in a keyword argument language. Currently there is no function for detecting the language, so you'll have to write one based on either metadata, HTML meta tag, or by introducing some library to detect the language.

@jjangsangy
Copy link
Owner

Heads up I'm making some changes that will be pushed upstream maybe this or next week. It shouldn't effect any code in textrank.py or the original api.

The code however does move a lot of files around. Mostly I've split the application into the flask server that only displays the webpage and a summarization backend which runs asynchronously on aws lambda. I've mostly been running the public heroku server for demo, but it's getting costly to maintain it even if it's not that much every month

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@jjangsangy @gioferreira and others