Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firecrawl extraction doesn't apply "only_main_content" parameter when set to true in v0.15.2 #13019

Open
5 tasks done
vica-admin opened this issue Jan 24, 2025 · 3 comments
Open
5 tasks done
Labels
🐞 bug Something isn't working

Comments

@vica-admin
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  • Create a dataset
  • Select Firecrawl as extractor
  • Check the option "Extract only main content (no headers, navs, footers, etc.)"
  • Run the extraction on any URL

✔️ Expected Behavior

When the option "Extract only main content (no headers, navs, footers, etc.)" is checked, it needs to only extract the main content as it was working before.

❌ Actual Behavior

All the content of the URL is being extracted, including headers, footers, which impacts negatively the embedding process, while testing the same option on Firecrawl app works great.

Copy link

dosubot bot commented Jan 24, 2025

Hello @vica-admin! I'm here to help you with bug reports, questions, and becoming a contributor. Let me know how I can assist you while we wait for a human maintainer.

The issue you're experiencing with the "only_main_content" parameter in the Firecrawl extractor not functioning as expected in version 0.15.2 could be due to a bug or a recent change in the codebase. The parameter is intended to extract only the main content of a webpage, excluding headers, navigation bars, footers, and other non-essential elements [1].

To address this issue, you might want to:

  1. Verify that the parameter is correctly set in your configuration and that there are no overrides or misconfigurations in your setup.
  2. Check for any recent changes or commits in the repository that might have affected this functionality, as there might be a bug introduced in the latest version.
  3. Review any related documentation or comments in the code that might provide insights into the expected behavior of this parameter.

If the issue persists, it might be helpful to look into the specific implementation details or reach out to the maintainers for further assistance.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 24, 2025
@crazywoola
Copy link
Member

@ftonato Hello could you take a look at this issue?

@ftonato
Copy link
Contributor

ftonato commented Jan 24, 2025

@ftonato Hello could you take a look at this issue?

Hello, which method is being used?

According to our documentation:

  1. When using /scrape, by default, onlyMainContent is now True.
  2. In the /crawl method, the onlyMainContent parameter resides inside the scrapeOptions parameter.

It would be helpful if you could share the code you're using so we can test it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants