Firecrawl extraction doesn't apply "only_main_content" parameter when set to true in v0.15.2 #13019

vica-admin · 2025-01-24T09:31:55Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Create a dataset
Select Firecrawl as extractor
Check the option "Extract only main content (no headers, navs, footers, etc.)"
Run the extraction on any URL

✔️ Expected Behavior

When the option "Extract only main content (no headers, navs, footers, etc.)" is checked, it needs to only extract the main content as it was working before.

❌ Actual Behavior

All the content of the URL is being extracted, including headers, footers, which impacts negatively the embedding process, while testing the same option on Firecrawl app works great.

dosubot · 2025-01-24T09:33:29Z

Hello @vica-admin! I'm here to help you with bug reports, questions, and becoming a contributor. Let me know how I can assist you while we wait for a human maintainer.

The issue you're experiencing with the "only_main_content" parameter in the Firecrawl extractor not functioning as expected in version 0.15.2 could be due to a bug or a recent change in the codebase. The parameter is intended to extract only the main content of a webpage, excluding headers, navigation bars, footers, and other non-essential elements ^[1].

To address this issue, you might want to:

Verify that the parameter is correctly set in your configuration and that there are no overrides or misconfigurations in your setup.
Check for any recent changes or commits in the repository that might have affected this functionality, as there might be a bug introduced in the latest version.
Review any related documentation or comments in the code that might provide insights into the expected behavior of this parameter.

If the issue persists, it might be helpful to look into the specific implementation details or reach out to the maintainers for further assistance.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

crazywoola · 2025-01-24T10:45:01Z

@ftonato Hello could you take a look at this issue?

ftonato · 2025-01-24T18:56:34Z

@ftonato Hello could you take a look at this issue?

Hello, which method is being used?

According to our documentation:

When using /scrape, by default, onlyMainContent is now True.
In the /crawl method, the onlyMainContent parameter resides inside the scrapeOptions parameter.

It would be helpful if you could share the code you're using so we can test it as well.

dosubot bot added the 🐞 bug Something isn't working label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Firecrawl extraction doesn't apply "only_main_content" parameter when set to true in v0.15.2 #13019

Firecrawl extraction doesn't apply "only_main_content" parameter when set to true in v0.15.2 #13019

vica-admin commented Jan 24, 2025

dosubot bot commented Jan 24, 2025

crazywoola commented Jan 24, 2025

ftonato commented Jan 24, 2025

Firecrawl extraction doesn't apply "only_main_content" parameter when set to true in v0.15.2 #13019

Firecrawl extraction doesn't apply "only_main_content" parameter when set to true in v0.15.2 #13019

Comments

vica-admin commented Jan 24, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Jan 24, 2025

crazywoola commented Jan 24, 2025

ftonato commented Jan 24, 2025