Add Web Discovery content scraper, payload generator and privacy guard #24970

DJAndries · 2024-08-01T23:08:36Z

Resolves brave/brave-browser#39439

Submitter Checklist:

I confirm that no security/privacy review is needed and no other type of reviews are needed, or that I have requested them
There is a ticket for my issue
Used Github auto-closing keywords in the PR description above
Wrote a good PR/commit description
Squashed any review feedback or "fixup" commits before merge, so that history is a record of what happened in the repo, not your PR
Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
Checked the PR locally:
- npm run test -- brave_browser_tests, npm run test -- brave_unit_tests wiki
- npm run presubmit wiki, npm run gn_check, npm run tslint
Ran git rebase master (if needed)

Reviewer Checklist:

A security review is not needed, or a link to one is included in the PR description
New files have MPL-2.0 license header
Adequate test coverage exists to prevent regressions
Major classes, functions and non-trivial code blocks are well-commented
Changes in component dependencies are properly reflected in gn
Code follows the style guide
Test plan is specified in PR before merging

After-merge Checklist:

The associated issue milestone is set to the smallest version that the
changes has landed on
All relevant documentation has been updated, for instance:

Test Plan:

browser/brave_tab_helpers.cc

fmarier

script/brave_license_helper.py changes 👍

components/web_discovery/browser/BUILD.gn

bsclifton

String reviewers ++

components/web_discovery/common/BUILD.gn

components/web_discovery/browser/hash_detection.cc

components/web_discovery/browser/regex_util.cc

iefremov · 2024-10-23T21:48:42Z

components/web_discovery/browser/regex_util.h

+
+// Lazily creates and caches pre-compiled regexes, mainly used for
+// privacy risk assessment of page URLs/contents.
+class RegexUtil {


IMO this one deserves a better name... I was having hard time to figure out why we are passing the util all around :D
also it's meaning is not really about regex, but rather about doing lookups for sensitive info?
I'd also move the following methods away from this class

void RemovePunctuation(std::string& str); void TransformToAlphanumeric(std::string& str);

also if this operates with only static global data, it's not shameful to make the whole thing a singletone and don't pass it everywhere

removed the methods, but kept the name since it simply caches regex patterns

components/web_discovery/browser/web_discovery_service.cc

components/web_discovery/browser/hash_detection.cc

components/web_discovery/browser/regex_util.h

components/web_discovery/renderer/blink_document_extractor.cc

iefremov · 2025-01-10T17:18:38Z

components/web_discovery/renderer/blink_document_extractor.cc

+    std::string_view root_selector,
+    const std::vector<mojom::SelectAttributeRequestPtr>& requests,
+    const blink::WebVector<blink::WebElement>& elements,
+    std::vector<mojom::AttributeResultPtr>& results) {


imo it would be cleaner to make this function return the vector and append every returned vector to the final container in QueryElementAttributes

wouldn't that result in unnecessary copying?

well that's also fair point

components/web_discovery/browser/regex_util.h

components/web_discovery/browser/privacy_guard.cc

components/web_discovery/browser/hash_detection.cc

components/web_discovery/browser/regex_util.h

components/web_discovery/browser/web_discovery_service.h

components/web_discovery/browser/web_discovery_service.cc

components/web_discovery/browser/content_scraper.cc

iefremov · 2025-01-16T17:08:46Z

components/web_discovery/browser/payload_generator.cc

+    bool is_query_action,
+    const PayloadRule& rule,
+    const PatternsURLDetails* matching_url_details,
+    const std::vector<base::Value::Dict>& attribute_values) {


how large this one can be? this processing is probably a candidate for moving to the task runner

The reported payloads are pretty concise. IMO, I don't think it warrants a task runner

iefremov · 2025-01-16T17:23:29Z

components/web_discovery/browser/web_discovery_service.cc

+  // We use a WeakPtr within the ContentScraper for callbacks from the renderer,
+  // so using Unretained is fine here.
+  CHECK(content_scraper_);
+  content_scraper_->ScrapePage(


basically content_scraper looks like a good candidate to be used via base::SequenceBound, both it's public functions look like they should be async

yes, it could be, though I don't think ScrapePage warrants a task runner, especially considering it already dispatches the compute-heavy operations to the renderer process. ParseAndScrapePage already dispatches the HTML parsing/extraction to a task runner.

In addition, the class uses the non-thread-safe ServerConfigLoader to get scraping rules and other attributes, so the caller would be responsible for retrieving the necessary config info ahead of time, making the class slightly less convenient to use.

The status quo allows us to choose whether a separate task is needed, depending on the chosen operation.

DJAndries requested review from a team and bridiver as code owners August 1, 2024 23:08

github-actions bot assigned DJAndries Aug 1, 2024

github-actions bot added the CI/run-audit-deps Check for known npm/cargo vulnerabilities (audit_deps) label Aug 1, 2024

DJAndries mentioned this pull request Aug 1, 2024

Native re-implementation of Web Discovery (part 1) #24421

Closed

24 tasks

bridiver reviewed Aug 6, 2024

View reviewed changes

browser/brave_tab_helpers.cc Outdated Show resolved Hide resolved

DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch from 4f11211 to 40d9461 Compare August 8, 2024 06:48

DJAndries requested review from deeppandya, fmarier and a team as code owners August 8, 2024 06:48

fmarier approved these changes Aug 8, 2024

View reviewed changes

DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch 2 times, most recently from 0e17059 to a888d67 Compare August 22, 2024 04:10

DJAndries force-pushed the wdp-native-extraction-payload-gen branch 2 times, most recently from 4213a66 to 07bdd3d Compare August 23, 2024 23:40

DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch from f147fc9 to 97f3219 Compare September 17, 2024 04:22

DJAndries requested a review from a team as a code owner September 17, 2024 04:22

DJAndries force-pushed the wdp-native-extraction-payload-gen branch from 07bdd3d to 3e0c02c Compare September 17, 2024 04:24

iefremov reviewed Oct 15, 2024

View reviewed changes

components/web_discovery/browser/BUILD.gn Outdated Show resolved Hide resolved

bsclifton approved these changes Oct 18, 2024

View reviewed changes

iefremov reviewed Oct 23, 2024

View reviewed changes

components/web_discovery/common/BUILD.gn Outdated Show resolved Hide resolved

iefremov reviewed Oct 23, 2024

View reviewed changes

components/web_discovery/browser/hash_detection.cc Outdated Show resolved Hide resolved

iefremov reviewed Oct 23, 2024

View reviewed changes

components/web_discovery/browser/hash_detection.cc Outdated Show resolved Hide resolved

iefremov reviewed Oct 23, 2024

View reviewed changes

components/web_discovery/browser/regex_util.cc Outdated Show resolved Hide resolved

iefremov reviewed Oct 23, 2024

View reviewed changes

components/web_discovery/browser/web_discovery_service.cc Show resolved Hide resolved