Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Web Discovery content scraper, payload generator and privacy guard #24970

Open
wants to merge 8 commits into
base: wdp-native-cred-mgr-srv-config
Choose a base branch
from

Conversation

DJAndries
Copy link
Collaborator

Resolves brave/brave-browser#39439

Submitter Checklist:

  • I confirm that no security/privacy review is needed and no other type of reviews are needed, or that I have requested them
  • There is a ticket for my issue
  • Used Github auto-closing keywords in the PR description above
  • Wrote a good PR/commit description
  • Squashed any review feedback or "fixup" commits before merge, so that history is a record of what happened in the repo, not your PR
  • Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
  • Checked the PR locally:
    • npm run test -- brave_browser_tests, npm run test -- brave_unit_tests wiki
    • npm run presubmit wiki, npm run gn_check, npm run tslint
  • Ran git rebase master (if needed)

Reviewer Checklist:

  • A security review is not needed, or a link to one is included in the PR description
  • New files have MPL-2.0 license header
  • Adequate test coverage exists to prevent regressions
  • Major classes, functions and non-trivial code blocks are well-commented
  • Changes in component dependencies are properly reflected in gn
  • Code follows the style guide
  • Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

@DJAndries DJAndries requested review from a team and bridiver as code owners August 1, 2024 23:08
@github-actions github-actions bot added the CI/run-audit-deps Check for known npm/cargo vulnerabilities (audit_deps) label Aug 1, 2024
@DJAndries DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch from 4f11211 to 40d9461 Compare August 8, 2024 06:48
@DJAndries DJAndries requested review from deeppandya, fmarier and a team as code owners August 8, 2024 06:48
Copy link
Member

@fmarier fmarier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

script/brave_license_helper.py changes 👍

@DJAndries DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch 2 times, most recently from 0e17059 to a888d67 Compare August 22, 2024 04:10
@DJAndries DJAndries force-pushed the wdp-native-extraction-payload-gen branch 2 times, most recently from 4213a66 to 07bdd3d Compare August 23, 2024 23:40
@DJAndries DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch from f147fc9 to 97f3219 Compare September 17, 2024 04:22
@DJAndries DJAndries requested a review from a team as a code owner September 17, 2024 04:22
@DJAndries DJAndries force-pushed the wdp-native-extraction-payload-gen branch from 07bdd3d to 3e0c02c Compare September 17, 2024 04:24
Copy link
Member

@bsclifton bsclifton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String reviewers ++


// Lazily creates and caches pre-compiled regexes, mainly used for
// privacy risk assessment of page URLs/contents.
class RegexUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this one deserves a better name... I was having hard time to figure out why we are passing the util all around :D
also it's meaning is not really about regex, but rather about doing lookups for sensitive info?
I'd also move the following methods away from this class

  void RemovePunctuation(std::string& str);
  void TransformToAlphanumeric(std::string& str);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if this operates with only static global data, it's not shameful to make the whole thing a singletone and don't pass it everywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the methods, but kept the name since it simply caches regex patterns

std::string_view root_selector,
const std::vector<mojom::SelectAttributeRequestPtr>& requests,
const blink::WebVector<blink::WebElement>& elements,
std::vector<mojom::AttributeResultPtr>& results) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it would be cleaner to make this function return the vector and append every returned vector to the final container in QueryElementAttributes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't that result in unnecessary copying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that's also fair point

@DJAndries DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch from ae789a2 to 92b2432 Compare January 10, 2025 20:39
@DJAndries DJAndries force-pushed the wdp-native-extraction-payload-gen branch 2 times, most recently from 77f1ed6 to c7089e1 Compare January 10, 2025 22:52
@DJAndries DJAndries force-pushed the wdp-native-cred-mgr-srv-config branch from 92b2432 to 859c035 Compare January 11, 2025 00:14
@DJAndries DJAndries force-pushed the wdp-native-extraction-payload-gen branch from c7089e1 to 03e489d Compare January 11, 2025 00:15
@DJAndries DJAndries force-pushed the wdp-native-extraction-payload-gen branch from b378cc1 to a598546 Compare January 15, 2025 00:13
bool is_query_action,
const PayloadRule& rule,
const PatternsURLDetails* matching_url_details,
const std::vector<base::Value::Dict>& attribute_values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how large this one can be? this processing is probably a candidate for moving to the task runner

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reported payloads are pretty concise. IMO, I don't think it warrants a task runner

// We use a WeakPtr within the ContentScraper for callbacks from the renderer,
// so using Unretained is fine here.
CHECK(content_scraper_);
content_scraper_->ScrapePage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically content_scraper looks like a good candidate to be used via base::SequenceBound, both it's public functions look like they should be async

Copy link
Collaborator Author

@DJAndries DJAndries Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it could be, though I don't think ScrapePage warrants a task runner, especially considering it already dispatches the compute-heavy operations to the renderer process. ParseAndScrapePage already dispatches the HTML parsing/extraction to a task runner.

In addition, the class uses the non-thread-safe ServerConfigLoader to get scraping rules and other attributes, so the caller would be responsible for retrieving the necessary config info ahead of time, making the class slightly less convenient to use.

The status quo allows us to choose whether a separate task is needed, depending on the chosen operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/run-audit-deps Check for known npm/cargo vulnerabilities (audit_deps)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants