-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Web Discovery content scraper, payload generator and privacy guard #24970
base: wdp-native-cred-mgr-srv-config
Are you sure you want to change the base?
Add Web Discovery content scraper, payload generator and privacy guard #24970
Conversation
4f11211
to
40d9461
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
script/brave_license_helper.py
changes 👍
0e17059
to
a888d67
Compare
4213a66
to
07bdd3d
Compare
f147fc9
to
97f3219
Compare
07bdd3d
to
3e0c02c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String reviewers ++
|
||
// Lazily creates and caches pre-compiled regexes, mainly used for | ||
// privacy risk assessment of page URLs/contents. | ||
class RegexUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this one deserves a better name... I was having hard time to figure out why we are passing the util all around :D
also it's meaning is not really about regex, but rather about doing lookups for sensitive info?
I'd also move the following methods away from this class
void RemovePunctuation(std::string& str);
void TransformToAlphanumeric(std::string& str);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if this operates with only static global data, it's not shameful to make the whole thing a singletone and don't pass it everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the methods, but kept the name since it simply caches regex patterns
std::string_view root_selector, | ||
const std::vector<mojom::SelectAttributeRequestPtr>& requests, | ||
const blink::WebVector<blink::WebElement>& elements, | ||
std::vector<mojom::AttributeResultPtr>& results) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo it would be cleaner to make this function return the vector and append every returned vector to the final container in QueryElementAttributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't that result in unnecessary copying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that's also fair point
ae789a2
to
92b2432
Compare
77f1ed6
to
c7089e1
Compare
92b2432
to
859c035
Compare
c7089e1
to
03e489d
Compare
b378cc1
to
a598546
Compare
bool is_query_action, | ||
const PayloadRule& rule, | ||
const PatternsURLDetails* matching_url_details, | ||
const std::vector<base::Value::Dict>& attribute_values) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how large this one can be? this processing is probably a candidate for moving to the task runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reported payloads are pretty concise. IMO, I don't think it warrants a task runner
// We use a WeakPtr within the ContentScraper for callbacks from the renderer, | ||
// so using Unretained is fine here. | ||
CHECK(content_scraper_); | ||
content_scraper_->ScrapePage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically content_scraper looks like a good candidate to be used via base::SequenceBound
, both it's public functions look like they should be async
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it could be, though I don't think ScrapePage
warrants a task runner, especially considering it already dispatches the compute-heavy operations to the renderer process. ParseAndScrapePage
already dispatches the HTML parsing/extraction to a task runner.
In addition, the class uses the non-thread-safe ServerConfigLoader
to get scraping rules and other attributes, so the caller would be responsible for retrieving the necessary config info ahead of time, making the class slightly less convenient to use.
The status quo allows us to choose whether a separate task is needed, depending on the chosen operation.
Resolves brave/brave-browser#39439
Submitter Checklist:
QA/Yes
orQA/No
;release-notes/include
orrelease-notes/exclude
;OS/...
) to the associated issuenpm run test -- brave_browser_tests
,npm run test -- brave_unit_tests
wikinpm run presubmit
wiki,npm run gn_check
,npm run tslint
git rebase master
(if needed)Reviewer Checklist:
gn
After-merge Checklist:
changes has landed on
Test Plan: