[new model] Identify spam comments #3994

jpangas · 2024-01-24T22:30:03Z

Closes #3377
This PR introduces the spamcomment model which is trained to identify spam comments. Later, it will be used to tag potential spam comments.
The model is a work in progress and still needs some features to be added, refined and experimented.

Train on Taskcluster: spamcomment

bugbug/models/spamcomment.py

bugbug/comment_features.py

bugbug/models/spamcomment.py

suhaibmujahid

Thank you, @jpangas! Please see my comments.

bugbug/models/spamcomment.py

suhaibmujahid · 2024-03-01T13:13:55Z

bugbug/comment_features.py

+class UnknownLinkAtBeginning(CommentFeature):
+    name = "Unknown Link found at Beginning of the Comment"
+
+    def __init__(self, domains_to_ignore=set()):
+        self.known_domains = domains_to_ignore
+
+    def __call__(self, comment, **kwargs):
+        urls = extract_urls_and_domains(comment["text"], self.known_domains)["urls"]
+
+        words = comment["text"].split()
+        return words[0] in urls if words else False
+
+
+class UnknownLinkAtEnd(CommentFeature):
+    name = "Unknown Link found at End of the Comment"
+
+    def __init__(self, domains_to_ignore=set()):
+        self.known_domains = domains_to_ignore
+
+    def __call__(self, comment, **kwargs):
+        urls = extract_urls_and_domains(comment["text"], self.known_domains)["urls"]
+
+        words = comment["text"].split()
+        return words[-1] in urls if words else False


Instead, we could return the index for the start of the first link and the index for the end of the last link.

@jpangas wdyt?

This sounds good. Let me try it out and I will share the results.

jpangas · 2024-03-14T09:11:03Z

This is currently blocked by #4097

jpangas and others added 10 commits January 12, 2024 21:04

Create spamcomment model

efef0bc

Add New Features

bff58c5

Merge remote-tracking branch 'upstream/master' into spamcom

48871dc

Include new features and change spamcom

61b0fe0

Version 0.0.534

e31fa75

Merge remote-tracking branch 'upstream/master' into spamcom

5103030

Merge remote-tracking branch 'upstream/master' into spamcom

a69cc54

Create comments extractor

d365ad3

Remove comment features from Bug Features

9ce864a

Add New features

77d534d

marco-c reviewed Jan 24, 2024

View reviewed changes

bugbug/models/spamcomment.py Outdated Show resolved Hide resolved

Refine Link feature

73f74a4

jpangas commented Jan 29, 2024

View reviewed changes

bugbug/comment_features.py Outdated Show resolved Hide resolved

jpangas and others added 17 commits January 29, 2024 12:41

Test with TomekLinks

2d65489

Change df in text vectorizer

501a89f

Use oversampling

606f743

Use max_step

41a73cb

Include and Refine features

586576d

Split Date Features

ba7a1a1

Rename features correctly

8f429d1

Remove Commenter Experience and Invalid Bugs

1ef2493

Remove first comment

5a18517

Include Links Dictionary

ea6c168

Fix Error and Lint

874b19f

Refactor the Links Dictionary

b3da2e5

Use List instead

b49485d

Merge remote-tracking branch 'origin/master' into spamcom

71fe950

Merge remote-tracking branch 'origin/spamcom' into spamcom

4626064

Use Dictionary for # of links

a7044b0

Include older bugs

13772c7

jpangas added 3 commits February 25, 2024 20:19

Test with new parameters

0a21b61

Change df

00a9f9f

Test: Include tags as feature

f55d137

suhaibmujahid reviewed Feb 26, 2024

View reviewed changes

bugbug/models/spamcomment.py Outdated Show resolved Hide resolved

jpangas added 4 commits February 26, 2024 17:56

Exclude comment tags

dbcb311

Exclude emails from commit authors

1b437da

Test without scale pos weight

16e14c5

Test with scale_pos_weight adjusted

94ab283

jpangas requested a review from suhaibmujahid February 27, 2024 15:07

Adjust scale pos weight

5a58108

jpangas requested a review from marco-c February 28, 2024 12:09

suhaibmujahid reviewed Mar 1, 2024

View reviewed changes

jpangas added 6 commits March 1, 2024 21:26

Test wihout WeekOfYear

3eab988

Include comment classifier

bd16d56

Include script in setup

0a11f3c

Fix script error

a3956b4

Fix setup error

5c93d23

Classify all comments

15c8d5a

jpangas mentioned this pull request Mar 12, 2024

Prepare for classification of comments #4094

Open

jpangas marked this pull request as ready for review March 13, 2024 11:08

jpangas added 5 commits March 13, 2024 15:45

Include spamcom in model names

5f953ac

Merge remote-tracking branch 'upstream/master' into spamcom

df77a40

Merge branch 'mozilla:master' into spamcom

4cd6c6d

Remove comment independent files

4237f8f

Merge remote-tracking branch 'origin/spamcom' into spamcom

ba2ece2

jpangas mentioned this pull request Mar 14, 2024

Introduce Base Comment Model and functionality to classify comments #4097

Open

jpangas added 2 commits March 26, 2024 17:11

Use(bug,comment) tuple

5490d01

Include BugvsCreator Feature

d95852d

marco-c removed their request for review August 5, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[new model] Identify spam comments #3994

[new model] Identify spam comments #3994

jpangas commented Jan 24, 2024 •

edited

Loading

suhaibmujahid left a comment

suhaibmujahid Mar 1, 2024

jpangas Mar 1, 2024

jpangas commented Mar 14, 2024

[new model] Identify spam comments #3994

Are you sure you want to change the base?

[new model] Identify spam comments #3994

Conversation

jpangas commented Jan 24, 2024 • edited Loading

suhaibmujahid left a comment

Choose a reason for hiding this comment

suhaibmujahid Mar 1, 2024

Choose a reason for hiding this comment

jpangas Mar 1, 2024

Choose a reason for hiding this comment

jpangas commented Mar 14, 2024

jpangas commented Jan 24, 2024 •

edited

Loading