Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments extraction issue #198

Open
gqwang16 opened this issue Apr 3, 2021 · 16 comments
Open

Comments extraction issue #198

gqwang16 opened this issue Apr 3, 2021 · 16 comments

Comments

@gqwang16
Copy link

gqwang16 commented Apr 3, 2021

I use get_posts("the page I scrape", pages=3,options={'comments':True}) to extract the comments, however, I got nonzero comments number but nothing in the "comments_full". Does anyone know the reason or how to extract comments?

@lgjluis
Copy link

lgjluis commented Apr 4, 2021

Hi @gqwang16,

Facebook gives you a limit of searches without being logged in. I use a proxy to get information.

@lgjluis
Copy link

lgjluis commented Apr 4, 2021

Hi @kevinzg,

I have a problem with the comments. They are bringing information from other posts. Do you know how I can fix it?

@gqwang16
Copy link
Author

gqwang16 commented Apr 4, 2021 via email

@lgjluis
Copy link

lgjluis commented Apr 4, 2021

I use a local service as a rotating proxy. In facebook_scraper.py change:

`def init(self, session=None, requests_kwargs=None):
if session is None:
session = HTMLSession()
session.headers.update(self.default_headers)

    if requests_kwargs is None:
        requests_kwargs = {'proxies':{'http': 'http://{ip}:{port}','https': 'http://{ip}:{port}'}}

    self.session = session
    self.requests_kwargs = requests_kwargs`

You must change the {ip} and {port}.

@neon-ninja
Copy link
Collaborator

neon-ninja commented Apr 7, 2021

Hi @gqwang16 - can you tell us which page or post is causing the problem? I tested with the Nintendo page (#188) and that works fine. @lgjluis same for you

@lgjluis
Copy link

lgjluis commented Apr 7, 2021

Hi @neon-ninja,

If I activate the comments, after a while Facebook closes the connection and stops extracting information. For this reason I use a proxy.

@neon-ninja
Copy link
Collaborator

Gotcha. Yes, extracting comments results in more requests to Facebook servers, which results in triggering a temporary IP ban faster

@ccolonna
Copy link

Please @lgjluis can you share a working script or simple example on how to use a rotating proxy?

@neon-ninja
Copy link
Collaborator

Twint has support for tor, and reloading tor if IP banned - might be worth porting here

@lgjluis
Copy link

lgjluis commented Apr 23, 2021

Hi @Christian-Nja, I use a Docker with a rotating-proxy.

@ccolonna
Copy link

Ok thank you. So, just to be sure of what technique to follow.

  • User credentials: wide access to information, but a single user login so facebook can temporary ban the user for massive scraping
  • No user credentials: limited access to information, possibility to IP banning, but with rotating proxy you can do all the massive scraping you want

Is this correct? You can't get the benefit of being logged in and IP rotation at once.

@neon-ninja
Copy link
Collaborator

neon-ninja commented Apr 27, 2021

Ok thank you. So, just to be sure of what technique to follow.

  • User credentials: wide access to information, but a single user login so facebook can temporary ban the user for massive scraping
  • No user credentials: limited access to information, possibility to IP banning, but with rotating proxy you can do all the massive scraping you want

Is this correct? You can't get the benefit of being logged in and IP rotation at once.

Sounds about right. Depending on what you're trying to do, #212 might also be useful. Additionally, if you had multiple accounts, cookie rotation might work

@abubelinha
Copy link

abubelinha commented May 9, 2021

@neon-ninja @lgjluis
In addition to using proxies, do you know if there is any parameter to slow-down the frequency of facebook-scraper requests?
I prefer it to wait a bit between requests (i.e. one second), if doing it I avoid my IP being banned (I am not scraping a lot ... I just want a cron job which makes a database backup of a given facebook page comments).

When I use get_posts() just once, does this imply a single request, or internally this function will launch a loop of requests which I can't slow-down?

@neon-ninja
Copy link
Collaborator

@abubelinha actually, get_posts returns a generator - and requests are only made when you iterate through it. Note that each page contains 4 posts. So,

import time
from facebook_scraper import get_posts
for post in get_posts("Nintendo"):
    print(post.get("post_id"))
    time.sleep(.25)

Should add a one second delay in between each request

@webcoderz
Copy link

webcoderz commented May 14, 2021

Twint has support for tor, and reloading tor if IP banned - might be worth porting here

this is what brought me here today was looking to see if you guys were looking on implementing this feature, it would be a tremendous help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants