Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract direct link for image and posts #213

Open
Breizhux opened this issue Apr 21, 2021 · 10 comments
Open

Extract direct link for image and posts #213

Breizhux opened this issue Apr 21, 2021 · 10 comments

Comments

@Breizhux
Copy link

Breizhux commented Apr 21, 2021

Hello,
I noticed that the url of the recovered images are not necessarily usable. Sometimes they are direct links with the domain "scontent-cdt1-1.xx.fbcdn.net". But sometimes the url is not direct and requires authentication to recover it, the domain in this case is m.facebook.com.

Public page : https://fr-fr.facebook.com/groups/saintyves.rennes/
Post concerned: https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/
Url that is retrieved for the image : https://m.facebook.com/photo/view_full_size/?fbid=3861145620587869&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg
While the following url would be much more relevant : https://scontent-cdt1-1.xx.fbcdn.net/v/t1.6435-9/s960x960/176314059_3861145627254535_6708760356773320290_n.jpg?_nc_cat=110&ccb=1-3&_nc_sid=825194&_nc_ohc=nTqGUQ-o0h0AX_bnWV-&_nc_ht=scontent-cdt1-1.xx&tp=7&oh=c7ab3b2c862064ae5c12503f6707f434&oe=60A68072

By creating this ticket, I notice that the url of the post is not relevant either.
The url retrieved for the post is : https://facebook.com/439909623068547/posts/1360623547663812
While this url is usable without an account : https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/

I understand the idea of retrieving urls even if you need an account to access them. But I think it would be very convenient to put also the direct url, usable without an account.

@neon-ninja
Copy link
Collaborator

This issue seems specific to groups, not pages.

The photos issue is tricky, as if you need an account to resolve the url to the full quality image, it's impossible for the scraper to resolve that unless you feed it cookies. It would still be possible to extract the low quality image. Perhaps we should always extract the low quality image, and also try to extract the full quality image if possible.

With the URLs problem, the regex needs to be updated for group posts. This problem was also reported in #165.

@Breizhux
Copy link
Author

In my case I would need the images because they can contain information. (I'm creating a facebook page to rss feed converter, so the goal is not to have an account)

In general, I think it could be good to get a functional link depending on what is available, even if the quality is not good in the end. But I understand the reason to propose the best quality link.
Or maybe even propose the different qualities available...

If not, maybe there is a way to get the html code of the posts? I did not find if it was possible. Since from there I could extract the link myself. That could be enough for me.

As for the direct links of the groups, they just seem to be all built in the same way:
https://m.facebook.com/groups/<group_id>/permalink/<post_id>/
I tested on several facebook groups, and several posts, I didn't get any error...
For facebook pages, it seems to me much more complicated...

@neon-ninja
Copy link
Collaborator

neon-ninja commented Apr 22, 2021

@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see #217

It is possible to get the HTML, the parameter is remove_source, e.g. get_posts(account, remove_source=False)

I've also raised a separate pull request to fix the regexes for group posts - #216

@pmdscully
Copy link

pmdscully commented Apr 29, 2021

@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see #217

Thanks for the change @neon-ninja . Note that the merge now empties the image and images (i.e. =None) fields instead of populating them.

@VariabileAleatoria
Copy link

I'm currently facing the same problem not on groups but on a shared post on a page

>>> list(get_posts('realgoblinhours', pages=2, cookies='cookies.txt'))[0]['image']
'https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg'

@neon-ninja
Copy link
Collaborator

neon-ninja commented May 2, 2021

In the time since you posted that comment, that post is no longer the first on the page. This code works fine though:

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

outputs

https://scontent.fakl1-3.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/180791935_796496920992238_7336250371130696992_n.jpg?_nc_cat=1&ccb=1-3&_nc_sid=110474&efg=eyJpIjoidCJ9&_nc_ohc=SzkHYeSZNWUAX-37aIG&_nc_ht=scontent.fakl1-3.fna&tp=14&oh=981ced36d2e8fbe4b243bae53e2f93e3&oe=60B24C27&manual_redirect=1

@VariabileAleatoria
Copy link

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

Unfortunately this doesn't do the trick for me:

>>> posts = list(get_posts(
...     post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
...     cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg

@neon-ninja
Copy link
Collaborator

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

Unfortunately this doesn't do the trick for me:

>>> posts = list(get_posts(
...     post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
...     cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg

You might need to recreate cookies.txt after changing your language

@VariabileAleatoria
Copy link

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

I thought I already did that, it worked now.
Probably I disconnected from browser and that invalidates cookies I guess

@neon-ninja
Copy link
Collaborator

neon-ninja commented May 2, 2021

I pushed a commit to warn about non en_US locales present in result HTML, should help with this kind of problem 21ac8c4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants