Extract direct link for image and posts #213

Breizhux · 2021-04-21T10:37:38Z

Hello,
I noticed that the url of the recovered images are not necessarily usable. Sometimes they are direct links with the domain "scontent-cdt1-1.xx.fbcdn.net". But sometimes the url is not direct and requires authentication to recover it, the domain in this case is m.facebook.com.

Public page : https://fr-fr.facebook.com/groups/saintyves.rennes/
Post concerned: https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/
Url that is retrieved for the image : https://m.facebook.com/photo/view_full_size/?fbid=3861145620587869&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg
While the following url would be much more relevant : https://scontent-cdt1-1.xx.fbcdn.net/v/t1.6435-9/s960x960/176314059_3861145627254535_6708760356773320290_n.jpg?_nc_cat=110&ccb=1-3&_nc_sid=825194&_nc_ohc=nTqGUQ-o0h0AX_bnWV-&_nc_ht=scontent-cdt1-1.xx&tp=7&oh=c7ab3b2c862064ae5c12503f6707f434&oe=60A68072

By creating this ticket, I notice that the url of the post is not relevant either.
The url retrieved for the post is : https://facebook.com/439909623068547/posts/1360623547663812
While this url is usable without an account : https://www.facebook.com/groups/saintyves.rennes/permalink/1360623547663812/

I understand the idea of retrieving urls even if you need an account to access them. But I think it would be very convenient to put also the direct url, usable without an account.

neon-ninja · 2021-04-21T22:41:25Z

This issue seems specific to groups, not pages.

The photos issue is tricky, as if you need an account to resolve the url to the full quality image, it's impossible for the scraper to resolve that unless you feed it cookies. It would still be possible to extract the low quality image. Perhaps we should always extract the low quality image, and also try to extract the full quality image if possible.

With the URLs problem, the regex needs to be updated for group posts. This problem was also reported in #165.

Breizhux · 2021-04-22T09:13:31Z

In my case I would need the images because they can contain information. (I'm creating a facebook page to rss feed converter, so the goal is not to have an account)

In general, I think it could be good to get a functional link depending on what is available, even if the quality is not good in the end. But I understand the reason to propose the best quality link.
Or maybe even propose the different qualities available...

If not, maybe there is a way to get the html code of the posts? I did not find if it was possible. Since from there I could extract the link myself. That could be enough for me.

As for the direct links of the groups, they just seem to be all built in the same way:
https://m.facebook.com/groups/<group_id>/permalink/<post_id>/
I tested on several facebook groups, and several posts, I didn't get any error...
For facebook pages, it seems to me much more complicated...

neon-ninja · 2021-04-22T22:55:44Z

@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see #217

It is possible to get the HTML, the parameter is remove_source, e.g. get_posts(account, remove_source=False)

I've also raised a separate pull request to fix the regexes for group posts - #216

pmdscully · 2021-04-29T06:19:41Z

@Breizhux I've raised a pull request to always return the low quality image (possibly in addition to the high quality one), see #217

Thanks for the change @neon-ninja . Note that the merge now empties the image and images (i.e. =None) fields instead of populating them.

VariabileAleatoria · 2021-05-01T20:47:47Z

I'm currently facing the same problem not on groups but on a shared post on a page

>>> list(get_posts('realgoblinhours', pages=2, cookies='cookies.txt'))[0]['image']
'https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg'

neon-ninja · 2021-05-02T02:53:36Z

In the time since you posted that comment, that post is no longer the first on the page. This code works fine though:

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

outputs

https://scontent.fakl1-3.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/180791935_796496920992238_7336250371130696992_n.jpg?_nc_cat=1&ccb=1-3&_nc_sid=110474&efg=eyJpIjoidCJ9&_nc_ohc=SzkHYeSZNWUAX-37aIG&_nc_ht=scontent.fakl1-3.fna&tp=14&oh=981ced36d2e8fbe4b243bae53e2f93e3&oe=60B24C27&manual_redirect=1

VariabileAleatoria · 2021-05-02T09:00:21Z

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

Unfortunately this doesn't do the trick for me:

>>> posts = list(get_posts(
...     post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
...     cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg

neon-ninja · 2021-05-02T11:58:11Z

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

Unfortunately this doesn't do the trick for me:

>>> posts = list(get_posts(
...     post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
...     cookies="cookies.txt"
... ))
>>> print(posts[0]["image"])
https://m.facebook.com/photo/view_full_size/?fbid=796496914325572&ref_component=mbasic_photo_permalink&ref_page=%2Fwap%2Fphoto.php&refid=13&__tn__=%2Cg

You might need to recreate cookies.txt after changing your language

VariabileAleatoria · 2021-05-02T13:49:32Z

posts = list(get_posts(
    post_urls=["https://m.facebook.com/story.php?story_fbid=4059210580793355&id=2261226117258486"],
    cookies="cookies.txt"
))
print(posts[0]["image"])

I thought I already did that, it worked now.
Probably I disconnected from browser and that invalidates cookies I guess

neon-ninja · 2021-05-02T21:00:54Z

I pushed a commit to warn about non en_US locales present in result HTML, should help with this kind of problem 21ac8c4

This was referenced Apr 21, 2021

regexes for GroupPostExtractor. support group posts in get_posts_by_url #216

Merged

always extract image_lq, try resolve view_full_size links #217

Merged

pmdscully mentioned this issue Apr 29, 2021

If a post has more than one image -> I receive only one low-quality image #203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract direct link for image and posts #213

Extract direct link for image and posts #213

Breizhux commented Apr 21, 2021 •

edited

Loading

neon-ninja commented Apr 21, 2021

Breizhux commented Apr 22, 2021

neon-ninja commented Apr 22, 2021 •

edited

Loading

pmdscully commented Apr 29, 2021 •

edited

Loading

VariabileAleatoria commented May 1, 2021

neon-ninja commented May 2, 2021 •

edited

Loading

VariabileAleatoria commented May 2, 2021

neon-ninja commented May 2, 2021

VariabileAleatoria commented May 2, 2021

neon-ninja commented May 2, 2021 •

edited

Loading

Extract direct link for image and posts #213

Extract direct link for image and posts #213

Comments

Breizhux commented Apr 21, 2021 • edited Loading

neon-ninja commented Apr 21, 2021

Breizhux commented Apr 22, 2021

neon-ninja commented Apr 22, 2021 • edited Loading

pmdscully commented Apr 29, 2021 • edited Loading

VariabileAleatoria commented May 1, 2021

neon-ninja commented May 2, 2021 • edited Loading

VariabileAleatoria commented May 2, 2021

neon-ninja commented May 2, 2021

VariabileAleatoria commented May 2, 2021

neon-ninja commented May 2, 2021 • edited Loading

Breizhux commented Apr 21, 2021 •

edited

Loading

neon-ninja commented Apr 22, 2021 •

edited

Loading

pmdscully commented Apr 29, 2021 •

edited

Loading

neon-ninja commented May 2, 2021 •

edited

Loading

neon-ninja commented May 2, 2021 •

edited

Loading