Skip to content

Playwright content fetcher

dgtlmoon edited this page Jun 11, 2022 · 23 revisions

Fetching content using Playwright

You can fetch pages using the excellent and very fast Playwright backend https://docs.browserless.io/docs/docker-quickstart.html

See docker-compose.yml for more examples

Set the environment variable PLAYWRIGHT_DRIVER_URL to ws://127.0.0.1:3000

Docker Compose based

In docker-compose.yml uncomment these lines

environment:
        - PLAYWRIGHT_DRIVER_URL=ws://playwright-chrome:3000/

playwright-chrome:
        hostname: playwright-chrome
        image: browserless/chrome
        restart: unless-stopped

Docker based

docker run -d --name browserless \ 
   -e "DEFAULT_LAUNCH_ARGS=[\"--window-size=1920,1080\"]" \
   --rm  -p 3000:3000 \
   --shm-size="2g" \
  browserless/chrome:1.53-chrome-stable

Pip install based

@todo

Playwright memory leak

There seems to be some memory leak in playwright https://github.com/microsoft/playwright/issues/6319 , as yet there does not seem to be a solution, this can easily consume 200Mb->several gigabytes, restarting the service seems to be very fast and so far the best way to mitigate this

Crontab every x minutes..

#!/bin/bash
# the docker container should restart this
# Check if >240Mb
ps  -C 'python ./changedetection.py -d /datastore' u|grep -v PID|awk '$6 > 240000 {print $2};'|while read pid
do

  kill -9 $pid
done