Add proxy management #34

PierreActuary · 2024-10-30T09:42:20Z

solvency2-rfr version: Version: 0.2.2
Python version: Python 3.9.13
Operating System: Windows

Description

Impossible to get the library to work when a proxy is required
In the scraping.py module I understand that there is no way to pass the proxies as a variable.
After the "import request" there should be a way to specifiy the proxy of allow passing the variable to all the request.get functions
For exemple :

def get_links(urls: str, r: str, proxy
...
            resp = requests.get(page, proxies=proxy)) -> list:

And all functions calling scraping.get_links directly or indirectly should allow to pass the proxy

I understand eiopa_data.py and rfr.py should also be modified to add proxy management to
resp = urllib.request.urlopen(url)
and
urllib.request.urlretrieve(url, target_file)

What I Did

import solvency2_data
d = solvency2_data.read("2017-12-31")

Traceback

Traceback (most recent call last):
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connection.py", line 196, in _new_conn
    sock = connection.create_connection(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\util\connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "C:\ProgramData\PYT4ALL_Installer\lib\socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 490, in _make_request
    raise new_e
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connection.py", line 615, in connect
    self.sock = sock = self._new_conn()
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connection.py", line 203, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x000001963C6604C0>: Failed to resolve 'www.eiopa.europa.eu' ([Errno 11001] getaddrinfo failed)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]AppData\Roaming\Python\Python39\site-packages\requests\adapters.py", line 667, in send
    resp = conn.urlopen(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 843, in urlopen
    retries = retries.increment(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\util\retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.eiopa.europa.eu', port=443): Max retries exceeded with url: /tools-and-data/risk-free-interest-rate-term-structures/risk-free-rate-previous-releases-and-preparatory-phase_en (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001963C6604C0>: Failed to resolve 'www.eiopa.europa.eu' ([Errno 11001] getaddrinfo failed)"))
**```**

The text was updated successfully, but these errors were encountered:

wjwillemse · 2024-11-02T12:38:00Z

I have added a 'proxies' parameters (a dictionary) to the functions .read(), .get() and .refresh(), and passed this parameter to the request and urllib library functions.

This is implemented in version 0.2.3.

I was unable to test this myself, so could you test this?

PierreActuary · 2024-11-04T08:44:53Z

Thank you very much for that, I just tested and it doesn't work
I managed to get the 0.22 version to work with proxy (by adding it hard coded within rfr.py, scrapping and eiopa_data.
What I noticed is that requests and urllib doesn't manage the proxy in the same way.
For requests (in scrapping.py) I had to send :
requests.head(url, proxies=proxy, timeout=10, verify=False).headers
verify=False was mandatory in my case since I cannot connect in true https.

For urllib, the urlopen(url) doesn't allow passing a proxy :

(function) def urlopen(
    url: str | Request,
    data: _DataType = None,
    timeout: float | None = ...,
    *,
    cafile: str | None = None,
    capath: str | None = None,
    cadefault: bool = False,
    context: SSLContext | None = None
) -> _UrlopenRet

So I defined a sub that I called in the module.

def set_http_proxy(proxy):
    proxy_support = urllib.request.ProxyHandler({'http': f"{proxy['http']}",'https': f"{proxy['https']}"})
    opener = urllib.request.build_opener(proxy_support)
    urllib.request.install_opener(opener)

As I hardcoded it, i only needed to do it once.
If the proxy in passed via parameters, I suppose you should add
set_http_proxy(proxy)
before each urlopen call

Thant again for trying to add this functionality to your project

wjwillemse · 2024-11-18T15:19:31Z

I forgot to add the proxies to requests.head. This is fixed now and available in 0.2.4. In eiopa_data.py, lines 79 to 82, the proxy handler is installed before the urllib.request.urlretrieve (but I already done this in 0.2.3, so strange that it does not work for you). If you still have issues, you can send me the lines of code I have to change for you to enable it to work, or you can send me a pull request.

PierreActuary · 2024-11-25T16:18:02Z

Hi wj, thanks for this new attempt.
I couldn't make it work.
Line 41 of scrapping.py i modified :
resp = requests.get(page, proxies=proxies, timeout=10, verify=False)
(verify = False to allow http connexion with https url).

But then I had an error " <urlopen error [Errno 11001] getaddrinfo failed>" (with the read function).
It improved when I added

def set_http_proxy(proxy):
    proxy_support = urllib.request.ProxyHandler({'http': f"{proxy['http']}",'https': f"{proxy['https']}"})
    opener = urllib.request.build_opener(proxy_support)
    urllib.request.install_opener(opener)

and a call of that function in get_links and check_if_download function but I then get an error : "can't concat str to bytes"

I think I'll stay with my own 0.22 modified version that works (not skilled enough in python to dig deeper).

Also a suggestion : a refresh() function with a date argument so the database doesn't refresh from 2016 every time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proxy management #34

Add proxy management #34

PierreActuary commented Oct 30, 2024 •

edited

Loading

wjwillemse commented Nov 2, 2024

PierreActuary commented Nov 4, 2024

wjwillemse commented Nov 18, 2024

PierreActuary commented Nov 25, 2024 •

edited

Loading

Add proxy management #34

Add proxy management #34

Comments

PierreActuary commented Oct 30, 2024 • edited Loading

Description

What I Did

Traceback

wjwillemse commented Nov 2, 2024

PierreActuary commented Nov 4, 2024

wjwillemse commented Nov 18, 2024

PierreActuary commented Nov 25, 2024 • edited Loading

PierreActuary commented Oct 30, 2024 •

edited

Loading

PierreActuary commented Nov 25, 2024 •

edited

Loading