Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proxy management #34

Open
PierreActuary opened this issue Oct 30, 2024 · 4 comments
Open

Add proxy management #34

PierreActuary opened this issue Oct 30, 2024 · 4 comments

Comments

@PierreActuary
Copy link

PierreActuary commented Oct 30, 2024

  • solvency2-rfr version: Version: 0.2.2
  • Python version: Python 3.9.13
  • Operating System: Windows

Description

Impossible to get the library to work when a proxy is required
In the scraping.py module I understand that there is no way to pass the proxies as a variable.
After the "import request" there should be a way to specifiy the proxy of allow passing the variable to all the request.get functions
For exemple :

def get_links(urls: str, r: str, proxy
...
            resp = requests.get(page, proxies=proxy)) -> list:

And all functions calling scraping.get_links directly or indirectly should allow to pass the proxy

I understand eiopa_data.py and rfr.py should also be modified to add proxy management to
resp = urllib.request.urlopen(url)
and
urllib.request.urlretrieve(url, target_file)

What I Did

import solvency2_data
d = solvency2_data.read("2017-12-31")

Traceback

Traceback (most recent call last):
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connection.py", line 196, in _new_conn
    sock = connection.create_connection(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\util\connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "C:\ProgramData\PYT4ALL_Installer\lib\socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 490, in _make_request
    raise new_e
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connection.py", line 615, in connect
    self.sock = sock = self._new_conn()
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connection.py", line 203, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x000001963C6604C0>: Failed to resolve 'www.eiopa.europa.eu' ([Errno 11001] getaddrinfo failed)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[...]AppData\Roaming\Python\Python39\site-packages\requests\adapters.py", line 667, in send
    resp = conn.urlopen(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\connectionpool.py", line 843, in urlopen
    retries = retries.increment(
  File "[...]AppData\Roaming\Python\Python39\site-packages\urllib3\util\retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.eiopa.europa.eu', port=443): Max retries exceeded with url: /tools-and-data/risk-free-interest-rate-term-structures/risk-free-rate-previous-releases-and-preparatory-phase_en (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001963C6604C0>: Failed to resolve 'www.eiopa.europa.eu' ([Errno 11001] getaddrinfo failed)"))
**```**
@wjwillemse
Copy link
Owner

I have added a 'proxies' parameters (a dictionary) to the functions .read(), .get() and .refresh(), and passed this parameter to the request and urllib library functions.

This is implemented in version 0.2.3.

I was unable to test this myself, so could you test this?

@PierreActuary
Copy link
Author

Thank you very much for that, I just tested and it doesn't work
I managed to get the 0.22 version to work with proxy (by adding it hard coded within rfr.py, scrapping and eiopa_data.
What I noticed is that requests and urllib doesn't manage the proxy in the same way.
For requests (in scrapping.py) I had to send :
requests.head(url, proxies=proxy, timeout=10, verify=False).headers
verify=False was mandatory in my case since I cannot connect in true https.

For urllib, the urlopen(url) doesn't allow passing a proxy :

(function) def urlopen(
    url: str | Request,
    data: _DataType = None,
    timeout: float | None = ...,
    *,
    cafile: str | None = None,
    capath: str | None = None,
    cadefault: bool = False,
    context: SSLContext | None = None
) -> _UrlopenRet

So I defined a sub that I called in the module.

def set_http_proxy(proxy):
    proxy_support = urllib.request.ProxyHandler({'http': f"{proxy['http']}",'https': f"{proxy['https']}"})
    opener = urllib.request.build_opener(proxy_support)
    urllib.request.install_opener(opener)

As I hardcoded it, i only needed to do it once.
If the proxy in passed via parameters, I suppose you should add
set_http_proxy(proxy)
before each urlopen call

Thant again for trying to add this functionality to your project

@wjwillemse
Copy link
Owner

I forgot to add the proxies to requests.head. This is fixed now and available in 0.2.4. In eiopa_data.py, lines 79 to 82, the proxy handler is installed before the urllib.request.urlretrieve (but I already done this in 0.2.3, so strange that it does not work for you). If you still have issues, you can send me the lines of code I have to change for you to enable it to work, or you can send me a pull request.

@PierreActuary
Copy link
Author

PierreActuary commented Nov 25, 2024

Hi wj, thanks for this new attempt.
I couldn't make it work.
Line 41 of scrapping.py i modified :
resp = requests.get(page, proxies=proxies, timeout=10, verify=False)
(verify = False to allow http connexion with https url).

But then I had an error " <urlopen error [Errno 11001] getaddrinfo failed>" (with the read function).
It improved when I added

def set_http_proxy(proxy):
    proxy_support = urllib.request.ProxyHandler({'http': f"{proxy['http']}",'https': f"{proxy['https']}"})
    opener = urllib.request.build_opener(proxy_support)
    urllib.request.install_opener(opener)

and a call of that function in get_links and check_if_download function but I then get an error : "can't concat str to bytes"

I think I'll stay with my own 0.22 modified version that works (not skilled enough in python to dig deeper).

Also a suggestion : a refresh() function with a date argument so the database doesn't refresh from 2016 every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants