This gem can help your Ruby / JRuby application to make HTTP(S) requests using proxy by fetching and validating actual proxy lists from multiple providers.
It gives you a special Manager
class that can load proxy lists, validate them and return random or specific proxies.
It also has a Client
class that encapsulates all the logic for sending HTTP requests using proxies, automatically
fetched and validated by the gem. Take a look at the documentation below to find all the gem features.
Also this gem can be used with any other programming language (Go / Python / etc) as standalone solution for downloading and validating proxy lists from the different providers. Checkout examples of usage below.
Please check the documentation for the version of doorkeeper you are using in: https://github.com/nbulaj/proxy_fetcher/releases
- Dependencies
- Installation
- Example of usage
- Client
- Configuration
- Proxy object
- Providers
- Contributing
- License
ProxyFetcher gem itself requires Ruby >= 2.0.0
(or JRuby > 9.0
, but maybe earlier too,
see Travis build matrix) and great HTTP.rb gem.
However, it requires an adapter to parse HTML. If you do not specify any specific adapter, then it will use default one - Nokogiri. It's OK for any Ruby on Rails project (because they use it by default).
But if you want to use some specific adapter (for example your application uses Oga, then you need to manually add your dependencies to your project and configure ProxyFetcher to use another adapter. Moreover, you can implement your own adapter if it your use-case. Take a look at the Configuration section for more details.
If using bundler, first add 'proxy_fetcher' to your Gemfile:
gem 'proxy_fetcher', '~> 0.14'
or if you want to use the latest version (from master
branch), then:
gem 'proxy_fetcher', git: 'https://github.com/nbulaj/proxy_fetcher.git'
And run:
bundle install
Otherwise simply install the gem:
gem install proxy_fetcher -v '0.14'
By default ProxyFetcher uses all the available proxy providers. To get current proxy list without validation you
need to initialize an instance of ProxyFetcher::Manager
class. By default ProxyFetcher will automatically load
and parse all the proxies from all available sources:
manager = ProxyFetcher::Manager.new # will immediately load proxy list from the servers
manager.proxies
#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
# @response_time=5217, @type="HTTP", @anonymity="High">, ... ]
You can initialize proxy manager without immediate load of the proxy list from the remote server by passing
refresh: false
on initialization:
manager = ProxyFetcher::Manager.new(refresh: false) # just initialize class instance
manager.proxies
#=> []
Also you could use ProxyFetcher to load proxy lists from local files if you have such:
manager = ProxyFetcher::Manager.new(file: "/home/dev/proxies.txt", refresh: false)
# or
manager = ProxyFetcher::Manager.from_file(file: "/home/dev/proxies.txt", refresh: false)
# or
manager = ProxyFetcher::Manager.new(
files: Dir.glob("/home/dev/proxies/**/*.txt"),
refresh: false
)
manager.proxies
#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
# @response_time=5217, @type="HTTP", @anonymity="High">, ... ]
ProxyFetcher::Manager
class is very helpful when you need to manipulate and manager proxies. To get the proxy
from the list you can call .get
or .pop
method that will return first proxy and move it to the end of the list.
This methods has some equivalents like get!
or aliased pop!
that will return first connectable proxy and
move it to the end of the list. They both marked as danger methods because all dead proxies will be removed from the list.
If you need just some random proxy then call manager.random_proxy
or it's alias manager.random
.
To clean current proxy list from the dead entries that does not respond to the requests you need to use cleanup!
or validate!
method:
manager.cleanup! # or manager.validate!
This action will enumerate proxy list and remove all the entries that doesn't respond by timeout or returns errors.
In order to increase the performance proxy list validation is performed using Ruby threads. By default gem creates a
pool with 10 threads, but you can increase this number by changing pool_size
configuration option: ProxyFetcher.config.pool_size = 50
.
Read more in Proxy validation speed section.
If you need raw proxy URLs (like host:port
) then you can use raw_proxies
methods that will return array of strings:
manager = ProxyFetcher::Manager.new
manager.raw_proxies
# => ["97.77.104.22:3128", "94.23.205.32:3128", "209.79.65.140:8080",
# "91.217.42.2:8080", "97.77.104.22:80", "165.234.102.177:8080", ...]
You don't need to initialize a new manager every time you want to load actual proxy list from the providers. All you
need is to refresh the proxy list by calling #refresh_list!
(or #fetch!
) method for your ProxyFetcher::Manager
instance:
manager.refresh_list! # or manager.fetch!
#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
# @response_time=5217, @type="HTTP", @anonymity="High">, ... ]
If you need to filter proxy list, for example, by country or response time and selected provider supports filtering with GET params, then you can just pass your filters like a simple Ruby hash to the Manager instance:
ProxyFetcher.config.providers = :xroxy
manager = ProxyFetcher::Manager.new(filters: { country: 'PL', maxtime: '500' })
manager.proxies
# => [...]
[IMPORTANT]: All the providers have their own filtering params! So you can't just use something like country
to
filter all the proxies by country. If you are using multiple providers, then you can split your filters by proxy
provider names:
ProxyFetcher.config.providers = [:proxy_docker, :xroxy]
manager = ProxyFetcher::Manager.new(filters: {
hide_my_name: {
country: 'PL',
maxtime: '500'
},
xroxy: {
type: 'All_http'
}
})
manager.proxies
# => [...]
You can apply different filters every time you calling #refresh_list!
(or #fetch!
) method:
manager.refresh_list!(country: 'PL', maxtime: '500')
# => [...]
NOTE: not all the providers support filtering. Take a look at the provider classes to see if it supports custom filters.
All you need to use this gem is Ruby >= 2.0 (2.4 is recommended). You can install it in a different ways. If you are using Ubuntu Xenial (16.04LTS) then you already have Ruby 2.3 installed. In other cases you can install it with RVM or rbenv.
After installing Ruby just bundle the gem by running gem install proxy_fetcher
in your terminal and now you can run it:
proxy_fetcher >> proxies.txt # Will download proxies from the default provider, validate them and write to file
If you need a list of proxies from some specific provider, then you need to pass it's name with -p
option:
proxy_fetcher -p xroxy >> proxies.txt # Will download proxies from the default provider, validate them and write to file
If you need a list of proxies in JSON format just pass a --json
option to the command:
proxy_fetcher --json
# Will print:
# {"proxies":["120.26.206.178:80","119.61.13.242:1080","117.40.213.26:80","92.62.72.242:1080","77.53.105.155:3124"
# "58.20.41.172:35923","204.116.192.151:35923","190.5.96.58:1080","170.250.109.97:35923","121.41.82.99:1080"]}
To get all the possible options run:
proxy_fetcher --help
ProxyFetcher gem provides you a ready-to-use HTTP client that made requesting with proxies easy. It does all the work with the proxy lists for you (load, validate, refresh, find proxy by type, follow redirects, etc). All you need it to make HTTP(S) requests:
require 'proxy_fetcher'
ProxyFetcher::Client.get 'https://example.com/resource'
ProxyFetcher::Client.post 'https://example.com/resource', { param: 'value' }
ProxyFetcher::Client.post 'https://example.com/resource', 'Any data'
ProxyFetcher::Client.post 'https://example.com/resource', { param: 'value'}.to_json , headers: { 'Content-Type': 'application/json' }
ProxyFetcher::Client.put 'https://example.com/resource', { param: 'value' }
ProxyFetcher::Client.patch 'https://example.com/resource', { param: 'value' }
ProxyFetcher::Client.delete 'https://example.com/resource'
By default, ProxyFetcher::Client
makes 1000 attempts to send a HTTP request in case if proxy is out of order or the
remote server returns an error. You can increase or decrease this number for your case or set it to nil
if you want to
make infinite number of requests (or before your Ruby process will die 💀):
require 'proxy_fetcher'
ProxyFetcher::Client.get 'https://example.com/resource', options: { max_retries: 10_000 }
You can also use your own proxy object when using ProxyFetcher client:
require 'proxy_fetcher'
manager = ProxyFetcher::Manager.new # will immediately load proxy list from the server
#random will return random proxy object from the list
ProxyFetcher::Client.get 'https://example.com/resource', options: { proxy: manager.random }
Btw, if you need support of JavaScript or some other features, you need to implement your own client using, for example,
selenium-webdriver
.
ProxyFetcher is very flexible gem. You can configure the most important parts of the library and use your own solutions.
Default configuration looks as follows:
ProxyFetcher.configure do |config|
config.logger = Logger.new($stdout)
config.user_agent = ProxyFetcher::Configuration::DEFAULT_USER_AGENT
config.pool_size = 10
config.client_timeout = 3
config.provider_proxies_load_timeout = 30
config.proxy_validation_timeout = 3
config.http_client = ProxyFetcher::HTTPClient
config.proxy_validator = ProxyFetcher::ProxyValidator
config.providers = ProxyFetcher::Configuration.registered_providers
config.adapter = ProxyFetcher::Configuration::DEFAULT_ADAPTER # :nokogiri by default
end
You can change any of the options above.
For example, you can set your custom User-Agent string:
ProxyFetcher.configure do |config|
config.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
end
ProxyFetcher uses HTTP.rb gem for dealing with HTTP(S) requests. It is fast enough and has a great chainable API. If you wanna add, for example, your custom provider that was developed as a Single Page Application (SPA) with some JavaScript, then you will need something like selenium-webdriver to properly load the content of the website. For those and other cases you can write your own class for fetching HTML content by the URL and setup it in the ProxyFetcher config:
class MyHTTPClient
# [IMPORTANT]: below methods are required!
def self.fetch(url)
# ... some magic to return proper HTML ...
end
end
ProxyFetcher.config.http_client = MyHTTPClient
manager = ProxyFetcher::Manager.new
manager.proxies
#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
# @response_time=5217, @type="HTTP", @anonymity="High">, ... ]
You can take a look at the lib/proxy_fetcher/utils/http_client.rb for an example.
Moreover, you can write your own proxy validator to check if proxy is valid or not:
class MyProxyValidator
# [IMPORTANT]: below methods are required!
def self.connectable?(proxy_addr, proxy_port)
# ... some magic to check if proxy is valid ...
end
end
ProxyFetcher.config.proxy_validator = MyProxyValidator
manager = ProxyFetcher::Manager.new
manager.proxies
#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
# @response_time=5217, @type="HTTP", @anonymity="High">, ... ]
manager.validate!
#=> [ ... ]
Be default, ProxyFetcher gem uses Nokogiri for parsing HTML. If you want
to use Oga instead, then you need to add gem 'oga'
to your Gemfile and configure
ProxyFetcher as follows:
ProxyFetcher.config.adapter = :oga
Also you can write your own HTML parser implementation and use it, take a look at the abstract class and implementations. Configure it as:
ProxyFetcher.config.adapter = MyHTMLParserClass
There are some tricks to increase proxy list validation performance.
In a few words, ProxyFetcher gem uses threads to validate proxies for availability. Every proxy is checked in a separate thread. By default, ProxyFetcher uses a pool with a maximum of 10 threads. You can increase this number by setting max number of threads in the config:
ProxyFetcher.config.pool_size = 50
You can experiment with the threads pool size to find an optimal number of maximum threads count for you PC and OS. This will definitely give you some performance improvements.
Moreover, the common proxy validation speed depends on ProxyFetcher.config.proxy_validation_timeout
option that is equal
to 3
by default. It means that gem will wait 3 seconds for the server answer to check if particular proxy is connectable.
You can decrease this option to 1
, for example, and it will heavily increase proxy validation speed (but remember
that some proxies could be connectable, but slow, so with this option you will clear proxy list from the proxies that
works, but very slow).
Every proxy is a ProxyFetcher::Proxy
object that has next readers (instance variables):
addr
(IP address)port
type
(proxy type, can be HTTP, HTTPS, SOCKS4 or/and SOCKS5)country
(USA or Brazil for example)response_time
(5217 for example)anonymity
(Low
,Elite proxy
orHigh +KA
for example)
Also you can call next instance methods for every Proxy object:
connectable?
(whether proxy server is available)http?
(whether proxy server has a HTTP protocol)https?
(whether proxy server has a HTTPS protocol)socks4?
socks5?
uri
(returnsURI::Generic
object)url
(returns a formatted URL like "IP:PORT" or "http://IP:PORT" ifscheme: true
provided)
Currently ProxyFetcher can deal with next proxy providers (services):
- Free Proxy List
- Free SSL Proxies
- Free Socks Proxies
- Free US Proxies
- HTTP Tunnel Genius
- Proxy List
- XRoxy
- Proxypedia
- Proxyscrape
- MTPro.xyz
If you wanna use one of them just setup it in the config:
ProxyFetcher.config.provider = :free_proxy_list
manager = ProxyFetcher::Manager.new
manager.proxies
#=> ...
You can use multiple providers at the same time:
ProxyFetcher.config.providers = :free_proxy_list, :xroxy, :proxy_docker
manager = ProxyFetcher::Manager.new
manager.proxies
#=> ...
If you want to use all the possible proxy providers then you can configure ProxyFetcher as follows:
ProxyFetcher.config.providers = ProxyFetcher::Configuration.registered_providers
manager = ProxyFetcher::Manager.new
manager.proxies
#=> [#<ProxyFetcher::Proxy:0x00000002879680 @addr="97.77.104.22", @port=3128, @country="USA",
# @response_time=5217, @type="HTTP", @anonymity="High">, ... ]
Moreover, you can write your own provider! All you need is to create a class, that would be inherited from the
ProxyFetcher::Providers::Base
class, and register your provider like this:
ProxyFetcher::Configuration.register_provider(:your_provider, YourProviderClass)
Provider class must implement self.load_proxy_list
and #to_proxy(html_element)
methods that will load and parse
provider HTML page with proxy list. Take a look at the existing providers in the lib/proxy_fetcher/providers directory.
You are very welcome to help improve ProxyFetcher if you have suggestions for features that other people can use.
To contribute:
- Fork the project.
- Create your feature branch (
git checkout -b my-new-feature
). - Implement your feature or bug fix.
- Add documentation for your feature or bug fix.
- Run rake doc:yard. If your changes are not 100% documented, go back to step 4.
- Add tests for your feature or bug fix.
- Run
rake spec
to make sure all tests pass. - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin my-new-feature
). - Create new pull request.
Thanks.
proxy_fetcher
gem is released under the MIT License.
Copyright (c) 2017—2018 Nikita Bulai ([email protected]).