-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
更新当前有效可靠的字符验证码cnn识别源码,原内容移入子目录下
- Loading branch information
shisilu
committed
Apr 21, 2020
1 parent
4974257
commit 07c4b03
Showing
6,378 changed files
with
71,939 additions
and
71,264 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Binary file not shown.
This file was deleted.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,52 +1,61 @@ | ||
# Anti-Anti-Spider | ||
# author@luyishisi & leng-yue | ||
# 2016-10-24 begin #2017-5-8 end | ||
|
||
特别说明:这个项目最初源于对自己爬虫代码的整理以及技术规整,因此部分爬虫源码可能因为网站改版已经不可用,验证码识别方面因为我改良版本用于工作中不可开源,因此采用熊猫的cnn模型以及冷月的滑动破解模型,上传前均亲测可用,且已得其本人授权。 | ||
|
||
仓库网址位于https://github.com/luyishisi/Anti-Anti-Spider 欢迎stat | ||
|
||
本项目由URLTEAM维护 | ||
|
||
作者博客 https://www.urlteam.org | ||
|
||
项目简介: | ||
|
||
运用请求伪造,浏览器伪造,浏览器自动化,图像处理,ip处理等方式进行反爬虫技术的通用化代码库,方便未来快速开发。 | ||
|
||
为以后的采集任务快速开展留下基础代码。 | ||
|
||
如今项目会包含多项技术的样例代码. | ||
|
||
|
||
项目起因 | ||
|
||
本身是想做一个反爬虫的技术攻关站点,如果在总结诸多技术中发觉可以将反反爬虫技术直接保留与代码中。 | ||
|
||
在之后采集需要时能快速有效的测试该站点具有怎样的反爬特性,并且可以快速的进行代码复用 | ||
|
||
你可以做什么: 提交你觉得难以采集的网站 联系方式: [email protected] | ||
|
||
项目结构树:(有待更新) | ||
|
||
https://github.com/luyishisi/Anti-Anti-Spider/blob/master/tree.txt | ||
|
||
重点项目: | ||
|
||
1:验证码 {亚马逊验证码破解,knn,svm,Tensorflow自动生成验证码并大量训练从而破解--98%成功率} | ||
|
||
2:代理 {抓取西刺代理,以及一个高可用的国外代理网站,并存入数据库,从而随时调用} | ||
|
||
3:代码模板 {多线程优化,百度地图可视化采集,聚焦爬虫,selenium模拟登陆,域名爬虫} | ||
|
||
5:爬虫项目源码 {优酷网,腾讯视频,推特,拉钩网,百度地图,妹子图网,百家号,百度百科,csdn,新浪微博, 淘宝采集} | ||
|
||
6:ip更换技术 {代理,tor,adsl} | ||
|
||
7:请求伪造 {phantomjs,requests,selenium} | ||
|
||
8:phantomjs {伪造请求头,获取页面截图,获取页面源码,设置超时} | ||
|
||
9:selenium {伪造请求头,支付宝模拟登陆} | ||
|
||
UrlSpider {项目中常用的采集代码样本,经过多线程数据库操作优化,最高速度6kw/d} | ||
## 基于CNN的验证码图片识别 | ||
### 简介 | ||
本项目采用alexnet模型和letnet模型,可根据实际需要选择(在train_model.py中的train函数修改即可)95.5% | ||
### 作者有话说 | ||
不知不觉这个git库伴随我从16到到20年,带给我自己最棒的一段人生旅程, | ||
整理了这份文档,希望任何想学习图片识别,玩玩卷积神经网络的同学可以最便捷的上手体验。 | ||
请谨慎使用技术,仅支持学习,不支持任何黑灰产相关 | ||
可参看:https://www.urlteam.cn/?p=1893 https://www.urlteam.cn/?p=1406 | ||
原先的Anti-Anti-Spider 全部内容移动到 原Anti-Anti-Spider 目录下 | ||
有何疑问可邮件 [email protected] 咨询 | ||
|
||
#### Alexnet 模型结构 | ||
|
||
![](src/READMEIMG2.PNG) | ||
|
||
根据验证码的复杂度不同,训练的时间也会有较大的不同 | ||
![](src/READMEIMG1.PNG) | ||
|
||
### 使用方法 | ||
1.开始训练样本前,修改conf/config.json | ||
2.将预处理过的数据集分成验证集和训练集,放到sample目录下 | ||
3.运行train_model.py开始训练,训练完成的模型保存至model_result中 | ||
4.将训练好的模型放置model_result,运行cnn_models/recognition.py,选定验证码,即可看到模型效果 | ||
### 环境配置 | ||
TensorFlow CPU版本安装:`pip install tensorflow==1.9.0` | ||
TensorFlow GPU版本安装:`pip install tensorflow-gpu==1.9.0` | ||
GUP版本的安装比较麻烦,需要安装CUDA和cuDNN才能使tensorflow调动GPU | ||
下图为TensorFlow,Python,CUDA与cuDNN之间的版本对应关系: | ||
![](./src/README_IMG0.PNG) | ||
CUDA与cuDNN安装过程主要有两步: | ||
|
||
1. 到官网下载CUDA并安装 | ||
2. 将cuDNN解压,复制到CUDA安装目录下 | ||
这里提供两个文件的链接: | ||
CUDA:`https://developer.nvidia.com/cuda-toolkit-archive` | ||
cuDNN:`https://developer.nvidia.com/rdp/cudnn-archive` | ||
更具体的安装过程度娘可帮你轻松解决(linux,windows这两步的操作方法各不相同) | ||
### 项目结构 | ||
``` | ||
├─cnn_models | ||
│ ├─cnn_model.py # CNN网络类 | ||
│ └─recognition.py # 验证训练结果 | ||
├─conf | ||
│ └─config.json # 配置文件 | ||
├─logs # 模型训练日志 | ||
├─model_result # 模型保存地址 | ||
│ └─1040 # 一套训练完成的验证码训练集及对应模型 | ||
├─sample | ||
│ ├─test # 训练集(训练集与验证集一般是对总数据集9:1分割) | ||
│ └─train # 验证集 | ||
├─src # 配置环境所需的工具,可根据自身情况到网上下载 | ||
├─train_model.py # 训练程序 | ||
└─verify_sample.py # 制作数据集(打标签加图片预处理) | ||
``` | ||
### 图片预处理 | ||
+ 为验证码图片打上标签,如: | ||
![](./src/1040_2019-10-13_10_1092.jpg) | ||
命名为1040_2019-10-13_10_1092.jpg,1092为标签,其余为附加信息,可根据自己需要更改,用`_`分割即可 | ||
+ 由于模型输入要求输入必须为227*227,所有需要调整图片形状,verify_sample.py中提供有工具函数 | ||
### 注意事项 | ||
alexnet输入必须为227*227的图片,所有图片预处理时可通过PIL中的函数线性转换图片形状,或者缩放后粘贴到227*227的背景中。 |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# encoding: utf-8 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
# encoding:utf-8 | ||
|
||
import numpy as np | ||
import random | ||
|
||
import tensorflow as tf | ||
from PIL import Image | ||
|
||
class CNN(object): | ||
|
||
def __init__(self, image_height, image_width, max_captcha, char_set, model_save_dir): | ||
self.image_height = image_height # 图片高度 | ||
self.image_width = image_width # 图片宽度 | ||
self.char_set = char_set # 字符集 | ||
self.char_set_len = len(char_set) # 字符集大小 | ||
self.max_captcha = max_captcha # 验证码字符长度 | ||
self.model_save_dir = model_save_dir # 模型路径 | ||
with tf.name_scope('parameters'): | ||
self.w_alpha = 0.01 | ||
self.b_alpha = 0.1 | ||
with tf.name_scope('data'): | ||
self.X = tf.placeholder(tf.float32, [None, self.image_height * self.image_width]) # 特征向量 | ||
self.Y = tf.placeholder(tf.float32, [None, self.max_captcha * self.char_set_len]) # 标签 | ||
self.keep_prob = tf.placeholder(tf.float32) # dropout值 | ||
|
||
@staticmethod | ||
def convert2gray(img): | ||
""" | ||
图片转为灰度图 | ||
""" | ||
if len(img.shape) > 2: | ||
r, g, b = img[:, :, 0], img[:, :, 1], img[:, :, 2] | ||
gray = 0.2989 * r + 0.5870 * g + 0.1140 * b | ||
return gray | ||
else: | ||
return img | ||
|
||
def text2vec(self, text): | ||
""" | ||
转标签为oneHot编码 | ||
""" | ||
text_len = len(text) | ||
if text_len > self.max_captcha: | ||
raise ValueError('验证码最长{}个字符'.format(self.max_captcha)) | ||
|
||
vector = np.zeros(self.max_captcha * self.char_set_len) | ||
|
||
for i, ch in enumerate(text): | ||
idx = i * self.char_set_len + self.char_set.index(ch) | ||
vector[idx] = 1 | ||
return vector | ||
|
||
def alexnet_model(self): | ||
'''CNN模型,输入为self.X,输入为y_predict''' | ||
x = tf.reshape(self.X, shape=[-1, self.image_height, self.image_width, 1]) | ||
|
||
with tf.name_scope("conv1") as scope: | ||
kernel1 = tf.Variable(tf.truncated_normal([11, 11, 1, 96], mean=0, stddev=0.1, | ||
dtype=tf.float32)) | ||
conv = tf.nn.conv2d(x, kernel1, [1, 4, 4, 1], padding="SAME") | ||
biases = tf.Variable(tf.constant(0, shape=[96], dtype=tf.float32), trainable=True) | ||
bias = tf.nn.bias_add(conv, biases) | ||
conv1 = tf.nn.relu(bias) | ||
lrn1 = tf.nn.lrn(conv1, 4, bias=1, alpha=1e-3 / 9, beta=0.75) | ||
pool1 = tf.nn.max_pool(lrn1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding="VALID") | ||
|
||
with tf.name_scope('conv2') as scope: | ||
kernel2 = tf.Variable(tf.truncated_normal([5, 5, 96, 256], mean=0, stddev=0.1, | ||
dtype=tf.float32)) | ||
conv = tf.nn.conv2d(pool1, kernel2, [1, 1, 1, 1], padding="SAME") | ||
biases = tf.Variable(tf.constant(0, shape=[256], dtype=tf.float32), trainable=True) | ||
bias = tf.nn.bias_add(conv, biases) | ||
conv2 = tf.nn.relu(bias) | ||
lrn2 = tf.nn.lrn(conv2, 4, bias=1, alpha=1e-3 / 9, beta=0.75) | ||
pool2 = tf.nn.max_pool(lrn2, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding="VALID") | ||
|
||
with tf.name_scope('conv3') as scope: | ||
kernel3 = tf.Variable(tf.truncated_normal([3, 3, 256, 384], mean=0, stddev=0.1, | ||
dtype=tf.float32)) | ||
conv = tf.nn.conv2d(pool2, kernel3, [1, 1, 1, 1], padding="SAME") | ||
biases = tf.Variable(tf.constant(0, shape=[384], dtype=tf.float32), trainable=True) | ||
bias = tf.nn.bias_add(conv, biases) | ||
conv3 = tf.nn.relu(bias) | ||
|
||
with tf.name_scope('conv4') as scope: | ||
kernel4 = tf.Variable(tf.truncated_normal([3, 3, 384, 384], mean=0, stddev=0.1, | ||
dtype=tf.float32)) | ||
conv = tf.nn.conv2d(conv3, kernel4, [1, 1, 1, 1], padding="SAME") | ||
biases = tf.Variable(tf.constant(0, shape=[384], dtype=tf.float32), trainable=True) | ||
bias = tf.nn.bias_add(conv, biases) | ||
conv4 = tf.nn.relu(bias) | ||
|
||
with tf.name_scope('conv5') as scope: | ||
kernel5 = tf.Variable(tf.truncated_normal([3, 3, 384, 256], mean=0, stddev=0.1, | ||
dtype=tf.float32)) | ||
conv = tf.nn.conv2d(conv4, kernel5, [1, 1, 1, 1], padding="SAME") | ||
biases = tf.Variable(tf.constant(0, shape=[256], dtype=tf.float32), trainable=True) | ||
bias = tf.nn.bias_add(conv, biases) | ||
conv5 = tf.nn.relu(bias) | ||
pool5 = tf.nn.max_pool(conv5, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding="VALID") | ||
|
||
with tf.name_scope('fc1') as scope: | ||
pool5 = tf.reshape(pool5, (-1, 6 * 6 * 256)) | ||
weight6 = tf.Variable(tf.truncated_normal([6 * 6 * 256, 4096], stddev=0.1, dtype=tf.float32)) | ||
ful_bias1 = tf.Variable(tf.constant(0.0, dtype=tf.float32, shape=[4096]), name="ful_bias1") | ||
ful_con1 = tf.nn.relu(tf.add(tf.matmul(pool5, weight6), ful_bias1)) | ||
|
||
with tf.name_scope('fc2') as scope: | ||
weight7 = tf.Variable(tf.truncated_normal([4096, 4096], stddev=0.1, dtype=tf.float32)) | ||
ful_bias2 = tf.Variable(tf.constant(0.0, dtype=tf.float32, shape=[4096]), name="ful_bias2") | ||
ful_con2 = tf.nn.relu(tf.add(tf.matmul(ful_con1, weight7), ful_bias2)) | ||
|
||
with tf.name_scope('fc3') as scope: | ||
weight8 = tf.Variable(tf.truncated_normal([4096, 1000], stddev=0.1, dtype=tf.float32), | ||
name="weight8") | ||
ful_bias3 = tf.Variable(tf.constant(0.0, dtype=tf.float32, shape=[1000]), name="ful_bias3") | ||
ful_con3 = tf.nn.relu(tf.add(tf.matmul(ful_con2, weight8), ful_bias3)) | ||
|
||
with tf.name_scope('y_prediction'): | ||
weight9 = tf.Variable(tf.truncated_normal([1000, self.char_set_len*self.max_captcha], stddev=0.1), dtype=tf.float32) | ||
bias9 = tf.Variable(tf.constant(0.0, shape=[self.char_set_len*self.max_captcha]), dtype=tf.float32) | ||
y_predict = tf.matmul(ful_con3, weight9) + bias9 | ||
|
||
|
||
return y_predict | ||
|
||
def Letnet_model(self): | ||
'''CNN模型,输入为self.X,输入为y_predict''' | ||
x = tf.reshape(self.X, shape=[-1, self.image_height, self.image_width, 1]) | ||
|
||
# w_c1_alpha = np.sqrt(2.0/(IMAGE_HEIGHT*IMAGE_WIDTH)) # | ||
# w_c2_alpha = np.sqrt(2.0/(3*3*32)) | ||
# w_c3_alpha = np.sqrt(2.0/(3*3*64)) | ||
# w_d1_alpha = np.sqrt(2.0/(8*32*64)) | ||
# out_alpha = np.sqrt(2.0/1024) | ||
|
||
# 3 conv layer | ||
w_c1 = tf.Variable(self.w_alpha * tf.random_normal([3, 3, 1, 32])) # 从正太分布输出随机值 | ||
b_c1 = tf.Variable(self.b_alpha * tf.random_normal([32])) | ||
conv1 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(x, w_c1, strides=[1, 1, 1, 1], padding='SAME'), b_c1)) | ||
conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') | ||
conv1 = tf.nn.dropout(conv1, self.keep_prob) | ||
|
||
w_c2 = tf.Variable(self.w_alpha * tf.random_normal([3, 3, 32, 64])) | ||
b_c2 = tf.Variable(self.b_alpha * tf.random_normal([64])) | ||
conv2 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME'), b_c2)) | ||
conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') | ||
conv2 = tf.nn.dropout(conv2, self.keep_prob) | ||
|
||
w_c3 = tf.Variable(self.w_alpha * tf.random_normal([3, 3, 64, 64])) | ||
b_c3 = tf.Variable(self.b_alpha * tf.random_normal([64])) | ||
conv3 = tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(conv2, w_c3, strides=[1, 1, 1, 1], padding='SAME'), b_c3)) | ||
conv3 = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') | ||
conv3 = tf.nn.dropout(conv3, self.keep_prob) | ||
|
||
# Fully connected layer | ||
w_d = tf.Variable(self.w_alpha * tf.random_normal([8 * 20 * 64, 1024])) | ||
b_d = tf.Variable(self.b_alpha * tf.random_normal([1024])) | ||
dense = tf.reshape(conv3, [-1, w_d.get_shape().as_list()[0]]) | ||
dense = tf.nn.relu(tf.add(tf.matmul(dense, w_d), b_d)) | ||
dense = tf.nn.dropout(dense, self.keep_prob) | ||
|
||
w_out = tf.Variable(self.w_alpha * tf.random_normal([1024, self.char_set_len * self.max_captcha])) | ||
b_out = tf.Variable(self.b_alpha * tf.random_normal([self.char_set_len * self.max_captcha])) | ||
y_predict = tf.add(tf.matmul(dense, w_out), b_out) | ||
# out = tf.nn.softmax(out) | ||
return y_predict |
Oops, something went wrong.