User Agen 用户代理浏览器识科普服务器屏蔽垃圾搜索引擎 UA 爬虫抓取

2022年7月20日 21:10:32ximagine

ximagine

站长

关注

2116
文章

1
粉丝

教程百科评论1,176字数 1358阅读4分31秒阅读模式

摘要User Agent 是 http 协议的一个请求头 header，用来让服务端识别发起请求的用户软件信息，包含有应用类型、操作系统、软件提供商、版本号等。中文译为用户代理，最常见...

日期：2022年7月20日分类：教程百科评论：发表浏览：1,176

数码荔枝软购商城艾维商城

User Agent 是 http 协议的一个请求头 header，用来让服务端识别发起请求的用户软件信息，包含有应用类型、操作系统、软件提供商、版本号等。中文译为用户代理，最常见的网页浏览器就是一个“帮助用户获取、渲染网页内容并与之交互”的用户代理，电子邮件阅读器也可以称作邮件代理。对于搜索引擎来说，搜索引擎的爬虫，就是帮助搜索引擎获取理解网页内容的用户代理，荒岛本次带来常见的搜索引擎 UA 介绍及如何屏蔽垃圾爬虫的教程。

宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

常见爬虫

谷歌爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

compatible; Googlebot /2.1; +http://www.google.com/bot.html

Googlebot 是谷歌的网络爬虫，对大部分网站，Googlebot 应该是爬取最勤快的爬虫，能给优质博客带来大量流量。除了爬取网页的 Googlebot，常见的还有图片爬虫 Googlebot-Image、移动广告爬虫 Mediapartners-Google 等。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

百度爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html

Baiduspider 是百度的网页爬虫，中文站很常见，除了网页爬虫，手机百度爬虫 Baiduboxapp、渲染抓取 Baiduspider-render 等。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

微软爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm

bingbot 是微软 bing 搜索的爬虫，自微软推广 bing 搜索品牌后，微软原来的爬虫 MSNBot 越来越少见到了。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

360 爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/5 37.36; 360Spider

360Spider 是 360 搜索的爬虫，目前 360 搜索份额较少，这个爬虫不是很常见。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

搜狗爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

Sogou web spider/4.0(+http://www.sogou.com/docs/help/ webmasters.htm#07

Sogou web spider 是搜狗搜索的网页爬虫，背靠腾讯搜狗目前市场份额在上升，因此其网络爬虫比较勤快，经常能看到。访问日志搜索 Sogou，除了 Sogou web spider，还常见 SogouMSE、SogouMobileBrowser。这是搜狗手机浏览器的 UA 标识，不是爬虫。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

神马爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

Mozilla/5.0 (Windows NT 6.1; Win64 ; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36

YisouSpider 是神马搜索的爬虫，神马成立初期太疯狂抓取网页导致一些小网站崩溃而惹的天怒人怨。随着市场份额提升和数据完善，目前 YisouSpider 还算克制不再疯狂抓取。从名字上看，神马搜索的发音类似于宜搜，但和专注小说搜索的“宜搜”不是同一家公司。神马搜索是 UC 合并到阿里后推出的移动搜索引擎，而宜搜在 2G wap 时代就已经名声在外。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

Yandex 爬虫 UA 标识为宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

compatible; YandexBot/3.0; +http://yandex.com/bots

YandexBot 是俄罗斯最大搜索引擎和互联网巨头 Yandex 的网页爬虫，提供中文界面和中文搜索，也是少数目前能直接打开的国外搜索引擎。随着越来越多中国人知道 Yandex，YandexBot 在中文网站日志里越来越常见。宝藏内容来自https://x-imagine.com/user-agen.html - 荒岛

DuckDuckGo 爬虫 UA 标识为

Mozilla/5.0 (Linux; Android 10) AppleWebK it/537.36 (KHTML, like Gecko) Version/4.0 Chrome/81.0.4044.138 Mobile Safari/537.36 DuckDuckGo/5

DuckDuckGo 是 DuckDuckGo 的网页爬虫，主打隐私、安全、不监控用户记录，界面简洁，也提供中文搜索界面。

苹果爬虫 UA 标识为

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5
(KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1)

Applebot 是苹果的爬虫，主要用在 Siri 还有产品建议上，这类搜索引擎不是 Google 那种通用搜索引擎，提供产品、用内搜索服务。

和移动端的搜索。

花瓣爬虫 UA 标识为

Mozilla/5.0(compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot

PetalBot 是华为花瓣搜索引擎的爬虫，想实现 Google 的替代，其中具体分为两个 UA，分别服务 PC 端

垃圾爬虫

爬虫	描述
FeedDemon	内容采集
Microsoft URL Control 扫描	无用爬虫
HttpClient	tcp 攻击
EasouSpider	无用爬虫
AhrefsBot	无用爬虫
WinHttp	采集 cc 攻击
MJ12bot	无用爬虫
jaunty	wordpress 爆破扫描器
ZmEu phpmyadmin	漏洞扫描
YandexBot	无用爬虫
Swiftbot	无用爬虫
YYSpider	无用爬虫
ApacheBench	cc 攻击器
UniversalFeedParser	内容采集
Feedly	内容采集
Jullo	内容采集
Python-urllib	内容采集
Java	内容采集
CrawlDaddy	sql 注入
BOT/0.1 (BOT for JCE)	sql 注入
Indy Library	扫描
FlightDeckReports Bot	无用爬虫
Linguee Bot	无用爬虫

robots

对于遵循 robots 协议的爬虫，可以通过修改 robots.txt 文件禁止爬取内容，示例如下：

User-agent:* 
User-Agent: AhrefsBot
Disallow: / User-Agent: MJ12bot
Disallow: / User-Agent: DotBot 
Disallow: /

Apache

可以修改网站目录下的.htaccess，添加如下代码即可。

RewriteEngine On RewriteCond % {
	HTTP_USER_AGENT
} ( ^ $ | MJ12bot)[NC] RewriteRule ^ (. * ) $ - [F]

PHP

将如下代码复制到网站入口文件 index.php 中即可

//获取 UA 信息
$ua = $_SERVER['HTTP_USER_AGENT'];
//将恶意 USER_AGENT 存入数组
$now_ua = array('MJ12bot');
//禁止空 USER_AGENT
if (!$ua) {
	header("Content-type: text/html; charset=utf-8");
	die('请勿采集荒岛');
} else {
	foreach($now_ua as $value)
	//判断是否是数组中存在的 UA
	if (eregi($value, $ua)) {
		header("Content-type: text/html; charset=utf-8");
		die('请勿采集荒岛');
	}
}

Nginx

对于不遵循 robots 协议的爬虫，可以在 web 软件层面通过 UA 屏蔽，在网站配置的 server 段内新增类似如下指令：

server {...一些网站配置#屏蔽 curl / httpclient 抓取
	if ($http_user_agent~ * curl | httpclient) {
		return 403;
	}#屏蔽 MauiBot 等爬虫
	if ($http_user_agent~ * MauiBot | AhrefsBot | DotBot) {
		return 403;
	}...其他配置

荒岛自媒体 | 抖音 - 小红书 - 哔哩哔哩 - 百家号 - 知乎 - 快手 - 微博 | 数码荔枝 - 软购商城 - 艾维商城