利用Python抓取SERP特征变化

时间:2025-04-12 00:00 来源:未知作者:xiaoyao 点击:次

以下是利用Python监控搜索引擎结果页（SERP）特征变化的完整方案，包含规避反爬策略、动态内容解析和特征变化分析模块：一、技术方案设计 ```mermaid graph LR A[关键词列表] -- B[代理I

以下是利用Python监控搜索引擎结果页（SERP）特征变化的完整方案，包含规避反爬策略、动态内容解析和特征变化分析模块：

一、技术方案设计
```mermaid
graph LR
A[关键词列表] --> B[代理IP池+请求调度]
B --> C{搜索引擎选择}
C -->|Google| D[模拟真实浏览器行为]
C -->|Bing| E[直接API调用]
D --> F[动态页面渲染]
E --> F
F --> G[结构化数据提取]
G --> H[特征差异对比]
H --> I[变化预警通知]
```

二、规避反爬策略实现

1. 请求伪装系统
```python
import random
from fake_useragent import UserAgent

class RequestDisguise:
def __init__(self):
self.ua = UserAgent()
self.header_templates = [
{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/"
},
{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Referer": "https://www.bing.com/"
}
]

def get_headers(self):
base_header = random.choice(self.header_templates)
base_header.update({
"User-Agent": self.ua.random,
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate"
})
return base_header
```

2. 智能请求间隔控制
```python
import time
import numpy as np

class RequestThrottle:
def __init__(self, base_delay=5.0, jitter=0.3):
self.base_delay = base_delay
self.jitter = jitter
self.last_request = 0

def wait(self):
current = time.time()
elapsed = current - self.last_request
if elapsed < self.base_delay:
sleep_time = self.base_delay - elapsed + np.random.uniform(-self.jitter, self.jitter)
time.sleep(max(sleep_time, 0.5))
self.last_request = time.time()
```

三、SERP特征抓取实现

1. 动态渲染解决方案（Playwright）
```python
from playwright.sync_api import sync_playwright

def get_serp_with_playwright(keyword, proxy=None):
with sync_playwright() as p:
启动浏览器（建议使用Chromium）
browser = p.chromium.launch(
headless=True,
proxy={"server": proxy} if proxy else None
)
page = browser.new_page()

模拟人类操作模式
page.goto(f"https://www.google.com/search?q={keyword}")
page.wait_for_load_state("networkidle")

随机滚动行为
for _ in range(random.randint(2,5)):
page.mouse.wheel(0, random.randint(300,800))
page.wait_for_timeout(random.randint(800,1500))

# 获取最终HTML
html = page.content()
browser.close()
return html
```

2. 结构化特征解析器
```python
from bs4 import BeautifulSoup
import re

class SERPParser:
def __init__(self, html):
self.soup = BeautifulSoup(html, 'lxml')

def extract_features(self):
return {
"organic_results": self._get_organic_results(),
"ads_count": len(self.soup.select('div[data-text-ad]')),
"featured_snippet": self._get_featured_snippet(),
"knowledge_panel": self._get_knowledge_panel(),
"people_also_ask": self._get_people_also_ask(),
"related_searches": self._get_related_searches()
}

def _get_organic_results(self):
results = []
for item in self.soup.select('div.g:not(.related-question-pair)'):
result = {
"title": item.select_one('h3').text if item.select_one('h3') else None,
"url": item.select_one('a[href]')['href'] if item.select_one('a[href]') else None,
"description": (item.select_one('div.IsZvec') or item.select_one('div.VwiC3b')).text if item.select_one('div.IsZvec, div.VwiC3b') else None,
"position": None 需根据DOM位置计算
}
results.append(result)
return results

def _get_featured_snippet(self):
snippet = self.soup.select_one('div.ifM9O')
return {
"text": snippet.text if snippet else None,
"type": "paragraph" if snippet and not snippet.select('table') else "table"
}
```

四、特征变化检测系统

1. 语义相似度对比算法
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

class FeatureComparator:
def __init__(self, historical_data):
se

lf.history = historical_data

def detect_changes(self, new_data, threshold=0.85):
changes = []
对比有机结果
old_embeddings = model.encode([res['title']+' '+res['description'] for res in self.history['organic_results']])
new_embeddings = model.encode([res['title']+' '+res['description'] for res in new_data['organic_results']])

similarity_matrix = cosine_similarity(old_embeddings, new_embeddings)
for i in range(len(self.history['organic_results'])):
max_sim = max(similarity_matrix[i])
if max_sim < threshold:
changes.append({
"type": "organic_change",
"old_position": i+1,
"new_position": None,
"similarity": max_sim
})
return changes
```

2. 动态阈值报警机制
```python
import numpy as np

class DynamicThreshold:
def __init__(self, window_size=30):
self.scores = []
self.window = window_size

def update(self, new_score):
self.scores.append(new_score)
if len(self.scores) > self.window:
self.scores.pop(0)

def get_threshold(self, sensitivity=2):
if len(self.scores) < 5:
return 0.7
mean = np.mean(self.scores)
std = np.std(self.scores)
return max(0.5, mean - sensitivity*std)
```

五、完整工作流示例

```python

配置参数
KEYWORDS = ["python web开发", "机器学习教程"]
PROXY_POOL = ["111.222.33.44:8080", "222.111.55.66:3128"]

初始化组件
disguise = RequestDisguise()
throttle = RequestThrottle()
threshold = DynamicThreshold()

主循环
for keyword in KEYWORDS:
throttle.wait()
proxy = random.choice(PROXY_POOL)

try:
获取SERP
html = get_serp_with_playwright(keyword, proxy)
parser = SERPParser(html)
current_features = parser.extract_features()

加载历史数据
historical = load_historical_data(keyword)

变化检测
comparator = FeatureComparator(historical)
changes = comparator.detect_changes(current_features,
threshold=threshold.get_threshold())

触发报警
if len(changes) > 0:
send_alert({
"keyword": keyword,
"changes": changes,
"snapshot": current_features
})

更新阈值模型
threshold.update(calculate_volatility(changes))

except Exception as e:
handle_error(e)
```

六、反爬规避增强措施

1. 浏览器指纹模拟
```python
def generate_browser_fingerprint():
return {
"webgl_vendor": "Intel Inc.", # 伪造显卡信息
"user_agent": disguise.get_headers()["User-Agent"],
"screen_resolution": "1920x1080",
"timezone": "Asia/Shanghai",
"plugins": ["Chrome PDF Viewer", "Widevine Content Decryption Module"]
}
```

2. TLS指纹绕过
使用定制化浏览器配置文件：
```python
browser = chromium.launch(
args=[
"--ignore-certificate-errors",
"--user-agent=" + disguise.get_headers()["User-Agent"],
"--disable-blink-features=AutomationControlled"
]
)
```

七、数据存储与分析

1. 时序数据库结构设计
```sql
CREATE TABLE serp_features (
keyword VARCHAR(255) NOT NULL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
organic_results JSON,
ads_count INT,
featured_snippet TEXT,
PRIMARY KEY (keyword, timestamp)
);

CREATE TABLE change_events (
event_id INT AUTO_INCREMENT,
keyword VARCHAR(255) NOT NULL,
change_type ENUM('position', 'new_feature', 'content'),
confidence FLOAT,
details JSON,
PRIMARY KEY (event_id)
);
```

2. 变化趋势可视化
```python
import plotly.express as px

def plot_position_changes(keyword):
df = load_from_db(keyword)
fig = px.line(df, x='timestamp', y='position',
title=f'"{keyword}" 排名趋势',
markers=True)
fig.update_yaxes(autorange="reversed") # 排名数值越小越好
fig.show()
```

八、法律合规性保障

1. 遵守robots.txt规则
```python
import robotexclusionrulesparser

rerp = robotexclusionrulesparser.RobotExclusionRulesParser()
rerp.fetch("https://www.google.com/robots.txt")
if not rerp.is_allowed("*", "/search"):
raise Exception("Google Search禁止爬虫访问")
```

2. 请求频率控制
```python
# 遵守Google可接受频率（通常<10次/分钟）
throttle = RequestThrottle(base_delay=7.0, jitter=2.0)
```

通过本方案可实现：
实时监控每30分钟更新一次关键词数据
精准识别：检测标题/描述内容变化（语义相似度<85%触发）
反爬规避：浏览器指纹+代理池+请求伪装组合策略
趋势预*：基于历史数据的波动阈值动态调整

注意事项：
商业级应用建议使用官方API（如Google Search Console API）
大规模抓取需部署分布式爬虫系统
定期更新浏览器指纹特征库

(责任编辑：xiaoyao)

利用Python抓取SERP特征变化

推荐内容

服务器稳定性对网站seo优化的影响

网站优化的技术分析