requests-html源码深度剖析:核心组件工作原理详解
requests-html是一个为人类设计的Pythonic HTML解析库,它将强大的HTML解析能力与直观的API设计完美结合。本文将深入剖析[requests_html.py](https://link.gitcode.com/i/d5dccb36a1da4300f3960935e2d2000b)的核心架构,通过分析关键组件的实现细节,帮助开发者理解其内部工作原理和设计思想。## 核心类..
requests-html源码深度剖析:核心组件工作原理详解
requests-html是一个为人类设计的Pythonic HTML解析库,它将强大的HTML解析能力与直观的API设计完美结合。本文将深入剖析requests_html.py的核心架构,通过分析关键组件的实现细节,帮助开发者理解其内部工作原理和设计思想。
核心类层次结构
requests-html的核心功能围绕四个主要类构建,它们之间形成了清晰的继承和协作关系:
- BaseParser:所有解析功能的基础抽象类,定义了HTML解析的核心接口
- Element:表示HTML文档中的单个元素,提供元素级操作
- HTML:表示完整的HTML文档,继承自BaseParser并扩展了文档级功能
- HTMLResponse:增强的响应对象,将requests响应与HTML解析功能结合
类关系图
BaseParser:解析基础架构
requests_html.py中的BaseParser类实现了HTML解析的核心功能,为Element和HTML类提供了统一的解析接口。它通过整合多个解析库(lxml和PyQuery)实现了强大而灵活的HTML处理能力。
编码自动检测
BaseParser的编码处理机制确保了对各种字符集的网页都能正确解析:
@property
def encoding(self) -> _Encoding:
"""The encoding string to be used, extracted from the HTML and
:class:`HTMLResponse <HTMLResponse>` headers.
"""
if self._encoding:
return self._encoding
# Scan meta tags for charset.
if self._html:
self._encoding = html_to_unicode(self.default_encoding, self._html)[0]
# Fall back to requests' detected encoding if decode fails.
try:
self.raw_html.decode(self.encoding, errors='replace')
except UnicodeDecodeError:
self._encoding = self.default_encoding
return self._encoding if self._encoding else self.default_encoding
双重解析引擎
BaseParser创新性地整合了两种解析引擎:
- PyQuery:提供jQuery风格的CSS选择器操作
- lxml:提供高性能的XML/HTML处理能力
这种双重引擎设计在requests_html.py的实现中尤为明显:
@property
def pq(self) -> PyQuery:
"""`PyQuery <https://pythonhosted.org/pyquery/>`_ representation
of the :class:`Element <Element>` or :class:`HTML <HTML>`.
"""
if self._pq is None:
self._pq = PyQuery(self.lxml)
return self._pq
@property
def lxml(self) -> HtmlElement:
"""`lxml <http://lxml.de>`_ representation of the
:class:`Element <Element>` or :class:`HTML <HTML>`.
"""
if self._lxml is None:
try:
self._lxml = soup_parse(self.html, features='html.parser')
except ValueError:
self._lxml = lxml.html.fromstring(self.raw_html)
return self._lxml
Element:HTML元素操作
Element类继承自BaseParser,代表HTML文档中的单个元素。它在requests_html.py中实现了对HTML元素的精细化操作。
属性处理机制
Element类对HTML元素属性的处理非常智能,特别是对class和rel等可能包含多个值的属性:
@property
def attrs(self) -> _Attrs:
"""Returns a dictionary of the attributes of the :class:`Element <Element>`
(`learn more <https://www.w3schools.com/tags/ref_attributes.asp>`_).
"""
if self._attrs is None:
self._attrs = {k: v for k, v in self.element.items()}
# Split class and rel up, as there are usually many of them:
for attr in ['class', 'rel']:
if attr in self._attrs:
self._attrs[attr] = tuple(self._attrs[attr].split())
return self._attrs
元素查找功能
Element的find方法提供了强大的CSS选择器查找能力,支持文本内容过滤:
def find(self, selector: str = "*", *, containing: _Containing = None, clean: bool = False, first: bool = False, _encoding: str = None) -> _Find:
"""Given a CSS Selector, returns a list of
:class:`Element <Element>` objects or a single one.
"""
# Convert a single containing into a list.
if isinstance(containing, str):
containing = [containing]
encoding = _encoding or self.encoding
elements = [
Element(element=found, url=self.url, default_encoding=encoding)
for found in self.pq(selector)
]
if containing:
elements_copy = elements.copy()
elements = []
for element in elements_copy:
if any([c.lower() in element.full_text.lower() for c in containing]):
elements.append(element)
elements.reverse()
HTML:文档级处理
HTML类是处理完整HTML文档的核心,它在requests_html.py中实现了文档导航、JavaScript渲染等高级功能。
智能分页导航
HTML类的next()方法实现了自动发现下一页链接的智能机制:
def next(self, fetch: bool = False, next_symbol: _NextSymbol = None) -> _Next:
"""Attempts to find the next page, if there is one. If ``fetch``
is ``True`` (default), returns :class:`HTML <HTML>` object of
next page. If ``fetch`` is ``False``, simply returns the next URL.
"""
if next_symbol is None:
next_symbol = DEFAULT_NEXT_SYMBOL
def get_next():
candidates = self.find('a', containing=next_symbol)
for candidate in candidates:
if candidate.attrs.get('href'):
# Support 'next' rel (e.g. reddit).
if 'next' in candidate.attrs.get('rel', []):
return candidate.attrs['href']
# Support 'next' in classnames.
for _class in candidate.attrs.get('class', []):
if 'next' in _class:
return candidate.attrs['href']
if 'page' in candidate.attrs['href']:
return candidate.attrs['href']
try:
# Resort to the last candidate.
return candidates[-1].attrs['href']
except IndexError:
return None
JavaScript渲染引擎
HTML类最强大的功能之一是其内置的JavaScript渲染能力,通过pyppeteer实现:
def render(self, retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False, cookies: list = [{}], send_cookies_session: bool = False):
"""Reloads the response in Chromium, and replaces HTML content
with an updated version, with JavaScript executed.
"""
self.browser = self.session.browser # Automatically create a event loop and browser
content = None
# Automatically set Reload to False, if example URL is being used.
if self.url == DEFAULT_URL:
reload = False
if send_cookies_session:
cookies = self._convert_cookiesjar_to_render()
for i in range(retries):
if not content:
try:
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page, cookies=cookies))
except TypeError:
pass
else:
break
HTMLSession:网络请求层
HTMLSession和AsyncHTMLSession类在requests_html.py中提供了网络请求能力,它们基于requests库并添加了HTML解析功能:
class HTMLSession(requests.Session):
"""A consumable session, for cookie persistence and connection pooling,
with HTML parsing abilities.
"""
def __init__(self, mock_browser : bool = True, verify : bool = True,
browser_args : list = ['--no-sandbox']):
super(HTMLSession, self).__init__()
self.hooks['response'].append(self.response_hook)
self.mock_browser = mock_browser
self.verify = verify
self.browser_args = browser_args
self.__browser = None
def response_hook(self, response, **kwargs) -> HTMLResponse:
"""Response hook to create a HTMLResponse object."""
return HTMLResponse._from_response(response, self)
请求-响应流程
HTMLSession实现了完整的请求-解析流程:
- 发送HTTP请求获取原始HTML
- 通过response_hook将响应转换为HTMLResponse
- HTMLResponse自动解析HTML内容为HTML对象
- 提供便捷的API访问解析结果
实际应用示例
基本使用流程
from requests_html import HTMLSession
# 创建会话
session = HTMLSession()
# 发送请求获取HTML
r = session.get('https://example.com')
# 使用CSS选择器查找元素
title = r.html.find('title', first=True).text
# 使用XPath查找元素
links = r.html.xpath('//a/@href')
# 提取所有链接
all_links = r.html.absolute_links
# 渲染JavaScript内容
r.html.render()
# 查找下一页并继续爬取
next_page = r.html.next()
异步操作示例
from requests_html import AsyncHTMLSession
async def main():
session = AsyncHTMLSession()
r = await session.get('https://example.com')
# 异步渲染JavaScript
await r.html.arender()
# 获取渲染后的内容
print(r.html.text)
await session.close()
# 运行异步函数
AsyncHTMLSession().run(main())
测试架构
requests-html项目包含完整的测试套件,确保核心功能的稳定性:
测试用例覆盖了解析、渲染、导航等关键功能:
def test_css_selector():
"""Test CSS selector functionality."""
r = get()
assert len(r.html.find('h1')) == 1
assert r.html.find('h1', first=True).text == 'Example Domain'
def test_xpath():
"""Test XPath selector functionality."""
r = get()
assert len(r.html.xpath('//h1')) == 1
assert r.html.xpath('//h1', first=True).text == 'Example Domain'
def test_render():
"""Test JavaScript rendering."""
r = get()
r.html.render()
assert len(r.html.find('script')) >= 1
项目结构与资源
项目文件组织
requests-html/
├── LICENSE # 许可证文件
├── Pipfile # 依赖管理
├── README.rst # 项目说明文档
├── requests_html.py # 核心代码
├── setup.py # 安装配置
├── docs/ # 文档资料
│ └── source/
│ └── index.rst # 文档主页
└── tests/ # 测试代码
├── python.html # 测试用HTML文件
├── test_internet.py # 网络测试
└── test_requests_html.py # 单元测试
官方文档
完整的官方文档请参考:docs/source/index.rst
总结与展望
requests-html通过创新的架构设计,将强大的HTML解析能力与简洁易用的API完美结合。其核心优势包括:
- 直观的API设计:符合Pythonic风格,降低学习曲线
- 强大的解析能力:整合lxml和PyQuery的优势
- JavaScript渲染:内置Chromium引擎支持动态内容
- 智能分页导航:自动发现下一页链接
- 完整的异步支持:提高爬取效率
未来发展方向可能包括:
- 增强对现代JavaScript框架的支持
- 优化渲染性能,减少资源占用
- 扩展数据提取功能,支持更多格式
通过深入理解requests_html.py的核心架构和实现细节,开发者可以更高效地利用这个强大的工具处理HTML解析任务,构建稳健的网络爬虫和数据提取应用。
更多推荐

所有评论(0)