requests-html源码深度剖析:核心组件工作原理详解

【免费下载链接】requests-html Pythonic HTML Parsing for Humans™ 【免费下载链接】requests-html 项目地址: https://gitcode.com/gh_mirrors/re/requests-html

requests-html是一个为人类设计的Pythonic HTML解析库,它将强大的HTML解析能力与直观的API设计完美结合。本文将深入剖析requests_html.py的核心架构,通过分析关键组件的实现细节,帮助开发者理解其内部工作原理和设计思想。

核心类层次结构

requests-html的核心功能围绕四个主要类构建,它们之间形成了清晰的继承和协作关系:

  • BaseParser:所有解析功能的基础抽象类,定义了HTML解析的核心接口
  • Element:表示HTML文档中的单个元素,提供元素级操作
  • HTML:表示完整的HTML文档,继承自BaseParser并扩展了文档级功能
  • HTMLResponse:增强的响应对象,将requests响应与HTML解析功能结合

类关系图

mermaid

BaseParser:解析基础架构

requests_html.py中的BaseParser类实现了HTML解析的核心功能,为Element和HTML类提供了统一的解析接口。它通过整合多个解析库(lxml和PyQuery)实现了强大而灵活的HTML处理能力。

编码自动检测

BaseParser的编码处理机制确保了对各种字符集的网页都能正确解析:

@property
def encoding(self) -> _Encoding:
    """The encoding string to be used, extracted from the HTML and
    :class:`HTMLResponse <HTMLResponse>` headers.
    """
    if self._encoding:
        return self._encoding

    # Scan meta tags for charset.
    if self._html:
        self._encoding = html_to_unicode(self.default_encoding, self._html)[0]
        # Fall back to requests' detected encoding if decode fails.
        try:
            self.raw_html.decode(self.encoding, errors='replace')
        except UnicodeDecodeError:
            self._encoding = self.default_encoding


    return self._encoding if self._encoding else self.default_encoding

双重解析引擎

BaseParser创新性地整合了两种解析引擎:

  1. PyQuery:提供jQuery风格的CSS选择器操作
  2. lxml:提供高性能的XML/HTML处理能力

这种双重引擎设计在requests_html.py的实现中尤为明显:

@property
def pq(self) -> PyQuery:
    """`PyQuery <https://pythonhosted.org/pyquery/>`_ representation
    of the :class:`Element <Element>` or :class:`HTML <HTML>`.
    """
    if self._pq is None:
        self._pq = PyQuery(self.lxml)

    return self._pq

@property
def lxml(self) -> HtmlElement:
    """`lxml <http://lxml.de>`_ representation of the
    :class:`Element <Element>` or :class:`HTML <HTML>`.
    """
    if self._lxml is None:
        try:
            self._lxml = soup_parse(self.html, features='html.parser')
        except ValueError:
            self._lxml = lxml.html.fromstring(self.raw_html)

    return self._lxml

Element:HTML元素操作

Element类继承自BaseParser,代表HTML文档中的单个元素。它在requests_html.py中实现了对HTML元素的精细化操作。

属性处理机制

Element类对HTML元素属性的处理非常智能,特别是对class和rel等可能包含多个值的属性:

@property
def attrs(self) -> _Attrs:
    """Returns a dictionary of the attributes of the :class:`Element <Element>`
    (`learn more <https://www.w3schools.com/tags/ref_attributes.asp>`_).
    """
    if self._attrs is None:
        self._attrs = {k: v for k, v in self.element.items()}

        # Split class and rel up, as there are usually many of them:
        for attr in ['class', 'rel']:
            if attr in self._attrs:
                self._attrs[attr] = tuple(self._attrs[attr].split())

    return self._attrs

元素查找功能

Element的find方法提供了强大的CSS选择器查找能力,支持文本内容过滤:

def find(self, selector: str = "*", *, containing: _Containing = None, clean: bool = False, first: bool = False, _encoding: str = None) -> _Find:
    """Given a CSS Selector, returns a list of
    :class:`Element <Element>` objects or a single one.
    """
    # Convert a single containing into a list.
    if isinstance(containing, str):
        containing = [containing]

    encoding = _encoding or self.encoding
    elements = [
        Element(element=found, url=self.url, default_encoding=encoding)
        for found in self.pq(selector)
    ]

    if containing:
        elements_copy = elements.copy()
        elements = []

        for element in elements_copy:
            if any([c.lower() in element.full_text.lower() for c in containing]):
                elements.append(element)

        elements.reverse()

HTML:文档级处理

HTML类是处理完整HTML文档的核心,它在requests_html.py中实现了文档导航、JavaScript渲染等高级功能。

智能分页导航

HTML类的next()方法实现了自动发现下一页链接的智能机制:

def next(self, fetch: bool = False, next_symbol: _NextSymbol = None) -> _Next:
    """Attempts to find the next page, if there is one. If ``fetch``
    is ``True`` (default), returns :class:`HTML <HTML>` object of
    next page. If ``fetch`` is ``False``, simply returns the next URL.
    """
    if next_symbol is None:
        next_symbol = DEFAULT_NEXT_SYMBOL

    def get_next():
        candidates = self.find('a', containing=next_symbol)

        for candidate in candidates:
            if candidate.attrs.get('href'):
                # Support 'next' rel (e.g. reddit).
                if 'next' in candidate.attrs.get('rel', []):
                    return candidate.attrs['href']

                # Support 'next' in classnames.
                for _class in candidate.attrs.get('class', []):
                    if 'next' in _class:
                        return candidate.attrs['href']

                if 'page' in candidate.attrs['href']:
                    return candidate.attrs['href']

        try:
            # Resort to the last candidate.
            return candidates[-1].attrs['href']
        except IndexError:
            return None

JavaScript渲染引擎

HTML类最强大的功能之一是其内置的JavaScript渲染能力,通过pyppeteer实现:

def render(self, retries: int = 8, script: str = None, wait: float = 0.2, scrolldown=False, sleep: int = 0, reload: bool = True, timeout: Union[float, int] = 8.0, keep_page: bool = False, cookies: list = [{}], send_cookies_session: bool = False):
    """Reloads the response in Chromium, and replaces HTML content
    with an updated version, with JavaScript executed.
    """
    self.browser = self.session.browser  # Automatically create a event loop and browser
    content = None

    # Automatically set Reload to False, if example URL is being used.
    if self.url == DEFAULT_URL:
        reload = False

    if send_cookies_session:
       cookies = self._convert_cookiesjar_to_render()

    for i in range(retries):
        if not content:
            try:
                content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page, cookies=cookies))
            except TypeError:
                pass
        else:
            break

HTMLSession:网络请求层

HTMLSession和AsyncHTMLSession类在requests_html.py中提供了网络请求能力,它们基于requests库并添加了HTML解析功能:

class HTMLSession(requests.Session):
    """A consumable session, for cookie persistence and connection pooling,
    with HTML parsing abilities.
    """
    def __init__(self, mock_browser : bool = True, verify : bool = True,
                 browser_args : list = ['--no-sandbox']):
        super(HTMLSession, self).__init__()
        self.hooks['response'].append(self.response_hook)
        self.mock_browser = mock_browser
        self.verify = verify
        self.browser_args = browser_args
        self.__browser = None

    def response_hook(self, response, **kwargs) -> HTMLResponse:
        """Response hook to create a HTMLResponse object."""
        return HTMLResponse._from_response(response, self)

请求-响应流程

HTMLSession实现了完整的请求-解析流程:

  1. 发送HTTP请求获取原始HTML
  2. 通过response_hook将响应转换为HTMLResponse
  3. HTMLResponse自动解析HTML内容为HTML对象
  4. 提供便捷的API访问解析结果

实际应用示例

基本使用流程

from requests_html import HTMLSession

# 创建会话
session = HTMLSession()

# 发送请求获取HTML
r = session.get('https://example.com')

# 使用CSS选择器查找元素
title = r.html.find('title', first=True).text

# 使用XPath查找元素
links = r.html.xpath('//a/@href')

# 提取所有链接
all_links = r.html.absolute_links

# 渲染JavaScript内容
r.html.render()

# 查找下一页并继续爬取
next_page = r.html.next()

异步操作示例

from requests_html import AsyncHTMLSession

async def main():
    session = AsyncHTMLSession()
    r = await session.get('https://example.com')
    
    # 异步渲染JavaScript
    await r.html.arender()
    
    # 获取渲染后的内容
    print(r.html.text)
    
    await session.close()

# 运行异步函数
AsyncHTMLSession().run(main())

测试架构

requests-html项目包含完整的测试套件,确保核心功能的稳定性:

测试用例覆盖了解析、渲染、导航等关键功能:

def test_css_selector():
    """Test CSS selector functionality."""
    r = get()
    assert len(r.html.find('h1')) == 1
    assert r.html.find('h1', first=True).text == 'Example Domain'

def test_xpath():
    """Test XPath selector functionality."""
    r = get()
    assert len(r.html.xpath('//h1')) == 1
    assert r.html.xpath('//h1', first=True).text == 'Example Domain'

def test_render():
    """Test JavaScript rendering."""
    r = get()
    r.html.render()
    assert len(r.html.find('script')) >= 1

项目结构与资源

项目文件组织

requests-html/
├── LICENSE                 # 许可证文件
├── Pipfile                 # 依赖管理
├── README.rst              # 项目说明文档
├── requests_html.py        # 核心代码
├── setup.py                # 安装配置
├── docs/                   # 文档资料
│   └── source/
│       └── index.rst       # 文档主页
└── tests/                  # 测试代码
    ├── python.html         # 测试用HTML文件
    ├── test_internet.py    # 网络测试
    └── test_requests_html.py # 单元测试

官方文档

完整的官方文档请参考:docs/source/index.rst

总结与展望

requests-html通过创新的架构设计,将强大的HTML解析能力与简洁易用的API完美结合。其核心优势包括:

  1. 直观的API设计:符合Pythonic风格,降低学习曲线
  2. 强大的解析能力:整合lxml和PyQuery的优势
  3. JavaScript渲染:内置Chromium引擎支持动态内容
  4. 智能分页导航:自动发现下一页链接
  5. 完整的异步支持:提高爬取效率

未来发展方向可能包括:

  • 增强对现代JavaScript框架的支持
  • 优化渲染性能,减少资源占用
  • 扩展数据提取功能,支持更多格式

通过深入理解requests_html.py的核心架构和实现细节,开发者可以更高效地利用这个强大的工具处理HTML解析任务,构建稳健的网络爬虫和数据提取应用。

【免费下载链接】requests-html Pythonic HTML Parsing for Humans™ 【免费下载链接】requests-html 项目地址: https://gitcode.com/gh_mirrors/re/requests-html

Logo

立足具身智能前沿赛道,致力于搭建全球化、开源化、全栈式技术交流与实践共创平台。

更多推荐