使用Python实现简单Web爬虫

开发 Python, web crawler, BeautifulSoup, requests, HTML parsing 03-10

使用Python实现简单的Web爬虫：抓取网页标题和链接

在本教程中，我们将创建一个简单的Web爬虫，使用Python语言抓取网页上的标题和链接。我们将使用requests库发送HTTP请求，并使用BeautifulSoup解析HTML内容。整个过程由浅入深，适合具备基础Python编程经验的学习者。

步骤1：准备开发环境

首先，你需要确保已安装以下Python库：

requests
BeautifulSoup（bs4）

可以通过以下命令安装它们：

pip install requests
pip install beautifulsoup4

步骤2：发送HTTP请求

我们将使用requests模块获取网页的HTML内容。这一步需要指定目标URL，并获取响应。

import requests

def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # 如果请求失败，这行代码会抛出HTTPError
        return response.text
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error occurred: {err}")
    except Exception as err:
        print(f"An error occurred: {err}")

url = "http://example.com"
html_content = fetch_page(url)

步骤3：解析HTML内容

接下来，使用BeautifulSoup解析网页内容，以便能够提取特定的HTML标签数据。

from bs4 import BeautifulSoup

def parse_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    return soup

soup = parse_html(html_content)

步骤4：提取标题和链接

使用BeautifulSoup对象，我们可以轻松地提取网页上的标题和超链接（<a>标签）。

def extract_info(soup):
    # 提取网页标题
    title = soup.title.string
    print(f"Page Title: {title}")

    # 提取所有链接
    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        link_text = link.string
        if href:
            print(f"Link text: {link_text} - URL: {href}")

extract_info(soup)

完整代码

import requests
from bs4 import BeautifulSoup

def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error occurred: {err}")
    except Exception as err:
        print(f"An error occurred: {err}")

def parse_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    return soup

def extract_info(soup):
    title = soup.title.string
    print(f"Page Title: {title}")

    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        link_text = link.string
        if href:
            print(f"Link text: {link_text} - URL: {href}")

url = "http://example.com"
html_content = fetch_page(url)

if html_content:
    soup = parse_html(html_content)
    extract_info(soup)

结论

通过这个简单的Web爬虫示例，我们学习了如何使用Python的requests库进行HTTP请求，以及如何使用BeautifulSoup解析HTML并提取信息。这是Web爬虫的基础，能帮助我们自动化许多网络数据采集的任务。在实际应用中，注意遵循网页的robots.txt文件的爬取规则，确保不会违反目标网站的使用条款。

编辑：一起学习网