一起学习网 一起学习网


使用Python实现简单Web爬虫

开发 Python, web crawler, BeautifulSoup, requests, HTML parsing 03-10

使用Python实现简单的Web爬虫:抓取网页标题和链接

在本教程中,我们将创建一个简单的Web爬虫,使用Python语言抓取网页上的标题和链接。我们将使用requests库发送HTTP请求,并使用BeautifulSoup解析HTML内容。整个过程由浅入深,适合具备基础Python编程经验的学习者。

步骤1:准备开发环境

首先,你需要确保已安装以下Python库:

  • requests
  • BeautifulSoup(bs4)

可以通过以下命令安装它们:

pip install requests
pip install beautifulsoup4

步骤2:发送HTTP请求

我们将使用requests模块获取网页的HTML内容。这一步需要指定目标URL,并获取响应。

import requests

def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # 如果请求失败,这行代码会抛出HTTPError
        return response.text
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error occurred: {err}")
    except Exception as err:
        print(f"An error occurred: {err}")

url = "http://example.com"
html_content = fetch_page(url)

步骤3:解析HTML内容

接下来,使用BeautifulSoup解析网页内容,以便能够提取特定的HTML标签数据。

from bs4 import BeautifulSoup

def parse_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    return soup

soup = parse_html(html_content)

步骤4:提取标题和链接

使用BeautifulSoup对象,我们可以轻松地提取网页上的标题和超链接(<a>标签)。

def extract_info(soup):
    # 提取网页标题
    title = soup.title.string
    print(f"Page Title: {title}")

    # 提取所有链接
    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        link_text = link.string
        if href:
            print(f"Link text: {link_text} - URL: {href}")

extract_info(soup)

完整代码

import requests
from bs4 import BeautifulSoup

def fetch_page(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.text
    except requests.exceptions.HTTPError as err:
        print(f"HTTP error occurred: {err}")
    except Exception as err:
        print(f"An error occurred: {err}")

def parse_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    return soup

def extract_info(soup):
    title = soup.title.string
    print(f"Page Title: {title}")

    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        link_text = link.string
        if href:
            print(f"Link text: {link_text} - URL: {href}")

url = "http://example.com"
html_content = fetch_page(url)

if html_content:
    soup = parse_html(html_content)
    extract_info(soup)

结论

通过这个简单的Web爬虫示例,我们学习了如何使用Python的requests库进行HTTP请求,以及如何使用BeautifulSoup解析HTML并提取信息。这是Web爬虫的基础,能帮助我们自动化许多网络数据采集的任务。在实际应用中,注意遵循网页的robots.txt文件的爬取规则,确保不会违反目标网站的使用条款。


编辑:一起学习网