使用Python实现简单Web爬虫
开发
使用Python实现简单的Web爬虫:抓取网页标题和链接
在本教程中,我们将创建一个简单的Web爬虫,使用Python语言抓取网页上的标题和链接。我们将使用requests
库发送HTTP请求,并使用BeautifulSoup
解析HTML内容。整个过程由浅入深,适合具备基础Python编程经验的学习者。
步骤1:准备开发环境
首先,你需要确保已安装以下Python库:
- requests
- BeautifulSoup(bs4)
可以通过以下命令安装它们:
pip install requests
pip install beautifulsoup4
步骤2:发送HTTP请求
我们将使用requests模块获取网页的HTML内容。这一步需要指定目标URL,并获取响应。
import requests
def fetch_page(url):
try:
response = requests.get(url)
response.raise_for_status() # 如果请求失败,这行代码会抛出HTTPError
return response.text
except requests.exceptions.HTTPError as err:
print(f"HTTP error occurred: {err}")
except Exception as err:
print(f"An error occurred: {err}")
url = "http://example.com"
html_content = fetch_page(url)
步骤3:解析HTML内容
接下来,使用BeautifulSoup解析网页内容,以便能够提取特定的HTML标签数据。
from bs4 import BeautifulSoup
def parse_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
return soup
soup = parse_html(html_content)
步骤4:提取标题和链接
使用BeautifulSoup对象,我们可以轻松地提取网页上的标题和超链接(<a>
标签)。
def extract_info(soup):
# 提取网页标题
title = soup.title.string
print(f"Page Title: {title}")
# 提取所有链接
links = soup.find_all('a')
for link in links:
href = link.get('href')
link_text = link.string
if href:
print(f"Link text: {link_text} - URL: {href}")
extract_info(soup)
完整代码
import requests
from bs4 import BeautifulSoup
def fetch_page(url):
try:
response = requests.get(url)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as err:
print(f"HTTP error occurred: {err}")
except Exception as err:
print(f"An error occurred: {err}")
def parse_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
return soup
def extract_info(soup):
title = soup.title.string
print(f"Page Title: {title}")
links = soup.find_all('a')
for link in links:
href = link.get('href')
link_text = link.string
if href:
print(f"Link text: {link_text} - URL: {href}")
url = "http://example.com"
html_content = fetch_page(url)
if html_content:
soup = parse_html(html_content)
extract_info(soup)
结论
通过这个简单的Web爬虫示例,我们学习了如何使用Python的requests
库进行HTTP请求,以及如何使用BeautifulSoup解析HTML并提取信息。这是Web爬虫的基础,能帮助我们自动化许多网络数据采集的任务。在实际应用中,注意遵循网页的robots.txt文件的爬取规则,确保不会违反目标网站的使用条款。
编辑:一起学习网