[點晴永久免費OA]什么是網絡爬蟲?
當前位置:點晴教程→點晴OA辦公管理信息系統
→『 經驗分享&問題答疑 』
|
A[爬蟲] --> B[比價省錢]
A --> C[搶限量球鞋]
A --> D[追蹤愛豆動態]
A --> E[查天氣航班]
A --> F[找租房信息]
? 核心原理:模擬人類瀏覽行為,批量抓取網頁中的目標數據
# 舉個生活化例子理解爬蟲
import requests
# 你每天用瀏覽器查看的天氣
def get_weather():
response = requests.get("http://tianqi.com")
return response.text # 爬蟲就是在代碼里做這件事!
print("爬蟲本質:自動獲取網頁數據的程序")
? 核心原理:模擬人類瀏覽行為,批量抓取網頁中的目標數據
1?? **安裝Python 3.8+**:官網直達鏈接
2?? 安裝開發工具:推薦PyCharm社區版(免費)
3?? 安裝必備庫:
pip install beautifulsoup4 requests lxml xlwt
?? 小技巧:Windows用戶復制上方命令到cmd執行
graph LR
A[發送請求] --> B[解析數據]
B --> C[存儲結果]
import urllib.request
# 偽裝成瀏覽器的關鍵!
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
def get_html(url):
req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)
return response.read().decode("utf-8") # 解決中文亂碼
# 測試獲取第一頁
print(get_html("https://movie.douban.com/top250")[:500])
from bs4 import BeautifulSoup
import re
# 抓取單頁電影信息的秘密武器
def parse_html(html):
soup = BeautifulSoup(html, "html.parser")
movie_list = []
for item in soup.find_all('div', class_='item'):
movie = {}
movie['鏈接'] = item.find('a')['href']
movie['標題'] = item.find('span', class_='title').text
movie['評分'] = item.find('span', class_='rating_num').text
movie_list.append(movie)
return movie_list
# 測試解析
html = get_html("https://movie.douban.com/top250")
print(parse_html(html)[0])
? 輸出效果:
{'鏈接': 'https://movie.douban.com/subject/1292052/',
'標題': '肖申克的救贖',
'評分': '9.7'}
import xlwt
def save_to_excel(data, filename):
workbook = xlwt.Workbook(encoding='utf-8')
sheet = workbook.add_sheet('豆瓣電影')
# 寫表頭
headers = ['排名', '標題', '評分', '詳情鏈接']
for col, header in enumerate(headers):
sheet.write(0, col, header)
# 寫數據
for row, movie in enumerate(data, 1):
sheet.write(row, 0, row)
sheet.write(row, 1, movie['標題'])
sheet.write(row, 2, movie['評分'])
sheet.write(row, 3, movie['鏈接'])
workbook.save(filename)
# 實戰保存
all_movies = []
for i in range(0, 10): # 抓取10頁
url = f"https://movie.douban.com/top250?start={i*25}"
html = get_html(url)
all_movies.extend(parse_html(html))
save_to_excel(all_movies, "豆瓣Top250.xls")
import time
time.sleep(2) # 每請求一次睡2秒
response.content.decode('utf-8') # 或gbk/GB2312
robots.txt
(如:https://www.douban.com/robots.txt)Q&A常見問題:
Q:爬蟲必須用Python嗎?
A:Java/PHP/C#都能寫,但Python最適合新手
Q:需要數學基礎嗎?
A:加減乘除足矣,零門檻入門!