今天开始学爬虫。爬个啥呢还没想好。🤭我的Python已经饥渴难耐了~

Notes

正则表达式

1 2	import re ptn = r'正则表达式'

re.compile()函数是什么？
将正则表达式的字符串形式编译为Pattern实例
正则小抄图片影响文章排版放到最下面了。
Note：数量词用在 字符或 () 之后，就是说对整个 () 生效

urllib

读网页

from urllib.request import urlopen

# utf-8解码
html = urlopen("https://o--o.win").read().decode('utf-8')
print(html)

下载

1 2	from urllib.request import urlretrieve urlretrieve(IMAGE_URL, './img/image1.png')

BeautifulSoup

官方文档

这才是正餐？
Ex：

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
#import webbrowser

base_url = "https://baike.baidu.com"
his = ["/item/%e5%91%a8%e6%9d%b0%e4%bc%a6"]

#his = his.encode('uft-8')
for i in range(5):
    print(i)
    url = base_url + his[-1]
    
    html = urlopen(url).read().decode('utf-8')
    
    #webbrowser.open(url)
    
    soup = BeautifulSoup(html,features='lxml')
    
    print(soup.find('h1').get_text(),'\n后缀：',his[0])
    #正则找到 属性href是/item/开头 的 a标签
    sub_urls = soup.find_all('a',{"target":"_blank","href":re.compile("/item/")})
    
    #随机选取一个a标签中的href属性值，接到his后。
    his.append(random.sample(sub_urls,1)[0]['href'])
    
    #sample(population, k)从list中随机抽k个元素，也就是说抽完的结果还是个list，因此先跟了个[0]。

若想加入容错,最后一句改成:

if len(sub_urls) != 0:
       his.append(random.sample(sub_urls, 1)[0]['href'])
   else:
       # no valid sub link found
       his.pop()

Requests

强大的外部需求模块
官方文档

Get

import requests
import webbrowser
param = {"wd": "炫猿"}  # 搜索的信息
r = requests.get('http://www.baidu.com/s', params=param)
print(r.url)
webbrowser.open(r.url)

Post

1
2
3

data = {'firstname': '莫烦', 'lastname': '周'}
r = requests.post('http://pythonscraping.com/files/processing.php', data=data)
print(r.text)

下载

1
2
3

r = requests.get(IMAGE_URL)
with open('./img/image2.png', 'wb') as f:
    f.write(r.content)

requests 能让你下一点, 保存一点, 而不是要全部下载完才能保存去另外的地方. 这就是一个 chunk 一个 chunk 的下载. 使用 r.iter_content(chunk_size) 来控制每个 chunk 的大小, 然后在文件中写入这个 chunk 大小的数据.

r = requests.get(IMAGE_URL, stream=True)    # stream loading

with open('./img/image3.png', 'wb') as f:
    for chunk in r.iter_content(chunk_size=32):
        f.write(chunk)

实战

这样基本可以入门了。
就来爬一下我每天都要上的福利网址福利吧吧。福利吧每天有个文章叫汇总。里面充满了MM。就拿这些MM练手了~
Demo1：

from bs4 import BeautifulSoup
import requests

base_URL = "http://fuliba.net/20180"
for i in range(1,10):
    URL = base_URL+str(i)+".html"
    
    print("第 %s 期" % i)
    html = requests.get(URL).text
    
    soup = BeautifulSoup(html,features="lxml")
    
    imgs = soup.find_all('img',{"class":"alignnone size-large"})
    
    for img in imgs:
        url = img['src']
        r = requests.get(url,stream=True)
        #图片地址用 / 分割，再选最后一块作文件名
        image_name = url.split('/')[-1]
        with open('/Users/lixs/Pictures/%s' % image_name, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)
        print('Saved %s' % image_name)

运行结果：

战果：

Problems

TypeError: expected string or bytes-like object

对着网上教程，开门就遇到这个报错。
搜了下，原来是
re.findall(r"<title>(.+?)</title>", html)
这种写法是Python2的。
在Python3中只需要改写成
res = re.findall(r"<title>(.+?)</title>", str(html))

百度为什么要用Get而不用Post呢？

于是搜到这篇文章
为什么有了post那么多优点，还有还多网站用get，比如百度搜索

网址是中文的情况

html = urlopen("https://baike.baidu.com/item/周杰伦/").read().decode('utf-8')
很明显虽然 decode('utf-8')可以解决网页内容是中文的问题，
可当网址是中文的时候，还是会报错
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)