发布于2020-03-14 18:20 阅读(1656) 评论(0) 点赞(17) 收藏(1)
之前有爬过文字、图片这次想试试爬一下视频,于是就爬取了梨视频https://www.pearvideo.com/popular网站的部分视频,练练手,没想到竟然花费了一天的时间。能力有限,只会爬取简单的内容。
说一下思路:
1.先从https://www.pearvideo.com/popular页面,找到视频的的详细地址,使用xpath解析,由于Google浏览器有一个功能,可以直接复制想要获取部分的xpath,所以使用起来很方便。
2.提取到详细视频页面的链接后,再去爬取详细视频页面,找到视频的的链接
这里解析时,推荐使用beautifulsoup,因为使用xpth获取不到js的代码(这里的视频地址是放到了js中)。使用beautifulsoup就可以获取到整个页面的代码。
3.下载视频
只给出了修改的部分
BOT_NAME = 'video'
SPIDER_MODULES = ['video.spiders']
NEWSPIDER_MODULE = 'video.spiders'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
#可接受的响应内容的编码方式。
"Accept-Encoding":"gzip, deflate, br",
#可接受的响应内容语言列表。
"Accept-Language":"zh-CN,zh;q=0.9",
#客户端(浏览器)想要优先使用的连接类型
"Connection":"keep-alive",
#用来指定当前的请求/回复中的,是否使用缓存机制
"Cache-Control":"max-age=0",
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
'Referer':"https://www.pearvideo.com/category_5"
}
ITEM_PIPELINES = {
'video.pipelines.VideoPipeline': 300,
}
FILES_STORE = 'D:/xxxx'
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from bs4 import BeautifulSoup
import re
import time
from video.items import VideoItem
class PearvideoSpider(scrapy.Spider):
name = 'pearvideo'
start_urls = ['https://www.pearvideo.com/popular']
def parse(self, response):
sel = Selector(response)
video_urls = sel.xpath('//*[@id="popularList"]/li')
for i in range(1,len(video_urls)+1):
video_url = sel.xpath('//*[@id="popularList"]/li['+ str(i) +']/div[2]/a/@href').extract()
# time.sleep(3)
yield scrapy.Request(url=self.pinjie(video_url), callback=self.newparse)
# 解析详细视频页面
def newparse(self,response):
videoitem = VideoItem()
bs = BeautifulSoup(response.text, 'html.parser')
sel = Selector(response)
x = bs.text
video_url = re.search(r'srcUrl="(.*?\.mp4)"', x).group(1)
video_title = sel.xpath('//*[@id="detailsbd"]/div[1]/div[2]/div/div[1]/h1/text()').extract()[0]
videoitem['video_url'] = video_url
videoitem['video_title'] = video_title
yield videoitem
def pinjie(self,video_url):
return "https://www.pearvideo.com/"+video_url[0]
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.item import Field
class VideoItem(scrapy.Item):
video_url = scrapy.Field()
video_title = scrapy.Field()
# -*- coding: utf-8 -*-
import scrapy
from scrapy.exceptions import DropItem
from scrapy.pipelines.files import FilesPipeline
import random
class VideoPipeline(FilesPipeline):
def get_media_requests(self, item, info):
# print(item['video_title'])
yield scrapy.Request(item['video_url'],meta={'item': item})
#重命名视频名字
def file_path(self, request, response=None, info=None):
# url = request.url
item = request.meta['item']
title=item['video_title']
file_name = title+'.mp4'
return file_name
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no files")
item['video_title'] = file_paths
return item
参考博客:
scrapy工作流程
https://blog.csdn.net/kuangshp128/article/details/80321099
scrapy文档
正则表达式
正则表达式在线测试
原文链接:https://blog.csdn.net/qq_34785659/article/details/104838431
作者:恋爱后女盆友的变化
链接:https://www.pythonheidong.com/blog/article/259333/2c59229ceb176b71882d/
来源:python黑洞网
任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任
昵称:
评论内容:(最多支持255个字符)
---无人问津也好,技不如人也罢,你都要试着安静下来,去做自己该做的事,而不是让内心的烦躁、焦虑,坏掉你本来就不多的热情和定力
Copyright © 2018-2021 python黑洞网 All Rights Reserved 版权所有,并保留所有权利。 京ICP备18063182号-1
投诉与举报,广告合作请联系vgs_info@163.com或QQ3083709327
免责声明:网站文章均由用户上传,仅供读者学习交流使用,禁止用做商业用途。若文章涉及色情,反动,侵权等违法信息,请向我们举报,一经核实我们会立即删除!