本站消息

站长简介


前每日优鲜python全栈开发工程师,自媒体达人,逗比程序猿,钱少话少特宅,我的公众号:想吃麻辣香锅

  python大神匠心打造,零基础python开发工程师视频教程全套,基础+进阶+项目实战,包含课件和源码

  出租广告位,需要合作请联系站长



+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

2020-06(12)

2020-07(30)

2020-08(30)

2020-09(66)

2020-10(79)

scrapy 政府招标文件抓取

发布于2021-04-24 19:59     阅读(293)     评论(0)     点赞(13)     收藏(0)


0

1

2

3

4

5

6

7



目标网址:广东政府招标
声明:此内容仅为学习交流使用,不能作商业用途,如需提取相关信息请告知并说明用途,否则一切后果与本人无关。
首先爬取的内容:
在这里插入图片描述
分类字段也需要所以要把每个分类做一个字典:

subclass_dict ={
    "采购意向公开":"59",
    "单一来源公示":"001051",
    "进口产品清单":"",
    "采购计划":"001101",
    "采购需求":"001059",
    "资格预审需求":"001052,001053",
    "采购公告":"00101",
    "中标(成交)结果公告":"00102",
    "更正公告":"00103",
    "终止公告":"001004,001006",
    "合同公告":"001054",
    "验收公告":"001009,00105A"
}

获取数据这是一个GET请求,参数:
在这里插入图片描述当然,后面还有时间段需要传入大概就写成:

item["channel"] = '07be11ca-1511-451f-afbb-6a2cb1e990d1' if k == "进口产品清单" else 'fca71be5-fc0c-45db-96af-f513e9abda9d'
            yield scrapy.Request(
  	           url=self.start_urls[0].format(item["channel"],item["page"],v,yesterday_format,now_time),
                #"2019-12-01"
                headers=headers,
                callback=self.parse,
                meta=copy.deepcopy(item)

翻页操作:

#页面内容少于10条,则退出
        if item["heigth"] < 9:
            return None
        item["page"] += 1
        yield scrapy.Request(
            url = self.start_urls[0].format(item["channel"],item["page"],item["noticeType"],yesterday_format,now_time),
            headers=headers,
            callback=self.parse,
            meta=item

请求返回的是Json数据,还有一点就是重定向的问题,具体看完整代码:

scrapy
import faker
import time
import json
import copy,re
from datetime import timedelta, datetime
#爬取时间段
yesterday = datetime.today() + timedelta(-1)
yesterday_format = yesterday.strftime('%Y-%m-%d')
now_time = time.strftime('%Y-%m-%d')
subclass_dict ={
    "采购意向公开":"59",
    "单一来源公示":"001051",
    "进口产品清单":"",
    "采购计划":"001101",
    "采购需求":"001059",
    "资格预审需求":"001052,001053",
    "采购公告":"00101",
    "中标(成交)结果公告":"00102",
    "更正公告":"00103",
    "终止公告":"001004,001006",
    "合同公告":"001054",
    "验收公告":"001009,00105A"
}
site = "www.ccgp-guangdong.gov.cn"
url = "http://www.ccgp-guangdong.gov.cn/freecms/rest/v1/notice/selectInfoMoreChannel.do?&siteId=cd64e06a-21a7-4620-aebc-0576bab7e07a&channel={}&currPage={}&pageSize=10&noticeType={}&regionCode=&purchaseManner=&title=&openTenderCode=&purchaser=&agency=&purchaseNature=&operationStartTime={}%2000:00:00&operationEndTime={}%2000:00:00&selectTimeName=noticeTime"
faker = faker.Faker()
headers = {'Content-Type': 'application/json;charset=utf-8','User-Agent': faker.user_agent(),}
class GdspiderSpider(scrapy.Spider):
    name = 'GDspider'
    allowed_domains = ['www.ccgp-guangdong.gov.cn']
    start_urls = [url]
    def __init__(self, goon=None, *args, **kwargs):
        super(GdspiderSpider, self).__init__(*args, **kwargs)
        self.goon = goon
    def start_requests(self):
        item = {}
        item["site"] = site
        item["page"] = 1
        for k ,v in subclass_dict.items():
            item["noticeType"] = v
            item["subclass"] = k
            item["channel"] = '07be11ca-1511-451f-afbb-6a2cb1e990d1' if k == "进口产品清单" else 'fca71be5-fc0c-45db-96af-f513e9abda9d'
            yield scrapy.Request(
                url=self.start_urls[0].format(item["channel"],item["page"],v,yesterday_format,now_time),
                #"2019-12-01"
                headers=headers,
                callback=self.parse,
                meta=copy.deepcopy(item)
            )
    def parse(self, response):
        item = response.meta
        results = response.json()["data"]
        item["heigth"] = len(results)
        #没内容则退出
        if not results:
            return None
        for result in results:
            #提取相关内容
            # print(item["subclass"])
            item["title"] = result["title"]
            addtime = result["fieldValues"]['f_noticeTime']
            # addtime = result["addtime"]/1000
            issue_time = re.findall("\d{4}-\d{2}-\d{2}", ''.join(addtime))
            item["issue_time"] = issue_time
            # item["issue_time"] = time.strftime("%Y-%m-%d", time.localtime(addtime))
            item["page_url"] = 'http://www.ccgp-guangdong.gov.cn'+result["pageurl"]+'?noticeType='+item["noticeType"]
            # print(item["page_url"])
            #附件
            item["download_url"] = []
            flie_list = json.loads(result["fieldValues"]['attachList'])
            flie_dict = {}
            for i in flie_list:
                name = i['fileName']
                download_url = i['fileUrl']
                flie_dict[name] = download_url
            item["download_url"] = [flie_dict]
            #文本
            yield scrapy.Request(
                url = item["page_url"],
                headers=headers,
                callback=self.parse_G,
                meta=item,
                dont_filter=True,
            )
        #页面内容少于10条,则退出
        if item["heigth"] < 9:
            return None
        item["page"] += 1
        yield scrapy.Request(
            url = self.start_urls[0].format(item["channel"],item["page"],item["noticeType"],yesterday_format,now_time),
            headers=headers,
            callback=self.parse,
            meta=item
        )
    def parse_G(self,response):
        item = response.meta
        item["content"] = response.xpath('//div[@id="content"]').get()
        yield item



0

1

2

3

4



所属网站分类: 技术文章 > 博客

作者:dfd323

链接:https://www.pythonheidong.com/blog/article/952866/cf06c29d9ec183df02cd/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

13 0
收藏该文
已收藏

评论内容:(最多支持255个字符)