+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

2020-03(50)

2020-04(67)

2020-05(28)

2020-06(46)

2020-07(42)

mac运行splash 渲染非登录 js页面

发布于2020-04-21 17:17     阅读(176)     评论(0)     点赞(5)     收藏(0)


0

1

2

3

4

5

6

7

8

9

docker run -p 8050:8050 scrapinghub/splash

渲染非登录js页面

import requests
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse

class SplashScraper:

    def __init__(self, base_url):

        self.base_url = base_url
        self.root_url = '{}://{}'.format(urlparse(self.base_url).scheme, urlparse(self.base_url).netloc)
        self.pool = ThreadPoolExecutor(max_workers=20)
        self.scraped_pages = set([])
        self.to_crawl = Queue()
        self.to_crawl.put(self.base_url)

    def parse_links(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        links = soup.find_all('a', href=True)
        found_urls = []
        for link in links:
            url = link['href']
            if url.startswith('/') or url.startswith(self.root_url):
                url = urljoin(self.root_url, url)
                if url not in self.scraped_pages:
                    self.to_crawl.put(url)

    def scrape_info(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        products = soup.find_all('div', {'class': 'product-detail'})
        if products:
            for product in products:
                name = product.find('p', {'class':'margin-bottom-xxl'})
                price = product.find('div', {'class': 'price'})
                if name and price:
                    with open('product-details.csv', 'a') as output:
                        output.write('"{}","{}"\n'.format(name.get_text(), price.get_text()))

    def post_scrape_callback(self, res):
        result = res.result()
        if result.status_code == 200:
            self.parse_links(result.text)
            self.scrape_info(result.text)

    def scrape_page(self, url):
        res = requests.get('http://localhost:8050/render.html?url={}&timeout=30&wait=10'.format(url))
        return res

    def run_scraper(self):
        while True:
            try:
                target_url = self.to_crawl.get(timeout=120)
                if target_url not in self.scraped_pages:
                    print("Scraping URL: {}".format(target_url))
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
            except Empty:
                return
            except Exception as e:
                print(e)
                continue

if __name__ == '__main__':
    s = SplashScraper("http://www.boden.co.uk")
    s.run_scraper()
    

https://github.com/EdmundMartin/SplashCrawler/edit/master/splashscrape.py

原文链接:https://blog.csdn.net/nongcunqq/article/details/105633898

0

1

2

3

4

5



所属网站分类: 技术文章 > 博客

作者:天青色等烟雨

链接: https://www.pythonheidong.com/blog/article/336849/1c1b7f6ceee1d151b7df/

来源: python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

5 0
收藏该文
已收藏

评论内容:(最多支持255个字符)