+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

头条网站的反爬机制很严吗?

发布于2021-04-14 17:00     阅读(92)     评论(0)     点赞(0)     收藏(0)


0

1

2

3

4

5

6

7

8

今天分享下使用selenium采集今日头条。头条之前都是很友好的网站,数据的采集也不需要费很大的心思,但是最近的反爬机制突然变的很严了,采集的难度更大,挂了代理采集根本就获取不到数据。所以要采集头条的朋友需要先分析下网站。今天我们就来分享下怎么挂上动态转发的代理采集头条数据。动态转发代理的使用方式和api的不同,这里分享下在程序里面配置动态转发参数的设置。代码如下:

from selenium import webdriver  import string  import zipfile   # 代理服务器(产品官网 www.16yun.cn)  proxyHost = "t.16yun.cn"  proxyPort = "31111"   # 代理验证信息  proxyUser = "username"  proxyPass = "password"   def create_proxy_auth_extension(proxy_host, proxy_port,  proxy_username, proxy_password,  scheme='http', plugin_path=None):  if plugin_path is None:  plugin_path = r'D:/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password)   manifest_json = """  {  "version": "1.0.0",  "manifest_version": 2,  "name": "16YUN Proxy",  "permissions": [  "proxy",  "tabs",  "unlimitedStorage",  "storage",  "",  "webRequest",  "webRequestBlocking"  ],  "background": {  "scripts": ["background.js"]  },  "minimum_chrome_version":"22.0.0"  }  """   background_js = string.Template(  """  var config = {  mode: "fixed_servers",  rules: {  singleProxy: {  scheme: "${scheme}",  host: "${host}",  port: parseInt(${port})  },  bypassList: ["foobar.com"]  }  };   chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});   function callbackFn(details) {  return {  authCredentials: {  username: "${username}",  password: "${password}"  }  };  }   chrome.webRequest.onAuthRequired.addListener(  callbackFn,  {urls: [""]},  ['blocking']  );  """  ).substitute(  host=proxy_host,  port=proxy_port,  username=proxy_username,  password=proxy_password,  scheme=scheme,  )   with zipfile.ZipFile(plugin_path, 'w') as zp:  zp.writestr("manifest.json", manifest_json)  zp.writestr("background.js", background_js)   return plugin_path   proxy_auth_plugin_path = create_proxy_auth_extension(  proxy_host=proxyHost,  proxy_port=proxyPort,  proxy_username=proxyUser,  proxy_password=proxyPass)   option = webdriver.ChromeOptions()   option.add_argument("--start-maximized")   # 如报错 chrome-extensions  # option.add_argument("--disable-extensions")   option.add_extension(proxy_auth_plugin_path)   # 关闭webdriver的一些标志  # option.add_experimental_option('excludeSwitches', ['enable-automation'])   driver = webdriver.Chrome(chrome_options=option)   # 修改webdriver get属性  # script = '''  # Object.defineProperty(navigator, 'webdriver', {  # get: () => undefined  # })  # '''  # driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})     driver.get("http:www.toutiao.com")

0

1

2

3

4

5

6

7



所属网站分类: 技术文章 > 博客

作者:yiniuyun

链接: https://www.pythonheidong.com/blog/article/936037/792e2f284e197800a31b/

来源: python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

0 0
收藏该文
已收藏

评论内容:(最多支持255个字符)