程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

暂无数据

多次抓取:代码中的问题。我究竟做错了什么?

发布于2022-10-06 22:13     阅读(671)     评论(0)     点赞(2)     收藏(0)


我正在尝试在多个元素上使用 Selenium 抓取(出于个人学习原因,因此出于个人教学原因,无利润)。具有多个刮擦元素的多次刮擦,可创建适合数据库的行。到目前为止,我从未创建过多次抓取,但我总是抓取单个元素。所以代码中存在一些问题。

我想为锦标赛的每一轮(第 1 轮、第 2 轮等)创建这一行:Round、Date、Team_Home、Team_Away、Result_Home、Result_Away详细地说,仅供参考并让您更好地了解,每个冠军赛将有 8 行。总匝数为 26。我没有收到任何错误,但输出只是 >>>。我只收到这个>>>,没有文字或错误。请求和代码的目的仅用于个人学习原因,因此出于个人教学原因,没有任何利润。此问题和此代码不用于商业或营利目的。

例如,我想得到这个:

#SWEDEN ALLSVENKAN
#Round, Date, Team_Home, Team_Away, Result_Home, Result_Away

Round 1, 11/31/2021 20:45, AIK Stockholm, Malmo, 2, 1
Round 1, 11/31/2021 20:45, Elfsborg, Gothenburg, 2, 3
...and the rest of the other matches of the 1st round

Round 2, 06/12/2021 20:45, Gothenburg, AIK Stockholm, 0, 1
Round 2, 06/12/2021 20:45, Malmo, Elfsborg, 1, 1
...and the rest of the other matches of the 2st round

Round 3, etc.

用于抓取的 Python 代码:

Values_Allsvenskan = []

#SCRAPING
driver.get("link")
driver.implicitly_wait(12)
driver.minimize_window()

for Allsvenskan in multiple_scraping:

    try:
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
    except:
        pass

    multiple_scraping = round, date, team_home, team_away, score_home, score_away

    #row/record
    round = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__round event__round--static']")
    date = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__time']")
    team_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--home']")            
    team_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__participant event__participant--away']")
    score_home = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--home']")
    score_away = driver.find_elements(By.CSS_SELECTOR, "a[href^='/squadra'][class^='event__score event__score--away']")   


    Allsvenskan_text = round.text, date.text, team_home.text, team_away.text, score_home.text, score_away.text
    Values_Allsvenskan.append(tuple([Allsvenskan_text]))
    print(Allsvenskan_text)
driver.close


    #INSERT IN DATABASE
    con = sqlite3.connect('/database.db')
    cursor = con.cursor()
    sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score (round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
    cursor.executemany(sqlite_insert_query_Allsvenskan, Values_Allsvenskan)
    con.commit()  

根据我的 python 代码,你能告诉我如何修复和修复代码吗?谢谢

在数据库中插入的更新

#INSERT IN DATABASE
con = sqlite3.connect('database.db')
cursor = con.cursor()
sqlite_insert_query_Allsvenskan = 'INSERT INTO All_Score(current_round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
cursor.executemany(sqlite_insert_query_Allsvenskan, results = [])
con.commit()  

逻辑代码的最终更新,在最终答案之后:我只添加评论来解释这些步骤。 如果我错过了评论或需要添加一些内容,请继续。我想确保我理解代码的逻辑

#I search for rows with event__round or event__match
all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

#Initializing an empty list
results = []

#Value default of the round before the for loop
current_round = '?'

#Check which classes of event__round and event__match have lines. It is used to recognize the row with Round?????
for row in all_rows:
     classes = row.get_attribute ('class')

## If round number and match both have rows, then I use find_element to get the rest of the other data to scrape
    if.........
    else.....

解决方案


您用于find_elements获取包含 all rounds、 all date、 all team_home、 allteam_away等的列表,因此您在分隔列表中具有值,您应该使用zip()对列表中的值进行分组,例如 [ single round, single date, single team_home, ...]`

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

我跳过round了,因为它会产生更多问题,它需要完全不同的代码。

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

round = driver.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
date = driver.find_elements(By.CSS_SELECTOR, "[class^='event__time']") #data e ora è tutto un pezzo su diretta.it
team_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
team_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
score_home = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
score_away = driver.find_elements(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

results = []

for row in zip(date, team_home, team_away, score_home, score_away):
    row = [item.text for item in row]
    print(row)
    results.append(row)

结果:

['01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']
['28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
['28.10. 19:00', 'Norrkoping', 'Mjallby', '2', '2']
['27.10. 19:00', 'Kalmar', 'Varbergs', '2', '2']
['27.10. 19:00', 'Malmo FF', 'AIK Stockholm', '1', '0']
['27.10. 19:00', 'Östersunds', 'Hacken', '1', '1']
['27.10. 19:00', 'Sirius', 'Hammarby', '0', '1']
['25.10. 19:00', 'Örebro', 'Degerfors', '1', '2']
['24.10. 17:30', 'AIK Stockholm', 'Norrkoping', '1', '0']
...

但是这种方法有时可能会出现问题 - 如果某行有空的地方,那么它将值从下一行移动到当前行等。这样它可以创建错误的行。

Better is to find all rows (div or tr in table) and next use for-loop to work with every row separatelly and use row.find_elements instead of driver.find_elements. This should also resolve problem with round which will need to read value and later duplicate it in next rows.

I search rows with event__round or event__match and next I check what classes has row. If it has event__round then I get round. If it has event__match then I use find_element without s at the end to get single date, single team_home, single team_away, etc (because in single row there are only single values) and use them with current_round to create row.

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.diretta.it/calcio/svezia/allsvenskan/risultati/")
driver.implicitly_wait(12)
#driver.minimize_window()

wait = WebDriverWait(driver, 10)

try:
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[id='event__more event__more--static']"))).click()
except Exception as ex:
    print('EX:', ex)

all_rows = driver.find_elements(By.CSS_SELECTOR, "div[class^='event__round'],div[class^='event__match']")

results = []

current_round = '?'

for row in all_rows:
    classes = row.get_attribute('class')
    #print(classes)
    
    if 'event__round' in classes:
        #round = row.find_elements(By.CSS_SELECTOR, "[class^='event__round event__round--static']")
        #current_round = row.text  # full text `Round 20`
        current_round = row.text.split(" ")[-1]  # only `20` without `Round`
    else:
        datetime = row.find_element(By.CSS_SELECTOR, "[class^='event__time']")
        
        date, time = datetime.text.split(" ")
        date = date.rstrip('.')  # right-strip to remove `.` at the end of date
        
        team_home = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--home']")            
        team_away = row.find_element(By.CSS_SELECTOR, "[class^='event__participant event__participant--away']")
        score_home = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--home']")
        score_away = row.find_element(By.CSS_SELECTOR, "[class^='event__score event__score--away']")   

        # old version
        #row = [current_round, datetime.text, team_home.text, team_away.text, score_home.text, score_away.text]
    
        row = [current_round, date, time, team_home.text, team_away.text, score_home.text, score_away.text]
        results.append(row)
        print(row)

# --- database ---

import sqlite3

con = sqlite3.connect('database.db')
cursor = con.cursor()

query = 'DROP TABLE IF EXISTS All_Score;'
cursor.execute(query)

# old version - with only `date`
#query = 'CREATE TABLE IF NOT EXISTS All_Score(current_round, date, team_home, team_away, score_home, score_away);'
# new version - with `date` and `time`
query = 'CREATE TABLE IF NOT EXISTS All_Score(current_round, date, time, team_home, team_away, score_home, score_away);'
cursor.execute(query)

# old version - with only `date`
#query = 'INSERT INTO All_Score(current_round, date, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?);'
# new version - with `date` and `time`
query = 'INSERT INTO All_Score(current_round, date, time, team_home, team_away, score_home, score_away) VALUES (?, ?, ?, ?, ?, ?, ?);'
cursor.executemany(query, results)

con.commit()   

Result:

['Giornata 26', '01.11. 19:00', 'Degerfors', 'Göteborg', '0', '1']
['Giornata 26', '01.11. 19:00', 'Halmstad', 'AIK Stockholm', '1', '0']
['Giornata 26', '01.11. 19:00', 'Mjallby', 'Hammarby', '2', '0']
['Giornata 26', '31.10. 17:30', 'Örebro', 'Djurgarden', '0', '1']
['Giornata 26', '31.10. 15:00', 'Norrkoping', 'Elfsborg', '3', '2']
['Giornata 26', '30.10. 17:30', 'Hacken', 'Kalmar', '1', '4']
['Giornata 26', '30.10. 15:00', 'Sirius', 'Malmo FF', '2', '3']
['Giornata 26', '30.10. 15:00', 'Varbergs', 'Östersunds', '3', '0']

['Giornata 25', '28.10. 19:00', 'Degerfors', 'Elfsborg', '1', '2']
['Giornata 25', '28.10. 19:00', 'Göteborg', 'Djurgarden', '3', '0']
['Giornata 25', '28.10. 19:00', 'Halmstad', 'Örebro', '1', '1']
# ...


所属网站分类: 技术文章 > 问答

作者:黑洞官方问答小能手

链接:https://www.pythonheidong.com/blog/article/1793554/9261abfd8f6af71673be/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

2 0
收藏该文
已收藏

评论内容:(最多支持255个字符)