'Rvest Scrape element
I am trying to scrape the team record (3-6-2) and the year for a team on this page: https://www.pro-football-reference.com/teams/pit/1933.htm
I tried using selector gadget to pull the correct xpath or class but nothing is working right. The closest I got was pulling "Record:" with the following:
read_html(
curl("https://www.pro-football-reference.com/teams/pit/1933.htm",
handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_element(xpath='//*[@id="meta"]/div[2]/p[1]/strong') %>%
html_text()
I would like the output to be a data frame. Any clarity as to how to access this element in selector gadget would be helpful as I try to learn to pull other elements from this and other similar pages. Thanks!
Solution 1:[1]
If you're looking exclusively for tables, rvest's html_table function does exactly what you want.
html_table(read_html("https://www.pro-football-reference.com/teams/pit/1933.htm"))
[[1]]
# A tibble: 5 x 23
`` `` `` `Tot Yds & TO` `Tot Yds & TO` `Tot Yds & TO` `` `` Passing Passing
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Player PF Yds "Ply" "Y/P" TO FL "1st~ "Cmp" Att
2 Team ~ 67 1943 "534" "3.6" 40 0 "" "60" 196
3 Opp. ~ 208 2735 "583" "4.7" 19 0 "" "57" 142
4 Lg Ra~ 8 8 "" "" 9 1 "1" "" 1
5 Lg Ra~ 10 9 "" "" 9 1 "1" "" 2
# ... with 13 more variables: Passing <chr>, Passing <chr>, Passing <chr>, Passing <chr>,
# Passing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>,
# Rushing <chr>, Penalties <chr>, Penalties <chr>, Penalties <chr>
[[2]]
# A tibble: 12 x 22
`` `` `` `` `` `` `` `` `` `` Score Score Offense Offense
<chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Week Day Date NA "" "" "OT" Rec "" Opp Tm Opp "1stD" "TotYd"
2 1 Wed Septemb~ NA "box~ "L" "" 0-1 "" New ~ 2 23 "" ""
3 2 Wed Septemb~ NA "box~ "W" "" 1-1 "" Chic~ 14 13 "" ""
4 3 Wed October~ NA "box~ "L" "" 1-2 "" Bost~ 6 21 "" ""
5 4 Wed October~ NA "box~ "W" "" 2-2 "" Cinc~ 17 3 "" ""
6 5 Sun October~ NA "box~ "L" "" 2-3 "@" Gree~ 0 47 "" ""
7 6 Sun October~ NA "box~ "T" "" 2-3-1 "@" Cinc~ 0 0 "" ""
8 7 Sun October~ NA "box~ "W" "" 3-3-1 "@" Bost~ 16 14 "" ""
9 8 Sun Novembe~ NA "box~ "T" "" 3-3-2 "@" Broo~ 3 3 "" ""
10 9 Sun Novembe~ NA "box~ "L" "" 3-4-2 "" Broo~ 0 32 "" ""
11 10 Sun Novembe~ NA "box~ "L" "" 3-5-2 "@" Phil~ 6 25 "" ""
12 12 Sun Decembe~ NA "box~ "L" "" 3-6-2 "@" New ~ 3 27 "" ""
# ... with 8 more variables: Offense <chr>, Offense <chr>, Offense <chr>, Defense <chr>,
# Defense <chr>, Defense <chr>, Defense <chr>, Defense <chr>
which you can then index and filter to get the value you're looking for.
If you're hoping to avoid parsing the tables, you can extract the team_record tag directly with
read_html("https://www.pro-football-reference.com/teams/pit/1933.htm") %>%
html_elements(xpath = "//td[@data-stat='team_record']") %>%
html_text()
which will pull all the values from that column and you can grab the last one.
[1] "0-1" "1-1" "1-2" "2-2" "2-3" "2-3-1" "3-3-1" "3-3-2" "3-4-2" "3-5-2" "3-6-2"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
