'Rvest Scrape element

I am trying to scrape the team record (3-6-2) and the year for a team on this page: https://www.pro-football-reference.com/teams/pit/1933.htm

I tried using selector gadget to pull the correct xpath or class but nothing is working right. The closest I got was pulling "Record:" with the following:

read_html(
  curl("https://www.pro-football-reference.com/teams/pit/1933.htm", 
                          handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>% 
  html_element(xpath='//*[@id="meta"]/div[2]/p[1]/strong') %>% 
  html_text()

I would like the output to be a data frame. Any clarity as to how to access this element in selector gadget would be helpful as I try to learn to pull other elements from this and other similar pages. Thanks!



Solution 1:[1]

If you're looking exclusively for tables, rvest's html_table function does exactly what you want.

html_table(read_html("https://www.pro-football-reference.com/teams/pit/1933.htm"))
[[1]]
# A tibble: 5 x 23
  ``     ``    ``    `Tot Yds & TO` `Tot Yds & TO` `Tot Yds & TO` ``    ``    Passing Passing
  <chr>  <chr> <chr> <chr>          <chr>          <chr>          <chr> <chr> <chr>   <chr>  
1 Player PF    Yds   "Ply"          "Y/P"          TO             FL    "1st~ "Cmp"   Att    
2 Team ~ 67    1943  "534"          "3.6"          40             0     ""    "60"    196    
3 Opp. ~ 208   2735  "583"          "4.7"          19             0     ""    "57"    142    
4 Lg Ra~ 8     8     ""             ""             9              1     "1"   ""      1      
5 Lg Ra~ 10    9     ""             ""             9              1     "1"   ""      2      
# ... with 13 more variables: Passing <chr>, Passing <chr>, Passing <chr>, Passing <chr>,
#   Passing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>, Rushing <chr>,
#   Rushing <chr>, Penalties <chr>, Penalties <chr>, Penalties <chr>

[[2]]
# A tibble: 12 x 22
   ``    ``    ``       ``    ``    ``    ``    ``    ``    ``    Score Score Offense Offense
   <chr> <chr> <chr>    <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>  
 1 Week  Day   Date     NA    ""    ""    "OT"  Rec   ""    Opp   Tm    Opp   "1stD"  "TotYd"
 2 1     Wed   Septemb~ NA    "box~ "L"   ""    0-1   ""    New ~ 2     23    ""      ""     
 3 2     Wed   Septemb~ NA    "box~ "W"   ""    1-1   ""    Chic~ 14    13    ""      ""     
 4 3     Wed   October~ NA    "box~ "L"   ""    1-2   ""    Bost~ 6     21    ""      ""     
 5 4     Wed   October~ NA    "box~ "W"   ""    2-2   ""    Cinc~ 17    3     ""      ""     
 6 5     Sun   October~ NA    "box~ "L"   ""    2-3   "@"   Gree~ 0     47    ""      ""     
 7 6     Sun   October~ NA    "box~ "T"   ""    2-3-1 "@"   Cinc~ 0     0     ""      ""     
 8 7     Sun   October~ NA    "box~ "W"   ""    3-3-1 "@"   Bost~ 16    14    ""      ""     
 9 8     Sun   Novembe~ NA    "box~ "T"   ""    3-3-2 "@"   Broo~ 3     3     ""      ""     
10 9     Sun   Novembe~ NA    "box~ "L"   ""    3-4-2 ""    Broo~ 0     32    ""      ""     
11 10    Sun   Novembe~ NA    "box~ "L"   ""    3-5-2 "@"   Phil~ 6     25    ""      ""     
12 12    Sun   Decembe~ NA    "box~ "L"   ""    3-6-2 "@"   New ~ 3     27    ""      ""     
# ... with 8 more variables: Offense <chr>, Offense <chr>, Offense <chr>, Defense <chr>,
#   Defense <chr>, Defense <chr>, Defense <chr>, Defense <chr>

which you can then index and filter to get the value you're looking for.

If you're hoping to avoid parsing the tables, you can extract the team_record tag directly with

read_html("https://www.pro-football-reference.com/teams/pit/1933.htm") %>%
  html_elements(xpath = "//td[@data-stat='team_record']") %>%
  html_text()

which will pull all the values from that column and you can grab the last one.

 [1] "0-1"   "1-1"   "1-2"   "2-2"   "2-3"   "2-3-1" "3-3-1" "3-3-2" "3-4-2" "3-5-2" "3-6-2"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1