Iterate over an xpath (string) in R for data scraping

Iterate over an xpath (string) in R for data scraping - r

I've got a (pretty simple) code to download a table with data:
library(rvest)
link = "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/team/2442/statistics"
aguada = read_html(link)
stats = aguada %>% html_nodes("tbody")
stats = aguada %>% html_nodes(xpath="/html/body/div[1]/div[6]/div/div/div/div[4]/table") %>% html_table()
my_df <- as.data.frame(stats)
And now I'm trying to do the same, but for the URLs for each player in the same table
for (i in 1:17){
url_path="/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr[i]/td[1]/a"
jugador[i] = aguada %>% html_nodes(xpath=url_path)%>% html_attr("href")
}
I've tried the code above, and while it doesn't crash, it doesn't work as intended either. I want to create an array with the urls or something like that so I can then get the stats for each player easily. While we're at it, I'd like to know if, instead of doing 1:17 in the for and manually counting the players, there's a way to automate that too, so I can do something like for i in 1:table_length

You need to initialise the vector jugador to be able to append the links to it. Also, when you create a path that invloves changing a character within the path, paste concatenates the strings with the number i to create the path, as shown below:
jugador <- vector()
for(i in 1:17){
url_path <- paste("/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr[", i, "]/td[1]/a", sep = "")
jugador[i] <- aguada %>% html_nodes(xpath=url_path)%>% html_attr("href")
}
Result:
> jugador
[1] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/15257?"
[2] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/17101?"
[3] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/17554?"
[4] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/43225?"
[5] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/262286?"
[6] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/623893?"
[7] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/725720?"
[8] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/858052?"
[9] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1645559?"
[10] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1651515?"
[11] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1717089?"
[12] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1924883?"
[13] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1924884?"
[14] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1931124?"
[15] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1950388?"
[16] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1971299?"
[17] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1991297?"

Links in the last column. Without loop
library(tidyverse)
library(rvest)
page <-
"https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/team/2442/statistics" %>%
read_html()
df <- page %>%
html_table() %>%
pluck(1) %>%
janitor::clean_names() %>%
mutate(link = page %>%
html_elements("td a") %>%
html_attr("href") %>%
unique())
# A tibble: 17 x 21
jugador p i pts_pr pts as_pr as ro_pr rd_pr rt_pr rt bl_prom bl re_pr re min_pr tc_percent x2p_percent x3p_percent tl_percent link$value
<chr> <int> <int> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 F. MEDINA 22 9 6 131 1.3 29 0.5 0.8 1.3 28 0 0 0.6 13 22 37 55.6 26.8 60 https://hosted.dcd.share~
2 J. SANTISO 23 23 12 277 5.6 128 0.4 2.9 3.3 75 0 0 0.7 15 31 43.1 43.2 43 75 https://hosted.dcd.share~
3 A. ZUVICH 17 1 8.2 139 0.7 11 2 2.9 4.9 83 0.5 8 1.1 19 15.9 59.8 67.1 16.7 76.5 https://hosted.dcd.share~
4 A. YOUNG 15 14 12.5 187 1.3 20 0.4 3.3 3.7 55 0.5 7 0.6 9 30.5 36.2 41.9 32 78.8 https://hosted.dcd.share~
5 E. VARGAS 23 23 16.1 370 1.9 44 3.5 8.4 11.9 273 1.6 37 1.1 25 30.3 53.3 53.5 0 62.6 https://hosted.dcd.share~
6 L. PLANELLS 23 0 3.6 83 1.6 37 0.5 1.1 1.6 37 0.1 2 0.7 17 15.1 35.4 35.1 35.6 90 https://hosted.dcd.share~
7 T. METZGER 11 9 6.8 75 0.6 7 1.7 3.3 5 55 0.4 4 0.5 5 23.1 37 44.2 28.9 40 https://hosted.dcd.share~
8 L. SILVA 19 0 1.1 21 0.1 2 0.2 0.2 0.3 6 0.1 1 0 0 4 35 71.4 15.4 100 https://hosted.dcd.share~
9 J. STOLL 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1.2 0 0 0 0 https://hosted.dcd.share~
10 G. BRUN 4 0 0.8 3 0 0 0.3 0 0.3 1 0 0 0 0 0.6 50 0 50 0 https://hosted.dcd.share~
11 A. GENTILE 3 0 0 0 0 0 0.3 0.3 0.7 2 0 0 0 0 1 0 0 0 0 https://hosted.dcd.share~
12 L. CERMINATO 19 5 8.6 163 1.7 33 1.3 3.6 4.9 93 0.7 14 0.9 17 20.9 44.1 51.9 27.1 57.1 https://hosted.dcd.share~
13 J. ADAMS 8 8 16.6 133 1.9 15 1 2.5 3.5 28 0.3 2 1.9 15 28.9 46.2 53.9 26.7 81.8 https://hosted.dcd.share~
14 K. FULLER 5 5 4.6 23 1.8 9 0.6 0.6 1.2 6 0 0 0.4 2 20.1 17.1 0 28.6 83.3 https://hosted.dcd.share~
15 S. MAC 4 4 12.5 50 2 8 0 3 3 12 0.5 2 1.8 7 29.9 37.8 35.5 42.9 76.9 https://hosted.dcd.share~
16 O. JOHNSON 12 12 15.4 185 3.4 41 1 3.2 4.2 50 0.3 4 0.8 9 31.8 47.3 53.6 34.7 75 https://hosted.dcd.share~
17 G. SOLANO 2 2 15.5 31 6.5 13 0.5 5.5 6 12 0 0 1 2 32.4 41.4 55.6 18.2 71.4 https://hosted.dcd.share~

Inside the string, i is just a regular character, and XPath doesn’t know it: it has no connection to the variables in your R session.
However, if you want to select all elements with a given XPath, you don’t need the index at all. That is, the following XPath expression works (I’ve simply removed the [i] part):
/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr/td[1]/a
Here’s the corresponding ‘rvest’ code. Note that it uses no loop:
library(rvest)
link = "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/team/2442/statistics"
aguada = read_html(link)
jugador = aguada %>%
html_nodes(xpath = "/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr/td[1]/a/#href")
Or, alternatively:
jugador = aguada %>%
html_nodes(xpath = "/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr/td[1]/a") %>%
html_attr("href")
Both return a vector of hyperrefs. The first solution has a slightly different return type (xml_nodeset) but for most purposes they will be similar.

Related

Add new column to state data frame based on other column data [duplicate]

This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 1 year ago.
I am attempting to add a new column to the state sample data frame in R. I am hoping for this column to cluster the ID of states into broader categories (1-4). My code is close to what I am looking for but I am not getting it quite right.. I know I could enter each state ID line by line but is there a a quicker way? Thank you!
library(tidyverse)
#Add column to denote each state
States=state.x77
States=data.frame(States)
States <- tibble::rowid_to_column(States, "ID")
States
#Create new variable for state buckets
States <- States %>%
mutate(WAGE_BUCKET=case_when(ID <= c(1,12) ~ '1',
ID <= c(13,24) ~ '2',
ID <= c(25,37) ~ '3',
ID <= c(38,50) ~ '4',
TRUE ~ 'NA'))
View(States) #It is not grouping the states in the way I want/I am still getting some NA values but unsure why!

You can use cut or findInterval if all of your groups will be using contiguous ID values:
findInterval(States$ID, c(0, 12, 24, 37, 51))
# [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
If you want to make it a bit more verbose, you can use dplyr::between in your case_when:
States %>%
mutate(
WAGE_BUCKET = case_when(
between(ID, 1, 12) ~ "1",
between(ID, 13, 24) ~ "2",
between(ID, 25, 37) ~ "3",
between(ID, 38, 50) ~ "4",
TRUE ~ NA_character_)
)
# ID Population Income Illiteracy Life Exp Murder HS Grad Frost Area WAGE_BUCKET
# 1 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 1
# 2 2 365 6315 1.5 69.31 11.3 66.7 152 566432 1
# 3 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 1
# 4 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 1
# 5 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 1
# 6 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 1
# 7 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 1
# 8 8 579 4809 0.9 70.06 6.2 54.6 103 1982 1
# 9 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 1
# 10 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 1
# 11 11 868 4963 1.9 73.60 6.2 61.9 0 6425 1
# 12 12 813 4119 0.6 71.87 5.3 59.5 126 82677 1
# 13 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 2
# 14 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 2
# 15 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 2
# 16 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 2
# 17 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 2
# 18 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 2
# 19 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 2
# 20 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 2
# 21 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 2
# 22 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 2
# 23 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 2
# 24 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 2
# 25 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 3
# 26 26 746 4347 0.6 70.56 5.0 59.2 155 145587 3
# 27 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 3
# 28 28 590 5149 0.5 69.03 11.5 65.2 188 109889 3
# 29 29 812 4281 0.7 71.23 3.3 57.6 174 9027 3
# 30 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 3
# 31 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 3
# 32 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 3
# 33 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 3
# 34 34 637 5087 0.8 72.78 1.4 50.3 186 69273 3
# 35 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 3
# 36 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 3
# 37 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 3
# 38 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 4
# 39 39 931 4558 1.3 71.90 2.4 46.4 127 1049 4
# 40 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 4
# 41 41 681 4167 0.5 72.08 1.7 53.3 172 75955 4
# 42 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 4
# 43 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 4
# 44 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 4
# 45 45 472 3907 0.6 71.64 5.5 57.1 168 9267 4
# 46 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 4
# 47 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 4
# 48 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 4
# 49 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 4
# 50 50 376 4566 0.6 70.29 6.9 62.9 173 97203 4

It is a vector of length > 1. The comparison operators works on a single vector. We could use between
library(dplyr)
States <- States %>%
mutate(WAGE_BUCKET=case_when(between(ID, 1, 12) ~ '1',
between(ID, 13,24) ~ '2',
between(ID, 25,37) ~ '3',
between(ID, 38,50) ~ '4',
TRUE ~ NA_character_))
Or another option is to use & with > and <=
States %>%
mutate(WAGE_BUCKET=case_when(ID >= 1 & ID <=12 ~ '1',
ID >= 13 & ID <= 24) ~ '2',
ID >= 25 & ID <= 37 ~ '3',
ID >= 38 & ID <= 50 ~ '4',
TRUE ~ NA_character))
Or may be the OP meant to use %in%
States %>%
mutate(WAGE_BUCKET=case_when(ID %in% c(1,12) ~ '1',
ID %in% c(13,24) ~ '2',
ID %in% c(25,37) ~ '3',
ID %in% c(38,50) ~ '4',
TRUE ~ NA_character_))

Use cases with higher value on one variable for each case of another variable in R

I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!

If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))

You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0

Webscraping in R - commented out table [duplicate]

This question already has an answer here:
Not able to scrape a second table within a page using rvest
(1 answer)
Closed 4 years ago.
I'm trying to webscrape the final table in https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml
i.e. the "MLB Detailed Standings"
My R code is as follows:
library(XML)
library(httr)
library(plyr)
library(stringr)
url <- paste0("http://www.baseball-reference.com/leagues/MLB/", 2015, "-standings.shtml")
tab <- GET(url)
data <- readHTMLTable(rawToChar(tab$content))
however the it does not seem to pickup the table I want. Looking at the source code it seems as though the table is commented out somehow?
Any help would be great

From the answer MrFlick linked:
library(XML)
library(tidyverse)
library(rvest)
page <- xml2::read_html("https://www.baseball-reference.com/leagues/MLB/2015-standings.shtml")
alt_tables <- xml2::xml_find_all(page,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
tbl <- alt_tables[[2]]
tbl <- as.tibble(tbl)
tbl
# A tibble: 31 x 23
Rk Tm Lg G W L `W-L%` R RA Rdiff SOS SRS pythWL Luck Inter Home Road ExInn
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <chr> <chr> <chr> <chr>
1 1 STL NL 162 100 62 0.617 4 3.2 0.8 -0.3 0.5 96-66 4 11-9 55-26 45-36 8-8
2 2 PIT NL 162 98 64 0.605 4.3 3.7 0.6 -0.3 0.3 93-69 5 13-7 53-28 45-36 12-9
3 3 CHC NL 162 97 65 0.599 4.3 3.8 0.5 -0.3 0.2 90-72 7 10-10 49-32 48-33 13-5
4 4 KCR AL 162 95 67 0.586 4.5 4 0.5 0.2 0.7 90-72 5 13-7 51-30 44-37 10-6
5 5 TOR AL 162 93 69 0.574 5.5 4.1 1.4 0.2 1.6 102-60 -9 12-8 53-28 40-41 8-6
6 6 LAD NL 162 92 70 0.568 4.1 3.7 0.4 -0.3 0.1 89-73 3 10-10 55-26 37-44 6-9
7 7 NYM NL 162 90 72 0.556 4.2 3.8 0.4 -0.4 0 89-73 1 9-11 49-32 41-40 9-6
8 8 TEX AL 162 88 74 0.543 4.6 4.5 0.1 0.2 0.4 83-79 5 11-9 43-38 45-36 5-4
9 9 NYY AL 162 87 75 0.537 4.7 4.3 0.4 0.3 0.8 88-74 -1 11-9 45-36 42-39 4-9
10 10 HOU AL 162 86 76 0.531 4.5 3.8 0.7 0.2 0.9 93-69 -7 16-4 53-28 33-48 8-6
# ... with 21 more rows, and 5 more variables: `1Run` <chr>, vRHP <chr>, vLHP <chr>, `≥.500` <chr>, `<.500` <chr>
>

How to index dataframe column inside a function in R

I have a function that takes in a dataframe, a percentile threshold, and the name of a given column, and computes all values that are above this threshold in the given column as a new column (0 for <, and 1 for >=). However, it won't allow me to do the df$column_name inside the quantile function because column_name is not actually a column name, but a variable storing the actual column name. Therefore df$column_name will return NULL. Is there any way to work around this and keep the code forma somewhat similar to what it is currently? Or do I have to specify the actual numerical column value instead of the name? While I can do this, it is definitely not as convenient/comprehensible as just passing in the column name.
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df$column_name, c(threshold))
new_df <- df %>%
mutate(ifelse(column_name > threshold_value, 1, 0))
return(new_df)
}
Thank you so much for your help!

I modified your function as follows. Now the function can take a data frame, a threshold, and a column name. This function only needs the base R.
# Modified function
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df[[column_name]], threshold)
new_df <- df
new_df[["new_col"]] <- ifelse(df[[column_name]] > threshold_value, 1, 0)
return(new_df)
}
# Take the trees data frame as an example
head(trees)
# Girth Height Volume
# 1 8.3 70 10.3
# 2 8.6 65 10.3
# 3 8.8 63 10.2
# 4 10.5 72 16.4
# 5 10.7 81 18.8
# 6 10.8 83 19.7
# Apply the function
func1(trees, 0.5, "Volume")
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 0
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1
If you still want to use dplyr, it is essential to learn how to deal with non-standard evaluation. Please see this to learn more (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html). The following code will also works.
library(dplyr)
func2 <- function(df, threshold, column_name) {
col_en <- enquo(column_name)
threshold_value <- quantile(df %>% pull(!!col_en), threshold)
new_df <- df %>%
mutate(new_col := ifelse(!!col_en >= threshold_value, 1, 0))
return(new_df)
}
func2(trees, 0.5, Volume)
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 1
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1

How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--
What is the best way to get the tables from inside the comment tags? Thanks!
Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none

You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1

Ok..got it.
library(stringi)
library(knitr)
library(rvest)
any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
e <- html_table(any_version_html(d))
> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====
kable(e[[1]],'rst')
=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date Â Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University
## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')
=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Iterate over an xpath (string) in R for data scraping - r

Related

Add new column to state data frame based on other column data [duplicate]

Use cases with higher value on one variable for each case of another variable in R

Webscraping in R - commented out table [duplicate]

How to index dataframe column inside a function in R

How to scrape tables inside a comment tag in html with R?

Categories

Resources