Scraping an interactive table with rvest - r

I'm attempting to scrape the second table shown at the URL below, and I'm running into issues which may be related to the interactive nature of the table.
div_stats_standard appears to refer to the table of interest.
The code runs with no errors but returns an empty list.
url <- 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
data <- url %>%
read_html() %>%
html_nodes(xpath = '//*[(#id = "div_stats_standard")]') %>%
html_table()
Can anyone tell me where I'm going wrong?

Look for the table.
library(rvest)
url <- "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
Result:
head(table)
Playing Time Playing Time Playing Time Performance Performance
1 Squad # Pl MP Starts Min Gls Ast
2 Arsenal 26 27 297 2,430 39 26
3 Aston Villa 28 27 297 2,430 33 27
4 Bournemouth 25 28 308 2,520 27 17
5 Brighton 23 28 308 2,520 28 19
6 Burnley 21 28 308 2,520 32 23
Performance Performance Performance Performance Per 90 Minutes Per 90 Minutes
1 PK PKatt CrdY CrdR Gls Ast
2 2 2 64 3 1.44 0.96
3 1 3 54 1 1.22 1.00
4 1 1 60 3 0.96 0.61
5 1 1 44 2 1.00 0.68
6 2 2 53 0 1.14 0.82
Per 90 Minutes Per 90 Minutes Per 90 Minutes Expected Expected Expected Per 90 Minutes
1 G+A G-PK G+A-PK xG npxG xA xG
2 2.41 1.37 2.33 35.0 33.5 21.3 1.30
3 2.22 1.19 2.19 30.6 28.2 22.0 1.13
4 1.57 0.93 1.54 31.2 30.5 20.8 1.12
5 1.68 0.96 1.64 33.8 33.1 22.4 1.21
6 1.96 1.07 1.89 30.9 29.4 18.9 1.10
Per 90 Minutes Per 90 Minutes Per 90 Minutes Per 90 Minutes
1 xA xG+xA npxG npxG+xA
2 0.79 2.09 1.24 2.03
3 0.81 1.95 1.04 1.86
4 0.74 1.86 1.09 1.83
5 0.80 2.01 1.18 1.98
6 0.68 1.78 1.05 1.73

Related

how to apply round() to odd or even rows only in R

assume my original dataframe is :
a b d e
1 1 1 2 1
2 20 30 40 30
3 1 2 6 2
4 40 50 40 50
5 5 5 3 5
6 60 60 60 60
I want to add a percentage row below each row.
a b d e
1 1.00 1.00 2.00 1.00
2 0.79 0.66 1.57 0.66
3 20.00 30.00 40.00 30.00
4 13.51 20.27 27.03 20.27
5 1.00 2.00 6.00 2.00
6 0.66 1.57 3.97 1.57
7 40.00 50.00 40.00 50.00
8 27.03 33.78 27.03 33.78
9 5.00 5.00 3.00 5.00
10 3.94 3.31 2.36 3.31
11 60.00 60.00 60.00 60.00
12 40.54 40.54 40.54 40.54
but as you see, my odd rows get .00 which I do not want.
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df <- df %>% slice(rep(1:n(), each=2))
df[seq_len(nrow(df)) %% 2 ==0, ] <- round(100*df[seq_len(nrow(df)) %% 2 ==0,
]/colSums(df[seq_len(nrow(df)) %% 2 ==0, ]),2)
how can I keep my odd rows without decimals?
The problem is that columns in data frames can only hold one type of data. If some of the columns in your data frame have decimals, then the whole column must be of type double. The only way to change how your data frame appears is via its print method.
Fortunately, you can easily turn your data frame into a tibble. This is a type of data frame, but prints in such a way that the integers don't have decimal points afterwards.
df
#> a b d e
#> 1 1.00 1.00 2.00 1.00
#> 2 0.79 0.66 1.57 0.66
#> 3 20.00 30.00 40.00 30.00
#> 4 13.51 20.27 27.03 20.27
#> 5 1.00 2.00 6.00 2.00
#> 6 0.66 1.57 3.97 1.57
#> 7 40.00 50.00 40.00 50.00
#> 8 27.03 33.78 27.03 33.78
#> 9 5.00 5.00 3.00 5.00
#> 10 3.94 3.31 2.36 3.31
#> 11 60.00 60.00 60.00 60.00
#> 12 40.54 40.54 40.54 40.54
dplyr::tibble(df)
#> # A tibble: 12 x 4
#> a b d e
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 1
#> 2 0.79 0.66 1.57 0.66
#> 3 20 30 40 30
#> 4 13.5 20.3 27.0 20.3
#> 5 1 2 6 2
#> 6 0.66 1.57 3.97 1.57
#> 7 40 50 40 50
#> 8 27.0 33.8 27.0 33.8
#> 9 5 5 3 5
#> 10 3.94 3.31 2.36 3.31
#> 11 60 60 60 60
#> 12 40.5 40.5 40.5 40.5
Created on 2022-04-26 by the reprex package (v2.0.1)
Allan Cameron is right, that a tibble prints better and does what you want. To offer another solution, though, if you're trying to print something that you might send to a text file (rather than just look at on the screen), you could print the values to character strings as follows:
library(dplyr)
df <- data.frame(a=c(1,20,1,40,5,60),
b=c(1,30,2,50,5,60),
d=c(2,40,6,40,3,60),
e = c(1,30,2,50,5,60))
df %>%
mutate(obs = row_number(),
across(-obs, ~.x/sum(.x)),
type = "pct") %>%
bind_rows(df %>% mutate(obs = row_number(),
type = "raw")) %>%
mutate(type = factor(type, levels=c("raw", "pct"))) %>%
arrange(obs, type) %>%
mutate(across(a:e, ~case_when(
type == "raw" ~ sprintf("%.0f", .x),
TRUE ~ sprintf("%.2f%%", .x*100)))) %>%
select(-c(obs, type))
#> a b d e
#> 1 1 1 2 1
#> 2 0.79% 0.68% 1.32% 0.68%
#> 3 20 30 40 30
#> 4 15.75% 20.27% 26.49% 20.27%
#> 5 1 2 6 2
#> 6 0.79% 1.35% 3.97% 1.35%
#> 7 40 50 40 50
#> 8 31.50% 33.78% 26.49% 33.78%
#> 9 5 5 3 5
#> 10 3.94% 3.38% 1.99% 3.38%
#> 11 60 60 60 60
#> 12 47.24% 40.54% 39.74% 40.54%
Created on 2022-04-26 by the reprex package (v2.0.1)
Also note, I think the percentages you calculated are wrong. When I used your data, I get:
sum(df$a[c(2,4,6,8,10,12)])
#> [1] 86.47
And when I use mine, that are different from yours, I get 100 (if we turn them back into numbers from strings).

Why same category is giving different frequency in R

Process_Table = Process_Table[order(-Process_Table$Process, -Process_Table$Freq),]
#output
Process Freq Percent
17 Other Airport Services 45 15.46
5 Check-in 35 12.03
23 Ticket sales and support channels 35 12.03
11 Flight and inflight 33 11.34
19 Pegasus Plus 23 7.90
24 Time Delays 16 5.50
7 Other 13 4.47
14 Other 13 4.47
22 Other 13 4.47
25 Other 13 4.47
16 Other 11 3.78
20 Other 6 2.06
26 Other 6 2.06
3 Other 5 1.72
13 Other 5 1.72
18 Other 5 1.72
21 Other 4 1.37
1 Other 2 0.69
2 Other 1 0.34
4 Other 1 0.34
6 Other 1 0.34
8 Other 1 0.34
9 Other 1 0.34
10 Other 1 0.34
12 Other 1 0.34
15 Other 1 0.34
as you can see it is giving different frequency for the same level
whereas, if i am printing the levels in that feature it is giving an output as the following
levels(Process_Table$Process)
[1] "Check-in" "Flight and inflight"
[3] "Other" "Other Airport Services"
[5] "Pegasus Plus" "Ticket sales and support channels"
[7] "Time Delays"
what i want is the combined frequency of "Others" category. Can anyone help me out on this.
Edit: code was used to derive to the first set of output:
Process_Table$Percent = round(Process_Table$Freq/sum(Process_Table$Freq) * 100, 2)
Process_Table$Process = as.character(Process_Table$Process)
low_list = Process_Table %>%
filter(Percent < 5.50) %>%
select(Process)
Process_Table$Process = ifelse(Process_Table$Process %in% low_list$Process, 'Other', Process_Table$Process)
as.data.frame(Process_Table)
Process_Table$Process = as.factor(Process_Table$Process)
Your Processed_Table should undergo another step of aggregating. Add the following to your final step of data aggregating.
Processed_Table <- Processed_Table %>% group_by(Process) %>% summarize(Freq = sum(Freq), Percent = sum(Percent))

How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--
What is the best way to get the tables from inside the comment tags? Thanks!
Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none
You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1
Ok..got it.
library(stringi)
library(knitr)
library(rvest)
any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
e <- html_table(any_version_html(d))
> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====
kable(e[[1]],'rst')
=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date  Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University
## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')
=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========

Error in producing the output

I have problem with my code. I can't trace the error. I have coor data (40 by 2 matrix) as below and a rainfall data (14610 by 40 matrix).
No Longitude Latitude
1 100.69 6.34
2 100.77 6.24
3 100.39 6.11
4 100.43 5.53
5 100.39 5.38
6 101.00 5.71
7 101.06 5.30
8 100.80 4.98
9 101.17 4.48
10 102.26 6.11
11 102.22 5.79
12 102.28 5.31
13 102.02 5.38
14 101.97 4.88
15 102.95 5.53
16 103.13 5.32
17 103.06 4.94
18 103.42 4.76
19 103.42 4.23
20 102.38 4.24
21 101.94 4.23
22 103.04 3.92
23 103.36 3.56
24 102.66 3.03
25 103.19 2.89
26 101.35 3.70
27 101.41 3.37
28 101.75 3.16
29 101.39 2.93
30 102.07 3.09
31 102.51 2.72
32 102.26 2.76
33 101.96 2.74
34 102.19 2.36
35 102.49 2.29
36 103.02 2.38
37 103.74 2.26
38 103.97 1.85
39 103.72 1.76
40 103.75 1.47
rainfall= 14610 by 40 matrix;
coor= 40 by 2 matrix
my_prog=function(rainrain,coordinat,misss,distance)
{
rain3<-rainrain # target station i**
# neighboring stations for target station i
a=coordinat # target station i**
diss=as.matrix(distHaversine(a,coor,r=6371))
mmdis=sort(diss,decreasing=F,index.return=T)
mdis=as.matrix(mmdis$x)
mdis1=as.matrix(mmdis$ix)
dist=cbind(mdis,mdis1)
# NA creation
# create missing values in rainfall data
set.seed(100)
b=sample(1:nrow(rain3),(misss*nrow(rain3)),replace=F)
k=replace(rain3,b,NA)
# pick i closest stations
neig=mdis1[distance] # neighbouring selection distance
# target (with NA) and their neighbors
rainB=rainfal00[,neig]
rainA0=rainB[,2:ncol(rainB)]
rainA<-as.matrix(cbind(k,rainA0))
rain2=na.omit(rainA)
x=as.matrix(rain2[,1]) # used to calculate the correlation
n1=ncol(rainA)-1
#1) normal ratio(nr)
jum=as.matrix(apply(rain2,2,mean))
nr0=(jum[1]/jum)
nr=as.matrix(nr0[2:nrow(nr0),])
m01=as.matrix(rainA[is.na(k),])
m1=m01[,2:ncol(m01)]
out1=as.matrix(sapply(seq_len(nrow(m1)),
function(i) sum(nr*m1[i,],na.rm=T)/n1))
print(out1)
}
impute=my_prog(rainrain=rainfall[,1],coordinat=coor[1,],misss=0.05,distance=mdis<200)
I have run this code and and the output obtained is:
Error in my_prog(rainrain = rainfal00[, 1], misss = 0.05, coordinat = coor[1, :
object 'mdis' not found
I have checked the program, but cannot trace the problem. I would really appreciate if someone could help me.

adding new column to data frame in R

rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA
I got a question in adding a new column, my data frame is called highway1,and i want to add a column named S/N, as slim divided by acpt, what can I do?
Thanks
> mydf$SN <- mydf$slim/mydf$acpt
> mydf
rate len ADT trks sigs1 slim shld lane acpt itg lwid hwy SN
1 4.58 4.99 69 8 0.20040080 55 10 8 4.6 1.20 12 FAI 11.956522
2 2.86 16.11 73 8 0.06207325 60 10 4 4.4 1.43 12 FAI 13.636364
3 3.02 9.75 49 10 0.10256410 60 10 4 4.7 1.54 12 FAI 12.765957
4 2.29 10.65 61 13 0.09389671 65 10 6 3.8 0.94 12 FAI 17.105263
5 1.61 20.01 28 12 0.04997501 70 10 4 2.2 0.65 12 FAI 31.818182
6 6.87 5.97 30 6 2.00750419 55 10 4 24.8 0.34 12 PA 2.217742
7 3.85 8.57 46 8 0.81668611 55 8 4 11.0 0.47 12 PA 5.000000
8 6.12 5.24 25 9 0.57083969 55 10 4 18.5 0.38 12 PA 2.972973
9 3.29 15.79 43 12 1.45333122 50 4 4 7.5 0.95 12 PA 6.666667
I hope an explanation is not necessary for the above.
While $ is the preferred route, you can also consider cbind.
First, create the numeric vector and assign it to SN:
SN <- Data[,6]/Data[,9]
Now you use cbind to append the numeric vector as a column to the existing data frame:
Data <- cbind(Data, SN)
Again, using the dollar operator $ is preferred, but it doesn't hurt seeing what an alternative looks like.

Resources