rvest html_table() use second row as header - r

I am trying to scrape data from a table on fbref however the tables contain two headers with the subheader being incorporated into the first row of data. Does anyone know how to skip the first line and use the second line as the table header so that data types can be maintained? Here is my code below.
library(rvest)
library(dplyr)
team_link = "https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats-All-Competitions"
team_page = read_html(team_link)
shooting_table = team_page %>% html_nodes("#all_stats_shooting") %>%
html_table()
shooting_table = shooting_table[[1]]

You can use the janitor package
library(janitor)
shooting_table %>%
row_to_names(1)
Which gives us:
# A tibble: 28 × 23
Player Nation Pos Age `90s` Gls Sh SoT `SoT%` `Sh/90` `SoT/90` `G/Sh` `G/SoT` Dist FK PK
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Edouard M… sn SEN GK 29 34.0 0 0 0 "" 0.00 0.00 "" "" "" 0 0
2 Antonio R… de GER DF 28 33.7 3 48 13 "27.1" 1.42 0.39 "0.06" "0.23" "19.… 0 0
3 Thiago Si… br BRA DF 36 29.4 3 18 5 "27.8" 0.61 0.17 "0.17" "0.60" "10.… 0 0
4 Mason Mou… eng E… MF,FW 22 26.3 11 75 27 "36.0" 2.86 1.03 "0.13" "0.37" "17.… 6 1

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

Add leading zeros to colum names

I'm surprised to find no one asked this question on Stackoverflow before. Maybe it's too stupid to ask?
So I have a dataframe that contains 48 weather variables, each representing a weather value for a month. I have drawn a simplified table shown below:
weather 1
weather 2
weather 3
weather 4
weather 5
weather 6
weather 7
weather 8
weather 9
weather 10
weather 11
weather 12
12
6
34
9
100
.01
-4
38
64
77
21
34
99
42
-3
34
34
.5
27
19
7
18
NA
20
My objective is to make the column names from "weather 1, weather 2, ..." to "weather 01, weather 02, ...." And I wrote a loop like this:
for (i in 1:9){
colnames(df) = gsub(i, 0+i, colnames(df))
}
However, instead of replacing the single-digit numbers with a leading zero, R replaced the actual letter "i" with "0+i". Can anyone let me know what's going on here and how to fix it? Or is there a better way to add leading zeros to column names?
Thank you very much!
We can use
library(stringr)
colnames(df) <- str_replace(colnames(df), "\\d+",
function(x) sprintf("%02d", as.integer(x)))
Here is another option:
library(tidyverse)
set.seed(35)
example <- tibble(`weather 1` = runif(2),
`weather 2` = runif(2),
`weather 3` = runif(2))
rename_with(example, ~str_replace(., "(weather )(\\d+)", "\\10\\2"), everything())
#> # A tibble: 2 x 3
#> `weather 01` `weather 02` `weather 03`
#> <dbl> <dbl> <dbl>
#> 1 0.857 0.553 0.486
#> 2 0.0108 0.950 0.0939
or with base R
colnames(example) <- gsub("(weather )(\\d+)", "\\10\\2", colnames(example))
example
#> # A tibble: 2 x 3
#> `weather 01` `weather 02` `weather 03`
#> <dbl> <dbl> <dbl>
#> 1 0.857 0.553 0.486
#> 2 0.0108 0.950 0.0939

What format of table is at Lineups.com and how to scrape it in R

I am new to scraping and have successfully scraped tables from these websites:-
https://www.numberfire.com/nba/daily-fantasy/daily-basketball-projections/guards
https://www.dailyfantasyfuel.com/nba/projections/draftkings/
https://www.sportsline.com/nba/expert-projections/simulation/
But this website:-
https://www.lineups.com/nba/nba-fantasy-basketball-projections
seems very tricky.
1. I have tried to read it as JSON
r <- read_html('https://www.lineups.com/nba/nba-fantasy-basketball-projections/') %>% html_element('script#__NEXT_DATA__') %>% html_text() %>% jsonlite::parse_json()
2. from rvest methods
data <- "https://www.lineups.com/nba/nba-fantasy-basketball-projections" %>%
read_html %>%
html_nodes('script') %>%
html_text()
3. As well as RSelenium but with no success.
Could you kindly tell me how to deal with these kinds of Tables found at "www.lineups.com" ?
Thanks
Using RSelenium
library(RSelenium)
library(rvest)
library(dplyr)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
url <- 'https://www.lineups.com/nba/nba-fantasy-basketball-projections'
remDr$navigate(url)
#get all the tables from webapage
df = remDr$getPageSource()[[1]] %>%
read_html() %>% html_table()
[[2]]
# A tibble: 51 x 31
Player Player Player Player Player Player Player Player `` `` `` `` `` `` Game Game Game Game Game `Projected Game~ `Projected Game~
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Name" Team Pos Proje~ Salary Pts/$~ FPPM USG% Pos Proj~ Sala~ Pts/~ FPPM USG% Opp DvP Spre~ Total O/U MINS PTS
2 "Nikola~ DEN C 58.37 $11,0~ 5.3 1.8 31.3% HOU 30 13.5 126 238 33 27.7 7.7 13.7 1.1 0.9 5 18.8
3 "Gianni~ MIL PF 57.84 $11,3~ 5.1 1.8 35.1% CHI 27 -5 122.~ 240 33 30.5 5.9 11.7 1 1.4 7.8 19.1
4 "Joel E~ PHI C 53.35 $10,8~ 4.9 1.7 37.4% CLE 5 7 112 216.5 32 29.4 4.2 11 0.9 1.4 8.9 19

Gather or transpose data with multiple rows as 'key' argument

In my mind I want to tidyr::gather() gather on not only the column names but also on row 1 and 2. What I want to achieve is to have a data frame with 5 columns and 4 rows.
This is a little piece of the dataset I'm working with:
library(tidyverse)
# A tibble: 4 x 3
Aanduiding `Coolsingel 40 links` `Goudseweg 15 links`
<chr> <chr> <chr>
1 Gebiedsnummer 1 2
2 Postcode 3011 AD 3031 XH
3 Leefbaar Rotterdam 124 110
4 Partij van de Arbeid (P.v.d.A.) 58 65
and its reproducable dput(df) to work with:
df <- structure(list(Aanduiding = c("Gebiedsnummer", "Postcode", "Leefbaar Rotterdam",
"Partij van de Arbeid (P.v.d.A.)"), `Coolsingel 40 links` = c("1",
"3011 AD", "124", "58"), `Goudseweg 15 links` = c("2", "3031 XH",
"110", "65")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"), .Names = c("Aanduiding", "Coolsingel 40 links",
"Goudseweg 15 links"))
So wanted out put looks like this:
Aanduiding Gebiedsnummer Postcode adres value
<chr> <dbl> <chr> <chr> <dbl>
1 Leefbaar Rotterdam 1.00 3011 AD Coolsingel 40 links 124
2 Leefbaar Rotterdam 1.00 3031 XH Goudseweg 15 links 120
3 Partij van de Arbeid (P.v.d.A.) 2.00 3011 AD Coolsingel 40 links 58.0
4 Partij van de Arbeid (P.v.d.A.) 2.00 3031 XH Goudseweg 15 links 65.0
I use the gather() function from the tidyr package a lot, but this is alway when I only want to gather the column names with a certain value. Now I actually want to gather the column names but also observation on row 1 and 2.
Can I gather on multiple key's? Or paste the values in observation 1 and 2 to the column, then gather() and then separate()?
What's the best tactic here, if possible in a tidyr way.
Much appreciated.
There's two things that need to be done here, and you'll have to figure out how to break down your dataset accordingly.
data.frame(t(df[1:2,]))
will give you:
X1 X2
Aanduiding Gebiedsnummer Postcode
Coolsingel 40 links 1 3011 AD
Goudseweg 15 links 2 3031 XH
And
tidyr::gather(df[3:4,],key="adres",value="value", `Coolsingel 40 links`, `Goudseweg 15 links`)
will give you:
Aanduiding adres value
<chr> <chr> <chr>
1 Leefbaar Rotterdam Coolsingel 40 links 124
2 Partij van de Arbeid (P.v.d.A.) Coolsingel 40 links 58
3 Leefbaar Rotterdam Goudseweg 15 links 110
4 Partij van de Arbeid (P.v.d.A.) Goudseweg 15 links 65
How you go on from there is another problem, possibly a left_join based on adres, but that really depends on how the rest of the data is structured.
You can do this with a combination of gather and spread a few times. I do this often when I need to move a value out to serve as a denominator for a calculation.
library(tidyverse)
...
The goal is to move Gebiedsnummer and Postcode out of Aanduiding, and to gather the other two columns into one column of values. The first gather gets you this:
df %>%
gather(key = address, value = value, -Aanduiding)
#> # A tibble: 8 x 3
#> Aanduiding address value
#> <chr> <chr> <chr>
#> 1 Gebiedsnummer Coolsingel 40 links 1
#> 2 Postcode Coolsingel 40 links 3011 AD
#> 3 Leefbaar Rotterdam Coolsingel 40 links 124
#> 4 Partij van de Arbeid (P.v.d.A.) Coolsingel 40 links 58
#> 5 Gebiedsnummer Goudseweg 15 links 2
#> 6 Postcode Goudseweg 15 links 3031 XH
#> 7 Leefbaar Rotterdam Goudseweg 15 links 110
#> 8 Partij van de Arbeid (P.v.d.A.) Goudseweg 15 links 65
Using a spread after that gets:
df %>%
gather(key = address, value = value, -Aanduiding) %>%
spread(key = Aanduiding, value = value)
#> # A tibble: 2 x 5
#> address Gebiedsnummer `Leefbaar Rotter… `Partij van de Arbe… Postcode
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Coolsinge… 1 124 58 3011 AD
#> 2 Goudseweg… 2 110 65 3031 XH
Then you want to gather again, but to keep address, Gebiedsnummer, and Postcode as their own columns. The select is just there to get the columns in order. So all together:
df %>%
gather(key = address, value = value, -Aanduiding) %>%
spread(key = Aanduiding, value = value) %>%
gather(key = Aanduiding, value = value, -Gebiedsnummer, -address, -Postcode) %>%
select(Aanduiding, Gebiedsnummer, Postcode, address, value) %>%
mutate_at(vars(Gebiedsnummer, value), as.numeric)
#> # A tibble: 4 x 5
#> Aanduiding Gebiedsnummer Postcode address value
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Leefbaar Rotterdam 1 3011 AD Coolsingel 40 l… 124
#> 2 Leefbaar Rotterdam 2 3031 XH Goudseweg 15 li… 110
#> 3 Partij van de Arbeid (P.v… 1 3011 AD Coolsingel 40 l… 58
#> 4 Partij van de Arbeid (P.v… 2 3031 XH Goudseweg 15 li… 65
Created on 2018-08-24 by the reprex package (v0.2.0).

Resources