I have scraped multiple tables from a basketball site using a for loop.
years <- c(2016:2021)
final_table <- {}
for(i in 1:length(years)){
url <- paste0("https://www.basketball-reference.com/friv/free_agents.cgi?year=",years[i])
past_free_agency_page <- read_html(url)
past_free_agency_webtable<- html_nodes(past_free_agency_page, "table")
past_free_agency_table <- html_table(past_free_agency_webtable, header = T)[[1]]
final_table <- rbind(final_table, past_free_agency_table)
}
This retrieves everything correctly, but I am trying to combine all of these tables as they are created. If you notice it is 5 total tables (Year 2016 - 2021).
There is one error that I am getting: I try to combine the table with rbind() at the end of the loop. It does not work. It says "the names do not match". I do not know of a clever way to fix this issue because I am new to working with loops, and I have tried turning the scraped table into a df with no success.
My next issue has to do with how the tables are combined. In the website links, one can see that the table has headers within it, that repeat the Master header exactly. The code treats it as another row, so it appears as an instance within each of the tables. I want these to be ignored.
The last issue has to do with making each of these rows unique, I want the respective year of each table to be a column in its own. For example, for the year 2016, I want the table to have a column that says 2016. I have tried something inside the loop, such as past_free_agency_table[,1] <- c(years[i]), I want to do this because some of these tables have the same players, and I want to be able to uniquely identify, which table is which.
Sort of a loop, but in purrr way.
library(tidyverse)
library(rvest)
get_df <- function(year) {
"https://www.basketball-reference.com/friv/free_agents.cgi?year=" %>%
paste0(., year) %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
mutate(years = year) %>%
select(Rk, years, everything())
}
df <- map_dfr(2016:2020, get_df)
# A tibble: 1,161 × 16
Rk years Player Pos Age Type OTm `2015-16 Stats` WS NTm
<chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2016 Kevin Du… F-G 33-2… UFA OKC 28.2 Pts, 8.2 … 14.5 GSW
2 2 2016 LeBron J… F-G 37-1… UFA CLE 25.3 Pts, 7.4 … 13.6 CLE
3 3 2016 Hassan W… C 33-0… UFA MIA 14.2 Pts, 11.8… 10.3 MIA
4 4 2016 DeMar De… G-F 32-3… UFA TOR 23.5 Pts, 4.5 … 9.9 TOR
5 5 2016 Al Horfo… C-F 36-0… UFA ATL 15.2 Pts, 7.3 … 9.4 BOS
6 6 2016 Marvin W… F 36-0… UFA CHO 11.7 Pts, 6.4 … 7.8 CHA
7 7 2016 Andre Dr… C 28-3… RFA DET 16.2 Pts, 14.8… 7.4 DET
8 8 2016 Pau Gasol C-F 41-3… UFA CHI 16.5 Pts, 11.0… 7.1 SAS
9 9 2016 Dirk Now… F 44-0… UFA DAL 18.3 Pts, 6.5 … 6.8 DAL
10 10 2016 Dwight H… C 36-1… UFA HOU 13.7 Pts, 11.8… 6.6 ATL
# … with 1,151 more rows, and 6 more variables: Terms <chr>, Notes <chr>,
# `2016-17 Stats` <chr>, `2017-18 Stats` <chr>, `2018-19 Stats` <chr>,
# `2019-20 Stats` <chr>
I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows
I have the following data: https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv
I'm trying to determine how to calculate the mean age of members of Congress by year (termstart) for each party (Republican and Democrat).
I was hoping for some help on how to go about doing this. I am a beginner in R and I'm just playing around with the data.
Thanks!
Try this approach. Make a filter for the required parties and then summarise. After that you can reshape to wide in order to have both parties for each individual date. Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Data
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/congress-age/congress-terms.csv',stringsAsFactors = F)
#Code
newdf <- df %>% filter(party %in% c('R','D')) %>%
group_by(termstart,party) %>% summarise(MeanAge=mean(age,na.rm=T)) %>%
pivot_wider(names_from = party,values_from=MeanAge)
Output:
# A tibble: 34 x 3
# Groups: termstart [34]
termstart D R
<chr> <dbl> <dbl>
1 1947-01-03 52.0 53.0
2 1949-01-03 51.4 54.6
3 1951-01-03 52.3 54.3
4 1953-01-03 52.3 54.1
5 1955-01-05 52.3 54.7
6 1957-01-03 53.2 55.4
7 1959-01-07 52.4 54.7
8 1961-01-03 53.4 53.9
9 1963-01-09 53.3 52.6
10 1965-01-04 52.3 52.2
# ... with 24 more rows
Assume we have a DF with duplicates in their respected UserID's but with different namings, which of course can be duplicates as well.
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
The aim is to aggregate and calculate the mean and standard deviation of the UserID's and their names respectively. A desired output example:
UserID Name Class Scoring_mean Scoring_std
101 Ed Junior 12.5 3
101 Hank Junior 24.67 11.62
102 Sandy High 24.75 6.29
102 Jessica High 24.25 1.5
Hence my question:
What are the options to aggregate the Names based on the UserID, without the loss of information (Hank being coerced into Ed etc. as with summarise() or mutate() )
In my way of thinking, R has to check which Name corresponds to the UserID, and if a match; aggregate and calculate mean & standard deviation, but I'm not able to get this working in R with dplyr.
At the same time I couldn't find any other post that is somewhat related to this question, as in:
How to calculate the mean of specific rows in R?
Subtract pairs of columns based on matching column
Calculating mean when 2 conditions need met in R
average between duplicated rows in R
Here's a tidyverse option that uses some reshaping to create one column of scores and then some grouping in order to get the summary stats:
DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26),
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)
library(tidyverse)
DF %>%
gather(score_type, score, Scoring, Other_Scores) %>% # reshape score columns
group_by(ID, Name, Class) %>% # group by combinations
summarise(scoring_mean = mean(score), # get summary stats
scoring_sd = sd(score)) %>%
ungroup() # forget the grouping
# # A tibble: 4 x 5
# ID Name Class scoring_mean scoring_sd
# <dbl> <fct> <fct> <dbl> <dbl>
# 1 101. Ed Junior 12.5 3.00
# 2 101. Hank Junior 24.7 11.6
# 3 102. Jessica Mid 24.2 1.50
# 4 102. Sandy High 24.8 6.29
What about computing your summary stats then joining the results to your initial dataframe. Like so:
DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))
DF2 <- DF %>% group_by(Name) %>%
summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
left_join(DF[,c(1,2,3)], by="Name")
Giving:
# A tibble: 9 x 5
Name scoring_mean scoring_sd ID Class
<fct> <dbl> <dbl> <dbl> <fct>
1 Ed 13.0 2.83 101. Junior
2 Ed 13.0 2.83 101. Junior
3 Hank 16.0 3.46 101. Junior
4 Hank 16.0 3.46 101. Junior
5 Hank 16.0 3.46 101. Junior
6 Jessica 25.5 0.707 102. Mid
7 Jessica 25.5 0.707 102. Mid
8 Sandy 21.0 1.41 102. High
9 Sandy 21.0 1.41 102. High
When working with data (e.g., in data.frame) the user can control displaying digits by using
options(digits=3)
and listing the data.frame like this.
ttf.all
When the user needs to paste the data in Excell like this
write.table(ttf.all, 'clipboard', sep='\t',row.names=F)
The digits parameter is ignored and numbers are not rounded.
See nice output
> ttf.all
year V1.x.x V1.y.x ratio1 V1.x.y V1.y.y ratioR V1.x.x V1.y.x ratioAL V1.x.y V1.y.y ratioRL
1 2006 227 645 35.2 67 645 10.4 150 645 23.3 53 645 8.22
2 2007 639 1645 38.8 292 1645 17.8 384 1645 23.3 137 1645 8.33
3 2008 1531 3150 48.6 982 3150 31.2 755 3150 24.0 235 3150 7.46
4 2009 1625 3467 46.9 1026 3467 29.6 779 3467 22.5 222 3467 6.40
But what is in excel (clipboard) is not rounded. How to control in in write.table()?
You can use the function format() as in:
write.table(format(ttf.all, digits=2), 'clipboard', sep='\t',row.names=F)
format() is a generic function that has methods for many classes, including data.frames. Unlike round(), it won't throw an error if your dataframe is not all numeric. For more details on the formatting options, see the help file via ?format
Adding a solution for data frame having mixed character and numeric columns. We first use mutate_if to select numeric columns then apply the round() function to them.
# install.packages('dplyr', dependencies = TRUE)
library(dplyr)
df <- read.table(text = "id year V1.x.x V1.y.x ratio1
a 2006 227.11111 645.11111 35.22222
b 2007 639.11111 1645.11111 38.22222
c 2008 1531.11111 3150.11111 48.22222
d 2009 1625.11111 3467.11111 46.22222",
header = TRUE, stringsAsFactors = FALSE)
df %>%
mutate_if(is.numeric, round, digits = 2)
#> id year V1.x.x V1.y.x ratio1
#> 1 a 2006 227.11 645.11 35.22
#> 2 b 2007 639.11 1645.11 38.22
#> 3 c 2008 1531.11 3150.11 48.22
#> 4 d 2009 1625.11 3467.11 46.22
### dplyr v1.0.0+
df %>%
mutate(across(where(is.numeric), ~ round(., digits = 2)))
#> id year V1.x.x V1.y.x ratio1
#> 1 a 2006 227.11 645.11 35.22
#> 2 b 2007 639.11 1645.11 38.22
#> 3 c 2008 1531.11 3150.11 48.22
#> 4 d 2009 1625.11 3467.11 46.22
Created on 2019-03-17 by the reprex package (v0.2.1.9000)