Related
I have an overview page of student statistics https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/index.html and each semester has specific information in a html table element, e.g. https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/2013-Wintersemester/index.html
I would like to scrape all information and put it together as a dataframe. I manually created a char vector of all URLs (perhaps there is another way).
Edit As was mentioned, some URL parts are capitalized, some are not. This list should be complete.
winters <- seq(from=2013, to=2021)
summers <- seq(from=2014, to=2022)
winters <- paste0(winters, "-wintersemester")
summers <- paste0(summers, "-Sommersemester")
all_terms <- c(rbind(winters, summers))
all_terms[1] <- "2013-Wintersemester"
all_terms[3] <- "2014-Wintersemester"
all_url <- paste0("https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/", all_terms, "/index.html")
I can get data for a single page
all_url[1] %>%
read_html() %>%
html_table() %>%
as.data.frame()
Studierende gesamt 6645
weiblich 4206
männlich 2439
Deutsche 5001
Ausländer/innen 1644
1. Fachsemester 1783
1. Hochschulsemester 1110
But fail to write a for loop.
tables <- list()
index <- 1
for(i in length(all_url)){
table <- all_url[i] %>%
read_html() %>%
html_table()
tables[index] <- table
index <- index + 1
}
df <- do.call("rbind", tables)
It would be great to have a dataframe with each sub-page (semester / year) as rows and all student data as columns.
Some appear not to be available. You could solve this using tryCatch and substitute with NA.
library(rvest)
tables <- lapply(all_url, \(x) tryCatch(as.data.frame(html_table(read_html(x))),
error=\(e) NA)) |> setNames(all_terms)
tail(tables, 3)
# $`2021-Sommersemester`
# X1 X2
# 1 Studierende gesamt 5131
# 2 weiblich 3037
# 3 männlich 2054
# 4 Deutsche 3698
# 5 Ausländer/innen 1433
# 6 1. Fachsemester 394
# 7 1. Hochschulsemester 143
#
# $`2021-Wintersemester`
# [1] NA
#
# $`2022-Sommersemester`
# X1 X2
# 1 Studierende gesamt 4851
# 2 weiblich 2847
# 3 männlich 2004
# 4 Deutsche 3360
# 5 Ausländer/innen 1491
# 6 1. Fachsemester 403
# 7 1. Hochschulsemester 189
Thereafter you may want to rbind the non-missings,
na <- is.na(tables)
tables[!na] <- Map(`[<-`, tables[!na], 'sem', value=substr(all_terms[!na], 1, 6)) ## add year column*
res <- do.call(rbind, tables[!is.na(tables)])
head(res)
# X1 X2 sem
# 2013-Wintersemester.1 Studierende gesamt 6645 2013-W
# 2013-Wintersemester.2 weiblich 4206 2013-W
# 2013-Wintersemester.3 männlich 2439 2013-W
# 2013-Wintersemester.4 Deutsche 5001 2013-W
# 2013-Wintersemester.5 Ausländer/innen 1644 2013-W
# 2013-Wintersemester.6 1. Fachsemester 1783 2013-W
*better use sapply(strsplit(substr(all_terms[!na], 1, 6), '-'), \(x) paste(rev(x), collapse='_')) here to get valid names
and reshape the data.
reshape2::dcast(res, X1 ~ sem, value.var='X2')
# X1 2013-W 2014-S 2014-W 2015-S 2016-S 2017-S 2018-S 2019-S 2020-S 2021-S 2022-S
# 1 1. Fachsemester 1783 567 1600 557 613 693 810 611 405 394 403
# 2 1. Hochschulsemester 1110 199 1020 224 240 217 273 214 78 143 189
# 3 Ausländer/innen 1644 1510 1649 1501 1576 1613 1551 1527 1369 1433 1491
# 4 Deutsche 5001 4836 4843 4599 4682 4733 4821 4523 4040 3698 3360
# 5 männlich 2439 2347 2394 2255 2292 2388 2468 2388 2197 2054 2004
# 6 Studierende gesamt 6645 6346 6492 6100 6258 6346 6372 6051 5409 5131 4851
# 7 weiblich 4206 3999 4098 3845 3966 3958 3904 3663 3212 3037 2847
Here's a tidyverse approach. Note that I only use links 1:4 because there's something off with (some of) the others.
library(rvest)
library(tidyverse)
# gather terms and corresponding urls
data <- tibble(
term = all_terms,
url = paste0(
"https://www.europauni.de/de/struktur/verwaltung/dezernat_1/statistiken/",
all_terms, "/index.html"
)
)[1:4,]
# map over `data`, scrape table, rename variables and bind the results
map(1:nrow(data), ~ {
data$url[.x] %>%
read_html() %>%
html_table() %>%
`[[`(., 1) %>%
mutate(term = data$term[.x]) %>%
rename(Kategorie = X1, Anzahl = X2)
}) %>%
bind_rows()
Result:
# A tibble: 28 × 3
Kategorie Anzahl term
<chr> <int> <chr>
1 Studierende gesamt 6645 2013-Wintersemester
2 weiblich 4206 2013-Wintersemester
3 männlich 2439 2013-Wintersemester
4 Deutsche 5001 2013-Wintersemester
5 Ausländer/innen 1644 2013-Wintersemester
6 1. Fachsemester 1783 2013-Wintersemester
7 1. Hochschulsemester 1110 2013-Wintersemester
8 Studierende gesamt 6346 2014-Sommersemester
9 weiblich 3999 2014-Sommersemester
10 männlich 2347 2014-Sommersemester
...
I want to create a weighted average of the baseball statistic WAR from 2017 to 2019.
The Averages would go as following:
2019: 57.14%
2018: 28.57%
2017: 14.29%
However some players only played in 2018 and 2019, some having played in 2019 and 2017.
If they've only played in two years it would be 67/33, and only one year would be 100% obviously.
I was wondering if there was an easy way to do this.
My data set looks like this
Name Season G PA HR BB_pct K_pct ISO wOBA wRC_plus Def WAR
337 A.J. Pollock 2017 112 466 14 7.5 15.2 0.205 0.340 103 2.6 2.2
357 A.J. Pollock 2018 113 460 21 6.7 21.7 0.228 0.338 111 0.9 2.6
191 Aaron Altherr 2017 107 412 19 7.8 25.2 0.245 0.359 120 -7.9 1.4
162 Aaron Hicks 2017 88 361 15 14.1 18.6 0.209 0.363 128 6.4 3.4
186 Aaron Hicks 2018 137 581 27 15.5 19.1 0.219 0.360 129 2.3 5.0
464 Aaron Hicks 2019 59 255 12 12.2 28.2 0.208 0.325 102 1.3 1.1
And the years vary from person to person, but was wondering if anyone had a way to do this weighted average dependent on the years they played. I also dont want any only 2017-ers if that make sense.
I guess, there is an easy way of doing your task. Unfortunately my approach is a little bit more complex. I'm using dplyr and purr.
First I put those weights into a list:
one_year <- 1
two_years <- c(2/3, 1/3)
three_years <- c(4/7, 3/7, 1/7)
weights <- list(one_year, two_years, three_years)
Next I split the datset into a list by the number of seasons each player took part:
df %>%
group_by(Name) %>%
mutate(n=n()) %>%
arrange(n) %>%
ungroup() %>%
group_split(n) -> my_list
Now I define a function that calculates the average using the weights:
WAR_average <- function(i) {my_list[[i]] %>%
group_by(Name) %>%
mutate(WAR_average = sum(WAR * weights[[i]]))}
And finally I apply the function WAR_average on my_list and filter/select the data:
my_list %>%
seq_along() %>%
lapply(WAR_average) %>% # apply function
reduce(rbind) %>% # bind the dataframes into one df
filter(Season != 2017 | n != 1) %>% # filter players only active in 2017
select(Name, WAR_average) %>% # select player and war_average
distinct() # remove duplicates
This whole process returns
# A tibble: 2 x 2
# Groups: Name [2]
Name WAR_average
<chr> <dbl>
1 A.J. Pollock 2.33
2 Aaron Hicks 4.24
I have one table with five columns Year,Revenue,Pensions,Income and Wages.With this table I made calculation with code below:
library(dplyr)
#DATA
TEST<-data.frame(
Year= c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021),
Revenue =c(8634,5798,6022,6002,6266,6478,6732,7224,6956,6968,7098,7620,7642,8203,9856,20328,22364,22222,23250,25250,26250,27250),
Pensions =c(8734,5798,7011,7002,7177,7478,7731,7114,7957,7978,7098,7710,7742,8203,9857,10328,11374,12211,13150,15150,17150,17150),
Income =c(8834,5898,6033,6002,6366,6488,6833,8334,6956,6968,8098,8630,8642,8203,9856,30328,33364,32233,33350,35350,36350,38350),
Wages =c(8834,5598,8044,8002,8488,8458,8534,5444,8958,8988,5098,5840,5842,8203,9858,40328,44384,42244,43450,45450,48450,45450)
)
#FUNCTION
fun1 <- function(x){ ((x - lag(x))/lag(x))*100}
#CALCULATION
ESTIMATION_0<-mutate(TEST,
Nominal_growth_Revenue=fun1(Revenue),
Nominal_growth_Pensions=fun1(Pensions),
Nominal_growth_Income=fun1(Income),
Nominal_growth_Wages=fun1(Wages)
)
But my intention is to optimize this code and to do this calculation with apply function (or something similar). Namely for this calculation I wrote 4 code line, but I like to do this with one code line. So can anybody help me with this problem ?
Assuming you have a character vector with the relevant columns:
cols <- c("Revenue", "Pensions", "Income", "Wages")
Use apply():
TEST[paste0('nomial_growth', cols)] <- apply(TEST[cols], 2, fun1)
or data.table:
library(data.table)
setDT(TEST)
TEST[, (paste0('nomial_growth', cols)) := lapply(.SD, fun1), .SDcols = cols]
You could do this:
vars_names <- paste0("Nominal_groth", names(select(TEST, -Year)))
TEST %>%
bind_cols( (TEST %>% mutate_at(vars(-Year), ~fun1(.x))) %>% select(-Year) %>% set_names(vars_names) )
Year Revenue Pensions Income Wages Nominal_grothRevenue Nominal_grothPensions Nominal_grothIncome Nominal_grothWages
1 2000 8634 8734 8834 8834 NA NA NA NA
2 2001 5798 5798 5898 5598 -32.8468844 -33.6157545 -33.2352275 -36.63119765
3 2002 6022 7011 6033 8044 3.8634012 20.9210072 2.2889115 43.69417649
4 2003 6002 7002 6002 8002 -0.3321156 -0.1283697 -0.5138405 -0.52212829
5 2004 6266 7177 6366 8488 4.3985338 2.4992859 6.0646451 6.07348163
6 2005 6478 7478 6488 8458 3.3833387 4.1939529 1.9164310 -0.35344015
7 2006 6732 7731 6833 8534 3.9209633 3.3832576 5.3175092 0.89855758
8 2007 7224 7114 8334 5444 7.3083779 -7.9808563 21.9669252 -36.20810874
9 2008 6956 7957 6956 8958 -3.7098560 11.8498735 -16.5346772 64.54812638
10 2009 6968 7978 6968 8988 0.1725129 0.2639186 0.1725129 0.33489618
11 2010 7098 7098 8098 5098 1.8656716 -11.0303334 16.2169920 -43.27992879
12 2011 7620 7710 8630 5840 7.3541843 8.6221471 6.5695233 14.55472734
13 2012 7642 7742 8642 5842 0.2887139 0.4150454 0.1390498 0.03424658
14 2013 8203 8203 8203 8203 7.3410102 5.9545337 -5.0798426 40.41424170
15 2014 9856 9857 9856 9858 20.1511642 20.1633549 20.1511642 20.17554553
16 2015 20328 10328 30328 40328 106.2500000 4.7783301 207.7110390 309.08906472
17 2016 22364 11374 33364 44384 10.0157418 10.1278079 10.0105513 10.05752827
18 2017 22222 12211 32233 42244 -0.6349490 7.3588887 -3.3898813 -4.82155732
19 2018 23250 13150 33350 43450 4.6260463 7.6897879 3.4653926 2.85484329
20 2019 25250 15150 35350 45450 8.6021505 15.2091255 5.9970015 4.60299194
21 2020 26250 17150 36350 48450 3.9603960 13.2013201 2.8288543 6.60066007
22 2021 27250 17150 38350 45450 3.8095238 0.0000000 5.5020633 -6.19195046
I have a data frame similar to this:
df1 <- data.frame(c(31,3447,12,1966,39,3275),
c(20,3460,10,1968,30,3284),
c(334,3146,212,1766,338,2976),
c(36,3442,35,1943,47,3267),
c(81,3399,71,1907,112,3202),
c(22,3458,22,1956,42,3272))
colnames(df1) <- c("Site1.C1","Site1.C2","Site2.C1","Site2.C2","Site3.C1","Site3.C2")
df1
Site1.C1 Site1.C2 Site2.C1 Site2.C2 Site3.C1 Site3.C2
1 31 20 334 36 81 22
2 3447 3460 3146 3442 3399 3458
3 12 10 212 35 71 22
4 1966 1968 1766 1943 1907 1956
5 39 30 338 47 112 42
6 3275 3284 2976 3267 3202 3272
I am converting each row into a table and then performing a chisq test.
In order get specific values from the chisq result (p value, parameter, statistic, expected, etc), I'm having to repeat chisq test several times over (in a very ugly and cumbersome way), using the following code:
df2 <- df1 %>% rowwise() %>% mutate(P=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$p.value,
df=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$parameter,
Site1.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[1,1],
Site1.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[1,2],
Site2.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[2,1],
Site2.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[2,2],
Site3.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[3,1],
Site3.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[3,2])
as.data.frame(df2)
Site1.C1 Site1.C2 Site2.C1 Site2.C2 Site3.C1 Site3.C2 P df Site1.c1.exp Site1.c2.exp Site2.c1.exp Site2.c2.exp Site3.c1.exp Site3.c2.exp
1 31 20 334 36 81 22 2.513166e-08 2 43.40840 7.591603 314.9237 55.07634 87.66794 15.33206
2 3447 3460 3146 3442 3399 3458 2.760225e-02 2 3391.05464 3515.945362 3234.4387 3353.56132 3366.50668 3490.49332
3 12 10 212 35 71 22 4.743725e-04 2 17.92818 4.071823 201.2845 45.71547 75.78729 17.21271
4 1966 1968 1766 1943 1907 1956 1.026376e-01 2 1928.02242 2005.977577 1817.7517 1891.24831 1893.22588 1969.77412
5 39 30 338 47 112 42 2.632225e-10 2 55.49507 13.504934 309.6464 75.35362 123.85855 30.14145
6 3275 3284 2976 3267 3202 3272 2.686389e-02 2 3216.55048 3342.449523 3061.5833 3181.41674 3174.86626 3299.13374
Is there a more elegant way to do chisq test just once and capture the result as a tibble in the same row and then extract values on a need-to basis into additional columns?
My data frame has over a million of rows and some additional variables not used with the Chisq test.
Thank you.
With input from #akrun, I was able to get the desired result using the following code:
df2 <- df1 %>% rowwise() %>% mutate(result=list(chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))))
I am looking for reshaping:
ID p2012 p2010 p2008 p2006 c2012 c2010 c2008 c2006
1 1 160 162 163 165 37.3 37.3 37.1 37.1
2 2 163 164 164 163 2.6 2.6 2.6 2.6
into:
ID year p c
1 1 2006 165 37.1
2 1 2008 164 37.1
3 1 2010 162 37.3
4 1 2012 160 37.3
5 2 2006 163 2.6
6 2 2008 163 2.6
7 2 2010 164 2.6
8 2 2012 163 2.6
I am new to R, have been trying around with melt and dcast functions, but there are just to many twists for me at this stage. Help would be much appreciated!
A dput of my df:
structure(list(ID = 1:2, p2012 = c(160L, 163L), p2010 = c(162L, 164L), p2008 = 163:164, p2006 = c(165L, 163L), c2012 = c(37.3, 2.6), c2010 = c(37.3, 2.6), c2008 = c(37.1, 2.6), c2006 = c(37.1, 2.6)), .Names = c("ID", "p2012", "p2010", "p2008", "p2006", "c2012", "c2010", "c2008", "c2006"), class = "data.frame", row.names = c(NA, -2L))
An alternative to shadow's answer is to use the reshape function:
reshape(d, direction='long', varying=list(2:5, 6:9), v.names=c("p", "c"), idvar="ID", times=c(2012, 2010, 2008, 2006))
This assumes that you know the column indices of the p and c beforehand (or add additional code to figure them out). Furthermore, the times vector above could be found by using something similar to the gsub function of shadow.
Which way to use probably is a matter of taste.
You probably have to melt the data first, then split the variable and the year and then dcast to your final data.frame.
require(reshape2)
# melt data.frame
dfmelt <- melt(df, id.vars="ID", variable.name="var.year")
# split "var.year" into new variables "var" and "year"
dfmelt[, "var"] <- gsub("[0-9]", "", as.character(dfmelt[, "var.year"]))
dfmelt[, "year"] <- as.numeric(gsub("[a-z, A-Z]", "", as.character(dfmelt[, "var.year"])))
# cast to data with column for each var-name
dcast(dfmelt, ID+year~var, value.var="value")
You can also use the following solution from tidyr. You don't actually need to use regular expressions, if "p" or "c" is always the first letter of the column names:
library(tidyr)
library(dplyr) # only loaded for the %>% operator
dat %>%
gather(key,value,p2012:c2006) %>%
separate(key,c("category","year"),1) %>%
spread(category,value)
ID year c p
1 1 2006 37.1 165
2 1 2008 37.1 163
3 1 2010 37.3 162
4 1 2012 37.3 160
5 2 2006 2.6 163
6 2 2008 2.6 164
7 2 2010 2.6 164
8 2 2012 2.6 163