Histogram with density curves of other data - r

I want to recreate but I have trouble to fit the density curves to the plot.
MWE ( of what I achieved so far. Data in the tibble are just sample data) :
tibble(home = sample(1:10,90, replace = T), away = sample(1:10,90, replace = T)) %>%
gather(key=Type, value=Value) %>%
ggplot(aes(x=Value,fill=Type)) +
geom_histogram(position="dodge")
UPDATE after answer by #Kota Mori
I adjusted the answer given by Kota Mori to get the following which results in an error. Before I start lets have a look at the datasets I want to use for the graph :
#Both Goals variables of this dataframe should be used for the histogram
actual
# A tibble: 90 x 7
season matchday club_name_home club_name_away goals_team_home goals_team_away sumgoals
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1819 21 ETuS Haltern TuS 05 Sinsen II 2 2 4
2 1819 21 VfL Ramsdorf Westfalia Gemen II 2 0 2
3 1819 21 FC RW Dorsten SV Altendorf-Ulfkotte 8 4 12
4 1819 21 SuS Hervest-Dorsten 1. SC BW Wulfen 0 0 0
5 1819 21 SV Lembeck SC Reken II 1 1 2
6 1819 21 RC Borken-Hoxfeld TSV Raesfeld 3 1 4
7 1819 21 TuS Velen Fenerbahce I. Marl 5 2 7
8 1819 21 BVH Dorsten SC Marl-Hamm 2 0 2
9 1819 21 1. SC BW Wulfen FC RW Dorsten 3 0 3
10 1819 21 BVH Dorsten SV Altendorf-Ulfkotte 2 0 2
# ... with 80 more rows
#Both Goals variables of this dataframe should be used for the density lines
poisson
# A tibble: 90 x 6
season matchday club_name_home club_name_away Goals_team_home Goals_team_away
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1819 21 ETuS Haltern TuS 05 Sinsen II 2 2
2 1819 21 VfL Ramsdorf Westfalia Gemen II 3 0
3 1819 21 FC RW Dorsten SV Altendorf-Ulfkotte 2 0
4 1819 21 SuS Hervest-Dorsten 1. SC BW Wulfen 0 4
5 1819 21 SV Lembeck SC Reken II 2 1
6 1819 21 RC Borken-Hoxfeld TSV Raesfeld 2 1
7 1819 21 TuS Velen Fenerbahce I. Marl 2 1
8 1819 21 BVH Dorsten SC Marl-Hamm 3 1
9 1819 21 1. SC BW Wulfen FC RW Dorsten 2 0
10 1819 21 BVH Dorsten SV Altendorf-Ulfkotte 2 1
# ... with 80 more rows
So I adjusted the answer by Kota Mori to end up with the following code :
simyears = 1819
actual <- read_rds(here::here(paste0("/data/database_match_results_",sim_years,".rds")))%>%
filter(between(matchday, 21, max(database_season$matchday)))
poisson <- missinggames
data <- rbind(data.frame(type="home", value=actual$goals_team_home, stringsAsFactors=FALSE),
data.frame(type="away", value=actual$goals_team_home, stringsAsFactors=FALSE))
estimate <- group_by(poisson %>% select(Goals_team_home,Goals_team_away), type) %>% summarize(mu=mean(value))
dens <- expand.grid(value=0:max(data$value), type=c("away", "home"),
stringsAsFactors=FALSE) %>%
inner_join(estimate) %>%
mutate(density=dpois(value, mu))
prop <- group_by(data, type, value) %>% summarize(count=n()) %>%
group_by(type) %>% mutate(prop=count/sum(count))
tmp_actual <- left_join(dens, prop) %>% replace_na(list(prop=0, count=0))
ggplot(tmp_actual, aes(x=value, weight=prop, fill=type)) +
geom_bar(position="dodge") +
geom_line(aes(value, density, color=type, weight=NULL))
Which results in the following error : 'Error: Mapping should be created with aes() or aes_().'

I think you need to calculate the poisson parameters on your own, which turns out to be as easy as calculating the sample mean for each type.
The following code generates a graph similar to the example.
library(dplyr)
library(ggplot2)
data <- rbind(data.frame(type="home", value=rpois(90, 2.5), stringsAsFactors=FALSE),
data.frame(type="away", value=rpois(90, 1.5), stringsAsFactors=FALSE))
estimate <- group_by(data, type) %>% summarize(mu=mean(value))
dens <- expand.grid(value=0:max(data$value), type=c("away", "home"),
stringsAsFactors=FALSE) %>%
inner_join(estimate) %>%
mutate(density=dpois(value, mu))
prop <- group_by(data, type, value) %>% summarize(count=n()) %>%
group_by(type) %>% mutate(prop=count/sum(count))
tmp <- left_join(dens, prop) %>% replace_na(list(prop=0, count=0))
ggplot(tmp, aes(x=value, weight=prop, fill=type)) +
geom_bar(position="dodge") +
geom_line(aes(value, density, color=type, weight=NULL))

Related

Keeping the max within a group constant within a group using base::cumsum

Use the data below to make the cumsum_a column look like the should column.
Data to start with:
> demo
th seq group
1 20.1 1 10
2 24.1 2 10
3 26.1 3 10
4 1.1 1 20
5 2.1 2 20
6 4.1 3 20
The "should" column below is the goal.
demo<-data.frame(th=c(c(20.1,24.1,26.1),(c(1.1,2.1,4.1))),
seq=(c(1:3,1:3)),group=c(rep(10,3),rep(20,3)))
library(magrittr)
library(dplyr)
demo %>%
group_by(group) %>%
mutate(
cumsum_a= cumsum((group)^seq*
(((th)/cummax(th)))))%>%
ungroup()%>%
mutate(.,
cumsum_m=c( #As an example only, this manually does exactly what cumsum_a is doing (which is wrong)
10^1*20.1/20.1, #good
10^1*20.1/20.1 + 10^2*24.1/24.1, #different denominators, bad
10^1*20.1/20.1 + 10^2*24.1/24.1 + 10^3*26.1/26.1, #different denominators, bad
20^1*1.1/1.1, #good
20^1*1.1/1.1 + 20^2*2.1/2.1, #different denominators, bad
20^1*1.1/1.1 + 20^2*2.1/2.1 + 20^3*4.1/4.1 #different denominators, bad
),
should=c( #this is exactly the kind of calculation I want
10^1*20.1/20.1, #good
10^1*20.1/24.1 + 10^2*24.1/24.1, #good
10^1*20.1/26.1 + 10^2*24.1/26.1 + 10^3*26.1/26.1, #good
20^1*1.1/1.1, #good
20^1*1.1/2.1 + 20^2*2.1/2.1, #good
20^1*1.1/4.1 + 20^2*2.1/4.1 + 20^3*4.1/4.1 #good
)
)
Most simply put, denominators need to be the same for each row so 24.1 and 24.1 instead of 20.1 and 24.1 on the second row of cumsum_m or the underlying calculations for cumsum_a.
Here are the new columns, where should is what cumsum_a or cumsum_m should be.
th seq group cumsum_a cumsum_m should
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 20.1 1 10 10 10 10
2 24.1 2 10 110 110 108.
3 26.1 3 10 1110 1110 1100.
4 1.1 1 20 20 20 20
5 2.1 2 20 420 420 410.
6 4.1 3 20 8420 8420 8210.
You can use the following solution:
purrr::accumulate takes a two argument function, the first one which is represented by .x or ..1 is the accumulated value of the previous iterations and .y represents the current value of our vector (2:n()). So our first accumulated value will be first element of group value as I supplied it as .init argument
Since you would like to change the denominator of the previous iterations/ calculations, I multiplied the result .x by the ratio of the previous value of cmax to the current value of cmax
I think the rest is pretty clear but if you have any more question about it just let me know.
library(dplyr)
library(purrr)
demo %>%
group_by(group) %>%
mutate(cmax = cummax(th),
should = accumulate(2:n(), .init = group[1],
~ (.x * cmax[.y - 1] / cmax[.y]) + (group[.y] ^ seq[.y]) * (th[.y] / cmax[.y])))
# A tibble: 6 x 5
# Groups: group [2]
th seq group cmax should
<dbl> <int> <dbl> <dbl> <dbl>
1 20.1 1 10 20.1 10
2 24.1 2 10 24.1 108.
3 26.1 3 10 26.1 1100.
4 1.1 1 20 1.1 20
5 2.1 2 20 2.1 410.
6 4.1 3 20 4.1 8210.

Possible to avoid a FOR loop in this very simple R code?

The answers below are very helpful. But I oversimplified my original question. I figured I learn more if I oversimplify and then adapt to my actual need, but now I am stuck. There are other factors that drive the amortization. See more complete code here. I like the response using "amort$end_bal <- begin_bal * (1 - mpr)^amort$period" and "amort$pmt <- c(0, diff(amort$end_bal))* -1", but in addition npr increases the ending balances and ch_off decreases ending balances. Here´s the more complete code:
n_periods <- 8
begin_bal <- 10000
yld <- .20
npr <- .09
mpr <- .10
co <- .10
period = seq(0,n_periods,1)
fin = 0
pur = 0
pmt = 0
ch_off = 0
end_bal = begin_bal
for(i in 1:n_periods){
{fin[i+1] = end_bal[i]*yld/12}
{pur[i+1] = end_bal[i]*npr}
{pmt[i+1] = end_bal[i]*mpr}
{ch_off[i+1] = end_bal[i]*co/12}
end_bal[i+1] = end_bal[i]+pur[i+1]-pmt[i+1]-ch_off[i+1]}
amort <- data.frame(period,fin,pur,pmt,ch_off,end_bal)
Which gives the below correct output:
print(amort,row.names=FALSE)
period fin pur pmt ch_off end_bal
0 0.0000 0.0000 0.0000 0.00000 10000.000
1 166.6667 900.0000 1000.0000 83.33333 9816.667
2 163.6111 883.5000 981.6667 81.80556 9636.694
3 160.6116 867.3025 963.6694 80.30579 9460.022
4 157.6670 851.4020 946.0022 78.83351 9286.588
5 154.7765 835.7929 928.6588 77.38823 9116.334
6 151.9389 820.4700 911.6334 75.96945 8949.201
7 149.1534 805.4281 894.9201 74.57668 8785.132
8 146.4189 790.6619 878.5132 73.20944 8624.072
I´m new to R, and I understand one of its features is matrix/vector manipulation. In the below example I amortize an asset over 8 months, where each payment ("pmt") is 10% ("mpr") of the prior period balance ("end_bal"). The below works fine. I used a FOR loop. I understand FOR loops can be slow in large models and a better solution is use of R´s abundant vector/matrix functions. But I didn´t know how to do this in my example since each monthly payment is calculated by referencing the prior period ending balance.
So my questions are:
Is there a more efficient way to do the below?
How do I replace the 0 for pmt in period 0, with an empty space?
R code:
n_periods <- 8
begin_bal <- 100
mpr <- .10
# Example loan amortization
pmt = 0
end_bal = begin_bal
for(i in 1:n_periods){
{pmt[i+1] = end_bal[i]*mpr}
end_bal[i+1] = end_bal[i]-pmt[i+1]}
amort <- data.frame(period = 0:n_periods,pmt,end_bal)
amort
Results, which are correct:
> amort
period pmt end_bal
1 0 0.000000 100.00000
2 1 10.000000 90.00000
3 2 9.000000 81.00000
4 3 8.100000 72.90000
5 4 7.290000 65.61000
6 5 6.561000 59.04900
7 6 5.904900 53.14410
8 7 5.314410 47.82969
9 8 4.782969 43.04672
Use R's vectorised calculations
n_periods <- 8
begin_bal <- 100
mpr <- .10
amort <- data.frame(period = seq(0, n_periods, 1))
amort$end_bal <- begin_bal * (1 - mpr)^amort$period
amort$pmt <- c(0, diff(amort$end_bal))* -1
amort
#> period end_bal pmt
#> 1 0 100.00000 0.000000
#> 2 1 90.00000 10.000000
#> 3 2 81.00000 9.000000
#> 4 3 72.90000 8.100000
#> 5 4 65.61000 7.290000
#> 6 5 59.04900 6.561000
#> 7 6 53.14410 5.904900
#> 8 7 47.82969 5.314410
#> 9 8 43.04672 4.782969
Created on 2021-05-12 by the reprex package (v2.0.0)
dplyr way for a different case (say)
n_periods <- 15
begin_bal <- 1000
mpr <- .07
library(dplyr)
seq(0, n_periods, 1) %>% as.data.frame() %>%
setNames('period') %>%
mutate(end_bal = begin_bal * (1 - mpr)^period,
pmt = -1 * c(0, diff(end_bal)))
#> period end_bal pmt
#> 1 0 1000.0000 0.00000
#> 2 1 930.0000 70.00000
#> 3 2 864.9000 65.10000
#> 4 3 804.3570 60.54300
#> 5 4 748.0520 56.30499
#> 6 5 695.6884 52.36364
#> 7 6 646.9902 48.69819
#> 8 7 601.7009 45.28931
#> 9 8 559.5818 42.11906
#> 10 9 520.4111 39.17073
#> 11 10 483.9823 36.42878
#> 12 11 450.1035 33.87876
#> 13 12 418.5963 31.50725
#> 14 13 389.2946 29.30174
#> 15 14 362.0439 27.25062
#> 16 15 336.7009 25.34308
Created on 2021-05-12 by the reprex package (v2.0.0)
Though OP has put another question in edited scenario, here's the approach suggested (for future reference)
n_periods <- 8
begin_bal <- 10000
yld <- .20
npr <- .09
mpr <- .10
co <- .10
library(dplyr)
seq(0, n_periods, 1) %>% as.data.frame() %>%
setNames('period') %>%
mutate(end_bal = begin_bal * (1 - (mpr + co/12 - npr))^period,
fin = c(0, (end_bal * yld/12)[-nrow(.)]),
pur = c(0, (end_bal * npr)[-nrow(.)]),
pmt = c(0, (end_bal * mpr)[-nrow(.)]),
ch_off = c(0, (end_bal * co/12)[-nrow(.)]))
#> period end_bal fin pur pmt ch_off
#> 1 0 10000.000 0.0000 0.0000 0.0000 0.00000
#> 2 1 9816.667 166.6667 900.0000 1000.0000 83.33333
#> 3 2 9636.694 163.6111 883.5000 981.6667 81.80556
#> 4 3 9460.022 160.6116 867.3025 963.6694 80.30579
#> 5 4 9286.588 157.6670 851.4020 946.0022 78.83351
#> 6 5 9116.334 154.7765 835.7929 928.6588 77.38823
#> 7 6 8949.201 151.9389 820.4700 911.6334 75.96945
#> 8 7 8785.132 149.1534 805.4281 894.9201 74.57668
#> 9 8 8624.072 146.4189 790.6619 878.5132 73.20944
Created on 2021-05-13 by the reprex package (v2.0.0)
If you are "lazy" (don't want to formulate the general expression of pmt and end_bal), you can define a recursive function f like blow
f <- function(k) {
if (k == 1) {
return(data.frame(pmt = 100 * mpr, end_bal = 100))
}
u <- f(k - 1)
end_bal <- with(tail(u, 1), end_bal - pmt)
pmt <- mpr * end_bal
rbind(u, data.frame(pmt, end_bal))
}
n_periods <- 8
res <- transform(
cbind(period = 0:n_periods, f(n_periods + 1)),
pmt = c(0, head(pmt, -1))
)
and you will see
> res
period pmt end_bal
1 0 0.000000 100.00000
2 1 10.000000 90.00000
3 2 9.000000 81.00000
4 3 8.100000 72.90000
5 4 7.290000 65.61000
6 5 6.561000 59.04900
7 6 5.904900 53.14410
8 7 5.314410 47.82969
9 8 4.782969 43.04672

Sum certain rows given 2 constraints in R

I am trying to write an conditional statement with the following constraints. Below is an example data frame showing the problem I am running into.
Row <- c(1,2,3,4,5,6,7)
La <- c(51.25,51.25,51.75,53.25,53.25,54.25,54.25)
Lo <- c(128.25,127.75,127.25,119.75,119.25,118.75,118.25)
Y <- c(5,10,2,4,5,7,9)
Cl <- c("EF","EF","EF","EF","NA","NA","CE")
d <- data.frame(Row,La,Lo,Y,Cl)
Row La Lo Y Cl
1 1 51.25 128.25 5 EF
2 2 51.25 127.75 10 EF
3 3 51.75 127.25 2 EF
4 4 53.25 119.75 4 EF
5 5 53.25 119.25 5 NA
6 6 54.25 118.75 7 NA
7 7 55.25 118.25 9 CE
I would like to sum column "Y" (removing all values from that row) if "Cl" is NA with the corresponding "Lo" and "La" values that are close (equal to or less than 1.00). In effect, I want to remove NA from being in the data frame without losing the value of "Y", but instead adding this value to its closest neighbor.
I would like the return data frame to look like this:
Row2 <- c(1,2,3,4,7)
La2 <- c(51.25,51.25,51.75,53.25,55.25)
Lo2 <- c(128.25,127.75,127.25,119.75,118.25)
Y2 <- c(5,10,2,9,16)
Cl2 <- c("EF","EF","EF","EF","CE")
d2 <- data.frame(Row2,La2,Lo2,Y2,Cl2)
Row2 La2 Lo2 Y2 Cl2
1 1 51.25 128.25 5 EF
2 2 51.25 127.75 10 EF
3 3 51.75 127.25 2 EF
4 4 53.25 119.75 9 EF
5 7 55.25 118.25 16 CE
recent edit: If NA row is close to one row in terms of Lo value and same closeness to another row in La value, join by La value. If there are 2 equally close rows of Lo and La values, join by smaller La value.
Thank you for the help!
Here is a method to use if you can make some distance matrix m for the distance between all the (La, Lo) rows in your data. I use the output of dist, which is euclidean distance. The row with the lowest distance is selected, or the earliest such row if the lowest distance is shared by > 1 row.
w <- which(is.na(d$Cl))
m <- as.matrix(dist(d[c('La', 'Lo')]))
m[row(m) %in% w] <- NA
d$g <- replace(seq(nrow(d)), w, apply(m[,w], 2, which.min))
library(dplyr)
d %>%
group_by(g) %>%
summarise(La = La[!is.na(Cl)],
Lo = Lo[!is.na(Cl)],
Y = sum(Y),
Cl = Cl[!is.na(Cl)]) %>%
select(-g)
# # A tibble: 5 x 4
# La Lo Y Cl
# <dbl> <dbl> <dbl> <fct>
# 1 51.2 128. 5 EF
# 2 51.2 128. 10 EF
# 3 51.8 127. 2 EF
# 4 53.2 120. 9 EF
# 5 54.2 118. 16 CE

Developing a row extraction rule

I want to develop a rule to extract certain rows from a matrix. I set up the example as follows:
mat1 = data.frame(matrix(nrow=508, ncol =5))
mat1[1:20,1] = rep(1,20)
mat1[1:20,2:5] = rnorm(20*4,0,1)
mat2 = data.frame(matrix(nrow=508, ncol =5))
seq1 <- seq(1,3,1)
mat2[1:27,1] = rep(seq1,9)
mat2[1:27,2:5] = rnorm(27*4,0,1)
mat3 = data.frame(matrix(nrow=508, ncol =5))
mat3[1:32,1] = rep(seq(1,4,1),8)
mat3[1:32,2:5] = rnorm(32*4,0,1)
colnames(mat1) = colnames(mat2) = colnames(mat3) = c("Cohort Number", "Alpha(t-1)", "date1", "date2", "date3")
mat.list <- list(mat1,mat2,mat3)
Example matrix
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -1.76745451 -1.3227308 2.7099501 -0.13797329
2 1 -0.72651808 -0.8714317 1.3200554 0.76964663
3 1 -0.50325892 0.0742336 -0.6460628 0.30148135
4 1 0.79592650 0.1353875 -0.5694022 -0.59019913
5 1 1.94064961 0.2255595 0.3156252 -0.90996475
6 1 0.27134932 0.3966957 -1.9198976 0.23998928
7 1 -1.13272507 -0.8603225 -1.2042036 0.06609958
8 1 -2.12392748 1.0905405 -0.3788234 0.92850110
9 1 0.22038996 0.4500683 -1.4617004 0.58498275
10 1 0.26348734 -0.8340913 1.2631368 -1.48490518
11 1 0.26931077 -0.5230622 -0.6615288 1.45668453
12 1 -2.03067695 -0.6432484 0.4801026 0.01808834
13 1 1.25915656 -0.1116544 -0.3004298 -1.04072722
14 1 -2.27894271 -2.1058424 -0.3351053 -1.04132045
15 1 0.47742052 2.1564274 -0.4733351 -0.53152019
16 1 -1.57680089 -0.1340645 -0.3134633 0.53223567
17 1 0.25245813 -0.8243152 0.5998211 -1.01892301
18 1 0.18391447 -1.3500645 1.6059798 1.43359399
19 1 -0.09602031 1.4921338 -0.6455687 0.66385823
20 1 -0.13613759 2.2474816 0.7311762 -2.46849071
mat2[1:27,]
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -0.76033920 1.317636591 -0.09684526 -0.08796725
2 2 0.05123185 -0.731591674 -0.37247406 0.04470346
3 3 -0.78460201 0.890336570 1.26737475 -0.39062992
4 1 -0.14111920 1.255008475 -0.32799815 -0.77277716
5 2 -0.46044451 1.175157970 0.82187906 0.54326905
6 3 -0.46804365 0.704203273 -2.04539007 -1.74782065
7 1 0.42009824 0.488807461 3.21093186 -0.13745029
8 2 1.27083389 -1.316989452 0.43565921 0.07870330
9 3 -0.16581119 1.872955624 -0.22399155 -0.79334562
10 1 -1.33436656 0.589311311 -1.03871415 -1.06221057
11 2 1.56584985 0.020699064 0.45691456 0.15858065
12 3 1.07756426 -0.045200151 0.05124461 -1.86633279
13 1 -1.01264994 -0.229406681 1.24954420 0.88846407
14 2 -0.09950713 -0.515798138 1.62560454 -0.20191909
15 3 -0.28319479 0.450854419 1.42963386 -1.11964154
16 1 0.51771608 -1.407248379 0.62626313 0.97775246
17 2 -0.43951262 -0.368739441 0.66564013 -0.79980882
18 3 -0.15865277 -0.231475146 0.37582330 0.93685867
19 1 -0.57758129 0.235550070 0.42480442 -0.14379249
20 2 -0.81726414 -1.207593079 -0.30000514 0.68967230
21 3 -0.72926703 -0.458849409 1.51162785 1.40921409
22 1 -0.32220454 0.334996561 1.26073381 -2.03405958
23 2 -0.51450039 -0.305634241 1.51021957 0.39775430
24 3 1.15476297 -1.040126709 -0.36192432 -0.37346894
25 1 -0.88053587 -0.006829769 -0.89855797 -0.39840858
26 2 -0.64435448 0.209561006 -0.13986834 -0.61308957
27 3 1.22492942 0.812693992 -1.32371617 -1.21852365
and
> mat3[1:32,]
Cohort Number Alpha(t-1) date1 date2 date3
1 1 -0.7657871 -0.35390862 -0.23539987 -1.8365309
2 2 -0.6631690 1.36450837 0.78403072 -0.8344993
3 3 -1.0134022 -0.28380021 0.72149463 -0.7890273
4 4 2.6419455 0.26998803 2.03606725 0.8099134
5 1 -0.1383910 0.90845134 1.09273919 0.4651443
6 2 -0.7549340 -0.23185551 2.21119705 -0.1386960
7 3 0.7296121 -1.09145187 -1.18092505 0.1510642
8 4 -0.5583415 0.71988405 0.09454476 -0.8661514
9 1 -0.2420894 -0.03215026 -2.51249946 1.1659027
10 2 -0.6434337 -0.13910557 -1.10373674 1.2377968
11 3 -0.6297123 2.09797419 0.87128407 -0.1351845
12 4 0.6674166 0.48707847 0.36373509 1.0680623
13 1 0.6254708 -0.61311671 0.82542494 1.7320687
14 2 -2.4704173 0.98460064 -1.10416042 2.9627952
15 3 -0.2544887 0.63177246 -0.39138717 1.6942072
16 4 -0.9807623 1.11882794 -0.47669974 1.2383798
17 1 -0.6900549 1.68086482 -0.01405476 -1.3099288
18 2 1.4510505 -0.04752782 1.49735258 0.2963673
19 3 -1.1355194 -1.76263532 -1.49318214 1.3524114
20 4 0.7168833 -0.76833639 0.60752304 -1.0647885
21 1 2.0004745 2.13931057 -1.35036048 -0.7694501
22 2 2.0985591 0.01569677 0.33975952 -1.4979973
23 3 0.1703261 -1.47625208 -1.13228671 0.5686501
24 4 0.2632233 -0.55672667 0.33428217 0.5341078
25 1 -0.2741324 -1.61301237 0.78861248 0.4982554
26 2 -0.8793897 -1.07266362 -0.78158128 0.9127354
27 3 0.3920579 -0.59869834 -0.76775259 1.8137107
28 4 -1.4088488 -0.54954542 0.32421016 0.7284813
29 1 -1.2421837 0.50599077 1.62464999 0.6801672
30 2 -2.8980422 0.42197236 0.45243582 1.4939070
31 3 0.3965108 -1.35877353 1.52230797 -1.6552039
32 4 0.8112229 0.51970084 0.30830797 -2.0563928
What I want to do:
For every matrix in mat.list I want to extract 6 rows of data, according to certain criteria, and place these rows as a data.frame in a list labelled Output1. I want to store all remaining rows as a data.frame in Output2.
The process:
1) Group data by cohort number.
2a. If there is 1 group (Cohort Number can only = 1). Move to column 2 and extract the 6 rows of matrix with the highest value for "Alpha(t-1)". Store these rows as a data.frame in a list named "Output1". Store all remaining rows as a data.frame in a list named "Output2".
2b. If there are 2 groups (Cohort number can = 1 or Cohort Number can =2) move to column 2 and extract the 3 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1 and extract the 3 rows with largest"Alpha(t-1)" corresponding to Cohort Number == 2. Place the 6 rows extracted as a data.frame in a list named "Output1". Place all remaining rows as a data.frame in a list named "Output2".
2c. If there are 3 groups ("Cohort Number can = 1, Cohort Number can =2, Cohort Number can =3 ) move to column 2 and extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1, extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number =2 and extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number =3
2d. If there are 4 groups ("Cohort Number can = 1, Cohort Number can =2, Cohort Number can =3, Cohort Number = 4) move to column 2. Extract the 2 rows with the largest "Alpha(t-1)" corresponding to Cohort Number ==1. Extract the 2 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==2. Extract the 1 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==3 and Extract the 1 row with the largest "Alpha(t-1)" corresponding to Cohort Number ==4. Store the 6 key rows as a data.frame in Output1. Store all remaining rows as a data.frame in the list Output2.
Desired Output:
Output1 <- c()
Output2 <- c()
Output1[[1]] = mat1 %>% group_by(`Cohort Number`) %>% top_n(6, `Alpha(t-1)`)
Output1[[2]] = mat2 %>% group_by(`Cohort Number`) %>% top_n(2, `Alpha(t-1)`)
> Output1[[1]]
# A tibble: 6 x 5
# Groups: Cohort Number [1]
`Cohort Number` `Alpha(t-1)` date1 date2 date3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.796 0.135 -0.569 -0.590
2 1 1.94 0.226 0.316 -0.910
3 1 0.271 0.397 -1.92 0.240
4 1 0.269 -0.523 -0.662 1.46
5 1 1.26 -0.112 -0.300 -1.04
6 1 0.477 2.16 -0.473 -0.532
> Output1[[2]]
# A tibble: 6 x 5
# Groups: Cohort Number [3]
`Cohort Number` `Alpha(t-1)` date1 date2 date3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.420 0.489 3.21 -0.137
2 2 1.27 -1.32 0.436 0.0787
3 2 1.57 0.0207 0.457 0.159
4 1 0.518 -1.41 0.626 0.978
5 3 1.15 -1.04 -0.362 -0.373
6 3 1.22 0.813 -1.32 -1.22
Overall I need a function to do this because i have over 1000 matrices in my actual application and can't do this manually.
We can count the number of distinct values in Cohort Number and based on that select the value of n in top_n. For distinct values which are more than 3, we create vector of values to select in top_n for each Cohort Number.
library(tidyverse)
output1 <- map(mat.list, function(x) {
dist <- n_distinct(x$`Cohort Number`, na.rm = TRUE)
if(dist <= 3)
x %>%
group_by(`Cohort Number`) %>%
top_n(6/dist, `Alpha(t-1)`)
else
map2_df(list(2, 2, 1, 1),x %>% na.omit %>% group_split(`Cohort Number`),
~.y %>% top_n(.x, `Alpha(t-1)`))
})
and for output2, we use map2 with ant_join
output2 <- map2(mat.list, output1, anti_join)
Confirming the output
map_dbl(output1, nrow)
#[1] 6 6 6
map_dbl(output2, nrow)
#[1] 502 502 502

calc mean/std/ci from column

is there a package to easily calculate for each specific n number, the mean/std/ci.
In example starting with the data:
> n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
> s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
> df = data.frame(n, s)
> df
n s
1 0 43
2 0 23
3 0 65
4 0 43
5 0 12
6 0 54
7 0 43
8 2 12
9 2 2
10 2 43
11 2 62
12 5 25
13 5 55
14 5 75
15 5 95
16 8 28
17 8 48
18 8 68
19 8 18
resulting as:
data
n mean std ci
0 40 .. ..
2 30 .. ..
5 63 .. ..
8 41 .. ..
dplyr is nice, but not necessary. In base R:
## df() is built-in in R, avoid ...
dd <- data.frame(n=rep(c(0,2,5,8),c(7,4,4,4)),
s = c(43,23,65,43,12,54,43,12,2,43,
62,25,55,75,95,28,48,68,18))
sumfun <- function(x) {
m <- mean(x)
s <- sd(x)
se <- s/sqrt(length(x))
c(mean=m,sd=s,lwr=m-1.96*se,upr=m+1.96*se)
}
(or see smean.cl.normal(), smean.cl.boot(), etc. from the Hmisc package ...)
res <- do.call(rbind,tapply(dd$s,dd$n,sumfun))
res <- cbind(n=unique(dd$n),as.data.frame(res))
Or as pointed out by #thelatemail:
res <- do.call(data.frame,aggregate(s ~ n, data=df, FUN=sumfun ))
You can easily package this into a function if you're going to use it on a regular basis.
For larger data sets/more complex transformations you can search SO for answers comparing solutions from the dplyr, plyr, data.table, doBy packages as well as base-R solutions using combinations of tapply(), ave(), aggregate(), by() ...
You can use the dplyr package.
Here's a code snippet. Note, I'm assuming you want to build the confidence interval using the standard normal approximation at the 95% level but you can make whatever choice you like.
n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
df = data.frame(n, s)
df %>%
group_by(n) %>%
summarise(mean = mean(s),
std = sqrt(var(s)),
lower = mean(s) - qnorm(.975)*std/sqrt(n()),
upper = mean(s) + qnorm(.975)*std/sqrt(n()))
Source: local data frame [4 x 5]
n mean std lower upper
1 0 40.42857 17.88721 27.177782 53.67936
2 2 29.75000 27.69326 2.611104 56.88890
3 5 62.50000 29.86079 33.236965 91.76303
4 8 40.50000 22.17356 18.770313 62.22969
Thanks for the advice everyone, I have taken a look into plyr and solved it:
n = c(0,0,0,0,0,0,0,2,2,2,2,5,5,5,5,8,8,8,8)
s = c(43,23,65,43,12,54,43,12,2,43,62,25,55,75,95,28,48,68,18)
dd = data.frame(n, s)
library(plyr)
data <- ddply(dd,.(n),function(dd) c(mean=mean(dd$s),
std = sd(dd$s),
se = sd(dd$s)/sqrt(length(dd$s)),
lower = mean(dd$s)-qnorm(.975)*sd(dd$s)/sqrt(length(dd$s)),
upper = mean(dd$s)+qnorm(.975)*sd(dd$s)/sqrt(length(dd$s))
))
resulting as:
data
n mean std se lower upper
1 0 40.42857 17.88721 6.760731 27.177782 53.67936
2 2 29.75000 27.69326 13.846630 2.611104 56.88890
3 5 62.50000 29.86079 14.930394 33.236965 91.76303
4 8 40.50000 22.17356 11.086779 18.770313 62.22969
Will avoid df() in the future, thanks
Update tidyr 1.0.0
Even though the solution by #user1357015 is totally fine, if you are a tidyverse fan like me, there is an elegant alternative:
The new tidyr 1.0.0 contained a function that didn't get much attention but is very useful: unnest_wider.
With that, you can simplify the code to the following:
df %>%
group_by(n) %>%
nest(data = -"n") %>%
mutate(ci = map(data, ~ MeanCI(.x$s))) %>%
unnest_wider(ci)
which gives
# A tibble: 4 x 5
# Groups: n [4]
n data mean lwr.ci upr.ci
<dbl> <list> <dbl> <dbl> <dbl>
1 0 <tibble [7 × 1]> 40.4 23.9 57.0
2 2 <tibble [4 × 1]> 29.8 -14.3 73.8
3 5 <tibble [4 × 1]> 62.5 15.0 110.
4 8 <tibble [4 × 1]> 40.5 5.22 75.8

Resources