Consider this dataset:
example = data.frame("person"= rep(c(1:3), each = 4),
"expected income" = c(seq(140,250,10)),
"income" = c(seq(110,220,10)))
print(example)
I need to calculate the ratio between column(2) in year(i) and column(3) in year (i+1).
Furthermore the ratio has to be done only when the "person" (col1) is the same.
Instead of the ratio between the "expected income" and the "income" of two different people I need an NA.
It has to be done generically since it is just a semplification of a dataset with more than 60000 rows.
Are you looking for:
example %>%
group_by(person) %>%
mutate(ratio = lag(expected.income)/income) %>%
ungroup()
# A tibble: 12 x 4
person expected.income income ratio
<int> <dbl> <dbl> <dbl>
1 1 140 110 NA
2 1 150 120 1.17
3 1 160 130 1.15
4 1 170 140 1.14
5 2 180 150 NA
6 2 190 160 1.12
7 2 200 170 1.12
8 2 210 180 1.11
9 3 220 190 NA
10 3 230 200 1.1
11 3 240 210 1.10
12 3 250 220 1.09
Alternative:
example %>%
group_by(person) %>%
mutate(ratio = expected.income/lead(income)) %>%
ungroup()
# A tibble: 12 x 4
person expected.income income ratio
<int> <dbl> <dbl> <dbl>
1 1 140 110 1.17
2 1 150 120 1.15
3 1 160 130 1.14
4 1 170 140 NA
5 2 180 150 1.12
6 2 190 160 1.12
7 2 200 170 1.11
8 2 210 180 NA
9 3 220 190 1.1
10 3 230 200 1.10
11 3 240 210 1.09
12 3 250 220 NA
It carries the same information, just in different rows now.
I made a ggplot, I would like to use the results from a function.
How can I return the result of stat_summary_2d?
ggplot(diamonds, aes(carat, depth, z = price))+
stat_summary_2d(fun =function(x)mean(x))
If you are asking how to re-use ggplot2's implementation of the statistics summary function, it is currently not easily feasible with the public API. You can have a look at ggplot2's internals and copy-paste the parts you need. But this should be seen as a non-robust hack which may break after any package upgrade.
The core function of bin2d is accessible in ggplot2::StatBin2d$compute_group. But to call it, you will need some of the plot's components which are not easy to construct by hand.
In another answer, I had listed all the steps done when building a graph in a typical case. Function ggplot2:::ggplot_build.ggplot will show you the details. The relevant line calling the statistics computation is
data <- by_layer(function(l, d) l$compute_statistic(d, layout))
You'll need to execute all previous steps before starting it. On your example, it would finally give:
# A tibble: 238 x 9
xbin ybin value x width y height PANEL group
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <int>
1 6 1 4206 1.04 0.160 42.6 1.20 1 -1
2 6 2 4032 1.04 0.160 43.8 1.20 1 -1
3 1 8 945 0.240 0.160 51.0 1.2 1 -1
4 8 8 6727 1.36 0.160 51.0 1.2 1 -1
5 2 9 1166 0.401 0.160 52.2 1.20 1 -1
6 3 9 1293 0.561 0.160 52.2 1.20 1 -1
7 4 9 1895 0.722 0.160 52.2 1.20 1 -1
8 2 10 1012. 0.401 0.160 53.4 1.2 1 -1
9 4 10 2164 0.722 0.160 53.4 1.2 1 -1
10 5 10 2815 0.882 0.160 53.4 1.2 1 -1
# ... with 228 more rows
Trick: internal functions of a package can be accessed with triple colon pkg:::fun.
Find a solution here:
p<- ggplot(diamonds, aes(carat, depth, z = price))+
stat_summary_2d(fun =function(x)mean(x))
library('Hmisc')
ggplot_build(p)$data[[1]]
fill xbin ybin value x y PANEL group xmin xmax
#1F4364 6 1 4206.0000 1.0421667 42.6 1 -1 0.9620000 1.1223333
#1F4262 6 2 4032.0000 1.0421667 43.8 1 -1 0.9620000 1.1223333
#142E47 1 8 945.0000 0.2405000 51.0 1 -1 0.1603333 0.3206667
#28557B 8 8 6727.0000 1.3628333 51.0 1 -1 1.2826667 1.4430000
#152F49 2 9 1166.0000 0.4008333 52.2 1 -1 0.3206667 0.4810000
#15304A 3 9 1293.0000 0.5611667 52.2 1 -1 0.4810000 0.6413333
#17344F 4 9 1895.0000 0.7215000 52.2 1 -1 0.6413333 0.8016667
#152E47 2 10 1011.5000 0.4008333 53.4 1 -1 0.3206667 0.4810000
#183651 4 10 2164.0000 0.7215000 53.4 1 -1 0.6413333 0.8016667
#1B3A57 5 10 2815.0000 0.8818333 53.4 1 -1 0.8016667 0.9620000
I have the original dataframe below:
> head (DATA)
N_b N_l A sqr.x e_1 e_2 e_3 e_4 e_5
1 5 3 18 810 14.6 2.6 -9.4 0 0
2 5 3 18 810 14.6 2.6 -9.4 0 0
3 5 4 24 1440 21.8 9.8 -2.2 -14.2 0
4 5 4 24 1440 21.8 9.8 -2.2 -14.2 0
5 5 5 30 2250 29 17 5 -7 -19
6 5 5 30 2250 29 17 5 -7 -19
I want to calculate R as shown in the equation below:
where m is dependent on N_l as shown below:
> Multi.Presence
N_l m
1 1 1.20
2 2 1.00
3 3 0.85
4 4 0.65
5 5 0.65
For each row of the data in the dataframe DATA, I would like to calculate R for all possible values of N_l. This means for rows 1 and 2, the code should solve for N_1 = 1, 2, and 3 and choose the maximum value of R, and return which N_l was responsible for the maximum value of R.
I wrote the following code to create a function and append the new column to the original dataframe:
compute.R <- function(N_l, e, N_b, A, x.sqr)
{
(c(1.2, 1, 0.85, 0.65, 0.65)[N_l] * ((N_l/N_b) + ((sum(e[1:N_l]) * A)/x.sqr)))
}
R <- sapply(seq(nrow(DATA)), function(i)
{
with(DATA, compute.R(N_l[i], DATA[i, 5:9], N_b[i], A[i], x.sqr[i]))
})
DATA <- cbind(DATA, R=R)
However, this just outputs the last calculated value of R and not the maximum value as shown below:
> head (DATA)
N_b N_l A sqr.x e_1 e_2 e_3 e_4 e_5 R
1 5 3 18 810 14.6 2.6 -9.4 0 0 0.6573333
2 5 3 18 810 14.6 2.6 -9.4 0 0 0.6573333
3 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.6846667
4 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.6846667
5 5 5 30 2250 29 17 5 -7 -19 0.8666667
6 5 5 30 2250 29 17 5 -7 -19 0.8666667
I carried the calculations and this is what I should get instead:
Also, I am not sure how to have it indicate, which value of N_l was responsible for the maximum calculated R. So as the calculations in excel show: for the first four rows of DATA, N_1 = 2 results in max R and in the last two rows N_l = 3 results in the max R value.
R <- function(x){
N_b <- x[1]
N_l <- x[2]
N_l_seq <- seq(N_l)
A <- x[3]
sqr.x <- x[4]
e <- x[5:(5 + N_l - 1)]
m <- Multi.Presence$m[N_l_seq]
f <- m * (N_l_seq/N_b + A * cumsum(e) / sqr.x)
c(val=max(f), pos=which.max(f))
}
cbind(df, vars = t(apply(DATA, 1, R)))
N_b N_l A sqr.x e_1 e_2 e_3 e_4 e_5 vars.1 vars.2
1 5 3 18 810 14.6 2.6 -9.4 0.0 0 0.7822222 2
2 5 3 18 810 14.6 2.6 -9.4 0.0 0 0.7822222 2
3 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.9266667 2
4 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.9266667 2
5 5 5 30 2250 29.0 17.0 5.0 -7.0 -19 1.0880000 3
6 5 5 30 2250 29.0 17.0 5.0 -7.0 -19 1.0880000 3
I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!
If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))
You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0
Hi I have a dataset like this:
City = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Area=c("A","B","A","B","A","A","B","B","B","A","A","B","A","A")
Weights=c(2.4,1.9,0.51,0.7,2.2,1.5,1.86,1.66,1.09,2.57,2.4,0.9,3.4,3.7)
Tax=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(City,Area,Weights,Tax)
Which looks like this:
What I want to do is to obtain the weighted tax for each City and each area.
For eg. for row 1 above the computed value is:
2.4*16/(2.40+0.51+2.20) and so on.
I can do that using this function:
cit_data=list()
weighted_tax=function(z){
for (cit in unique(z$City)){
city_data=z[z$City==cit,]
area_new=list()
for (ar in unique(z$Area)){
area_data=city_data[city_data$Area==ar,]
area_data$area_dat_n = (area_data$Weight*area_data$Tax)/sum(area_data$Weights)
area_new=rbind(area_new,area_data)
}
cit_data=rbind(cit_data,area_new)
}
return(cit_data)
}
tax=weighted_tax(z)
Is there a easier/cleaner way to implement this? Thanks in advance.
Using dplyr :
library(dplyr)
z %>%
group_by(City, Area) %>%
mutate(Weighted_tax = Tax*Weights/sum(Weights))
Output:
# A tibble: 14 x 5
# Groups: City, Area [6]
City Area Weights Tax Weighted_tax
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 A 2.4 16 7.51
2 1 B 1.9 93 68.0
3 1 A 0.51 96 9.58
4 1 B 0.7 44 11.8
5 1 A 2.2 67 28.8
6 2 A 1.5 73 26.9
7 2 B 1.86 12 4.84
8 2 B 1.66 65 23.4
9 2 B 1.09 81 19.2
10 2 A 2.57 22 13.9
11 3 A 2.4 39 9.85
12 3 B 0.9 94 94.
13 3 A 3.4 41 14.7
14 3 A 3.7 30 11.7
We also could do this in base R with by,
do.call(rbind, by(z, z[c("City", "Area")], function(x)
cbind(x, area.dat.n=with(x, Weights * Tax / sum(Weights)))))
# City Area Weights Tax area.dat.n
# 1 1 A 2.40 16 7.514677
# 3 1 A 0.51 96 9.581213
# 5 1 A 2.20 67 28.845401
# 6 2 A 1.50 73 26.904177
# 10 2 A 2.57 22 13.891892
# 11 3 A 2.40 39 9.852632
# 13 3 A 3.40 41 14.673684
# 14 3 A 3.70 30 11.684211
# 2 1 B 1.90 93 67.961538
# 4 1 B 0.70 44 11.846154
# 7 2 B 1.86 12 4.841649
# 8 2 B 1.66 65 23.405640
# 9 2 B 1.09 81 19.151844
# 12 3 B 0.90 94 94.000000
or with ave.
cbind(z,
area.dat.n=
apply(cbind(z, w=with(z, ave(Weights, City, Area, FUN=sum))), 1, function(x)
x[3] * x[4] / x[5]))
# City Area Weights Tax area.dat.n
# 1 1 1 2.40 16 7.514677
# 2 1 2 1.90 93 67.961538
# 3 1 1 0.51 96 9.581213
# 4 1 2 0.70 44 11.846154
# 5 1 1 2.20 67 28.845401
# 6 2 1 1.50 73 26.904177
# 7 2 2 1.86 12 4.841649
# 8 2 2 1.66 65 23.405640
# 9 2 2 1.09 81 19.151844
# 10 2 1 2.57 22 13.891892
# 11 3 1 2.40 39 9.852632
# 12 3 2 0.90 94 94.000000
# 13 3 1 3.40 41 14.673684
# 14 3 1 3.70 30 11.684211
Data
z <- structure(list(City = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3), Area = structure(c(1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 1L), .Label = c("A", "B"), class = "factor"), Weights = c(2.4,
1.9, 0.51, 0.7, 2.2, 1.5, 1.86, 1.66, 1.09, 2.57, 2.4, 0.9, 3.4,
3.7), Tax = c(16, 93, 96, 44, 67, 73, 12, 65, 81, 22, 39, 94,
41, 30)), class = "data.frame", row.names = c(NA, -14L))