Merge data frames by time interval in R - r

I have two Data Frames. One is an Eye Tracking data frame with subject, condition, timestamp, xposition, and yposition. It has over 400,000 rows. Here's a toy data set for an example:
subid condition time xpos ypos
1 1 1 1.40 195 140
2 1 1 2.50 138 147
3 1 1 3.40 140 162
4 1 1 4.10 188 150
5 1 2 1.10 131 194
6 1 2 2.10 149 111
eyedata <- data.frame(subid = rep(1:2, each = 8),
condition = rep(rep(1:2, each = 4),2),
time = c(1.4, 2.5, 3.4, 4.1,
1.1, 2.1, 3.23, 4.44,
1.33, 2.3, 3.11, 4.1,
.49, 1.99, 3.01, 4.2),
xpos = round(runif(n = 16, min = 100, max = 200)),
ypos = round(runif(n = 16, min = 100, max = 200)))
Then I have a Data Frame with subject, condition, a trial number, and a trial begin and end time. It looks like this:
subid condition trial begin end
1 1 1 1 1.40 2.4
2 1 1 2 2.50 3.2
3 1 1 2 3.21 4.5
4 1 2 1 1.10 1.6
5 1 2 2 2.10 3.3
6 1 2 2 3.40 4.1
7 2 1 1 0.50 1.1
8 2 1 1 1.44 2.9
9 2 1 2 2.97 3.3
10 2 2 1 0.35 1.9
11 2 2 1 2.12 4.5
12 2 2 2 3.20 6.3
trials <- data.frame(subid = rep(1:2, each = 6),
condition = rep(rep(1:2, each = 3),2),
trial= c(rep(c(1,rep(2,2)),2),rep(c(rep(1,2),2),2)),
begin = c(1.4, 2.5, 3.21,
1.10, 2.10, 3.4, .50,
1.44,2.97,.35,2.12,3.20),
end = c(2.4,3.2,4.5,1.6,
3.3,4.1,1.1,2.9,
3.3,1.9,4.5,6.3))
The number of trials in a condition are variable, and I want to add a column to my eyetracking dataframe that specifies the correct trial based upon whether the timestamp falls within the time interval. The time intervals do not overlap, but there will be many rows for the eyetracking data in between trials. In the end I'd like a dataframe like this:
subid condition trial time xpos ypos
1 1 1 1.40 198 106
1 1 2 2.50 166 139
1 1 2 3.40 162 120
1 1 2 4.10 113 164
1 2 1 1.10 162 120
1 2 2 2.10 162 120
I've seen data.table rolling joins, but would prefer a solution with dplyr or fuzzyjoin. Thanks in advance.

Here's what I tried, but I can't figure the discrepancies, so it is likely an incomplete answer. Row 12,13 of this result may be an overlap in time. Also, when using random generation functions such as runif please set.seed -- here xpos and ypos have no bearing on the result, so not an issue.
eyedata %>%
left_join(trials, by = c("subid", "condition")) %>%
filter( (time >= begin & time <= end))
# subid condition time xpos ypos trial begin end
# 1 1 1 1.40 143 101 1 1.40 2.4
# 2 1 1 2.50 152 173 2 2.50 3.2
# 3 1 1 3.40 185 172 2 3.21 4.5
# 4 1 1 4.10 106 119 2 3.21 4.5
# 5 1 2 1.10 155 165 1 1.10 1.6
# 6 1 2 2.10 169 154 2 2.10 3.3
# 7 1 2 3.23 166 134 2 2.10 3.3
# 8 2 1 2.30 197 171 1 1.44 2.9
# 9 2 1 3.11 140 135 2 2.97 3.3
# 10 2 2 0.49 176 139 1 0.35 1.9
# 11 2 2 3.01 187 180 1 2.12 4.5
# 12 2 2 4.20 147 176 1 2.12 4.5
# 13 2 2 4.20 147 176 2 3.20 6.3

Related

How to calculate the ratio between column(2) and column(3) while the value of column(1) is the same in R?

Consider this dataset:
example = data.frame("person"= rep(c(1:3), each = 4),
"expected income" = c(seq(140,250,10)),
"income" = c(seq(110,220,10)))
print(example)
I need to calculate the ratio between column(2) in year(i) and column(3) in year (i+1).
Furthermore the ratio has to be done only when the "person" (col1) is the same.
Instead of the ratio between the "expected income" and the "income" of two different people I need an NA.
It has to be done generically since it is just a semplification of a dataset with more than 60000 rows.
Are you looking for:
example %>%
group_by(person) %>%
mutate(ratio = lag(expected.income)/income) %>%
ungroup()
# A tibble: 12 x 4
person expected.income income ratio
<int> <dbl> <dbl> <dbl>
1 1 140 110 NA
2 1 150 120 1.17
3 1 160 130 1.15
4 1 170 140 1.14
5 2 180 150 NA
6 2 190 160 1.12
7 2 200 170 1.12
8 2 210 180 1.11
9 3 220 190 NA
10 3 230 200 1.1
11 3 240 210 1.10
12 3 250 220 1.09
Alternative:
example %>%
group_by(person) %>%
mutate(ratio = expected.income/lead(income)) %>%
ungroup()
# A tibble: 12 x 4
person expected.income income ratio
<int> <dbl> <dbl> <dbl>
1 1 140 110 1.17
2 1 150 120 1.15
3 1 160 130 1.14
4 1 170 140 NA
5 2 180 150 1.12
6 2 190 160 1.12
7 2 200 170 1.11
8 2 210 180 NA
9 3 220 190 1.1
10 3 230 200 1.10
11 3 240 210 1.09
12 3 250 220 NA
It carries the same information, just in different rows now.

How to return the results from a summary function?

I made a ggplot, I would like to use the results from a function.
How can I return the result of stat_summary_2d?
ggplot(diamonds, aes(carat, depth, z = price))+
stat_summary_2d(fun =function(x)mean(x))
If you are asking how to re-use ggplot2's implementation of the statistics summary function, it is currently not easily feasible with the public API. You can have a look at ggplot2's internals and copy-paste the parts you need. But this should be seen as a non-robust hack which may break after any package upgrade.
The core function of bin2d is accessible in ggplot2::StatBin2d$compute_group. But to call it, you will need some of the plot's components which are not easy to construct by hand.
In another answer, I had listed all the steps done when building a graph in a typical case. Function ggplot2:::ggplot_build.ggplot will show you the details. The relevant line calling the statistics computation is
data <- by_layer(function(l, d) l$compute_statistic(d, layout))
You'll need to execute all previous steps before starting it. On your example, it would finally give:
# A tibble: 238 x 9
xbin ybin value x width y height PANEL group
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <int>
1 6 1 4206 1.04 0.160 42.6 1.20 1 -1
2 6 2 4032 1.04 0.160 43.8 1.20 1 -1
3 1 8 945 0.240 0.160 51.0 1.2 1 -1
4 8 8 6727 1.36 0.160 51.0 1.2 1 -1
5 2 9 1166 0.401 0.160 52.2 1.20 1 -1
6 3 9 1293 0.561 0.160 52.2 1.20 1 -1
7 4 9 1895 0.722 0.160 52.2 1.20 1 -1
8 2 10 1012. 0.401 0.160 53.4 1.2 1 -1
9 4 10 2164 0.722 0.160 53.4 1.2 1 -1
10 5 10 2815 0.882 0.160 53.4 1.2 1 -1
# ... with 228 more rows
Trick: internal functions of a package can be accessed with triple colon pkg:::fun.
Find a solution here:
p<- ggplot(diamonds, aes(carat, depth, z = price))+
stat_summary_2d(fun =function(x)mean(x))
library('Hmisc')
ggplot_build(p)$data[[1]]
fill xbin ybin value x y PANEL group xmin xmax
#1F4364 6 1 4206.0000 1.0421667 42.6 1 -1 0.9620000 1.1223333
#1F4262 6 2 4032.0000 1.0421667 43.8 1 -1 0.9620000 1.1223333
#142E47 1 8 945.0000 0.2405000 51.0 1 -1 0.1603333 0.3206667
#28557B 8 8 6727.0000 1.3628333 51.0 1 -1 1.2826667 1.4430000
#152F49 2 9 1166.0000 0.4008333 52.2 1 -1 0.3206667 0.4810000
#15304A 3 9 1293.0000 0.5611667 52.2 1 -1 0.4810000 0.6413333
#17344F 4 9 1895.0000 0.7215000 52.2 1 -1 0.6413333 0.8016667
#152E47 2 10 1011.5000 0.4008333 53.4 1 -1 0.3206667 0.4810000
#183651 4 10 2164.0000 0.7215000 53.4 1 -1 0.6413333 0.8016667
#1B3A57 5 10 2815.0000 0.8818333 53.4 1 -1 0.8016667 0.9620000

Sapply a function to create a new column in dataframe with vectorized input

I have the original dataframe below:
> head (DATA)
N_b N_l A sqr.x e_1 e_2 e_3 e_4 e_5
1 5 3 18 810 14.6 2.6 -9.4 0 0
2 5 3 18 810 14.6 2.6 -9.4 0 0
3 5 4 24 1440 21.8 9.8 -2.2 -14.2 0
4 5 4 24 1440 21.8 9.8 -2.2 -14.2 0
5 5 5 30 2250 29 17 5 -7 -19
6 5 5 30 2250 29 17 5 -7 -19
I want to calculate R as shown in the equation below:
where m is dependent on N_l as shown below:
> Multi.Presence
N_l m
1 1 1.20
2 2 1.00
3 3 0.85
4 4 0.65
5 5 0.65
For each row of the data in the dataframe DATA, I would like to calculate R for all possible values of N_l. This means for rows 1 and 2, the code should solve for N_1 = 1, 2, and 3 and choose the maximum value of R, and return which N_l was responsible for the maximum value of R.
I wrote the following code to create a function and append the new column to the original dataframe:
compute.R <- function(N_l, e, N_b, A, x.sqr)
{
(c(1.2, 1, 0.85, 0.65, 0.65)[N_l] * ((N_l/N_b) + ((sum(e[1:N_l]) * A)/x.sqr)))
}
R <- sapply(seq(nrow(DATA)), function(i)
{
with(DATA, compute.R(N_l[i], DATA[i, 5:9], N_b[i], A[i], x.sqr[i]))
})
DATA <- cbind(DATA, R=R)
However, this just outputs the last calculated value of R and not the maximum value as shown below:
> head (DATA)
N_b N_l A sqr.x e_1 e_2 e_3 e_4 e_5 R
1 5 3 18 810 14.6 2.6 -9.4 0 0 0.6573333
2 5 3 18 810 14.6 2.6 -9.4 0 0 0.6573333
3 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.6846667
4 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.6846667
5 5 5 30 2250 29 17 5 -7 -19 0.8666667
6 5 5 30 2250 29 17 5 -7 -19 0.8666667
I carried the calculations and this is what I should get instead:
Also, I am not sure how to have it indicate, which value of N_l was responsible for the maximum calculated R. So as the calculations in excel show: for the first four rows of DATA, N_1 = 2 results in max R and in the last two rows N_l = 3 results in the max R value.
R <- function(x){
N_b <- x[1]
N_l <- x[2]
N_l_seq <- seq(N_l)
A <- x[3]
sqr.x <- x[4]
e <- x[5:(5 + N_l - 1)]
m <- Multi.Presence$m[N_l_seq]
f <- m * (N_l_seq/N_b + A * cumsum(e) / sqr.x)
c(val=max(f), pos=which.max(f))
}
cbind(df, vars = t(apply(DATA, 1, R)))
N_b N_l A sqr.x e_1 e_2 e_3 e_4 e_5 vars.1 vars.2
1 5 3 18 810 14.6 2.6 -9.4 0.0 0 0.7822222 2
2 5 3 18 810 14.6 2.6 -9.4 0.0 0 0.7822222 2
3 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.9266667 2
4 5 4 24 1440 21.8 9.8 -2.2 -14.2 0 0.9266667 2
5 5 5 30 2250 29.0 17.0 5.0 -7.0 -19 1.0880000 3
6 5 5 30 2250 29.0 17.0 5.0 -7.0 -19 1.0880000 3

Use cases with higher value on one variable for each case of another variable in R

I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!
If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))
You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0

R apply weighting operation by multiple groups

Hi I have a dataset like this:
City = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Area=c("A","B","A","B","A","A","B","B","B","A","A","B","A","A")
Weights=c(2.4,1.9,0.51,0.7,2.2,1.5,1.86,1.66,1.09,2.57,2.4,0.9,3.4,3.7)
Tax=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(City,Area,Weights,Tax)
Which looks like this:
What I want to do is to obtain the weighted tax for each City and each area.
For eg. for row 1 above the computed value is:
2.4*16/(2.40+0.51+2.20) and so on.
I can do that using this function:
cit_data=list()
weighted_tax=function(z){
for (cit in unique(z$City)){
city_data=z[z$City==cit,]
area_new=list()
for (ar in unique(z$Area)){
area_data=city_data[city_data$Area==ar,]
area_data$area_dat_n = (area_data$Weight*area_data$Tax)/sum(area_data$Weights)
area_new=rbind(area_new,area_data)
}
cit_data=rbind(cit_data,area_new)
}
return(cit_data)
}
tax=weighted_tax(z)
Is there a easier/cleaner way to implement this? Thanks in advance.
Using dplyr :
library(dplyr)
z %>%
group_by(City, Area) %>%
mutate(Weighted_tax = Tax*Weights/sum(Weights))
Output:
# A tibble: 14 x 5
# Groups: City, Area [6]
City Area Weights Tax Weighted_tax
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 A 2.4 16 7.51
2 1 B 1.9 93 68.0
3 1 A 0.51 96 9.58
4 1 B 0.7 44 11.8
5 1 A 2.2 67 28.8
6 2 A 1.5 73 26.9
7 2 B 1.86 12 4.84
8 2 B 1.66 65 23.4
9 2 B 1.09 81 19.2
10 2 A 2.57 22 13.9
11 3 A 2.4 39 9.85
12 3 B 0.9 94 94.
13 3 A 3.4 41 14.7
14 3 A 3.7 30 11.7
We also could do this in base R with by,
do.call(rbind, by(z, z[c("City", "Area")], function(x)
cbind(x, area.dat.n=with(x, Weights * Tax / sum(Weights)))))
# City Area Weights Tax area.dat.n
# 1 1 A 2.40 16 7.514677
# 3 1 A 0.51 96 9.581213
# 5 1 A 2.20 67 28.845401
# 6 2 A 1.50 73 26.904177
# 10 2 A 2.57 22 13.891892
# 11 3 A 2.40 39 9.852632
# 13 3 A 3.40 41 14.673684
# 14 3 A 3.70 30 11.684211
# 2 1 B 1.90 93 67.961538
# 4 1 B 0.70 44 11.846154
# 7 2 B 1.86 12 4.841649
# 8 2 B 1.66 65 23.405640
# 9 2 B 1.09 81 19.151844
# 12 3 B 0.90 94 94.000000
or with ave.
cbind(z,
area.dat.n=
apply(cbind(z, w=with(z, ave(Weights, City, Area, FUN=sum))), 1, function(x)
x[3] * x[4] / x[5]))
# City Area Weights Tax area.dat.n
# 1 1 1 2.40 16 7.514677
# 2 1 2 1.90 93 67.961538
# 3 1 1 0.51 96 9.581213
# 4 1 2 0.70 44 11.846154
# 5 1 1 2.20 67 28.845401
# 6 2 1 1.50 73 26.904177
# 7 2 2 1.86 12 4.841649
# 8 2 2 1.66 65 23.405640
# 9 2 2 1.09 81 19.151844
# 10 2 1 2.57 22 13.891892
# 11 3 1 2.40 39 9.852632
# 12 3 2 0.90 94 94.000000
# 13 3 1 3.40 41 14.673684
# 14 3 1 3.70 30 11.684211
Data
z <- structure(list(City = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3), Area = structure(c(1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 1L), .Label = c("A", "B"), class = "factor"), Weights = c(2.4,
1.9, 0.51, 0.7, 2.2, 1.5, 1.86, 1.66, 1.09, 2.57, 2.4, 0.9, 3.4,
3.7), Tax = c(16, 93, 96, 44, 67, 73, 12, 65, 81, 22, 39, 94,
41, 30)), class = "data.frame", row.names = c(NA, -14L))

Resources