I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.
Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.
set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
The best I've got so far with dplyr is the following (but this loses the identity variables):
df1 %>%
rowwise() %>%
do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
mutate(n=nrow(count))
output:
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 × 2
count n
<list> <int>
1 <data.frame [17 × 1]> 17
2 <data.frame [18 × 1]> 18
3 <data.frame [10 × 1]> 10
I was expecting to be able to do this:
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>%
group_by(id) %>%
mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))
but this returns the error:
Error: comparison (6) is possible only for atomic and list types
What is the dplyr way of doing this?
Here is one option with data.table where we can use the non-equi joins
library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew),
by = .EACHI][, c('id', 'N'), with = FALSE]
# id N
#1: A 17
#2: B 18
#3: C 10
which gives the same count as in the OP's base R method
sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10
If we need the 'id' variable also as output in dplyr, we can modify the OP's code
df1 %>%
rowwise() %>%
do(data.frame(., count = filter(df2, time>.$start_time,
time < (.$start_time + 15)))) %>%
group_by(id) %>%
summarise(n = n())
# id n
# <fctr> <int>
#1 A 17
#2 B 18
#3 C 10
Or another option is map from purrr with dplyr
library(purrr)
df1 %>%
split(.$id) %>%
map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>%
select(-start_time)
# id N
#1 A 17
#2 B 18
#3 C 10
Another slightly different approach using dplyr:
result <- df1 %>% group_by(id) %>%
summarise(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
### A tibble: 3 x 2
## id count
## <fctr> <int>
##1 A 17
##2 B 18
##3 C 10
I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.
If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:
result <- df1 %>% rowwise() %>%
mutate(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
## id start_time count
## <fctr> <dbl> <int>
##1 A 10 17
##2 B 20 18
##3 C 30 10
Related
first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)
I tried to calculate the cumulative sum of the twitter followers for each gvkey respectively ,and I use the group_by function,but the output is still the sum of the entire column,I suppose it is the problem of the " for (i in i:nrow(premod_e))
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%#arrange the gvkey and date
group_by(gvkey)#use group_by for respective calculation
for (i in 1:nrow(predmod_e)) {
predmod_e[i+1,]$x <- predmod_e[i+1,]$x + predmod_e[i,]$x
}#for loop to calculate
Perhaps just this:
predmod_e <- predmod_e %>%
arrange(gvkey, date) %>%
group_by(gvkey) %>%
mutate(newx = cumsum(x))
If you want to do something with the groups yourself (i.e., not with a dplyr verb), then you should use the groups as they are "known" by the tidy verbs. Luckily, they are merely stored as an attribute:
mtcars %>%
group_by(cyl) %>%
attr(., "groups")
# # A tibble: 3 x 2
# cyl .rows
# <dbl> <list>
# 1 4 <int [11]>
# 2 6 <int [7]>
# 3 8 <int [14]>
I'm trying to compute the number of months between two dates within dplyr::mutate but run into the error
Error in mutate_impl(.data, dots) : 'from' must be of length 1
Is there something about seq that is incompatible with mutate?
library(dplyr)
dset <- data.frame( f = as.Date(c("2016-03-04","2016-12-13","2017-03-01")) ,
o = as.Date(c("2016-03-04","2016-12-13","2017-06-02")) )
dset %>% mutate( y = length(seq(from=f, to=o, by='month')) - 1 )
To work around it, you can either use sapply or mapply. Otherwise, you can extract the month from the date using functions in lubridate and then compute the difference.
library(dplyr)
library(lubridate)
# Sapply
dset %>%
mutate(y=sapply(1:length(f), function(i) length(seq(f[i], o[i], by="month")) - 1))
# Mapply
dset %>%
mutate(y=mapply(function(x, y) length(seq(x, y, by="month")) - 1, f, o))
# function in lubridate
dset %>% mutate(y=month(o) - month(f))
You need to group, iterate, or adjust such that each from and to parameter is length 1 (seq(1, 5) is fine; seq(1:2, 5:6) is not), which means rowwise or maybe group_by_all:
library(dplyr)
dset <- data.frame( f = as.Date(c("2016-03-04","2016-12-13","2017-03-01")) ,
o = as.Date(c("2016-03-04","2016-12-13","2017-06-02")) )
dset %>%
rowwise() %>%
mutate(y = length(seq(f, o, by = 'month')) - 1)
#> Source: local data frame [3 x 3]
#> Groups: <by row>
#>
#> # A tibble: 3 x 3
#> f o y
#> <date> <date> <int>
#> 1 2016-03-04 2016-03-04 0
#> 2 2016-12-13 2016-12-13 0
#> 3 2017-03-01 2017-06-02 3
You may want to also use dplyr for this:
dset <- data.frame( f = as.Date(c("2016-03-04","2016-12-13","2017-03-01")) ,
o = as.Date(c("2016-03-04","2016-12-13","2017-06-02")) )
dset %>% mutate( y = as.numeric(difftime(f,o, units = "weeks"))/4)
"alistaire" has done some typo mistake, so the answer is wrong
dset %>%
rowwise() %>%
mutate(y = length(seq(f, o, by = 'month')) - 1)
Source: local data frame [3 x 3]
Groups: <by row>
# A tibble: 3 x 3
f o y
<date> <date> <dbl>
1 2016-03-04 2016-03-04 0
2 2016-12-13 2016-12-13 0
3 2017-03-01 2017-06-02 3
Let's say I have the following data frame:
(dat = data_frame(v1 = c(rep("a", 3), rep("b", 3), rep("c", 4)), v2 = 1:10))
# A tibble: 10 × 2
# v1 v2
# <chr> <int>
# 1 a 1
# 2 a 2
# 3 a 3
# 4 b 4
# 5 b 5
# 6 b 6
# 7 c 7
# 8 c 8
# 9 c 9
# 10 c 10
What I want to be able to do is compute a sum for each group (i.e. "a", "b", and "c") that is equal to the sum of v2 where v1 is not equal to the grouping value. So it should look like this:
# A tibble: 3 × 2
# v1 sum
# <chr> <int>
# 1 a 49
# 2 b 40
# 3 c 21
Based on what I've been seeing online, this looks like a job for do, but I can't wrap my head around how to achieve this. I thought it would look something like this:
x %>%
group_by(v1) %>%
do(data.frame(sum=sum(.$v2[x$v1 != unique(.$v1)])))
But this just gives me a dataframe with sum equal to NA for all three groups. How would I go about doing this?
Maybe using an intermediate column it is easier:
dat %>% mutate(total = sum(v2)) %>% group_by(v1) %>% summarize(sum = max(total) - sum(v2))
You can nest and then index the list column negatively:
library(tidyverse)
dat %>% nest(v2) %>% mutate(sum = map_int(seq(n()), ~sum(unlist(data[-.x]))))
## # A tibble: 3 × 3
## v1 data sum
## <chr> <list> <int>
## 1 a <tibble [3 × 1]> 49
## 2 b <tibble [3 × 1]> 40
## 3 c <tibble [4 × 1]> 21
The advantage of this approach is that it's really easy to save the original data and align the computed values with them.
A small function without using dplyr:
dat <- data_frame(v1 = c(rep("a", 3), rep("b", 3), rep("c", 4)), v2 = 1:10)
test_func<-function(df){
a<-sum(df[df$v1 != "a",][,2])
b<-sum(df[df$v1 != "b",][,2])
c<-sum(df[df$v1 != "c",][,2])
out<-rbind(a,b,c)
return(out)
}
test_func(dat)
[,1]
a 49
b 40
c 21
#67342343's solution seems like the way to go here. If you have more complex overlapping/excluded groups, then maybe something like the following would be helpful:
library(tidyverse)
dat = data_frame(v1 = rep(letters[1:5], 3), v2 = 1:(5*3))
c(combn(unique(dat$v1),2, simplify=FALSE),
combn(unique(dat$v1),3, simplify=FALSE)) %>%
map_df(~ dat %>%
group_by(v1) %>%
summarise(v2 = sum(v2)) %>%
filter(v1 %in% .x) %>%
ungroup %>%
summarise(groups = paste(.x,collapse=","),
sum = sum(v2)))
groups sum
1 a,b 39
2 a,c 42
3 a,d 45
4 a,e 48
5 b,c 45
...
18 b,c,e 75
19 b,d,e 78
20 c,d,e 81
Keeping it simple:
dat %>% group_by(v1) %>% summarize(foo = sum(dat$v2) - sum(v2))
This is crass if you are in the middle of a long dplyr chain and have modified dat. (But then, why not relax and just store your data?)
Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand.
dt2a <- select(dt, mutation.genome.position,
mutation.cds, primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382 5
## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds,
primary.site, sample.name, mutation.id) %>%
group_by(mutation.genome.position, mutation.cds, primary.site) %>%
mutate(occ = nrow(.)) %>%
select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982 5
This is the file I am working with:
sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz
dt = fread(fl)
This appears to be a result of the group_by Consider this case
dt<-data.frame(g=rep(c("a","b"), each=3),
v=c(2,2,5,2,7,7))
dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 b 2
dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
#
# g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7
When you use distinct() without indicating which variables to make distinct, it appears to use the grouping variable.