Using seq function within dplyr mutate - r

I'm trying to compute the number of months between two dates within dplyr::mutate but run into the error
Error in mutate_impl(.data, dots) : 'from' must be of length 1
Is there something about seq that is incompatible with mutate?
library(dplyr)
dset <- data.frame( f = as.Date(c("2016-03-04","2016-12-13","2017-03-01")) ,
o = as.Date(c("2016-03-04","2016-12-13","2017-06-02")) )
dset %>% mutate( y = length(seq(from=f, to=o, by='month')) - 1 )

To work around it, you can either use sapply or mapply. Otherwise, you can extract the month from the date using functions in lubridate and then compute the difference.
library(dplyr)
library(lubridate)
# Sapply
dset %>%
mutate(y=sapply(1:length(f), function(i) length(seq(f[i], o[i], by="month")) - 1))
# Mapply
dset %>%
mutate(y=mapply(function(x, y) length(seq(x, y, by="month")) - 1, f, o))
# function in lubridate
dset %>% mutate(y=month(o) - month(f))

You need to group, iterate, or adjust such that each from and to parameter is length 1 (seq(1, 5) is fine; seq(1:2, 5:6) is not), which means rowwise or maybe group_by_all:
library(dplyr)
dset <- data.frame( f = as.Date(c("2016-03-04","2016-12-13","2017-03-01")) ,
o = as.Date(c("2016-03-04","2016-12-13","2017-06-02")) )
dset %>%
rowwise() %>%
mutate(y = length(seq(f, o, by = 'month')) - 1)
#> Source: local data frame [3 x 3]
#> Groups: <by row>
#>
#> # A tibble: 3 x 3
#> f o y
#> <date> <date> <int>
#> 1 2016-03-04 2016-03-04 0
#> 2 2016-12-13 2016-12-13 0
#> 3 2017-03-01 2017-06-02 3

You may want to also use dplyr for this:
dset <- data.frame( f = as.Date(c("2016-03-04","2016-12-13","2017-03-01")) ,
o = as.Date(c("2016-03-04","2016-12-13","2017-06-02")) )
dset %>% mutate( y = as.numeric(difftime(f,o, units = "weeks"))/4)

"alistaire" has done some typo mistake, so the answer is wrong
dset %>%
rowwise() %>%
mutate(y = length(seq(f, o, by = 'month')) - 1)
Source: local data frame [3 x 3]
Groups: <by row>
# A tibble: 3 x 3
f o y
<date> <date> <dbl>
1 2016-03-04 2016-03-04 0
2 2016-12-13 2016-12-13 0
3 2017-03-01 2017-06-02 3

Related

R:Avoid error when counting sequential number of date with data that all the observations are NA

My task is to count the length of periods from given start/end date that were extracted from the large dataset.
Here is sample data.
library(tidyverse)
data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(c("2022-03-03", "2022-03-03", "2022-03-04", "2022-03-20", "2022-03-22")),
end = ymd(c("2022-03-03", "2022-03-04", "2022-03-07", "2022-03-22", "2022-03-23")))
data
# A tibble: 5 × 3
ID start end
<dbl> <date> <date>
1 1 2022-03-03 2022-03-03
2 1 2022-03-03 2022-03-04
3 1 2022-03-04 2022-03-07
4 2 2022-03-20 2022-03-22
5 2 2022-03-22 2022-03-23
I've figured out this with the method introduced here.
data2 <- data %>%
rowwise() %>%
do(tibble(ID = .$ID,
Date = seq(.$start, .$end, by = 1))) %>%
distinct() %>%
ungroup() %>%
count(ID)
data2
# A tibble: 2 × 2
ID n
<dbl> <int>
1 1 5
2 2 4
However, occasionally, all the observations in the extracted start/end columns are NA.
Then the method above stops at the function seq() because no data is there.
like
na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(NA),
end = ymd(NA))
na_data
A tibble: 5 × 3
ID start end
<dbl> <date> <date>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 2 NA NA
5 2 NA NA
na_data %>%
rowwise() %>%
do(tibble(ID = .$ID,
Date = seq(.$start, .$end, by = 1))) %>%
distinct() %>%
ungroup() %>%
count(ID)
*Error in seq.int(0, to0 - from, by) : 'to' must be a finite number*
It is difficult for me to check if all the data in selected columns are NA beforehand, because I have a lot of this kind of process to run simultaneously with the data from the same dataset.
To run the process, I usually select entire scripts in Rstudio with [ctrl + A] and then start. But the error message interrupts in the middle of my tasks.
Does Anyone have a solution to achieve this process with a whole NA data, or to avoid interruption by the error message and proceed to the next code?
Thank you.
This solution (1) creates lubridate Intervals for each row; (2) merges them by group using a modification of #AllanCameron's int_merge() function to handle NAs; and (3) sums days per Interval within each group.
To fully test it, I made two additional example datasets -- one including discontinuous date intervals, and one where only some values are NA.
library(lubridate)
library(dplyr)
int_merge <- function(x, na.rm = FALSE) {
if (na.rm) {
if(all(is.na(x))) return(interval(NA, NA))
if(any(is.na(x))) x <- x[!is.na(x)]
} else {
if(any(is.na(x))) return(interval(NA, NA))
}
if(length(x) == 1) return(x)
x <- x[order(int_start(x))]
y <- x[1]
for(i in 2:length(x)){
if(int_overlaps(y[length(y)], x[i]))
y[length(y)] <- interval(start = min(int_start(c(y[length(y)], x[i]))),
end = max(int_end(c(y[length(y)], x[i]))))
else
y <- c(y, x[i])
}
return(y)
}
data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 4
discontinuous_data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 8
#> 2 2 4
na_data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 NA
#> 2 2 NA
partial_na_data %>%
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 NA
#> 2 2 4
partial_na_data %>% # with `na.rm = TRUE`
mutate(interval = interval(start, end)) %>%
group_by(ID) %>%
summarize(
interval = sum(as.numeric(int_merge(interval, na.rm = TRUE), unit = "days") + 1)
)
#> # A tibble: 2 × 2
#> ID interval
#> <dbl> <dbl>
#> 1 1 7
#> 2 2 4
Created on 2022-11-25 with reprex v2.0.2
Additional example data:
discontinuous_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
end = ymd(c("2022-03-03", "2022-03-04", "2022-03-15", "2022-03-22", "2022-03-23")))
partial_na_data <- tibble(ID = rep.int(c(1, 2), times = c(3, 2)),
start = ymd(c("2022-03-03", "2022-03-03", "2022-03-10", "2022-03-20", "2022-03-22")),
end = ymd(c("2022-03-03", NA, "2022-03-15", "2022-03-22", "2022-03-23")))

Unexpected results using eval() in R

I have a column called "equation" which stored formuala about "t". Another column is "t". I want to calculate the equation's value (y) according to each t in the row. Below is an example.
library(magrittr);library(dplyr)
dt <- data.frame(t = c(1,2,3),
equation = c("t+1", "5*t", "t^3"))
dt %<>%
mutate(y = eval(parse(text = equation)))
However, the results seem not expected:
t equation y
1 t+1 1
2 5*t 8
3 t^3 27
The expected results for y is: 2, 10, 27. What should I do to fix it (but the third y is correct)?
This is because eval(parse()) isn't vectorised. You can get around this using rowwise():
library(magrittr)
library(dplyr, warn.conflicts = FALSE)
dt <- data.frame(
t = c(1,2,3),
equation = c("t+1", "5*t", "t^3")
)
dt %<>%
rowwise() %>%
mutate(y = eval(parse(text = equation))) %>%
ungroup()
#> # A tibble: 3 × 3
#> t equation y
#> <dbl> <chr> <dbl>
#> 1 1 t+1 2
#> 2 2 5*t 10
#> 3 3 t^3 27
Created on 2022-10-14 with reprex v2.0.2

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?
We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

Is there some function to keep unique values in R dplyr with group_by?

I have a data.frame (or tiibble or whatever) with an id variable. Often I made some operation for this id with dplyr::group_by, so
data %>%
group_by(id) %>%
summarise/mutate/...()
Often, I have other non-numeric variables that are unique for each id, such as the project or country to which the id belongs and other characteristics of the id (such as gender, etc.). When I use the summarise function above, these other variables ares lost unless I specify, either
data %>%
group_by(id) %>%
summarise(across(c(project, country, gender, ...), unique),...)
or
data %>%
group_by(id, project, country, gender, ...) %>%
summarise()
Is there some functions which detect these variables which are unique for each id, so that one does not have to specify them?
Thank you!
PS: I am asking mainly on dplyr and group_by related functions, but other environments like R-base or data.table are wellcome also.
I did not test it extensively yet it should do the job
library(dplyr)
myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
R = rnorm(6))
myData
#> # A tibble: 6 x 3
#> X Y R
#> <dbl> <chr> <dbl>
#> 1 1 A 0.463
#> 2 1 A -0.965
#> 3 2 B -0.403
#> 4 2 B -0.417
#> 5 2 B -2.28
#> 6 3 C 0.423
group_by_id_vars <- function(.data, ...) {
# group by the prespecified ID variables
.data <- .data %>% group_by(...)
# how many groups do these ID determine
ID_groups <- .data %>% n_groups()
# Get the number of groups if the initial grouping variables are combined
# with other variables
groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
nms <- names(.data) # all variables in .data
res <- sapply(nms[!nms %in% groupVars],
function(x) {
.data %>%
# important to specify add = TRUE to combine the variable
# with the IDs
group_by(across(all_of(x)), .add = TRUE) %>%
n_groups()})
# which combinations are identical, i.e. this variable does not increase the
# number of groups in the data if combined with IDvars
v <- names(res)[which(res == ID_groups)]
# group the data accordingly
.data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
return(.data)
}
myData %>%
group_by_id_vars(X) %>%
summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups: X [3]
#> X Y n
#> <dbl> <chr> <int>
#> 1 1 A 2
#> 2 2 B 3
#> 3 3 C 1
This is a bit more advanced in application, but what you are looking for are linear combinations of your grouping variables. You can convert these to factors and then use some linear algebra.
You can use findLinearCombos() from caret to locate these. It takes a bit of work to get it all organized how I think you want it though.
Something like this may do the trick. I also have not extensively tested this.
Packages
library(dplyr)
library(caret)
library(purrr)
Function
group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
# capture the ... and convert to a character vector
.groups <- rlang::ensyms(...)
.groups_chr <- map_chr(.groups, rlang::as_name)
# convert all character and factor variables to a numeric
d <- .data %>%
mutate(across(where(is.factor), as.character),
across(where(is.character), as.factor),
across(where(is.factor), as.integer))
# find linear combinations of the character / factor variables
lc <- caret::findLinearCombos(d)
# see if any of your grouping variables have linear combinations
find_group_match <- function(known_groups, lc_pair) {
if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
}
# convert column indices to names
lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
# iteratively look for linear combinations of known grouping variabels
lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)
# find new grouping variables
added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])
# apply the grouping to your groups and the linear combinations
group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
}
Usage
data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)
Result
You can see how it added in all the other grouping variables. You can rework this all in other ways, the key part is building that added_groups list to find them.
# A tibble: 10 x 5
# Groups: W, X, V [10]
V W X Y Z
<chr> <chr> <chr> <chr> <dbl>
1 A a Aa A 0.884
2 B b Bb A 0.133
3 C c Cc B 0.194
4 D d Dd B 0.407
5 E e Ee C 0.256
6 F f Ff C 0.0976
7 G g Gg D 0.635
8 H h Hh D 0.0542
9 I i Ii E 0.0104
10 J j Jj E 0.464

dplyr mutate calling another dataframe

I would like to mutate a dataframe by applying a function which calls out to another dataframe. I can acheive this in a few different ways, but would like to know how to do this 'properly'.
Here is an example of what I'm trying to do. I have a dataframe with some start times, and a second with some timed observations. I would like to return a dataframe with the start times, and the number of observations that occur within some window after the start time. e.g.
set.seed(1337)
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
lapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
The best I've got so far with dplyr is the following (but this loses the identity variables):
df1 %>%
rowwise() %>%
do(count = filter(df2, time>.$start_time, time < (.$start_time + 15))) %>%
mutate(n=nrow(count))
output:
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 × 2
count n
<list> <int>
1 <data.frame [17 × 1]> 17
2 <data.frame [18 × 1]> 18
3 <data.frame [10 × 1]> 10
I was expecting to be able to do this:
df1 <- data.frame(id=LETTERS[1:3], start_time=1:3*10)
df2 <- data.frame(time=runif(100)*100)
df1 %>%
group_by(id) %>%
mutate(count = nrow(filter(df2, time>start_time, time<(start_time+15))))
but this returns the error:
Error: comparison (6) is possible only for atomic and list types
What is the dplyr way of doing this?
Here is one option with data.table where we can use the non-equi joins
library(data.table)#1.9.7+
setDT(df1)[, start_timeNew := start_time + 15]
setDT(df2)[df1, .(id, .N), on = .(time > start_time, time < start_timeNew),
by = .EACHI][, c('id', 'N'), with = FALSE]
# id N
#1: A 17
#2: B 18
#3: C 10
which gives the same count as in the OP's base R method
sapply(df1$start_time, function(s) sum(df2$time>s & df2$time<(s+15)))
#[1] 17 18 10
If we need the 'id' variable also as output in dplyr, we can modify the OP's code
df1 %>%
rowwise() %>%
do(data.frame(., count = filter(df2, time>.$start_time,
time < (.$start_time + 15)))) %>%
group_by(id) %>%
summarise(n = n())
# id n
# <fctr> <int>
#1 A 17
#2 B 18
#3 C 10
Or another option is map from purrr with dplyr
library(purrr)
df1 %>%
split(.$id) %>%
map_df(~mutate(., N = sum(df2$time >start_time & df2$time < start_time + 15))) %>%
select(-start_time)
# id N
#1 A 17
#2 B 18
#3 C 10
Another slightly different approach using dplyr:
result <- df1 %>% group_by(id) %>%
summarise(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
### A tibble: 3 x 2
## id count
## <fctr> <int>
##1 A 17
##2 B 18
##3 C 10
I believe you can use length and which to count the number of occurrences for which your condition is true for each id in df1. Then, group by id and use this to summarise.
If there are possibly more that one start_time per id, then you can use the same function but rowwise and with mutate:
result <- df1 %>% rowwise() %>%
mutate(count = length(which(df2$time > start_time &
df2$time < (start_time+15))))
print(result)
##Source: local data frame [3 x 3]
##Groups: <by row>
##
### A tibble: 3 x 3
## id start_time count
## <fctr> <dbl> <int>
##1 A 10 17
##2 B 20 18
##3 C 30 10

Resources