I have a dataset that looks like this:
ID created_at
MUM-0001 2014-04-16
MUM-0002 2014-01-14
MUM-0003 2014-04-17
MUM-0004 2014-04-12
MUM-0005 2014-04-18
MUM-0006 2014-04-17
I am trying to introduce new column that would be all dates between start date and defined last day (say, 12th-july-2015). I used seq function in dplyr but getting an error.
data1 <- data1 %>%
arrange(ID) %>%
group_by(ID) %>%
mutate(date = seq(as.Date(created_at), as.Date('2015-07-12'), by= 1))
the error which I am getting is:
Error: incompatible size (453), expecting 1 (the group size) or 1
Can you please suggest some better way to perform this task in R ?
You could use data.table to get the sequence of Dates from 'created_at' to '2015-07-12', grouped by the 'ID' column.
library(data.table)
setDT(df1)[, list(date=seq(created_at, as.Date('2015-07-12'), by='1 day')) , ID]
If you need an option with dplyr, use do
library(dplyr)
df1 %>%
group_by(ID) %>%
do( data.frame(., Date= seq(.$created_at,
as.Date('2015-07-12'), by = '1 day')))
If you have duplicate IDs, then we may need to group by row_number()
df1 %>%
group_by(rn=row_number()) %>%
do(data.frame(ID= .$ID, Date= seq(.$created_at,
as.Date('2015-07-12'), by = '1 day'), stringsAsFactors=FALSE))
Update
Based on #Frank's commment, the new idiom for tidyverse is
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(d = list(seq(created_at, as.Date('2015-07-12'), by='1 day')), created_at = NULL) %>%
unnest()
In the case of data.table
setDT(df1)[, list(date=seq(created_at,
as.Date('2015-07-12'), by = '1 day')), by = 1:nrow(df1)]
data
df1 <- structure(list(ID = c("MUM-0001", "MUM-0002", "MUM-0003",
"MUM-0004",
"MUM-0005", "MUM-0006"), created_at = structure(c(16176, 16084,
16177, 16172, 16178, 16177), class = "Date")), .Names = c("ID",
"created_at"), row.names = c(NA, -6L), class = "data.frame")
Related
I have data in 6-month intervals (ID, 6-month-start-date, outcome value), but for some IDs, there are half years where the outcome is missing. Simplified example:
id = c("aa", "aa", "ab", "ab", "ab")
date = as.Date(c("2021-07-01", "2022-07-01", "2021-07-01", "2022-01-01", "2022-07-01"))
col3 = c(1,2,1,2,1)
df <- data.frame(id, date, col3)
For similar datasets where the date is monthly, I used complete(date = seq.Date(start date, end date, by = "month") to fill the missing months and add 0 to the outcome field in the 3rd column.
I could do the following and expand the data to monthly, then create a new 6-month-start-date column, group by it and ID, and sum col3.
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="month")) %>%
mutate (col3 = replace_na(col3, 0))
df_complete_6mth <- df_complete %>% mutate(
halfyear = ifelse(as.integer(format(date, '%m')) <= 6,
paste0(format(date, '%Y'), '-01-01'),
paste0(format(date, '%Y'), '-07-01'))) %>%
group_by(id, halfyear) %>%
summarise(col3_halfyear = sum(col3))
However, is there a solution where the "by =" argument specifies 6 months? I tried
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="months(6)")) %>%
mutate (col3 = replace_na(col3, 0))
but it didn't work.
From the help for seq.Date:
by can be specified in several ways.
A number, taken to be in days.
A object of class difftime
A character string, containing one of "day", "week", "month",
"quarter" or "year". This can optionally be preceded by a (positive or
negative) integer and a space, or followed by "s".
So I expect you want:
library(dplyr); library(tidyr)
df %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by="6 month"),
fill = list(col3 = 0))
Could you do something like this. You make a sequence of dates by month and then take every sixth one after the first one?
library(lubridate)
dates <- seq(mdy("01-01-2020"), mdy("01-01-2023"), by="month")
dates[seq(1, length(dates), by=6)]
#> [1] "2020-01-01" "2020-07-01" "2021-01-01" "2021-07-01" "2022-01-01"
#> [6] "2022-07-01" "2023-01-01"
Created on 2023-02-08 by the reprex package (v2.0.1)
I'm trying to count the difference in dates from a single column, based on another columns value.
This is the result I'm looking for
Try this
library('dplyr')
df <- data.frame(id = c(1, 2, 3, 1, 2, 3),
Date = c('1/1/2020', '1/3/2020','1/1/2020','1/7/2020','1/6/2020','1/5/2020'))
df %>% mutate(Date = as.Date(Date, format='%m/%d/%Y')) %>%
group_by(id) %>%
mutate(DIFF = Date - lag(Date))
Here is a way using dplyr and lubridate (needed to make the dates behave when subtracting). It looks like you want the calculation to determine the number of days between the dates in a group by ID and the earliest date for that ID.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = Date - min(Date))
If you want to have NA instead of 0, you can do the following:
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = if_else(Date == min(Date), NA_integer_, Date - min(Date))
I have this code, recommended from a Stackoverflow user that works very well. I have
several datasets that I wish to apply this code to.
Would I have to continuously apply each dataset to the code, or is there something else that I can do? (Like store it in some sort of function?)
I have datsets
df1, df2, df3, df4. I do not wish to rbind these datasets.
Dput for each dataset:
structure(list(Date = structure(1:6, .Label = c("1/2/2020 5:00:00 PM",
"1/2/2020 5:30:01 PM", "1/2/2020 6:00:00 PM", "1/5/2020 7:00:01 AM",
"1/6/2020 8:00:00 AM", "1/6/2020 9:00:00 AM"), class = "factor"),
Duration = c(20L, 30L, 10L, 5L, 2L, 8L)), class = "data.frame", row.names = c(NA,
-6L))
CODE:
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
This is what I have been doing for each:(etc)
df1 %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
df2 %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
df3 %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
Is there a way to:
Store_code<-
df %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
and then apply each dataset easily to this code?
df1(Store_code)
df2(Store_code)
Any suggestion is appreciated.
We can use mget to return all the objects into a list, use map to loop over the list and apply the function
library(dplyr)
library(lubridate)
library(purrr)
f1 <- function(dat) {
dat %>%
group_by(Date = as.Date(dmy_hms(Date))) %>%
summarise(Total_Duration = sum(Duration), Count = n())
}
lst1 <- map(mget(ls(pattern = "^df\\d+$")), f1)
Here, we assume the column names are the same i.e. 'Date', 'Duration' in all the datasets. If it is a different one, then can pass as another argument to function
f2 <- function(dat, datecol, durationcol) {
dat %>%
group_by(Date = as.Date(dmy_hms({{datecol}}))) %>%
summarise(Total_Duration = sum({{durationcol}}), Count = n())
}
and apply the function as
f2(df1, Date, Duration)
Or in the loop
lst1 <- map(mget(ls(pattern = "^df\\d+$")), f2,
datecol = Date, durationcol = Duration)
I have a dataset the following
DT <- data.drame(v1 = c(0,0,0,1,0,0,1))
I want to create a ID cumulatively stopped at a value of 1.
The ID should be
ID<-c(1,2,3,4,1,2,3)
If you are using dplyr, this will do the trick.
DT = data.frame(v1 = c(0,0,0,1,0,0,1))
DT %>%
dplyr::mutate(rno = row_number()) %>%
dplyr::mutate(group = ifelse(v1 == 0, NA, rno)) %>%
tidyr::fill(group, .direction = "up") %>%
dplyr::group_by(group) %>%
dplyr::mutate(ID = row_number()) %>%
dplyr::ungroup() %>%
dplyr::select(v1, ID)
In base R, we can use ave :
with(DT, ave(v1, c(0, cumsum(v1)[-length(v1)]), FUN = seq_along))
#[1] 1 2 3 4 1 2 3
In dplyr , we can use lag to create groups and assign row number in each group.
library(dplyr)
DT %>% group_by(gr = lag(cumsum(v1), default = 0)) %>% mutate(ID = row_number())
and we can use the same logic in data.table :
library(data.table)
setDT(DT)[, ID := seq_len(.N), shift(cumsum(v1), fill = 0)]
I have a dataset that looks like this:
ID created_at
MUM-0001 2014-04-16
MUM-0002 2014-01-14
MUM-0003 2014-04-17
MUM-0004 2014-04-12
MUM-0005 2014-04-18
MUM-0006 2014-04-17
I am trying to introduce new column that would be all dates between start date and defined last day (say, 12th-july-2015). I used seq function in dplyr but getting an error.
data1 <- data1 %>%
arrange(ID) %>%
group_by(ID) %>%
mutate(date = seq(as.Date(created_at), as.Date('2015-07-12'), by= 1))
the error which I am getting is:
Error: incompatible size (453), expecting 1 (the group size) or 1
Can you please suggest some better way to perform this task in R ?
You could use data.table to get the sequence of Dates from 'created_at' to '2015-07-12', grouped by the 'ID' column.
library(data.table)
setDT(df1)[, list(date=seq(created_at, as.Date('2015-07-12'), by='1 day')) , ID]
If you need an option with dplyr, use do
library(dplyr)
df1 %>%
group_by(ID) %>%
do( data.frame(., Date= seq(.$created_at,
as.Date('2015-07-12'), by = '1 day')))
If you have duplicate IDs, then we may need to group by row_number()
df1 %>%
group_by(rn=row_number()) %>%
do(data.frame(ID= .$ID, Date= seq(.$created_at,
as.Date('2015-07-12'), by = '1 day'), stringsAsFactors=FALSE))
Update
Based on #Frank's commment, the new idiom for tidyverse is
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(d = list(seq(created_at, as.Date('2015-07-12'), by='1 day')), created_at = NULL) %>%
unnest()
In the case of data.table
setDT(df1)[, list(date=seq(created_at,
as.Date('2015-07-12'), by = '1 day')), by = 1:nrow(df1)]
data
df1 <- structure(list(ID = c("MUM-0001", "MUM-0002", "MUM-0003",
"MUM-0004",
"MUM-0005", "MUM-0006"), created_at = structure(c(16176, 16084,
16177, 16172, 16178, 16177), class = "Date")), .Names = c("ID",
"created_at"), row.names = c(NA, -6L), class = "data.frame")