library(dplyr)
library(lubridate)
I have some data like this
f <- tribble(~a, ~date,
"BVH", 201801,
"HBYU", 202012,
"CYC", 202112,
"AC", 202109)
And I need to transform it to the last day of month, I do this
f %>% mutate(date = ym(date))
# A tibble: 4 x 2
a date
<chr> <date>
1 BVH 2018-01-01
2 HBYU 2020-12-01
3 CYC 2021-12-01
4 AC 2021-09-01
What I would like is, this
# A tibble: 4 x 2
a date
<chr> <date>
1 BVH 2018-01-31
2 HBYU 2020-12-31
3 CYC 2021-12-31
4 AC 2021-09-30
lubridate's rollforward does exactly what you want:
> f %>% mutate(date = rollforward(ym(date)))
# A tibble: 4 x 2
a date
<chr> <date>
1 BVH 2018-01-31
2 HBYU 2020-12-31
3 CYC 2021-12-31
4 AC 2021-09-30
Here are two other approaches -
library(dplyr)
library(lubridate)
f %>%
mutate(date = ym(date),
date1 = date + months(1) - 1,
date2 = ceiling_date(date, 'month') - 1)
# a date date1 date2
# <chr> <date> <date> <date>
#1 BVH 2018-01-01 2018-01-31 2018-01-31
#2 HBYU 2020-12-01 2020-12-31 2020-12-31
#3 CYC 2021-12-01 2021-12-31 2021-12-31
#4 AC 2021-09-01 2021-09-30 2021-09-30
Related
I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows
I want to complete a df in R when in it miss a month date for example if I have one year of information by months and days like this one.
df = data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-02","2020-04-01","2020-09-01","2020-10-01",
"2020-11-01","2020-12-01"))
When I use the function complete, I use it like this
df = df%>%
mutate(Date = as.Date(Date)) %>%
complete(Date= seq.Date("2020-01-01", "2020-12-31", by = "month"))
And the problem is that my final df complete all the dates like may, june, july and that is ok but also complete march because march doesn't have the first day and begings in 2020-03-02.
df = data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-01","2020-03-02","2020-04-01","2020-05-01",
"2020-06-01","2020-07-01","2020-08-01","2020-09-01",
"2020-10-01","2020-11-01","2020-12-01"))
Do you know how to complete df only if the df doesn't have any date of a month?
In my case I don't want to complete march because march has a date already.
Thanks a lot.
You can extract year and month value from the Date and use complete on that.
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
mutate(Date = as.Date(Date),
year = year(Date),
month = month(Date)) %>%
complete(year, month = 1:12) %>%
mutate(Date = if_else(is.na(Date),
as.Date(paste(year, month, 1, sep = '-')), Date)) %>%
select(Date)
# Date
# <date>
# 1 2020-01-01
# 2 2020-02-01
# 3 2020-03-02
# 4 2020-04-01
# 5 2020-05-01
# 6 2020-06-01
# 7 2020-07-01
# 8 2020-08-01
# 9 2020-09-01
#10 2020-10-01
#11 2020-11-01
#12 2020-12-01
A possible solution would be completing only by yearmon from the zoo package, so that it the actual day of the month is irrelevant.
library(dplyr)
library(zoo) # for as.yearmon
library(tidyr) # for complete
df <- data.frame(Date = c("2020-01-01","2020-02-01",
"2020-03-02","2020-04-01",
"2020-09-01","2020-10-01",
"2020-11-01","2020-12-01"),
id = 1:8)
df
#> Date id
#> 1 2020-01-01 1
#> 2 2020-02-01 2
#> 3 2020-03-02 3
#> 4 2020-04-01 4
#> 5 2020-09-01 5
#> 6 2020-10-01 6
#> 7 2020-11-01 7
#> 8 2020-12-01 8
df %>%
mutate(Date = as.Date(Date),
year_mon = as.yearmon(Date)) %>%
complete(
year_mon = seq.Date(as.Date("2020-01-01"),
as.Date("2020-12-31"),
by = "month") %>% as.yearmon()
)
#> # A tibble: 12 x 3
#> year_mon Date id
#> <yearmon> <date> <int>
#> 1 Jan 2020 2020-01-01 1
#> 2 Feb 2020 2020-02-01 2
#> 3 Mar 2020 2020-03-02 3
#> 4 Apr 2020 2020-04-01 4
#> 5 May 2020 NA NA
#> 6 Jun 2020 NA NA
#> 7 Jul 2020 NA NA
#> 8 Aug 2020 NA NA
#> 9 Sep 2020 2020-09-01 5
#> 10 Oct 2020 2020-10-01 6
#> 11 Nov 2020 2020-11-01 7
#> 12 Dec 2020 2020-12-01 8
Created on 2021-06-25 by the reprex package (v2.0.0)
I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same.
I would like to select cases in data2 with conditions:
Match by ID
Eventdate in data2 >= Eventdate in data1.
If one ID has multiple Eventdates in data2, select the earliest one.
If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])
Does this work:
library(dplyr)
data1 %>% rename(Eventdate_dat1 = Eventdate) %>% left_join(data2, by = 'ID') %>%
group_by(ID) %>% filter(Eventdate >= Eventdate_dat1) %>%
mutate(Eventdate = case_when(length(unique(Eventdate)) > 1 ~ min(Eventdate), TRUE ~ Eventdate),
Eventcode = case_when(length(unique(Eventcode)) > 1 ~ min(Eventcode), TRUE ~ Eventcode)) %>%
distinct() %>% right_join(data1, by = 'ID') %>% select(ID, 'Eventdate' = Eventdate.y, 'Eventdate.data2' = Eventdate.x, Eventcode)
# A tibble: 6 x 4
# Groups: ID [6]
ID Eventdate Eventdate.data2 Eventcode
<dbl> <chr> <chr> <dbl>
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01 NA NA
5 5 2019-05-01 NA NA
6 6 2019-06-01 NA NA
I have a dataframe that looks like this:
And here is the output I'm hoping for.
This should work. The key is to use uncount from dplyr package. Then you need to do some operations regarding the datetime. There are some tricky issues in calculating the difference in months. What I proposed here may not be the best way to do it, but you get the idea.
library(tidyverse)
library(lubridate)
df = tibble(name = c('Alice', 'Bob', 'Caroline'),
start_date = as.Date(c('2019-01-01','2018-03-01','2019-06-01')),
end_date = as.Date(c('2019-07-01','2019-05-01','2019-09-01')))
# # A tibble: 3 x 3
# name start_date end_date
# <chr> <date> <date>
# 1 Alice 2019-01-01 2019-07-01
# 2 Bob 2018-03-01 2019-05-01
# 3 Caroline 2019-06-01 2019-09-01
df %>% mutate(tenure_in_month = as.integer(difftime(end_date, start_date, units = "days")/365*12+2))%>%
uncount(tenure_in_month)%>%
group_by(name)%>%
mutate(iteratedDate = start_date %m+% months(row_number()-1))%>%
select(name,iteratedDate)
# A tibble: 28 x 2
# Groups: name [3]
name iteratedDate
<chr> <date>
1 Alice 2019-01-01
2 Alice 2019-02-01
3 Alice 2019-03-01
4 Alice 2019-04-01
5 Alice 2019-05-01
6 Alice 2019-06-01
7 Alice 2019-07-01
8 Bob 2018-03-01
9 Bob 2018-04-01
10 Bob 2018-05-01
I use seq function to fix this problem.
library(data.table)
library(lubridate)
# data
original_data <- data.table(
CustomerName = c('Ben','Julie','Angelo','Carlo'),
StartDate = c(ymd(20190101),ymd(20180103),ymd(20190106),ymd(20170108)),
EndDate = c(ymd(20190107),ymd(20190105),ymd(20190109),ymd(20180112))
)
# CustomerName StartDate EndDate
#1: Ben 2019-01-01 2019-01-07
#2: Julie 2018-01-03 2019-01-05
#3: Angelo 2019-01-06 2019-01-09
#4: Carlo 2017-01-08 2018-01-12
finish_data <- original_data %>%
.[,.(IteratedDate = seq(from = StartDate,
to = EndDate, by = 'day')), by = .(CustomerName)]
# CustomerName IteratedDate
#1: Ben 2019-01-01
#2: Ben 2019-01-02
#3: Ben 2019-01-03
#4: Ben 2019-01-04
#5: Ben 2019-01-05
#6: Ben 2019-01-06
#7: Ben 2019-01-07
#8: Julie 2018-01-03
#9: Julie 2018-01-04
I have a following dataframe in R
ID Date1 Date2
1 21-03-16 8:36 22-03-16 12:36
1 23-03-16 9:36 24-03-16 01:36
1 22-03-16 10:36 25-03-16 11:46
1 23-03-16 11:36 28-03-16 10:16
My desired dataframe is
ID Date1 Date1_time Date2 Date2_time
1 2016-03-21 08:36:00 2016-03-22 12:36:00
1 2016-03-23 09:36:00 2016-03-24 01:36:00
1 2016-03-22 10:36:00 2016-03-25 11:46:00
1 2016-03-23 11:36:00 2016-03-28 10:16:00
I can do this individually using strptime like following
df$Date1 <- strptime(df$Date1, format='%d-%m-%y %H:%M')
df$Date1_time <- strftime(df$Date1 ,format="%H:%M:%S")
df$Date1 <- strptime(df$Date1, format='%Y-%m-%d')
But,I have many date columns to convert like above. How can I write function in R which will do this.
You can do this with dplyr::mutate_at to operate on multiple columns. See select helpers for more info on efficiently specifying which columns to operate on.
Then you can use lubridate and hms for date and time functions.
library(dplyr)
library(lubridate)
library(hms)
df <- readr::read_csv(
'ID,Date1,Date2
1,"21-03-16 8:36","22-03-16 12:36"
1,"23-03-16 9:36","24-03-16 01:36"
1,"22-03-16 10:36","25-03-16 11:46"
1,"23-03-16 11:36","28-03-16 10:16"'
)
df
#> # A tibble: 4 x 3
#> ID Date1 Date2
#> <int> <chr> <chr>
#> 1 1 21-03-16 8:36 22-03-16 12:36
#> 2 1 23-03-16 9:36 24-03-16 01:36
#> 3 1 22-03-16 10:36 25-03-16 11:46
#> 4 1 23-03-16 11:36 28-03-16 10:16
df %>%
mutate_at(vars(Date1, Date2), dmy_hm) %>%
mutate_at(vars(Date1, Date2), funs("date" = date(.), "time" = as.hms(.))) %>%
select(-Date1, -Date2)
#> # A tibble: 4 x 5
#> ID Date1_date Date2_date Date1_time Date2_time
#> <int> <date> <date> <time> <time>
#> 1 1 2016-03-21 2016-03-22 08:36:00 12:36:00
#> 2 1 2016-03-23 2016-03-24 09:36:00 01:36:00
#> 3 1 2016-03-22 2016-03-25 10:36:00 11:46:00
#> 4 1 2016-03-23 2016-03-28 11:36:00 10:16:00
Using dplyr for manipulation:
convertTime <- function(x)as.POSIXct(x, format='%d-%m-%y %H:%M')
df %>%
mutate_at(vars(Date1, Date2), convertTime) %>%
group_by(ID) %>%
mutate_all(funs("date"=as.Date(.), "time"=format(., "%H:%M:%S")))
# Source: local data frame [4 x 7]
# Groups: ID [1]
#
# ID Date1 Date2 Date1_date Date2_date Date1_time Date2_time
# <int> <dttm> <dttm> <date> <date> <chr> <chr>
# 1 1 2016-03-22 12:36:00 2016-03-22 12:36:00 2016-03-22 2016-03-22 12:36:00 12:36:00
# 2 1 2016-03-24 01:36:00 2016-03-24 01:36:00 2016-03-23 2016-03-23 01:36:00 01:36:00
# 3 1 2016-03-25 11:46:00 2016-03-25 11:46:00 2016-03-25 2016-03-25 11:46:00 11:46:00
# 4 1 2016-03-28 10:16:00 2016-03-28 10:16:00 2016-03-28 2016-03-28 10:16:00 10:16:00
I have the same problem, you can try this may be help using strsplit
x <- df$Date1
y = t(as.data.frame(strsplit(as.character(x),' ')))
row.names(y) = NULL
# store splitted data into new columns
df$date <- y[,1] # date column
df$time <- y[,2] # time column