complete() in R for 6-month dates - r

I have data in 6-month intervals (ID, 6-month-start-date, outcome value), but for some IDs, there are half years where the outcome is missing. Simplified example:
id = c("aa", "aa", "ab", "ab", "ab")
date = as.Date(c("2021-07-01", "2022-07-01", "2021-07-01", "2022-01-01", "2022-07-01"))
col3 = c(1,2,1,2,1)
df <- data.frame(id, date, col3)
For similar datasets where the date is monthly, I used complete(date = seq.Date(start date, end date, by = "month") to fill the missing months and add 0 to the outcome field in the 3rd column.
I could do the following and expand the data to monthly, then create a new 6-month-start-date column, group by it and ID, and sum col3.
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="month")) %>%
mutate (col3 = replace_na(col3, 0))
df_complete_6mth <- df_complete %>% mutate(
halfyear = ifelse(as.integer(format(date, '%m')) <= 6,
paste0(format(date, '%Y'), '-01-01'),
paste0(format(date, '%Y'), '-07-01'))) %>%
group_by(id, halfyear) %>%
summarise(col3_halfyear = sum(col3))
However, is there a solution where the "by =" argument specifies 6 months? I tried
df_complete <- df %>% group_by(id) %>%
complete(date = seq.Date(as.Date(min(date)), as.Date(max(date) %m+% months(5)), by="months(6)")) %>%
mutate (col3 = replace_na(col3, 0))
but it didn't work.

From the help for seq.Date:
by can be specified in several ways.
A number, taken to be in days.
A object of class difftime
A character string, containing one of "day", "week", "month",
"quarter" or "year". This can optionally be preceded by a (positive or
negative) integer and a space, or followed by "s".
So I expect you want:
library(dplyr); library(tidyr)
df %>%
group_by(id) %>%
complete(date = seq.Date(min(date), max(date), by="6 month"),
fill = list(col3 = 0))

Could you do something like this. You make a sequence of dates by month and then take every sixth one after the first one?
library(lubridate)
dates <- seq(mdy("01-01-2020"), mdy("01-01-2023"), by="month")
dates[seq(1, length(dates), by=6)]
#> [1] "2020-01-01" "2020-07-01" "2021-01-01" "2021-07-01" "2022-01-01"
#> [6] "2022-07-01" "2023-01-01"
Created on 2023-02-08 by the reprex package (v2.0.1)

Related

How to fill in missing value of a data.frame in R?

I have multiple columns that has missing values. I want to use the mean of the same day across all years while filling the missing data for each column. for example, DF is my fake data where I see missing values for the two columns (A & X)
library(lubridate)
library(tidyverse)
library(naniar)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1985-01-01"), to = as.Date("1987-12-31"), by = "day"),
A = sample(1:10,1095, replace = T), X = sample(5:15,1095, replace = T)) %>%
replace_with_na(replace = list(A = 2, X = 5))
To fill in Column A, i use the following code
Fill_DF_A <- DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(A = ifelse(is.na(A), mean(A, na.rm=TRUE), A))
I have many columns in my data.frame and I would like to generalize this for all the columns to fill in the missing value?
We can use na.aggregate from zoo
library(dplyr)
library(zoo)
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, na.aggregate))
Or if we prefer to use conditional statements
DF %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
group_by(Year, Day) %>%
mutate(across(A:X, ~ case_when(is.na(.)
~ mean(., na.rm = TRUE), TRUE ~ as.numeric(.))))

R count difference in days based on column value

I'm trying to count the difference in dates from a single column, based on another columns value.
This is the result I'm looking for
Try this
library('dplyr')
df <- data.frame(id = c(1, 2, 3, 1, 2, 3),
Date = c('1/1/2020', '1/3/2020','1/1/2020','1/7/2020','1/6/2020','1/5/2020'))
df %>% mutate(Date = as.Date(Date, format='%m/%d/%Y')) %>%
group_by(id) %>%
mutate(DIFF = Date - lag(Date))
Here is a way using dplyr and lubridate (needed to make the dates behave when subtracting). It looks like you want the calculation to determine the number of days between the dates in a group by ID and the earliest date for that ID.
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = Date - min(Date))
If you want to have NA instead of 0, you can do the following:
df %>%
mutate(Date = dmy(Date)) %>%
group_by(ID) %>%
mutate(Diff = if_else(Date == min(Date), NA_integer_, Date - min(Date))

How to modify the Dataframe as below in R

I am working on R for a Data Analysis. I have a Dataframe which stores the data for each month in a Year. For certain months of a particular year the data is missing. The dataframe which i am currently using is as below.
How to modify the data in the Dataframe to be stored in another dataframe in this below manner?
The column Year is of the type yearmon and n is of the typr int.
Solution using tidyverse
library(tidyverse)
##Recreate data
df <- tibble(
Year = c("Dec-13", "Jan-14","Feb-14","Mar-14",
"Apr-14", "May-14","Jun-15","Jul-14",
"Aug-15","Sep-18"),
n = c(1,8,2,4,8,9,2,1,1,1)
)
##convert to character, spread, and fill
df_2 <- df %>%
mutate(Year = parse_character(Year)) %>%
separate(Year, into = c("Month", "Year")) %>%
mutate(Year = paste0("20",Year)) %>%
spread(Year,n, fill = "-") %>%
mutate(Month = factor(Month, levels = c("Dec","Jan","Feb", "Mar","Apr",
"May","Jun","Jul", "Aug",
"Sep"))) %>%
arrange(Month)
df_2

How do I convert date to format "ddmmyyyy"

I have a list of date entries where I need to convert every date format to "DDMMYYYY"
Example:
a <- c("31 aug 1953", "1953", "aug 1953")
Desired output:
"31081953", "00001953", "00081953"
As there are different formats, one option would to extract into 'day', 'month', and 'year', then paste together after replaceing the missing values with 0 formatted with str_pad
library(dplyr)
library(tidyr)
library(stringr)
data.frame(a) %>%
extract(a, into = c('day', 'month', 'year'), "(\\d{2})*\\s*([a-z]*)\\s*(\\d{4})") %>%
mutate(month = match(toupper(month), toupper(month.abb))) %>%
mutate_all(funs(str_pad(replace(., is.na(.), "00"), width = 2, pad = "0"))) %>%
unite(newcol, day, month, year, sep="") %>%
pull(newcol)
#[1] "31081953" "00001953" "00081953"

Expanding date to include all dates in range [duplicate]

I have a dataset that looks like this:
ID created_at
MUM-0001 2014-04-16
MUM-0002 2014-01-14
MUM-0003 2014-04-17
MUM-0004 2014-04-12
MUM-0005 2014-04-18
MUM-0006 2014-04-17
I am trying to introduce new column that would be all dates between start date and defined last day (say, 12th-july-2015). I used seq function in dplyr but getting an error.
data1 <- data1 %>%
arrange(ID) %>%
group_by(ID) %>%
mutate(date = seq(as.Date(created_at), as.Date('2015-07-12'), by= 1))
the error which I am getting is:
Error: incompatible size (453), expecting 1 (the group size) or 1
Can you please suggest some better way to perform this task in R ?
You could use data.table to get the sequence of Dates from 'created_at' to '2015-07-12', grouped by the 'ID' column.
library(data.table)
setDT(df1)[, list(date=seq(created_at, as.Date('2015-07-12'), by='1 day')) , ID]
If you need an option with dplyr, use do
library(dplyr)
df1 %>%
group_by(ID) %>%
do( data.frame(., Date= seq(.$created_at,
as.Date('2015-07-12'), by = '1 day')))
If you have duplicate IDs, then we may need to group by row_number()
df1 %>%
group_by(rn=row_number()) %>%
do(data.frame(ID= .$ID, Date= seq(.$created_at,
as.Date('2015-07-12'), by = '1 day'), stringsAsFactors=FALSE))
Update
Based on #Frank's commment, the new idiom for tidyverse is
library(tidyverse)
df1 %>%
group_by(ID) %>%
mutate(d = list(seq(created_at, as.Date('2015-07-12'), by='1 day')), created_at = NULL) %>%
unnest()
In the case of data.table
setDT(df1)[, list(date=seq(created_at,
as.Date('2015-07-12'), by = '1 day')), by = 1:nrow(df1)]
data
df1 <- structure(list(ID = c("MUM-0001", "MUM-0002", "MUM-0003",
"MUM-0004",
"MUM-0005", "MUM-0006"), created_at = structure(c(16176, 16084,
16177, 16172, 16178, 16177), class = "Date")), .Names = c("ID",
"created_at"), row.names = c(NA, -6L), class = "data.frame")

Resources