Here is my toy dataset:
df <- tibble::tribble(
~date, ~value,
"2007-01-31", 25,
"2007-05-31", 31,
"2007-12-31", 26
)
I am creating month-end date series using the following code.
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(as.Date("2007-01-31"), as.Date("2019-12-31"), by="month"))
However, I am not getting the correct month-end dates.
date value
<date> <dbl>
1 2007-01-31 25
2 2007-03-03 NA
3 2007-03-31 NA
4 2007-05-01 NA
5 2007-05-31 31
6 2007-07-01 NA
7 2007-07-31 NA
8 2007-08-31 NA
9 2007-10-01 NA
10 2007-10-31 NA
11 2007-12-01 NA
12 2007-12-31 26
What am I missing here? I am okay using other functions from any other package.
No need of complete function, you can do this in base R.
Since last day of the month is different for different months, we can create a sequence of monthly start dates and subtract 1 day from it.
seq(as.Date("2007-02-01"), as.Date("2008-01-01"), by="month") - 1
#[1] "2007-01-31" "2007-02-28" "2007-03-31" "2007-04-30" "2007-05-31" "2007-06-30"
# "2007-07-31" "2007-08-31" "2007-09-30" "2007-10-31" "2007-11-30" "2007-12-31"
Using the same logic in updated dataframe, we can do :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
tidyr::complete(date = seq(min(date) + 1, max(date) + 1, by="month") - 1)
# date value
# <date> <dbl>
# 1 2007-01-31 25
# 2 2007-02-28 NA
# 3 2007-03-31 NA
# 4 2007-04-30 NA
# 5 2007-05-31 31
# 6 2007-06-30 NA
# 7 2007-07-31 NA
# 8 2007-08-31 NA
# 9 2007-09-30 NA
#10 2007-10-31 NA
#11 2007-11-30 NA
#12 2007-12-31 26
I have a data.table which age column contain missing values and rdate is Date format. I want to replace missing age by finding the next non-missing age and rdate of each horsenum, then calculate the missing age by next non-missing age - ceiling year difference of non-missing rdate and this record' rdate. I assume next non-missing rdate is birthday so I use ceiling year difference. Also, I want to keep rdate.fill as Date format. How to write this in data.table code?
My idea of age.fill is calculate by this way, but I have error
library(lubridateļ¼
data[, rdate.fill := ifelse(is.na(age), as.Date(rdate[na.lacf(age)]), NA), by=horsenum]
data[, age.fill := ifelse(is.na(age), ind4- ceiling(time_length(difftime(rdate.fill, rdate, "years"), age), by=horsenum]
input
index rdate horsenum age ind4
1: 14704 2009-03-01 K123 NA 10
2: 14767 2009-03-01 K212 NA 9
3: 39281 2011-10-09 K123 NA 10
4: 39561 2011-10-19 K212 NA 9
5: 74560 2015-04-07 K212 NA 9
6: 77972 2015-09-06 K123 10 NA
7: 79111 2015-10-10 K212 9 NA
8: 84233 2016-03-28 K212 10 NA
structure(list(index = c(14704L, 14767L, 39281L, 39561L, 74560L,
77972L, 79111L, 84233L), rdate = structure(c(14304, 14304, 15256,
15266, 16532, 16684, 16718, 16888), class = "Date"), horsenum = c("K123",
"K212", "K123", "K212", "K212", "K123", "K212", "K212"), age = c(NA,
NA, NA, NA, NA, 10, 9, 10), ind4 = c(10, 9, 10, 9, 9, NA, NA,
NA)), row.names = c(NA, -8L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000002c5512f1ef0>)
output
index rdate horsenum age ind4 rdate.fill age.fill
1: 14704 2009-03-01 K123 NA 10 2015-09-06 3
2: 14767 2009-03-01 K212 NA 9 2015-10-10 2
3: 39281 2011-10-09 K123 NA 10 2015-09-06 6
4: 39561 2011-10-19 K212 NA 9 2015-10-10 5
5: 74560 2015-04-07 K212 NA 9 2015-10-10 8
6: 77972 2015-09-06 K123 10 NA 10
7: 79111 2015-10-10 K212 9 NA 9
8: 84233 2016-03-28 K212 10 NA 10
Not clear to me how age.fill is calculated differently for rows 2 and 4 as compared to rows 1 and 3 respectively. But this should get you closer to your needs:
library(data.table) #data.table_1.12.4
DT[, rdate.fill := nafill(fifelse(is.na(age), as.Date(NA), rdate), "nocb"), horsenum][,
age.fill := fifelse(is.na(age), ind4 - ceiling(lubridate::time_length(difftime(rdate.fill, rdate), "years")), age), horsenum]
output:
index rdate horsenum age ind4 rdate.fill age.fill
1: 14704 2009-03-01 K123 NA 10 2015-09-06 3
2: 14767 2009-03-01 K212 NA 9 2015-10-10 2
3: 39281 2011-10-09 K123 NA 10 2015-09-06 6
4: 39561 2011-10-19 K212 NA 9 2015-10-10 5
5: 74560 2015-04-07 K212 NA 9 2015-10-10 8
6: 77972 2015-09-06 K123 10 NA 2015-09-06 10
7: 79111 2015-10-10 K212 9 NA 2015-10-10 9
8: 84233 2016-03-28 K212 10 NA 2016-03-28 10
data[,age.fill := nafill(age,'nocb'),by=horsenum][,
rdate.fill:=ifelse(is.na(age),rdate[which.min(age.fill==age)],rdate),by=horsenum][,
age.fill:=unclass(age.fill - round((rdate.fill-rdate)/365))
]
index rdate horsenum age ind4 age.fill rdate.fill
1: 14704 2009-03-01 K123 NA 10 3 2015-09-06
2: 14767 2009-03-01 K212 NA 9 2 2015-10-10
3: 39281 2011-10-09 K123 NA 10 6 2015-09-06
4: 39561 2011-10-19 K212 NA 9 5 2015-10-10
5: 74560 2015-04-07 K212 NA 9 8 2015-10-10
6: 77972 2015-09-06 K123 10 NA 10 2015-09-06
7: 79111 2015-10-10 K212 9 NA 9 2015-10-10
8: 84233 2016-03-28 K212 10 NA 10 2016-03-28
Your algorithm systematically underestimates age. For example, horse K212's estimated age on 2015-04-07 (row 5) is 8. However, we know K212's age on 2016-03-28 is 10 (row 8), so K212 must be 9 on 2015-04-07, not 8. Here I address this problem by calculating an estimated birthdate from each non-NA rdate, then calculating the earliest estimated birthdate for each horse.
library(data.table)
data=data.table(index=c(14704L,14767L,39281L,39561L,74560L,77972L,79111L,84233L),rdate=structure(c(14304,14304,15256,15266,16532,16684,16718,16888),class="Date"),horsenum=c("K123","K212","K123","K212","K212","K123","K212","K212"),age=c(NA,NA,NA,NA,NA,10,9,10))
lt = data[!is.na(age),as.POSIXlt(rdate)]
lt$year = lt$year - data[!is.na(age),age]
data[!is.na(age),bday:=as.Date(lt)]
data[,bday:=min(bday,na.rm=T),horsenum]
data[,age.fill:=floor(as.numeric(rdate-bday)/365)]
data[order(index)]
Output:
index rdate horsenum age bday age.fill
1: 14704 2009-03-01 K123 NA 2005-09-06 3
2: 14767 2009-03-01 K212 NA 2006-03-28 2
3: 39281 2011-10-09 K123 NA 2005-09-06 6
4: 39561 2011-10-19 K212 NA 2006-03-28 5
5: 74560 2015-04-07 K212 NA 2006-03-28 9
6: 77972 2015-09-06 K123 10 2005-09-06 10
7: 79111 2015-10-10 K212 9 2006-03-28 9
8: 84233 2016-03-28 K212 10 2006-03-28 10
Note: this algorithm could be improved. Consider K212 is 9 on 2015-10-10 and 10 on 2016-03-28. This means that K212's actual birthday is after 10-10 and before 3-28. Instead of assuming 3-28, we could assume it is halfway between 10-10 and 3-28, or, more specifically, if there is more than one estimated birthdate, calculate both the max and the min possible birthdate for each horse, then find the date that's halfway between max - 1 year and min.
The approach below is slightly different:
It calculates, based on the given ages, the possible 'range' of the birthday from the horse. It then uses this window to calculate the minimum and maximum age a horse can haveon the given rdate.
So, the more infor you have in te horse's age, the smaller the window of possible birthdays, and the bigger the chance that the minimum estimates age equals the maximum estimated age (of they are the same, you know the age of the horse for sure )..
Here we go:
library( data.table )
library( lubridate ) #for the %m+% and %m-% operators
library( intervals ) #to calculate with intervals and find overlaps
library( eeptools ) #for age_calc function; calculating the age, given a date and a birthday (respects leap yaers, etc..)
#read sample data
DT <- fread("
index rdate horsenum age ind4
14704 2009-03-01 K123 NA 10
14767 2009-03-01 K212 NA 9
39281 2011-10-09 K123 NA 10
39561 2011-10-19 K212 NA 9
74560 2015-04-07 K212 NA 9
77972 2015-09-06 K123 10 NA
79111 2015-10-10 K212 9 NA
84233 2016-03-28 K212 10 NA")
#set dates as IDate
DT[, rdate := as.POSIXct(rdate) ]
#set keys
setkey( DT, horsenum, rdate, age )
#calculate bandwidth date of birth (dob) based on age and date
DT[!is.na( age ), dob_min := as.integer( rdate %m-% lubridate::years(age + 1) %m+% lubridate::days(1) ) ]
DT[!is.na( age ), dob_max := as.integer( rdate %m-% lubridate::years(age) ) ]
#function to get get overlap of birthday-intervals
myfun <- function( y ) {
all_intervals <- intervals::Intervals( as.matrix( y ), check_valid = TRUE )
int_min <- all_intervals[1]
for (i in 1:nrow(all_intervals) ) int_min <- interval_intersection( all_intervals[1], all_intervals[i] )
as.data.table( int_min )
}
#get range of possible date of birth for each horsenum
dob_range <- DT[ !is.na(age), myfun( .SD ), by = .(horsenum), .SDcols = c("dob_min", "dob_max") ]
dob_range <- dob_range[, .(horsenum, dob_from = as.POSIXct(V1, origin = "1970-01-01"),
dob_to = as.POSIXct(V2, origin = "1970-01-01"))]
#use found ranges of birthday to estimate ages
#first join dob-ranges by horsenum
DT[ dob_range, `:=`( dob_from = i.dob_from, dob_to = i.dob_to), on = .(horsenum)]
#now calculate ages (minimum and maximum)
DT[, age_min := floor( eeptools::age_calc( as.Date(dob_to), as.Date(rdate), units= "years" ) )]
DT[, age_max := floor( eeptools::age_calc( as.Date(dob_from), as.Date(rdate), units= "years" ) )]
#remove helper columns
DT[, `:=`( dob_min = NULL, dob_max = NULL, dob_from = NULL, dob_to = NULL)]
# index rdate horsenum age ind4 age_min age_max
# 1: 14704 2009-03-01 K123 NA 10 3 4
# 2: 39281 2011-10-09 K123 NA 10 6 7
# 3: 77972 2015-09-06 K123 10 NA 10 10
# 4: 14767 2009-03-01 K212 NA 9 2 3
# 5: 39561 2011-10-19 K212 NA 9 5 6
# 6: 74560 2015-04-07 K212 NA 9 9 9
# 7: 79111 2015-10-10 K212 9 NA 9 9
# 8: 84233 2016-03-28 K212 10 NA 10 10
I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)
I have a list of 100+ time series dataframes my.list with daily observations for each product in its own data frame. Some values are NA without any record of the date. I would like to update each data frame in this list to show the date and NA if it does not have a record on this date.
Dates:
start = as.Date('2016/04/08')
full <- seq(start, by='1 days', length=10)
Sample Time Series Data:
d1 <- data.frame(Date = seq(start, by ='2 days',length=5), Sales = c(5,10,15,20,25))
d2 <- data.frame(Date = seq(start, by= '1 day', length=10),Sales = c(1, 2, 3,4,5,6,7,8,9,10))
my.list <- list(d1, d2)
I want to merge all full date values into each data frame, and if no match exists then sales is NA:
my.list
[[d1]]
Date Sales
2016-04-08 5
2016-04-09 NA
2016-04-10 10
2016-04-11 NA
2016-04-12 15
2016-04-13 NA
2016-04-14 20
2016-04-15 NA
2016-04-16 25
2016-04-17 NA
[[d2]]
Date Sales
2016-04-08 1
2016-04-09 2
2016-04-10 3
2016-04-11 4
2016-04-12 5
2016-04-13 6
2016-04-14 7
2016-04-15 8
2016-04-16 9
2016-04-17 10
If I understand correctly, the OP wants to update each of the dataframes in my.list to contain one row for each date given in the vector of dates full
Base R
In base R, merge() can be used as already mentioned by Hack-R. However, th answer below expands this to work on all dataframes in the list:
# creat dataframe from vector of full dates
full.df <- data.frame(Date = full)
# apply merge on each dataframe in the list
lapply(my.list, merge, y = full.df, all.y = TRUE)
[[1]]
Date Sales
1 2016-04-08 5
2 2016-04-09 NA
3 2016-04-10 10
4 2016-04-11 NA
5 2016-04-12 15
6 2016-04-13 NA
7 2016-04-14 20
8 2016-04-15 NA
9 2016-04-16 25
10 2016-04-17 NA
[[2]]
Date Sales
1 2016-04-08 1
2 2016-04-09 2
3 2016-04-10 3
4 2016-04-11 4
5 2016-04-12 5
6 2016-04-13 6
7 2016-04-14 7
8 2016-04-15 8
9 2016-04-16 9
10 2016-04-17 10
Caveat
The answer assumes that full covers the overall range of Date of all dataframes in the list.
In order to avoid any mishaps, the overall range of Date can be retrieved from the available data in my.list:
overall_date_range <- Reduce(range, lapply(my.list, function(x) range(x$Date)))
full <- seq(overall_date_range[1], overall_date_range[2], by = "1 days")
Using rbindlist()
Alternatively, the list of dataframes which are identical in structure can be stored in a large dataframe. An additional attribute indicates to which product each row belongs to. The homogeneous structure simplifies subsequent operations.
The code below uses the rbindlist() function from the data.table package to create a large data.table. CJ() (cross join) creates all combinations of dates and product id which is then merged / joined to fill in the missing dates:
library(data.table)
all_products <- rbindlist(my.list, idcol = "product.id")[
CJ(product.id = unique(product.id), Date = seq(min(Date), max(Date), by = "1 day")),
on = .(Date, product.id)]
all_products
product.id Date Sales
1: 1 2016-04-08 5
2: 1 2016-04-09 NA
3: 1 2016-04-10 10
4: 1 2016-04-11 NA
5: 1 2016-04-12 15
6: 1 2016-04-13 NA
7: 1 2016-04-14 20
8: 1 2016-04-15 NA
9: 1 2016-04-16 25
10: 1 2016-04-17 NA
11: 2 2016-04-08 1
12: 2 2016-04-09 2
13: 2 2016-04-10 3
14: 2 2016-04-11 4
15: 2 2016-04-12 5
16: 2 2016-04-13 6
17: 2 2016-04-14 7
18: 2 2016-04-15 8
19: 2 2016-04-16 9
20: 2 2016-04-17 10
Subsequent operations can be grouped by product.id, e.g., to determine the number of valid sales data for each product:
all_products[!is.na(Sales), .(valid.sales.data = .N), by = product.id]
product.id valid.sales.data
1: 1 5
2: 2 10
Or, the totals sales per product:
all_products[, .(total.sales = sum(Sales, na.rm = TRUE)), by = product.id]
product.id total.sales
1: 1 75
2: 2 55
If required for some reason the result can be converted back to a list by
split(all_products, by = "product.id")
I have a data frame that contains several columns with dates
col1<-seq( as.Date("2011-07-01"), by=20, len=10)
col2<-seq( as.Date("2011-09-01"), by=7, len=10)
col3<-seq( as.Date("2011-08-01"), by=1, len=10)
data.frame(col1,col2,col3)
The data frame looks like this:
col1 col2 col3
1 2011-07-01 2011-09-01 2011-08-01
2 2011-07-21 2011-09-08 2011-08-02
3 2011-08-10 2011-09-15 2011-08-03
4 2011-08-30 2011-09-22 2011-08-04
5 2011-09-19 2011-09-29 2011-08-05
6 2011-10-09 2011-10-06 2011-08-06
7 2011-10-29 2011-10-13 2011-08-07
8 2011-11-18 2011-10-20 2011-08-08
9 2011-12-08 2011-10-27 2011-08-09
10 2011-12-28 2011-11-03 2011-08-10
I am trying to merge them into one column so that
A. Only the lowest (earliest) date remains per row and others get ignored
1 2011-07-01
2 2011-07-21
3 2011-08-03
4 2011-08-04
5 2011-08-05
6 2011-08-06
7 2011-08-07
8 2011-08-08
9 2011-08-09
10 2011-08-10
B. Only the highest (latest) date remains per row
1 2011-09-01
2 2011-09-08
3 2011-09-15
4 2011-09-22
5 2011-09-29
6 2011-10-09
7 2011-10-29
8 2011-11-18
9 2011-12-08
10 2011-12-28
The real dataset has NAs so if NA gets encountered it should be ignored unless all columns have a missing value for a particular row, in which case NA will be generated there as well.
Any thoughts?
pmin and pmax are helpful here:
do.call(pmin, dat)
# [1] "2011-07-01" "2011-07-21" "2011-08-03" "2011-08-04" "2011-08-05"
# [6] "2011-08-06" "2011-08-07" "2011-08-08" "2011-08-09" "2011-08-10"
do.call(pmax, dat)
# [1] "2011-09-01" "2011-09-08" "2011-09-15" "2011-09-22" "2011-09-29"
# [6] "2011-10-09" "2011-10-29" "2011-11-18" "2011-12-08" "2011-12-28"
This also works for NA values, like:
do.call(pmin, c(dat, na.rm=TRUE) )
You can also select the specific columns you want to analyse like:
do.call(pmin, c(dat[c("col1","col2","col3")], na.rm=TRUE) )
We can use max.col to find the index of the maximum values in each row, then cbind with row index and get the value per each row, convert to a data.frame
j1 <- sapply(df1, as.numeric)
df2 <- data.frame(Date = df1[cbind(1:nrow(df1),max.col(j1, 'first') )])
df3 <- data.frame(Date = df1[cbind(1:nrow(df1), max.col(-1*j1, "first"))])
df2
# Date
#1 2011-09-01
#2 2011-09-08
#3 2011-09-15
#4 2011-09-22
#5 2011-09-29
#6 2011-10-09
#7 2011-10-29
#8 2011-11-18
#9 2011-12-08
#10 2011-12-28
df3
# Date
#1 2011-07-01
#2 2011-07-21
#3 2011-08-03
#4 2011-08-04
#5 2011-08-05
#6 2011-08-06
#7 2011-08-07
#8 2011-08-08
#9 2011-08-09
#10 2011-08-10
Or another option is
as.Date(apply(df1, 1, min, na.rm = TRUE))
as.Date(apply(df1, 1, max, na.rm = TRUE))
Or with tidyverse
library(tidyverse)
df1 %>%
rowwise() %>%
transmute(col1Max = max(col1, col2, col3), colMin = min(col1, col2, col3))