R - Aggregate Dates - r

When aggregating an R dataframe, the dates are converted in integer :
For instance, if I want to take the maximum dates for every Id in the following dataframe :
> df1 <- data.frame(id = rep(c(1, 2), 2), b = as.Date(paste("01/01/", 2000:2003, sep=''), format = "%d/%m/%Y"))
> df1
id b
1 1 2000-01-01
2 2 2001-01-01
3 1 2002-01-01
4 2 2003-01-01
> aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
id b
1 1 11688
2 2 12053
Why does R behave this way ? (and what's the best way to keep a date class column in the returned dataframe?)
Thanks for your help,

That works for me R version 3, perhaps there were some changes in updates, so I recommend you to update R :)
As for this version of R, have you tried as.Date() function after aggregating?
In your example, should be like:
dtf2<-aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
dtf2$b<-as.Date(dtf$b)
You can also add 'origin' option to as.Date, like
as.Date(dtf$b, origin='1970-01-01')
UPD: When R looks at dates as integers, its origin is January 1, 1970.
Hope that will help.

Related

Adding something to a list of dates in a column

Supppose a data.table is:
dt <- structure(list(type = c("A", "B", "C"), dates = c("21-07-2011",
"22-11-2011,01-12-2011", "07-08-2012,14-08-2012,18-08-2012,11-10-2012"
)), class = c("data.table", "data.frame"), row.names = c(NA, -3L))
Check it:
type dates
1: A 21-07-2011
2: B 22-11-2011,01-12-2011
3: C 07-08-2012,14-08-2012,18-08-2012,11-10-2012
I need to add, say, 5 to each of the dates in second column, ie, I want the result to be as under:
type dates
1: A 26-07-2011
2: B 27-11-2011,06-12-2011
3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
Any help would be appreciated.
Using only basic R you can do:
dt$dates = sapply(dt$dates, function(x){
dates = as.Date(strsplit(x,",")[[1]], format = "%d-%m-%Y")
paste(format(dates+5, '%d-%m-%Y'), collapse = ",")
})
Result:
> dt
type dates
1: A 26-07-2011
2: B 27-11-2011,06-12-2011
3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
This procedure is practically the same as the one given by akrun, but without the extra libraries.
Grouped by 'type', we split the 'dates' by the ,, (with strsplit), convert to a Date class object with dmy (from lubridate), add 5, format it to the original format of the data, paste it to single string and assign (:=) to update the 'dates' column in the dataset
library(lubridate)
library(data.table)
dt[, dates := paste(format(dmy(unlist(strsplit(dates, ","))) + 5,
'%d-%m-%Y'), collapse=','), by = type]
dt
# type dates
#1: A 26-07-2011
#2: B 27-11-2011,06-12-2011
#3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
Another option without splitting, converting to Date, reformatting is regex method with gsubfn
library(gsubfn)
dt[, dates := gsubfn("^(\\d+)", ~ as.numeric(x) + 5,
gsubfn(",(\\d+)", ~sprintf(",%02d", as.numeric(x) + 5), dates))]
dt
# type dates
#1: A 26-07-2011
#2: B 27-11-2011,06-12-2011
#3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
NOTE: Would assume the second method to be faster as we are not splitting, converting, pasteing etc.

How can I use PAD function (from PADR() package) for multiple data frames?

I have 24 files (1 for each hour of the day, HR_NBR = Hour Number) and I've to pad the dates in each of the files.
AS-IS data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
03/07/2016 1 10
TO-BE data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
02/07/2016 NA NA
03/07/2016 1 10
I can use the pad function for each file, like this:
chil_bev1_1 = pad (chil_bev1_1, interval= "day") # Hour1
chil_bev1_2 = pad (chil_bev1_2, interval= "day") # Hour2
and so on.
And it works. But I want to use a loop or LAPPLY.
I tried several variations of these 2 codes, but none of them worked:
df1 = data.frame (chil_bev1_1)
df2 = data.frame (chil_bev1_2)
dflist = c("df1","df2")
CODE1:
x = function(df) {df %>% pad}
allpad = lapply(dflist,x)
CODE2:
x = function(df) {pad (df)}
allpad = lapply(dflist,x)
The error is
"x must be a data frame".
I'm new to R. Any help would be greatly appreciated.
Thank you.
I managed to figure it out. Here's the answer:
hour_list = list(chil_bev1_1, chil_bev1_2)
chil_bev1n = lapply (hour_list, function (x) {x %>% complete(CLNDR_DT = seq.Date(min(CLNDR_DT), max(CLNDR_DT), by="day"), fill = list(QTY=0))})
Notes:
The fill = list() function replaces the NAs with 0s.
CLNDR_DT is the name of the column that contains dates.

Fill a data frame with increasing date objects in R

I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.

melt with chron

I'm trying to melt a data frame with chron class
library(chron)
x = data.frame(Index = as.chron(c(15657.00,15657.17)), Var1 = c(1,2), Var2 = c(9,8))
x
Index Var1 Var2
1 (11/13/12 00:00:00) 1 9
2 (11/13/12 04:04:48) 2 8
y = melt(x,id.vars="Index")
Error in data.frame(ids, variable, value, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 2, 4
I can trick with as.numeric() as follows:
x$Index= as.numeric(x$Index)
y = melt(x,id.vars="Index")
y$Index = as.chron(y$Index)
y
Index variable value
1 (11/13/12 00:00:00) Var1 1
2 (11/13/12 04:04:48) Var1 2
3 (11/13/12 00:00:00) Var2 9
4 (11/13/12 04:04:48) Var2 8
But can it be simpler ? (I want to keep the chron class)
(1) I assume you issued this command before running the code shown:
library(reshape2)
In that case you could use the reshape package instead. It doesn't result in this problem:
library(reshape)
Other solutions are to
(2) use R's reshape function:
reshape(direction = "long", data = x, varying = list(2:3), v.names = "Var")
(3) or convert the chron column to numeric, use melt from the reshape2 package and then convert back:
library(reshape2)
xt <- transform(x, Index = as.numeric(Index))
transform(melt(xt, id = 1), Index = chron(Index))
ADDED additional solutions.
I'm not sure but I think this might be an "oversight" in chron (or possibly data.frame, but that seems unlikely).
The issue occurs when constructing the data frame in melt.data.frame in reshape2, which typically uses recycling, but that portion of data.frame:
for (j in seq_along(xi)) {
xi1 <- xi[[j]]
if (is.vector(xi1) || is.factor(xi1))
xi[[j]] <- rep(xi1, length.out = nr)
else if (is.character(xi1) && class(xi1) == "AsIs")
xi[[j]] <- structure(rep(xi1, length.out = nr), class = class(xi1))
else if (inherits(xi1, "Date") || inherits(xi1, "POSIXct"))
xi[[j]] <- rep(xi1, length.out = nr)
else {
fixed <- FALSE
break
}
seems to go wrong, as the chron variable doesn't inherit either Date or POSIXct. This removes the error but alters the date times:
x = data.frame(Index = as.chron(c(15657.00,15657.17)), Var1 = c(1,2), Var2 = c(9,8))
class(x$Index) <- c(class(x$Index),'POSIXct')
y = melt(x,id.vars="Index")
Like I said, this sorta smells like a bug somewhere. My money would be on the need for chron to add POSIXct to the class vector, but I could be wrong. The obvious alternative would be to use POSIXct date times instead.

Aggregating daily content

I've been attempting to aggregate (some what erratic) daily data. I'm actually working with csv data, but if i recreate it - it would look something like this:
library(zoo)
dates <- c("20100505", "20100505", "20100506", "20100507")
val1 <- c("10", "11", "1", "6")
val2 <- c("5", "31", "2", "7")
x <- data.frame(dates = dates, val1=val1, val2=val2)
z <- read.zoo(x, format = "%Y%m%d")
Now i'd like to aggregate this on a daily basis (notice that some times there are >1 datapoint for a day, and sometimes there arent.
I've tried lots and lots of variations, but i cant seem to aggregate, so for instance this fails:
aggregate(z, as.Date(time(z)), sum)
# Error in Summary.factor(2:3, na.rm = FALSE) : sum not meaningful for factors
There seems to be a lot of content regarding aggregate, and i've tried a number of versions but cant seem to sum this on a daily level. I'd also like to run cummax and cumulative averages in addition to the daily summing.
Any help woudl be greatly appreciated.
Update
The code I am actually using is as follows:
z <- read.zoo(file = "data.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, blank.lines.skip = T, na.strings="NA", format = "%Y%m%d");
It seems my (unintentional) quotation of the numbers above is similar to what is happening in practice, because when I do:
aggregate(z, index(z), sum)
#Error in Summary.factor(25L, na.rm = FALSE) : sum not meaningful for factors
There a number of columns (100 or so), how can i specify them to be as.numeric automatically ? (stringAsFactors = False doesnt appear to work?)
Or you aggregate before using zoo (val1 and val2 need to be numeric though).
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
y <- aggregate(x[,2:3],by=list(x[,1]),FUN=sum)
and then feed y into zoo.
You avoid the warning:)
You started on the right path but made a couple of mistakes.
First, zoo only consumes matrices, not data.frames. Second, those need numeric inputs:
> z <- zoo(as.matrix(data.frame(val1=c(10,11,1,6), val2=c(5,31,2,7))),
+ order.by=as.Date(c("20100505","20100505","20100506","20100507"),
+ "%Y%m%d"))
Warning message:
In zoo(as.matrix(data.frame(val1 = c(10, 11, 1, 6), val2 = c(5, :
some methods for "zoo" objects do not work if the index entries in
'order.by' are not unique
This gets us a warning which is standard in zoo: it does not like identical time indices.
Always a good idea to show the data structure, maybe via str() as well, maybe run summary() on it:
> z
val1 val2
2010-05-05 10 5
2010-05-05 11 31
2010-05-06 1 2
2010-05-07 6 7
And then, once we have it, aggregation is easy:
> aggregate(z, index(z), sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
>
val1 and val2 are character strings. data.frame() converts them to factors. Summing factors doesn't make sense. You probably intended:
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
z <- read.zoo(x, format = "%Y%m%d")
aggregate(z, as.Date(time(z)), sum)
which yields:
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
Convert the character columns to numeric and then use read.zoo making use of its aggregate argument:
> x[-1] <- lapply(x[-1], function(x) as.numeric(as.character(x)))
> read.zoo(x, format = "%Y%m%d", aggregate = sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7

Resources