Fill a data frame with increasing date objects in R - r

I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.

Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29

Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.

Related

Converting dates in column of a data frame R

I am having problems converting a column of imported dates in a data frame, represented as characters in a different date format, into date objects in that same data frame. Here is a toy example:
xx <- data.frame(A = c(10, 15, 20), B = c("10/15/2010", "9/8/2015", "8/5/2013"))
If I print xx,
A B
1 10 10/15/2010
2 15 9/8/2015
3 20 8/5/2013
I apply:
xx[, "B"] <- sapply(xx[, "B"], function(x) {as.Date(x,
format = "%m/%d/%Y", origin = "1970-01-01")})
and I get:
A B
1 10 14897
2 15 16686
3 20 15922
If I look at the mode of column B, it is numeric, not date. No matter what I try I cannot seem to get a result that converts column B to a date type. I can always add:
xx[, "B"] <- as.Date(xx[, "B"])
but there must be a way to do this in one statement.
If you have only one column to convert, you can do
xx$B <- as.Date(xx$B, "%m/%d/%Y")
If you have multiple columns use lapply instead of sapply
cols <- 2
xx[cols] <- lapply(xx[cols], as.Date, "%m/%d/%Y")
Or using lubridate where you don't need to specify the format argument.
xx$B <- lubridate::mdy(xx$B)

How can I use PAD function (from PADR() package) for multiple data frames?

I have 24 files (1 for each hour of the day, HR_NBR = Hour Number) and I've to pad the dates in each of the files.
AS-IS data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
03/07/2016 1 10
TO-BE data:
CLNDR_DT HR_NBR QTY
01/07/2016 1 6
02/07/2016 NA NA
03/07/2016 1 10
I can use the pad function for each file, like this:
chil_bev1_1 = pad (chil_bev1_1, interval= "day") # Hour1
chil_bev1_2 = pad (chil_bev1_2, interval= "day") # Hour2
and so on.
And it works. But I want to use a loop or LAPPLY.
I tried several variations of these 2 codes, but none of them worked:
df1 = data.frame (chil_bev1_1)
df2 = data.frame (chil_bev1_2)
dflist = c("df1","df2")
CODE1:
x = function(df) {df %>% pad}
allpad = lapply(dflist,x)
CODE2:
x = function(df) {pad (df)}
allpad = lapply(dflist,x)
The error is
"x must be a data frame".
I'm new to R. Any help would be greatly appreciated.
Thank you.
I managed to figure it out. Here's the answer:
hour_list = list(chil_bev1_1, chil_bev1_2)
chil_bev1n = lapply (hour_list, function (x) {x %>% complete(CLNDR_DT = seq.Date(min(CLNDR_DT), max(CLNDR_DT), by="day"), fill = list(QTY=0))})
Notes:
The fill = list() function replaces the NAs with 0s.
CLNDR_DT is the name of the column that contains dates.

Calculate time difference between two events (given date and time) in R

I'm currently struggling with a beginner's issue regarding the calculation of a time difference between two events.
I want to take a column consisting of date and time (both values in one column) into consideration and calculate a time difference with the value of the previous/next row with the same ID (A or B in this example).
ID = c("A", "A", "B", "B")
time = c("08.09.2014 10:34","12.09.2014 09:33","13.08.2014 15:52","11.09.2014 02:30")
d = data.frame(ID,time)
My desired output is in the format Hours:Minutes
time difference = c("94:59","94:59","682:38","682:38")
The format Days:Hours:Minutes or anything similar would also work, as long as it could be conveniently implemented. I am flexible regarding the format of the output, the above is just an idea that crossed my mind.
For each single ID, I always have two rows (in the example 2xA and 2xB). I don't have a convincing idea how to avoid the repition of the difference.
I've tried some examples before, which I found on stackoverflow. Most of them used POSIXt and strptime. However, I didn't manage to apply those ideas to my data set.
Here's my attempt using dplyr
library(dplyr)
d %>%
mutate(time = as.POSIXct(time, format = "%d.%m.%Y %H:%M")) %>%
group_by(ID) %>%
mutate(diff = paste0(gsub("[.].*", "", diff(time)*24), ":",
round(as.numeric(gsub(".*[.]", ".", diff(time)*24))*60)))
# Source: local data frame [4 x 3]
# Groups: ID
#
# ID time diff
# 1 A 2014-09-08 10:34:00 94:59
# 2 A 2014-09-12 09:33:00 94:59
# 3 B 2014-08-13 15:52:00 682:38
# 4 B 2014-09-11 02:30:00 682:38
A very (to me) hack-ish base solution:
ID <- c("A", "A", "B", "B")
time <- c("08.09.2014 10:34", "12.09.2014 09:33", "13.08.2014 15:52","11.09.2014 02:30")
d <- data.frame(ID, time)
d$time <- as.POSIXct(d$time, format="%d.%m.%Y %H:%M")
unlist(unname(lapply(split(d, d$ID), function(d) {
sapply(abs(diff(c(d$time[2], d$time))), function(x) {
sprintf("%s:%s", round(((x*24)%/%1)), round(((x*24)%%1 *60)))
})
})))
## [1] "94:59" "94:59" "682:38" "682:38"
I have to believe this function exists somewhere already, tho.
similar to the attempts of David and hrmbrmstr, I found that this solution using difftime works
I use a rowshift script I found on stackoverflow
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
d$time.c <- as.POSIXct(d$time, format = "%d.%m.%Y %H:%M")
d$time.prev <- rowShift(d$time.c,-1)
d$diff <- difftime(d$time.c,d$time.prev, units="hours")
Every other row of d$diff has positive/negative values in the results. I do remove all the rows with negative values and have the difference between the first and the last time for every ID.

R to create a data frame a specific way

I am trying to create a data frame and in the column of time I need to have each time written 1 time before the next date. For example:
1983-01-01
1983-01-01
1983-01-01
1983-01-01
1983-01-02
1983-01-02
etc.
for 10 years.
I used this command, but I don't have the needed format.
data=data.frame(date=as.Date("1983-01-01") +seq(n))
head(data)
date
1 1983-01-02
2 1983-01-03
3 1983-01-04
4 1983-01-05
5 1983-01-06
6 1983-01-07
Here's one way to create the data frame:
library(zoo)
start_date <- as.Date("1983-01-01")
stop_date <- as.Date(as.yearmon(start_date) + 10) - 1
# [1] "1992-12-31"
dat <- data.frame(date = rep(seq(start_date, stop_date, by = 1), each = 4))
Update (based on comment):
dates <- lapply(seq(0, 9), function(x)
rep(as.Date((as.yearmon(start_date) + x) + (0:11)/12), each = 3) + c(0,10,20))
dat <- do.call(rbind, lapply(dates,
function(x) data.frame(date = rep(x, each = 4))))

Aggregating daily content

I've been attempting to aggregate (some what erratic) daily data. I'm actually working with csv data, but if i recreate it - it would look something like this:
library(zoo)
dates <- c("20100505", "20100505", "20100506", "20100507")
val1 <- c("10", "11", "1", "6")
val2 <- c("5", "31", "2", "7")
x <- data.frame(dates = dates, val1=val1, val2=val2)
z <- read.zoo(x, format = "%Y%m%d")
Now i'd like to aggregate this on a daily basis (notice that some times there are >1 datapoint for a day, and sometimes there arent.
I've tried lots and lots of variations, but i cant seem to aggregate, so for instance this fails:
aggregate(z, as.Date(time(z)), sum)
# Error in Summary.factor(2:3, na.rm = FALSE) : sum not meaningful for factors
There seems to be a lot of content regarding aggregate, and i've tried a number of versions but cant seem to sum this on a daily level. I'd also like to run cummax and cumulative averages in addition to the daily summing.
Any help woudl be greatly appreciated.
Update
The code I am actually using is as follows:
z <- read.zoo(file = "data.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, blank.lines.skip = T, na.strings="NA", format = "%Y%m%d");
It seems my (unintentional) quotation of the numbers above is similar to what is happening in practice, because when I do:
aggregate(z, index(z), sum)
#Error in Summary.factor(25L, na.rm = FALSE) : sum not meaningful for factors
There a number of columns (100 or so), how can i specify them to be as.numeric automatically ? (stringAsFactors = False doesnt appear to work?)
Or you aggregate before using zoo (val1 and val2 need to be numeric though).
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
y <- aggregate(x[,2:3],by=list(x[,1]),FUN=sum)
and then feed y into zoo.
You avoid the warning:)
You started on the right path but made a couple of mistakes.
First, zoo only consumes matrices, not data.frames. Second, those need numeric inputs:
> z <- zoo(as.matrix(data.frame(val1=c(10,11,1,6), val2=c(5,31,2,7))),
+ order.by=as.Date(c("20100505","20100505","20100506","20100507"),
+ "%Y%m%d"))
Warning message:
In zoo(as.matrix(data.frame(val1 = c(10, 11, 1, 6), val2 = c(5, :
some methods for "zoo" objects do not work if the index entries in
'order.by' are not unique
This gets us a warning which is standard in zoo: it does not like identical time indices.
Always a good idea to show the data structure, maybe via str() as well, maybe run summary() on it:
> z
val1 val2
2010-05-05 10 5
2010-05-05 11 31
2010-05-06 1 2
2010-05-07 6 7
And then, once we have it, aggregation is easy:
> aggregate(z, index(z), sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
>
val1 and val2 are character strings. data.frame() converts them to factors. Summing factors doesn't make sense. You probably intended:
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
z <- read.zoo(x, format = "%Y%m%d")
aggregate(z, as.Date(time(z)), sum)
which yields:
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
Convert the character columns to numeric and then use read.zoo making use of its aggregate argument:
> x[-1] <- lapply(x[-1], function(x) as.numeric(as.character(x)))
> read.zoo(x, format = "%Y%m%d", aggregate = sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7

Resources