Add posixlt as a new column to a dataframe - r

I am creating some random numbers:
data <- matrix(runif(10, 0, 1), ncol = 2)
dataframe <- data.frame(data)
> dataframe
X1 X2
1 0.7981783 0.13233858
2 0.9592338 0.05512942
3 0.1812384 0.74571334
4 0.1447498 0.96656930
5 0.1735390 0.37345575
and I want to create a corresponding timestamp column and bind that to the above data frame.
time <- as.POSIXlt(runif(10, 0, 60), origin = "2017-05-05 10:00:00")
This creates 10 values.
> time
[1] "2017-05-05 13:00:27 EEST" "2017-05-05 13:00:02 EEST" "2017-05-05 13:00:26 EEST" "2017-05-05 13:00:25 EEST" "2017-05-05 13:00:28 EEST"
[6] "2017-05-05 13:00:17 EEST" "2017-05-05 13:00:35 EEST" "2017-05-05 13:00:08 EEST" "2017-05-05 13:00:29 EEST" "2017-05-05 13:00:32 EEST"
Now, I want to bind it to the dataframe, so I thought first to make it a matrix:
time <- matrix(time, nrow = 5, ncol = 2)
but this gives me:
Warning message:
In matrix(time, nrow = 5, ncol = 2) :
data length [11] is not a sub-multiple or multiple of the number of rows [5]

The reason is that POSIXlt stores the date time as a list of attributes whereas POSIXct would not. So, it is better to use as.POSIXct
time <- as.POSIXct(runif(10, 0, 60), origin = "2017-05-05 10:00:00")
If we need to store, it can be done as a list of data.frame
data.frame(date1= time[1:5], date2 = time[6:10])
without converting to matrix as 'Datetime' gets coerced to integer storage mode when converted to matrix.
Suppose, we proceed with POSIXlt, then we find the list of attributes
time1 <- as.POSIXlt(runif(10, 0, 60), origin = "2017-05-05 10:00:00")
unclass(time1)
#$sec
# [1] 13.424695 40.860449 57.756890 59.072140 24.425521 39.429729 58.309546
# [8] 6.294982 46.613436 25.444415
#$min
# [1] 30 30 30 30 30 30 30 30 30 30
#$hour
# [1] 15 15 15 15 15 15 15 15 15 15
#$mday
# [1] 5 5 5 5 5 5 5 5 5 5
#$mon
# [1] 4 4 4 4 4 4 4 4 4 4
#$year
# [1] 117 117 117 117 117 117 117 117 117 117
#$wday
# [1] 5 5 5 5 5 5 5 5 5 5
#$yday
# [1] 124 124 124 124 124 124 124 124 124 124
#$isdst
# [1] 0 0 0 0 0 0 0 0 0 0
#$zone
# [1] "IST" "IST" "IST" "IST" "IST" "IST" "IST" "IST" "IST" "IST"
#$gmtoff
# [1] 19800 19800 19800 19800 19800 19800 19800 19800 19800 19800
#attr(,"tzone")
#[1] "" "IST" "IST"
With POSIXct, it is the integer storage values that can be found by unclass
unclass(time)
#[1] 1493978445 1493978451 1493978432 1493978402 1493978447 1493978441
#[7] 1493978445 1493978450 1493978419 1493978425
#attr(,"tzone")
#[1] ""

Related

Creating different dataframe sizes when grouping in dplyr and summerising with summerise_by_time

I have a dataframe that looks like this:
> head(subppm)
File ChunkEnd DPM Nall MinsOn area station deployment cpod
1 File1.CP3 11/4/2014 00:00 0 287 1 FB FB1 FB1Ha 917
2 File2.CP3 11/4/2014 00:01 0 48 1 FB FB1 FB1Ha 917
3 File3.CP3 11/4/2014 00:02 0 57 1 FB FB1 FB1Ha 917
4 File4.CP3 11/4/2014 00:03 0 44 1 FB FB1 FB1Ha 917
5 File5.CP3 11/4/2014 00:04 0 20 1 FB FB1 FB1Ha 917
6 File6.CP3 11/4/2014 00:05 0 9 1 FB FB1 FB1Ha 917
DateTime
1 2014-04-11 00:00:00
2 2014-04-11 00:00:01
3 2014-04-11 00:00:02
4 2014-04-11 00:00:03
5 2014-04-11 00:00:04
6 2014-04-11 00:00:05
> sapply(subppm,class)
$File
[1] "character"
$ChunkEnd
[1] "character"
$DPM
[1] "integer"
$Nall
[1] "integer"
$MinsOn
[1] "integer"
$area
[1] "character"
$station
[1] "character"
$deployment
[1] "character"
$cpod
[1] "character"
$DateTime
[1] "POSIXct" "POSIXt"
I am attempting to group these variables by the $area variable, and sum the $DPM variable by month according to $DateTime. DPM is a 0/1, so summing all the 1s will give me an idea of how many minutes had data a month. To do this I am using dplyr and timetk.
histData=subppm %>%
group_by(area)+
summarise_by_time(.data = subppm,
.date_var = DateTime,
.by ='month',
value = sum(DPM, na.rm = TRUE)
)
Error in Ops.data.frame(subppm %>% group_by(area), summarise_by_time(.data = subppm, :
‘+’ only defined for equally-sized data frames
That produces the above error. The thing is, I can't see a way to create dataframes that are the same size. I am using the area to group, but we collected data at different areas at different times. I've tried removing the nas, but that doesn't help the issue. I also can't seem to find a way to solve this issue that takes the two groupings, area and time, into consideration.
According to this example, this method should work. The output format in this example is exactly what I am looking for.
Thoughts?
Reproducible data:
dates1=seq(from = as.Date('2019-01-01 00:00'), to = as.Date('2019-07-10 00:00'), by = 1)
dates2=seq(from = as.Date('2019-05-01 00:00'), to = as.Date('2019-10-10 00:00'), by = 1)
dates3=seq(from = as.Date('2019-03-01 00:00'), to = as.Date('2019-07-31 00:00'), by = 1)
data1=data.frame(area='group1', dates=dates1)
data2=data.frame(area='group2', dates=dates2)
data3=data.frame(area='group3', dates=dates3)
data1$DPM=rbinom(n=nrow(data1), size=1, prob=0.05)
data2$DPM=rbinom(n=nrow(data2), size=1, prob=0.05)
data3$DPM=rbinom(n=nrow(data3), size=1, prob=0.05)
data=rbind(data1,data2,data3)
You are using a + at the end of the second line where there should be a dplyr pipe %>%. That produces the given error.

Why does dplyr convert POSIXct objects

I have a date-time object of class POSIXct. I need to adjust the values by adding several hours. I understand that I can do this using basic addition. For example, I can add 5 hours to a POSIXct object like so:
x <- as.POSIXct("2009-08-02 18:00:00", format="%Y-%m-%d %H:%M:%S")
x
[1] "2009-08-02 18:00:00 PDT"
x + (5*60*60)
[1] "2009-08-02 23:00:00 PDT"
Now I have a data frame in which some times are ok and some are bad.
> df
set_time duration up_time
1 2009-05-31 14:10:00 3 2009-05-31 11:10:00
2 2009-08-02 18:00:00 4 2009-08-02 23:00:00
3 2009-08-03 01:20:00 5 2009-08-03 06:20:00
4 2009-08-03 06:30:00 2 2009-08-03 11:30:00
Note that the first data frame entry has an 'up_time' less than the 'set_time'. So in this context a 'good' time is one where the set_time < up_time. And a 'bad' time is one in which set_time > up_time. I want to leave the good entries alone and fix the bad entries. The bad entries should be fixed by creating an 'up_time' that is equal to the 'set_time' + duration. I do this with the following dplyr pipe:
df1 <- tbl_df(df) %>% mutate(up_time = ifelse(set_time > up_time, set_time +
(duration*60*60), up_time))
df1
# A tibble: 4 x 3
set_time duration up_time
<dttm> <dbl> <dbl>
1 2009-05-31 14:10:00 3. 1243815000.
2 2009-08-02 18:00:00 4. 1249279200.
3 2009-08-03 01:20:00 5. 1249305600.
4 2009-08-03 06:30:00 2. 1249324200.
Up time has been coerced to numeric:
> str(df1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ set_time: POSIXct, format: "2009-05-31 14:10:00" "2009-08-02 18:00:00"
"2009-08-03 01:20:00" "2009-08-03 06:30:00"
$ duration: num 3 4 5 2
$ up_time : num 1.24e+09 1.25e+09 1.25e+09 1.25e+09
I can convert it back to the desired POSIXct format using:
> as.POSIXct(df1$up_time,origin="1970-01-01")
[1] "2009-05-31 17:10:00 PDT" "2009-08-02 23:00:00 PDT" "2009-08-03 06:20:00
PDT" "2009-08-03 11:30:00 PDT"
But I feel like this last step shouldn't be necessary. Can I/How can I avoid having dplyr change my variable formatting?

summarize by time interval not working

I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1

Using lubridate and subtracting entries in column from 1st entry

I have data that looks like
Dates another column
2015-05-13 23:53:00 some values
2015-05-13 23:53:00 ....
2015-05-13 23:33:00
2015-05-13 23:30:00
...
2003-01-06 00:01:00
2003-01-06 00:01:00
The code I then used is
trainDF<-read.csv("train.csv")
diff<-as.POSIXct(trainDF[1,1])-as.POSIXct(trainDF[,1])
head(diff)
Time differences in hours
[1] 23.88333 23.88333 23.88333 23.88333 23.88333 23.88333
However, this doesn't make sense because subtracting the 1st two entries should give 0, since they are the exact same time. Subtracting the 3rd entry from the 1st should give a difference of 20 minutes, not 23.88333 hours. I get the similar values that don't make sense when I try as.duration(diff) and as.numeric(diff). Why is this?
If you just have a series of dates in POSIXct, you can use the diff function to calculate the difference between each date. Here's an example:
> BD <- as.POSIXct("2015-01-01 12:00:00", tz = "UTC") # Making a begin date.
> ED <- as.POSIXct("2015-01-01 13:00:00", tz = "UTC") # Making an end date.
> timeSeq <- seq(BD, ED, "min") # Creating a time series in between the dates by minute.
>
> head(timeSeq) # To see what it looks like.
[1] "2015-01-01 12:00:00 UTC" "2015-01-01 12:01:00 UTC" "2015-01-01 12:02:00 UTC" "2015-01-01 12:03:00 UTC" "2015-01-01 12:04:00 UTC"
[6] "2015-01-01 12:05:00 UTC"
>
> diffTime <- diff(timeSeq) # Takes the difference between each adjacent time in the time series.
> print(diffTime) # Printing out the result.
Time differences in mins
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> # For the sake of example, let's make a hole in the data.
>
> limBD <- as.POSIXct("2015-01-01 12:15:00", tz = "UTC") # Start of the hole we want to create.
> limED <- as.POSIXct("2015-01-01 12:45:00", tz = "UTC") # End of the hole we want to create.
>
> timeSeqLim <- timeSeq[timeSeq <= limBD | timeSeq >= limED] # Make a hole of 1/2 hour in the sequence.
>
> diffTimeLim <- diff(timeSeqLim) # Taking the diff.
> print(diffTimeLim) # There is now a large gap, which is reflected in the print out.
Time differences in mins
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
However, I read through your post again, and it seems you just want to subtract each item not in the first row by the first row. I used the same sample I used above to do this:
Time difference of 1 mins
> timeSeq[1] - timeSeq[2:length(timeSeq)]
Time differences in mins
[1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23 -24 -25 -26 -27 -28 -29 -30 -31 -32 -33 -34 -35 -36
[37] -37 -38 -39 -40 -41 -42 -43 -44 -45 -46 -47 -48 -49 -50 -51 -52 -53 -54 -55 -56 -57 -58 -59 -60
Which gives me what I'd expect. Trying a data.frame method:
> timeDF <- data.frame(time = timeSeq)
> timeDF[1,1] - timeDF[, 1]
Time differences in secs
[1] 0 -60 -120 -180 -240 -300 -360 -420 -480 -540 -600 -660 -720 -780 -840 -900 -960 -1020 -1080 -1140 -1200 -1260 -1320 -1380
[25] -1440 -1500 -1560 -1620 -1680 -1740 -1800 -1860 -1920 -1980 -2040 -2100 -2160 -2220 -2280 -2340 -2400 -2460 -2520 -2580 -2640 -2700 -2760 -2820
[49] -2880 -2940 -3000 -3060 -3120 -3180 -3240 -3300 -3360 -3420 -3480 -3540 -3600
It seems I'm not encountering the same problem as you. Perhaps coerce everything to POSIX.ct first and then do your subtraction? Try checking the class of your data and make sure it is actually in POSIXct. Check the actual values you are subtracting and that may give you some insight.
EDIT:
After downloading the file, here's what I ran. The file is trainDF:
trainDF$Dates <- as.POSIXct(trainDF$Dates, tz = "UTC") # Coercing to POSIXct.
datesDiff <- trainDF[1, 1] - trainDF[, 1] # Taking the difference of each date with the first date.
head(datesDiff) # Printing out the head.
With results:
Time differences in secs
[1] 0 0 1200 1380 1380 1380
The only thing I did differently was use the time zone UTC, which does not shift hours with daylight savings time, so there should be no effect there.
HOWEVER, I did the exact same method as you and got the same results:
> diff<-as.POSIXct(trainDF[1,1])-as.POSIXct(trainDF[,1])
> head(diff)
Time differences in hours
[1] 23.88333 23.88333 23.88333 23.88333 23.88333 23.88333
So there is something up with your method, but I can't say what. I do find that it is typically safer to coerce and then do some mathematical operation instead of all together in one line.

As.XTS from Matrix - Error - Adds time and timezone info

For some reason I do not understand, when I run as.xts to convert from a matrix with a date in rownames, this operation will generate a Date Time in the end. Since this is different from the start indexes merge/cbinds will not work.
Can someone point me what am I doing wrong?
> class(x)
[1] "xts" "zoo"
> head(x)
XLY.Adjusted XLP.Adjusted XLE.Adjusted AGG.Adjusted IVV.Adjusted
2005-07-31 0.042255791 0.017219585 0.17841600 0.010806168 0.04960026
2005-08-31 0.034117087 0.009951766 0.18476766 0.015245222 0.03825968
2005-09-30 -0.029594066 0.008697349 0.22851906 0.009769765 0.02944754
2005-10-31 -0.015653740 0.019966664 0.09314327 -0.012705172 0.01640395
2005-11-30 -0.005593003 0.005932542 0.05437377 -0.005209811 0.03173972
2005-12-31 0.005084193 0.021293537 0.05672958 0.002592639 0.04045477
> head(index(x))
[1] "2005-07-31" "2005-08-31" "2005-09-30" "2005-10-31" "2005-11-30" "2005-12-31"
> temp=t(apply(-x, 1, rank, na.last = "keep"))
> class(temp)
[1] "matrix"
> head(temp)
XLY.Adjusted XLP.Adjusted XLE.Adjusted AGG.Adjusted IVV.Adjusted
2005-07-31 3 4 1 5 2
2005-08-31 3 5 1 4 2
2005-09-30 5 4 1 3 2
2005-10-31 5 2 1 4 3
2005-11-30 5 3 1 4 2
2005-12-31 4 3 1 5 2
> head(rownames(temp))
[1] "2005-07-31" "2005-08-31" "2005-09-30" "2005-10-31" "2005-11-30" "2005-12-31"
> y=as.xts(temp)
> class(y)
[1] "xts" "zoo"
> head(y)
XLY.Adjusted XLP.Adjusted XLE.Adjusted AGG.Adjusted IVV.Adjusted
2005-07-31 3 4 1 5 2
2005-08-31 3 5 1 4 2
2005-09-30 5 4 1 3 2
2005-10-31 5 2 1 4 3
2005-11-30 5 3 1 4 2
2005-12-31 4 3 1 5 2
> head(index(y))
[1] "2005-07-31 BST" "2005-08-31 BST" "2005-09-30 BST" "2005-10-31 GMT" "2005-11-30 GMT" "2005-12-31 GMT"
as.xts.matrix has a dateFormat argument that defaults to "POSIXct", so it assumes the rownames of your matrix are datetimes. If you want them to simply be dates, specify dateFormat="Date" in your as.xts call.
y <- as.xts(temp, dateFormat="Date")

Resources