Can you specify what space to separate columns by? - r

I am working with a data set called sleep with the following columns:
head(sleep)
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
1 1503960366 4/12/2016 12:00:00 AM 1 327 346
2 1503960366 4/13/2016 12:00:00 AM 2 384 407
3 1503960366 4/15/2016 12:00:00 AM 1 412 442
4 1503960366 4/16/2016 12:00:00 AM 2 340 367
I am trying to separate the sleepDay column into two columns named "Date" and "Sleep"
I used the separate function and was able to create the two columns below:
separate(weight_log, Date, into = c("Date", "Time"), sep = ' ')
Id Date Time WeightKg WeightPounds Fat BMI IsManualReport LogId
1 1503960366 5/2/2016 11:59:59 52.6 115.9631 22 22.65 True 1.462234e+12
2 1503960366 5/3/2016 11:59:59 52.6 115.9631 NA 22.65 True 1.462320e+12
3 1927972279 4/13/2016 1:08:52 133.5 294.3171 NA 47.54 False 1.460510e+12
I want to be able to keep the AM and PM next to the times, but with the function I used, they seem to disappear I assume because I am separating based on a space. Is there anyway to be able to specify that I am only trying to separate the column into two based on the first space?
Edit: The data set Sleep shown at the top is different then the dataset I used the separator function on which is weight_log, but the issue is the same

data.frame(SleepDay = "4/12/2016 12:00:00 AM") %>%
separate(SleepDay, into = c("Date", "Time"), sep = " ", extra = "merge")
# Date Time
#1 4/12/2016 12:00:00 AM
If you are doing further analysis or visualization, I recommend converting the text into a datetime.
library(lubridate)
data.frame(SleepDay = "4/12/2016 12:05:00 AM") %>%
mutate(SleepDay = mdy_hms(SleepDay),
SleepDay_base = as.POSIXct(SleepDay),
date = as_date(SleepDay),
time_12 = format(SleepDay, "%I:%M %p"),
time_24 = format(SleepDay, "%H:%M"))
# SleepDay SleepDay_base date time_12 time_24
#1 2016-04-12 00:05:00 2016-04-12 00:05:00 2016-04-12 12:05 AM 00:05

Related

How to create a matrix with time in columns and date in rows for a long time-data frame?

I'm new here so...
I have a data frame with two variables (R is new for me, I used Matlab for a long). One is a classic POSIXlt with timestamps with 30 minutes between each data point. The second one is the data itself (for example, Air Temperature data) and same dimensions with time vector. I used this pair to get nice plots.
I want to reshape data using time in this fashion: I want to sort the data using days in the row-direction and time (up to 48 columns, using the 30-minute interval between 0:00 and 23:30) in the column-direction, to use this data in another R package to fill missing data.
>> head(data_f, 10)
time data
1 2013-08-01 00:30:00 8.001
2 2013-08-01 01:00:00 7.918
3 2013-08-01 01:30:00 7.621
4 2013-08-01 02:00:00 7.564
5 2013-08-01 02:30:00 7.718
6 2013-08-01 03:00:00 7.846
7 2013-08-01 03:30:00 7.481
8 2013-08-01 04:00:00 7.351
9 2013-08-01 04:30:00 7.275
10 2013-08-01 05:00:00 7.291
More data
48 2013-08-02 00:00:00 9.372
49 2013-08-02 00:30:00 9.485
50 2013-08-02 01:00:00 9.151
51 2013-08-02 01:30:00 8.870
52 2013-08-02 02:00:00 8.504
53 2013-08-02 02:30:00 8.404
54 2013-08-02 03:00:00 8.342
55 2013-08-02 03:30:00 8.278
56 2013-08-02 04:00:00 8.229
57 2013-08-02 04:30:00 8.163
58 2013-08-02 05:00:00 8.092
59 2013-08-02 05:30:00 8.038
I want an ideally rectangular output (could be a matrix instead of a data frame), putting NAs where is no data available for that time. Something like this:
(30-min span in this direction -->)
2013-08-01 NA 8.001 7.918 7.621 7.564 7.718 7.846 7.481 7.351 7.275 7.291 ...
2013-08-02 9.372 9.485 9.151 8.870 8.504 8.404 8.342 8.278 8.229 8.092 8.038 ...
2013-08-03 ... ... ... ... ... ... ... ... ... ... ... ...
2013-08-04 ... ... ... ... ... ... ... ... ... ... ... ...
...
...
I have worked porting a Matlab function (wrote for myself) to accomplish that but with no success, by the way R interprets date and time.
Update: How to generate data. (Consider that original data is from a 7-yr database from my work)
library(lubridate)
data_f = data.frame(time = seq(from = as_datetime("2013-08-01 00:30:00"),
to = as_datetime("2013-10-12 18:00:00"),
by = "30 min"),
data = runif(3491, 2, 14))
Thanks in advance.
One approach you could follow is separating date and time an then reshaping the data. Here the code with tidyverse functions:
#Data
df <- structure(list(time = structure(c(1375317000, 1375318800, 1375320600,
1375322400, 1375324200, 1375326000, 1375327800, 1375329600, 1375331400,
1375333200, 1375401600, 1375403400, 1375405200, 1375407000, 1375408800,
1375410600, 1375412400, 1375414200, 1375416000, 1375417800, 1375419600,
1375421400), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
data = c(8.001, 7.918, 7.621, 7.564, 7.718, 7.846, 7.481,
7.351, 7.275, 7.291, 9.372, 9.485, 9.151, 8.87, 8.504, 8.404,
8.342, 8.278, 8.229, 8.163, 8.092, 8.038)), class = "data.frame", row.names = c(NA,
-22L))
Code:
#Split and reshape
df %>% separate(time,into = c('V1','V2'),sep = ' ') %>%
pivot_wider(names_from = V2,values_from=data)
Output:
# A tibble: 2 x 13
V1 `00:30:00` `00:59:59` `01:30:00` `02:00:00` `02:29:59` `03:00:00` `03:30:00` `03:59:59` `04:30:00`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013~ 8.00 7.92 7.62 7.56 7.72 7.85 7.48 7.35 7.28
2 2013~ 9.48 9.15 8.87 8.50 8.40 8.34 8.28 8.23 8.16
# ... with 3 more variables: `05:00:00` <dbl>, `00:00:00` <dbl>, `05:29:59` <dbl>
As names of new variables can change you could rearrange them.

How to subset data by specific hours of interest?

I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?

Group by with summarise in date difference in R

I am trying to use group_by and then summarise using date difference calculation. I am not sure if its a runtime error or something wrong in what I am doing. Sometimes when I run the code I get the output as days and other times as seconds. I am not sure what is causing this change. I am not changing dataset or codes. The dataset I am using is huge (2,304,433 rows and 40 columns). Both the times, the output value (digits) are the same but only the name changes (days to secs). I would like to see the output in days.
This is the code that I am using:
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
Revenue = max(TOTAL_AMT + 0.000001/QUANTITY),
No_Days = (max(ORDER_DT) - min(ORDER_DT) + 1)/n())
This is the output.
Can anyone please help me on this?
Use difftime() You might need to specify the units.
set.seed(314)
data <- data.frame(PRODUCT = sample(1:10, size = 10000, replace = TRUE),
PERSON_ID = sample(1:10, size = 10000, replace = TRUE),
ORDER_DT = as.POSIXct(as.Date('2019/01/01') + sample(-300:+300, size = 10000, replace = TRUE)))
require(dplyr)
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
start = min(ORDER_DT),
end = max(ORDER_DT)) %>%
mutate(No_Days = (as.double(difftime(end, start, units = "days"), units = "days")+1)/Freq)
gives:
PRODUCT PERSON_ID Freq start end No_Days
<int> <int> <int> <dttm> <dttm> <dbl>
1 1 1 109 2018-03-21 01:00:00 2019-10-27 02:00:00 5.38
2 1 2 117 2018-03-23 01:00:00 2019-10-26 02:00:00 4.98
3 1 3 106 2018-03-19 01:00:00 2019-10-28 01:00:00 5.56
4 1 4 109 2018-03-07 01:00:00 2019-10-26 02:00:00 5.50
5 1 5 95 2018-03-07 01:00:00 2019-10-16 02:00:00 6.2
6 1 6 79 2018-03-09 01:00:00 2019-10-04 02:00:00 7.28
7 1 7 83 2018-03-09 01:00:00 2019-10-28 01:00:00 7.22
8 1 8 114 2018-03-09 01:00:00 2019-10-16 02:00:00 5.15
9 1 9 100 2018-03-09 01:00:00 2019-10-13 02:00:00 5.84
10 1 10 91 2018-03-11 01:00:00 2019-10-26 02:00:00 6.54
# ... with 90 more rows
Why is the value devided by n()?
Simple as.integer(max(ORDER_DT) - min(ORDER_DT)) should work, but if it doesn't then please be more specific and update me with more information.
Also while working with datetime values it's good to know lubridate library

R : how to get the rolling mean of a variable over the last few days but only at a given hour?

Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00

Create a time interval of 15 minutes from minutely data in R?

I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455

Resources