R multiple columns group by - r

I have a dataset x_output that looks like this:
timestamp city wait_time weekday
2015-07-14 09:00:00 Boston 1.4 Tuesday
2015-07-14 09:01:00 Boston 2.5 Tuesday
2015-07-14 09:02:00 Boston 2.8 Tuesday
2015-07-14 09:03:00 Boston 1.6 Tuesday
2015-07-14 09:04:00 Boston 1.5 Tuesday
2015-07-14 09:05:00 Boston 1.4 Wednesday
I would like to find the mean wait_time, grouped by city, weekday, and time. Basically, given your city, what is the average wait time for Monday, for example? Then Tuesday?
I'm having difficulty creating the time column given x_output$timestamp; I'm currently using:
x_output$time <- strsplit(as.character(x_output$timestamp), split = " ")[[1]][2]
However, that simply puts "09:00" in every row, not the correct time for each individual row.
Secondly, I need to have a 3-way grouping to find the mean wait_time given city, weekday and time. This is something that's fairly straightforward to do in python pandas, but I can find very little documentation on it in R (and unfortunately I need to do it in R, not python).
I've looked into using data.table but that hasn't seemed to work. Is there a simple function like there would be in python pandas (eg. df.groupby(['col1', 'col2', 'col3']).mean())?

Mean wait_time grouped by city, weekday, time:
library(plyr)
ddply(x_output, .(city, weekday, time), summarize, avg=mean(wait_time))
If you wanted data.table
x_output[, list(avg=mean(wait_time)), .(city, weekday, time)]
I'm having difficulty creating the time column given x_output$timestamp
Well, what is the time column supposed to have in it? Just the time component of timestamp? Is timestamp a POSIXct or a string?
If it is a POSIXct, then you can just convert to character, specifying the time format:
x_output$time <- as.character(x_output$timestamp, '%H:%M')
# or as.factor(as.character(...)) if you need it to be a factor.
# in data.table: x[, time:=as.character(timestamp, '%H:%M')]
This will make the time column a string with the hour and minutes. See ?strptime for more options on converting that datetime to a string (e.g. if you want to include seconds).
If it is a string, you could strsplit and extract the second component:
vapply(strsplit(x_output$timestamp, ' '), '[', i=2, 'template')
which will give you "HH:MM:SS" as your time format. If you want to do a custom time format, probably best to convert your timestamp string into a POSIXct and back out to the specific format like already mentioned.

Related

How to separate column containing dates and times by a space in R

I have two columns, one of start dates and times and one of end dates and times. I want to split those into four columns: Start_date, Start_time, End_date, End_time. They are separated by a space (each column is formatted such as "12/04/2017 05:25 PM"). Ultimately I need to find the difference between the start date and time and the end date and time. I am a beginner at R so I really appropriate your help.
For the purposes of this question I am assuming that you are in the United States and thus that the example date which you provided refers to December 4th, 2017.
The first step is for you to convert the two date columns into dates instead of strings. The pattern of the elements in the datetime object must be echoed in the formatting command. Based on the example you provided, I have created a toy dataframe.
df <- data.frame(Start = c("12/04/2017 05:25 PM","05/05/2017 06:25 PM"), End = c("12/09/2018 05:15 PM","05/05/2019 06:24 PM"))
df
Start End
1 12/04/2017 05:25 PM 12/09/2018 05:15 PM
2 05/05/2017 06:25 PM 05/05/2019 06:24 PM
Now to convert these strings into date objects:
library(lubridate)
df$Start <- strptime(df$Start,format="%m/%d/%Y %I:%M %p")
df$End <- strptime(df$End,format="%m/%d/%Y %I:%M %p")
df
Start End
1 2017-12-04 17:25:00 2018-12-09 17:15:00
2 2017-05-05 18:25:00 2019-05-05 18:24:00
You will note that the spaces you indicated are included in the format pattern, along with symbols which indicate which parts of the date appear where as well as whether solo digits are padded with zeros (as all of yours appear to be). For a reference on which symbols/patterns to use in which situation, I recommend this page: https://www.stat.berkeley.edu/~s133/dates.html
If you wish to determine the difference between the two datetimes, it is now a simple matter of subtracting one from the other.
df$diff <- df$End - df$Start
Start End diff
1 2017-12-04 17:25:00 2018-12-09 17:15:00 369.9931 days
2 2017-05-05 18:25:00 2019-05-05 18:24:00 729.9993 days
In your question you asked about splitting up into pieces. Just in case that is still something that you need to do, creating the date time is still going to help you out. Now that we have datetime objects instead of strings, we can easily split the column into pieces.
df$Start_Day <- day(df$Start)
df$Start_Month<- month(df$Start)
df$Start_Year <- year(df$Start)
and so on.

how to iterate based on a condition, and assign aggregated value to a row in new dataframe in R

I have a large dataset of stock prices with 203615 rows and 2 columns(price and Timestamp). in below format
price(USD) | Timestamp
3.5 | 2014-01-01 20:00:00
2 | 2014-01-01 20:15:00
5 | 2014-01-01 20:15:00
----
4 | 2014-01-31 23:00:00
5 | 2014-01-31 23:00:00
4.5 | 2014-01-31 23:00:00
203615 2.3 | 2014-01-31 23:00:00
Time stamp varies from "2014-01-01 20:00:00" to "2014-01-31 23:00:00" with intervals of 15min(rounded to 15min). i have several transactions on same timestamp.
I have to group rows based on timestamp with difference of one day, and caluclate min,max and mean of the price and no of rows within the timestamp limits and assign them to a row in a new dataframe for every iteration until it reaches the end timestamp("2014-01-31 23:00:00") from starting date('2014-01-02 20:00:00")
note: iteration has to be done for every 15min
i have tried while loop. please help me with this and suggest me if i can use any packages
This is my own code which I used as a way of creating a window of time (the prior 24 hours) to iterate over and create min and max values for a project I am working on...
inter is the inteval I worked on in the loop
raw is the data frame name
i is the specific row from which the datetime column was selected from raw
I started my intervals at 97th row ( (i in 97:nrow(raw) ) because the stamps were taken at 15 minute intervals and I wanted a 24 hour backward window, so I needed to leave 96 intervals to pull from...I could not reach back into time I had no data for...so I started far enough into my data to leave room for those intervals.
for (i in 97:nrow(raw)){
inter=raw$datetime[i] - as.difftime(24, unit='hours')
raw$deltaAirTemp_24[i] <-max(temp$Air.Temperature)- min(temp$Air.Temperature)
}
The key is getting into a real date time format. Run str() on the field with the dates, if the come back as anything but Factor, use:
as.POSIXct(yourdate$field, %Y-%m-%d %H:%M:%S)
If they come back from str(yourdatecolumn here) as FACTOR then wrap it in as.POSIXct(as.character(yourdate$field), %Y-%m-%d %H:%M:%S) to be sure it does not coerce the date into a Level number then time..
Get them into a consistent date format, then construct something like above to extract the periods you need. difftime is in the base package and works well you can use positive and negative intervals with it. I hope his helps!

Get aggregate sum of data by day and hour

The below is an example of the data I have.
date time size filename day.of.week
1 2015-01-16 5:36:12 1577 01162015053400.xml Friday
2 2015-01-16 5:38:09 2900 01162015053600.xml Friday
3 2015-01-16 5:40:09 3130 01162015053800.xml Friday
What I would like to do is sum up the size of the files for each hour.
I would like a resulting data table that looks like:
date hour size
2015-01-16 5 7607
2015-01-16 6 10000
So forth and so on.
But I can't quite seem to get the output I need.
I've tried ddply and aggregate, but I'm summing up the entire day, I'm not sure how to break it down by the hour in the time column.
And I've got multiple days worth of data. So it's not only for that one day. It's from that day, almost every day until yesterday.
Thanks!
The following should do the trick, assuming your example data are stored in a data frame called "test":
library(lubridate) # for hms and hour functions
test$time <- hms(test$time)
test$hour <- factor(hour(test$time))
library(dplyr)
test %>%
select(-time) %>% # dplyr doesn't like this column for some reason
group_by(date, hour) %>%
summarise(size=sum(size))
You can use data.table
library(data.table)
# Define a time stamp column.
dt[, timestamp=as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))]
# Aggregate by hours
dt[, size = .N, by = as.POSIXct(round(timestamp, "hour"))]
Benefit is that data.table is blazing fast!
Use a compound group_by(day,hour)
That will do it.
If you convert your date and time columns into a single POSIX date when (similar to a previous answer, i.e. df$when <- as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))), you could use:
aggregate(df[c("size")], FUN=sum, by=list(d=as.POSIXct(trunc(df$when, "hour"))))

difftime for multiple dates in r

I have chemistry water data taken from a river. Normally, the sample dates were on a Wednesday every two weeks. The data record starts in 1987 and ends in 2013.
Now, I want to re-check if there are any inconsistencies within the data, that is if the samples are really taken every 14 days. For that task I want to use the r function difftime. But I have no idea on how to do that for multiple dates.
Here is some data:
Date Value
1987-04-16 12:00:00 1,5
1987-04-30 12:00:00 1,2
1987-06-25 12:00:00 1,7
1987-07-14 12:00:00 1,3
Can you tell me on how to use the function difftime properly in that case or any other function that does the job. The result should be the number of days between the samplings and/or a true and false for the 14 days.
Thanks to you guys in advance. Any google-fu was to no avail!
Assuming your data.frame is named dd, you'll want to verify that the Date column is being treated as a date. Most times R will read them as a character which gets converted to a factor in a data.frame. If class(df$Date) is "character" or "factor", run
dd$Date<-as.POSIXct(as.character(dd$Date), format="%Y-%m-%d %H:%M:%S")
Then you can so a simple diff() to get the time difference in days
diff(dd$Date)
# Time differences in days
# [1] 14 56 19
# attr(,"tzone")
# [1] ""
so you can check which ones are over 14 days.

How do i extract a specific, recurring time from 1 minute tick data in R?

For instance, let's say I want to extract the price at 09:04:00 everyday from a timeseries that is formatted as:
DateTime | Price
2011-04-09 09:01:00 | 100.00
2011-04-09 09:02:00 | 100.10
2011-04-09 09:03:00 | 100.13
(NB: there is no | in the actual data, i've just included it here to illustrate that the DateTime is the index and Price is the coredata and that the two are distinct within the xts object)
and put those extracted values into an xts vector...what is the most efficient way to do this?
Also, if i have a five year time series of a cross-border spread, where - due to time differences - the spread opens at different times during the year (say 9am during winter, and 10am during summer) how can I get R to take account of those time differences and recognise either 9am-16:30 or 10am-16:30 as the same "day" interval.
In other words, I want to convert an intraday, 1m tick data file to daily OHLC data. Normally would just use xts and to.period to do this, but - given the time difference noted above - gives odd / strange day start/end times due
Any advice greatly appreciated!
You can use the "T" prefix with xts subsetting to specify a time interval for each day. You must specify an interval; a single time will not work.
set.seed(21)
x <- xts(cumprod(1+rnorm(2*60*24)/100),
as.POSIXct("2011-04-09 09:01:00")+60*(1:(2*60*24)))
x["T09:01:59/T09:02:01"]
# [,1]
# 2011-04-09 09:02:00 0.9980737
# 2011-04-10 09:02:00 1.0778835

Resources