how to iterate based on a condition, and assign aggregated value to a row in new dataframe in R - r

I have a large dataset of stock prices with 203615 rows and 2 columns(price and Timestamp). in below format
price(USD) | Timestamp
3.5 | 2014-01-01 20:00:00
2 | 2014-01-01 20:15:00
5 | 2014-01-01 20:15:00
----
4 | 2014-01-31 23:00:00
5 | 2014-01-31 23:00:00
4.5 | 2014-01-31 23:00:00
203615 2.3 | 2014-01-31 23:00:00
Time stamp varies from "2014-01-01 20:00:00" to "2014-01-31 23:00:00" with intervals of 15min(rounded to 15min). i have several transactions on same timestamp.
I have to group rows based on timestamp with difference of one day, and caluclate min,max and mean of the price and no of rows within the timestamp limits and assign them to a row in a new dataframe for every iteration until it reaches the end timestamp("2014-01-31 23:00:00") from starting date('2014-01-02 20:00:00")
note: iteration has to be done for every 15min
i have tried while loop. please help me with this and suggest me if i can use any packages

This is my own code which I used as a way of creating a window of time (the prior 24 hours) to iterate over and create min and max values for a project I am working on...
inter is the inteval I worked on in the loop
raw is the data frame name
i is the specific row from which the datetime column was selected from raw
I started my intervals at 97th row ( (i in 97:nrow(raw) ) because the stamps were taken at 15 minute intervals and I wanted a 24 hour backward window, so I needed to leave 96 intervals to pull from...I could not reach back into time I had no data for...so I started far enough into my data to leave room for those intervals.
for (i in 97:nrow(raw)){
inter=raw$datetime[i] - as.difftime(24, unit='hours')
raw$deltaAirTemp_24[i] <-max(temp$Air.Temperature)- min(temp$Air.Temperature)
}
The key is getting into a real date time format. Run str() on the field with the dates, if the come back as anything but Factor, use:
as.POSIXct(yourdate$field, %Y-%m-%d %H:%M:%S)
If they come back from str(yourdatecolumn here) as FACTOR then wrap it in as.POSIXct(as.character(yourdate$field), %Y-%m-%d %H:%M:%S) to be sure it does not coerce the date into a Level number then time..
Get them into a consistent date format, then construct something like above to extract the periods you need. difftime is in the base package and works well you can use positive and negative intervals with it. I hope his helps!

Related

Issue merging dataframes in R using POSIXct

I have two dataframes (per_frame, values) - The first contains POSIXct values for a 24 hour period at 15 minute intervals.
periods = as.POSIXct(seq.POSIXt("2019-06-01 04:00:00 UTC","2019-06-02 03:45:00 UTC", by=900))
per_frame = data.frame(Period = periods)
The second contains a column for some of the time values above (but not all) and another for 'average value'.
Period
avg_value
2019-06-01 04:45:00
4
2019-06-01 05:00:00
7
2019-06-01 05:45:00
9
2019-06-01 08:45:00
2
2019-06-01 10:00:00
4
I want to create a new dataframe that adds the average values where available to the first dataframe, leaving 'missing values' where there aren't any. I thought this could be achieved easily using the below:
Combined= merge(per_frame, values, by = "Period", all.x = TRUE)
However, the new dataframe it creates has incorrect values for each Period. It is adding values to some time periods that don't have a corresponding average value in the values dataframe. I'm not sure what i'm doing incorrect here?
Apologies - I realised after some investigation that the timezones used in the two databases were different - hence the mismatch when merging. I'm not actually sure why this happened as i'm using the same data import to generate both the start and end values for the first dataframe and the values for the second. I was able to override it though using the 'tz' value in the as.POSIXct function.

Fetch Data At every 2 Minutes SQL Lite

Please can anyone help me in fetching data from my SQL Lite Table at every 2 minutes between my start time and stop time
I have two columns Data , TimeStamp and I am filtering between two timestamp and it is working fine but what I am trying to do is to result my data at every 2 minutes interval For example my start time is 2016-12-15 10:00:00 and stop time is 2016-12-15 10:10:00 the result should be 2016-12-15 10:00:00,2016-12-15 10:02:00,2016-12-15 10:04:00 ....
Add, to your where clause, an expression that looks for 2 minute boundaries:
strftime("%s", TimeStamp) % 120 = 0
This assumes you have data on exact, 2-minute boundaries. It will ignore data between those points.
strftime("%s", TimeStamp) converts your time stamp string into a single number representing the number of seconds since Jan 1st, 1970. The % 120 does modulo arithmetic resulting in 0 every 120 seconds. If you want minute boundaries, use 60. If you want hourly, use 3600.
What's more interesting -- and I've used this -- is to take all the data between boundaries and average them together:
SELECT CAST(strftime("%s", TimeStamp) / 120 AS INTEGER) * 120 as stamp, AVG(Data)
FROM table
WHERE TimeStamp >= '2016-12-15 10:00:00' AND
TimeStamp < '2016-12-15 10:10:00'
GROUP BY stamp;
This averages all data with timestamps in the same 2-minute "bin". The second date comparison is < rather than <= because then the last bin would only average one sample whereas the other bins would be averages of multiple values. You could also add MAX(Data) and MIN(Data) columns, if you want to know how much the data changed within each bin.

Converting an interval to duration per hour per weekday in R using data.table

I have the following problem:
Suppose we have:
Idx ID StartTime EndTime
1: 1 2014-01-01 02:20:00 2014-01-01 03:42:00
2: 1 2014-01-01 14:51:00 2014-01-01 16:44:00
note: Idx is not given, but I'm simply adding it to the table view.
Now we see that person with ID=1 is using the computer from 2:20 to 3:42. Now what I would like to do is to convert this interval into a set of variables representing hour and weekday and the duration in those periods.
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-2:00 Wednesday-3:00
1: 1 40 42
For the second row we would have
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-14:00 Wednesday-15:00 Wednesday-16:00
2: 1 9 60 44
Now the problem is of course that it can span over multiple hours as you can see from the second row.
I would like to do this per row and I was wondering if this is possible without too much computational effort and using data.table?
PS: it is also possible that the interval spans over the day.
library(data.table)
library(lubridate)
#produce sample data
DT<-data.table(idx=1:100,ID=rep(1:20,5), StartTime=runif(100,60*60,60*60*365)+ymd('2014-01-01'))
DT[,EndTime:=StartTime+runif(1,60,60*60*8)]
#make fake start and end dates with same day of week and time but all within a single calendar week
DT[,fakestart:=as.numeric(difftime(StartTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
DT[,fakeend:=as.numeric(difftime(EndTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
setkey(DT,fakestart,fakeend)
#check that weekdays line up
nrow(DT[weekdays(EndTime)==weekdays(fakeend)])
nrow(DT[weekdays(StartTime)==weekdays(fakestart)])
#both are 100 so we're good.
#check that fakeend > fakestart
DT[fakeend<fakestart]
#uh-oh some ends are earlier than starts, let's add 7 days to those ends
DT[fakeend<fakestart,fakeend:=fakeend+days(7)]
#make data.table with all possible labels
DTin<-data.table(start=seq(from=ymd('1970-01-01'),to=DT[,floor_date(max(fakeend),"hour")],by=as.difftime(hours(1))))
DTin[,end:=start+hours(1)]
DTin[,label:=paste0(format(start,format="%A-%H:00"),' ',format(end,format="%A-%H:00"))]
#set key and use new foverlaps feature of data.table which merges by interval
setkey(DT,fakestart,fakeend)
setkey(DTin,start,end)
DTout<-foverlaps(DT,DTin,type="any")
#compute duration in each interval
DTout[,dur:=60-pmax(0,difftime(fakestart,start,unit="mins"))-pmax(0,difftime(end,fakeend,unit="mins"))]
#cast all the rows up to columns for final result
castout<-dcast.data.table(DTout,idx+ID~label,value.var="dur",fill=0)

difftime for multiple dates in r

I have chemistry water data taken from a river. Normally, the sample dates were on a Wednesday every two weeks. The data record starts in 1987 and ends in 2013.
Now, I want to re-check if there are any inconsistencies within the data, that is if the samples are really taken every 14 days. For that task I want to use the r function difftime. But I have no idea on how to do that for multiple dates.
Here is some data:
Date Value
1987-04-16 12:00:00 1,5
1987-04-30 12:00:00 1,2
1987-06-25 12:00:00 1,7
1987-07-14 12:00:00 1,3
Can you tell me on how to use the function difftime properly in that case or any other function that does the job. The result should be the number of days between the samplings and/or a true and false for the 14 days.
Thanks to you guys in advance. Any google-fu was to no avail!
Assuming your data.frame is named dd, you'll want to verify that the Date column is being treated as a date. Most times R will read them as a character which gets converted to a factor in a data.frame. If class(df$Date) is "character" or "factor", run
dd$Date<-as.POSIXct(as.character(dd$Date), format="%Y-%m-%d %H:%M:%S")
Then you can so a simple diff() to get the time difference in days
diff(dd$Date)
# Time differences in days
# [1] 14 56 19
# attr(,"tzone")
# [1] ""
so you can check which ones are over 14 days.

How do i extract a specific, recurring time from 1 minute tick data in R?

For instance, let's say I want to extract the price at 09:04:00 everyday from a timeseries that is formatted as:
DateTime | Price
2011-04-09 09:01:00 | 100.00
2011-04-09 09:02:00 | 100.10
2011-04-09 09:03:00 | 100.13
(NB: there is no | in the actual data, i've just included it here to illustrate that the DateTime is the index and Price is the coredata and that the two are distinct within the xts object)
and put those extracted values into an xts vector...what is the most efficient way to do this?
Also, if i have a five year time series of a cross-border spread, where - due to time differences - the spread opens at different times during the year (say 9am during winter, and 10am during summer) how can I get R to take account of those time differences and recognise either 9am-16:30 or 10am-16:30 as the same "day" interval.
In other words, I want to convert an intraday, 1m tick data file to daily OHLC data. Normally would just use xts and to.period to do this, but - given the time difference noted above - gives odd / strange day start/end times due
Any advice greatly appreciated!
You can use the "T" prefix with xts subsetting to specify a time interval for each day. You must specify an interval; a single time will not work.
set.seed(21)
x <- xts(cumprod(1+rnorm(2*60*24)/100),
as.POSIXct("2011-04-09 09:01:00")+60*(1:(2*60*24)))
x["T09:01:59/T09:02:01"]
# [,1]
# 2011-04-09 09:02:00 0.9980737
# 2011-04-10 09:02:00 1.0778835

Resources