Say I have the following XTS object containing Bid and Ask data:
Time Bid Ask
00:01 10 11
00:02 10 11
00:03 11 12
00:04 12 13
00:05 10 11
00:06 10 11
00:07 10 11
00:08 9 12
00:09 2 10
00:10 4 5
I would like to get the following output:
Time Bid Ask
00:01 10 11
00:03 11 12
00:04 12 13
00:05 10 11
00:08 9 12
00:09 2 10
00:10 4 5
Rows should only be removed if they are the same as the previous entry. If there is a Bid or Ask change then nothing it is not removed, so simply taking out duplicates will not work.
This should be fairly simple as I have done this before, but I just cannot remember how and cannot find it.
Update:
I added some extra entries in my initial data and expected output.
Joshua's swearer is brilliant, but it depends on the function such as rowSums giving different results, but it breaks with 9 12.
I used the rowProds function from the matrixStats package which works, but obviously fails for my last 2 lines. Also, my sample the values in column one are smaller than in column 2. While that makes sense, it not a must so the function should work if the Ask was less than the Bid where both rowSums and rowProds would fail.
Is there a better row function that would always give a different result if anything is different, such as maybe a row hash?
You can do this by using rle() on the sum of each row.
x <- structure(
c(10L, 10L, 11L, 12L, 10L, 10L, 10L, 11L, 11L, 12L, 13L, 11L, 11L, 11L),
.Dim = c(7L, 2L), .Dimnames = list(NULL, c("Bid", "Ask")),
index = structure(1:7, tzone = "", tclass = c("POSIXct", "POSIXt")),
.indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = "",
tclass = c("POSIXct", "POSIXt"), tzone = "", class = c("xts", "zoo"))
r <- rle(rowSums(x))
If you want the last observation in each group, you can just use cumsum(r$lengths) as the row index when subsetting.
R> x[cumsum(r$lengths),]
Bid Ask
1969-12-31 18:00:02 10 11
1969-12-31 18:00:03 11 12
1969-12-31 18:00:04 12 13
1969-12-31 18:00:07 10 11
Since you want the first observation of each group, you need to prepend the r$lengths vector with a 1 (you always want the first observation) and then remove the last element of r$lengths. Then call cumsum() on the result.
R> x[cumsum(c(1, head(r$lengths, -1))),]
Bid Ask
1969-12-31 18:00:01 10 11
1969-12-31 18:00:03 11 12
1969-12-31 18:00:04 12 13
1969-12-31 18:00:05 10 11
Good catch on the limitation of rowSums(). A robust solution is to diff() the bids and asks and select the rows where either is not zero.
d <- diff(x) != 0 # rows with price changes
d[1,] <- TRUE # always select first observation
g <- cumsum(d$Bid | d$Ask) # groups of repeats
r <- rle(as.numeric(g)) # run length encoding on groups
# now use the solution above
x[cumsum(c(1, head(r$lengths, -1))),]
Related
I have a data-set containing vehicle movement, where vehicle time and position is logged with GPS. The issue is that there exist duplicate values (spoofing) for several of the vehicles, and I am not able to identify the true vehicle, outside the guess that the first time the vehicle is introduced it is the true vehicle. My intent is to create a fore-loop that calculates predicted movement from one position to the next , and if the next value is outside this value the row will be deleted. This will also remove outliers, where one position is extremely off for some reason.
Dataset
Vehicle ID Time Lat Long Max Speed (kts)
1 01.01.2013 12:00:00 9.535 18.536 20
1 01.01.2013 12:10:00 9.539 18.539 20
1 01.01.2013 12:20:00 65.535 35.545 20
1 01.01.2013 12:30:00 65.835 35.545 20
1 01.01.2013 12:40:00 9.541 18.542 20
1 01.01.2013 12:50:00 66.135 35.536 20
1 01.01.2013 13:00:00 9.543 18.545 20
2 05.01.2013 17:00:00 13.535 15.536 30
The idea is to run a loop that takes the position from row 1 if the Vehicle ID = Vehicle Id + 1, and calculates the maximum distance possibly traveled by calculating the time between time and time for next row (time + 1) and multiple this by the maximum speed. Then calculate a maximum and minimum latitude and longitude for where the vehicle can be theoretically be at (time + 1) and if the position is outside these maximum values, the row will be deleted and the loop will run the same statement on next row.
Something like this:
if vehicle ID = vehicle ID[n+1],
then (create latmax and latmin) ( time[n+1] - time ) * maximum speed +- latitude &
then (create lonmax and lonmin) ( time[n+1] - time ) * maximum speed +- longitude
then if lat[n+1] > latmax & lat[n+1] < latmin & lon[n+1] > lonmax & lon[n+1] < lonmax (deleterow) if not, do the same at next line
This should result in deleting row 3, 4 and 6 in my sample. For row 8 there is a new vehicle and a large deviation in position is allowed.
This method does not take account in the fact that the earth is circular and distance between latitudes decrease the closer we get to the north or south pole. The best solution would take this into account by solving for this mathematically in the formula, or using distm or similar to calculate the true distance. Implementing cosinus in the formula is properly the easiest method. However the deviation between the outliers and the true position is normalt so large, that the curvature of the earth does not matter in practice for this data-set.
Ok you have 2 problems here, you have a research problem where you need to define the appropriate distance to compare the spoofing pairs in function of coordinates, but foremost you need to define the spoofing pairs and the id of their previous known non-spoofing coordinates. The first problem is a research problem, and I won't go deep into it, but perhaps an internet search on how to calculate distances based on coordinates would help you. A solution to the second problem, the coding part, is proposed below, assuming you want to choose the minimal distance to the latest known non-spoofing position.
First you can take my same example by running this:
dput(df)
structure(list(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("01.01.2013", "05.01.2013"
), class = "factor"), structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 3L), .Label = c("12:00:00", "12:10:00", "12:20:00", "12:30:00",
"12:40:00", "12:50:00", "13:00:00", "17:00:00"), class = "factor"),
c(9.535, 9.635, 65.535, 65.835, 9.935, 66.135, 10.235, 13.535,
40.535), c(18.536, 18.636, 35.536, 35.536, 18.936, 35.536,
19.236, 15.536, 40.545), c(20L, 20L, 20L, 20L, 20L, 20L,
20L, 30L, 20L)), .Names = c("Vehicle ID", "date", "Time",
"Lat", "Long", "Max Speed (kts)"), class = "data.frame", row.names = c(NA,
-9L))
My method is to use a series of apply functions. I am also interested if someone knows a more elegant way of doing it other than explicit loops, which might perhaps do the job in fewer steps, but I tend to avoid those.
spoofingtestdb <- df[,1:3]
df$spoofing <- duplicated(spoofingtestdb)|duplicated(spoofingtestdb, fromLast = T)
df$datetime <- dmy_hms(paste0(df$date,"-", df$Time))
df$candidatespreviousposition <- apply(df, 1, function(x) which(df$`Vehicle ID`== x["Vehicle ID"] & !df$spoofing & (as_datetime(df$datetime) < as_datetime(x["datetime"])) ) )
df$latestpreviousposition <- NA
for(i in 1: nrow(df)){
if(length(df$candidatespreviousposition[[i]]>0)) df$latestpreviousposition[[i]] <- df$candidatespreviousposition[[i]][which.max(df$datetime[df$candidatespreviousposition[[i]]])]
}
df$spoofingkey <- paste0(df$`Vehicle ID`, df$datetime)
df$spoofingid <- ifelse(df$spoofing, apply(df, 1, function(x) which(df$spoofingkey==x["spoofingkey"])), NA)
df$lat1 <- apply(df, 1, function(x) df$Lat[x[["spoofingid"]][1]][which(!is.na(df$Lat[x[["spoofingid"]][1]]))] )
df$long1 <- apply(df, 1, function(x) df$Long[x[["spoofingid"]][1]][which(!is.na(df$Long[x[["spoofingid"]][1]]))] )
df$latinit <- apply(df, 1, function(x) df$Lat[x["latestpreviousposition"]])
df$latinit <- ifelse(df$spoofing, df$Lat[df$latestpreviousposition], NA)
df$longinit <- ifelse(df$spoofing, df$Long[df$latestpreviousposition], NA)
getdistance <- function(latinit, longinit, lat, long) {
distance1 <- abs(lat-latinit)+abs(long-longinit)
}
df$distance <- ifelse(df$spoofing, getdistance(df$latinit, df$longinit, df$Lat, df$Long), NA )
df$spoofingnumber <- apply(df, 1, function(x) paste0(x["spoofingid"], collapse=""))
#apply(df, 1, function(x) which(df$spoofingnumber==x["spoofingnumber"]))
df$ismindistance <- apply(df, 1, function(x) x["distance"] == min(df$distance[which(df$spoofingnumber==x["spoofingnumber"])]))
df$tokeep <- ifelse(is.na(df$ismindistance)|df$ismindistance, T, F)
result <- df[df$tokeep,]
result
Here just using a basic distance calculation function. The result is below, as you can see the second row has been deleted in my example, only the minimum distance was kept with respect to the previous known position.
Vehicle ID date Time Lat Long Max Speed (kts) spoofing datetime candidatespreviousposition
1 1 01.01.2013 12:00:00 9.535 18.536 20 FALSE 2013-01-01 12:00:00
2 1 01.01.2013 12:10:00 9.635 18.636 20 FALSE 2013-01-01 12:10:00 1
4 1 01.01.2013 12:30:00 65.835 35.536 20 FALSE 2013-01-01 12:30:00 1, 2
5 1 01.01.2013 12:40:00 9.935 18.936 20 FALSE 2013-01-01 12:40:00 1, 2, 4
6 1 01.01.2013 12:50:00 66.135 35.536 20 FALSE 2013-01-01 12:50:00 1, 2, 4, 5
7 1 01.01.2013 13:00:00 10.235 19.236 20 FALSE 2013-01-01 13:00:00 1, 2, 4, 5, 6
8 2 05.01.2013 17:00:00 13.535 15.536 30 FALSE 2013-01-05 17:00:00
9 1 01.01.2013 12:20:00 40.535 40.545 20 TRUE 2013-01-01 12:20:00 1, 2
latestpreviousposition spoofingkey spoofingid lat1 long1 latinit longinit distance spoofingnumber ismindistance tokeep
1 NA 12013-01-01 12:00:00 NA NA NA NA NA NA TRUE
2 1 12013-01-01 12:10:00 NA NA NA NA NA NA TRUE
4 2 12013-01-01 12:30:00 NA NA NA NA NA NA TRUE
5 4 12013-01-01 12:40:00 NA NA NA NA NA NA TRUE
6 5 12013-01-01 12:50:00 NA NA NA NA NA NA TRUE
7 6 12013-01-01 13:00:00 NA NA NA NA NA NA TRUE
8 NA 22013-01-05 17:00:00 NA NA NA NA NA NA TRUE
9 2 12013-01-01 12:20:00 3, 9 65.535 35.536 9.635 18.636 52.809 c(3, 9) TRUE TRUE
After you choose which distance function is appropriate for you, you can just replace the getdistance() function above.
I am currently trying to write a forecasting algorithm in R, but I'm having an issue extracting my time data from a txt file.
I currently have a test text file with the following data
x
1 2010-01-01
2 2010-07-02
3 2010-08-03
4 2011-02-04
5 2011-11-05
6 2011-12-06
7 2012-06-07
8 2012-08-30
9 2013-04-16
10 2013-03-18
11 2014-02-22
12 2014-01-27
13 2015-12-15
14 2015-09-28
15 2016-05-04
16 2017-11-07
17 2017-09-22
18 2017-04-04
When I extract it and try to plot it with the following code:
library(forecast)
library(ggplot2)
Quantity <- c(read.table("....Path..../Quantity.txt"))
Time <- c(read.table("....Path..../Time.txt"))
x <- ts(as.Date(unlist(Time)))
y <- unlist(Quantity)
plot(x,y)
The resulting graph displays all the points on the graph correctly, except for the labels for time (which are 14500, 16000, and 17500). The labels should show the dates from the file, but the way I see it, its probably treating the data as a maths sum (and does a calculation resulting in those values) and not dates.
I also have an issue that the time data is not being plotted in chronological order, but instead in the order from the files.
Here's the data from the other file just for reference:
x
1 5
2 3
3 8
4 4
5 0
6 5
7 2
8 7
9 4
10 2
11 6
12 8
13 4
14 7
15 8
16 9
17 4
18 6
How can I correct these 2 issues?
Thanks in advance.
Here is one of the many possible solutions.
I hope it can help you.
# A dataset with date and x values
# Important: the format of date is "character"
df <- structure(list(date = c("2010-01-01", "2010-07-02", "2010-08-03",
"2011-02-04", "2011-11-05", "2011-12-06", "2012-06-07", "2012-08-30",
"2013-04-16", "2013-03-18", "2014-02-22", "2014-01-27", "2015-12-15",
"2015-09-28", "2016-05-04", "2017-11-07", "2017-09-22", "2017-04-04"
), x = c(5L, 3L, 8L, 4L, 0L, 5L, 2L, 7L, 4L, 2L, 6L, 8L, 4L,
7L, 8L, 9L, 4L, 6L)), .Names = c("date", "x"), row.names = c(NA,
-18L), class = "data.frame")
str(df)
# Create a x vector with dates as rownames
x <- as.matrix(df$x)
rownames(x) <- df$date
# Convert in a xts object
library(xts)
x <- as.xts(x)
# Plot the xts object
plot(x, grid.col="white")
To answer your ggplot question, using the data frame that Marco provided above, you would simply use:
ggplot(df, aes(x = date, y = x)) + geom_line(group = 1)
Since you have only one group or one set of points, you must use the group = 1 arg in geom_line.
One of the things I will point out is that your time series data has irregular periods, and you will have to make sure that you account for that in your time series object. Most time series packages have their own specialized functions for handling the data and plotting.
This question already has an answer here:
Aggregate by week in R
(1 answer)
Closed 7 years ago.
I have Date (column B) and Total (column A) variables - how can I create a new variable in R that sums the previous seven days' worth of Totals?
In Excel, I have the following formula:
=SUMIFS($A:$A,$B:$B, ">="&$B20-7,$B:$B,"<"&$B20)
and I just don't know how to convert this to work in R. Suggestions?
This will do it too, advanced, but short - essentially a one-liner.
# Initialze some data
date <- seq(as.Date("2001-01-01"),as.Date("2001-01-31"),"days")
tot <- trunc(rnorm(31,100,20))
df <- data.frame(date,tot)
# Now compute week sum by summing a subsetted df for each date
df$wktot <- sapply(df$date,function(x)sum(df[difftime(df$date,x,,"days") %in% 0:-6,]$tot))
Changed the variable names to match the posed problem.
It also handles the data in any order and multiple entries per day.
Edited to add comments and make it fit in a window.
If there is one total per day, this function may help:
rollSums <- function(totals, roll) {
res <- c()
for(i in 1:(length(totals)-roll)) {
res <- c(res, sum(totals[0:(roll-1)+i]))
}
res
}
df1
Total Date
1 3 2015-01-01
2 8 2015-01-01
3 4 2015-01-02
4 7 2015-01-03
5 6 2015-01-04
6 1 2015-01-04
7 10 2015-01-05
8 9 2015-01-06
9 2 2015-01-07
10 5 2015-01-08
rollSums(df1$Total, 3)
[1] 15 19 17 14 17 20 21
rollSums(df1$Total, 4)
[1] 22 25 18 24 26 22
It will take two arguments, the vector with the totals and how many days you'd like in each sum.
Data
dput(df1)
structure(list(Total = c(3L, 8L, 4L, 7L, 6L, 1L, 10L, 9L, 2L,
5L), Date = structure(c(16436, 16436, 16437, 16438, 16439, 16439,
16440, 16441, 16442, 16443), class = "Date")), .Names = c("Total",
"Date"), row.names = c(NA, -10L), class = "data.frame")
Update
In case you run into a situation with multiple values on the same day, here's a solution. Surprisingly, #MikeWise has a one-liner that can do all of this. See other answer.
grouped.roll <- function(DF, Values, Group, roll) {
totals <- eval(substitute(with(DF, tapply(Values, Group, sum))))
newsums <- rollSums(totals, roll)
data.frame(Group=names(totals), Sums=c(rep(NA, roll), newsums))
}
It uses the rollSums that I used earlier. It will spit out NAs until the desired day grouping begins. That may be the only advantage over the other answer. But they could easily edit that in, I'm sure. Just providing more options for reference.
grouped.roll(df1, Total, Date, 3)
Group Sums
1 2015-01-01 NA
2 2015-01-02 NA
3 2015-01-03 NA
4 2015-01-04 22
5 2015-01-05 18
6 2015-01-06 24
7 2015-01-07 26
8 2015-01-08 21
I have a data frame with two columns, time and flow. The time interval for the time series is 15 minutes and I want to cut this time series so that the output time series has consistent one hour time intervals and the flow value from that hourly time stamp from the original data. How do I extract the hourly data?
Input:
structure(list(t = structure(c(1104555600, 1104556500, 1104557400,
1104558300, 1104559200, 1104560100, 1104561000, 1104561900, 1104562800
), class = c("POSIXct", "POSIXt"), tzone = "EST"), flow = c(18,
18, 18, 18.125, 18.125, 18.125, 18.125, 18.125, 18.125)), .Names = c("t", "flow"), row.names = c(NA, 9L), class = "data.frame")
And for output I would want something like
time flow
2005-01-01 00:00:00 18.000
2005-01-01 01:00:00 18.125
2005-01-01 02:00:00 18.125
You can use cut to get the hour in which each t variable is, and then just take the first element of every cut group. If df is your dataframe:
aggregate(df, list(cut(df$t,breaks="hour")), FUN=head, 1)[,-2]
# Group.1 flow
# 2005-01-01 00:00:00 18.000
# 2005-01-01 01:00:00 18.125
# 2005-01-01 02:00:00 18.125
if your dataframe is a:
library(dplyr)
filter(a, grepl(":00:00",t))
You don't give any example, but from what I understand you simply want to keep every forth row.
In a data set with
time<- c(10,11,12,13,14,15,16,17,18,19)
flow<- c(3,4,5,6,7,8,9,10,11,12)
d <- data.frame(time,flow)
1 10 3
2 11 4
3 12 5
4 13 6
5 14 7
6 15 8
7 16 9
8 17 10
9 18 11
10 19 12
with
> d[seq(1, NROW(d), by = 4),]
you only keep every fourth row.
time flow
1 10 3
5 14 7
9 18 11
Hi I'm looking to roll up minutely data in a data.table to 5 minutely (or 10 minutely) horizon. I know this is easily done via using xts and the to.minutes5 function, but I prefer not to use xts in this instance as the data set is rather large. Is there an easy way to do this in data.table ?
Data example : In this example the period between 21.30 to 21.34 (both inclusive) would have just one row with t = 21.30, open = 0.88703 , high = 0.88799 , low = 0.88702 , close = 0.88798, volume = 43 (note the data from 21.35 itself is ignored).
t open high low close volume
1: 2010-01-03 21:27:00 0.88685 0.88688 0.88685 0.88688 2
2: 2010-01-03 21:28:00 0.88688 0.88688 0.88686 0.88688 5
3: 2010-01-03 21:29:00 0.88688 0.88704 0.88687 0.88703 7
4: 2010-01-03 21:30:00 0.88703 0.88795 0.88702 0.88795 10
5: 2010-01-03 21:31:00 0.88795 0.88795 0.88774 0.88778 7
6: 2010-01-03 21:32:00 0.88778 0.88778 0.88753 0.88760 8
7: 2010-01-03 21:33:00 0.88760 0.88781 0.88760 0.88775 11
8: 2010-01-03 21:34:00 0.88775 0.88799 0.88775 0.88798 7
9: 2010-01-03 21:35:00 0.88798 0.88803 0.88743 0.88782 8
10: 2010-01-03 21:36:00 0.88782 0.88782 0.88770 0.88778 6
Output from dput(head(myData)) as requested by GSee. I want to use the data.table for storing some more derived fields based on this original data. So, even if I did use xts to roll up these price bars, I'll have to put them in a data table somehow, so I'd appreciate any tips around the correct way to hold data.table with xts items.
structure(list(t = structure(c(1241136000, 1241136060, 1241136120,
1241136180, 1241136240, 1241136300), class = c("POSIXct", "POSIXt"
), tzone = "Europe/London"), open = c(0.89467, 0.89467, 0.89472,
0.89473, 0.89504, 0.895), high = c(0.89481, 0.89475, 0.89473,
0.89506, 0.8951, 0.895), low = c(0.89457, 0.89465, 0.89462, 0.89473,
0.89486, 0.89486), close = c(0.89467, 0.89472, 0.89473, 0.89504,
0.895, 0.89488), volume = c(96L, 14L, 123L, 49L, 121L, 36L)), .Names = c("t",
"open", "high", "low", "close", "volume"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000100788>)
You can use the endpoints function (which is written in C) from xts on POSIXt vectors. endpoints finds the position of the last element of a certain time period. By convention, 1:05 would not be included in the same bar as 1:00. So, the data that you provided dput for (which is different than the printed data above it) will have 2 bars.
Assuming dt is your data.table:
library(data.table)
library(xts)
setkey(dt, t) # make sure the data.table is sorted by time.
ep <- endpoints(dt$t, "minutes", 5)[-1] # remove the first value, which is 0
dt[ep, grp:=seq_along(ep)] # create a column to group by
dt[, grp:=na.locf(grp, fromLast=TRUE)] # fill in NAs
dt[, list(t=last(t), open=open[1], high=max(high), low=min(low),
close=last(close), volume=sum(volume)), by=grp]
grp t open high low close volume
1: 1 2009-05-01 01:04:00 0.89467 0.8951 0.89457 0.89500 403
2: 2 2009-05-01 01:05:00 0.89500 0.8950 0.89486 0.89488 36