R: aggregating data between events - r

I have trade data as follows:
TradeNumber OpenTime CloseTime Profit TradeHour Equity
1 01/01/2014 13:10 01/01/2014 14:40 10 13 520
2 01/01/2014 13:25 01/01/2014 13:28 20 13 520
3 01/01/2014 13:29 01/01/2014 15:40 -50 13 520
4 01/01/2014 13:30 01/01/2014 14:05 -5 13 520
5 01/01/2014 14:12 01/01/2014 14:40 12 14 560
6 01/01/2014 14:21 01/01/2014 14:45 -16 14 560
7 01/01/2014 14:50 01/01/2014 14:59 -14 14 560
8 01/01/2014 14:58 01/01/2014 15:05 56 14 560
I am looking to find, for each trade, the sum of the profits of all other trades that closed in the same hour, but before that particular trade, and add it to equity at the time of the trade. So, in the given example, the result would be:
TradeNumber OpenTime CloseTime Profit TradeHour Equity
1 01/01/2014 13:10 01/01/2014 14:40 10 13 520
2 01/01/2014 13:25 01/01/2014 13:28 20 13 520
3 01/01/2014 13:29 01/01/2014 15:40 -50 13 520 + 20
4 01/01/2014 13:30 01/01/2014 14:05 -5 13 520 + 20
5 01/01/2014 14:12 01/01/2014 14:40 12 14 560
6 01/01/2014 14:21 01/01/2014 14:45 -16 14 560 - 5
7 01/01/2014 14:50 01/01/2014 14:59 -14 14 560+10-5+12-16
8 01/01/2014 14:58 01/01/2014 15:05 56 14 560+10-5+12-16
Trade number 8, for example, opened at 14:58 on 01/01/2014. Prior to it opening, there were 4 other trades that closed in that hour (trades 1, 4, 5 and 6). I would, therefore, like to add the profits from those 4 trades to the equity at the beginning of the hour and place that number in the equity column of the trades data.
for (i in 1:nrow(tradeData))
{
tradeData$EquityUSD1 [i] = tradeData$Equity [i] + sum(tradeData$Profit[tradeData$CloseTime <= tradeData$OpenTime[i] & tradeData$CloseTime >= tradeData$tradeHour[i,1]])
}
This works, but is quite slow and I would like to speed it up, as there are tens of thousands of trades.
Any ideas? Please let me know if I have omitted any important data/info
Thanks

Haven't test the speed with a larger data set yet..
dt
## TradeNumber OpenTime CloseTime Profit TradeHour Equity
## 1 1 01/01/2014 13:10 01/01/2014 14:40 10 13 520
## 2 2 01/01/2014 13:25 01/01/2014 13:28 20 13 520
## 3 3 01/01/2014 13:29 01/01/2014 15:40 -50 13 520
## 4 4 01/01/2014 13:30 01/01/2014 14:05 -5 13 520
## 5 5 01/01/2014 14:12 01/01/2014 14:40 12 14 560
## 6 6 01/01/2014 14:21 01/01/2014 14:45 -16 14 560
## 7 7 01/01/2014 14:50 01/01/2014 14:59 -14 14 560
## 8 8 01/01/2014 14:58 01/01/2014 15:05 56 14 560
require(data.table)
setDT(dt)
dt[,OpenTime:=as.POSIXct(OpenTime,format="%m/%d/%Y %H:%M")]
dt[,CloseTime:=as.POSIXct(CloseTime,format="%m/%d/%Y %H:%M")]
dt[,Equity.new:=Equity+sum(dt$Profit[hour(OpenTime)==hour(dt$CloseTime) & OpenTime > dt$CloseTime]), by="TradeNumber"]
dt
## TradeNumber OpenTime CloseTime Profit TradeHour Equity Equity.new
## 1: 1 2014-01-01 13:10:00 2014-01-01 14:40:00 10 13 520 520
## 2: 2 2014-01-01 13:25:00 2014-01-01 13:28:00 20 13 520 520
## 3: 3 2014-01-01 13:29:00 2014-01-01 15:40:00 -50 13 520 540
## 4: 4 2014-01-01 13:30:00 2014-01-01 14:05:00 -5 13 520 540
## 5: 5 2014-01-01 14:12:00 2014-01-01 14:40:00 12 14 560 555
## 6: 6 2014-01-01 14:21:00 2014-01-01 14:45:00 -16 14 560 555
## 7: 7 2014-01-01 14:50:00 2014-01-01 14:59:00 -14 14 560 561
## 8: 8 2014-01-01 14:58:00 2014-01-01 15:05:00 56 14 560 561

The following code seems to produce the output you want, assuming your data is in a data frame called tradedata:
ddply( tradedata , .(TradeHour) ,
mutate,
Equity=Equity+ cumsum(Profit) - Profit )
If you want to include the profit of the particular trade, remove the -Profit.
You could run this in parallel by giving ddply the .parallel=TRUE option. An answer with data.table may be quicker, however. It would be interesting to see which works best.

Related

Is there a way of converting four-digit numbers to time values in r?

When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30

Aggregate on a daily basis in R

I'm borrowing the reproducible example given here:
Aggregate daily level data to weekly level in R
since it's pretty much close to what I want to do.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
In his question, he asks to aggregate on weekly intervals, what I'd like to do is aggregate on a "day of the week basis".
So I'd like to have a table similar to that one, adding the values of all the same day of the week:
Day of the week value
1 "Sunday" 60000
2 "Monday" 50000
3 "Tuesday" 60000
4 "Wednesday" 50000
5 "Thursday" 60000
6 "Friday" 50000
7 "Saturday" 60000
You can try:
aggregate(d$value, list(weekdays(as.Date(d$Interval))), sum)
We can group them by weekly intervals using weekdays :
library(dplyr)
df %>%
group_by(Day_Of_The_Week = weekdays(as.Date(Interval))) %>%
summarise(value = sum(value))
# Day_Of_The_Week value
# <chr> <int>
#1 Friday 16903
#2 Monday 26368
#3 Saturday 4738
#4 Sunday 2975
#5 Thursday 17858
#6 Tuesday 23772
#7 Wednesday 13560
We can do this with data.table
library(data.table)
setDT(df1)[, .(value = sum(value)), .(Dayofweek = weekdays(as.Date(Interval)))]
# Dayofweek value
#1: Sunday 2975
#2: Monday 26368
#3: Tuesday 23772
#4: Wednesday 13560
#5: Thursday 17858
#6: Friday 16903
#7: Saturday 4738
using lubridate https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
df1$Weekday=wday(arrive,label=TRUE)
library(data.table)
df1=data.table(df1)
df1[,sum(value),Weekday]

Subtract successive rows in a dataframe grouped by id

I have the following data frame:
id day total_amount
1 2015-07-09 1000
1 2015-10-22 100
1 2015-11-12 200
1 2015-11-27 2392
1 2015-12-16 123
6 2015-07-09 200
7 2015-07-09 1000
7 2015-08-27 100018
7 2015-11-25 1000
8 2015-08-27 1000
8 2015-12-07 10000
8 2016-01-18 796
8 2016-03-31 10000
15 2015-09-10 1500
15 2015-09-30 1000
I need to subtract every two successive time in day column if they have the same id until reaching the last row of that id then start subtracting times in day column this time for new id, something similar to following lines in output is expected:
7 2015-07-09 1000 2015-08-27 - 2015-07-09
7 2015-08-27 100018 2015-07-09 - 2015-08-27
7 2015-07-09 1000 0
8 2015-08-27 1000 2015-12-07 - 2015-08-27
8 2015-12-07 10000 2016-01-18 - 2015-12-07
8 2016-01-18 796 2016-03-31 - 2016-01-18
8 2016-03-31 10000 0
15 2015-09-10 1000 2015-09-30 - 2015-09-10
15 2015-09-30 1000 2015-10-01 - 2015-09-30
15 2015-10-01 1000
To get the difference as number of days you could try:
library(dplyr)
group_by(df, id) %>% mutate(new = as.Date(lead(day)) - as.Date(day))
Source: local data frame [15 x 4]
Groups: id [5]
id day total_amount new
(int) (fctr) (int) (dfft)
1 1 2015-07-09 1000 105 days
2 1 2015-10-22 100 21 days
3 1 2015-11-12 200 15 days
4 1 2015-11-27 2392 19 days
5 1 2015-12-16 123 NA days
6 6 2015-07-09 200 NA days
7 7 2015-07-09 1000 49 days
8 7 2015-08-27 100018 90 days
9 7 2015-11-25 1000 NA days
10 8 2015-08-27 1000 102 days
11 8 2015-12-07 10000 42 days
12 8 2016-01-18 796 73 days
13 8 2016-03-31 10000 NA days
14 15 2015-09-10 1500 20 days
15 15 2015-09-30 1000 NA days
EDITED
To subtract the last date from the current date you can use:
# First save the above result as `df1`:
df1[is.na(df1["new"]), "new"] <- as.Date(unlist(df1[is.na(df1["new"]), "day"]))
- Sys.Date()

calculating Net Squared Displacement and repeating at 0 when target is reached

Apologies if a similar query has been posted - couldn't find it.
I have GPS locations (UTM) for multiple individuals.
X Y AnimalID DATE
1 550466 4789843 10 1/25/2008
2 550820 4790544 10 1/26/2008
3 551071 4791230 10 1/26/2008
4 550462 4789292 10 1/26/2008
5 550390 4789934 10 1/27/2008
6 550543 4790085 10 1/27/2008
I am attempting to calculate Net Squared Displacement and once NSD has reached at least 800m, I'd like to repeat the formula starting at 0 at the next row.
Desired output is this:
XLOC YLOC ANIMALID DATETIME Xdist Ydist NSD GROUP
1 550466 4789843 10 1/25/2008 17:00 354 701 785 1
2 550820 4790544 10 1/26/2008 1:00 605 1387 1513 1
3 551071 4791230 10 1/26/2008 9:00 609 1938 2031 2
4 550462 4789292 10 1/26/2008 17:00 72 642 646 3
5 550390 4789934 10 1/27/2008 1:00 81 793 797 3
6 550543 4790085 10 1/27/2008 9:00 82 149 170 3
7 550380 4789441 10 1/27/2008 17:00 178 192 262 3
8 550284 4789484 10 1/28/2008 1:00 559 426 703 3
9 549903 4789718 10 1/28/2008 9:00 0 35 35 3
10 550462 4789327 10 1/28/2008 17:00 574 275 636 3
11 549888 4789567 10 1/29/2008 1:00 532 263 593 3
12 549930 4789555 10 1/29/2008 9:00 65 4 65 3
13 550397 4789288 10 1/29/2008 17:00 124 140 187 3
14 550338 4789432 10 1/30/2008 1:00 554 339 649 3
15 549908 4789631 10 1/30/2008 9:00 84 75 113 3
16 550378 4789367 10 1/30/2008 17:00 657 1876 1988 3
17 550414 4789354 10 1/31/2008 1:00 531 91 539 4
18 549883 4789445 10 1/31/2008 9:00 188 136 232 4
19 550226 4789490 10 1/31/2008 17:00 126 141 189 4
20 550288 4789495 10 2/1/2008 1:00 176 187 257 4
I added the 'Group' column to indicate when 800 NSD was attained.
I'm really struggling with how exactly to code for this particular approach mainly because the first UTM has to be identical until 800m has been reached.
In other words, I can't do this:
xdist<-abs(diff(X)
ydist<-abs(diff(Y)
nsd<-sqrt(xdist^2+ydist^2)
I need to do this until the target of 800m was reached:
xdist <- abs(X in row 2 - 550446)
ydist <- abs(Y in row 2 - 4789843)
Then the unique UTMs will need to be from rows 3, 4, 17 and so on.
I hope this makes sense and I'd appreciate any help!
I think this is what you are looking for:
data$GROUP[1] <- 1
data$Xdist[1] <- data$XLOC[2] - data$XLOC[1]
data$Ydist[1] <- data$YLOC[2] - data$YLOC[1]
data$NSD[1] <- as.integer(sqrt(data$Xdist[1]^2+data$Ydist[1]^2))
for ( i in 2:(nrow(data)-1)) {
if ( data$NSD[i-1] > 800) {
data$Xdist[i] <- data$XLOC[i+1] - data$XLOC[i]
data$Ydist[i] <- data$YLOC[i+1] - data$YLOC[i]
data$NSD[i] <- as.integer(sqrt(data$Xdist[i]^2+data$Ydist[i]^2))
data$GROUP[i] <- (data$GROUP[i-1] + 1)
} else {
data$Xdist[i] <- data$XLOC[i+1] - data$XLOC[i] + data$Xdist[i-1]
data$Ydist[i] <- data$YLOC[i+1] - data$YLOC[i] + data$Ydist[i-1]
data$NSD[i] <- as.integer(sqrt(data$Xdist[i]^2+data$Ydist[i]^2))
data$GROUP[i] <- (data$GROUP[i-1])
}
}
output:
> data
XLOC YLOC ANIMALID DATE TIME Xdist Ydist NSD GROUP
1 550466 4789843 10 1/25/20081 7:00 354 701 785 1
2 550820 4790544 10 1/26/2008 1:00 605 1387 1513 1
3 551071 4791230 10 1/26/2008 9:00 -609 -1938 2031 2
4 550462 4789292 10 1/26/2008 17:00 -72 642 646 3
5 550390 4789934 10 1/27/2008 1:00 81 793 797 3
6 550543 4790085 10 1/27/2008 9:00 -82 149 170 3
7 550380 4789441 10 1/27/2008 17:00 -178 192 261 3
8 550284 4789484 10 1/28/2008 1:00 -559 426 702 3
9 549903 4789718 10 1/28/2008 9:00 0 35 35 3
10 550462 4789327 10 1/28/2008 17:00 -574 275 636 3
11 549888 4789567 10 1/29/2008 1:00 -532 263 593 3
12 549930 4789555 10 1/29/2008 9:00 -65 -4 65 3
13 550397 4789288 10 1/29/2008 17:00 -124 140 187 3
14 550338 4789432 10 1/30/2008 1:00 -554 339 649 3
15 549908 4789631 10 1/30/2008 9:00 -84 75 112 3
16 550378 4789367 10 1/30/2008 17:00 -48 62 78 3
17 550414 4789354 10 1/31/2008 1:00 -579 153 598 3
18 549883 4789445 10 1/31/2008 9:00 -236 198 308 3
19 550226 4789490 10 1/31/2008 17:00 -174 203 267 3
20 550288 4789495 10 2/1/2008 1:00 NA NA NA NA
Also I think you made a mistake above at xdist16 because for xlocline17 - xlocline16 + xdistline15 = 550414 - 550378 + (-84) = -48 and not 657 as you specified. Unless I missed something at your formula.
Hope this helps!

How to insert rows at variable positions in a dataframe

My original data frame shows changes on one variable (act, measured in seconds) over approx a 2-week period for several individuals (identified by Ring). My problem is that this variable stretches over the change of date (i.e. at midnight) and I wanted to split it in two: from time[i] till just at midnight, and from midnight until time[i+1]. I have added a few more variables that I need for computing these two operations:
modify the ith row (only when date changes) so I can get the portion of act[i] before midnight
insert one extra row (only when date changes) and assign it the other portion of act[i].
For example:
ith row: 01-01-2000 23:55:00 act= 360 seconds
i+1th row: 02-01-2000 00:01:00 act= 30 seconds
i+2th row: 02-01-2000 00:01:30 act= 50 seconds
.
.
.
My goal is to get:
ith row: 01-01-2000 23:55:00 act= 300 seconds # modified row
i+1th row: 02-01-2000 00:00:00 act= 60 seconds # inserted row
i+2th row: 02-01-2000 00:01:00 act= 30 seconds # previously row i+1th
i+3th row: 02-01-2000 00:01:30 act= 30 seconds #previously row i+2th
.
.
.
Data associated to each individual (Ring) stretch over a different period of time, thereby resulting in date changes between individuals that shoudn't be taken into account.
Below, a selection of my ~ 90000-row dataframe (xact) that shows date changes within and between individuals (Ring) and next my code:
Ring time act wd date clock timepos timemn actmn jul
156 6106933 09/06/11 21:37:45 267 dry 09/06/11 21:37:45 2011-06-09 21:37:45 2011-06-10 8535 15134
157 6106933 09/06/11 21:42:12 3417 wet 09/06/11 21:42:12 2011-06-09 21:42:12 2011-06-10 8268 15134
158 6106933 09/06/11 22:39:09 51 dry 09/06/11 22:39:09 2011-06-09 22:39:09 2011-06-10 4851 15134
159 6106933 09/06/11 22:40:00 7317 wet 09/06/11 22:40:00 2011-06-09 22:40:00 2011-06-10 4800 15134
160 6106933 10/06/11 00:41:57 24 dry 10/06/11 00:41:57 2011-06-10 00:41:57 2011-06-11 83883 15135
529 6106933 11/06/11 22:41:57 3177 wet 11/06/11 22:41:57 2011-06-11 22:41:57 2011-06-12 4683 15136
530 6106933 11/06/11 23:34:54 6 dry 11/06/11 23:34:54 2011-06-11 23:34:54 2011-06-12 1506 15136
531 6106933 11/06/11 23:35:00 1779 wet 11/06/11 23:35:00 2011-06-11 23:35:00 2011-06-12 1500 15136
532 6106933 12/06/11 00:04:39 594 dry 12/06/11 00:04:39 2011-06-12 00:04:39 2011-06-13 86121 15137
533 6106933 12/06/11 00:14:33 18840 wet 12/06/11 00:14:33 2011-06-12 00:14:33 2011-06-13 85527 15137
7024 6134701 24/07/11 15:24:14 6 dry 24/07/11 15:24:14 2011-07-24 15:24:14 2011-07-25 30946 15179
7025 6134701 24/07/11 15:24:20 6 wet 24/07/11 15:24:20 2011-07-24 15:24:20 2011-07-25 30940 15179
7026 6134701 24/07/11 15:24:26 810 dry 24/07/11 15:24:26 2011-07-24 15:24:26 2011-07-25 30934 15179
R = unique(xact$Ring)
for ( m in R ) {
for ( i in 1:nrow(xact) ) {
if( xact$jul[i] < xact$jul[i+1] ) {
# modify row i (jul= Julian date)
xact[i] <- c( xact$Ring[i], xact$time[i], xact$actmn[i], xact$wd[i], xact$date[i], xact$clock[i], xact$timepos[i], xact$timemn[i], xact$actmn[i], xact$jul[i] )
# add new row between row i and row i+1
r <- i
newrow <- c( xact$Ring[i], xact$timemn[i], as.numeric(xact$timepos[i+1] - xact$timemn[i]), xact$wd[i], xact$date[i+1], xact$clock[i+1], xact$timemn[i], xact$timemn[i], xact$actmn[i], xact$jul[i+1] )
insertRow <- function( xact, newrow, r ) {
xact[seq( r+1, nrow(xact) + 1), ] <- xact[seq( r, nrow(xact) ), ]
xact[r,] <- newrow
xact
}
}
}
}
I tried to adapt an existing code Add new row to dataframe, at specific row-index, not appended? but produces this message:
I would appreciate any help.
Santi
Here is an example with made-up data:
#create data
DF <- data.frame(time=seq(from=strptime("2013-01-01 01:00","%Y-%m-%d %H:%M"),to=strptime("2013-01-03 01:00","%Y-%m-%d %H:%M"),by=3500))
DF$ring <- 1:2
DF <- DF[order(DF$ID),]
#apply per ring
library(plyr)
DF <- ddply(DF,.(ring),function(df){
#index of date changes
ind <- c(FALSE,diff(as.POSIXlt(df$time)$yday)==1)
add <- df[ind,]
add$time <- round(add$time,"days")
#you can simply rbind and order, no need for inserting
df <- rbind(df,add)
df <- df[order(df$time),]
#it's easier to calculate act here
df$act <- c(diff(as.numeric(df$time)),NA)
df})
time ring act
1 2013-01-01 01:00:00 1 7000
2 2013-01-01 02:56:40 1 7000
3 2013-01-01 04:53:20 1 7000
4 2013-01-01 06:50:00 1 7000
5 2013-01-01 08:46:40 1 7000
6 2013-01-01 10:43:20 1 7000
7 2013-01-01 12:40:00 1 7000
8 2013-01-01 14:36:40 1 7000
9 2013-01-01 16:33:20 1 7000
10 2013-01-01 18:30:00 1 7000
11 2013-01-01 20:26:40 1 7000
12 2013-01-01 22:23:20 1 5800
13 2013-01-02 00:00:00 1 1200
14 2013-01-02 00:20:00 1 7000
15 2013-01-02 02:16:40 1 7000
16 2013-01-02 04:13:20 1 7000
17 2013-01-02 06:10:00 1 7000
18 2013-01-02 08:06:40 1 7000
19 2013-01-02 10:03:20 1 7000
20 2013-01-02 12:00:00 1 7000
21 2013-01-02 13:56:40 1 7000
22 2013-01-02 15:53:20 1 7000
23 2013-01-02 17:50:00 1 7000
24 2013-01-02 19:46:40 1 7000
25 2013-01-02 21:43:20 1 7000
26 2013-01-02 23:40:00 1 NA
27 2013-01-01 01:58:20 2 7000
28 2013-01-01 03:55:00 2 7000
29 2013-01-01 05:51:40 2 7000
30 2013-01-01 07:48:20 2 7000
31 2013-01-01 09:45:00 2 7000
32 2013-01-01 11:41:40 2 7000
33 2013-01-01 13:38:20 2 7000
34 2013-01-01 15:35:00 2 7000
35 2013-01-01 17:31:40 2 7000
36 2013-01-01 19:28:20 2 7000
37 2013-01-01 21:25:00 2 7000
38 2013-01-01 23:21:40 2 2300
39 2013-01-02 00:00:00 2 4700
40 2013-01-02 01:18:20 2 7000
41 2013-01-02 03:15:00 2 7000
42 2013-01-02 05:11:40 2 7000
43 2013-01-02 07:08:20 2 7000
44 2013-01-02 09:05:00 2 7000
45 2013-01-02 11:01:40 2 7000
46 2013-01-02 12:58:20 2 7000
47 2013-01-02 14:55:00 2 7000
48 2013-01-02 16:51:40 2 7000
49 2013-01-02 18:48:20 2 7000
50 2013-01-02 20:45:00 2 7000
51 2013-01-02 22:41:40 2 4700
52 2013-01-03 00:00:00 2 2300
53 2013-01-03 00:38:20 2 NA

Resources