I have the following data frame:
id day total_amount
1 2015-07-09 1000
1 2015-10-22 100
1 2015-11-12 200
1 2015-11-27 2392
1 2015-12-16 123
6 2015-07-09 200
7 2015-07-09 1000
7 2015-08-27 100018
7 2015-11-25 1000
8 2015-08-27 1000
8 2015-12-07 10000
8 2016-01-18 796
8 2016-03-31 10000
15 2015-09-10 1500
15 2015-09-30 1000
I need to subtract every two successive time in day column if they have the same id until reaching the last row of that id then start subtracting times in day column this time for new id, something similar to following lines in output is expected:
7 2015-07-09 1000 2015-08-27 - 2015-07-09
7 2015-08-27 100018 2015-07-09 - 2015-08-27
7 2015-07-09 1000 0
8 2015-08-27 1000 2015-12-07 - 2015-08-27
8 2015-12-07 10000 2016-01-18 - 2015-12-07
8 2016-01-18 796 2016-03-31 - 2016-01-18
8 2016-03-31 10000 0
15 2015-09-10 1000 2015-09-30 - 2015-09-10
15 2015-09-30 1000 2015-10-01 - 2015-09-30
15 2015-10-01 1000
To get the difference as number of days you could try:
library(dplyr)
group_by(df, id) %>% mutate(new = as.Date(lead(day)) - as.Date(day))
Source: local data frame [15 x 4]
Groups: id [5]
id day total_amount new
(int) (fctr) (int) (dfft)
1 1 2015-07-09 1000 105 days
2 1 2015-10-22 100 21 days
3 1 2015-11-12 200 15 days
4 1 2015-11-27 2392 19 days
5 1 2015-12-16 123 NA days
6 6 2015-07-09 200 NA days
7 7 2015-07-09 1000 49 days
8 7 2015-08-27 100018 90 days
9 7 2015-11-25 1000 NA days
10 8 2015-08-27 1000 102 days
11 8 2015-12-07 10000 42 days
12 8 2016-01-18 796 73 days
13 8 2016-03-31 10000 NA days
14 15 2015-09-10 1500 20 days
15 15 2015-09-30 1000 NA days
EDITED
To subtract the last date from the current date you can use:
# First save the above result as `df1`:
df1[is.na(df1["new"]), "new"] <- as.Date(unlist(df1[is.na(df1["new"]), "day"]))
- Sys.Date()
Related
Say for example I have the following data set:
timestamp open close ID
2000 1000 1100 5
2060 1100 1150 5
2120 1150 1200 5
2180 1200 1150 5
2240 1150 1100 8
2300 1100 1000 8
2360 1000 950 8
2420 950 900 8
2480 900 950 5
2540 950 1000 5
2600 1000 1050 5
2660 1050 1100 4
2720 1100 1150 4
2780 1150 1200 4
How can I add another colum which shows how many times a particular ID has shown up, this is shown by Number_ID? And how can I add another column which gives the percentage change since the beginning when a new ID starts. The first open is the start of the ID and we use the closes to calculate the %_change. So this would look something like this (the because calculation doesnt have to be included, I added it so you can see the calculation):
timestamp open close ID Number_ID %_change
2000 1000 1100 5 1 10 (because (1100-1000)*100/1000)
2060 1100 1150 5 2 15 (because (1150-1000)*100/1000)
2120 1150 1200 5 3 20 (because (1200-1000)*100/1000)
2180 1200 1150 5 4 15 (because (1150-1000)*100/1000)
2240 1150 1100 8 1 -4 (because (1100-1150)*100/1150)
2300 1100 1000 8 2 -13 (because (1000-1150)*100/1150)
2360 1000 950 8 3 -17 (because (950-1150)*100/1150)
2420 950 900 8 4 -21 (because (900-1150)*100/1150)
2480 900 950 5 1 5 (because (950-900)*100/900)
2540 950 1000 5 2 11 (because (1000-900)*100/900)
2600 1000 1050 5 3 16 (because 1050-900)*100/900)
2660 1050 1100 4 1 4 (because (1100-1050)*100/1050)
2720 1100 1150 4 2 9 (because (1150-1050)*100/1050)
2780 1150 1200 4 3 14 (because (1200-1050)*100/1050)
And when have these 2 columns, how can I make a graph which plots the highest positive and negative % change per ID? So, I would first need to add a calculation which calculates the price difference in percentage between the open and the close of an ID. This would look something like this:
timestamp open close ID Number_ID %_change %_change_opencloseID
2000 1000 1100 5 1 10
2060 1100 1150 5 2 15
2120 1150 1200 5 3 20
2180 1200 1150 5 4 15 15 (because (1150-1000)*100/1000)
2240 1150 1100 8 1 -4
2300 1100 1000 8 2 -13
2360 1000 950 8 3 -17
2420 950 900 8 4 -21 -21 (because (900-1150)*100/1150)
2480 900 950 5 1 5
2540 950 1000 5 2 11
2600 1000 1050 5 3 16 16 (because (1050-900)*100/900)
2660 1050 1100 4 1 4
2720 1100 1150 4 2 9
2780 1150 1200 4 3 14 14 (because (1200-1050)*100/1050)
If I have this, how can I make a graph that plots the 16% change for ID 5 and not the 15% change for ID 5 automatically? With timestamp on the x-axis and %_change on the y-axis.
Thanks!
This is how you can do your first step :
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Number_ID = row_number(),
perc_change = (close - first(open))/first(open) * 100)
# timestamp open close ID Number_ID perc_change
# <int> <int> <int> <int> <int> <dbl>
# 1 2000 1000 1100 5 1 10
# 2 2060 1100 1150 5 2 15
# 3 2120 1150 1200 5 3 20
# 4 2180 1200 1150 5 4 15
# 5 2240 1150 1100 8 1 -4.35
# 6 2300 1100 1000 8 2 -13.0
# 7 2360 1000 950 8 3 -17.4
# 8 2420 950 900 8 4 -21.7
# 9 2480 900 950 5 5 -5
#10 2540 950 1000 5 6 0
#11 2600 1000 1050 5 7 5
#12 2660 1050 1100 4 1 4.76
#13 2720 1100 1150 4 2 9.52
#14 2780 1150 1200 4 3 14.3
In data.table :
library(data.table)
setDT(df)[, c("Number_ID", "perc_change") := list(seq_len(.N),
(close - first(open))/first(open) * 100), ID]
I'm borrowing the reproducible example given here:
Aggregate daily level data to weekly level in R
since it's pretty much close to what I want to do.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
In his question, he asks to aggregate on weekly intervals, what I'd like to do is aggregate on a "day of the week basis".
So I'd like to have a table similar to that one, adding the values of all the same day of the week:
Day of the week value
1 "Sunday" 60000
2 "Monday" 50000
3 "Tuesday" 60000
4 "Wednesday" 50000
5 "Thursday" 60000
6 "Friday" 50000
7 "Saturday" 60000
You can try:
aggregate(d$value, list(weekdays(as.Date(d$Interval))), sum)
We can group them by weekly intervals using weekdays :
library(dplyr)
df %>%
group_by(Day_Of_The_Week = weekdays(as.Date(Interval))) %>%
summarise(value = sum(value))
# Day_Of_The_Week value
# <chr> <int>
#1 Friday 16903
#2 Monday 26368
#3 Saturday 4738
#4 Sunday 2975
#5 Thursday 17858
#6 Tuesday 23772
#7 Wednesday 13560
We can do this with data.table
library(data.table)
setDT(df1)[, .(value = sum(value)), .(Dayofweek = weekdays(as.Date(Interval)))]
# Dayofweek value
#1: Sunday 2975
#2: Monday 26368
#3: Tuesday 23772
#4: Wednesday 13560
#5: Thursday 17858
#6: Friday 16903
#7: Saturday 4738
using lubridate https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
df1$Weekday=wday(arrive,label=TRUE)
library(data.table)
df1=data.table(df1)
df1[,sum(value),Weekday]
I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33
I have trade data as follows:
TradeNumber OpenTime CloseTime Profit TradeHour Equity
1 01/01/2014 13:10 01/01/2014 14:40 10 13 520
2 01/01/2014 13:25 01/01/2014 13:28 20 13 520
3 01/01/2014 13:29 01/01/2014 15:40 -50 13 520
4 01/01/2014 13:30 01/01/2014 14:05 -5 13 520
5 01/01/2014 14:12 01/01/2014 14:40 12 14 560
6 01/01/2014 14:21 01/01/2014 14:45 -16 14 560
7 01/01/2014 14:50 01/01/2014 14:59 -14 14 560
8 01/01/2014 14:58 01/01/2014 15:05 56 14 560
I am looking to find, for each trade, the sum of the profits of all other trades that closed in the same hour, but before that particular trade, and add it to equity at the time of the trade. So, in the given example, the result would be:
TradeNumber OpenTime CloseTime Profit TradeHour Equity
1 01/01/2014 13:10 01/01/2014 14:40 10 13 520
2 01/01/2014 13:25 01/01/2014 13:28 20 13 520
3 01/01/2014 13:29 01/01/2014 15:40 -50 13 520 + 20
4 01/01/2014 13:30 01/01/2014 14:05 -5 13 520 + 20
5 01/01/2014 14:12 01/01/2014 14:40 12 14 560
6 01/01/2014 14:21 01/01/2014 14:45 -16 14 560 - 5
7 01/01/2014 14:50 01/01/2014 14:59 -14 14 560+10-5+12-16
8 01/01/2014 14:58 01/01/2014 15:05 56 14 560+10-5+12-16
Trade number 8, for example, opened at 14:58 on 01/01/2014. Prior to it opening, there were 4 other trades that closed in that hour (trades 1, 4, 5 and 6). I would, therefore, like to add the profits from those 4 trades to the equity at the beginning of the hour and place that number in the equity column of the trades data.
for (i in 1:nrow(tradeData))
{
tradeData$EquityUSD1 [i] = tradeData$Equity [i] + sum(tradeData$Profit[tradeData$CloseTime <= tradeData$OpenTime[i] & tradeData$CloseTime >= tradeData$tradeHour[i,1]])
}
This works, but is quite slow and I would like to speed it up, as there are tens of thousands of trades.
Any ideas? Please let me know if I have omitted any important data/info
Thanks
Haven't test the speed with a larger data set yet..
dt
## TradeNumber OpenTime CloseTime Profit TradeHour Equity
## 1 1 01/01/2014 13:10 01/01/2014 14:40 10 13 520
## 2 2 01/01/2014 13:25 01/01/2014 13:28 20 13 520
## 3 3 01/01/2014 13:29 01/01/2014 15:40 -50 13 520
## 4 4 01/01/2014 13:30 01/01/2014 14:05 -5 13 520
## 5 5 01/01/2014 14:12 01/01/2014 14:40 12 14 560
## 6 6 01/01/2014 14:21 01/01/2014 14:45 -16 14 560
## 7 7 01/01/2014 14:50 01/01/2014 14:59 -14 14 560
## 8 8 01/01/2014 14:58 01/01/2014 15:05 56 14 560
require(data.table)
setDT(dt)
dt[,OpenTime:=as.POSIXct(OpenTime,format="%m/%d/%Y %H:%M")]
dt[,CloseTime:=as.POSIXct(CloseTime,format="%m/%d/%Y %H:%M")]
dt[,Equity.new:=Equity+sum(dt$Profit[hour(OpenTime)==hour(dt$CloseTime) & OpenTime > dt$CloseTime]), by="TradeNumber"]
dt
## TradeNumber OpenTime CloseTime Profit TradeHour Equity Equity.new
## 1: 1 2014-01-01 13:10:00 2014-01-01 14:40:00 10 13 520 520
## 2: 2 2014-01-01 13:25:00 2014-01-01 13:28:00 20 13 520 520
## 3: 3 2014-01-01 13:29:00 2014-01-01 15:40:00 -50 13 520 540
## 4: 4 2014-01-01 13:30:00 2014-01-01 14:05:00 -5 13 520 540
## 5: 5 2014-01-01 14:12:00 2014-01-01 14:40:00 12 14 560 555
## 6: 6 2014-01-01 14:21:00 2014-01-01 14:45:00 -16 14 560 555
## 7: 7 2014-01-01 14:50:00 2014-01-01 14:59:00 -14 14 560 561
## 8: 8 2014-01-01 14:58:00 2014-01-01 15:05:00 56 14 560 561
The following code seems to produce the output you want, assuming your data is in a data frame called tradedata:
ddply( tradedata , .(TradeHour) ,
mutate,
Equity=Equity+ cumsum(Profit) - Profit )
If you want to include the profit of the particular trade, remove the -Profit.
You could run this in parallel by giving ddply the .parallel=TRUE option. An answer with data.table may be quicker, however. It would be interesting to see which works best.
My original data frame shows changes on one variable (act, measured in seconds) over approx a 2-week period for several individuals (identified by Ring). My problem is that this variable stretches over the change of date (i.e. at midnight) and I wanted to split it in two: from time[i] till just at midnight, and from midnight until time[i+1]. I have added a few more variables that I need for computing these two operations:
modify the ith row (only when date changes) so I can get the portion of act[i] before midnight
insert one extra row (only when date changes) and assign it the other portion of act[i].
For example:
ith row: 01-01-2000 23:55:00 act= 360 seconds
i+1th row: 02-01-2000 00:01:00 act= 30 seconds
i+2th row: 02-01-2000 00:01:30 act= 50 seconds
.
.
.
My goal is to get:
ith row: 01-01-2000 23:55:00 act= 300 seconds # modified row
i+1th row: 02-01-2000 00:00:00 act= 60 seconds # inserted row
i+2th row: 02-01-2000 00:01:00 act= 30 seconds # previously row i+1th
i+3th row: 02-01-2000 00:01:30 act= 30 seconds #previously row i+2th
.
.
.
Data associated to each individual (Ring) stretch over a different period of time, thereby resulting in date changes between individuals that shoudn't be taken into account.
Below, a selection of my ~ 90000-row dataframe (xact) that shows date changes within and between individuals (Ring) and next my code:
Ring time act wd date clock timepos timemn actmn jul
156 6106933 09/06/11 21:37:45 267 dry 09/06/11 21:37:45 2011-06-09 21:37:45 2011-06-10 8535 15134
157 6106933 09/06/11 21:42:12 3417 wet 09/06/11 21:42:12 2011-06-09 21:42:12 2011-06-10 8268 15134
158 6106933 09/06/11 22:39:09 51 dry 09/06/11 22:39:09 2011-06-09 22:39:09 2011-06-10 4851 15134
159 6106933 09/06/11 22:40:00 7317 wet 09/06/11 22:40:00 2011-06-09 22:40:00 2011-06-10 4800 15134
160 6106933 10/06/11 00:41:57 24 dry 10/06/11 00:41:57 2011-06-10 00:41:57 2011-06-11 83883 15135
529 6106933 11/06/11 22:41:57 3177 wet 11/06/11 22:41:57 2011-06-11 22:41:57 2011-06-12 4683 15136
530 6106933 11/06/11 23:34:54 6 dry 11/06/11 23:34:54 2011-06-11 23:34:54 2011-06-12 1506 15136
531 6106933 11/06/11 23:35:00 1779 wet 11/06/11 23:35:00 2011-06-11 23:35:00 2011-06-12 1500 15136
532 6106933 12/06/11 00:04:39 594 dry 12/06/11 00:04:39 2011-06-12 00:04:39 2011-06-13 86121 15137
533 6106933 12/06/11 00:14:33 18840 wet 12/06/11 00:14:33 2011-06-12 00:14:33 2011-06-13 85527 15137
7024 6134701 24/07/11 15:24:14 6 dry 24/07/11 15:24:14 2011-07-24 15:24:14 2011-07-25 30946 15179
7025 6134701 24/07/11 15:24:20 6 wet 24/07/11 15:24:20 2011-07-24 15:24:20 2011-07-25 30940 15179
7026 6134701 24/07/11 15:24:26 810 dry 24/07/11 15:24:26 2011-07-24 15:24:26 2011-07-25 30934 15179
R = unique(xact$Ring)
for ( m in R ) {
for ( i in 1:nrow(xact) ) {
if( xact$jul[i] < xact$jul[i+1] ) {
# modify row i (jul= Julian date)
xact[i] <- c( xact$Ring[i], xact$time[i], xact$actmn[i], xact$wd[i], xact$date[i], xact$clock[i], xact$timepos[i], xact$timemn[i], xact$actmn[i], xact$jul[i] )
# add new row between row i and row i+1
r <- i
newrow <- c( xact$Ring[i], xact$timemn[i], as.numeric(xact$timepos[i+1] - xact$timemn[i]), xact$wd[i], xact$date[i+1], xact$clock[i+1], xact$timemn[i], xact$timemn[i], xact$actmn[i], xact$jul[i+1] )
insertRow <- function( xact, newrow, r ) {
xact[seq( r+1, nrow(xact) + 1), ] <- xact[seq( r, nrow(xact) ), ]
xact[r,] <- newrow
xact
}
}
}
}
I tried to adapt an existing code Add new row to dataframe, at specific row-index, not appended? but produces this message:
I would appreciate any help.
Santi
Here is an example with made-up data:
#create data
DF <- data.frame(time=seq(from=strptime("2013-01-01 01:00","%Y-%m-%d %H:%M"),to=strptime("2013-01-03 01:00","%Y-%m-%d %H:%M"),by=3500))
DF$ring <- 1:2
DF <- DF[order(DF$ID),]
#apply per ring
library(plyr)
DF <- ddply(DF,.(ring),function(df){
#index of date changes
ind <- c(FALSE,diff(as.POSIXlt(df$time)$yday)==1)
add <- df[ind,]
add$time <- round(add$time,"days")
#you can simply rbind and order, no need for inserting
df <- rbind(df,add)
df <- df[order(df$time),]
#it's easier to calculate act here
df$act <- c(diff(as.numeric(df$time)),NA)
df})
time ring act
1 2013-01-01 01:00:00 1 7000
2 2013-01-01 02:56:40 1 7000
3 2013-01-01 04:53:20 1 7000
4 2013-01-01 06:50:00 1 7000
5 2013-01-01 08:46:40 1 7000
6 2013-01-01 10:43:20 1 7000
7 2013-01-01 12:40:00 1 7000
8 2013-01-01 14:36:40 1 7000
9 2013-01-01 16:33:20 1 7000
10 2013-01-01 18:30:00 1 7000
11 2013-01-01 20:26:40 1 7000
12 2013-01-01 22:23:20 1 5800
13 2013-01-02 00:00:00 1 1200
14 2013-01-02 00:20:00 1 7000
15 2013-01-02 02:16:40 1 7000
16 2013-01-02 04:13:20 1 7000
17 2013-01-02 06:10:00 1 7000
18 2013-01-02 08:06:40 1 7000
19 2013-01-02 10:03:20 1 7000
20 2013-01-02 12:00:00 1 7000
21 2013-01-02 13:56:40 1 7000
22 2013-01-02 15:53:20 1 7000
23 2013-01-02 17:50:00 1 7000
24 2013-01-02 19:46:40 1 7000
25 2013-01-02 21:43:20 1 7000
26 2013-01-02 23:40:00 1 NA
27 2013-01-01 01:58:20 2 7000
28 2013-01-01 03:55:00 2 7000
29 2013-01-01 05:51:40 2 7000
30 2013-01-01 07:48:20 2 7000
31 2013-01-01 09:45:00 2 7000
32 2013-01-01 11:41:40 2 7000
33 2013-01-01 13:38:20 2 7000
34 2013-01-01 15:35:00 2 7000
35 2013-01-01 17:31:40 2 7000
36 2013-01-01 19:28:20 2 7000
37 2013-01-01 21:25:00 2 7000
38 2013-01-01 23:21:40 2 2300
39 2013-01-02 00:00:00 2 4700
40 2013-01-02 01:18:20 2 7000
41 2013-01-02 03:15:00 2 7000
42 2013-01-02 05:11:40 2 7000
43 2013-01-02 07:08:20 2 7000
44 2013-01-02 09:05:00 2 7000
45 2013-01-02 11:01:40 2 7000
46 2013-01-02 12:58:20 2 7000
47 2013-01-02 14:55:00 2 7000
48 2013-01-02 16:51:40 2 7000
49 2013-01-02 18:48:20 2 7000
50 2013-01-02 20:45:00 2 7000
51 2013-01-02 22:41:40 2 4700
52 2013-01-03 00:00:00 2 2300
53 2013-01-03 00:38:20 2 NA