I'm trying to reshape a data frame but I'm totally lost in how to proceed:
> test
# Time Entry Order Size Price S / L T / P Profit Balance
1 1 2017-01-11 00:00:00 buy 1 0.16 1.05403 1.0449 1.07838 NA NA
2 3 2017-01-24 16:00:00 s/l 1 0.16 1.04490 1.0449 1.07838 -97.28 9902.72
As you can see, we have 2 (or more) registers for one order ID. What I want to do is combine those 2 rows into one by adding several new columns: Exit (that's where the "s/l" entry of the second observation should go), Exit Price (there should go the data for the Price column on the second entry) and replace the NA from the first entry with the data of the second one on the Profit and Balance columns.
By the way, the original name of the Entry column is "Type" but I already changed that, so that's why it doesn't make that much sense of having the exit reason of the trade on a column called "Entry". So far I've only thought of extracting the data on several vectors and then just do a mutate on the first entry and dropping the second one, but I'm quite sure there's a better way of doing that. Also, that stone-age approach would be useless when applied to the whole data frame.
If possible, I'd like to stick to the tidyverse library to do this just for ease of replication. Thank you in advance for your suggestions!
I ended up sorting it out! My solution was to split the data frame in 2, reshape each half as needed, and then full joining them. Here's the initial data frame:
> head(backtest_table, n = 10)
# Time Type Order Size Price S / L T / P Profit Balance
1 1 2017.01.11 00:00 buy 1 0.16 1.05403 1.04490 1.07838 NA NA
2 2 2017.01.19 00:00 buy 2 0.16 1.05376 1.04480 1.07764 NA NA
3 3 2017.01.24 16:00 s/l 1 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
4 4 2017.01.24 16:00 s/l 2 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
5 5 2017.02.09 00:00 buy 3 0.15 1.05218 1.04265 1.07758 NA NA
6 6 2017.03.03 16:00 t/p 3 0.15 1.07758 1.04265 1.07758 251.75 10058.99
7 7 2017.03.29 00:00 buy 4 0.15 1.08826 1.07859 1.11405 NA NA
8 8 2017.04.04 00:00 close 4 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
9 9 2017.04.04 00:00 sell 5 0.15 1.08416 1.09421 1.05737 NA NA
10 10 2017.04.07 00:00 sell 6 0.15 1.08250 1.09199 1.05718 NA NA
Here's the code I used to modify everything:
# Re-format data
library(lubridate)
# Separate entries and exits
entries <- backtest_table %>% filter(Type %in% c("buy", "sell"))
exits <- backtest_table %>% filter(!Type %in% c("buy", "sell"))
# Reshape entries and exits
# Entries
entries <- entries[-c(1, 9, 10)]
colnames(entries) <- c("Entry time", "Entry type", "Order", "Entry volume",
"Entry price", "Entry SL", "Entry TP")
entries$`Entry time` <- entries$`Entry time` %>% ymd_hm()
entries$`Entry type` <- as.factor(entries$`Entry type`)
# Exits
exits <- exits[-1]
colnames(exits) <- c("Exit time", "Exit type", "Order", "Exit volume",
"Exit price", "Exit SL", "Exit TP", "Profit", "Balance")
exits$`Exit time` <- exits$`Exit time` %>% ymd_hm()
exits$`Exit type` <- as.factor(exits$`Exit type`)
# Join re-shaped data
test <- full_join(entries, exits, by = c("Order"))
And here's the output of that:
> head(test, n = 10)
Entry time Entry type Order Entry volume Entry price Entry SL Entry TP Exit time
1 2017-01-11 buy 1 0.16 1.05403 1.04490 1.07838 2017-01-24 16:00:00
2 2017-01-19 buy 2 0.16 1.05376 1.04480 1.07764 2017-01-24 16:00:00
3 2017-02-09 buy 3 0.15 1.05218 1.04265 1.07758 2017-03-03 16:00:00
4 2017-03-29 buy 4 0.15 1.08826 1.07859 1.11405 2017-04-04 00:00:00
5 2017-04-04 sell 5 0.15 1.08416 1.09421 1.05737 2017-05-26 10:00:00
6 2017-04-07 sell 6 0.15 1.08250 1.09199 1.05718 2017-05-01 09:20:00
7 2017-04-19 sell 7 0.15 1.07334 1.08309 1.04733 2017-04-25 10:00:00
8 2017-05-05 sell 8 0.14 1.07769 1.08773 1.05093 2017-05-29 14:00:00
9 2017-05-24 sell 9 0.14 1.06673 1.07749 1.03803 2017-06-22 18:00:00
10 2017-06-14 sell 10 0.14 1.04362 1.05439 1.01489 2017-06-15 06:40:00
Exit type Exit volume Exit price Exit SL Exit TP Profit Balance
1 s/l 0.16 1.04490 1.04490 1.07838 -97.28 9902.72
2 s/l 0.16 1.04480 1.04480 1.07764 -95.48 9807.24
3 t/p 0.15 1.07758 1.04265 1.07758 251.75 10058.99
4 close 0.15 1.08416 1.07859 1.11405 -41.24 10017.75
5 t/p 0.15 1.05737 1.09421 1.05737 265.58 10091.18
6 s/l 0.15 1.09199 1.09199 1.05718 -94.79 9825.60
7 s/l 0.15 1.08309 1.08309 1.04733 -97.36 9920.39
8 t/p 0.14 1.05093 1.08773 1.05093 247.61 10338.79
9 t/p 0.14 1.03803 1.07749 1.03803 265.59 10504.05
10 s/l 0.14 1.05439 1.05439 1.01489 -100.33 10238.46
And that combined the observations where a trade was added that showed NAs on the last columns with the observations where that trade was closed, populating the last columns with the actual result and new account balance!
If someone has suggestions on how to improve the system please let me know!
Related
I am conducting an event study with the market model: AR(i,t)=R(i,t) - ((alpha(i) + beta(i)*R(m,t)). I struggle with calculating the alpha(intercept) and beta(slope) estimators because of data format and filtering. This is what my data looks like at the moment:
Date ISIN R STOXX_Return Rating_Change Rating
9 2016-10-01 CH00 0.0175315633 -0.0003749766 0.00 A
10 2016-11-01 CH00 -0.0733760706 -0.0220566972 0.00 A
11 2016-12-01 CH00 -0.0107454123 0.0182991778 0.00 A
12 2017-01-01 CH00 0.0457420548 0.0641541456 1.90 A
...
21 2017-10-01 CH00 0.0250723834 0.0374169332 0.00 A
22 2017-11-01 CH00 -0.0780495570 0.0179348620 0.00 A
23 2017-12-01 CH00 0.0688209502 -0.0312226700 0.00 A
24 2018-01-01 CH00 -0.0064684781 0.0144049186 -0.90 A
..
74 2017-01-01 GB00 0.0409336446 0.0641541456 0.00 B+
75 2017-02-01 GB00 0.0056671717 0.0006470779 0.00 B+
76 2017-03-01 GB00 0.0028145957 0.0364348490 0.00 B+
77 2017-04-01 GB00 0.0366417787 0.0144673074 3.66 B+
...
There is an "event" if the Rating Change is non-zero (line 12, 24, 77).
What I need, is doing a regression with the lm() function for only the pre-event-windows (for instance lines 9:11, 21:23, 74:77 - which is -3:1 pre event).
However, there are several events per ISIN
meaning I have to group by ISIN and by the event (non-zero rating change) and
then do the regression with lm() (R ~ STOXX_Return) for each pre-event-window and each ISIN and
save the values in columns next to the event and pre-event-window.
I did not manage to do it in a conditional for-loop or with magrittr/dplyr (or anything google-able :-) ). Nothing really worked out - I just do not know how to manage the "double-filter" for ISIN and event, with a following regression and the output of the coefficients.
Has anyone an approach how to solve this?
Thank you very much in advance for any support- very appreciated!
Addition after response
I tried the following way:
PRE-WINDOW
filter_lmco_pre_tmp1 <- within(data, {
event_pre_window <- if_else(lag(Rating_Change!=0),1,0)
event_pre_window <- ave(event_pre_window, lag(ISIN), FUN=cumsum)
})
Date ISIN R STOXX_Return Rating_Change Rating event_pre_window
10 2016-11-01 CH00 -0.0733761 -0.0220567 0.00 A NA
11 2016-12-01 CH00 -0.0107454 0.0182992 0.00 A 0
12 2017-01-01 CH00 0.0457421 0.0641541 1.90 A 0
13 2017-02-01 CH00 0.0208479 0.0006471 0.00 A 1
14 2017-03-01 CH00 0.0351640 0.0364348 0.00 A 1
22 2017-11-01 CH00 -0.0780496 0.0179349 0.00 A 1
23 2017-12-01 CH00 0.0688210 -0.0312227 0.00 A 1
24 2018-01-01 CH00 -0.0064685 0.0144049 -0.90 A 1
POST-WINDOW
filter_lmco_post_tmp1 <- within(data, {
event_post_window <- if_else(Rating_Change !=0,1,0)
event_post_window <- ave(event_post_window, ISIN, FUN=cumsum)
})
Date ISIN R STOXX_Return Rating_Change Rating event_post_window
10 2016-11-01 CH00 -0.0733761 -0.0220567 0.00 A 0
11 2016-12-01 CH00 -0.0107454 0.0182992 0.00 A 0
12 2017-01-01 CH00 0.0457421 0.0641541 1.90 A 1
13 2017-02-01 CH00 0.0208479 0.0006471 0.00 A 1
14 2017-03-01 CH00 0.0351640 0.0364348 0.00 A 1
22 2017-11-01 CH00 -0.0780496 0.0179349 0.00 A 1
23 2017-12-01 CH00 0.0688210 -0.0312227 0.00 A 1
24 2018-01-01 CH00 -0.0064685 0.0144049 -0.90 A 2
25 2018-02-01 CH00 -0.0997418 0.0119439 0.00 A 2
You can see that if there is an event (line 12 or 24) the pre and post event IDs are not the same. The first event-id of pre window starts with 0 and of post window with 1. This is because I lagged the pre-events. However, if I do not lag, then the actual event is not included in the pre-event. So is there a way to get for both, the pre and post window an "ID" so that the matching afterwards is possible?
Consider assigning columns for the pre-windows with cumsum, ifelse and ave. Then call by (or split) to create a list of data frames for each ISIN and event pre-window. Finally, use tail to retrieve last 3 rows and pass into your modeling function. All handled in base R!
# CREATE EVENT PRE-WINDOW COLUMN
my_df <- within(my_df, {
event_pre_window <- ifelse(Rating_change != 0, 1, 0)
event_pre_window <- ave(event_pre_window, ISIN, FUN=cumsum)
event_pre_window <- ifelse(Rating_Change != 0, event_pre_window-1, event_pre_window)
})
# DEFINE FUNCTION TO PROCESS SINGLE DATAFRAME
my_lm_model <- function(df) {
# ... code to run lm and return results on each pre-window
}
# SPLIT DF BY ISIN AND PRE-WINDOWS
# CALL ABOVE FUNCTION ON EACH TAILED SUBSET
pre_windows_lm_results_list <- by(
my_df,
my_df[c("ISIN", "event_pre_window")],
function(sub) my_lm_model(tail(sub, 3))
)
I have a dataframe df that summarizes activity or depth for several fish individuals (ID) over time (DateTime). Here an example:
df1<- data.frame(ID=c(1,1,2,3,1,2,3,1,2,2,3,1,3,2,3),
DateTime=c("2017-05-08 10:15:23","2017-05-08 10:19:31","2017-05-08 10:11:12","2017-05-08 10:02:23","2017-05-08 10:21:32","2017-05-08 10:15:52","2017-05-08 10:13:23","2017-05-08 10:22:19","2017-05-08 10:19:42","2017-05-08 10:21:27","2017-05-08 10:16:07","2017-05-08 10:24:53","2017-05-08 10:28:39","2017-05-08 10:23:48","2017-05-08 10:33:01"),
DataType=c("Activity","Depth","Depth","Activity","Activity","Activity","Depth","Depth","Activity","Depth","Activity","Depth","Depth","Activity","Activity"),
Value=c(0.89,24,19,1.8,1.1,0.7,17,28,2.1,20,1.35,12,19,0.4,0.97))
df1
ID DateTime DataType Value
1 1 2017-05-08 10:15:23 Activity 0.89
2 1 2017-05-08 10:19:31 Depth 24.00
3 2 2017-05-08 10:11:12 Depth 19.00
4 3 2017-05-08 10:02:23 Activity 1.80
5 1 2017-05-08 10:21:32 Activity 1.10
6 2 2017-05-08 10:15:52 Activity 0.70
7 3 2017-05-08 10:13:23 Depth 17.00
8 1 2017-05-08 10:22:19 Depth 28.00
9 2 2017-05-08 10:19:42 Activity 2.10
10 2 2017-05-08 10:21:27 Depth 20.00
11 3 2017-05-08 10:16:07 Activity 1.35
12 1 2017-05-08 10:24:53 Depth 12.00
13 3 2017-05-08 10:28:39 Depth 19.00
14 2 2017-05-08 10:23:48 Activity 0.40
15 3 2017-05-08 10:33:01 Activity 0.97
For methodological reasons, I need to select activity values that match one condition: there is a previous depth data in less than 3 minutes for the same individual. That is, I need activity data for which I have previous depth data in less than 3 minutes. I would need the resulting dataframe to have those activity values that meet this condition as well as the previous depth values.
I would expect something like this:
> df2
ID DateTime DataType Value
1 1 2017-05-08 10:19:31 Depth 24.00
2 1 2017-05-08 10:21:32 Activity 1.10 # Activity value in less than 3 minutes with regard a depth data
3 2 2017-05-08 10:21:27 Depth 20.00
4 2 2017-05-08 10:23:48 Activity 0.40 # Activity value in less than 3 minutes with regard a depth data
5 3 2017-05-08 10:13:23 Depth 17.00
6 3 2017-05-08 10:16:07 Activity 1.35 # Activity value in less than 3 minutes with regard a depth data
Does anyone know how to do it?
We first convert DateTime to POSIXct type, create a new column which has latest "Depth" time, subtract the "Depth" time with current DateTime for each group (ID) and select rows where DataType == 'Activity' and the time difference is less than 180 seconds.
library(dplyr)
df1 %>%
mutate(DateTime = as.POSIXct(DateTime),
diffTime = replace(DateTime, DataType != "Depth", NA)) %>%
arrange(ID, DateTime) %>%
group_by(ID) %>%
tidyr::fill(diffTime) %>%
mutate(diffTime = difftime(DateTime, diffTime, units = "secs")) %>%
slice({i1 <- which(DataType == 'Activity' & diffTime < 180);c(i1-1, i1)}) %>%
select(-diffTime)
# ID DateTime DataType Value
# <dbl> <dttm> <fct> <dbl>
#1 1 2017-05-08 10:19:31 Depth 24
#2 1 2017-05-08 10:21:32 Activity 1.1
#3 2 2017-05-08 10:21:27 Depth 20
#4 2 2017-05-08 10:23:48 Activity 0.4
#5 3 2017-05-08 10:13:23 Depth 17
#6 3 2017-05-08 10:16:07 Activity 1.35
Here is an option using a non-equi join in data.table
and for each row of Depth with a match, rbind the Depth row with an Activity row that is within 3mins:
library(data.table)
cols <- names(df1)
setDT(df1)[, DateTime := as.POSIXct(DateTime, format="%Y-%m-%d %H:%M:%S")][,
c("start", "end") := .(DateTime, DateTime + 3*60)]
ans <- df1[DataType=="Activity"][df1[DataType=="Depth"],
on=.(ID, start>=start, start<=end), nomatch=0L,
by=.EACHI, rbindlist(use.names=FALSE,
list(mget(paste0("i.", cols)), mget(cols)))
][, (1:3) := NULL] #remove unwanted columns
#set column names as desired
setnames(ans, gsub("i.","", names(ans), fixed=TRUE))[]
output:
ID DateTime DataType Value
1: 1 2017-05-08 10:19:31 Depth 24.00
2: 1 2017-05-08 10:21:32 Activity 1.10
3: 3 2017-05-08 10:13:23 Depth 17.00
4: 3 2017-05-08 10:16:07 Activity 1.35
5: 2 2017-05-08 10:21:27 Depth 20.00
6: 2 2017-05-08 10:23:48 Activity 0.40
I am working with multiple csv files stored in a folder (2488). Each csv contains monthly data from gaugin stations with the format years * months. I want to store all these csv files in a big data frame in which cols are the IDs of the different gaugin stations and the rows the times %Y-%m-%d.
For this purpose, I have list all the files with list.files:
a <- list.files(pattern="./*.csv",full.names=TRUE)
And create a dummy data frame with the final dimensions:
gst <- data.frame(NA,1860,2488) # 1860 times - 2488 stations
Each csv file starts in a different date. The earliest date found is January 1863, so I created the data frame with 1860 rows from that starting date to June 2017.
I create a sequence of dates to name the rows of gst:
s <- paste0(1863,"-01-01")
e <- paste0(2017,"-12-31")
ss<- chron(s, format='y-m-d')
ee<- chron(e, format='y-m-d')
dates <- seq.dates(ss,ee,by='months')
In the following loop, I read each csv file. First I change the initial data frame format: years * months + total column.
# Initial format
Jan Feb Mar Apr
1993 NA 0.05 0.05 0.06
1994 0.18 0.15 0.1 0.19
1995 0.22 0.23 0.26 0.11
1996 0.14 0.11 0.1 0.08
1997 0.12 0.16 0.07 0.05
1998 0.12 0.07 0.12 0.18
1999 0.07 0.32 0.14 0.15
2000 0.13 0.22 0.15 0.1
2001 0.18 0.09 0.5 0.26
to a single column data frame (kk.df) with data stored as:
Date Value
93-01-01 NA
93-02-01 0.05
93-03-01 0.05
93-04-01 0.06
93-05-01 0.05
93-06-01 0.05
93-07-01 0.03
93-08-01 0.03
93-09-01 0.05
93-10-01 0.09
93-11-01 0.04
93-12-01 0.10
This is the loop I am working with:
for (i in 1:length(a)){
kk <- read.csv(a[i])
colnames(kk) <- c(seq(1,12,1),'total') # 12 (months) columns and a total column
kk.ts <- ts(as.vector(t(as.matrix(kk))),
start=as.numeric(c(rownames(kk)[1],1)), end= as.numeric(c(rownames(kk)[dim(kk)[1]],12)),frequency=12)
kk.df <- as.data.frame(kk.ts)
colnames(kk.df) <- a[[i]]
a <- paste0(start,"-01-01")
b <- paste0(end,"-12-31")
ac<- chron(a, format='y-m-d')
bc<- chron(b, format='y-m-d')
times <- seq.dates(ac,bc, by="months")
rownames(kk.df) <- times
gst[i,] <- kk.df
}
My question is, as I am storing the same number of columns as gaugin stations I have (2488 stations) and each station start at a different Year-month, how can I specify when store each i for each station, the row in which it must start?
If i = 1 and the first record is in 1993-01-01, I want that column to start at the row of gst that corresponds to 1993-01-01 and so on with the rest of the stations.
Thank you so much.
Maybe you could incorporate a join in your for loop for that:
df = data.frame(date = seq(Sys.Date(),Sys.Date()+3,by=1))
station1 = data.frame(date = seq(Sys.Date()+2,Sys.Date()+3,by=1),data = c(1,2))
station2 = data.frame(date = seq(Sys.Date()+1,Sys.Date()+2,by=1),data = c(2,3))
df = df %>% left_join(station1) %>% rename(station1=data)
df = df %>% left_join(station2) %>% rename(station2=data)
Input:
> df
date
1 2017-07-17
2 2017-07-18
3 2017-07-19
4 2017-07-20
> station1
date data
1 2017-07-19 1
2 2017-07-20 2
> station2
date data
1 2017-07-18 2
2 2017-07-19 3
Output:
> df
date station1 station2
1 2017-07-17 NA NA
2 2017-07-18 NA 2
3 2017-07-19 1 3
4 2017-07-20 2 NA
I am new to "R"; I have this html table here
I need to find out if there is a gap in the "time (DT)" column of more than one minute. I need to analyze the data and create a new table with just two columns, the first one with the time and the second one with the number of the gap.
Like this: output
So far I am able to download the data!!!
require(XML)
u='http://cronos.est.pr/test.html'
tables = readHTMLTable(u)
datatest=tables[[1]]
View(datatest)
What's next???
Convert the first column to "POSIXct" class, take differences and replace differences of one minute or less with NA. No packages are used.
with(datatest, {
Time <- as.POSIXct(`Time (DT)`)
Diff <- c(0 , c(diff(Time, units = "minutes")))
data.frame(Time, Diff = ifelse(Diff <= 1, NA, Diff))
})
giving:
Time Diff
1 2010-01-01 09:10:00 NA
2 2010-01-01 09:11:00 NA
3 2010-01-01 09:12:00 NA
4 2010-01-01 09:13:00 NA
5 2010-01-01 09:17:00 4
6 2010-01-01 09:18:00 NA
7 2010-01-01 09:19:00 NA
8 2010-01-01 09:20:00 NA
9 2010-01-01 09:22:00 2
10 2010-01-01 09:24:00 2
11 2010-01-01 09:25:00 NA
12 2010-01-01 09:26:00 NA
13 2010-01-01 09:38:00 12
14 2010-01-01 09:39:00 NA
15 2010-01-01 09:40:00 NA
Use the lubridate package.
library(lubridate)
minutes = minute(datatest[,"Time (DT)"])
gaps = c(0, diff(minutes))
output = data.frame("date_time" = datatest[,"Time (DT)"], gaps = gaps)
The output is like you requested except that every gap is recorded, not just the ones greater than 1 minute. To get just the big gaps, do
output[output$gaps > 1,]
I am quite new to R, and I am trying to find a way to average continuous data into a specific period of time.
My data is a month recording of several parameters with 1s time steps
The table via read.csv has a date and time in one column and several other columns with values.
TimeStamp UTC Pitch Roll Heave(m)
05-02-13 6:45 0 0 0
05-02-13 6:46 0.75 -0.34 0.01
05-02-13 6:47 0.81 -0.32 0
05-02-13 6:48 0.79 -0.37 0
05-02-13 6:49 0.73 -0.08 -0.02
So I want to average the data in specific intervals: 20 min for example in a way that the average for hour 7:00, takes all the points from hour 6:41 to 7:00 and returns the average in this interval and so on for the entire dataset.
The time interval will look like this :
TimeStamp
05-02-13 19:00 462
05-02-13 19:20 332
05-02-13 19:40 15
05-02-13 20:00 10
05-02-13 20:20 42
Here is a reproducible dataset similar to your own.
meteorological <- data.frame(
TimeStamp = rep.int("05-02-13", 1440),
UTC = paste(
rep(formatC(0:23, width = 2, flag = "0"), each = 60),
rep(formatC(0:59, width = 2, flag = "0"), times = 24),
sep = ":"
),
Pitch = runif(1440),
Roll = rnorm(1440),
Heave = rnorm(1440)
)
The first thing that you need to do is to combine the first two columns to create a single (POSIXct) date-time column.
library(lubridate)
meteorological$DateTime <- with(
meteorological,
dmy_hm(paste(TimeStamp, UTC))
)
Then set up a sequence of break points for your different time groupings.
breaks <- seq(ymd("2013-02-05"), ymd("2013-02-06"), "20 mins")
Finally, you can calculate the summary statistics for each group. There are many ways to do this. ddply from the plyr package is a good choice.
library(plyr)
ddply(
meteorological,
.(cut(DateTime, breaks)),
summarise,
MeanPitch = mean(Pitch),
MeanRoll = mean(Roll),
MeanHeave = mean(Heave)
)
Please see if something simple like this works for you:
myseq <- data.frame(time=seq(ISOdate(2014,1,1,12,0,0), ISOdate(2014,1,1,13,0,0), "5 min"))
myseq$cltime <- cut(myseq$time, "20 min", labels = F)
> myseq
time cltime
1 2014-01-01 12:00:00 1
2 2014-01-01 12:05:00 1
3 2014-01-01 12:10:00 1
4 2014-01-01 12:15:00 1
5 2014-01-01 12:20:00 2
6 2014-01-01 12:25:00 2
7 2014-01-01 12:30:00 2
8 2014-01-01 12:35:00 2
9 2014-01-01 12:40:00 3
10 2014-01-01 12:45:00 3
11 2014-01-01 12:50:00 3
12 2014-01-01 12:55:00 3
13 2014-01-01 13:00:00 4