Ok so I am back testing trading signals using R. Here is a snippet of my code which shows the z-score creation, close to close returns (using TTR package), the long signal and the Lag() to place the signal the next day.
require(quantmod)
require(TTR)
require(zoo)
# Calculate n period close price z-scores indicator using TTR package
new.df$roll.mean.n3 <- runMean(new.df$Close, n=3, cumulative = FALSE)
new.df$roll.sd.n3 <- runSD(new.df$Close, n=3, cumulative = FALSE)
new.df$roll.z.score.n3 <- apply(new.df[,c('Close','roll.mean.n3', 'roll.sd.n3')], 1, function(x) { (x[1]-x[2])/x[3] } )
# Calculate Close-to-Close returns
new.df$clret <- ROC(new.df$Close,1)
new.df$clret[1] <- 0
# Create the long (up) signal
new.df$sigup <- ifelse(new.df$roll.z.score.n3 < -1, 1, 0)
# Lag signals to align with days in market not days when signals were generated
new.df$sigup <- Lag(new.df$sigup,1) # Note k=1 implies a move *forward*
The current setup above produces an output like this :
roll.z.score.n3 sigup
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 0
6 -1.148753543 0
7 -0.942160394 1
8 -0.695763683 0
9 1.140646864 0
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 0
15 1.091089451 1
The entry signal is to go long when zscore value is <-1 which is shown in row 3. We have a +1 on row 4 because we used Lag() to forward step the entry signal to the next day. Each time the z-score value is below -1, there is a +1 the next day.
This setup is perfectly fine if i'm only trading for 1 holding day only.
I can then multiply sigup 1 x % daily returns to obtain an equity curve.
I want to elaborate further on the entry / exit signals. I wish to go long (sig long) when zscore is <-1 and exit when z-score is >1.
The output would look something like this:
roll.z.score.n3 sig long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
Row 3 shows a zscore signal of <-1. Lag next day makes it +1 (row 4). And it stays +1 all way until row 9 when z-score signal is >1.0. Thus, the next day at row 10, the signal is 0.
I wanted to give some background on the current coding, its an attempt to further the post at FOSS trading blog.
Thanks for taking a look at this.
See if the following works:
zz = '
roll.z.score.n3 sig_long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
'
df <- read.table(text = zz, header = TRUE)
df = na.omit(df)
df$sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1), 1,
ifelse(df$roll.z.score.n3[i] > 1, 0,
df$sig_long[i-1]))
}
Not sure about this part:
df$sig_long <- Lag(df$sig_long, 1)
Related
I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1
As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1
For every element in a vector, I want to calculate the number of previous elements until I reach an element that is equal to or exceeds the value of the current element.
This is my vector:
CPC= c(25382.6, 30044.9, 22664.4, 30752.3, 21896.9, 24173.1, 29777.9, 9021.1, 8175.1, 9005.8, 5314.2, 4274.1, 3759.1, 5043.1, 5080.9, 6444.4, 6097.6, 8004.2, 6004.7, 6468.9, 5104.7, 5985.5, 8343.7, 8582, 7494.3, 6088.9, 4372.7, 4298.6, 4553.2, 5742)
I have tried something like this, which is not working
ROC = NULL #Create vector to store values
for (i in seq(1:length(CPC))
while(CPC[i])<CPC[i-1]
ROC[i] <- ifelse((CPC[i] < CPC[i-1]),1,0)
The output I am looking for is a vector (ROC) of same lenght as the original one (CPC), so that every element corresponds to the element on the same place in the original vector. ROC[i] would for CPC[i] give the number of previous elements in the CPC vector until the value is equal to or bigger than CPC[i]. For some elements, there are no pervious elements that has a higher or equal value and this would still need to be saved in the output, for example as a NA.
It would look like this:
ROC
NA,NA,1,NA,1,2,3,1,1,2... etc
prs <- unlist(lapply(1:(length(CPC)),
function(x) {
less_or_eq <- CPC[1 : x] <= CPC[x]
if(all(less_or_eq))
return(0)
inds <- which(less_or_eq == 0)
return(x - inds[length(inds)])
} ))
prs
# 0 0 1 0 1 2 3 1 1 2 1 1 1 3 4 6 1 8 1 2 1 2 13
# 14 1 1 1 1 3 4
Using for loop
res <- numeric(length(CPC))
for (i in 1 : length(CPC)) {
less_or_eq <- CPC[1 : i] <= CPC[i]
if (all(less_or_eq)) {
res[i] <- 0
} else {
inds <- which(less_or_eq == 0)
res[i] <- (i - inds[length(inds)])
}
}
res
#[1] 0 0 1 0 1 2 3 1 1 2 1 1 1 3 4 6 1 8 1 2 1 2 13
#[24] 14 1 1 1 1 3 4
Your output requirement is not very clear. The main part is to find where the difference (diff) changes to from negative to positive (meaning that the previous value is less than or equal to the current one. We create the groups based on that and count the length of each group, i.e.
tapply(CPC, cumsum(c(TRUE, diff(CPC) >= 0)), length)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 2 2 1 3 4 1 1 2 2 2 1 1 5 1 1
#or take the cumulative sum of the above,
cumsum(tapply(CPC, cumsum(c(TRUE, diff(CPC) >= 0)), length))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 1 3 5 6 9 13 14 15 17 19 21 22 23 28 29 30
I have a loop that begins when a certain condition exits. It begins when the value is below -.1.0 The loop exits when the condition is over 1.0
#Loop for long entry and exit signal
entry <- -1.0 #Input value for entry
exit <- 1.o #Input value for exit
mydf$sig_long[[1]] = ifelse(mydf$roll.z.score.n3[[1]] < (entry), 1, 0)
for (i in 2:nrow(mydf)){
mydf$sig_long[i] = ifelse(mydf$roll.z.score.n3[i] < (entry), 1,
ifelse(mydf$roll.z.score.n3[i] > (exit), 0,
mydf$sig_long[i-1]))
}
I wanted to learn how to use the same loop but instead of exit on values > 1.0 Exit after a set number of lines. If for example I set nline variable to 5. It would loop and print 1, for a total of 5 lines after the initial entry. An example data frame below:
roll.z.score.n3
1 0
2 0
3 0.651651537
4 -1.153593891
5 -0.926552368
6 -0.369016769
7 0.65405595
8 -1.139305279
9 0.358231351
10 1.135314685
11 0.944997472
12 -0.293105191
13 -1.146659778
14 -0.66246734
15 -1.131901741
16 -0.600480649
17 -1.152333435
18 1.1025176
19 -0.144684006
20 -0.678000883
21 -1.146875039
22 -1.132235788
23 0.115583229
24 0.645489447
25 1.148754398
26 0.988193418
27 -0.818892395
After the script has run I would use zoo to +1 line the sig_long column.
new.df$sig_long <- Lag(new.df$sig_long,1)
Code Testing
Ok so i see the issue. As we are specifying the first part of the loop to enter and exit between -1.0 and 1.0 this prints 1's between those two criteria in the new.df$sig_long column. The issue arises if I set the nlines to 5. The entry/exit might exit the trade in 3 lines in the sig_long column. If that is the case, rollsum has no 1's to count in that column, even if i want to hold for 5 lines, if theres only 3 1's from entry... its not going to be able to compute the 5 line hold time. Perhaps we can use the first part of the loop like this in order to print a 1 at the entry condition:
new.df$sig_long <- ifelse(new.df$roll.z.score.n3 < -1.0 , 1, 0) #Set 1 at entry criteria
That sets the 1 at each < -1.0 value. Next would be how to count +5 lines from that point forward. Counting until...1,2,3,4,5, else 0... 0, until next +1.... count until 1,2,3,4,5, else 0, 0 until next +1....
I can see the logic... if newdf$sig_long == 1, count until nlines 5, else 0, repeat...
Then the 2nd part of the loop would work I think for the new_sig_long column
Made many changes. This should now work:
zz = '
roll.z.score.n3 sig_long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
'
df <- read.table(text = zz, header = TRUE)
df = na.omit(df)
df$sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1), 1,
ifelse(df$roll.z.score.n3[i] > 1, 0,
df$sig_long[i-1]))
}
df$sig_long <- Lag(df$sig_long, 1)
colnames(df[, 2]) = "sig_long"
df = na.omit(df)
nlines = 5
df$rollsum = rollsumr(df$sig_long, k = nlines, fill = 0)
colnames(df[, 3]) = "rollsum"
df$new_sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$new_sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1) & df$rollsum[i] < 5, 1,
ifelse(df$roll.z.score.n3[i] > 1 | df$rollsum[i] >= 5, 0,
df$sig_long[i-1]))
}
df
Output:
> df
roll.z.score.n3 sig_long rollsum new_sig_long
# 4 0.1933112 1 0 0
# 5 0.7142857 1 0 1
# 6 -1.1487535 1 0 1
# 7 -0.9421604 1 0 1
# 8 -0.6957637 1 5 0
# 9 1.1406469 1 5 0
# 10 0.9851969 0 4 1
# 11 -0.7687666 0 3 0
# 12 -1.0112939 0 2 1
# 13 -0.5167036 1 2 0
# 14 -1.1208971 1 2 1
# 15 1.0910895 1 3 0
# 16 0.9683641 0 3 1
# 17 0.8728716 0 3 0
# 18 1.0995250 0 2 0
# 19 0.9183979 0 1 0
Then you can take the lag of the new_sig_long if you wish.
I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.
I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below
I tried doing something similar in R but I am getting an error.
DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))
Is there an easier way to do this besides making a function
Row# Date Condition1 DaysBetweenCondition1
1 5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0
(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)
This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.
At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
# loop over rows
for(i in 1:nrow(df_date)){
if(is.na(df_date$Condition1[i])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0) {
df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
} else {
df_date$DaysBetweenCondition1[i] <- 0
}
}
Here's a solution that should be relatively fast
f0 = function(x) {
y = x # template for return value
isna = is.na(x) # used a couple of times
grp = cumsum(x[!isna]) # use '1' to mark start of each group
lag = lapply(tabulate(grp + 1), function(len) {
seq(0, length.out=len) # sequence from 0 to len-1
})
split(y[!isna], grp) <- lag # split y, set to lag element, unsplit
data.frame(x, y)
}
A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector
f1 = function(x0) {
y0 = x0
x = x0[!is.na(x0)]
y = seq_along(x)
offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
y0[!is.na(y0)] = y - offset
data.frame(x0, y)
}
Walking through the first solution, here's some data
> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
[1] NA NA NA NA NA 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
[26] 1 0 0 1 0 0 0 0 0 0
use cumsum() to figure out the group the non-NA data belong to
> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
[1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8
use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences
> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))
finally, create a vector to hold the result, and use spilt<- to update with the lag
> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
x y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 0 0
7 0 1
8 0 2
9 1 0
10 1 0
11 0 1
12 0 2
13 1 0
14 0 1
15 0 2
16 1 0
17 0 1
...
The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).