Comparing changes across two matrices - r

I'm performing some biogeographic analyses in R and the result is encoded as a pair of matrices. Columns represent geographic regions, rows indicate nodes in a phylogenetic tree and values in the matrix are the probability that the branching event occurred in the geographic region indicated by the column. A very simple example would be:
> One_node<-matrix(c(0,0.8,0.2,0),
+ nrow=1, ncol=4,
+ dimnames = list(c("node 1"),
+ c("A","B","C","D")))
> One_node
A B C D
node_1 0 0.8 0.2 0
In this case, the most probable location for node_1 is region B. In reality, the output of the analysis is encoded as two separate 79x123 matrices. The first is the probabilities of a node occupying a given region before an event and the second is the probabilities of a node occupying a given region after an event (rowSums=1). Some slightly more complicated examples:
before<-matrix(c(0,0,0,0,0.9,
0.8,0.2,0.6,0.4,0.07,
0.2,0.8,0.4,0.6,0.03,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
after<-matrix(c(0,0,0,0,0.9,
0.2,0.8,0.4,0.6,0.03,
0.8,0.2,0.6,0.4,0.07,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
> before
A B C D
node_1 0.0 0.80 0.20 0
node_2 0.0 0.20 0.80 0
node_3 0.0 0.60 0.40 0
node_4 0.0 0.40 0.60 0
node_5 0.9 0.07 0.03 0
> after
A B C D
node_1 0.0 0.20 0.80 0
node_2 0.0 0.80 0.20 0
node_3 0.0 0.40 0.60 0
node_4 0.0 0.60 0.40 0
node_5 0.9 0.03 0.07 0
Specifically, I'm only interested in extracting row numbers where column B is the highest in before and column C is the highest in after and vice versa as I'm trying to extract node numbers in a tree where taxa have moved B->C or C->B.
So the output I'm looking for would be something like:
> BC
[1] 1 3
> CB
[1] 2 4
There will be rows where B>C or C>B but where neither is the highest in the row (node_5) and I need to ignore these. The row numbers are then used to query a separate dataframe that provides the data I want.
I hope this all makes sense. Thanks in advance for any advice!

You could do something like this...
maxBefore <- apply(before, 1, which.max) #find highest columns before (by row)
maxAfter <- apply(after, 1, which.max) #and highest columns after
BC <- which(maxBefore==2 & maxAfter==3) #rows with B highest before, C after
CB <- which(maxBefore==3 & maxAfter==2) #rows with C highest before, B after
BC
node_1 node_3
1 3
CB
node_2 node_4
2 4

Related

Backtesting stock returns for Moving Average rule in r

I am trying to backtest stock returns given a 10 month moving average rule. The rule being, if the price is above the 10mnth average - buy, if it is below the 10mnth average - hold the value constant.
I know how to do this in excel very easily, but I am having trouble in R.
Below is my approach in R:
#Downloand financial data
library(Quandl)
SPY <- Quandl("YAHOO/INDEX_GSPC", type = "xts", collapse = "monthly")
head(SPY)
#Calculate log returns
SPY$log_ret <- diff(log(SPY$Close))
#Calculate moving average for Closing price
SPY$MA.10 <- rollapply(SPY$Close, width = 10, FUN = mean)
#Create binary rule to determine when to buy and when to hold
#1 = Buy
SPY$Action <- ifelse(SPY$MA.10 < SPY$Close, 1, 0)
#Create default value in a new column to backtest returns
SPY$Hit <- 100
#Calculate cumulative returns
SPY$Hit <-ifelse(SPY$Action == 1, SPY[2:n, "Hit"] *
(1 + SPY$log_ret), lag.xts(SPY$Hit, k=1))
Returns do get calculated correctly for an Action of 1, but when the Action is not 1, I find that SPY$Hit only lags 1 time, then defaults to the 100 value, while I would like it to hold the value from the last Action == 1 time.
This formula works very well in MS Excel and is very easy to implement, but it seems that the issue in R is that I cannot keep the value constant from the last Action == 1, how can I do this so that I can see how well this simple trading strategy would work?
Please let me know if I can clarify this further, thank you.
Sample of the desired output:
Action Return Answer
[1,] 0 0.00 100.00000
[2,] 1 0.09 109.00000
[3,] 1 0.08 117.72000
[4,] 1 -0.05 111.83400
[5,] 1 -0.03 108.47898
[6,] 0 -0.02 108.47898
[7,] 0 0.01 108.47898
[8,] 0 0.06 108.47898
[9,] 1 -0.03 105.22461
[10,] 0 0.10 105.22461
[11,] 1 -0.05 99.96338
Here's my guess, let me know what you think.
# Looping
Hit <- matrix(100, nrow = nrow(SPY))
for(row in 11:nrow(SPY)){ # 11 since you have NA's from your moving average
if(SPY$Action[row] == 1){
Hit[row] = Hit[row-1] * (1 + SPY$log_ret[row]) # here we needed row-1
} else {
Hit[row] = Hit[row-1]
}
}
SPY$Hit <- Hit
cbind(SPY$Action, SPY$Hit)
For your sample:
x <- data.frame(Action = c(0,1,1,1,1,0,0,0,1,0,1))
x$Return <- c(0,0.09,0.08,-0.05,-0.03,-0.02,0.01,0.06,-0.03,0.10,-0.05)
x$Answer <- matrix(100, nrow = nrow(x))
for(row in 2:nrow(x)){ # 11 since you have NA's from your moving average
if(x$Action[row] == 1){
x$Answer[row] = x$Answer[row-1] * (1 + x$Return[row])
} else {
x$Answer[row] = x$Answer[row-1]
}
}
x
Action Return Answer
1 0 0.00 100.00000
2 1 0.09 109.00000
3 1 0.08 117.72000
4 1 -0.05 111.83400
5 1 -0.03 108.47898
6 0 -0.02 108.47898
7 0 0.01 108.47898
8 0 0.06 108.47898
9 1 -0.03 105.22461
10 0 0.10 105.22461
11 1 -0.05 99.96338
In Excel there are 2 ways to attain it,
1. Go to the Data command find Data Analysis, find Moving Average,,
In the dialog box you need to put Input data range, Interval (in yur case 10), then output cell addresses.
After finding Result write this formula,
=if(A2 >B2, "Buy", "Hold")
Where A2 hold Price, B2 holds 10 months Moving Average value.
Any where in sheet number cells horizontally 1 to 10 (month number).
Below row put month's value 1 to 10.
Below row calculate 10 months Average.
And finally write the Above written formula to find Buy or hold.

Merging dataframes with all.equal on numeric(float) keys?

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:
> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 <NA> B
3 0.12 b <NA>
4 0.13 c C
5 0.14 d D
6 0.15 <NA> E
7 0.15 e <NA>
8 0.16 f F
9 0.17 g G
10 0.18 h H
11 0.19 i I
12 0.20 j J
Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.
Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?
> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
EDIT: A little more about the data
I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.
Round to the level of precision that will allow the number to be equal.
> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
>
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).

R inference from one matrix to a data frame

I think this may be a very simple and easy question, but since I'm new to R, I hope someone can give me some outlines of how to solve it step by step. Thanks!
So the question is if I have a (n * 2) matrix (say m) where the first column representing the index of the data in another data frame (say d) and the second column representing some value(p value).
What i want to do is if the p value of some row r in m is less than 0.05,I will plot the data in d by the index indicated in the first column in row r of matrix m.
..............
The data is somewhat like what I draw below:
m:
ind p_value
2 0.02
23 0.03
56 0.12
64 0.54
105 0.04
d:
gene_id s1 s2 s3 s4 ... sn
IDH1 0.23 3.01 0 0.54 ... 4.02
IDH2 0.67 0 8.02 10.54 ... 0.72
...
so IDH2 is corresponding to the first line in m whose index column is 2
toplot <- d[ m[ m[,'p_value'] < .05,'ind'], ] works!

Grouping consecutive integers in r and performing analysis on groups

I have a data frame, with which I would like to group the intervals based on whether the integer values are consecutive or not and then find the difference between the maximum and minimum value of each group.
Example of data:
x Integers
0.1 14
0.05 15
2.7 17
0.07 19
3.4 20
0.05 21
So Group 1 would consist of 14 and 15 and Group 2 would consist of 19,20 and 21.
The difference of each group then being 1 and 2, respectively.
I have tried the following, to first group the consecutive values, with no luck.
Breaks <- c(0, which(diff(Data$Integer) != 1), length(Data$Integer))
sapply(seq(length(Breaks) - 1),
function(i) Data$Integer[(Breaks[i] + 1):Breaks[i+1]])
Here's a solution using by():
df <- data.frame(x=c(0.1,0.05,2.7,0.07,3.4,0.05),Integers=c(14,15,17,19,20,21));
do.call(rbind,by(df,cumsum(c(0,diff(df$Integers)!=1)),function(g) data.frame(imin=min(g$Integers),imax=max(g$Integers),irange=diff(range(g$Integers)),xmin=min(g$x),xmax=max(g$x),xrange=diff(range(g$x)))));
## imin imax irange xmin xmax xrange
## 0 14 15 1 0.05 0.1 0.05
## 1 17 17 0 2.70 2.7 0.00
## 2 19 21 2 0.05 3.4 3.35
I wasn't sure what data you wanted in the output, so I just included everything you might want.
You can filter out the middle group with subset(...,irange!=0).

Creating multiple dummies from an existing data frame or data table

I am looking for a quick extension to the following solution posted here. In it Frank shows that for an example data table
test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))
You can quickly make dummies by using the following code
inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]
Now I want to extend this solution for a data.table that has multiple rows of indices, e.g.
new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))
I need to do this for many dummies and ideally the solution would allow me to get 4 things:
The total count of every index
The mean times every index occurs
The count of every index per id
The mean of every index per id
In my real case, the indices are named differently so the solution would need to be able to loop through the column names I think.
Thanks
Simon
If you only need the four items in that list, you should just tabulate:
indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means
# or...
lapply(
new[,indcols,with=FALSE],
function(x){
z<-table(x)
rbind(count=z,mean=prop.table(z))
})
This gives
$index1
a b c d e
count 200.0 200.0 200.0 200.0 200.0
mean 0.2 0.2 0.2 0.2 0.2
$index2
f g h i j k l m n o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
$index3
p q r s
count 250.00 250.00 250.00 250.00
mean 0.25 0.25 0.25 0.25
The previous approach would work on a data.frame or a data.table, but is rather complicated. With a data.table, one can use the melt syntax:
melt(new, id="id")[,.(
N=.N,
mean=.N/nrow(new)
), by=.(variable,value)]
which gives
variable value N mean
1: index1 a 200 0.20
2: index1 b 200 0.20
3: index1 c 200 0.20
4: index1 d 200 0.20
5: index1 e 200 0.20
6: index2 f 100 0.10
7: index2 g 100 0.10
8: index2 h 100 0.10
9: index2 i 100 0.10
10: index2 j 100 0.10
11: index2 k 100 0.10
12: index2 l 100 0.10
13: index2 m 100 0.10
14: index2 n 100 0.10
15: index2 o 100 0.10
16: index3 p 250 0.25
17: index3 q 250 0.25
18: index3 r 250 0.25
19: index3 s 250 0.25
This approach was mentioned by #Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id") which transforms the original data.table.
As mentioned in the comments, melting a data.table requires installing and loading reshape2 for some versions of the data.table package.
If you also need the dummies, they can be made in a loop as in the linked question:
newcols <- list()
for (i in indcols){
vals = unique(new[[i]])
newcols[[i]] = paste(vals,i,sep='_')
new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}
This stores the groups of columns associated with each variable in newcols for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do
lapply(
indcols,
function(i) new[,lapply(.SD,function(x){
z <- sum(x)
list(z,z/.N)
}),.SDcols=newcols[[i]] ])
which gives a similar result. I just wrote it this way to illustrate how data.table syntax can be used. You could again avoid square brackets and .SD here:
lapply(
indcols,
function(i) sapply(
new[, newcols[[i]], with=FALSE],
function(x){
z<-sum(x)
rbind(z,z/length(x))
}))
But anyway: just use table if you can hold onto the underlying variables.

Resources