Despite using two complete columns where every element is numeric and no numbers are missing for rows 2 thru 570, I find it impossible to get a result other than NA when using a loop to find a rolling 24-week correlation between the two columns.
rolling.correlation <- NULL
temp <- NULL
for(i in 2:547){
temp <- cor(training.set$return.SPY[i:i+23],training.set$return.TLT[i:i+23])
rolling.correlation <- c(rolling.correlation, temp)
} #end "for" loop
rolling.correlation
The cor()command works fine for [2:25], [i:25], or [2:i] but R doesn't understand when I say [i:i+23]
I want R to calculate a correlation for rows 2 thru 25, then 3 thru 26, ..., 547 thru 570. The result should be a vector of length 546 which has numeric values for each correlation. Instead I'm getting a vector of 546 NAs. How can I fix this? Thanks for your help.
Look what happens when you do
5:5+2
# [1] 7
Note that : has a higher operator precedence than + which means 5:5+2 is the same as (5:5)+2 when you really want 5:(5+2). Use
i:(i+23)
Related
so I am working on R with a matrix as following:
diff_0
SubPop0-1, SubPop1-1, SubPop2-1, SubPop3-1, SubPop4-1,
SubPop0-1, NA NA NA NA NA
SubPop1-1, 0.003403100 NA NA NA NA
SubPop2-1, 0.005481177 -0.002070277 NA NA NA
SubPop3-1, 0.002216444 0.005946314 0.001770977 NA NA
SubPop4-1, 0.010344845 0.007151529 0.004237316 -0.0021275130 NA
... but bigger ;-).
This is a matrix of pairwise genetic differenciation between each SubPop from 0 to 4. I would like to obtain a mean differenciation value for each subPop.
For instance, for SubPop-0, the mean would just correspond to the mean of the 4 values from column 1. However for SubPop-2, this would be the mean of the 2 values in line 3 and the 2 value in column 3, since this is a demi-matrix.
I wanted to write a for loop to compute each mean value for each SubPop, taking this into account. I tried the following:
Mean <- for (r in 1:nrow(diff_0)) {
mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
}
First this isolates each line and column of index [r], whose values refer to the same SubPop r. 'sum' enable to gather these values and eliminate 'NA's. Finally I get the mean value for my SubPop r. I was hoping my for loop would give me with value for each index r, which would be a SubPop.
However, eventhough my mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T)), if run alone with a fixed r value between 1 and 5, does give me what I want; well the 'for loop' itself only returns an empty vector.
Something like for (r in 1:nrow(diff_0)) { print(diff_0[r,1]) } also works, so I do not understand what is going on.
This is a trivial question but I could not find an answer on the internet! Although I know I am probably missing the obvious :-)...
Thank you very much,
Cheers!
Okay, based on what you want to do (and if I understand everything correctly) there are several ways of doing this.
The one that comes to my mind now is just making your lower triangular matrix to an "entire matrix" (i.e. fill the upper triangle with the transpose of the lower triangle) and then do row- or column-wise means
My R is running right now on something else, so I can't check my code but this should work
diff = diff_0
diff[upper.tri(diff)] = t(diff_0[lower.tri(diff)]) #This step might need some work
As I said, my R is running right now so I can't check the correctness of the last line - I might be confused with some transposes there, so I'd appreciate any feedback on whether it actually worked or not.
You can then either set the diagonal values to 0 or alternatively add na.rm = TRUE to the mean statement
mean_diffs = apply(diff,2,FUN = function(x)mean(x, na.rm = TRUE))
that should work
Also: Yes, your code does not work, because the assignment is not in the for loop. This should work:
means = rep(NA, nrow(diff_0)
for (r in 1:length(means)){
means[r] = mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
But in general for loops are not what you want to do in R
This may be a solution...
for(i in 1:nrow(diff_0)) {
k<-mean(cbind(as.numeric(diff_0[,i]),as.numeric(diff_0[i,])),na.rm=T)
if(i==1) {
data_mean <- k
}else{
data_mean <- rbind(data_mean,k)
}
}
colnames(data_mean) <- "mean"
rownames(data_mean) <- c("SubPop0","SubPop1","SubPop2","SubPop3","SubPop4")
data_mean
mean
SubPop0 0.005361391
SubPop1 0.003607666
SubPop2 0.002354798
SubPop3 0.001951555
SubPop4 0.004901544
I've got a dataframe that I need to split based on the values in one of the columns - most of them are either 0 or 1, but a couple are NA, which I can't get to form a subset. This is what I've done:
all <- read.csv("XXX.csv")
splitted <- split(all, all$case_con)
dim(splitted[[1]]) #--> gives me 185
dim(splitted[[2]]) #--> gives me 180
but all contained 403 rows, which means that 38 NA values were left out and I don't know how to form a similar subset to the ones above with them. Any suggestions?
Try this:
splitted<-c(split(all, all$case_con,list(subset(all, is.na(case_con))))
This should tack on the data frame subset with the NAs as the last one in the list...
list(split(all, all$cases_con), split(all, is.na(all$cases_con)))
I think it would be work. Ty
I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])
I have a data frame "accdata".
dim(accdata)
[1] 6496 188
One of the variables - "VAL" is of interest to me. I must calculate the number of instances where VAL is equal to 24.
I tried a few functions that returned error messages. After some research it seems I need to remove the NA values from VAL first.
I would try something like nonaaccdaa <- na.omit(accdata) except this removes instances of NA in any variable, not just VAL.
I tried nonaval <- na.omit(accdata[accdata$VAL]) but when I then checked the number of rows using nrow the result was null. I had expected a value between 1 and 6,496.
Whats up here?
This should do the trick:
sum(accdata$VAL == 24, na.rm=TRUE)
I've been trying for a while now to produce a code that brings me a new vector of the sum of the 25 previous rows for an original vector.
So if we say I have a variable Y with 500 rows and I would like a running sum, in a new vector, which contains the sum of rows [1:25] then [2:26] for the length of Y, such as this:
y<-1:500
runsum<-function(x){
cumsum(x)-cumsum(x[26:length(x)])
}
new<-runsum(y)
I've tried using some different functions here and then even using the apply functions on top but none seem to produce the right answers....
Would anyone be able to help? I realise it's probably very easy for many of the community here but any help would be appreciated
Thanks
This function calculates the sum of the 24 preceding values and the actual value:
movsum <- function(x,n=25){filter(x,rep(1,n), sides=1)}
It is easy to adapt to sum only preceding values, if this is what you really want.
In addition to Roland's answer you could use the zoo library
library ( zoo )
y <- 1:500
rollapply ( zoo ( y ), 25, sum )
HTH
I like Roland's answer better as it relies on a time series function and will probably be pretty quick. Since you mentioned you started going down the path of using apply() and friends, here's one approach to do that:
y<-1:500
#How many to sum at a time?
n <- 25
#Create a matrix of the appropriate start and end points
mat <- cbind(start = head(y, -(n-1)), end = tail(y, -(n-1)))
#Check output
rbind(head(mat,3), tail(mat,3))
#-----
start end
1 25
2 26
3 27
[474,] 474 498
[475,] 475 499
[476,] 476 500
#add together
apply(mat, 1, function(x) sum(y[x[1]]:y[x[2]]))
#Is it the same as Roland's answer after removing the NA values it returns?
all.equal(apply(mat, 1, function(x) sum(y[x[1]]:y[x[2]])),
movsum(y)[-c(1:n-1)])
#-----
[1] TRUE