so I am working on R with a matrix as following:
diff_0
SubPop0-1, SubPop1-1, SubPop2-1, SubPop3-1, SubPop4-1,
SubPop0-1, NA NA NA NA NA
SubPop1-1, 0.003403100 NA NA NA NA
SubPop2-1, 0.005481177 -0.002070277 NA NA NA
SubPop3-1, 0.002216444 0.005946314 0.001770977 NA NA
SubPop4-1, 0.010344845 0.007151529 0.004237316 -0.0021275130 NA
... but bigger ;-).
This is a matrix of pairwise genetic differenciation between each SubPop from 0 to 4. I would like to obtain a mean differenciation value for each subPop.
For instance, for SubPop-0, the mean would just correspond to the mean of the 4 values from column 1. However for SubPop-2, this would be the mean of the 2 values in line 3 and the 2 value in column 3, since this is a demi-matrix.
I wanted to write a for loop to compute each mean value for each SubPop, taking this into account. I tried the following:
Mean <- for (r in 1:nrow(diff_0)) {
mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
}
First this isolates each line and column of index [r], whose values refer to the same SubPop r. 'sum' enable to gather these values and eliminate 'NA's. Finally I get the mean value for my SubPop r. I was hoping my for loop would give me with value for each index r, which would be a SubPop.
However, eventhough my mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T)), if run alone with a fixed r value between 1 and 5, does give me what I want; well the 'for loop' itself only returns an empty vector.
Something like for (r in 1:nrow(diff_0)) { print(diff_0[r,1]) } also works, so I do not understand what is going on.
This is a trivial question but I could not find an answer on the internet! Although I know I am probably missing the obvious :-)...
Thank you very much,
Cheers!
Okay, based on what you want to do (and if I understand everything correctly) there are several ways of doing this.
The one that comes to my mind now is just making your lower triangular matrix to an "entire matrix" (i.e. fill the upper triangle with the transpose of the lower triangle) and then do row- or column-wise means
My R is running right now on something else, so I can't check my code but this should work
diff = diff_0
diff[upper.tri(diff)] = t(diff_0[lower.tri(diff)]) #This step might need some work
As I said, my R is running right now so I can't check the correctness of the last line - I might be confused with some transposes there, so I'd appreciate any feedback on whether it actually worked or not.
You can then either set the diagonal values to 0 or alternatively add na.rm = TRUE to the mean statement
mean_diffs = apply(diff,2,FUN = function(x)mean(x, na.rm = TRUE))
that should work
Also: Yes, your code does not work, because the assignment is not in the for loop. This should work:
means = rep(NA, nrow(diff_0)
for (r in 1:length(means)){
means[r] = mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
But in general for loops are not what you want to do in R
This may be a solution...
for(i in 1:nrow(diff_0)) {
k<-mean(cbind(as.numeric(diff_0[,i]),as.numeric(diff_0[i,])),na.rm=T)
if(i==1) {
data_mean <- k
}else{
data_mean <- rbind(data_mean,k)
}
}
colnames(data_mean) <- "mean"
rownames(data_mean) <- c("SubPop0","SubPop1","SubPop2","SubPop3","SubPop4")
data_mean
mean
SubPop0 0.005361391
SubPop1 0.003607666
SubPop2 0.002354798
SubPop3 0.001951555
SubPop4 0.004901544
Related
Consider the sample data below:
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
Objective to find correlation between 2 columns where NA should reduce the correlation. NA means that an event did not take place.
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
> cor(df$a, df$b)
[1] NA
Or should I be looking at some other mathematical function?
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
Here is a way to use NA values to decrease correlation. For demonstration, I am using different data with some good size.
a <- sort(ruinf(10))
b <- sort(ruinf(10))
## Sorting so that there is some good correlation between them.
## Now making some values NA deliberately
a[c(9,10)] <- NA
cor(a[1:8],b[1:8])
## [1] 0.890465 #correlation value is high
## Lets assign a to c and Fill NA values with something
c <- a
## using mean causes no change to numerator but increases denominator.
c[is.na(a)] <- mean(a, na.rm=T) cor(c,b)
## [1] 0.6733387
Note that when you replace all NA terms with mean, the numerator has no change as there is multiplication with zero in additional terms. The denominator however adds some more values for b so that correlation value comes down. Also, the more NA in your data, more the correlation comes down.
The question doesn't make mathematical sense as there is no correlation between events that didn't happen. Correlation cannot be reduced by no event happening. There is no function to do this other than to transform the data.
You may replace the NA values with something like #Ujjwal Kumar has suggested but this is just data manipulation and not a predefined function
Look at the help file for cor ?cor and using functions like cor(df$a,df$b,use="pairwise.complete.obs" you can see how NA values should usually be treated where they are just removed and have no impact on the correlation itself
?cor output
If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value
"pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.
I guess, there is no simple explanation. . You have to remove data with NA, and ofcourse corresponding data in columns b,c,d. And then compute correlation. You can check if thera are corrensponding NA in each dataset (a,b,c,d)
In yours example you can compute corelation with all combinations of b,c,d, but if you want compute cor for cor(a,b) you have to pick only rows that are without NA in a and b. And maybe when you compute this cor(a,b) multiply it by (number of rows with NA in a and b) divided by number of all rows in dataset
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
I have a question regarding the Reducefunction.
For example i have a list that one of the elements has an NA.
a<-list(c(1,2),c(2,2),c(1,NA))
I wish to use the Reduce function to do an average of the elements of the list.
that is (1+2+1)/3=1.33 and (2+2+NA)/3 = NA But in this last case, what i actually need is to avoid having the NA so the result should be (2+2)/2 = 2so the final outcome is a vector 1.33, 2
I am using Reduce("+", a)/length(a) but i get an NA because of the NA element.
Thanks in advance
I wouldn't use Reduce for this. It is just a hidden for loop anyway. Here is a better alternative:
rowMeans(do.call(cbind, a), na.rm = TRUE)
#[1] 1.333333 2.000000
This combines your vectors into a matrix and calculates the row means using the rowMeans function, which can remove NA values.
Despite using two complete columns where every element is numeric and no numbers are missing for rows 2 thru 570, I find it impossible to get a result other than NA when using a loop to find a rolling 24-week correlation between the two columns.
rolling.correlation <- NULL
temp <- NULL
for(i in 2:547){
temp <- cor(training.set$return.SPY[i:i+23],training.set$return.TLT[i:i+23])
rolling.correlation <- c(rolling.correlation, temp)
} #end "for" loop
rolling.correlation
The cor()command works fine for [2:25], [i:25], or [2:i] but R doesn't understand when I say [i:i+23]
I want R to calculate a correlation for rows 2 thru 25, then 3 thru 26, ..., 547 thru 570. The result should be a vector of length 546 which has numeric values for each correlation. Instead I'm getting a vector of 546 NAs. How can I fix this? Thanks for your help.
Look what happens when you do
5:5+2
# [1] 7
Note that : has a higher operator precedence than + which means 5:5+2 is the same as (5:5)+2 when you really want 5:(5+2). Use
i:(i+23)
I have been searching Stackoverflow for hours hoping to find something I guessed was self-evident but nobody seemed to have asked (which might mean it is indeed self-evident).
I want to use tapply or by, to find the first time a specific event occurs in a dataframe (first non-zero value). The way I did this before was via
max.col(df, ties.method = c("first"))
But somehow this does not work when used in conjunction with either tapply or by. Here's some examplary data
FIRM<-as.vector(sample(c("a","b","c","d"),100,replace=T))
MOMENT<-as.vector(sample((1990:1995),100,replace=T))
EVENT<-as.vector(sample(c("x12","x43","x35","y71","y81","xy1","xy67","yy123","xx901"),100,replace=T))
OCCURENCE<-as.vector(sample(c(0,1),100,replace=T))
m<-as.data.frame(cbind(FIRM,MOMENT,EVENT,OCCURENCE))
So here is what I tried and did not work
tapply(m[,4],m[,3],max.col) # This gives just 1s for every EVENT with the length of the resulting vector equal to number of EVENTs mentioned in the dataset
tapply(m[,4],m[,3],max.col(m, ties.method=c("first"))) # Error in match.fun(FUN) :
'max.col(m, ties.method = c("first"))' is not a function, character or symbol
In addition: Warning message: In max.col(m, ties.method = c("first")) : NAs introduced by coercion
Number 2 is really the crux of the problem. For reasons unclear to me, max.col is not recognised as a function once you change the default tie-breaking method (i.e. "random") to to one I need (i.e. "first").
Additionally, I'd want to be able to find the year in which the non-zero occurs.
I think a sensible alternative would be to multiply the MOMENT column with the OCCURENCE column (call that ID) and look for the first non-zero value in ID (for each factor EVENT) keep that ID value and turn the other values into zero
m$MOMENT<-as.numeric(as.character(m$MOMENT))
m$OCCURENCE<-as.numeric(as.character(m$OCCURENCE))
m[,"ID"]<-m$MOMENT * m$OCCURENCE
I have tried to code this with a function containing a when and if statement and using break but it does not work
tapply(m$ID,m$EVENT, function(x) m$ID[i]<- while (m$ID[i] == 0) {m$ID[i]
if (m$ID[i]>0) {m$YEAR[i] && break }})
The idea here was to iterate the function over EVENT while m$ID == 0 and then to change the value and break once m$ID > 0. Didn't work...
Any ideas on how to fix this (or much simpler solutions)?
The FUN argument of tapply must be a function but the code in the question supplies an expression, not a function. Try this:
tapply(m[,4], m[,3], max.col, ties.method = "first")
This will give a logical indicator of the first row in each event which has 1 in the OCCURENCE column and the second line will select those rows:
o <- order(m$EVENT, m$MOMENT) # omit this and next line if already ordered
m <- m[o,]
is.first <- ave(m$OCCURENCE == 1, m$EVENT, FUN = function(x) x & !duplicated(x))
m[is.first, ]
REVISED
Ordered by event and year.
Note that if its possible that there are events with only zeros then such events will be omitted entirely from m[is.first, ] .
I'm not quite sure what you are trying to achieve, so here is only some coding advice.
First of all, you need to read help("tapply") to lear how to pass arguments to the function that is passed to tapply:
tapply(m[,4],m[,3],max.col, ties.method="first")
However, I doubt this does what you need. Maybe something like this would be useful:
m<-data.frame(FIRM,MOMENT,EVENT,OCCURENCE)
#note how I create the data.frame in a different way
#in order to avoid coercing all columns to factors
tapply(m[,4],m[,3],which.max)
# x12 x35 x43 xx901 xy1 xy67 y71 y81 yy123
# 2 1 2 3 1 1 3 1 1
tapply(m[,4],m[,3],function(x) m[which.max(x), "MOMENT"])
# x12 x35 x43 xx901 xy1 xy67 y71 y81 yy123
# 1995 1995 1995 1991 1995 1995 1991 1995 1995
I've been trying for a while now to produce a code that brings me a new vector of the sum of the 25 previous rows for an original vector.
So if we say I have a variable Y with 500 rows and I would like a running sum, in a new vector, which contains the sum of rows [1:25] then [2:26] for the length of Y, such as this:
y<-1:500
runsum<-function(x){
cumsum(x)-cumsum(x[26:length(x)])
}
new<-runsum(y)
I've tried using some different functions here and then even using the apply functions on top but none seem to produce the right answers....
Would anyone be able to help? I realise it's probably very easy for many of the community here but any help would be appreciated
Thanks
This function calculates the sum of the 24 preceding values and the actual value:
movsum <- function(x,n=25){filter(x,rep(1,n), sides=1)}
It is easy to adapt to sum only preceding values, if this is what you really want.
In addition to Roland's answer you could use the zoo library
library ( zoo )
y <- 1:500
rollapply ( zoo ( y ), 25, sum )
HTH
I like Roland's answer better as it relies on a time series function and will probably be pretty quick. Since you mentioned you started going down the path of using apply() and friends, here's one approach to do that:
y<-1:500
#How many to sum at a time?
n <- 25
#Create a matrix of the appropriate start and end points
mat <- cbind(start = head(y, -(n-1)), end = tail(y, -(n-1)))
#Check output
rbind(head(mat,3), tail(mat,3))
#-----
start end
1 25
2 26
3 27
[474,] 474 498
[475,] 475 499
[476,] 476 500
#add together
apply(mat, 1, function(x) sum(y[x[1]]:y[x[2]]))
#Is it the same as Roland's answer after removing the NA values it returns?
all.equal(apply(mat, 1, function(x) sum(y[x[1]]:y[x[2]])),
movsum(y)[-c(1:n-1)])
#-----
[1] TRUE