I am trying to get get the variance of variable (GD1) in my dataset (CETeen), but the output keeps returning an "NA" when I use the basic variance function. I know there are some NA's in the data, but I am not sure if this is the culprit. I am new to R and still learning, is there a better way to get the variance for this variable, or a way to figure out the issue?
var(CETeen$GD1)
[1] NA
You cannot compute the variance of a set containing NA:
R> var(c(1,2,3,NA,5))
[1] NA
R> var(c(1,2,3,NA,5), na.rm=TRUE)
[1] 2.91667
R>
So either treat / filter your data, or tell var() to skip the NA values.
Related
I am trying to compute cumulative returns in R, using cumprod() for $1 invested
I seem to be getting NA values after using the cumprod() function, because the first return I'm trying to use is NA and therefore not successfully cumulating returns.
[1] NA -0.059898142 -0.267314770 -0.075349437 0.008658063 -0.008658063 0.000000000
The first row is NA and because of that, the cumprod(x+1) function turns into all NAs
How do I remove the first row/ignore the NA?
Any input would be appreciated
You can use na.omit to remove NA values in x before applying cumprod, e.g.,
cumprod(na.omit(x))
so I am working on R with a matrix as following:
diff_0
SubPop0-1, SubPop1-1, SubPop2-1, SubPop3-1, SubPop4-1,
SubPop0-1, NA NA NA NA NA
SubPop1-1, 0.003403100 NA NA NA NA
SubPop2-1, 0.005481177 -0.002070277 NA NA NA
SubPop3-1, 0.002216444 0.005946314 0.001770977 NA NA
SubPop4-1, 0.010344845 0.007151529 0.004237316 -0.0021275130 NA
... but bigger ;-).
This is a matrix of pairwise genetic differenciation between each SubPop from 0 to 4. I would like to obtain a mean differenciation value for each subPop.
For instance, for SubPop-0, the mean would just correspond to the mean of the 4 values from column 1. However for SubPop-2, this would be the mean of the 2 values in line 3 and the 2 value in column 3, since this is a demi-matrix.
I wanted to write a for loop to compute each mean value for each SubPop, taking this into account. I tried the following:
Mean <- for (r in 1:nrow(diff_0)) {
mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
}
First this isolates each line and column of index [r], whose values refer to the same SubPop r. 'sum' enable to gather these values and eliminate 'NA's. Finally I get the mean value for my SubPop r. I was hoping my for loop would give me with value for each index r, which would be a SubPop.
However, eventhough my mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T)), if run alone with a fixed r value between 1 and 5, does give me what I want; well the 'for loop' itself only returns an empty vector.
Something like for (r in 1:nrow(diff_0)) { print(diff_0[r,1]) } also works, so I do not understand what is going on.
This is a trivial question but I could not find an answer on the internet! Although I know I am probably missing the obvious :-)...
Thank you very much,
Cheers!
Okay, based on what you want to do (and if I understand everything correctly) there are several ways of doing this.
The one that comes to my mind now is just making your lower triangular matrix to an "entire matrix" (i.e. fill the upper triangle with the transpose of the lower triangle) and then do row- or column-wise means
My R is running right now on something else, so I can't check my code but this should work
diff = diff_0
diff[upper.tri(diff)] = t(diff_0[lower.tri(diff)]) #This step might need some work
As I said, my R is running right now so I can't check the correctness of the last line - I might be confused with some transposes there, so I'd appreciate any feedback on whether it actually worked or not.
You can then either set the diagonal values to 0 or alternatively add na.rm = TRUE to the mean statement
mean_diffs = apply(diff,2,FUN = function(x)mean(x, na.rm = TRUE))
that should work
Also: Yes, your code does not work, because the assignment is not in the for loop. This should work:
means = rep(NA, nrow(diff_0)
for (r in 1:length(means)){
means[r] = mean(apply(cbind(diff_0[r,], diff_0[,r]), 1, sum, na.rm=T))
But in general for loops are not what you want to do in R
This may be a solution...
for(i in 1:nrow(diff_0)) {
k<-mean(cbind(as.numeric(diff_0[,i]),as.numeric(diff_0[i,])),na.rm=T)
if(i==1) {
data_mean <- k
}else{
data_mean <- rbind(data_mean,k)
}
}
colnames(data_mean) <- "mean"
rownames(data_mean) <- c("SubPop0","SubPop1","SubPop2","SubPop3","SubPop4")
data_mean
mean
SubPop0 0.005361391
SubPop1 0.003607666
SubPop2 0.002354798
SubPop3 0.001951555
SubPop4 0.004901544
Consider the sample data below:
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
Objective to find correlation between 2 columns where NA should reduce the correlation. NA means that an event did not take place.
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
> cor(df$a, df$b)
[1] NA
Or should I be looking at some other mathematical function?
Is there a way to use NA in the correlation such that it pulls down the value of the correlation?
Here is a way to use NA values to decrease correlation. For demonstration, I am using different data with some good size.
a <- sort(ruinf(10))
b <- sort(ruinf(10))
## Sorting so that there is some good correlation between them.
## Now making some values NA deliberately
a[c(9,10)] <- NA
cor(a[1:8],b[1:8])
## [1] 0.890465 #correlation value is high
## Lets assign a to c and Fill NA values with something
c <- a
## using mean causes no change to numerator but increases denominator.
c[is.na(a)] <- mean(a, na.rm=T) cor(c,b)
## [1] 0.6733387
Note that when you replace all NA terms with mean, the numerator has no change as there is multiplication with zero in additional terms. The denominator however adds some more values for b so that correlation value comes down. Also, the more NA in your data, more the correlation comes down.
The question doesn't make mathematical sense as there is no correlation between events that didn't happen. Correlation cannot be reduced by no event happening. There is no function to do this other than to transform the data.
You may replace the NA values with something like #Ujjwal Kumar has suggested but this is just data manipulation and not a predefined function
Look at the help file for cor ?cor and using functions like cor(df$a,df$b,use="pairwise.complete.obs" you can see how NA values should usually be treated where they are just removed and have no impact on the correlation itself
?cor output
If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
If use is "all.obs", then the presence of missing observations will produce an error. If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
"na.or.complete" is the same unless there are no complete cases, that gives NA. Finally, if use has the value
"pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method. Note that (the equivalent of) var(double(0), use = *) gives NA for use = "everything" and "na.or.complete", and gives an error in the other cases.
I guess, there is no simple explanation. . You have to remove data with NA, and ofcourse corresponding data in columns b,c,d. And then compute correlation. You can check if thera are corrensponding NA in each dataset (a,b,c,d)
In yours example you can compute corelation with all combinations of b,c,d, but if you want compute cor for cor(a,b) you have to pick only rows that are without NA in a and b. And maybe when you compute this cor(a,b) multiply it by (number of rows with NA in a and b) divided by number of all rows in dataset
a=c(NA,1,NA)
b=c(1,2,4)
c=c(0,1,0)
d=c(1,2,4)
df=data.frame(a,b,c,d)
I was wondering if there is an simply way check if my zero values in my data are excluded in my anova.
I first changed all my zero values to NA with
BFL$logDecomposers[which(BFL$logDecomposers==0)] = NA
I'm not sure if 'na.action=na.exclude' makes sure my values are being ignored(like I want them to be)??
standard<-lm(logDecomposers~1, data=BFL) #null model
ANOVAlnDeco.lm<-lm(logDecomposers~Species_Number,data=BFL,na.action=na.exclude)
anova(standard,ANOVAlnDeco.lm)
P.S.:I've just been using R for a few weeks, and this website has been of tremendous help to me :)
You haven't given a reproducible example, but I'll make one up.
set.seed(101)
mydata <- data.frame(x=rnorm(100),y=rlnorm(100))
## add some zeros
mydata$y[1:5] <- 0
As pointed out by #Henrik you can use the subset argument to exclude these values:
nullmodel <- lm(y~1,data=mydata,subset=y>0)
fullmodel <- update(nullmodel,.~x)
It's a little confusing, but na.exclude and na.omit (the default) actually lead to the same fitted model -- the difference is in whether NA values are included when you ask for residual or predicted values. You can try it out:
mydata2 <- within(mydata,y[y==0] <- NA)
fullmodel2 <- update(fullmodel,subset=TRUE,data=mydata2)
(subset=TRUE turns off the previous subset argument, by specifying that all the data should be included).
You can compare the fits (coefficients etc.). One shortcut is to use the nobs method, which counts the number of observations used in the model:
nrow(mydata) ## 100
nobs(nullmodel) ## 95
nobs(fullmodel) ## 95
nobs(fullmodel2) ## 95
nobs(update(fullmodel,subset=TRUE)) ## 100
So I am having some issues with some NA values in the residuals of a lm cross sectional regression in R.
The issue isn't the NA values themselves, it's the way R presents them.
For example:
test$residuals
# 1 2 4 5
# 0.2757677 -0.5772193 -5.3061303 4.5102816
test$residuals[3]
# 4
# -5.30613
In this simple example a NA value will make one of the residuals go missing. When I extract the residuals I can clearly see the third index missing. So far so good, no complaints here. The problem is that the corresponding numeric vector is now one item shorter so the third index is actually the fourth. How can I make R return these residuals instead, i.e., explicitly showing NA instead of skipping an index?
test$residuals
# 1 2 3 4 5
# 0.2757677 -0.5772193 NA -5.3061303 4.5102816
I need to keep track of all individual residuals so it would make my life much easier if I could extract them this way instead.
I just found this googling around a bit deeper. The resid function on a lm with na.action=na.exclude is the way to go.
Yet another idea is to take advantage of the row names associated with the data frame provided as input to lm. In that case, the residuals should retain the names from the source data. Accessing the residuals from your example would give a value of -5.3061303 for test$residuals["4"] and NA for test$residuals["3"].
However, this does not exactly answer your question. One approach to doing exactly what you asked for in terms of getting the NA values back into the residuals is illustrated below:
> D<-data.frame(x=c(NA,2,3,4,5,6),y=c(2.1,3.2,4.9,5,6,7),residual=NA)
> Z<-lm(y~x,data=D)
> D[names(Z$residuals),"residual"]<-Z$residuals
> D
x y residual
1 NA 2.1 NA
2 2 3.2 -0.28
3 3 4.9 0.55
4 4 5.0 -0.22
5 5 6.0 -0.09
6 6 7.0 0.04
If you are doing predictions based on the regression results, you may want to specify na.action=na.exclude in lm. See the help results for na.omit for a discussion. Note that simply specifying na.exclude does not actually put the NA values back into the residuals vector itself.
As noted in a prior answer, resid (synonym for residuals) provides a generic access function in which the residuals will contain the desired NA values if na.exclude was specified in lm. Using resid is probably more general and a cleaner approach. In that case, the code for the above example would be changed to:
> D<-data.frame(x=c(NA,2,3,4,5,6),y=c(2.1,3.2,4.9,5,6,7),residual=NA)
> Z<-lm(y~x,data=D,na.action=na.exclude)
> D$residuals<-residuals(Z)
Here an illustrated strategy using a slightly modified example on the lm help page. This is a direct application of the definition of residual:
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
# Two NA's introduced
weight <- c(4.17,5.58,NA,6.11,4.50,4.61,5.17,4.53,5.33,5.14,
4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69)
group <- gl(2,10,20, labels=c("Ctl","Trt"))
lm.D9 <- lm(weight ~ group)
rr2 <- weight- predict(lm.D9, na.action=na.pass)
Warning message:
In weight - predict(lm.D9, na.action = na.pass) :
longer object length is not a multiple of shorter object length
> rr2
[1] -0.8455556 0.5644444 NA 1.0944444 -0.5155556 -0.4055556 0.1544444
[8] -0.4855556 0.3144444 0.5044444 0.1744444 -0.4655556 -0.2255556 -1.0455556
[15] 1.2344444 -0.8055556 1.3944444 NA -0.6955556 -0.3255556
I think it would be dangerous to directly modify an lm object so that lm.D9$residual would return that result.