I have a large dataframe where some of the columns have NA as a result of taking the log of 0.
I have been doing various tests on the data (ANOVA, Tukey, Kruskal Wallis, Mann Whitney) but I couldn't figure out what is happening to the NA values.
Is R excluding those values completely?
Yes. The behavior of R regarding missing observations is given by
options("na.action")
which, by default, is
> options("na.action")
$na.action
[1] "na.omit"
So for many functions like the ones you mentioned, R only considers complete observations, i.e., lines with no NA.
Related
I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))
I'm trying to do the compositional analysis of habitat use with the compana() function in the adehabitatHS package (I use adehabitat because I can't install adehabitatHS).
Compana() needs two matrices: one of habitat use and one of avaiable habitat.
When I try to run the function it doesn't work (it never stops), so I have to abort the RStudio session.
I read that one problem could be the 0-values in some habitat types for some animals in the 'avaiable' matrix, whereas other animals have positive values for the same habitat. As done by other people, I replaced 0-values with small values (0,001), ran compana and it worked BUT the lambda values returned me NaN.
The problem is similar to the one found here
adehabitatHS compana test returns lambda = NaN?
They said they resolved using as 'used' habitat matrix the counts (integers) and not the proportions.
I tried also this approach, but never changed (it freezes when there are 0-values in the available matrix, or returns NaN value for Lambda if I replace 0- values wit small values).
I checked all matrices and they are ok, so I'm getting crazy.
I have 6 animals and 21 habitat types.
Can you resolve this BIG problem?
PARTIALLY SOLVED: Asking to some researchers, they told me that the number of habitats shouldn't be higher than the number of animals.
In fact I merged some habitats in order to have six animals per six habitats and now the function works when I replace 0-values in the 'avaiable' matrix with small values (e.d. 0.001).
Unfortunately this is not what I wanted, because I needed to find values (rankings, Log-ratios, etc..) for each habitat type (originally they were 21).
I am trying to run a kprototype clustering algorithm on my data using the kproto function of the clustMixType package in R but it's not working
I have 1000 rows and 5 variables: only one is categorical, the others have been scaled (still not working when not scaled)
It keeps saying:
Estimated lambda: Inf
Equal prototyps merged. Cluster number reduced to: 3
Error in table(clusters) : all arguments must have the same length
In addition: Warning message:
In kproto.default(inputdata_test, 4) :
All categorical variables have zero variance.
Or my categorical variable has at least 3 attributes, the numeric variables all have at least 2 distinct values and there is no NaN values in the dataframe.
There is a small extract of my data below
First, don't use scale function() in kproto. but use it when you use k-means.
Second, Change the character of categorical feature as number. For example, Other=1, Tablet=2, Mobile Phone=3 ...
And, When i see your error message..
all arguments must have the same length In addition
this means your data rows of variable is different. so check your data set.
And,
All categorical variables have zero variance.
this means categorical variable has problem in variance.
check as this code.
lambdaest(df of kproto)
Have a nice day.
kproto also throws Error in table(clusters) : all arguments must have the same length if there are NAs in the data.
I fixed this using na.omit on my data frame.
I want to do a PCA in R with monthly rainfall values. Since there is no rain during winter, quite a few values in my columns are 0.
When I run the PCA, the following message appears in the console: Error in cov.wt(z) : 'x' must contain finite values only
I think what R is telling me here is that it does not like my 0 values.
So, I tried to change my 0 values to 'real numbers' by multiplying everything with 1.0000000001. But even if I do that and run R again with the new values, it pops up with the same message.
I read that I would need to either get rid of the rows with any missing values in them (which I can't) or use a PCA code that can deal with missing values by somehow imputing them. But my 0's are actual values, not missing values.
I find a lot of information on the web on how to deal with missing values or NA values but nothing on how to deal with zero values. Does anyone have any suggestions how I can do this? Many thanks for your help!
My guess is that "Error in cov.wt(z) : 'x' must contain finite values only" is complaining that some covariances are non-finite, i.e. NA/NaN. This can happen if you have variables that have a standard deviation of 0.
Example code:
latent = rnorm(10)
data = data.frame(rep(0,10), #10 0's
latent+rnorm(10), #latent and noise
latent+rnorm(10), #
latent+1.5*rnorm(10)) #
colnames(data) = c("zeros","var1","var2","var3")
library(psych)
principal(data) #error!
principal(data[-1]) #no errors
I have two vectors, both of which have NA values in them. I am trying to find a Chi-Squared value for a table I created with the two vectors, but I get this error:
Error in chisq.test(data.table) :
all entries of 'x' must be nonnegative and finite
Is there a code to remove the NA values from the table?
I did find some codes to do this for vectors but I am not sure how this would work. If an NA value gets deleted from one vector, will the corresponding value from the other vector not go into the Chi-Squared calculation?
The vectors have over 8,000 values each and each row corresponds to one subject, so if that subject failed to answer a question, I wouldn't want to use his/her other answer either. I hope that makes sense.
One solution would be to pull out the NA values from your data before you even run the test.
Reproducibility would be helpful here, but I'm guessing your data look something like this:
control<-c(runif(5),NA,runif(4))
treatment<-c(runif(3),NA,runif(6))
In this case, by putting your data into a dataframe, you can both values for every subject with an NA in either value:
df<-data.frame(control,treatment)
df<-df[-which(is.na(df$treatment)),]
df<-df[-which(is.na(df$control)),]
Your data now only includes subjects without any missing data, and can be tested as you please.