Finding Chi-Squared with NA values - r

I have two vectors, both of which have NA values in them. I am trying to find a Chi-Squared value for a table I created with the two vectors, but I get this error:
Error in chisq.test(data.table) :
all entries of 'x' must be nonnegative and finite
Is there a code to remove the NA values from the table?
I did find some codes to do this for vectors but I am not sure how this would work. If an NA value gets deleted from one vector, will the corresponding value from the other vector not go into the Chi-Squared calculation?
The vectors have over 8,000 values each and each row corresponds to one subject, so if that subject failed to answer a question, I wouldn't want to use his/her other answer either. I hope that makes sense.

One solution would be to pull out the NA values from your data before you even run the test.
Reproducibility would be helpful here, but I'm guessing your data look something like this:
control<-c(runif(5),NA,runif(4))
treatment<-c(runif(3),NA,runif(6))
In this case, by putting your data into a dataframe, you can both values for every subject with an NA in either value:
df<-data.frame(control,treatment)
df<-df[-which(is.na(df$treatment)),]
df<-df[-which(is.na(df$control)),]
Your data now only includes subjects without any missing data, and can be tested as you please.

Related

Hi, I am trying to create an object in R and then subset the data but am getting an error message regarding dimensions

I am very new at R so I know the fix is simple, I would appreciate if someone could explain to me though my mistake and how to fix it.
dat4<-c(10, 11)
subDat<-dat4[,c(10,11)]
The error that I am getting is "Error in subDat4<-dat4[,c(10,11)] incorrect number of dimensions"
Thank you in advance
welcome to StackOverflow.
You are specifying the dat4 as a vector (one dimension object), but trying to subset as data.frame/tible (2 dimensional objects)...
To specify dat4[a,b], with a being the indication for rows, and b a indication for columns, you need to have columns and rows (data frame, matrix, ...)
Your data is not a matrix, thus, you can not subset a vector as a matrix. You can only subset matrix with square bracket as you did.
Try
dat4<-c(10, 11)
dat5<-c(12, 13)
mat1<-matrix(c(dat4,dat5),nrow=2)
mat1[1,2]
# 12
You can see my subst states row one column two which prints 12, that is the element that falls on row one column two.
If you want to subset the vector you provided you can go this way.
dat4[[1]]
#[1] 10
That show the first element of the vector 'dat4' and
dat4[[2]]
#[ 11
Which show the second element of 'dat4'
I hope this answer is of help to you.

Why does mutate() command create NAs?

I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!
var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))
Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.

How does R treat NA's for significance test?

I have a large dataframe where some of the columns have NA as a result of taking the log of 0.
I have been doing various tests on the data (ANOVA, Tukey, Kruskal Wallis, Mann Whitney) but I couldn't figure out what is happening to the NA values.
Is R excluding those values completely?
Yes. The behavior of R regarding missing observations is given by
options("na.action")
which, by default, is
> options("na.action")
$na.action
[1] "na.omit"
So for many functions like the ones you mentioned, R only considers complete observations, i.e., lines with no NA.

How to do a PCA with 0 (zero) values

I want to do a PCA in R with monthly rainfall values. Since there is no rain during winter, quite a few values in my columns are 0.
When I run the PCA, the following message appears in the console: Error in cov.wt(z) : 'x' must contain finite values only
I think what R is telling me here is that it does not like my 0 values.
So, I tried to change my 0 values to 'real numbers' by multiplying everything with 1.0000000001. But even if I do that and run R again with the new values, it pops up with the same message.
I read that I would need to either get rid of the rows with any missing values in them (which I can't) or use a PCA code that can deal with missing values by somehow imputing them. But my 0's are actual values, not missing values.
I find a lot of information on the web on how to deal with missing values or NA values but nothing on how to deal with zero values. Does anyone have any suggestions how I can do this? Many thanks for your help!
My guess is that "Error in cov.wt(z) : 'x' must contain finite values only" is complaining that some covariances are non-finite, i.e. NA/NaN. This can happen if you have variables that have a standard deviation of 0.
Example code:
latent = rnorm(10)
data = data.frame(rep(0,10), #10 0's
latent+rnorm(10), #latent and noise
latent+rnorm(10), #
latent+1.5*rnorm(10)) #
colnames(data) = c("zeros","var1","var2","var3")
library(psych)
principal(data) #error!
principal(data[-1]) #no errors

subset indexing in r

I have a dataframe ma
it has a factor called type
type is comprised of the following factors: I210, I210plus, I210plusc, KV2c, KV2cplus
I'd like to put some of these factors in a vector, say, selected_types
so, selected_types<-c("I210plusc","KV2c")
then, have this command subset the dataframe ma
ma1<-subset(ma, type==selected_types)
such that ma1 would be a subset of ma consisting of only the observations that had
type I210plusc and KV2c
however, when I do this, the number of observations in the resulting dataframe ma1 is less than the sum of the occurrences of the two types in selected_types from the original ma
Any ideas on what I'm doing incorrectly?
Thank you
I originally had this in a comment, but it's a bit lengthy, plus I wanted to add to it. Here some details on what's happening:
what you're doing with == is recycling your two length vector, so that every even row is compared to "KV2c", and every odd one to "I210plusc", so your final result will be the data frame of odd rows that are "KV2c" and even rows that are "I210plusc".
An alternate solution that might make the issue clear is as follows:
subset(ma, type == selected_types[[1]] | type == selected_types[[2]])
Or, more gracefully:
subset(ma, type %in% selected_types)
The %in% operator returns a logical vector of same length as type with TRUE for every position in type that "is in" selected_types (hence the name of the operator).

Resources