How to do a PCA with 0 (zero) values - r

I want to do a PCA in R with monthly rainfall values. Since there is no rain during winter, quite a few values in my columns are 0.
When I run the PCA, the following message appears in the console: Error in cov.wt(z) : 'x' must contain finite values only
I think what R is telling me here is that it does not like my 0 values.
So, I tried to change my 0 values to 'real numbers' by multiplying everything with 1.0000000001. But even if I do that and run R again with the new values, it pops up with the same message.
I read that I would need to either get rid of the rows with any missing values in them (which I can't) or use a PCA code that can deal with missing values by somehow imputing them. But my 0's are actual values, not missing values.
I find a lot of information on the web on how to deal with missing values or NA values but nothing on how to deal with zero values. Does anyone have any suggestions how I can do this? Many thanks for your help!

My guess is that "Error in cov.wt(z) : 'x' must contain finite values only" is complaining that some covariances are non-finite, i.e. NA/NaN. This can happen if you have variables that have a standard deviation of 0.
Example code:
latent = rnorm(10)
data = data.frame(rep(0,10), #10 0's
latent+rnorm(10), #latent and noise
latent+rnorm(10), #
latent+1.5*rnorm(10)) #
colnames(data) = c("zeros","var1","var2","var3")
library(psych)
principal(data) #error!
principal(data[-1]) #no errors

Related

How to discretize a variable with only 2 distinct values?

I am trying to discretize the variable- DEATH, into two bins.
DEATH can only be a value of 0 or 1
The command I am using is as follows:
to convert Death to a factor variable using unsupervised discretization with equal frequency binning
burn$DEATH<-discretize(burn$DEATH, method="interval", breaks=2)
summary(burn$DEATH)
However, my output is the entire range of values. I would like to show the individual count for 0 and 1.
My current output:
summary(burn$DEATH)
[0,1]
1000
I think the user specified method would be the solution but when I tried this, I received an error stating that 'x must be numeric'
burn$FACILITY <- discretize(burn$FACILITY, method="fixed", breaks=c(-Inf,0, 1, Inf))
Additional note: This is for a class so I'm assuming they wouldn't want us to use a method that we haven't discussed yet. I'd prefer to use a discretization method if possible! Someone suggested I use the factor() command, but how do I see the summary statistics with the levels if I do this?

Avoid NaN and Inf when dividing in R (using within formula)

I'm trying to add a column to my data set in R with the Within formula.
Data set name: Full_Stats
Objective: Add Minutespergoal column using within formula
Formula
Full_Stats2<-within(Full_Stats,
{Minutespergoal<-Minutes_played/Goal })
The formula works fine, but I'd like to avoid having NaN and Inf in the result. How could I fix this?
Please let me know if any question.
Thanks
NaN occurs by dividing zero by zero, and infinity occurs by dividing a non-zero number by zero. You can avoid these by making sure that your denominator Goal is never zero. Assuming you wanted to remove these values you could try:
Full_Stats2<-within(Full_Stats,
{Minutespergoal<-Minutes_played/Goal })[Goal != 0]

kproto function of clustMixType r package

I am trying to run a kprototype clustering algorithm on my data using the kproto function of the clustMixType package in R but it's not working
I have 1000 rows and 5 variables: only one is categorical, the others have been scaled (still not working when not scaled)
It keeps saying:
Estimated lambda: Inf
Equal prototyps merged. Cluster number reduced to: 3
Error in table(clusters) : all arguments must have the same length
In addition: Warning message:
In kproto.default(inputdata_test, 4) :
All categorical variables have zero variance.
Or my categorical variable has at least 3 attributes, the numeric variables all have at least 2 distinct values and there is no NaN values in the dataframe.
There is a small extract of my data below
First, don't use scale function() in kproto. but use it when you use k-means.
Second, Change the character of categorical feature as number. For example, Other=1, Tablet=2, Mobile Phone=3 ...
And, When i see your error message..
all arguments must have the same length In addition
this means your data rows of variable is different. so check your data set.
And,
All categorical variables have zero variance.
this means categorical variable has problem in variance.
check as this code.
lambdaest(df of kproto)
Have a nice day.
kproto also throws Error in table(clusters) : all arguments must have the same length if there are NAs in the data.
I fixed this using na.omit on my data frame.

How does R treat NA's for significance test?

I have a large dataframe where some of the columns have NA as a result of taking the log of 0.
I have been doing various tests on the data (ANOVA, Tukey, Kruskal Wallis, Mann Whitney) but I couldn't figure out what is happening to the NA values.
Is R excluding those values completely?
Yes. The behavior of R regarding missing observations is given by
options("na.action")
which, by default, is
> options("na.action")
$na.action
[1] "na.omit"
So for many functions like the ones you mentioned, R only considers complete observations, i.e., lines with no NA.

Finding Chi-Squared with NA values

I have two vectors, both of which have NA values in them. I am trying to find a Chi-Squared value for a table I created with the two vectors, but I get this error:
Error in chisq.test(data.table) :
all entries of 'x' must be nonnegative and finite
Is there a code to remove the NA values from the table?
I did find some codes to do this for vectors but I am not sure how this would work. If an NA value gets deleted from one vector, will the corresponding value from the other vector not go into the Chi-Squared calculation?
The vectors have over 8,000 values each and each row corresponds to one subject, so if that subject failed to answer a question, I wouldn't want to use his/her other answer either. I hope that makes sense.
One solution would be to pull out the NA values from your data before you even run the test.
Reproducibility would be helpful here, but I'm guessing your data look something like this:
control<-c(runif(5),NA,runif(4))
treatment<-c(runif(3),NA,runif(6))
In this case, by putting your data into a dataframe, you can both values for every subject with an NA in either value:
df<-data.frame(control,treatment)
df<-df[-which(is.na(df$treatment)),]
df<-df[-which(is.na(df$control)),]
Your data now only includes subjects without any missing data, and can be tested as you please.

Resources