kproto function of clustMixType r package - r

I am trying to run a kprototype clustering algorithm on my data using the kproto function of the clustMixType package in R but it's not working
I have 1000 rows and 5 variables: only one is categorical, the others have been scaled (still not working when not scaled)
It keeps saying:
Estimated lambda: Inf
Equal prototyps merged. Cluster number reduced to: 3
Error in table(clusters) : all arguments must have the same length
In addition: Warning message:
In kproto.default(inputdata_test, 4) :
All categorical variables have zero variance.
Or my categorical variable has at least 3 attributes, the numeric variables all have at least 2 distinct values and there is no NaN values in the dataframe.
There is a small extract of my data below

First, don't use scale function() in kproto. but use it when you use k-means.
Second, Change the character of categorical feature as number. For example, Other=1, Tablet=2, Mobile Phone=3 ...
And, When i see your error message..
all arguments must have the same length In addition
this means your data rows of variable is different. so check your data set.
And,
All categorical variables have zero variance.
this means categorical variable has problem in variance.
check as this code.
lambdaest(df of kproto)
Have a nice day.

kproto also throws Error in table(clusters) : all arguments must have the same length if there are NAs in the data.
I fixed this using na.omit on my data frame.

Related

How to discretize a variable with only 2 distinct values?

I am trying to discretize the variable- DEATH, into two bins.
DEATH can only be a value of 0 or 1
The command I am using is as follows:
to convert Death to a factor variable using unsupervised discretization with equal frequency binning
burn$DEATH<-discretize(burn$DEATH, method="interval", breaks=2)
summary(burn$DEATH)
However, my output is the entire range of values. I would like to show the individual count for 0 and 1.
My current output:
summary(burn$DEATH)
[0,1]
1000
I think the user specified method would be the solution but when I tried this, I received an error stating that 'x must be numeric'
burn$FACILITY <- discretize(burn$FACILITY, method="fixed", breaks=c(-Inf,0, 1, Inf))
Additional note: This is for a class so I'm assuming they wouldn't want us to use a method that we haven't discussed yet. I'd prefer to use a discretization method if possible! Someone suggested I use the factor() command, but how do I see the summary statistics with the levels if I do this?

adehabitat compana() doesn't work or returns lambda=NaN

I'm trying to do the compositional analysis of habitat use with the compana() function in the adehabitatHS package (I use adehabitat because I can't install adehabitatHS).
Compana() needs two matrices: one of habitat use and one of avaiable habitat.
When I try to run the function it doesn't work (it never stops), so I have to abort the RStudio session.
I read that one problem could be the 0-values in some habitat types for some animals in the 'avaiable' matrix, whereas other animals have positive values for the same habitat. As done by other people, I replaced 0-values with small values (0,001), ran compana and it worked BUT the lambda values returned me NaN.
The problem is similar to the one found here
adehabitatHS compana test returns lambda = NaN?
They said they resolved using as 'used' habitat matrix the counts (integers) and not the proportions.
I tried also this approach, but never changed (it freezes when there are 0-values in the available matrix, or returns NaN value for Lambda if I replace 0- values wit small values).
I checked all matrices and they are ok, so I'm getting crazy.
I have 6 animals and 21 habitat types.
Can you resolve this BIG problem?
PARTIALLY SOLVED: Asking to some researchers, they told me that the number of habitats shouldn't be higher than the number of animals.
In fact I merged some habitats in order to have six animals per six habitats and now the function works when I replace 0-values in the 'avaiable' matrix with small values (e.d. 0.001).
Unfortunately this is not what I wanted, because I needed to find values (rankings, Log-ratios, etc..) for each habitat type (originally they were 21).

How does R treat NA's for significance test?

I have a large dataframe where some of the columns have NA as a result of taking the log of 0.
I have been doing various tests on the data (ANOVA, Tukey, Kruskal Wallis, Mann Whitney) but I couldn't figure out what is happening to the NA values.
Is R excluding those values completely?
Yes. The behavior of R regarding missing observations is given by
options("na.action")
which, by default, is
> options("na.action")
$na.action
[1] "na.omit"
So for many functions like the ones you mentioned, R only considers complete observations, i.e., lines with no NA.

How to do a PCA with 0 (zero) values

I want to do a PCA in R with monthly rainfall values. Since there is no rain during winter, quite a few values in my columns are 0.
When I run the PCA, the following message appears in the console: Error in cov.wt(z) : 'x' must contain finite values only
I think what R is telling me here is that it does not like my 0 values.
So, I tried to change my 0 values to 'real numbers' by multiplying everything with 1.0000000001. But even if I do that and run R again with the new values, it pops up with the same message.
I read that I would need to either get rid of the rows with any missing values in them (which I can't) or use a PCA code that can deal with missing values by somehow imputing them. But my 0's are actual values, not missing values.
I find a lot of information on the web on how to deal with missing values or NA values but nothing on how to deal with zero values. Does anyone have any suggestions how I can do this? Many thanks for your help!
My guess is that "Error in cov.wt(z) : 'x' must contain finite values only" is complaining that some covariances are non-finite, i.e. NA/NaN. This can happen if you have variables that have a standard deviation of 0.
Example code:
latent = rnorm(10)
data = data.frame(rep(0,10), #10 0's
latent+rnorm(10), #latent and noise
latent+rnorm(10), #
latent+1.5*rnorm(10)) #
colnames(data) = c("zeros","var1","var2","var3")
library(psych)
principal(data) #error!
principal(data[-1]) #no errors

Missing data and Attributes selection

My data is 1,785,000 records with 271 features. I'm trying to reduce number of features used to build the model.
Q1. while exploring the data I found that some features are almost all missing data, like only 25 records has value for this feature and the others records has missing values, so I thought that is not informative enough and it's better to eleminate those features, am I right? and if I am right, for what level I can do that, I mean if 90%, 80%, etc.. of each feature are missing values, when I can decide to get rid of these features? (taking in consideration that it is the dependent variable is N/Y and only %1.157 of the whole data is belonging to Y).
Q2. for each individual in the dataset, there are 64 trait_type listed, where each one can take one of the values [1 or 3 or 5]. my question is: if some trait-type take only value [5] or missing dat for all the record, does it have any value or again we can eliminate that feature?
Q3. if the choice is to delete these features, how to delete column from data.frame in R?
Thank you
Update:
I'm trying to use caret package to do the variable selection.
I applied this:
ctrl<- rfeControl(functions = lmFuncs, method="cv", verbose = FALSE, returnResamp=
"final")
lmprofile<- rfe(x,y, sizes = subsets, rfeControl = ctrl)
where x is the data.frame that have 270 dependant variables and y is the factor of the independent variable which has value Y/N. I got this error:
Error in { :
task 1 failed - "contrasts can be applied only to factors with 2 or more levels"
enter code here
In addition: There were 11 warnings (use warnings() to see them)
any help please?
Just because much of your data in one column is missing doesn't mean that column will not be predictive, it's just the same as having many of the same value in that column.
Of course there is a cutoff, if that column can only help you distinguish between a few cases (of many) then it can be removed and could only affect overall model strength a little.
To help you decide whether to keep the column, you could build a univariate model with it - where the dataset just includes that column and the dependant variable, and look at the strength of that model. If it's not much better than random, then it's probably safe to drop the column.

Resources