How to discretize a variable with only 2 distinct values? - r

I am trying to discretize the variable- DEATH, into two bins.
DEATH can only be a value of 0 or 1
The command I am using is as follows:
to convert Death to a factor variable using unsupervised discretization with equal frequency binning
burn$DEATH<-discretize(burn$DEATH, method="interval", breaks=2)
summary(burn$DEATH)
However, my output is the entire range of values. I would like to show the individual count for 0 and 1.
My current output:
summary(burn$DEATH)
[0,1]
1000
I think the user specified method would be the solution but when I tried this, I received an error stating that 'x must be numeric'
burn$FACILITY <- discretize(burn$FACILITY, method="fixed", breaks=c(-Inf,0, 1, Inf))
Additional note: This is for a class so I'm assuming they wouldn't want us to use a method that we haven't discussed yet. I'd prefer to use a discretization method if possible! Someone suggested I use the factor() command, but how do I see the summary statistics with the levels if I do this?

Related

Averaging different length vectors with same domain range in R

I have a dataset that looks like the one shown in the code.
What I am guaranteed is that the "(var)x" (domain) of the variable is always between 0 and 1. The "(var)y" (co-domain) can vary but is also bounded, but within a larger range.
I am trying to get an average over the "(var)x" but over the different variables.
I would like some kind of selective averaging, not sure how to do this in R.
ax=c(0.11,0.22,0.33,0.44,0.55,0.68,0.89)
ay=c(0.2,0.4,0.5,0.42,0.5,0.43,0.6)
bx=c(0.14,0.23,0.46,0.51,0.78,0.91)
by=c(0.1,0.2,0.52,0.46,0.4,0.41)
qx=c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
qy=c(0.03,0.2,0.52,0.4,0.45,0.48,0.61,0.9)
a<-list(ax,ay)
b<-list(bx,by)
q<-list(qx,qy)
What I would like to have something like
avgd_x = c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
and
avgd_y would have contents that would
find the value of ay and by at 0.12 and find the mean with ay, by and qy.
Similarly and so forth for all the values in the vector with the largest number of elements.
How can I do this in R ?
P.S: This is a toy dataset, my dataset is spread over files and I am reading them with a custom function, but the raw data is available as shown in the code below.
Edit:
Some clarification:
avgd_y would have the length of the largest vector, for example, in the case above, avgd_y would be (ay'+by'+qy)/3 where ay' and by' would be vectors which have c(ay(qx(i))) and c(by(qx(i))) for i from 1 to length of qx, ay' and by' would have values interpolated at data points of qx

Making a histogram

this sounds pretty basic but every time I try to make a histogram, my code is saying x needs to be numeric. I've been looking everywhere but can't find one relating to my problem. I have data with 240 obs with 5 variables.
Nipper length
Number of Whiskers
Crab Carapace
Sex
Estuary location
There is 3 locations and i'm trying to make a histogram with nipper length
I've tried making new factors and levels, with the 80 obs in each location but its not working
Crabs.data <-read.table(pipe("pbpaste"),header = FALSE)##Mac
names(Crabs.data)<-c("Crab Identification","Estuary Location","Sex","Crab Carapace","Length of Nipper","Number of Whiskers")
Crabs.data<-Crabs.data[,-1]
attach(Crabs.data)
hist(`Length of Nipper`~`Estuary Location`)
Error in hist.default(Length of Nipper ~ Estuary Location) :
'x' must be numeric
Instead of correct result
hist() doesn't seem to like taking more than one variable.
I think you'd have the best luck subsetting the data, that is, making a vector of nipper lengths for all crabs in a given estuary.
crabs.data<-read.table("whatever you're calling it")
names<-(as you have it)
Estuary1<-as.vector(unlist(subset(crabs.data, `Estuary Loc`=="Location", select = `Length of Nipper`)))
hist(Estuary1)
Repeat the last two lines for your other two estuaries. You may not need the unlist() command, depending on your table. I've tended to need it for Excel files, but I don't know what format your table is in (that would've been helpful).

kproto function of clustMixType r package

I am trying to run a kprototype clustering algorithm on my data using the kproto function of the clustMixType package in R but it's not working
I have 1000 rows and 5 variables: only one is categorical, the others have been scaled (still not working when not scaled)
It keeps saying:
Estimated lambda: Inf
Equal prototyps merged. Cluster number reduced to: 3
Error in table(clusters) : all arguments must have the same length
In addition: Warning message:
In kproto.default(inputdata_test, 4) :
All categorical variables have zero variance.
Or my categorical variable has at least 3 attributes, the numeric variables all have at least 2 distinct values and there is no NaN values in the dataframe.
There is a small extract of my data below
First, don't use scale function() in kproto. but use it when you use k-means.
Second, Change the character of categorical feature as number. For example, Other=1, Tablet=2, Mobile Phone=3 ...
And, When i see your error message..
all arguments must have the same length In addition
this means your data rows of variable is different. so check your data set.
And,
All categorical variables have zero variance.
this means categorical variable has problem in variance.
check as this code.
lambdaest(df of kproto)
Have a nice day.
kproto also throws Error in table(clusters) : all arguments must have the same length if there are NAs in the data.
I fixed this using na.omit on my data frame.

Is there anything like numerical variable with labels?

I have a numerical variable with discrete levels, that have a special meaning for me, e.g.
-1 'less than zero'
0 'zero'
1 'more than zero'
I know, that I can convert the variable as factor/ordinal and keep the labels, but then the numerical representation of the variable would be
1 'less than zero'
2 'zero'
3 'more than zero'
which is useless for me. I cannot afford having two copies of the variable, because of memory constraints (it is a very big data.table).
Is there any standard way of adding text labels to certain levels of the numerical (possibly integer) variable, so that I can get a nice looking frequency tables just like if it was a factor, and simultaneously being able to treat it as the source numerical variable with values untouched?
I'm going to say the answer to your questions is "no". There's no standard or built-in way of doing what you want.
Because, as you note, factors have positive non-zero integer codes, and integers can't be denoted by label strings in a vector. Not in a "standard" way anyway.
So you will have to do the labelling yourself, in whatever outputs you want to present, manually.
Any tricks like keeping your data (once) as a factor and subtracting a number to get the negative values you need (presumably for your analysis) will make a copy of that data. Keep the numbers, do the analysis, then do replacement with the results (which I presume are tables and plots and so aren't as big as the data).
R also doesn't have an equivalent to the "enumerated type" of many languages, which is one way this can be done.
You could use a vector. Would that work?
var <- c(-1,0,1)
names(var) <- c("less than zero", "zero", "more than zero")
that would give you
> var
less than zero zero more than zero
-1 0 1
Hope that helps,
Umberto

How to do a PCA with 0 (zero) values

I want to do a PCA in R with monthly rainfall values. Since there is no rain during winter, quite a few values in my columns are 0.
When I run the PCA, the following message appears in the console: Error in cov.wt(z) : 'x' must contain finite values only
I think what R is telling me here is that it does not like my 0 values.
So, I tried to change my 0 values to 'real numbers' by multiplying everything with 1.0000000001. But even if I do that and run R again with the new values, it pops up with the same message.
I read that I would need to either get rid of the rows with any missing values in them (which I can't) or use a PCA code that can deal with missing values by somehow imputing them. But my 0's are actual values, not missing values.
I find a lot of information on the web on how to deal with missing values or NA values but nothing on how to deal with zero values. Does anyone have any suggestions how I can do this? Many thanks for your help!
My guess is that "Error in cov.wt(z) : 'x' must contain finite values only" is complaining that some covariances are non-finite, i.e. NA/NaN. This can happen if you have variables that have a standard deviation of 0.
Example code:
latent = rnorm(10)
data = data.frame(rep(0,10), #10 0's
latent+rnorm(10), #latent and noise
latent+rnorm(10), #
latent+1.5*rnorm(10)) #
colnames(data) = c("zeros","var1","var2","var3")
library(psych)
principal(data) #error!
principal(data[-1]) #no errors

Resources