Cluster analysis on two columns that contain name of person in R - r

I am a beginner in R. I have to do cluster analysis in data that contains two columns with name of persons. I converted it in data frame but it is character type. To use dist() function the data frame must be numeric. example of my data:
Interviewed.Type interviewed.Relation.Type
1. An1 Xuan
2. An2 The
3. An3 Ngoc
4. Bui Thi
5. ANT feed
7. Bach Thi
8. Gian1 Thi
9. Lan5 Thi
.
.
.
1100. Xung Van
I will be grateful for your help.

You can convert a character vector to a factor using factor. A factor is basically a vector of numbers together with an attribute giving the text associated with each number, which are called levels in R. One can use as.numeric or unclass to get at the raw numbers. These can then be fed into algorithms which require numbers, like e.g. dist.
Note that the order in which numbers are associated with texts is pretty much arbitrary (in fact alphabetical), so the difference between numbers has no meaning in most applications. Therefore calling dist on this result is technically possible, but not neccessarily meaningful. For this reason, the author of this answer is not satisfied with it, even if the original poster seems to be happy about it. :-)
Also note that if there are different vectors, converting each separately will mean that the same number will represent different textual values and vice versa, unless both vectors are compromised from exactly the same set of distinct values. Additional care has to be taken if you want the same levels for both factors. One way would be to concatenate both vecotrs, turn that into a factor, and then split the result into two factor vectors.

Related

Transform factor variable to numeric R

I have tried multiple things so I'll ask my question here.
I have a dataset, containing of 5 columns. The first one lists countries (text), the second Year (integer) and 3-5 are my variables which now are factors.
I want to run a regression with my 3 variables, which is not possible rn as (I guess) my variables are not numeric/integers. I tried to transform them to numeric directly, but it only gave out ranks. I also tried to firstly transform them to characters and secondly to integers/numeric (tried both), but also only transformed my 3 variables into ranks. I used the transform and as.integer code, thus creating a new dataset.
x<-transform(GDPall, HardWork = as.integer(HardWork), FamilyImportance = as.integer(FamilyImportance), GDPWorker = as.integer(GDPWorker))
How can I transform my 3 variables into a class which allows me to run my regression?
Thank you in advance!

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

In R's randomForest package, do factors have to be explicitly labeled as factors?

Or will the package realize that they are not continuous and treat them as factors? I know that, for classification, the feature being classified does need to be a factor. But what about predictive features? I've run it on a couple of toy datasets, and I get slightly different results depending on whether categorical features are numeric or factors, but the algorithm is random, so I do not know if the difference in my results are meaningful.
Thank you!
Yes there is a difference between the two. If you want to use a factor variable you should specify it as such and not leave it as a numeric.
For categorical data (this is actually a very good answer on CrossValidated):
A split on a factor with N levels is actually a selection of one of the (2^N)−2 possible combinations. So, the algorithm will check all the possible combinations and choose the one that produces the better split
For numerical data (as seen here):
Numerical predictors are sorted then for every value Gini impurity or entropy is calculated and a threshold is chosen which gives the best split.
So yeah it makes a difference whether you will add it as a factor or as a numeric variable. How much of a difference depends on the actual data.

Change decimal digits for data frame column in R

Questions about displaying of certain numbers of digits have been posted, however, just for single values or vectors, so I hope someone can help me with this.
I have a data frame with several columns and want to display all values in one column with two decimal digits (this column only). I have tried round() and format() and options(digits) but none worked on a column (numerical). I wonder if there is a method to do this without going the extra way of converting the column to a vector and gluing all together again.
Thanks a lot!
Here's an example of how to do this with the cars data.frame that comes installed with R.
First I'll add some variability so that we have numbers with decimal places:
data=cars+runif(nrow(cars))
Then to round just a single column (in this case the dist column to 2 decimal places):
data[,'dist']=round(data[,'dist'],2)
If your data contain whole numbers then you can guarantee that all values will have 2 decimal places by using:
cars[,'dist']=format(round(cars[,'dist'],2),nsmall=2)

Resources