Cluster analysis with daisy - r

I'm trying to perform a Hierarchical cluster analysis with RStudio, by using the package daisy. This is my dataset:
data.frame':341 obs. of 28 variables:
$ Impo_Env : Ord.factor w/ 3 levels "Low"<"Med"<"High": 3 2 3 2 3 2 3 3 2 3 ...
$ ComparativePriority_IAS: Ord.factor w/ 3 levels "Low"<"Med"<"High": 3 1 3 2 3 2 3 2 3 2 ...
$ Strategy_Eradication: Ord.factor w/ 3 levels "No intervention"<..: 3 2 3 2 3 2 3 2 2 3 ...
$ Knowl_BiodivLoss: Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 2 ...
$ Control_Trade: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Engagement_Retail: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Knowl_PastProj: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
$ Priority_IAS: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Knowl_Eradic: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 2 2 1 ...
$ Alert_CFS: Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 1 2 1 ...
$ Alert_Municipality: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Alert_Park: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 1 ...
$ Alert_Police: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Alert_Firemen: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ Supp_AuthorityIAS: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Knowl_Env: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
$ Info_Tv: Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 1 2 1 ...
$ Info_Web: Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 1 2 2 ...
$ Info_Radio: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
$ Info_Magazines: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 2 1 1 ...
$ Info_School: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 2 ...
$ Blacklist: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Workshop: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 2 2 1 1 ...
$ SuppFin_FutProj: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 2 ...
$ Tourist_dummy: Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 2 1 ...
$ Gender: Factor w/ 2 levels "Female","Male": 1 2 1 2 1 1 2 2 2 1 ...
$ logIASknown: num 2.89 2.94 2.89 2.56 3.14 ...
$ Age: int 20 41 14 10 26 33 19 59 23 16 ...
I would like to use the Euclidean distance with daisy, however when I run
daisy(fuu, metric = c("euclidean"), type=list(ordratio = c(1,2,3), asymm=c(4:24), symm=c(25,26)))
The output is not fine. Gower's distance is used instead of Euclidean distance:
Warning message:In daisy(fuu, metric = c("euclidean"), type = list(ordratio = c(1,:with mixed variables, metric "gower" is used automatically
How can I fix it?

As described in "Details" section within the documentation of the daisy function contained in the cluster package:
The handling of nominal, ordinal, and (a)symmetric binary data is
achieved by using the general dissimilarity coefficient of Gower
(1971). If x contains any columns of these data-types, both arguments
metric and stand will be ignored and Gower's coefficient will be used
as the metric.
In other words, for euclidean metrics (distances as root sum-of-squares of differences) to be computed, input columns must be numeric (mode) variables (i.e. all columns when x is a matrix) and thus recognised as interval scaled variables, as opposed to nominal (columns of class factor) variables or ordinal (columns of class ordered) variables. Specifying variable type within the type argument does not change this fact.
Under these premises, and supposing it makes sense for all of your 28 variables despite some being qualitative binary, you might try converting them with as.numeric and proceed then, reason being: with mixed variables metric "gower" overrides being automatically used.

Related

Error in Logistic Regression for Factors in R

I am trying to do logistic regression by using the code:
model <- glm (Participation ~ Gender + Race + Ethnicity + Education + Comorbidities + WLProgram + LoseWeight + EverLoseWeight + PastYearLW + Age + BMI, data = LogisticData, family = binomial)
summary(model)
I keep getting the error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
Upon checking the forums I checked to see which variables were factors:
str(LogisticData)
'data.frame': 994 obs. of 13 variables:
$ outcome : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2 2 1 2 2 ...
$ Gender : Factor w/ 3 levels "Male","Female",..: 1 2 2 1 2 1 1 1 1
$ Race : Factor w/ 3 levels "White","Black",..: 1 1 1 3 1 1 1 1 1 1
$ Ethnicity : Factor w/ 2 levels "Hispanic/Latino",..: 2 2 2 2 2 2 2 2 2
$ Education : Factor w/ 2 levels "Below Bachelors",..: 1 1 1 2 1 1 1 2 1
$ Comorbidities : Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 2 2 1 1 ...
$ WLProgram : Factor w/ 2 levels "No","Yes": NA 1 2 2 1 1 1 NA 1 1 ...
$ LoseWeight : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 1 1 ...
$ PastYearLW : Factor w/ 2 levels "Yes","No": NA 2 1 1 1 2 1 NA 1 1 ...
$ EverLoseWeight: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 1 1 ...
$ Age : int 29 35 69 32 21 45 40 62 59 58 ...
$ Participation : Factor w/ 2 levels "Yes","No": 2 2 1 1 1 1 1 2 1 2 ...
$ BMI : num 25.7 33.8 26.4 32.3 27.5 ...
All factors appear to have 2 or more levels.
I also tried to omit NA's which still gave me this error.
I want all variables in the regression, and can't figure out why it won't run.
When performing :
newdata <- droplevels(na.omit(LogisticData))
> str(newdata)
'data.frame': 840 obs. of 13 variables:
$ outcome : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 2 2 2 ...
$ Gender : Factor w/ 3 levels "Male","Female",..: 2 2 1 2 1 1 1 2 1
$ Race : Factor w/ 3 levels "White","Black",..: 1 1 3 1 1 1 1 1 3
$ Ethnicity : Factor w/ 2 levels "Hispanic/Latino",..: 2 2 2 2 2 2 2 2
$ Education : Factor w/ 2 levels "Below Bachelors",..: 1 1 2 1 1 1 1 1
$ Comorbidities : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
$ WLProgram : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 1 1 1 1 1 ...
$ LoseWeight : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ PastYearLW : Factor w/ 2 levels "Yes","No": 2 1 1 1 2 1 1 1 1 2 ...
$ EverLoseWeight: Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ Age : int 35 69 32 21 45 40 59 58 23 32 ...
$ Participation : Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 2 2 1 ...
$ BMI : num 33.8 26.4 32.3 27.5 45.4 ...
- attr(*, "na.action")=Class 'omit' Named int [1:154] 1 8 13 14 21 24 25
46 55 58 ...
.. ..- attr(*, "names")= chr [1:154] "1" "8" "13" "14" ...
This doesn't make sense to me because you can see in the first str(Logisitic Data) that there is obviously 2 levels in EverLoseWeight as you can see both the Yes and the No and the 1 and 2? How do I fix this anomaly?
Given your update, it looks like you have at least two possibilities.
1: Remove the factors that are left with only a single level after removing the NAs (i.e. LoseWeight and EverLoseWeight).
2: Treat the NAs as an extra level. Something along the lines of
a = as.factor(c(1,1,NA,2))
b = as.factor(c(1,1,2,1))
# 0 is an unused factor level for a
x = data.frame(a, b)
levels(x$a) = c(levels(x$a), 0)
x$a[is.na(x$a)] = 0
But this might not deal with any singularity issues that also resulted in having single-level factors.
Try doing summary on your raw data and make sure that all of the levels have values. I would put this in a comment, but I don't have the reputation points :(

Error with bnlearn argument: iamb(bn) or gs(bn)

I tried to create bayesian network with function iamb(x) and gs(x).
But it showed "Error in check.data(x) : variable MFYield must have at least two levels."
Here is my code,
bn[sapply(bn, is.character)] <- lapply(bn[sapply(bn, is.character)], as.factor)
attach(bn)
this is sample of my data "bn":
str(bn)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 59 obs. of 42 variables:
$ MFYield : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
$ MWtA : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ MWtT : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
$ MClb : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 1 1 ...
$ MPS : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ MTwU : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ MTwD : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 2 2 ...
$ MTwDAgent: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 2 2 ...
bn.iamb = iamb(bn)
The error begun here, when I try to create network.
Error in check.data(x) : variable MFYield must have at least two levels.
I am not sure that because my data is tibble?
A tibble: 59 x 42
When I checked, it said:
nlevels("bn$MFYield")
[1] 0
class(bn)
[1] "tbl_df" "tbl" "data.frame"
is.factor(bn$MFYield)
[1] TRUE
So, I think my data is already a factor, but R cannot detect it is already have two level. And I do not understand, why?
How can I fix it?. I quite a new beginner for R please help me to get through this.
Thank you.

Merging in R: 1 row missing after merge

I have a dataframe movielens:
str(u.data)
'data.frame': 100000 obs. of 4 variables:
$ userID : int 196 186 22 244 166 298 115 253 305 6 ...
$ movieID : int 242 302 377 51 346 474 265 465 451 86 ...
$ rating : int 3 3 1 2 1 4 2 5 3 3 ...
$ timestamp: int 881250949 891717742 878887116 880606923 886397596 884182806 881171488 891628467 886324817 883603013 ...
and
str(u.item)
'data.frame': 1681 obs. of 20 variables:
$ unknown : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Action : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 1 1 ...
$ Adventure : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
$ Animation : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$ Childrens : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 1 ...
$ Comedy : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
$ Crime : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
$ Documentary: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Drama : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 2 2 ...
$ Fantasy : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Film-Noir : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Horror : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Musical : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Mystery : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Romance : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ Sci-Fi : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
$ Thriller : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
$ War : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
$ Western : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ movieID : int 1 2 3 4 5 6 7 8 9 10 ...
The number of row of u.data is 100.000
nrow(u.data)
100000
And
nrow(u.item)
[1] 1681
Then, I want to merge them:
all_data = u.data
all_data = merge(all_data, u.item, by = "movieID")
But the merged data has only 99.999 rows
nrow(all_data)
[1] 99999
Did I did something wrong while merging these two data frames?
This happens if min(u.data$movieID) < min(u.item$movieID) or if max(u.data$movieID) > max(u.item$movieID). Example for the latter:
# max(u.data$movieID) = 10
u.data <- data.frame(movieID = 1:10, NAME = LETTERS[1:10])
dim(u.data)
# [1] 10 2
# max(u.item$movieID) = 11
u.item <- data.frame(movieID = c(1:9,11), name = letters[c(1:9,11)])
dim(u.item)
# [1] 10 2
out <- merge(u.data, u.item, by = "movieID")
dim(out)
# [1] 9 3
# check if all elements of u.item$movieID exist in u.data$movieID
is.element(u.data$movieID, u.item$movieID)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
Suggested by Batanichek:
out <- merge(u.data, u.item, by = "movieID", all.x = TRUE)
dim(out)
# [1] 10 3

Impute missing data

I have the following dataset:
> str(train)
'data.frame': 4619 obs. of 110 variables:
$ UserID : int 1 2 5 6 7 8 9 11 12 13 ...
$ YOB : int 1938 1985 1963 1997 1996 1991 1995 1983 1984 1997 ...
$ Gender : Factor w/ 3 levels "","Female","Male": 3 2 3 3 3 2 3 3 2 2 ...
$ Income : Factor w/ 7 levels "","$100,001 - $150,000",..: 1 3 6 5 4 7 5 2 4 6 ...
$ HouseholdStatus: Factor w/ 7 levels "","Domestic Partners (no kids)",..: 5 6 5 6 6 6 6 5 5 6 ...
$ EducationLevel : Factor w/ 8 levels "","Associate's Degree",..: 1 8 1 7 4 5 4 3 7 4 ...
$ Party : Factor w/ 6 levels "","Democrat",..: 3 2 1 6 1 1 6 3 6 2 ...
$ Happy : int 1 1 0 1 1 1 1 1 0 0 ...
$ Q124742 : Factor w/ 3 levels "","No","Yes": 2 1 2 1 2 3 1 2 2 1 ...
$ Q124122 : Factor w/ 3 levels "","No","Yes": 1 3 3 3 2 3 1 3 3 1 ...
$ Q123464 : Factor w/ 3 levels "","No","Yes": 2 2 2 3 2 2 1 2 2 1 ...
$ Q123621 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 2 1 1 3 2 1 ...
$ Q122769 : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 1 1 2 2 2 ...
$ Q122770 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 1 1 2 3 3 ...
$ Q122771 : Factor w/ 3 levels "","Private","Public": 3 3 2 2 3 3 1 3 3 3 ...
$ Q122120 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 1 2 2 2 ...
$ Q121699 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 2 3 2 3 3 2 ...
$ Q121700 : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 3 2 2 2 2 ...
$ Q120978 : Factor w/ 3 levels "","No","Yes": 1 3 2 3 3 2 2 3 3 3 ...
$ Q121011 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 3 2 3 2 ...
$ Q120379 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 2 2 2 3 ...
$ Q120650 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 3 ...
$ Q120472 : Factor w/ 3 levels "","Art","Science": 1 3 3 3 3 2 3 3 2 3 ...
$ Q120194 : Factor w/ 3 levels "","Study first",..: 3 2 3 2 2 3 3 3 3 3 ...
$ Q120012 : Factor w/ 3 levels "","No","Yes": 2 3 3 1 2 3 2 2 3 3 ...
$ Q120014 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 3 1 3 3 2 3 ...
$ Q119334 : Factor w/ 3 levels "","No","Yes": 1 3 2 2 2 3 2 3 2 2 ...
$ Q119851 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 2 2 3 2 2 3 ...
$ Q119650 : Factor w/ 3 levels "","Giving","Receiving": 1 2 2 3 2 1 2 2 2 3 ...
$ Q118892 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 2 1 3 2 2 ...
$ Q118117 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 3 1 2 2 2 ...
$ Q118232 : Factor w/ 3 levels "","Idealist",..: 2 2 3 3 3 1 1 2 2 3 ...
$ Q118233 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 3 2 ...
$ Q118237 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 2 ...
$ Q117186 : Factor w/ 3 levels "","Cool headed",..: 1 2 2 2 1 3 1 2 3 1 ...
$ Q117193 : Factor w/ 3 levels "","Odd hours",..: 1 2 3 2 3 3 1 3 3 3 ...
$ Q116797 : Factor w/ 3 levels "","No","Yes": 3 3 2 2 2 1 1 2 2 1 ...
$ Q116881 : Factor w/ 3 levels "","Happy","Right": 2 2 3 3 2 2 1 2 2 1 ...
$ Q116953 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 1 3 3 3 3 1 ...
$ Q116601 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 3 1 3 3 1 ...
$ Q116441 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
$ Q116448 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 1 ...
$ Q116197 : Factor w/ 3 levels "","A.M.","P.M.": 3 2 2 2 2 3 1 2 3 1 ...
$ Q115602 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 1 3 2 1 ...
$ Q115777 : Factor w/ 3 levels "","End","Start": 3 2 3 3 3 3 1 3 2 1 ...
$ Q115610 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 1 1 3 2 1 ...
$ Q115611 : Factor w/ 3 levels "","No","Yes": 2 2 3 3 2 2 1 2 2 1 ...
$ Q115899 : Factor w/ 3 levels "","Circumstances",..: 2 3 3 2 2 3 1 2 3 1 ...
$ Q115390 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 1 2 3 3 2 1 ...
$ Q114961 : Factor w/ 3 levels "","No","Yes": 3 3 2 3 2 3 2 2 3 1 ...
$ Q114748 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 3 3 2 3 1 ...
$ Q115195 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 1 ...
$ Q114517 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 2 2 2 2 3 1 ...
$ Q114386 : Factor w/ 3 levels "","Mysterious",..: 1 3 3 2 2 3 3 3 3 1 ...
$ Q113992 : Factor w/ 3 levels "","No","Yes": 3 1 3 2 2 2 2 2 3 1 ...
$ Q114152 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 2 2 2 2 1 ...
$ Q113583 : Factor w/ 3 levels "","Talk","Tunes": 2 3 2 3 3 3 3 2 3 1 ...
$ Q113584 : Factor w/ 3 levels "","People","Technology": 3 2 2 3 2 1 3 2 2 1 ...
$ Q113181 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 2 2 1 ...
[list output truncated]
As you can see, I have 110 variables. I am trying to build a predictive model to predict happiness using these variables. If I leave them in factor form (CART models, randomForest etc. struggle) so I'm trying to convert these into vectorised or numeric type (to make the algorithm's life a bit easier)...
Currently I am doing it one by one e.g.:
> table(train_new$Q117193)
Odd hours Standard hours
1410 1299 1910
> train_new$Q117193 = as.integer(train_new$Q117193)
> table(train_new$Q117193)
1 2 3
1410 1299 1910
You can notice that almost all the factor variables have missing values denoted by "".
I have converted this dataset to numeric using:
train_numeric$Gender = as.integer(train_numeric$Gender)
train_numeric[,grep(pattern="^Q1",colnames(train_numeric))] = lapply(train_numeric[,grep(pattern="^Q1",colnames(train_numeric))],as.integer)
I am using the mice package to impute this dataset... I am lost to be honest. Any ideas how I could fill these missing values please?
It seems that you are converting factor variables (like Gender) into numeric format and to my knowledge that is not possible in this case because they contain strings, so you could only convert them to character I believe.
To replace all missing values ("") with NAs in your data frame train you could do something like
train[train == ""] <- NA
You can correct this issue while importing your file for ex. I am assuming you imported csv file so code for that woudl be
dataset<-read.csv(file="file location",sep=",",header=True,na.strings = c("","NA"))
it will replace your blank with NA in categorical variable

How can I take multiple vectors and recode their datatypes in R?

I'm looking for an elegant way to change multiple vectors' datatypes in R.
I'm working with an educational dataset: 426 students' answers to eight multiple choice questions (1 = correct, 0 = incorrect), plus a column indicating which instructor (1, 2, or 3) taught their course.
As it stands, my data is sitting pretty in data.df, like this:
str(data.df)
'data.frame': 426 obs. of 9 variables:
$ ques01: int 1 1 1 1 1 1 0 0 0 1 ...
$ ques02: int 0 0 1 1 1 1 1 1 1 1 ...
$ ques03: int 0 0 1 1 0 0 1 1 0 1 ...
$ ques04: int 1 0 1 1 1 1 1 1 1 1 ...
$ ques05: int 0 0 0 0 1 0 0 0 0 0 ...
$ ques06: int 1 0 1 1 0 1 1 1 1 1 ...
$ ques07: int 0 0 1 1 0 1 1 0 0 1 ...
$ ques08: int 0 0 1 1 1 0 1 1 0 1 ...
$ inst : num 1 1 1 1 1 1 1 1 1 1 ...
But those ques0x values aren't really integers. Rather, I think it's better to have R treat them as experimental factors. Same goes for the "inst" values.
I'd love to turn all those ints and nums into factors
Ideally, an elegant solution should produce a dataframe—I call it factorData.df—that looks like this:
str(factorData.df)
'data.frame': 426 obs. of 9 variables:
$ ques01: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 2 ...
$ ques02: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ ques03: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 2 2 1 2 ...
$ ques04: Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
$ ques05: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
$ ques06: Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 2 ...
$ ques07: Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 1 1 2 ...
$ ques08: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 2 2 1 2 ...
$ inst : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
I'm fairly certain that whatever solution you folks come up with, it ought to be easy to generalize to any n number of variables that'd need to get reclassified, and would work across most common conversions (int -> factor and num -> int, for example).
No matter what solution you folks generate, it's bound to be more elegant than mine
Because my current clunky code is just 9 separate factor() statements, one for each variable, like this
factorData.df$ques01
I'm brand-new to R, programming, and stackoverflow. Please be gentle, and thanks in advance for your help!
This was also answered in R-Help.
I imagine that there's a better way to do it, but here are two options:
# use a sample data set
> str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
> data.df <- cars
You can use lapply:
> data.df <- data.frame(lapply(data.df, factor))
Or a for statement:
> for(i in 1:ncol(data.df)) data.df[,i] <- as.factor(data.df[,i])
In either case, you end up with what you want:
> str(data.df)
'data.frame': 50 obs. of 2 variables:
$ speed: Factor w/ 19 levels "4","7","8","9",..: 1 1 2 2 3 4 5 5 5 6 ...
$ dist : Factor w/ 35 levels "2","4","10","14",..: 1 3 2 9 5 3 7 11 14 6 ...
I found an alternative solution in the plyr package:
# load the package and data
> library(plyr)
> data.df <- cars
Use the colwise function:
> data.df <- colwise(factor)(data.df)
> str(data.df)
'data.frame': 50 obs. of 2 variables:
$ speed: Factor w/ 19 levels "4","7","8","9",..: 1 1 2 2 3 4 5 5 5 6 ...
$ dist : Factor w/ 35 levels "2","4","10","14",..: 1 3 2 9 5 3 7 11 14 6 ...
Incidentally, if you look inside the colwise function, it just uses lapply:
df <- as.data.frame(lapply(filtered, .fun, ...))

Resources