clustering qualitative data in R - r

I have a data.frame (df) that looks like that:
ZN.N ZL.N
MMP2 (1.89,3.58] (2.13,4.1]
AEBP1 (1.89,3.58] (2.13,4.1]
A1AG1 (1.89,3.58] (2.13,4.1]
A1AT [0.364,1.89] [0.275,2.13]
A2MG [0.364,1.89] [0.275,2.13]
ENOA (1.89,3.58] (2.13,4.1]
And I would like to cluster the row.names (proteins) based on the two variables (ZN.N and ZL.N). Could I use a k.means approach or a hierarchical clustering for this kind of data?
I've tried
df.k2 <- k.means(df, 2)
but it doesn't work. I'm really new on clustering so apologise whether the question is really silly, thanks a lot
Here is the dput of my data.frame
structure(list(ZN.N = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("[0.364,1.89]", "(1.89,3.58]"), class = "factor"),
ZL.N = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 2L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L,
3L, 3L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 3L, 2L, 2L, 3L, 1L, 3L, 2L,
1L, 1L, 2L, 3L, 1L), .Label = c("[0.275,2.13]", "(2.13,4.1]",
"(4.1,6.78]"), class = "factor")), .Names = c("ZN.N", "ZL.N"), class = "data.frame", row.names = c("MMP2", "AEBP1", "A1AG1", "A1AT", "A2MG", "ENOA", "ANGI", "ANGL2", "ANT3", "APOA1", "APOA2", "APOD", "PGBM", "PGS1", "CAH3", "CRAC1", "CILP1", "CILP2", "COMP", "CH3L1", "CH3L2", "CSPG4", "CCD80", "CO1A1", "CO2A1", "CO3A1", "CO6A1", "COCA1", "COFA1", "COIA1", "CO1A2", "CO6A2", "COBA2", "CO6A3", "C1QB", "C1R", "C1S", "CO3", "CO4B", "CO8A", "CFAB", "CFAH", "CRP", "KCRM", "CLC3A", "ECM1", "FIBA", "FIBB", "FIBG", "FGFP2", "FMOD", "FINC", "FBLN1", "FSTL1", "G3P", "HPT", "HBA", "HBB", "H2B1L", "H32", "H4", "HPLN1", "IGHA1", "IGHG1", "IGKC", "LAC6", "IGHM", "INHBA", "IBP3", "ITIH1", "MMP1", "LDHA", "LYSC", "TIMP1", "TIMP2", "MIME", "MOES", "MYG", "NID2", "NUCB1", "OSTP", "PPIA", "PPIB", "POSTN", "PRDX2", "PGAM1", "PA2GA", "PLTP", "PEDF", "IPSP", "LMNA", "PCOC1", "PRELP", "AMBP", "PDIA3", "PDIA6", "S10AA", "S10A8", "PRG4", "KPYM", "RNAS1", "HTRA1", "TRFE", "ALBU", "SAMP", "SMOC2", "MMP3", "TARSH", "TENA", "TENX", "TETN", "TSP3", "TSP4", "BGH3", "TTHY", "TR11B", "RL40", "CSPG2", "VIME", "VTNC"))

The reason you are having trouble with clustering is that kmeans expects a numeric matrix, but you're providing the function a data frame with factor variables.
You could instead convert those factors to numbers and then run kmeans:
set.seed(144)
df$ZN.N <- as.numeric(df$ZN.N)
df$ZL.N <- as.numeric(df$ZL.N)
clusters <- kmeans(df, 2)$cluster
clusters1 <- names(clusters[clusters == 1])
clusters1
# [1] "MMP2" "AEBP1" "A1AG1" "ENOA" "APOA1" "PGS1" "CAH3" "CO1A1" "CO3A1"
# [10] "C1R" "CO8A" "CRP" "KCRM" "FIBB" "FIBG" "HPT" "HBA" "H32"
# [19] "H4" "IGHG1" "IGKC" "INHBA" "MYG" "NID2" "POSTN" "PLTP" "PEDF"
# [28] "LMNA" "PDIA3" "PDIA6" "S10AA" "S10A8" "TENA" "TETN" "TSP3" "BGH3"
# [37] "VIME"
clusters2 <- names(clusters[clusters == 2])
clusters2
# [1] "A1AT" "A2MG" "ANGI" "ANGL2" "ANT3" "APOA2" "APOD" "PGBM" "CRAC1"
# [10] "CILP1" "CILP2" "COMP" "CH3L1" "CH3L2" "CSPG4" "CCD80" "CO2A1" "CO6A1"
# [19] "COCA1" "COFA1" "COIA1" "CO1A2" "CO6A2" "COBA2" "CO6A3" "C1QB" "C1S"
# [28] "CO3" "CO4B" "CFAB" "CFAH" "CLC3A" "ECM1" "FIBA" "FGFP2" "FMOD"
# [37] "FINC" "FBLN1" "FSTL1" "G3P" "HBB" "H2B1L" "HPLN1" "IGHA1" "LAC6"
# [46] "IGHM" "IBP3" "ITIH1" "MMP1" "LDHA" "LYSC" "TIMP1" "TIMP2" "MIME"
# [55] "MOES" "NUCB1" "OSTP" "PPIA" "PPIB" "PRDX2" "PGAM1" "PA2GA" "IPSP"
# [64] "PCOC1" "PRELP" "AMBP" "PRG4" "KPYM" "RNAS1" "HTRA1" "TRFE" "ALBU"
# [73] "SAMP" "SMOC2" "MMP3" "TARSH" "TENX" "TSP4" "TTHY" "TR11B" "RL40"
# [82] "CSPG2" "VTNC"
In this code, ZN.N was converted into the numbers 1 and 2, and ZL.N was converted into the numbers 1, 2, and 3. kmeans then computes the euclidean distance between points for the clustering. You'll have to determine if this makes sense for your application.

Related

Why do similarly structured vectors not allow a prediction with evalm?

I am trying to use the evalm() function to assess model performance using caret, but I keep getting the following error. Any advice?
> evalm(data.frame(preds, test$surg))
***MLeval: Machine Learning Model Evaluation***
Input: data frame of probabilities of observed labels
Error in names(x) <- value :
'names' attribute [3] must be the same length as the vector [2]
> dput(test$surg)
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
2L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")
> dput(preds)
structure(c(1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,
2L, 2L, 2L, 1L), .Label = c("0", "1"), class = "factor")

Getting specific combination of interaction as variable in logistic regression with R

I have this dataset and want to perform a regression analysis on it. I have to predictive variables urban_rural and religious. Now I want to have two specific interaction variables: 1.) Urban/not religious and 2.) Rural/religious. I know that interaction is possible through the sign *, but this does not give me the desired combination of interaction. I guess one has to set the reference variable manually?
structure(list(urban_rural = structure(c(1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = c("Urban", "Rural", "Refugee camp"
), class = "factor"), religious = structure(c(2L, 1L, 2L, 2L,
3L, 2L, 2L, 3L, 1L, 3L, 3L, 1L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 3L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 3L, 1L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 1L, 3L, 1L, 2L, 3L, 2L,
2L, 1L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 3L, 2L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L,
1L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 3L, 2L, 3L, 1L), .Label = c("Religious", "Somewhat religious",
"Not religious"), class = "factor"), family_role_recoded = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("Agree/strongly agree",
"Disagree/strongly disagree", "Don't know"), class = "factor")), row.names = c(NA,
250L), class = "data.frame")
I used these regression models:
model1 <- glm(family_role_recoded ~ urban_rural,
family=binomial(link='logit'),
subset = (family_role_recoded != "Don't know" & urban_rural != "Refugee camp"),
data=dataset)
model2 <- glm(family_role_recoded ~ religious,
family=binomial(link='logit'),
subset = (family_role_recoded != "Don't know" & urban_rural != "Refugee camp"),
data=dataset)
model3 <- glm(family_role_recoded ~ urban_rural + religious,
family=binomial(link='logit'),
subset = (family_role_recoded != "Don't know" & urban_rural != "Refugee camp"),
data=dataset)
Does anyone have an idea how to solve this problem?
If you set the reference for religious to be "Somewhat religious" first. We can look at the results first :
library(broom)
dataset$religious = relevel(dataset$religious,ref="Somewhat religious")
fit0 = glm(family_role_recoded ~ urban_rural*religious,data=dataset,family=binomial())
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.902 0.181 -4.99 6.03e-7
2 urban_ruralRural -0.484 0.532 -0.910 3.63e-1
3 religiousReligious -0.0141 0.456 -0.0308 9.75e-1
4 religiousNot religious 1.47 0.391 3.76 1.67e-4
5 urban_ruralRural:religiousReligious 0.995 1.14 0.876 3.81e-1
6 urban_ruralRural:religiousNot religio… 0.201 0.993 0.203 8.39e-1
You have one of the terms rural/religious. Intuitively, the Urban/Not religious term would be the flip of urban_ruralRural:religiousNot religio. We can also manually define the interaction terms we need:
dataset$Rural_religious = with(dataset,as.numeric(urban_rural=="Rural" & religious=="Religious"))
dataset$Urban_not_religious = with(dataset,as.numeric(urban_rural=="Urban" & religious=="Not religious"))
fit = glm(family_role_recoded ~ 0+urban_rural+religious+Urban_not_religious+Rural_religious,data=dataset,family=binomial())
tidy(fit)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 urban_ruralUrban -0.902 0.181 -4.99 0.000000603
2 urban_ruralRural -1.39 0.500 -2.77 0.00556
3 religiousReligious -0.0141 0.456 -0.0308 0.975
4 religiousNot religious 1.67 0.913 1.83 0.0667
5 Urban_not_religious -0.201 0.993 -0.203 0.839
6 Rural_religious 0.995 1.14 0.876 0.381
You need to do a post hoc test. For that you can use the R package "emmeans"

unable to remove NA values in ggplot

I am very new to the r programing. I am trying to build ggplot without NA values. I am unable to remove NA values. Below is the syntax I am using. Please I need help with this.
ggplot(data = phq_projekt, aes(x = Sex, fill = Gruppe, na.rm = TRUE )) +
geom_bar(position = 'stack', na.rm = TRUE)
Not sure if I am missing some syntax.
Below is the data:
c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, NA, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L)

Standard evaluation with mutate_ to calculate percentages by group

I am trying to use standard evaluation with dplyr to calculate percents as a function of two grouping variables. The problem is in my mutate_ statement.
Here is a dataset:
structure(list(
var1 = structure(c(2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L
),
.Label = c("No", "Yes"), class = "factor"),
var2 = structure(c(2L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L,
2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L
),
.Label = c("Female", "Male"), class = "factor")),
.Names = c("var1", "var2"), row.names = c(NA, -100L), class = "data.frame")
Here is the code I am working with:
for_plots = function(data, var1, var2){
grouped_data = data %>% group_by_(var1, var2) %>%
summarise_(n_in_group = ~n()) %>%
mutate_(.dots = setNames(list(
interp(quote(n_in_group / sum(n_in_group, na.rm = TRUE) * 100),
n_in_group = as.name(n_in_group)))
))
return(grouped_data)
}
When I run the code, I receive an error:
Error in setNames(list(interp(quote(n_in_group/sum(n_in_group, na.rm = TRUE) * :
argument "nm" is missing, with no default
Any thoughts?
Here is some code based on #Frank's response:
for_plots = function(data, var1, var2) {
grouped_data = data %>% group_by_(var1, var2) %>%
summarise_(n_in_group = ~n()) %>%
mutate(percent = (n_in_group / sum(n_in_group, na.rm = TRUE)) * 100)
return(grouped_data)
}

What is the meaning of the warning message about log(P) when calculating a polychoric correlation with 'hetcor'?

When calculating a polychoric correlation in R (library(polycor), function hetcor) I get the warning message In log(P) : NaNs produced. I wasn't able to figure out what this warning message might constitute. I suppose it has to do with the calculation of the p-values for testing bivariate normality.
Thus my questions are:
What characteristics of this dataset result in this warning?
What's the meaning of this warning?
Is this warning problematic in terms of using the polychoric correlation matrix for further analyses?
Data subset:
foo <- structure(list(item1 = structure(c(4L, 4L, 4L, 2L, 2L, 2L,
2L, 2L, 4L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 1L,
2L, 2L, 3L, 3L, 3L, 2L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 3L, 2L, 3L,
2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 4L, 2L, 4L, 2L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
2L, 2L, 3L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 3L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 3L
), .Label = c("0", "1", "2", "3"), class = c("ordered", "factor"
)), item2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 3L,
1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 3L, 2L, 2L, 1L,
3L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 1L,
2L, 3L, 2L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 2L, 1L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 1L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L,
1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L, 1L, 1L, 3L), .Label = c("0",
"1", "2", "3"), class = c("ordered", "factor")), item3 = structure(c(4L,
4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 4L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 4L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 1L, 1L,
2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L,
1L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 3L, 2L, 1L), .Label = c("0", "1", "2", "3"), class = c("ordered",
"factor")), item4 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 3L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 3L, 2L, 1L,
1L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L,
2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 2L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L,
1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 2L, 2L, 3L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 1L, 2L,
2L, 1L, 1L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 4L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 4L, 1L, 2L, 3L), .Label = c("0",
"1", "2", "3"), class = c("ordered", "factor")), item5 = structure(c(4L,
4L, 4L, 1L, 1L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 2L, 1L, 1L,
3L, 3L, 3L, 4L, 3L, 2L, 1L, 3L, 3L, 4L, 1L, 2L, 1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 4L, 2L, 1L, 2L, 2L, 2L, 2L,
3L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 3L, 3L, 1L,
2L, 1L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 4L, 2L, 2L, 1L, 2L, 2L, 4L, 2L, 4L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 1L, 3L, 2L, 1L, 1L, 3L, 3L,
1L, 4L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 2L, 1L, 1L,
1L, 1L, 2L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L,
3L, 3L, 4L, 3L, 3L), .Label = c("0", "1", "2", "3"), class = c("ordered",
"factor"))), .Names = c("item1", "item2", "item3", "item4",
"item5"))
Computation of correlation matrix:
hetcor(foo)
Comment: the real dataset contains about 2500 rows (and more variables), but when evaluating the contingency tables a sparse matrix doesn't seem to be an issue.
A short (and belated) answer to a very old question. The warning is because some of the cells in the cross tabulation of the variables (for example, variables 1 and 2) have 0 values in the cells. This can lead to problems in estimation.
The polychoric (and tetrachoric) correlations are normal theory approximations of what would happen if bivariate normal (and continuous) data were converted into categorical (dichotomous for tetrachorics, polytomous for polychorics) data. The normal theory approximation assumes that all cells have some value. However, the correlations can be found with 0 cell values, but with a warning. The resulting correlations are correct, but unstable, in that if we add a small correction for continuity (i.e., add .1 or .5 to the 0 cells), the values change a great deal. This problem is discussed by Gunther and Hofler for the case of tetrachoric correlations where they compare solutions with and with the correction for continuity.
(See the article by A. Gunther and M. Hofler. Different results on tetrachorical correlations in mplus and stata-stata announces modified procedure. Int J Methods Psychiatr Res, 15(3):157-66, 2006. for a discussion of this problem with tetrachoric correlations.)
Using the polychoric function in the psych package, we find the same answer as the hetcor function from polycor if we do not apply the correction for continuity, but somewhat different values if we do correct for continuity. I recommend the correction.
See the help function for polychoric in psych for a longer discussion of this problem.

Resources