I'm trying to perform a pretty straightforward clustering analysis but can't get the results right. My question for a large dataset is "Which diseases are frequently reported together?". The simplified data sample below should result in 2 clusters: 1) headache / dizziness 2) nausea / abd pain. However, I can't get the code right. I'm using the pam and daisy functions. For this example I manually assign 2 clusters (k=2) because I know the desired result, but in reality I explore several values for k.
Does anyone know what I'm doing wrong here?
library(cluster)
library(dplyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"))
gower_dist <- daisy(dat, metric = "gower")
k <- 2
pam_fit <- pam(gower_dist, diss = TRUE, k) # performs cluster analysis
pam_results <- dat %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
The format in which you give the dataset to the clustering algorithm is not precise for your objective. In fact, if you want to group diseases that are reported together but you also include IDs in your dissimilarity matrix, they will have a part in the matrix construction and you do not want that, since your objective regards only the diseases.
Hence, we need to build up a dataset in which each row is a patient with all the diseases he/she reported, and then construct the dissimilarity matrix only on the numeric features. For this task, I'm going to add a column presence with value 1 if the disease is reported by the patient, 0 otherwise; zeros will be filled automatically by the function pivot_wider (link).
Here is the code I used and I think I reached what you wanted to, please tell me if it is so.
library(cluster)
library(dplyr)
library(tidyr)
dat <- data.frame(ID = c("id1","id1","id2","id2","id3","id3","id4","id4","id5","id5"),
PTName = c("headache","dizziness","nausea","abd pain","dizziness","headache","abd pain","nausea","headache","dizziness"),
presence = 1)
# build the wider dataset: each row is a patient
dat_wider <- pivot_wider(
dat,
id_cols = ID,
names_from = PTName,
values_from = presence,
values_fill = list(presence = 0)
)
# in the dissimalirity matrix construction, we leave out the column ID
gower_dist <- daisy(dat_wider %>% select(-ID), metric = "gower")
k <- 2
set.seed(123)
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- dat_wider %>%
mutate(cluster = pam_fit$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
head(pam_results$the_summary)
Furthermore, since you are working only with binary data, instead of Gower's distance you can consider using the Simple Matching or Jaccard distance if they suit your data better. In R you can employ them using
sm_dist <- dist(dat_wider %>% select(-ID), method = "manhattan")/p
j_dist <- dist(dat_wider %>% select(-ID), method = "binary")
respectively, where p is the number of binary variables you want to consider.
Related
I am really new into the field of setting up survey weights, and I need help. I have this example dataframe as follows that represents a multi-stage survey (5 clusters for stage 1 and 10 clusters for stage 2)
set.seed(111)
mood <- sample(c("happy","neutral","grumpy"),
size = 1000,
replace=TRUE,
c(0.3,0.3,0.4))
set.seed(222)
sex <- sample(c("female","male"),
size=1000,
replace=TRUE,
c(0.6,0.4))
set.seed(333)
age_group <- sample(c("young","middle","senior"),
size=1000,
replace=TRUE,
c(0.2,0.6,0.2))
status <- data.frame(mood=mood,
sex=sex,
age_group=age_group,
income = trunc(runif(1000,1000,2000)),
dnum = rep(c(441,512,39,99,61),each = 200),
snum = rep(c(1,2,3,4,5,6,7,8,9,10),each=100),
fpc1 = rep(c(100,200,300,400,500),each=200),
fpc2 = rep(c(10,9,8,10,7,6,13,9,5,12),each=100) )
# to take into account the two cluster populations (fpc1 and fpc2)
# I calculated the probability proportional to size of each unit as follows
# (using a method mentioned by a previous question.
# The link of the referred question is at the end of this post) :
status1 <- status %>%
group_by(fpc1,fpc2) %>%
summarise(n = n(), .groups = 'drop') %>%
mutate(fpc = n/sum(n)) %>%
right_join(status)
That way, we take into account the clusters to set up the PPS for each unit.
So my question is (assuming that there is no missing values), we create the design weights by the inverse of the new fpc column. Right?
And if we wanted to continue to adjust for other variables (mood, sex, age_group) so that my sample becomes representative of the target population, we adjust the design weights created using some calibration method such as raking, propensity score ...etc. Is this correct that way? Or did I misunderstand something using RSTUDIO to assign weights to my sample?
The link of the referred question :
survey package in R: How to set fpc argument (finite population correction)
Thanks.
I am trying to produce estimates of the 25th percentile of a continuous variable for a series of sub-groups, where the data is taken from a survey that uses sampling weights. I am doing this in R using the survey and srvyr packages.
This issue I face is that in a small minority of cases a sub-group only has one observation and therefore a 25th percentile is meaningless. This would be fine however it results in a error which prevents the percentiles being calculated for those subgroups with sufficient observations.
Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
need at least two non-NA values to interpolate
The code runs when the offending groups are removed, however I have had to identify them manually which is far from ideal.
Is there a way to achieve the same outcome but where for single observation groups an NA, or just the value of that observation, is outputted rather than an error? Alternatively is there a neat way of automatically excluding such groups from the calculation?
Below is a reproducible example to illustrate my issue using the apistrat dataset from the survey package.
library(dplyr)
library(survey)
library(srvyr)
data(api)
#25th percentile of api00 by school type and whether school is year round or not
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
#need at least two non-NA values to interpolate
apistrat %>% group_by(yr.rnd, stype) %>% tally() %>% filter(n==1)
#one group out of 6 has only a single api00 observation and therefore a quantile can't be interpolated
#Removing that one group means the code can now run as intended
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
filter(!(yr.rnd=="Yes"&stype=="H")) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Get the same error if you do it the 'survey' package way
dstrat <- svydesign(id=~1,strata=~stype,data=apistrat, fpc=~fpc)
svyby(~api99, ~stype+yr.rnd, dstrat, svyquantile, quantiles=0.25)
One work-around is to wrap the call to svyquantile() using tryCatch()
> svyq<-function( ...){tryCatch(svyquantile(...), error=function(e) matrix(NA,1,1))}
> svyby(~api99, ~stype+yr.rnd, dstrat, svyq, quantiles=0.25,keep.var=FALSE,na.rm=TRUE)
stype yr.rnd statistic
E.No E No 560.50
H.No H No 532.75
M.No M No 509.00
E.Yes E Yes 456.00
H.Yes H Yes NA
M.Yes M Yes 436.00
With quantiles and svyby you need to be explicit about whether you want standard errors -- the code above doesn't. If you want standard errors, you'd need the error= branch of tryCatch to return an actual svyquantile object with NAs in it.
This might be a very simple question. Suppose I have multiple populations of categorical values as well as a group of 'target' categories.
e.g.
set.seed(500)
pops <- list(
val1 = c('20','20','10','90','100','30','10','20'),
val2 = c('20','110','1400','50','40'),
val3 = c('100','50','30')
)
target <- c('20','100','40')
What would be the average probability of drawing at least one of the target categories from all populations?
I can calculate the frequency distribution of each value and therefore the chance of getting a specific result.
# Frequency table
p <- table(pops$val1) / length(pops$val1)
# The probability of getting at least of the target values
sum(p[which(names(p) %in% target)])
# 0.5
Problem is that calculation is not independent of sampling size as increasing N obviously increases the probability that at least one of the categories is present.
Anyone has an idea how do to assess this unbiased by sample size?
We can use
sapply(pops, function(x) {
p <- table(x)/length(x)
sum(p[which(names(p) %in% target)])
})
Or using tidyverse
library(tidyverse)
stack(pops) %>%
group_by(ind) %>%
mutate(n1 = n()) %>%
group_by(values, add = TRUE) %>%
summarise(perc = n()/n1[1]) %>%
filter(values %in% target) %>%
summarise(perc = sum(perc))
I need to run a clustering or similarity analysis on some biological data and I am looking for an output like the one SIMPROF gives. Aka a dendrogram or hierarchical cluster.
However, I have 3200 observations/rows per group. SIMPROF, see example here,
library(clustsig)
usarrests<-USArrests[,c(1,2,4)]
rownames(usarrests)<-state.abb
# Run simprof on the data
res <- simprof(data= usarrests,
method.distance="braycurtis")
# Graph the result
pl.color <- simprof.plot(res)
seems to expect only one observation per group (US state in this example).
Now, again, my biological data (140k rows total) has about 3200 obs per group.
I am trying to cluster the groups together that have a similar representation in the variables provided.
As if in the example above, AK would be represented by more than one observation.
What's my best bet for a function/package/analysis?
Cheers,
Mo
Example from a paper:
The solution became obvious upon further reflection.
Instead of using all observations (200k) in the long format, I made longitude and depth of sampling into one variable, used like sampling units along a transect. Thus, ending up with 3800 columns of longitude - depth combinations, and 61 rows for the taxa, with the value variable being the abundance of the taxa (If you want to cluster sampling units then you have to transpose the df). This is then feasible for hclust or SIMPROF since now the quadratic complexity only applies to 61 rows (as opposed to ~200k as I tried at the beginning).
Cheers
Here is some code:
library(reshape2)
library(dplyr)
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d4$sampling_units<-paste(d4$LONGITUDE_DEC,d4$BIN_MIDDEPTH_M)
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
d5<-d5%>%na.omit()
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
d7$PREDICTED_GROUP<-as.factor(d7$PREDICTED_GROUP)
# give the rownames the taxa names
rownames(d7)<-paste(d7$PREDICTED_GROUP)
#delete that variable that is no longer needed
d7$PREDICTED_GROUP<-NULL
library(vegan)
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
#method
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
clust1
#plot the cluster dendrogram with dendextend
library(dendextend)
library(ggdendro)
library(ggplot2)
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)
In the R package caret, can we create stratified training and test sets based on several variables using the function createDataPartition() (or createFolds() for cross-validation)?
Here is an example for one variable:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
In the code above the training and test sets are stratified by 'df$yourFactor'. But is it possible to stratify using several variables (e.g. 'df$yourFactor' and 'df$yourFactor2')? The following code seems to work but I don't know if it is correct:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)
This is fairly simple if you use the tidyverse.
For example:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'
There is a better way to do this.
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
stratum indicator
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
sample selection
indices <- tapply(1:nrow(d), d$group, sample, 30 )
obtain subsample
subsampd <- d[unlist(indices, use.names = FALSE), ]
what this does is make a size 30 random stratified sample on every combination of yourFactor and yourFactor2.