K means clustering of variable with multiple values - r

I have a sample data below that is from a large data set, where each participant is given multiple condition for scoring.
Participant<-c("p1","p1","p2","p2","p3","p3")
Condition<-c( "c1","c2","c1","c2","c1","c2")
Score<-c(4,5, 5,7,8,2)
T<-data.frame(Participant, Condition, Score)
I am trying to use K-mean clustering to split participants in different groups, is there any good way to do it, considering the condition is not numeric?
thanks!

#Anony has the right idea. You actually do have numeric data - there is (evidently) a c1-score and a c2-score for each participant. So you need to convert your data from "long" format (data in a single column (Score) with a second column (Condition) differentiating the scores, to "wide" format (scores under different conditions in separate columns). Then you can run kmeans clustering on the scores to group the participants.
Here is how you would do that in R, using a slightly larger example to demonstrate the clusters.
# example with 100 Participants in 3 clusters
set.seed(1) # for reproducibble example
T <- data.frame(Participant=rep(paste0("p",sprintf("%03i",1:100)),each=2),
Condition =paste0("c",1:2),
Score =c(rpois(70,c(10,25)),rpois(70,c(25,10)),rpois(60,c(15,10))))
head(T)
# Participant Condition Score
# 1 p001 c1 8
# 2 p001 c2 25
# 3 p002 c1 7
# 4 p002 c2 27
# 5 p003 c1 14
# 6 p003 c2 28
library(reshape2) # for dcast(...)
# convert from long to wide format
result <- dcast(T,Participant~Condition,value.var="Score")
# k-means on the columns containing scores - look for 3 clusters
result$clust <- kmeans(result[,2:ncol(result)],centers=3)$clust
result[sample(1:100,6),] # just a random sample of 6 rows
# Participant c1 c2 clust
# 12 p012 13 21 1
# 24 p024 7 32 1
# 85 p085 10 6 2
# 43 p043 27 5 3
# 48 p048 29 11 3
# 66 p066 24 17 3
Now we can plot the scores, showing how the participant clusters.
# plot the scores for each Participant, color coded by cluster.
plot(c2~c1,result,col=result$clust, pch=20)
EDIT: Response to OP's comment.
OP wants to know what to do if there is more than one score for a participant/condition. The answer depends on why there are multiple scores. If the replicates are random and have a central tendency, then probably taking the mean is justified, although in theory participants with more replicates should be more heavily weighted.
One the other hand, suppose these are test scores. Then generally (but not always), the scores go up with multiple sittings. So these scores would not be random - there is a trend. In that case it might be more meaningful to take the most recent score.
As a third example, if the scores are used to make a decision based on some policy (such as with the SAT, where most colleges use the highest score), then the most appropriate aggregating function might be max, not mean.
Finally, it might be the case that the number of replicates is in fact an important distinguishing characteristic. In that case you would include not just the scores but also the number of replicates for each participant/condition when clustering. This is relevant in certain kinds of standardized testing under NCLB, where students take the test over and over again until they pass.
BTW: This type of question (the one in your comment) definitely belongs on https://stats.stackexchange.com/.

You should pivot your data, so that
each participant is a row
each condition is a column
the scores are your data
Try the reshape2 package.

You have 3 variables which will be used to split your data in groups. Two of them are categorical which might cause a problem. You can use k-means to split your data in groups but you will need to make dummies for your categorical data (condition and participant) and scale your continuous variable Score.
Using categorical data in K-means is not optimal because k-means cannot handle them well. The dummies will be highly correlated which might cause the algorithm to put too much weight on them and produce suboptimal results.
For the reason above, you can use different techniques such as hierarchical clustering or running a PCA on your data (in order to have continuous uncorrelated data) and then perform a normal k-means model on the PC scores.
These links give good answers:
link1
link2
Hope that helps!

Related

What is the difference between these two data in R?

I have got two data. I am using r for forestplot.
Data one:
coef lower upper
Males vs Female 0.04088551 0.03483956 0.04693145
85 vs 65 years -0.05515741 -0.06508088 -0.04523394
Charlsons Medium vs Low -0.03833060 -0.04727946 -0.02938173
Charlsons High vs Low -0.09247572 -0.12020001 -0.06475144
Data two:
..1 mean lower upper
1 A 1.4194 0.8560 2.3536
2 B 0.6574 0.2333 1.8523
3 C 0.7751 0.4012 1.4973
4 D 1.0831 0.6587 1.7811
5 E 1.3362 0.6559 2.7221
1. I need my data two should be looked like data one(not in value). The data two is dataframe, what do you think is data one?
2. In data one there is no reference for row number but in data two. I need the row number to be gone.
3. I need data two as a matrix. But if I convert it as a matrix, these are the row reference which gets considered as a column while I am doing forest plot.
Can you please suggest anything?

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.
The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:
This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:
For every individual's choice in the grouped data I make three new rows and use chid to tie these three
rows together. I now want to run :
mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).
Is this the correct approach? Or have I misunderstood the purpose of the chid function?
It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.
The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.
Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:
df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))
get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
# mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1 car 1 120 5 60 10 0 30
# 2 car 1 120 5 60 10 0 30
# 3 car 1 120 5 60 10 0 30
# 4 car 1 120 5 60 10 0 30
# 5 car 1 120 5 60 10 0 30
# 6 car 1 120 5 60 10 0 30
Now we can use mlogit(...)
library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
# bicycle bus car
# 0.055234 0.323037 0.621729
#
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# price 0.0047375 0.0003936 12.036 < 2.2e-16 ***
# time -0.0740975 0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
# time
# -15.64069
So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?
This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.
To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).
There's a nice tutorial on mlogit here.
Are price and time real variables that you're trying to make a part of the model?
If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:
# Assuming your original data frame is saved in "df" below
library(nnet)
response <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month
# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)
In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.
Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.

Finding aggregate correlation of multiple columns against one column in r

I have a data frame with 11 columns out of which 9 are numeric. I am trying to find out the correlation of 8 columns together against the remaining column i.e., correlation of 8 variables with 1 variable which should generate one value of correlation instead of generating 9 different values in a matrix.
is it possible? or do I need to calculate the average correlation after calculating individual correlation?e.g., I am trying to find the correlation of X,Y,Z to A. Using the mentioned methods I get a matrix which gives me indivual score of association for X,Y,Z with A where as I need one score which takes into account all three X,Y & Z.
A simulated df is presented below for illustration purposes
x y z a
1 1.72480753 0.007053053 0.32435032 10
2 0.97227885 -0.844118498 -0.75534119 20
3 -0.53844294 -0.036178789 0.89396765 30
4 1.34695331 0.870119744 0.99400826 40
5 0.02336335 0.514481676 0.95894286 50
6 -0.15239307 0.386061290 0.73541287 60
7 -0.29878116 1.615012645 -0.04416341 70
8 -1.10907706 -1.581093487 -0.93293702 80
9 2.73021114 -0.130141775 1.85304372 90
10 0.22417487 1.170900385 -0.68312974 100
I can do correlation of each row and variable with a but what I want is correlation of x,y,z combined with a
corr.test(df[,1:3],df[,4])
I will appreciate any help towards this problem.
Regards,
Pearson Correlation is defined to be a number relating one sequence (or vector) of values to another (look it up). As far as I know there is no roughly equivalent definition for a group of vectors to another, but you could do something like take the average vector (of the 3 vectors) and correlate a to that.
To me at least that has a more immediate geometric meaning than taking the average of the 3 correlation values.
If you want to compute the correlation of each variable with a, you could do something like:
head(cor(df)[,"a"], -1)
# x y z
# -0.14301569 0.19188340 -0.06561505
You said you wanted to combine these values by averaging, so I suppose you could just take the mean of that:
mean(head(cor(df)[,"a"], -1))
# [1] -0.005582445

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

To draw life table quartiles through boxplot for right censored data in R

I have data from a retrospective survey. The individuals who have not experienced a particular event of interest upto survey time are put into censored observation category and rest are uncensored. How to draw boxplot for this right censored data that shows the life table quartiles taking into account both censored and un-censored observations?
(My variable of interest 'fbi' is a duration variable, thus for uncensored obs durations are available and for censored I have replaced the duration 'fbi' with time interval between the origin to survey date and another dichotomous variable "cens" is there to recognize censored and uncensored cases.)
The data can be emulated with:
fbi <- rpois(100,12)
cens <- sample(0:1,100,replace=T)
test <- data.frame(fbi,cens)
> head(test)
fbi cens
1 18 0
2 14 0
3 17 1
4 11 1
5 9 0
6 10 1
Using the dummy data you suggested and which I added to the answer, the below line will plot 2 boxplots summarising the fbi variable including all cases, and using just the non-censored cases.
boxplot(test$fbi,test$fbi[test$cens==0],names=c("all cases","w/out censored"))
If you'd rather compare the censored to uncensored cases, you could do:
boxplot(fbi ~ cens,data=test,names=c("not censored","censored"))
edit
In response to the comment below, is the following piece of code using the NADA library what you are seeking?
library(NADA)
cenboxplot(test$fbi, as.logical(test$cen))
There is documentation on the cenboxplot function online here: http://rss.acs.unt.edu/Rdoc/library/NADA/html/cenboxplot.html

Resources