K-Means Clustering Method - r

Description of the data
I am trying to produce in R a suitable graphical display of the cluster means.
How can I place the attributes on the x-axis and treat the means for each cluster as trajectories over the items?
All the data is continuous.

What about the following approach: since your variables are on a similar measurement scale (e.g. Likert scale) you could show the distribution of each variable within each cluster (with e.g. boxplots) and visually compare their distribution by using the same axis limits on every cluster.
This can be accomplished by a putting your data in a suitable format and using the ggplot2 package to generate the plot. This is shown below.
Step 1: Generate simulated data to mimic the numeric data you have
The generated data contains four non-negative integer variables and a cluster variable with 3 clusters.
library(ggplot2)
set.seed(1717) # make the simulated data repeatable
N = 100
nclusters = 3
cluster = as.numeric( cut(rnorm(N), breaks=nclusters, label=seq_len(nclusters)) )
df = data.frame(cluster=cluster,
x1=floor(cluster + runif(N)*5),
x2=floor(runif(N)*5),
x3=floor((nclusters-cluster) + runif(N)*5),
x4=floor(cluster + runif(N)*5))
df$cluster = factor(df$cluster) # define cluster as factor to ease plotting code below
tail(df)
table(df$cluster)
whose output is:
cluster x1 x2 x3 x4
95 2 5 2 5 2
96 3 5 4 0 3
97 3 3 3 1 7
98 2 5 4 3 3
99 3 6 1 1 7
100 3 5 1 2 5
1 2 3
15 64 21
i.e., out of the 100 simulated cases, the data contains 15 cases in cluster 1, 64 cases in cluster 2 and 21 cases in cluster 3.
Step 2: Prepare the data for plotting
Here we use reshape() from the stats package to transpose the dataset from wide to long so that the four numeric variables (x1, x2, x3, x4) are placed into one single column, suitable for generating a boxplot for each of the four variables which are then grouped by the cluster variable.
vars2transpose = c("x1", "x2","x3", "x4")
df.long = reshape(df, direction="long", idvar="id",
varying=list(vars2transpose),
timevar="var", times=vars2transpose, v.names="value")
head(df.long)
table(df.long$cluster)
whose output is:
cluster var value id
1.x1 1 x1 5 1
2.x1 1 x1 3 2
3.x1 3 x1 5 3
4.x1 1 x1 1 4
5.x1 2 x1 3 5
6.x1 1 x1 2 6
1 2 3
60 256 84
Note that the number of cases in each cluster has increased 4-fold (i.e. the number of numeric variables) since the data is now in transposed long format.
Step 3: Create the variable's boxplots by cluster with line-connected means
We plot horizontal boxplots for each variable x1, x2, x3, x4 that show their distribution in each cluster, and mark the mean values with connected red crosses (the trajectories you are after).
gg <- ggplot(df.long, aes(x=var, y=value))
gg + facet_grid(cluster ~ ., labeller=label_both) +
geom_boxplot(aes(fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical
which generates the following graph.
The graph might get packed with the many variables you have, so you may want to:
either show vertical boxplots by removing the last coord_flip() line
or remove the boxplots altogether and just show the connected red crosses by eliminating the geom_boxplot() line.
And if you want to compare each variable side by side among the different clusters, you can swap the grouping and x-axis variables as follows:
gg <- ggplot(df.long, aes(x=cluster, y=value))
gg + facet_grid(var ~ ., labeller=label_both) +
geom_boxplot(aes(group=cluster, fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical

Related

R: a plot to visualise correlation of two groups with one measurements?

I have two groups with one measurement variable.
I would like to plot them on one graph to see if they show a correlation or they overlap.
The measurement for both group is in the same scale.
I thought of doing a scatter plot, but in this case, I thought it would just give me a straight line as I only have one measurement.
Could I get some ideas and suggestions please?
You can unstack the data.
set.seed(1234)
df <- data.frame(var = rnorm(200, 50, 10), gp = gl(2,100))
head(df)
var gp
1 37.92934 1
2 52.77429 1
3 60.84441 1
4 26.54302 1
5 54.29125 1
6 55.06056 1
unstack(df)
X1 X2
1 37.92934 54.14524
2 52.77429 45.25282
3 60.84441 50.65993
4 26.54302 44.97522
5 54.29125 41.74001
6 55.06056 51.66989
And then plot this.
library(ggplot2)
library(dplyr)
unstack(df) %>% ggplot(aes(x=X1, y=X2)) +
geom_point() +
geom_smooth(method="lm")

Overlaying two ggplot facet_wrap histograms

So I have two histogram plots I can do one at a time. The result using the following code gives a 2 row x 3 col facet plot for six different histograms:
ggplot(data) +
aes(x=values) +
geom_histogram(binwidth=2, fill='blue', alpha=0.3, color="black", aes(y=(..count..)*100/(sum(..count..)/6))) +
facet_wrap(~ model_f, ncol = 3)
Here the aes(y...) just gives the percentage instead of counts.
As stated, I have two of this 6 facet_wrap plot, which I now which to combine to show that one is more shifted than the other.
In addition, the data size is not the same, so for one I have:
# A tibble: 5,988 x 5
values ID structure model model_f
<dbl> <chr> <chr> <chr> <fctr>
1 6 1 bone qua Model I
2 7 1 bone liu Model II
3 20 1 bone dav Model III
4 3 1 bone ema Model IV
5 3 1 bone tho Model V
6 4 1 bone ranc Model VI
7 3 2 bone qua Model I
8 5 2 bone liu Model II
9 18 2 bone dav Model III
10 2 2 bone ema Model IV
# ... with 5,978 more rows
And the other:
# A tibble: 954 x 5
values ID structure model model_f
<dbl> <chr> <chr> <chr> <fctr>
1 9 01 bone qua Model I
2 8 01 bone liu Model II
3 22 01 bone dav Model III
4 6 01 bone ema Model IV
5 5 01 bone tho Model V
6 9 01 bone ran Model VI
7 12 02 bone qua Model I
8 11 02 bone liu Model II
9 24 02 bone dav Model III
10 9 02 bone ema Model IV
# ... with 944 more rows
So they are not the same size, the ID's are not the same (data not related), but still, I wish to merge the histograms in order to see the difference between the data.
I thought this might do the trick:
ggplot() +
geom_histogram(data=data1, aes(x=values), binwidth=1, fill='blue', alpha=0.3, color="black", aes(y=(..count..)*100/(sum(..count..)/6))) +
geom_histogram(data=data2, aes(x=values), binwidth=1, fill='blue', alpha=0.3, color="black", aes(y=(..count..)*100/(sum(..count..)/6))) +
facet_wrap(~ model_f, ncol = 3)
However, that didn't do much.
So now I'm stuck. Is this possible to do, or...?
Here is my crack at this, based on the builtin dataset iris (since you did not provide reproducible data). To create the smaller, shifted dataset, I am using dplyr to keep the first 20 rows from each species and add 1 to the Sepal length for each observation:
smallIris <-
iris %>%
group_by(Species) %>%
slice(1:20) %>%
ungroup() %>%
mutate(Sepal.Length = Sepal.Length + 1)
Your code at the end gets you close, but you did not specify different colors for the two histograms. If you set the fill differently for each, you will get them to show up differently. You could either set this directly (e.g., change "blue" to "red" in one of them) or by setting a name within aes. Setting it in aes has the advantage of creating (and labeling) a legend:
ggplot() +
geom_histogram(data=iris
, aes(x=Sepal.Length
, fill = "Big"
, y=(..count..)*100/(sum(..count..)))
, alpha=0.3) +
geom_histogram(data=smallIris
, aes(x=Sepal.Length
, fill = "Small"
, y=(..count..)*100/(sum(..count..)))
, alpha=0.3) +
facet_wrap(~Species)
Creates this:
However, I tend to dislike the look of overlapping histograms, so I would prefer to use a density plot. You can do it just like the above (just change the geom_histogram), but I think you get a bit more control (and the ability to expand this to more than two groups) by stacking the data. Again, this uses dplyr to stitch the two datasets together:
bigIris <-
bind_rows(
small = smallIris
, big = iris
, .id = "Source"
)
Then, you can create the plot relatively easily:
bigIris %>%
ggplot(aes(x = Sepal.Length, col = Source)) +
geom_line(stat = "density") +
facet_wrap(~Species)
creates:

replicate multiple regression plot from excel in R

I am struggling with reproducing a multiple linear regression plot in R which is rather easily obtainable in Excel.
I make an example. Say I have the following data frame (called test) in R:
y x1 x2 x3
2 5 5 9
6 4 2 9
4 2 6 15
7 5 10 6
7 5 10 6
5 4 3 12
To generate a linear regression, I simply write:
reg=lm(y ~ x1 + x2 + x3, data = test)
Now I would like to create a plot of the actual value of the y variable, the predicted y and on a secondary axis the standardised residuals. I add a screenshot from Excel so you see what I mean.
To access the Excel plot I would like to obtain:
the plot is in italian, "y" means observed y values, "Y prevista" means predicted Y values and "Residui standard" means standardized residuals. The standard residuals are plotted on a secondary axis
If anyone could show me who I can achieve the above in R, it would be very appreciated.
Use something like
matplot(seq(nrow(test)), cbind(test$y, predict(reg), rstudent(reg)), type="l")
but you'd have to set the axes to make sure everything is okay
You could try something like this. Easier to debug your code.
test <- data.frame(y=c(2,6,4,7,7,5), x1=c(5,4,2,5,5,4), x2=c(5,2,6,10,10,3),
x3=c(9,9,15,6,6,12))
reg=lm(y ~ x1 + x2 + x3, data = test)
# Add new columns to dataframe from regression
test$yhat <- reg$fitted.values
test$resid <- reg$residuals
# Create your x-variable column
test$X <-seq(nrow(test))
library(ggplot2)
library(reshape2)
# Columns to keep
keep = c("y", "yhat", "resid", "X")
# Drop columns not needed
test <-test[ , keep, drop=FALSE]
# Reshape for easy plotting
test <- melt(test, id.vars="X")
# Everything on the same plot
ggplot(test, aes(X,value, col=variable)) +
geom_point() +
geom_line()
For different look, you could also replace geom_line with geom_smooth()

Merge data.frames for grouped boxplot r

I have two data frames z (1 million observations) and b (500k observations).
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
b= Tracer time treatment
2 0 S
35 0 S
10 0 X
04 0 X
20 15 S
11 15 S
12 15 X
25 15 X
I'd like to create grouped boxplots using time as a factor and treatment as colour. Essentially I need to bind them together and then differentiate between them but not sure how. One way I tried was using:
zz<-factor(rep("Z", nrow(z))
bb<-factor(rep("B",nrow(b))
dumB<-merge(z,zz) #this won't work because it says it's too big
dumB<-merge(b,zz)
total<-rbind(dumB,dumZ)
But z and zz merge won't work because it says it's 10G in size (which can't be right)
The end plot might be similar to this example: Boxplot with two levels and multiple data.frames
Any thoughts?
Cheers,
EDIT: Added boxplot
I would approach it as follows:
# create a list of your data.frames
l <- list(z,b)
# assign names to the dataframes in the list
names(l) <- c("z","b")
# bind the dataframes together with rbindlist from data.table
# the id parameter will create a variable with the names of the dataframes
# you could also use 'bind_rows(l, .id="id")' from 'dplyr' for this
library(data.table)
zb <- rbindlist(l, id="id")
# create the plot
ggplot(zb, aes(x=factor(time), y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~id) +
theme_bw()
which gives:
Other alternatives for creating your plot:
# facet by 'time'
ggplot(zb, aes(x=id, y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
# facet by 'time' & color by 'id' instead of 'treatment'
ggplot(zb, aes(x=treatment, y=Tracer, color=id)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
In respons to your last comment: to get everything in one plot, you use interaction to distinguish between the different groupings as follows:
ggplot(zb, aes(x=treatment, y=Tracer, color=interaction(id, time))) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.7)) +
theme_bw()
which gives:
The key is you do not need to perform a merge, which is computationally expensive on large tables. Instead assign a new variable and value (source c(b,z) in my code below) to each dataframe and then rbind. Then it becomes straight forward, my solution is very similar to #Jaap's just with different faceting.
library(ggplot2)
#Create some mock data
t<-seq(1,55,by=2)
z<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
b<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
#Add a variable to each table to id itself
b$source<-"b"
z$source<-"z"
#concatenate the tables together
all<-rbind(b,z)
ggplot(all, aes(source, tracer, group=interaction(treatment,source), fill=treatment)) +
geom_boxplot() + facet_grid(~time)

How to change from row to column major order with facet_wrap?

I want to make a 2x4 array of plots that show distributions changing over time. The default ggplot arrangement with facet_wrap is that the top row has series 1&2, the second row has series 3&4, etc. I would like to change this so that the first column has series in order (1->2->3->4) and then the second column has the next 4 series. This way your eye can compare immediately adjacent distributions in time vertically (as I think they should be).
Use the direction dir parameter to facet_wrap(). Default is horizontal, and this can be switched to vertical:
# Horizontally
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + facet_wrap(~ cyl, ncol=2)
# Vertically
ggplot(mtcars, aes(x=hp, y=mpg)) + geom_point() + facet_wrap(~ cyl, ncol=2, dir="v")
Looks like you need to do this with the ordering factor prior to the the facet_wrap call:
fac <- factor( fac, levels=as.character(c(1, 10, 2, 20, 3, 30, 4, 40) ) )
The default for as/table in facet_wrap is TRUE which is going to put the lowest value ("1" in this case) at the upper left and the highest value ("40" in the example above) at the lower right corner. So:
pl + facet_wrap(~fac, ncol=2, nrow=4)
Your comments suggest you are working with numeric class variables. (Your comments still do not provide a working example and you seem to think this is our responsibility and not yours. Where does one acquire such notions of entitlement?) This should create a factor that might be "column major" ordered with either numeric of factor input:
> ss <- 1:8; factor(ss, levels=ss[matrix(ss, ncol=2, byrow=TRUE)])
[1] 1 2 3 4 5 6 7 8
Levels: 1 3 5 7 2 4 6 8
On the other hand I can think of situations where this might be the effective approach:
> ss <- 1:8; factor(ss, levels=ss[matrix(ss, nrow=2, byrow=TRUE)])
[1] 1 2 3 4 5 6 7 8
Levels: 1 5 2 6 3 7 4 8

Resources