I have 4 clusters that I would like to visualize with ggplot.
I tried to plot it with ggplot but I didn't know how make it look like the figure below. My result was just to present scatterplot showing points not grouped by similarity with centroids.
top50combos_freq : has two columns[freq,freq1]
top50combos_freq.ckmeans : took the result of kmeans with 4 clusters as parameters.
plot(top50combos_freq[top50combos_freq.ckmeans1$cluster==1,],
col = "red",
xlim = c(min(top50combos_freq[,1]), max(top50combos_freq[,1])),
ylim = c(min(top50combos_freq[,2]), max(top50combos_freq[,2]))
)
points(top50combos_freq[top50combos_freq.ckmeans1$cluster==2,],
col="blue")
points(top50combos_freq[top50combos_freq.ckmeans1$cluster==3,],
col="seagreen")
points(top50combos_freq.ckmeans1$centers, pch=2, col="green")
Any help to make this plot with ggplot will appreciated. Thanks.
One way to do that would be to create 2 data frames:
one for actual data points, with a factor variable specifying the cluster,
the other one only with centroids (number of rows same as the number of clusters).
Then you might want to plot the first data frame as usual, but then add additional geom, where you specify new data frame.
Example using iris data:
library(ggplot2)
# Data frame with actual data points
plotDf <- iris
# Data frame with centroids, one entry per centroid
centroidsDf <- data.frame(
Sepal.Length = tapply(iris$Sepal.Length, iris$Species, mean),
Sepal.Width = tapply(iris$Sepal.Width, iris$Species, mean)
)
# First plot data, colouring by cluster (in this case Species variable)
ggplot(
data = plotDf,
aes(x = Sepal.Length, y = Sepal.Width, col = Species)
) +
geom_point() +
# Then add centroids
geom_point(
data = centroidsDf, # separate data.frame
aes(x = Sepal.Length, y = Sepal.Width),
col = "green", # notice "col" and "shape" are
shape = 2) # outside aes()
Related
I aim to create a ggplot with Date along the x axis, and jump height along the y axis. Simplistically, for 1 athlete in a large group of athletes, this will allow the reader to see improvements in jump height over time.
Additionally, I would like to add a ggMarginal(type = "density") to this plot. Here, I aim to plot the distribution of all athlete jump heights. As a result, the reader can interpret the performance of the primary athlete in relationship to the group distribution.
For the sack of a reproducible example, the Iris df will work.
'''
library(dplyr)
library(ggplot2)
library(ggExtra)
df1 <- iris %<%
filter(Species == "setosa")
df2 <- iris
#I have tried as follows, but a variety of error have occurred:
ggplot(NULL, aes(x=Sepal.Length, y=Sepal.Width))+
geom_point(data=df1, size=2)+
ggMarginal(data = df2, aes(x=Sepal.Length, y=Sepal.Width), type="density", margins = "y", size = 6)
'''
Although this data frame is significantly different than mine, in relation to the Iris data set, I aim to plot x = Sepal.Length, y = Sepal.Width for the Setosa species (df1), and then use ggMarginal to show the distribution of Sepal.Width on the y axis for all the species (df2)
I hope this makes sense!
Thank you for your time and expertise
As far as I get it from the docs you can't specify a separate data frame for ggMarginal. Either you specify a plot to which you want to add a marginal plot or you provide the data directly to ggMarginal.
But one option to achieve your desired result would be to create your density plot as a separate plot and glue it to your main plot via patchwork:
library(ggplot2)
library(patchwork)
df1 <- subset(iris, Species == "setosa")
df2 <- iris
p1 <- ggplot(df1, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 2)
p2 <- ggplot(df2, aes(y = Sepal.Width)) +
geom_density() +
theme_void()
p1 + p2 +
plot_layout(widths = c(6, 1))
I have a question about using geom_segment in R ggplot2.
For example, I have three facets and two clusters of points(points which have the same y values) in each facets, how do I draw multiple vertical line segments for each clustering with geom_segment?
Like if my data is
x <- (1:24)
y <- (rep(1,2),2,rep(2,2),1,rep(3,2),4, rep(4,1),5,6, ..rep(8,2),7)
facets <-(1,2,3)
factors <-(1,2,3,4,5,6)
xmean <- ( (1+2+3)/3, (4+5+6)/3, ..., (22+23+24)/3)
Note: (1+2+3)/3 is the mean first cluster in the first facet and (4+5+6)/3 is the mean second cluster in the second facet and (7+8+9)/3 is the first cluster in the second facet.
My Code:
ggplot(,aes(x=as.numeric(x),y=as.numeric(y),color=factors)+geom_point(alpha=0.85,size=1.85)+facet_grid(~facets)
+geom_segment(what should I put here to draw this line in different factors?)
Desired result:
Please see the picture!
Please see the updated picture!
Thank you so much! Have a nice day :).
Maybe this is what you are looking for. Instead of working with vectors put your data in a dataframe. Doing so you could easily make an aggregated dataframe with the mean values per facet and cluster which makes it easy to the segments:
Note: Wasn't sure about the setup of your data. You talk about two clusters per facet but your data has 8. So I slightly changed the example data.
library(ggplot2)
library(dplyr)
df <- data.frame(
x = 1:24,
y = rep(1:6, each = 4),
facets = rep(1:3, each = 8)
)
df_sum <- df %>%
group_by(facets, y) %>%
summarise(x = mean(x))
#> `summarise()` has grouped output by 'facets'. You can override using the `.groups` argument.
ggplot(df, aes(x, y, color = factor(y))) +
geom_point(alpha = 0.85, size = 1.85) +
geom_segment(data = df_sum, aes(x = x, xend = x, y = y - .25, yend = y + .25), color = "black") +
facet_wrap(~facets)
I have a list of model-output in R that I want to plot using ggplot. I want to produce a scatter plot within which every column of data is a different colour. In the example here, I have three model outputs which I want to plot against 'measured'. What I want in the end is a scatter with three different 'clouds' of points, each of which is a different colour. Here is a reproducible example of what I have so far:
library(ggplot)
library(tidyverse)
#data for three different models as well as a column for 'observations' (measured)
output <- list(model1 = 1:10, model2 = 22:31, model3=74:83)
#create the dataframe
df <- data.frame(
predicted = output,
measured = 1:length(output[[1]]),
#year = as.factor(data$year),
#site = data$site
#model = as.factor(names(output)),
#stringsAsFactors=TRUE)
fix.empty.names = TRUE)
#fix the column names
colnames(df)<-names(output)
#plot the data with a different colour for each column of data
p <- ggplot(df) +
geom_point(
aes(
measured,
predicted,
colour =colnames(df)
)
) +
ylim(-5, 90)+
theme_minimal()
p + geom_hline(yintercept=0)
print(p)
I am getting the error: Error in FUN(X[[i]], ...) : object 'measured' not found
why is 'measured' not being found? I can see it in the df?
Perhaps I needs to collapse all the model outputs into one column a create a column as a 'factor' column to 'assign' each data point to a particular model?
The first issue is that your output list only has as many elements as you have models, so it has no name for the last "measured" column and that gets overwritten with NA.
Compare:
colnames(df)<-names(output). # NA in last col
colnames(df)<-c(names(output), "measured"). # fixed
Then, to plot your data in ggplot2 it's almost always better to convert to longer, "tidy" format, with one row per observation. pivot_longer from tidyr is great for that.
df %>%
pivot_longer(-measured, # don't pivot "measured" -- keep in every row
names_to = "model",
values_to = "predicted") %>%
ggplot() +
geom_point(
aes(
measured,
predicted,
colour = model
)
) +
ylim(-5, 90)+
theme_minimal() +
geom_hline(yintercept=0)
You changed the name of your object :
colnames(df)<-names(output)
So now your columns were not found.
I reorganized your object into a data frame that can be easily understood by ggplot2. Do not hesitate to look at your objects.
Here is one option :
library(ggplot2)
library(tidyverse)
#data for three different models as well as a column for 'observations' (measured)
output <- list(model1 = 1:10, model2 = 22:31, model3=74:83)
#create the dataframe
df <- data.frame(
predicted = unlist(output),
measured = 1:length(unlist(output)),
model = names(output)
)
#plot the data with a different colour for each column of data
p <- ggplot(df) +
geom_point(aes(measured, predicted,colour = model)) +
ylim(-5, 90)+
theme_minimal()
p + geom_hline(yintercept=0)
print(p)
plotwithgroups
If you add this line :
facet_grid(~model) +
You can get this which sounds like what you were asking :
plotwithfacet
I can create a loess line in R, based on a column of dates and a second column of values. Having loaded the dataset, I visualise one column of data below:
scatter.smooth(x=1:length(goals$Value), y=goals$Value)
However, how do I add multiple loess lines for additional columns? What would be the code to plot all the loess lines in one graphic? Say, each additional column is named Value2, Value3, Value4 etc.
If you haven't yet considered it, the package ggplot2 makes such graphing problems significantly easier to handle, and gives nicer graphs:
library(ggplot2)
library(tidyr)
set.seed(123)
df <- data.frame("days"=1:25, "v1"=rnorm(25), "v2"=(rnorm(25)+0.1))
#Reshape data from wide to long
df2 <- gather(df,var,val,c(v1,v2))
ggplot(df2,aes(x = days, y = val)) +
geom_point() +
geom_smooth(aes(colour = var),se = F)
If you don't want to reshape the data, you could add individual lines like this:
ggplot(df,aes(x = days, y = v1)) +
geom_point() + #Add scatter plot
geom_smooth(aes(colour = 'v1'),se = F) + #Add loess 1
geom_smooth(aes(y = v2,colour = 'v2'),se = F) + #Add loess 2... and so on
scale_colour_discrete(name = 'Line',
breaks = c('v1','v2'),
labels = c('variable 1','variable 2')) #Define legend
You would used the lines function:
# create test data
set.seed(123)
df <- data.frame("days"=1:25, "v1"=rnorm(25), "v2"=(rnorm(25)+0.1))
# first plot
scatter.smooth(x=df$days, y=df$v1)
# add plot of second lowess line
lines(loess.smooth(x=df$days, y=df$v2))
to add color to the lines:
scatter.smooth(x=df$days, y=df$v1, lpars=list(col="red"))
lines(loess.smooth(x=df$days, y=df$v2), col="green")
Using the goats data set from the ResourceSelection package I can look at the relationship between ELEVATION and a binary response (STATUS) using glm.
library(ResourceSelection)
library(ggplot2)
mod <- glm(STATUS ~ ELEVATION, family=binomial, data = goats)
summary(mod)
I then want to predict over a larger range of ELEVATIONand do so with the following code.
df <- data.frame(ELEVATION = seq(0,5000,1))
df$Preds <- predict(mod, newdata = df, type="response")
ggplot(df, aes(x=ELEVATION, y = Preds)) + geom_point()
Now, with the resulting ggplot how can I add a rug to the bottom of the figure that shows the observed values of ELEVATION from the goats data set when STATUS == 1. e.g. I want a rug showing goats$ELEVATION[goats$STATUS == 1]
I have tried adding geom_rug(), but am not sure how to include the values from the goats data frame rather than the df that I used in the ggplot code. In other words, how can I include a rug of the observed values (subset as indicated above) from the original data in the plot with the new predicted data from the df data frame?
Thanks in advance!
geom_rug has a data argument (all geoms do), so you should just give it that data you want to be plotted.
ggplot(df, aes(x=ELEVATION, y = Preds)) + geom_point() +
geom_rug(data = subset(goats, STATUS == 1),
aes(x = ELEVATION), inherit.aes = F)
In this case, you map y = Preds, which is a column not present in the goats data, so we need to set inherit.aes = F for the rug layer using the goats data to prevent ggplot from looking for the nonexistent column.