kmeans through time retain consistent cluster ID - r

There are times when we would like to know how the clustering of points might change through time. For example, say you have cities with demographic attributes by decade and you are interested in what cities are the "most similar" based on their attributes for each decade. Here is a toy dataset that illustrates the point:
set.seed(1)
centers <- data.frame(cluster=factor(1:3),
size=c(100, 150, 50),
x1=c(5, 0, -3),
x2=c(-1, 1, -2))
year1 <- centers %>%
group_by(cluster) %>%
do(data.frame(x1=rnorm(.$size[1], .$x1[1]),
x2=rnorm(.$size[1], .$x2[1]),
year="year 1",
stringsAsFactors = F)) %>%
data.frame()
year2 <- centers %>%
group_by(cluster) %>%
do(data.frame(x1=rnorm(.$size[1], .$x1[1]),
x2=rnorm(.$size[1], .$x2[1]),
year="year 2",
stringsAsFactors = F)) %>%
data.frame()
points <- rbind(year1,year2)
We can calculate kmeans per year using something like below:
kclusters <- points %>%
select(-cluster) %>%
group_by(year) %>%
do(data.frame(., kclust = kmeans(as.matrix(.[,-3]),centers=3)$cluster)) %>%
mutate(kclust = as.character(kclust))
And here is the resulting plot:
ggplot(kclusters) +
geom_point(aes(x1,x2,color=kclust)) +
facet_wrap(~year) +
theme_bw() +
scale_color_viridis_d()
The code works as expected but notice that the cluster IDs have changed. It doesn't make much difference here because I am plotting the clusters using the original x1 and x2, but my real example is making a map and the points are plotted in space using coordinates and colored according to clusters (i.e., the location of the point never changes). Imagine this same plot for several years--it becomes hard to track the changing cluster membership of individual points each year. Is there a way to keep the IDs consistent?

Related

How to make the points in a geom_point scatter plot depend on the count?

I'm using ggplot to create a scatterplot of a dataframe. The x and y axis are two columns in the frame, and the following code gives me a scatter plot:
ggplot(df,aes(x=Season,y=type))+
geom_point(fill="blue")
But the points are all the same size. I want each point to depend on the count of how many rows matched for the combination of x and y. Anyone know how?
You haven't provided any sample data, so I'm generating some:
library('tidyverse')
df <- data.frame(Season = sample(c('W', 'S'), size=20, replace=T),
type=sample(c('A', 'B'), size=20, replace=T))
df %>%
group_by(Season, type) %>%
summarise(count = n()) %>%
ggplot(aes(x=Season, y=type, size=count)) +
geom_point(col="blue")
The idea is to count all the occurences of your Season–type data, and then use that new count field to adjust the size in your ggplot.

Visualise in R with ggplot, a k-means clustered developmental gene expression dataset

I can see many posts on this topic, but none addresses this question. Apologies if I missed a relevant answer. I have a large protein expression dataset, with samples like so as the columns:
rep1_0hr, rep1_16hr, rep1_24hr, rep1_48hr, rep1_72hr .....
and 2000+ proteins in the rows. In other words each sample is a different developmental timepoint.
If it is of any interest, the original dataset is 'mulvey2015' from the pRolocdata package in R, which I converted to a SummarizedExperiment object in RStudio.
I first ran k-means clustering on the data (an assay() of a SummarizedExperiment dataset, to get 12 clusters:
k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)
Then:
summary(k_mul)
produced the expected output.
I would like the visualisation to look like this, with samples on the x-axis and expression on the y-axis. The plots look like they have been generated using facet_wrap() in ggplot:
For ggplot the data need to be provided as a dataframe with a column for the cluster identity of an individual protein. Also the data need to be in long format. I tried pivoting (pivot_longer) the original dataset, but of course there are a very large number of data points. Moreover, the image I posted shows that for any one plot, the number of coloured lines is smaller than the total number of proteins, suggesting that there might have been dimension reduction on the dataset first, but I am unsure. Up till now I have been running the kmeans algorithm without dimension reduction. Can I get guidance please for how to produce this plot?
Here is my attempt at reverse engeneering the plot:
library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
mulvey2015 %>%
Biobase::assayData() %>%
magrittr::extract2("exprs") %>%
data.frame(check.names = FALSE) %>%
tibble::rownames_to_column("prot_id") %>%
mutate(.,
cl = kmeans(select(., -prot_id),
centers = 12,
nstart = 10) %>%
magrittr::extract2("cluster") %>%
as.factor()) %>%
pivot_longer(cols = !c(prot_id, cl),
names_to = "Timepoint",
values_to = "Expression") %>%
ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
geom_line(aes(group = prot_id)) +
facet_wrap(~ cl, ncol = 4)
As for you questions, pivot_longer is usually quite performant unless it fails to find unique combinations in keys or problems related with data type conversion. The plot can be improved by:
tweaking the alpha parameter of geom_lines (e.g. alpha = 0.5), in order to provide an idea of density of lines
finding a good abbreviation and order for Timepoint
changing axis.text.x orientation
Here is my own, very similar solution to the above.
dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
add the kmeans $cluster column to the dfsa_mul2 dataframe. Only change clus to a factor after executing pivot_longer
dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>%
pivot_longer(cols = -c("protID", "clus"),
names_to = "samples",
values_to = "expression") %>%
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
geom_line(aes(group = protID)) +
facet_wrap(~ factor(clus))
This generates a series of plots identical to the graphs posted by #sbarbit.

How to make "interactive" time series plots for exploratory data analysis

I have a time series data frame similar to data created below. Measurements of 5 variables are taken on each individual. Individuals have unique ID numbers. Note that in this data set each individual is of the same length (each has 1000 observations), but in my real data set each individual is of has different lengths (teach individual has a different number of observations). For each individual, I want to plot all 5 variables on top of one another (i.e. all on the y axis) and plot them against time (x axis). I want to print each of these plots to an external document of some kind (pdf, or whatever is recommended for this application) with one plot per page, meaning each individual will have its own page with a single plot. I want these time series plots to be "interactive", in that I can move my mouse over a point, and it will tell me what time individual data points are at. My goal in doing this is exploring the association between peaks, valleys, and other regions between the 5 variables. I am not sure if ggplot2 is still the best application for this, but I would still like for the plots to be aesthetically appealing so that it will be easier to see patterns in the data. Also, is pasting these plots to a pdf the most sensible route? Or would I be better off using R notebook or some other application?
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five))
Try using the plotly package. And since you want it to be interactive, you'll want to export as something like html rather than pdf.
To produce a single faceted plot (note I added stringAsFactors = FALSE to your sample data):
library(tidyverse)
library(plotly)
ID <- rep(c("A","B","C"), each=1000)
time <- rep(c(1:1000), times = 3)
one <- rnorm(1000)
two <- rnorm(1000)
three <- rnorm(1000)
four <- rnorm(1000)
five<-rnorm(1000)
data<- data.frame(cbind(ID,time,one,two,three,four,five),
stringsAsFactors = FALSE)
data_long <- data %>%
gather(variable,
value,
one:five) %>%
mutate(time = as.numeric(time),
value = as.numeric(value))
plot <- data_long %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
facet_wrap(~ID)
interactive_plot <- ggplotly(plot)
htmlwidgets::saveWidget(interactive_plot, "example.html")
If you want to produce and export an interactive plot for every ID programmatically:
walk(unique(data_long$ID),
~ htmlwidgets::saveWidget(ggplotly(data_long %>%
filter(ID == .x) %>%
ggplot(aes(x = time,
y = value,
color = variable)) +
geom_point() +
labs(title = paste(.x))),
paste("plot_for_ID_", .x, ".html", sep = "")))
Edit: I changed map() to walk() so that the plots are produced without console output (previously just a list with 3 empty elements).

same y axis variable , scatter-plot and long format

Let's say two different raters are evaluating behavioral problems. They use the same scale (from 0 to 50) and the child being evaluated is the same for both raters. In social sciences, this method is common and there are some useful statistics, such as correlation coefficient and Intra-Class Correlation.
In addition, one graph that comes to my mind is the scatter-plot, and in the x-axys I'll plot the behavioral problems scores considering the first rater and in the y-axis, I'll do the same for the second rater.
gplot2 creates amazing plots, however, some simple routines and action become really difficult to do.
Please see the code below and the r base plot. I would like to know if ggplot can create this plot as well.
Thanks much
set.seed(123)
ds <- data.frame(behavior_problems = rnorm(100,30,2), evaluator=sample(1:2))
plot(ds$behavior_problems[ds$evaluator == '1'] ,
y = ds$behavior_problems[ds$evaluator == '2'])
== I had to edit to make clear why a scatter-plot would be informative==
I think the key problem here is the way in which you have set up the data frame. It is not clear that each individual gets a pair of scores, one from each evaluator. So the first thing to do is add an ID for each individual: 50 IDs in your example data.
library(tidyverse)
ds %>%
mutate(id = rep(1:50, each = 2)
Now we can use tidyr::spread to create a column for each evaluator. But numbers for column names are not a great idea, so we'll rename them to e1 and e2.
ds %>%
mutate(id = rep(1:50, each = 2)) %>%
spread(evaluator, behavior_problems) %>%
rename(e1 = `1`, e2 = `2`)
Now we have column names that can be supplied to ggplot:
ds %>%
mutate(id = rep(1:50, each = 2)) %>%
spread(evaluator, behavior_problems) %>%
rename(e1 = `1`, e2 = `2`) %>%
ggplot(aes(e1, e2)) +
geom_point()
If this seems like a "long way around", it's because ggplot2 works better with "long" data (before the spread) than "wide" (after the spread). To illustrate, here's another way to visualize the difference in scores by individual, which I think works quite well:
ds %>%
mutate(id = rep(1:50, each = 2),
evaluator = factor(evaluator)) %>%
ggplot(aes(id, behavior_problems)) +
geom_point(aes(color = evaluator)) +
geom_line(aes(group = id))

R scatterplot y-axis grouped

My df is a database of individuals (rows) and amount they spent (column) in one activity. I want to draw a scatterplot in R that has the following characteristics:
x-axis: log(amount spent)
y-axis: log(number of people that spent this amount)
This is how far I got:
plot(log(df$Amount), log(df$???))
How can I do that? Thanks!
My df looks something like this:
df
Name Surname Amount
John Smith 223
Mary Osborne 127
Mark Bloke 45
This is what I have in mind (taken from a paper by Chen (2012))
Try this:
library(dplyr)
library(scales) # To let you make plotted points transparent
# Make some toy data that matches your df's structure
set.seed(1)
df <- data.frame(Name = rep(letters, 4), Surname = rep(LETTERS, 4), Amount = rnorm(4 * length(LETTERS), 200, 50))
# Use dplyr to get counts of loans in each 5-dollar bin, then merge those counts back
# into the original data frame to use as y values in plot to come.
dfsum <- df %>%
mutate(Bins=cut(Amount, breaks=seq(round(min(Amount), -1) - 5, round(max(Amount) + 5, -1), by=5))) # Per AkhilNair's comment
group_by(Bins) %>%
tally() %>%
merge(df, ., all=TRUE)
# Make the plot with the new df with the x-axis on a log scale
with(dfsum, plot(x = log(Amount), y = n, ylab="Number around this amount", pch=20, col = alpha("black", 0.5)))
Here's what that produced:

Resources