same y axis variable , scatter-plot and long format - r

Let's say two different raters are evaluating behavioral problems. They use the same scale (from 0 to 50) and the child being evaluated is the same for both raters. In social sciences, this method is common and there are some useful statistics, such as correlation coefficient and Intra-Class Correlation.
In addition, one graph that comes to my mind is the scatter-plot, and in the x-axys I'll plot the behavioral problems scores considering the first rater and in the y-axis, I'll do the same for the second rater.
gplot2 creates amazing plots, however, some simple routines and action become really difficult to do.
Please see the code below and the r base plot. I would like to know if ggplot can create this plot as well.
Thanks much
set.seed(123)
ds <- data.frame(behavior_problems = rnorm(100,30,2), evaluator=sample(1:2))
plot(ds$behavior_problems[ds$evaluator == '1'] ,
y = ds$behavior_problems[ds$evaluator == '2'])
== I had to edit to make clear why a scatter-plot would be informative==

I think the key problem here is the way in which you have set up the data frame. It is not clear that each individual gets a pair of scores, one from each evaluator. So the first thing to do is add an ID for each individual: 50 IDs in your example data.
library(tidyverse)
ds %>%
mutate(id = rep(1:50, each = 2)
Now we can use tidyr::spread to create a column for each evaluator. But numbers for column names are not a great idea, so we'll rename them to e1 and e2.
ds %>%
mutate(id = rep(1:50, each = 2)) %>%
spread(evaluator, behavior_problems) %>%
rename(e1 = `1`, e2 = `2`)
Now we have column names that can be supplied to ggplot:
ds %>%
mutate(id = rep(1:50, each = 2)) %>%
spread(evaluator, behavior_problems) %>%
rename(e1 = `1`, e2 = `2`) %>%
ggplot(aes(e1, e2)) +
geom_point()
If this seems like a "long way around", it's because ggplot2 works better with "long" data (before the spread) than "wide" (after the spread). To illustrate, here's another way to visualize the difference in scores by individual, which I think works quite well:
ds %>%
mutate(id = rep(1:50, each = 2),
evaluator = factor(evaluator)) %>%
ggplot(aes(id, behavior_problems)) +
geom_point(aes(color = evaluator)) +
geom_line(aes(group = id))

Related

e_facet using grouped data in echarts4r question

I really like the possibilities this package offers and would like to use it in a shiny app. however i am struggling to recreate a plot from ggplot to echarts4r
library(tidyverse)
library(echarts4r)
data = tibble(time = factor(sort(rep(c(4,8,24), 30)), levels = c(4,8,24)),
dose = factor(rep(c(1,2,3), 30), levels = c(1,2,3)),
id = rep(sort(rep(LETTERS[1:10], 3)),3),
y = rnorm(n = 90, mean = 5, sd = 3))
This is the plot i am aiming to recreate:
ggplot(data = data, mapping = aes(x = time, y = y, group = id)) +
geom_point() +
geom_line() +
facet_wrap(~dose)
The problem i am having is to make groups of my data using group = id in ggplot syntax in echarts4r . I am aiming to do e_facet on grouped data using group_by() however i can not (or dont know how to) add a group to connect the dots using geom_line()
data %>%
group_by(dose) %>%
e_charts(time) %>%
e_line(y) %>%
e_facet(rows = 1, cols = 3)
You can do this with echarts4r.
There are two methods that I know of that work, one uses e_list. I think that method would make this more complicated than it needs to be, though.
It might be useful to know that e_facet, e_arrange, and e_grid all fall under echarts grid functionality—you know, sort of like everything that ggplot2 does falls under base R's grid.
I used group_split from dplyr and imap from purrr to create the faceted graph. You'll notice that I didn't use e_facet due to its constraints.
group_split is interchangeable with base R's split and either could have been used.
I used imap so I could map over the groups and have the benefit of using an index. If you're familiar with the use of enumerate in a Python for statement or a forEach in Javascript, this sort of works the same way. In the map call, j is a data frame; k is an index value. I appended the additional arguments needed for e_arrange, then made the plot.
library(tidyverse) # has both dplyr and purrrrrr (how many r's?)
library(echarts4r)
data %>% group_split(dose) %>%
imap(function(j, k) {
j %>% group_by(id) %>%
e_charts(time, name = paste0("chart_", k)) %>%
e_line(y, name = paste0("Dose ", k)) %>%
e_color(color = "black")
}) %>% append(c(rows = 1, cols = 3)) %>%
do.call(e_arrange, .)

Visualise in R with ggplot, a k-means clustered developmental gene expression dataset

I can see many posts on this topic, but none addresses this question. Apologies if I missed a relevant answer. I have a large protein expression dataset, with samples like so as the columns:
rep1_0hr, rep1_16hr, rep1_24hr, rep1_48hr, rep1_72hr .....
and 2000+ proteins in the rows. In other words each sample is a different developmental timepoint.
If it is of any interest, the original dataset is 'mulvey2015' from the pRolocdata package in R, which I converted to a SummarizedExperiment object in RStudio.
I first ran k-means clustering on the data (an assay() of a SummarizedExperiment dataset, to get 12 clusters:
k_mul <- kmeans(scale(assay(mul)), centers = 12, nstart = 10)
Then:
summary(k_mul)
produced the expected output.
I would like the visualisation to look like this, with samples on the x-axis and expression on the y-axis. The plots look like they have been generated using facet_wrap() in ggplot:
For ggplot the data need to be provided as a dataframe with a column for the cluster identity of an individual protein. Also the data need to be in long format. I tried pivoting (pivot_longer) the original dataset, but of course there are a very large number of data points. Moreover, the image I posted shows that for any one plot, the number of coloured lines is smaller than the total number of proteins, suggesting that there might have been dimension reduction on the dataset first, but I am unsure. Up till now I have been running the kmeans algorithm without dimension reduction. Can I get guidance please for how to produce this plot?
Here is my attempt at reverse engeneering the plot:
library(pRolocdata)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
mulvey2015 %>%
Biobase::assayData() %>%
magrittr::extract2("exprs") %>%
data.frame(check.names = FALSE) %>%
tibble::rownames_to_column("prot_id") %>%
mutate(.,
cl = kmeans(select(., -prot_id),
centers = 12,
nstart = 10) %>%
magrittr::extract2("cluster") %>%
as.factor()) %>%
pivot_longer(cols = !c(prot_id, cl),
names_to = "Timepoint",
values_to = "Expression") %>%
ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
geom_line(aes(group = prot_id)) +
facet_wrap(~ cl, ncol = 4)
As for you questions, pivot_longer is usually quite performant unless it fails to find unique combinations in keys or problems related with data type conversion. The plot can be improved by:
tweaking the alpha parameter of geom_lines (e.g. alpha = 0.5), in order to provide an idea of density of lines
finding a good abbreviation and order for Timepoint
changing axis.text.x orientation
Here is my own, very similar solution to the above.
dfsa_mul <- data.frame(scale(assay(mul)))
dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
add the kmeans $cluster column to the dfsa_mul2 dataframe. Only change clus to a factor after executing pivot_longer
dfsa_mul2$clus <- ksa_mul$cluster
dfsa_mul2 %>%
pivot_longer(cols = -c("protID", "clus"),
names_to = "samples",
values_to = "expression") %>%
ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
geom_line(aes(group = protID)) +
facet_wrap(~ factor(clus))
This generates a series of plots identical to the graphs posted by #sbarbit.

Fill geom_tile by mode of a factor variable or other ways to create a heat map in R

I am trying to create a heat map in R using three factors. I would like to be able to fill the colour using the modal category of one of the factors but I have not been able to find out how to do this.
When I try ggplot with geom_tile, it does produce the heatmap, however, I am not sure how it chooses the value of the fill variable. It certainly isn't the mode because I've checked this.
For instance, using the inbuilt dataset ChickWeight, I would like the fill to be based on the modal (most frequent) category of a variable "weight_group" I created.
data(ChickWeight)
glimpse(ChickWeight)
ChickWeight$Time <- ifelse(ChickWeight$Time >= 10,1,0)
ChickWeight <- ChickWeight %>% mutate(weight_group = ntile(weight, 3))
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
ChickWeight$Time <- as.factor(ChickWeight$Time)
ChickWeight$weight_group <- as.factor(ChickWeight$weight_group)
table(ChickWeight$Diet, ChickWeight$Time, ChickWeight$weight_group)
ggplot(data = ChickWeight, aes(x=Time, y=Diet, fill=weight_group)) +
geom_tile()
Based on the three-way table, the bottom right block should be pink (corresponding to weight_group==1) rather than green as the modal category of weight_group when Diet==1 & Time==1 is weight_group==1 (11 counts).
Any help on this would be greatly appreciated.
Thank you!
You can define a function getMode that calculates the mode of a vector using plyr's count function to create a data frame of the counts for each class. Then sort the data frame and get the top value.
library(plyr)
getMode <- function(vec){
df <- plyr::count(vec) %>%
arrange(-freq)
return(df[1,"x"])
}
From here group by time and diet so you can find the mode for each combination of these groups and then use this as the fill for ggplot.
ChickWeight %>%
group_by(Time, Diet) %>%
summarize(modeWeightGroup = getMode(weight_group)) %>%
ggplot(aes(x=Time, y=Diet, fill= modeWeightGroup)) +
geom_tile()
I also don't think that the bottom right square should be weight_group 1 because it looks like the three way table is already sorted based on weight_group so that square is saying that of chicks in weight_group 1, their modal time, diet combination is (1,1).
Using dplyr to count the most frequent category of weight_group for each combination of Time and Diet :
ChickWeight %>%
group_by(Time, Diet) %>%
count(weight_group) %>%
filter(n == max(n)) %>%
ggplot(
aes(x = Time,
y = Diet,
fill = weight_group)
) +
geom_tile()
By the way, since you already know dplyr::mutate, you should know you can do all the pre-processing you are doing here inside a single mutate.
That means instead of :
ChickWeight$Time <- ifelse(ChickWeight$Time >= 10,1,0)
ChickWeight <- ChickWeight %>% mutate(weight_group = ntile(weight, 3))
ChickWeight$Diet <- as.factor(ChickWeight$Diet)
ChickWeight$Time <- as.factor(ChickWeight$Time)
ChickWeight$weight_group <- as.factor(ChickWeight$weight_group)
you can simply type :
ChickWeight <-
ChickWeight %>%
mutate(
Time = as.factor(ifelse(Time>=10, 1 ,0)),
Diet = as.factor(Diet),
weight_group = as.factor(ntile(weight, 3))
)

Show outliers in an efficient manner using ggplot

The actual data (and aim) I have is different but for reproducing purposes I used the Titanic dataset. My aim is create a plot of the age outliers (1 time SD) per class and sex.
Therefore the first thing I did is calculating the sd values and ranges:
library(dplyr)
library(ggplot2)
#Load titanic set
titanic <- read.csv("titanic_total.csv")
group <- group_by(titanic, Pclass, Sex)
#Create outlier ranges
summarise <- summarise(group, mean=mean(Age), sd=sd(Age))
summarise <- as.data.frame(summarise)
summarise$outlier_max <- summarise$mean + summarise$sd
summarise$outlier_min <- summarise$mean - summarise$sd
#Create a key
summarise$key <- paste0(summarise$Pclass, summarise$Sex)
#Create a key for the base set
titanic$key <- paste0(titanic$Pclass, titanic$Sex)
total_data <- left_join(titanic, summarise, by = "key")
total_data$outlier <- 0
Next, using a loop I determine whether the age is inside or outside the range
for (row in 1:nrow(total_data)){
if((total_data$Age[row]) > (total_data$outlier_max[row])){
total_data$outlier[row] <- 1
} else if ((total_data$Age[row]) < (total_data$outlier_min[row])){
total_data$outlier[row] <- 1
} else {
total_data$outlier[row] <- 0
}
}
Do some data cleaning ...
total_data$Pclass.x <- as.factor(total_data$Pclass.x)
total_data$outlier <- as.factor(total_data$outlier)
Now this code gives me the plot I am looking for.
ggplot(total_data, aes(x = Age, y = Pclass.x, colour = outlier)) + geom_point() +
facet_grid(. ~Sex.x)
However, this not really seems like the easiest way to crack this problem. Any thoughts on how I can include best practises to make this more efficients.
One way to reduce your code and make it less repetitive is to get it all into one procedure thanks to the pipe. Instead of creating a summary with the values, re-join this with the data, you could basically do this within one mutate step:
titanic %>%
mutate(Pclass = as.factor(Pclass)) %>%
group_by(Pclass, Sex) %>%
mutate(Age.mean = mean(Age),
Age.sd = sd(Age),
outlier.max = Age.mean + Age.sd,
outlier.min = Age.mean - Age.sd,
outlier = as.factor(ifelse(Age > outlier.max, 1,
ifelse(Age < outlier.min, 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, Pclass, colour = outlier)) +
facet_grid(.~Sex)
Pclass is mutated to a factor in advance, as it is a grouping factor. Then, the steps are done within the original dataframe, instead of creating two new ones. No changes are made to the original dataset however! If you would want this, just reassign the results to titanic or another data frame, and execute the ggplot-part as next step. Else you would assign the result of the figure to your data.
For the identification of outliers, one way is to work with the ifelse. Alternatively, dplyr offers the nice between function, however, for this, you would need to add rowwise, i.e. after creating the min and max thresholds for outliers:
...
rowwise() %>%
mutate(outlier = as.factor(as.numeric(between(Age, outlier.min, outlier.max)))) %>% ...
Plus:
Additionally, you could even reduce your code further, depends on which variables you want to keep in which way:
titanic %>%
group_by(Pclass, Sex) %>%
mutate(outlier = as.factor(ifelse(Age > (mean(Age) + sd(Age)), 1,
ifelse(Age < (mean(Age) - sd(Age)), 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, as.factor(Pclass), colour = outlier)) +
facet_grid(.~Sex)

How to melt a dataframe into multiple factors

I have been trying to plot a line plot with ggplot.
My data looks something like this:
I04 F04 I05 F05 I06 F06
CAT 3 12 2 6 6 20
DOG 0 0 0 0 0 0
BIEBER 1 0 0 1 0 0
and can be found here.
Basically, we have a certain number of CATs (or other creatures) initially in a year (this is I04), and a certain number of CATs at the end of the year (this is F04). This goes on for some time.
I can plot something like this fairly simply using the code below, and get this:
This is fantastic, but doesn't work very well for me. After all, I have these staring and ending inventory for each year. So I am interested in seeing how the initial values (I04, I05, I06) change over time. So, for each animal, I would like to create two different lines, one for initial quantity and one for final quantity (F01, F05, F06). This seems to me like now I have to consider two factors.
This is really difficult given the way my data is set up. I'm not sure how to tell ggplot that all the I prefixed years are one factor, and all the F prefixed years are another factor. When the dataframe gets melted, it's too late. I'm not sure how to control this situation.
Any advice on how I can separate these values or perhaps another, better way to tackle this situation?
Here is the code I have:
library(ggplot2)
library(reshape2)
DF <- read.csv("mydata.csv", stringsAsFactors=FALSE)
## cleaning up, converting factors to numeric, etc
text_names <- data.frame(as.character(DF$animals))
names(text_names) <- c("animals")
numeric_cols <- DF[, -c(1)]
numeric_cols <- sapply(numeric_cols, as.numeric)
plot_me <- data.frame(cbind(text_names, numeric_cols))
plot_me$animals <- as.factor(plot_me$animals)
meltedDF <- melt(plot_me)
p <- ggplot()
p <- p + geom_line(aes(seq(1:36), meltedDF$value, group=meltedDF$animals, color=meltedDF$animals))
p
Using your original data from the link:
nd <- reshape(mydata, idvar = "animals", direction = "long", varying = names(mydata)[-1], sep = "")
ggplot(nd, aes(x = time, y = I, group = animals, colour = animals)) + geom_line() + ggtitle("Development of initial inventories")
ggplot(nd, aes(x = time, y = F, group = animals, colour = animals)) + geom_line() + ggtitle("Development of final inventories")
I think from a data analyst perspective the following approach might provide better insight.
For each animal we visualize the initial and the final quantity in a separate panel. Moreover, each subplot has its own y scale because the values of the different animal types are radically different. Like this, differences within and across animal types are easier to spot.
Given the current structure of your data, we do not need two different factors. After the gather call the indicator column includes data like I04, F04, etc. We just need to separate the first character from the rest resulting in two columns type and time. We can use type as the argument for color in the ggplot call. time provides a unified x-axis across all animal types.
library(tidyr)
library(dplyr)
library(ggplot2)
data %>% gather(indicator, value, -animals) %>%
separate(indicator, c('type', 'time'), sep = 1) %>%
mutate(
time = as.numeric(time)
) %>% ggplot(aes(time, value, color = type)) +
geom_line() +
facet_grid(animals ~ ., scales = "free_y")
Of course, you might also do it the other way round, namely using a subplot for the initial and the final quantities like this:
data %>% gather(indicator, value, -animals) %>%
separate(indicator, c('type', 'time'), sep=1) %>%
mutate(
time = as.numeric(time)
) %>% ggplot(aes(time, value, color = animals)) +
geom_line() +
facet_grid(type ~ ., scales = "free_y")
But as described above, I would not recommend that because the y scale varies too much across animal types.

Resources