visualizing statistical test results with ggplot2 - r

I would like to get my statistical test results integrated to my plot. Example of my script with dummy variables (dummy data below generated after first post):
cases <- rep(1:1:5,times=10)
var1 <- rep(11:15,times=10)
outcome <- rep(c(1,1,1,2,2),times=10)
maindata <- data.frame(cases,var1,outcome)
df1 <- maindata %>%
group_by(cases) %>%
select(cases,var1,outcome) %>%
summarise(var1 = max(var1, na.rm = TRUE), outcome=mean(outcome, na.rm =TRUE))
wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
ggplot(df1, aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) + geom_boxplot()
With these everything works just fine, but I can't find a way to integrate my wilcox.test results to my plot automatically (of course I can make use annotation() and write the results manually but that's not what I'm after.
My script produces two boxplots with max-value of var1 on the y-axis and grouped by outcome on the x-axis (only two different values for outcome). I would like to add my wilcox.test results to that boxplot, all other relevant data is present. Tried to find a way from forums and help files but can't find a way (at least with ggplot2)
I'm new to R and trying learn stuff through using ggplot2 and dplyr which I see as most intuitive packages for manipulation and visualization. Don't know if they are optimal for the solution which I'm after so feel free to suggest solutions from alternative packages also...

I thinks this figure shows what you want. I also added some parts to the code because you're new with ggplot2. Take or leave them, but there're things I do make publication quality figures:
wtOut = wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
exampleOut <- ggplot(df1,
aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) +
geom_boxplot() +
scale_fill_gradient(name = paste0("P-value: ",
signif(wtOut$p.value, 3), "\nOutcome")) +
ylab("Variable 1") + xlab("Outcome") + theme_bw()
ggsave('exampleOut.jpg', exampleOut, width = 6, height = 4)
If you want to include the p-value as its own legend, it looks like it is some work, but doable.
Or, if you want, just throw signif(wtOut$p.value, 3) into annotate(...). You'll just need to come up with rules for where to place it.

Related

Plotting multiple continuous variables by frequencies together with same scale margin in r

I am trying to visualize my data. All I need is a plot to compare the distribution of the different variables.
I already tried with multi.hist. Actually, that would be enough for me. But the problem is, I cannot manage the margins of the scale to stay the same for each histogram to compare the distributions as it is already trying to fit for each variable.
As well, I have a categorial variable in my data as well (topic 1-5). Maybe there is a good way to visualize this as well but I am not dying if it is not possible so easy.
I tried a lot with ggplot as well but I am rather new to r and could not make anything good yet.
Below you see an example for my data.
Thank you very much in advance :)
My data:
Data
Try first converting your data to long format:
df2 <- df %>% pivot_longer(cols = 1:5, names_to = 'set', values_to = 'sub_means')
Then you can do a density plot, either colouring by set and faceting by topic:
df2 %>% ggplot(x = sub_means, fill = set) + geom_density() + facet_wrap(~topic)
Or vice versa:
df2 %>% ggplot(x = sub_means, fill = topic) + geom_density() + facet_wrap(~set)

ggplot par new=TRUE option

I am trying to plot 400 ecdf graphs in one image using ggplot.
As far as I know ggplot does not support the par(new=T) option.
So the first solution I thought was use the grid.arrange function in gridExtra package.
However, the ecdfs I am generating are in a for loop format.
Below is my code, but you could ignore the steps for data processing.
i=1
for(i in 1:400)
{
test<-subset(df,code==temp[i,])
test<-test[c(order(test$Distance)),]
test$AI_ij<-normalize(test$AI_ij)
AI = test$AI_ij
ggplot(test, aes(AI)) +
stat_ecdf(geom = "step") +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
new_theme +
xlab("Calculated Accessibility Value") +
ylab("Percent")
}
So I have values stored in "AI" in the for loop.
In this case how should I plot 400 graphs in the same chart?
This is not the way to put multiple lines on a ggplot. To do this, it is far easier to pass all of your data together and map code to the "group" aesthetic to give you one ecdf line for each code.
By far the hardest part of answering this question was attempting to reverse-engineer your data set. The following data set should be close enough in structure and naming to allow the code to be run on your own data.
library(dplyr)
library(BBmisc)
library(ggplot2)
set.seed(1)
all_codes <- apply(expand.grid(1:16, LETTERS), 1, paste0, collapse = "")
temp <- data.frame(sample(all_codes, 400), stringsAsFactors = FALSE)
df <- data.frame(code = rep(all_codes, 100),
Distance = sqrt(rnorm(41600)^2 + rnorm(41600)^2),
AI_ij = rnorm(41600),
stringsAsFactors = FALSE)
Since you only want the first 400 codes from temp that appear in df to be shown on the plot, you can use dplyr::filter to filter out code %in% test[[1]] rather than iterating through the whole thing one element at a time.
You can then group_by code, and arrange by Distance within each group before normalizing AI_ij, so there is no need to split your data frame into a new subset for every line: the data is processed all at once and the data frame is kept together.
Finally, you plot this using the group aesthetic. Note that because you have 400 lines on one plot, you need to make each line faint in order to see the overall pattern more clearly. We do this by setting the alpha value to 0.05 inside stat_ecdf
Note also that there are multiple packages with a function called normalize and I don't know which one you are using. I have guessed you are using BBmisc
So you can get rid of the loop and do:
df %>%
filter(code %in% temp[[1]]) %>%
group_by(code) %>%
arrange(Distance, by_group = TRUE) %>%
mutate(AI = normalize(AI_ij)) %>%
ggplot(aes(AI, group = code)) +
stat_ecdf(geom = "step", alpha = 0.05) +
scale_y_continuous(labels = scales::percent) +
theme_bw() +
xlab("Calculated Accessibility Value") +
ylab("Percent")

Plotting graph from Text file using R

I am using an NS3 based simulator called NDNsim. I can generate certain trace files that can be used to analyze performance, etc. However I need to visualize the data generated.
I am a complete Novice with R, and would like a way to visualize. Here is how the output looks from which I would to plot. Any help is appreciated.
It's pretty difficult to know what you're looking for, since you have almost 50,000 measurements across 9 variables. Here's one way of getting a lot of that information on the screen:
df <- read.table(paste0("https://gist.githubusercontent.com/wuodland/",
"9b2c76650ea37459f869c59d5f5f76ea/raw/",
"6131919c105c95f8ba6967457663b9c37779756a/rate.txt"),
header = TRUE)
library(ggplot2)
ggplot(df, aes(x = Time, y = Kilobytes, color = Type)) +
geom_line() +
facet_wrap(~FaceDescr)
You could look into making sub structures from your input file and then graphing that by node, instead of trying to somehow invoke the plotter in just the right way.
df <- read.table(paste0("https://gist.githubusercontent.com/wuodland/",
"9b2c76650ea37459f869c59d5f5f76ea/raw/",
"6131919c105c95f8ba6967457663b9c37779756a/rate.txt"),
header = TRUE)
smaller_df <- df[which(df$Type=='InData'), names(df) %in% c("Time", "Node",
"FaceId", "FaceDescr", "Type", "Packets", "Kilobytes",
"PacketRaw", "KilobyteRaw")]
ggplot(smaller_df, aes(x = Time, y = Kilobytes, color = Type))
+ geom_line()
+ facet_wrap (~ Node)
The above snippet makes a smaller data frame from your original text data using only the "InData" Type, and then plots that by nodes.

How to draw a violin plot with the color showing the expression of gene value?

I am trying to plot the gene expression of "gene A" among several groups.
I use ggplot2 to draw, but I fail
p <- ggplot(MAPK_plot, aes(x = group, y = gene_A)) + geom_violin(trim = FALSE , aes( colour = gene_A)) + theme_classic()
And I want to get the figure like this from https://www.researchgate.net/publication/313728883_Neuropilin-1_Is_Expressed_on_Lymphoid_Tissue_Residing_LTi-like_Group_3_Innate_Lymphoid_Cells_and_Associated_with_Ectopic_Lymphoid_Aggregates
You would have to provide data to get a more specific answer, tailored to your problem. But, I do not want that you get demotivated by the down-votes you got so far and, based on your link, maybe this example can give you some food for thought.
Nice job on figuring out that you have to use geom_violin. Further, you will need some form of faceting / multi-panels. Finally, to do the full annotation like in the given link, you need to make use of the grid package functionality (which I do not use here).
I am not familiar with gene-expression data sets, but I use a IMDB movie rating data set for this example (stored in the package ggplot2movies).
library(ggplot2)
library(ggplot2movies)
library(data.table)
mv <- copy(movies)
setDT(mv)
# make some variables for our plotting example
mv[, year_10 := cut_width(year, 10)]
mv[, rating_10yr_avg := mean(rating), by = year_10]
mv[, length_3gr := cut_number(length, 3)]
ggplot(mv,
aes(x = year_10,
y = rating)) +
geom_violin(aes(fill = rating_10yr_avg),
scale = "width") +
facet_grid(rows = vars(length_3gr))
Please do not take this answer as a form on encouragement of not posting data relevant to your problem.

Manually added legend not working in ggplot2?

Here's facsimile of my data:
d1 <- data.frame(
e=rnorm(3000,10,10)
)
d2 <- data.frame(
e=rnorm(2000,30,30)
)
So, I got around the problem of plotting two different density distributions from two very different datasets on the same graph by doing this:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2)
But when I try to manually add a legend, like so:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2) +
scale_fill_manual(name="Data", values = c("XXXXX" = "red","YYYYY" = "blue"))
Nothing happens. Does anybody know what's going wrong? I thought I could actually manually add legends if need be.
Generally ggplot works best when your data is in a single data.frame and in long format. In your case we therefore want to combine the data from both data.frames. For this simple example, we just concatenate the data into a long variable called d and use an additional column id to indicate to which dataset that value belongs.
d.f <- data.frame(id = rep(c("XXXXX", "YYYYY"), c(3000, 2000)),
d = c(d1$e, d2$e))
More complex data manipulations can be done using packages such as reshape2 and tidyr. I find this cheat sheet often useful. Then when we plot we map fill to id, and ggplot will take of the legend automatically.
ggplot(d.f, aes(x = d, fill = id)) +
geom_density()

Resources