I have a CSV file with 3 levels of the parameter(C, C+Fe, Fe). Now I want to group the boxplot based on parameters by using geom-boxplot(fill=parameter) but only for two levels of them, not all 3.
The current code is: geom_boxplot(aes(x = gene, y=RA, fill = parameter) which yealds to:
However, I want to eliminate the blue box plot which is one of the parameters.
You may just filter the observation before plotting. Assuming your data frame is called df:
df2 <- subset(df, parameter != "Fe")
ggplot(df2, aes(x = gene, y=RA, fill=parameter)) +
geom_boxplot()
Related
I have a data frame containing 5 probes which are my variables in a dataframe, cg02823866, cg13474877, cg14305799, cg15837913 and cg19724470. I want to create a boxplot that will group cg02823866 and cg14305799 into a group called 'GeneBody' and then cg13474877, cg14305799 and cg19724470 into a group called 'Promoter'. I then want to colour code the boxplots to represent the probe names. I can't figure out how to group those variables into groups to plot the graph.
I created an ungrouped boxplot of the five probes and it looked like this.
I want there to be the titles 'Promoter' and 'GeneBody' on the x axis. Above the 'GeneBody' title there are the 2 boxplots for the cg02823866 and cg14305799 probes. Then a 'Promoter' label with the boxplots for cg13474877, cg14305799 and cg19724470. I then want each boxplots colour coded to represent each different probe.
My data frame that I imported into RStudio looks like this: https://i.stack.imgur.com/r4gEC.png
Assuming you have some data with variable names Beta (your y axis), Probe (your current x axis), and group (either "GeneBody" or "Promoter"), you can do something like the following:
library(ggplot2)
ggplot(data, aes(x = group, y = Beta, fill = Probe)) +
geom_boxplot()
If you provide a reproducible set of data, I can probably do better.
Adding to Ben's answer the traditional iris-data.frame example,which you can easily load by data(iris):
ggplot(iris) +
aes(x = "", y = Sepal.Length, group = Species) +
geom_boxplot(shape = "circle", fill = "#112446") +
theme_minimal()
So you just need a column which indicates the group dependency.
It gets of course more difficult with uncleand data, where you might need to transpond the data first etc. But those are follow up questions i guess.
Also if you want to make your life easier, use esquisse R-Studio add-on
Boxplot
I have a dataframe in R consisting of 104 columns, appearing as so:
id vcr1 vcr2 vcr3 sim_vcr1 sim_vcr2 sim_vcr3 sim_vcr4 sim_vcr5 sim_vcr6 sim_vcr7
1 2913 -4.782992840 1.7631999 0.003768704 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
2 1260 0.003768704 3.1577108 -0.758378208 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
3 2912 -4.782992840 1.7631999 0.003768704 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
4 2914 -1.311132669 0.8220594 2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
5 2915 -1.311132669 0.8220594 2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
6 1261 2.372950077 -0.7022792 -4.951318264 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
The "sim_vcr*" variables go all the way through sim_vcr100
I need two overlapping density density curves contained within one plot, looking something like this (except here you see 5 instead of 2):
I need one of the density curves to consist of all values contained in columns vcr1, vcr2, and vcr3, and I need another density curve containing all values in all of the sim_vcr* columns (so 100 columns, sim_vcr1-sim_vcr100)
Because the two curves overlap, they need to be transparent, like in the attached image. I know that there is a pretty straightforward way to do this using the ggplot command, but I am having trouble with the syntax, as well as getting my data frame oriented correctly so that each histogram pulls from the proper columns.
Any help is much appreciated.
With df being the data you mentioned in your post, you can try this:
Separate dataframes with next code, then plot:
library(tidyverse)
library(gdata)
#Index
i1 <- which(startsWith(names(df),pattern = 'vcr'))
i2 <- which(startsWith(names(df),pattern = 'sim'))
#Isolate
df1 <- df[,c(1,i1)]
df2 <- df[,c(1,i2)]
#Melt
M1 <- pivot_longer(df1,cols = names(df1)[-1])
M2 <- pivot_longer(df2,cols = names(df2)[-1])
#Plot 1
ggplot(M1) + geom_density(aes(x=value,fill=name), alpha=.5)
#Plot 2
ggplot(M2) + geom_density(aes(x=value,fill=name), alpha=.5)
Update
Use next code for one plot:
#Unique plot
#Melt
M <- pivot_longer(df,cols = names(df)[-1])
#Mutate
M$var <- ifelse(startsWith(M$name,'vcr',),'vcr','sim_vcr')
#Plot 3
ggplot(M) + geom_density(aes(x=value,fill=var), alpha=.5)
Using the dplyr package, first you can convert your data to long format using the function pivot_longer as follows:
df %<>% pivot_longer(cols = c(starts_with('vcr'), starts_with('sim_vcr')),
names_to = c('type'),
values_to = c('values'))
After using filter function you can create separate plots for each value type
For vcr columns:
df %>%
filter(str_detect(type, '^vcr')) %>%
ggplot(.) +
geom_density(aes(x = values, fill = type), alpha = 0.5)
The above produces the following plot:
for sim_vcr columns:
df %>%
filter(str_detect(type, '^sim_vcr')) %>%
ggplot(.) +
geom_density(aes(x = values, fill = type), alpha = 0.5)
The above code produces the following plot:
Another simple way to subset and prepare your data for ggplot is with gather() from tidyr which you can read more about. Heres how I do it. df being your data frame provided.
# Load tidyr to use gather()
library(tidyr)
#Split appart the data you dont want on their own, the first three columns, and gather them
df_vcr <- gather(data = df[,2:4])
#Gather the other columns in the dataframe
df_sim<- gather(data = df[,-c(1:4)])
#Plot the first
ggplot() +
geom_density(data = df_vcr,
mapping = aes(value, group = key, color = key, fill = key),
alpha = 0.5)
#Plot the second
ggplot() +
geom_density(data = df_sim,
mapping = aes(value, group = key, color = key, fill = key),
alpha = 0.5)
However I am a little unclear on what you mean by "all values in all of the sim_vcr* columns". Perhaps you want all of those values in one density curve? To do this, simply do not give ggplot any grouping info in the second case.
ggplot() + geom_density(data = df_sim,
mapping = aes(value),
fill = "grey50",
alpha = 0.5)
Notice here I can still specify the 'fill' for the curve outside of the aes() function and it will apply it too all curves instead of give each group specified in 'key' a different color.
This question already has an answer here:
ggplot: line plot for discrete x-axis
(1 answer)
Closed 2 years ago.
How can I create a line graph with ggplot 2 where the x variable is either categorical or a factor, the y variable is numeric and the group variable is categorical? I have tried just + geom_point() with the variables as stated above and it works, but + geom_line() does not.
I have already reviewed posts such as:
Creating line graph using categorical data,
ggplot2 bar plot with two categorical variables, and No line in plot chart despite + geom_line(), but none of them answer my question.
Before I go into code and examples, (1) Yes I absolutely must have the x-variable and group variable as a character or factor, (2) No, I do not want a bar graph or just geom_point().
The example below provides the coefficients of multiple independent variables from three different example regressions run using different variations on the dependent variable. While the code below shows a work around that I figured out (i.e. creating a int variable named 'test' to use in place of the chr variable containing the names of the independent variables form the regression), I need to instead be able to preserve the chr names of the independent variables.
Here is what I have:
library(dplyr)
library(ggplot2)
library(plotly)
library(tidyr)
var_names <- c("ST1", "ST2", "ST3",
"EFI1", "EFI2", "EFI3", "EFI4",
"EFI5", "EFI6")
####Dataset1####
reg <- c(26441.84, 20516.03, 12936.79, 17793.22, 18837.48, 15704.31, 17611.14, 17360.59, 14836.34)
r_adj <- c(30473.17, 35221.43, 29875.98, 30267.31, 29765.9, 30322.86, 31535.66, 30955.29, 29828.3)
a_adj <- c(19588.63, 31163.79, 22498.53, 27713.72, 25703.89, 28565.34, 29853.22, 29088.25, 25213.02)
df1 <- data.frame(var_names, reg, r_adj, a_adj, stringsAsFactors = FALSE)
df1$test <- c(1:9)
df2 <- gather(df1, key = "series_type", value = "value", c(2:4))
fig7 <- ggplot(df2, aes(x = test, y = value, color = series_type)) + geom_line() + geom_point()
fig7
Ultimately I want something that looks like the plot below, but with the independent variable names in place of the 'test' variable.
Example Plot
You can convert var_names into a factor and set the levels in the order of appearance (otherwise it will be assigned alphanumerically and the x axis will be out of order). Then just add series_type to the group parameter in the plot.
df2 <- gather(df1, key = "series_type", value = "value", c(2:4)) %>%
mutate(var_names = factor(var_names, levels = unique(var_names)))
ggplot(df2, aes(x = var_names, y = value, color = series_type, group = series_type)) + geom_line() + geom_point()
I would like to iterate over a data frame and plot each column against a particular column such as price.
What I have done so far is:
for(i in ncol(dat.train)) {
ggplot(dat.train, aes(dat.train[[,i]],price)) + geom_point()
}
What I want is to have the first introduction to my data (Approximately 300 columns) by plotting against the decision variable (i.e., price)
I know that there is a similar question, though I cannot really understand why the above is not really working.
You can do this, I have used mtcars data to plot other continuous variables with mpg. You have to melt the data into long form (use gather) and then use ggplot to plot these contiuous variables (disp,drat,qsec etc) against mpg. In your case instead of mpg you would take price and all the other continuous variables to be melted (like here disp,drat,qsec etc), the rest categorical variables can be taken for shape and colors etc (optional).
library(tidyverse)
mtcars %>%
gather(-mpg, -hp, -cyl, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg, color = hp, shape = factor(cyl))) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
EDIT:
This is another solution in case we need separate graphs for each of the variables.
Create a list of variables like this: lyst <- list("disp","hp") , you can use colnames function to get all the variable names. Use lapply to to loop through all the "lyst" objects on your data frame.
setwd("path") ###set the working directory here, This is the place where all the files are saved.
pdf(file=paste0("one.pdf"))
lapply(lyst, function(i)ggplot(mtcars, aes_string(x=i, y="mpg")) + geom_point())
dev.off()
A pdf file wil. be generated with all the graphs pdfs at your working directory which you have set
Output from solution first:
I have this dataframe and this plot :
df <- data.frame(Groupe = rep(c("A","B"),4),
Period = gl(4,2,8,c("t0","t1","t2","t3","t4")),
rate = c(0.83,0.96,0.75,0.93,0.67,0.82,0.65,0.73))
ggplot(data = df, mapping = aes(y = rate, x = Period ,group = Groupe, colour=Groupe, shape=Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
How could i organize my data so that the points between t1 and t2 are not connected with a line ? I'd like t0 and t1 to be connected (blue or red according to the group), t2 and t3 connected in the same way, but no lines between t1 and t2. I tried several things by looking at similar questions, but it always mess up my grouping colors :/
Creating a new grouping variable manually is mostly not the best way. So, a slightly different approach which requires less hardcoding:
# create new grouping variable
df$grp <- c(1,2)[df$Period %in% c("t2","t3","t4") + 1L]
# create the plot and use the interaction between 'Group' and 'grp' as group
ggplot(df, aes(x = Period, y = rate,
group = interaction(Groupe,grp),
colour = Groupe,
shape = Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
this gives the same plot as in the other answer:
The best way to handle a problem like this in ggplot is often to create an additional column in your data frame that indicates the grouping you want to work with in your data. For example, here I've added an extra column gp to your data frame:
df$gp <- c(1,2,1,2,3,4,3,4)
ggplot(data = df, aes(y = rate, x = Period, group = gp, colour=Groupe, shape=Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
The result is, I believe, what you are looking for:
If you make Period a numerical column rather than a character vector or factor, you can more easily generate a column like gp automatically rather than manually specifying it (perhaps using ifelse or cases to create it) - this would be useful if you wanted to do the same thing many times or with a large data frame.