So I have two sets of data (of different length) that I am trying to group up and display the density plots for:
dat <- data.frame(dens = c(nEXP,nCNT),lines = rep(c("Exp","Cont")))
ggplot(dat, aes(x = dens, group=lines, fill = lines)) + geom_density(alpha = .5)
when I run the code it spits an error about the different lengths, i.e.
"arguments imply different num of rows: x, y"
I then augment the code to:
dat <- data.frame(dens = c(nEXP,nCNT),lines = rep(c("Exp","Cont"),X))
Where X is the length of the longer argument so the lengths of "lines" will match that of dens.
Now the issue is that when when I go to plot the data I am only getting ONE density plot.... I know there should be two, since plotting the densities with plot/lines, is clearly two non-equal overlapping distributions, so I am assuming the error is with the grouping...
hope that makes sense.
So I am not sure why but basically I simply had to do the rep() function manually:
A<-data.frame(ExpN, key = "exp")
B<-data.frame(ConN,key = "con")
colnames(A) <- c("a","key")
colnames(B) <- c("a","key")
dat <- rbind(A,B)
ggplot(dat, aes(x = dens, fill = key)) + geom_density(alpha = .5)
You need to tell rep how many times to repeat each element to get it to line up
dat <- data.frame(dens = c(nEXP,nCNT),
lines = rep(c("Exp","Cont"), c(length(nEXP),length(nCNT)))
That should give you a dat you can use with your ggplot call.
Related
I am, in R and using ggplot2, plotting the development over time of several variables for several groups in my sample (days of the week, to be precise). An artificial sample (using long data suitable for plotting) is this:
library(tidyverse)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>% ggplot(mapping = aes(x = x, y = values)) + geom_line() + facet_grid(groups2 ~ groups1)
which gives
In this example, the first variable -- shown in the left column -- has unlimited range, while the second variable -- shown in the right column -- is weakly positive.
I would like to reflect this in my plot by allowing the Y axes to differ across the columns in this plot, i.e. set Y axis limits separately for the two variables plotted. However, in order to allow for easy visual comparison of the different groups for each of the two variables, I would also like to have the identical Y axes within each column.
I've looked at the scales option to facet_grid(), but it does not seem to be able to do what I want. Specifically,
passing scales = "free_x" allows the Y axes to vary across rows, while
passing scales = "free_y" allows the X axes to vary across columns, but
there is no option to allow the Y axes to vary across columns (nor, presumably, the X axes across rows).
As usual, my attempts to find a solution have yielded nothing. Thank you very much for your help!
I think the easiest would to create a plot per facet column and bind them with something like {patchwork}. To get the facet look, you can still add a faceting layer.
library(tidyverse)
library(patchwork)
groups1 <- rep(1:2, each = 7 * 100)
groups2 <- rep(rep(1:7, times = 2), each = 100)
x <- rep(1:100, times = 14)
set.seed(42) ## always better to set a seed before using random functions
values <- c(rnorm(n = 700), rgamma(n = 700, shape = 2))
data <- tibble(x, groups1, groups2, values)
data %>%
group_split(groups1) %>%
map({
~ggplot(.x, aes(x = x, y = values)) +
geom_line() +
facet_grid(groups2 ~ groups1)
}) %>%
wrap_plots()
Created on 2023-01-11 with reprex v2.0.2
I have been working with R recently and I have encountered this issue when trying to apply a moving average function onto a data set.
library(dplyr)
library(ggplot2)
library(tidyverse)
library(zoo)
library(bit)
#grabs tab delimited file
Mouse_mm9_rDNA_file <- read.delim("mm9_rDNA_mapping_HOXA9-ER-CEBPA-degron_POLR1A_06122021-Dm3-Q25-norm_to_Input.txt")
#Averages two specific columns from the original file with 8 different columns
Column_position <- Mouse_mm9_rDNA_file[,c("position")]
Columns_5_and_6_mean_0_hrs <- rowMeans(Mouse_mm9_rDNA_file[,c("X5", "X6")])
Columns_7_and_8_mean_4_hrs <- rowMeans(Mouse_mm9_rDNA_file[,c("X7", "X8")])
Columns_9_and_10_mean_8_hrs <- rowMeans(Mouse_mm9_rDNA_file[,c("X9", "X10")])
Columns_11_and_12_means_10_hrs <- rowMeans(Mouse_mm9_rDNA_file[,c("X11", "X12")])
#Puts those averaged columns into rows and then flips the columns and rows
all_Columns_averaged <- rbind(Column_position,
Columns_5_and_6_mean_0_hrs,
Columns_7_and_8_mean_4_hrs,
Columns_9_and_10_mean_8_hrs,
Columns_11_and_12_means_10_hrs)
all_Columns <- t(all_Columns_averaged)
#Turns my dataset into a data frame
all_Columns_dataframe <- setattr(all_Columns, "class", c("tbl", "tbl_df", "data.frame"))
#runs the moving average function on my data set
all_Columns_dataframe <- all_Columns_dataframe %>%
mutate(Averages_03 = rollmean(Columns_5_and_6_mean_0_hrs, k = 5, ))
#Creates a line plot with multiple y-values
p <- all_Columns_dataframe %>%
ggplot(aes(x = Column_position)) +
labs(x = "position", y = "hours", color = "legend") + xlab("position") + ylab("hours")
p +
geom_line(data = all_Columns, aes(x = Column_position, y = Columns_5_and_6_mean_0_hrs), color = "black") +
geom_line(data = all_Columns, aes(x = Column_position, y = Columns_7_and_8_mean_4_hrs), color = "green") +
geom_line(data = all_Columns, aes(x = Column_position, y = Columns_9_and_10_mean_8_hrs), color = "darkslategray1") +
geom_line(data = all_Columns, aes(x = Column_position, y = Columns_11_and_12_means_10_hrs), color = "maroon1")
I am trying to visualize the all_Columns_dataframe data after it has been smoothed out and averaged by the roll means function, but when I try to run this code I get the error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "NULL"
At first I thought it may have been because I had NULL values in my data so I added 1 to all values in all_Columns, but the same error persisted. If I take away the
all_Columns_dataframe <- all_Columns_dataframe %>%
mutate(Averages_03 = rollmean(Columns_5_and_6_mean_0_hrs, k = 5, ))
section of my code then everything runs smoothly and I get a nice looking graph with the correct values and everything. I guess my question would be how can I get rollmean to work or what would be the most effective way to run a moving average on my data so I can smooth it out?
It's hard to know without seeing a sample of your data (you can do this with dput(head(df))). But I would first just trying to specify the package for mutate.
dplyr::mutate()
This issue sometimes happens when it is using mutate from another package.
enter image description hereI wanted to plot multiple lines in one graph but I couldn't figure out which code to use. Also, is there a way I could assign colors to each of the lines? Just new to Rstudio and was assigned to pick up someones work so I've been doing a lot of trial and error but I haven't been lucky for the past few days. Hope someone could help me with this! Thank you so much
ecdf.shift <- function(OUR_threshold, des_cap = 40, nint = 10000){
#create some empty vectors for later use in the loop
ecdf_med = c()
ecdf_obs = c()
for (i in 1:length(OUR_threshold)){
# filter out the OUR threshold data, then select only the capture column and create a ecdf function
ecdf_fun <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
ecdf()
# extract the ecdf data and put in tibble dataframe, then create a linear interpolation of the curve.
ecdf_data <- tibble(TSS_con = environment(ecdf_fun)$x, prob = environment(ecdf_fun)$y)
ecdf_interpol <- approx(x = ecdf_data$TSS_con, y = ecdf_data$prob, n = nint)
# find the vector numbers in x which correspond with the desired capture. Then find correlate the vectornumbers with probability numbers in the y vectors. Take the median value in case multiple hits. Put this number in a vector with designed vectornumber as ditacted by the loopnumber i.
ecdf_med[i] <- median(ecdf_interpol$y[(round(ecdf_interpol$x,1) == des_cap)])
# calculate the number of observations when the filtering takes place.
ecdf_obs[i] <- HRP_rESS_no %>%
filter(ESS > OUR_threshold[i]) %>%
.$TSS_con %>%
length()
# Flush the ecdf data. The ecdf is encoded as a function with global paramaters, so you want to reset them everytime the loop is done to avoid pesky bugs to appear.
rm(ecdf_data)
}
#create a tibble dataframe with all the loop data.
ecdf_out <- tibble(OUR_ratio_cutoff = OUR_threshold, prob = (ecdf_med)*100, nobs = ecdf_obs)
return(ecdf_out)
}
ratio_threshold <- seq(0,115, by = 5)
t = ecdf_MLSS_target <- 400 %>%
ecdf.shift(ratio_threshold, .) %>%
filter(nobs > 2) %>%
ggplot(aes( x = OUR_ratio_cutoff, y = prob)) +
geom_line() +
geom_point() +
theme_bw(base_size = 12) +
theme(panel.grid = element_blank()) +
scale_y_continuous(limits = c(0,100),
breaks = seq(0,300, by = 5),
expand = c(0,0)) +
scale_x_continuous(limits = c(0,120),
breaks = seq(0,110, by = 10),
expand = c(0,0)) +
labs(x = "ESS mg TSS/L",
y = "Probability of contactor MLSS > 400 mg TSS/L ")
plot(t)
Easiest would be to loop over your different t values first and bring the resulting data frames into one big data frame, and use this for your plot. Your code is not fully reproducible (it requires data that we do not have, i.e. HRP_rESS_no). So I have stripped down the function to the core - creating a data frame which makes different "lines" depending on your t value. I just used it as slope.
I hope the idea is clear.
library(tidyverse)
ecdf.shift <- function(OUR_threshold, t) {
data.frame(x = OUR_threshold, y = t * OUR_threshold)
}
ratio_threshold <- seq(0, 115, by = 5)
t_df <-
map(1:5, function(t) ecdf.shift(ratio_threshold, t)) %>%
bind_rows(, .id = "t")
ggplot(t_df, aes(x, y, color = t)) +
geom_line() +
geom_point()
Created on 2020-05-07 by the reprex package (v0.3.0)
Before asking, I have read this post, but mine is more specific.
library(ggplot2)
library(scales)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
I replace my real data with dat, the domain of x and y is [-4,4] at this random seed, and I partition the area into 256(16*16) cells, the interval of which is 0.5. For each cell, I want to get the count numbers.
Yeah, it's quite easy, geom_bin2d can solve it.
# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")
So far so good, but I only want to get top 100 count numbers and plot in the pic, like pic below.
After reading ?geom_bin2d, drop = TRUE only removes all cells with 0 counts, and my concern is the top 100 counts. What should I do, this is question 1.
Please take another look on the legend of the 2nd pic, the count number is small and close, what if it's 10,000, 20,000, 30,000.
The method is use trans in scale_fill_gradient, the built_in function are exp, log, sqrt, and so on, but I want to divide 1,000. Then, I found trans_new() in package scales and had a try, but negative.
sci_trans <- function(){ trans_new('sci', function(x) x/1000, function(x) x*1000)}
p + scale_fill_gradient(trans='sci')
And, this is question 2. I have googled a lot, but cannot find a way to solve it, thanks a lot for anyone who does me a favor, thank you!
Apparently you can't get the output bins or counts from stat_bin2d or stat_summary_2d ; according to a related question: How to use stat_bin2d() to compute counts labels in ggplot2? where #MrFlick 's comment quotes Hadley from 2010: "he basically says you can't use stat_bin2d, you'll have to do the summarization yourself".
So, the workaround: create the coordinate bins manually yourself, get the 2D counts, then take top-n. For example, using dplyr:
dat %>% mutate(x_binned=some_fn(x), y_binned=some_fn(y)) %>%
group_by(x_binned,y_binned) %>% # maybe can skip this line
summarize(count = count()) %>% # NOTE: no need to sort() or order()
top_n(..., 100)
You might have to poke into stat_bin2d in order to copy (or call) their exact coordinate-binning code. UPDATE: here's the source for stat-bin2d.r
StatBin2d <- ggproto("StatBin2d", Stat,
default_aes = aes(fill = ..count..),
required_aes = c("x", "y"),
compute_group = function(data, scales, binwidth = NULL, bins = 30,
breaks = NULL, origin = NULL, drop = TRUE) {
origin <- dual_param(origin, list(NULL, NULL))
binwidth <- dual_param(binwidth, list(NULL, NULL))
breaks <- dual_param(breaks, list(NULL, NULL))
bins <- dual_param(bins, list(x = 30, y = 30))
xbreaks <- bin2d_breaks(scales$x, breaks$x, origin$x, binwidth$x, bins$x)
ybreaks <- bin2d_breaks(scales$y, breaks$y, origin$y, binwidth$y, bins$y)
xbin <- cut(data$x, xbreaks, include.lowest = TRUE, labels = FALSE)
ybin <- cut(data$y, ybreaks, include.lowest = TRUE, labels = FALSE)
...
}
bin2d_breaks <- function(scale, breaks = NULL, origin = NULL, binwidth = NULL,
bins = 30, right = TRUE) {
...
(But this seems a worthy enhance request on ggplot2, if it hasn't already been filed.)
I've been able to successfully create a dotpot in ggplot for percentages across gender. But, I want to highlight the significant differences. I thought I could do this with a combination of subsetting and the use of last_plot().
Here’s my data:
require(ggplot2)
require(reshape2)
prog <- c("Honors", "Academic", "Social", "Media")
m <- c(30,35,40,23)
f <- c(25,40,45,15)
s <- c(0.7, 0.4, 0.1, 0.03)
temp <- as.data.frame(cbind(prog, m, f, s), stringsAsFactors=FALSE)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt <- as.data.frame(cbind(first.melt,temp[,4]), , stringsAsFactors=FALSE)
names(first.melt) <- c("program", "Gender", "Percent", "sig")
first.melt$program <- as.factor(first.melt$program)
Here’s where I reverse order my Program variable, so that when graphed if will be alphabetical from top to bottom.
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt$sig <- as.numeric(as.character(first.melt$sig))
first.melt$Percent <- as.numeric(as.character(first.melt$Percent))
Now, I subset...
first.melt.ns <- subset(first.melt,sig > 0.05)
first.melt.sig <- subset(first.melt,sig <= 0.05)
ggplot(first.melt.ns, aes(program, y=Percent, shape=Gender)) +
geom_point(size=3) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
The first run at ggplot get’s me my non-significant Program pairs – and it’s in the right order – so, I add my the two new points for male and female (making them solid, to draw attention as a significant pair):
last_plot() +
geom_point(data=first.melt.sig, aes(program[Gender=="m"], y=Percent[Gender=="m"]), size=3, shape=19) +
geom_point(data=first.melt.sig, aes(program[Gender=="f"], y=Percent[Gender=="f"]),size=4, shape=18)
The points get added just fine – ggplot works. But notice my Program axis – it’s correct, but reversed now.
First, you really should avoid as.data.frame(cbind(...)). It is dramatically increasing the amount of work necessary to prepare your data. The function for creating data frames is (naturally) data.frame. Use it!
What you're doing here is basically trying to get around the limitation of only having one shape scale. It's probably easiest to just do this:
temp <- data.frame(prog,m,f,s)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt$sig <- rep(temp$s,times = 2)
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt.sig <- subset(first.melt,sig < 0.05)
first.melt$Percent[first.melt$sig < 0.05] <- NA
ggplot() +
geom_point(data = first.melt,aes(x = prog,y = Percent,shape = Gender),size = 3) +
geom_point(data = first.melt.sig[1,],aes(x = prog,y = Percent),shape = 19) +
geom_point(data = first.melt.sig[2,],aes(x = prog,y = Percent),shape = 18) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
In general, work to structure your ggplot code so that you're subsetting data frames, not variables inside of aes. That gets both tricky and dangerous, because ggplot is assuming certain things about what you pass inside of aes in order for the evaluation to work properly.