I'm working with a really big data setcontaining one dummy variable and a factor variable with 14 levels- a sample of which I have posted here. I'm trying to make a stacked proportional bar graph using the following code:
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar(position="fill")+
ylab("Proportion")+
theme(axis.title.y=element_text(angle=0))
It works great and its almost the plot I need. I just want to add small text labels reporting the number of observations of each factor level. My intuition tells me that something like this should work
Labels<-c("n=1853" , "n=392", "n=181" , "n=80", "n=69", "n=32" , "n=10", "n=6", "n=4", "n=5", "n=3", "n=3", "n=2", "n=1" )
ggplot(data,aes(factor(data$factor),fill=data$dummy))+
geom_bar(position="fill")+
geom_text(aes(label=Labels,y=.5))+
ylab("Proportion")+
theme(axis.title.y=element_text(angle=0))
But it spits out a blank graph and the error
Aesthetics must either be length one, or the same length as the dataProblems:Labels
this really doesn't make sense to me because I know for a fact that the length of my factor levels is the same length as the number of labels I muscled in. I've been trying to figure out how I can get it to just print what I need without creating a vector of values for the number of observations like this example, but no matter what I try I always get the same Aesthetics error.
How about this:
library(dplyr)
# Create a separate data frame of counts for the count labels
counts = data %>% group_by(factor) %>%
summarise(n=n()) %>%
mutate(dummy=NA)
counts$factor = factor(counts$factor, levels=0:10)
ggplot(data, aes(factor(factor), fill=factor(dummy))) +
geom_bar(position="fill") +
geom_text(data=counts, aes(label=n, x=factor, y=-0.03), size=4) +
ylab("Proportion")+
theme(axis.title.y=element_text(angle=0))
Your method is the right idea, but Labels needs to be a data frame, rather than a vector. geom_text needs to be given the name of the data frame using the data argument. Then, the label argument inside aes tells geom_text which column to use for the labels. Also, even though geom_text doesn't use the dummy column, it has to be in the data frame or you'll get an error.
Related
I'm trying to plot the (first) difference of a time series with ggplot.
As the difference (by definition) contains one less element than the data, I (predictably) get the error message: "Error: Aesthetics must be either length 1 or the same as the data".
I solved this by defining my y aesthetic as c(NA, diff(data)) instead of just diff(data), which works.
However, this feels like a clumsy workaround and only works so far as it gets me in trouble for instance when I'm trying to facet several plots. (Also, you need to keep adding NA's if a higher order of difference is needed, or more lag).
Anyone knows of a more robust solution?
The ultimate problem is this:
What I want (this was made using patchwork::)
What I get using faceting (NB: if I put the NA at the end, it's the third chart which becomes correct)
While adding NA to the differenced vector is not clumsy, doing this within the ggplot aesthetic is. Compare the following two code:
ggplot(data = data.long, aes(x = date, y = c(NA, count %>% diff()))) +
geom_point()
data.long %>%
mutate(diff_count = c(NA, diff(count))) %>%
ggplot(aes(x = date, y = diff_count)) +
geom_point()
They both would give the same graph, but the second code is the preferred method since the data used for plotting (the differenced count) is calculated before being sent to ggplot and is easier to read and modify. In other words, do the data management first, then visualise the data. As you say, using the first method can get you into trouble later, for example doing more complicated graphing such as facetting.
I am trying to plot three data sets that share an x axis. some of the data sets, however, have missing data and are thus of different length. I can plot them fine individually but when I try to facet them all together I get an error that the data sets contain different numbers of rows. This error only occurs when I facet the plot (which is necessary).
Any suggestions for how I could get the facet plot to accept data sets with different numbers of rows?
The code i've been using is:
ggplot()+
geom_line(data=x,aes(x=x$BIN_START,y=x$TajimaD),size=0.6,alpha=0.65,colour="skyblue1")+
geom_line(data=y,aes(x=y$BIN_START,y=y$TajimaD),size=0.3,alpha=0.85,colour="greenyellow")+
geom_line(data=z,aes(x=z$BIN_START,y=z$TajimaD),size=0.25,alpha=0.95,colour="black")+
scale_x_continuous()+
facet_grid(rows=vars(x$CHROM))+
theme_classic()+
ylab("TajimaD") +
xlab("Location (bp)")
As was suggested in a comment I have now moved all the data into a single file and added a column to indicate the population the data is from. I am still getting a similar error message: "replacement has 22588 rows, data has 7537"
ggplot()+
geom_line(data=x,aes(x=a$BIN_START,y=a$TajimaD,color=a$Population),size=0.6,alpha=0.65)+
scale_x_continuous()+
facet_grid(rows=vars(a$CHROM))+
theme_classic()+
ylab("TajimaD") +
xlab("Location (bp)")
On your second attempt you're using x as data but then use a$BIN_START, etc. It's very likely that x and a have a different number of rows, and hence the error. I suggest removing the <dataset_name>$ alltogether in all your aes() calls when you use ggplot2. When you say data = x, you only need to write aes(x=BIN_START,y=TajimaD,color=Population) (i.e. no need for x$).
I'm having some trouble with qplot in R. I am trying to plot data from a data frame. When I execute the command below the plot gets bunched up on the left side (see the image below). The data frame only has 963 rows so I don't think size is the issue, but I can use the same command on a smaller data frame and it looks fine. Any ideas?
library(ggplot2)
qplot(x=variable,
y=value,
data=data,
color=Classification,
main="Average MapQ Scores")
Or similarly:
ggplot(data = data, aes(x = variable, y = value, color = Classification) +
geom_point()
Your column value is likely a factor, when it should be a numeric. This causes each categorical value of value to be given its own entry on the y-axis, thus producing the effect you've noticed.
You should coerce it to be a numeric
data$value <- as.numeric(as.character(data$value))
Note that there is probably a good reason it has been interpreted as a factor and not a numeric, possibly because it has some entries that are not pure numeric values (maybe 1,000 or 1000 m or some other character entry among the numbers). The consequence of the coercion may be a loss of information, so be warned or cleanse the data thoroughly.
Also, you appear to have the same problem on the x-axis.
I have an R dataframe that contains a string variable and a numerical variable, and I would like to plot the top 10 strings, based on the value of the numerical variable.
I can of course get the top 10 entries pretty simply:
top10_rank <- rank[order(rank$numerical_var_name),]
My first approach to trying to visualize this was to simple attempt to plot this like:
ggplot(data=top10_rank, aes(x = top10_rank$numerical_var_name, y = top10_rank$string_name)) + geom_point(size=3)
And to a first approximation this "works" - the problem is that the strings on the y axis are sorted alphabetically rather than by the numerical value.
My preference would be to find a way to plot the top 10 strings without having to bother showing the numerical variable at all - just basically as a list (even better would be if I could enumerate the list). I am attempting to plot this so it looks more pleasing than simply dumping the text to the screen.
Any ideas greatly appreciated!
The y-axis tick marks may be sorted alphabetically, but the points are drawn in order(from left to right) of the top10_rank dataframe. What you need to do is change the order of the y-axis. Add this to your call of ggplot + scale_y_discrete(limits=top10_rank$String) and it should work.
ggplot(data=top10_rank, aes(x = top10_rank$Number,
y = top10_rank$String)) + geom_point(size=3) + scale_y_discrete(limits=top10_rank$String)
Here is a link to a great resource on R graphics: R Graphics Cookbook
I'm having some trouble with qplot in R. I am trying to plot data from a data frame. When I execute the command below the plot gets bunched up on the left side (see the image below). The data frame only has 963 rows so I don't think size is the issue, but I can use the same command on a smaller data frame and it looks fine. Any ideas?
library(ggplot2)
qplot(x=variable,
y=value,
data=data,
color=Classification,
main="Average MapQ Scores")
Or similarly:
ggplot(data = data, aes(x = variable, y = value, color = Classification) +
geom_point()
Your column value is likely a factor, when it should be a numeric. This causes each categorical value of value to be given its own entry on the y-axis, thus producing the effect you've noticed.
You should coerce it to be a numeric
data$value <- as.numeric(as.character(data$value))
Note that there is probably a good reason it has been interpreted as a factor and not a numeric, possibly because it has some entries that are not pure numeric values (maybe 1,000 or 1000 m or some other character entry among the numbers). The consequence of the coercion may be a loss of information, so be warned or cleanse the data thoroughly.
Also, you appear to have the same problem on the x-axis.