I am trying to plot a set of time-series data where some data are observed quarterly and others are observed monthly. When I plot my data, however, the facet containing the data that has fewer observations doesn't come out correctly.
In my actual data set, I'm getting a bit of a different outcome, but I think that the problem I face in the example below is probably due to the same underlying problem. You'll see in the example below, that the last facet doesn't display a line at all. In my actual dataset, the last facet displays all of the points, but they are compressed together at the start of the graph as if the observations were happening monthly. So the line in the last facet is a quarter the length of frame, which spans the whole time period.
a <- economics[1:100,]
a[seq(1,100,2), "unemploy"] <- NA
b <- melt(a,id.vars="date")
smPlot <- ggplot(b, aes_string(x="date", y="value")) +
geom_line() +
facet_wrap(~ variable, ncol=5, scales="free")
smPlot
Which creates the following plot:
The issue in your example code is that a line must have at least 2 points. If every other point is NA (as in the last facet), each line segment has only one point in it so does not show anything.
To expand on this a little, consider how a line trace is drawn. It is comprised of many line segments, each having a start and an end. In continuous data that does not have any NA's, then each segment starts where the last ended. But when there is an NA in the data, it will break the line and a new line starts from the next non-NA data point. So when there is only a single data point between each NA, each segment has only a single point and cannot be drawn as a line (but could be shown as a point using geom_point).
If you just subset out the rows with NA, it should look fine
smPlot <- ggplot(b[!is.na(b$value), ], aes_string(x="date", y="value")) +
geom_line() +
facet_wrap(~ variable, ncol=5, scales="free")
smPlot
Related
I have neuroscientific data where we count synapses/cells in the cochlea and quantify these per frequency. We do this for animals of different ages. What I thus ideally want is the frequencies (5,10,20,30,40) in the x-axis and the amount of synapses/cells plotted on the y-axis (usually a numerical value from 10 - 20). The graph then will contain 5 lines of the different ages (6 weeks, 17 weeks, 43 weeks, 69 weeks and 96 weeks).
I try this with ggplot and first just want to plot one age. When I use the following command:
ggplot(mydata, aes(x=Frequency, y=puncta6)) + geom_line()
I get a graph, but no line and the following error: 'geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?'
So I found I have to adjust the code to:
ggplot(mydata, aes(x=Frequency, y=puncta6, group = 1)) + geom_line()
This works, except for the fact that my first data point (5 kHz) is now plotted behind my last data point (40 kHz)......... (This also happens without the 'group = 1' addition). How do I solve this or is there an easier way to plot this kind of data?
I couldnt add a file so I added a photo of my code + graph with the 5 kHz data point oddly located and I added a photo of my data in excel.
example data
example code and graph
Is there a parameter that can anchor start and end points for a loess geom_smooth regression? If I increase the span (so that the regression isn't too wiggly), the starting and ending points seem to be drastically different (I have multiple lines on a graph, using as.factor) when in reality they are not (quite close together). I can't share my data as it is for confidential academic research, and I'm not sure how to reproduce an example for this... just wondering if this is possible with ggplot.
Here are some pictures that illustrate the problem, though...
Low span (span = 0.1), just the first 10 out of the 750 points to be graphed --> with this you can see the true starting points:
And then with the high span (span = 1.0), and all 750 points, the starting value and ending values are completely different. I'm not sure why this happens, but it is very misleading:
Basically, I want the smoothness of the second picture, but the specific and accurate starting points of the first when I graph all of the data (i.e., all 750 points). Let me know if there's any way to do this. Thanks for all your help.
Without seeing your code, I can already tell that you're setting your axis limits for the "span = 1.0" version using xlim(0,10) or scale_x_continuous(limits=c(0,10)) - is that correct? Change it to the following:
coord_cartesian(xlim = c(0, 10))
This is because xlim() (which is just a wrapper for scale_x_continuous(limits=...)) does not just zoom in on your data, but in fact discards any of the data outside of those limits before performing any calculations. Check the documentation on xlim() and the documentation on coord_cartesian() for more info.
It's easy to see how this is working using the following example:
# create dataset
set.seed(8675309)
df <- data.frame(x=1:1000, y=rnorm(1000))
# basic plot
p <- ggplot(df, aes(x,y)) + theme_bw() +
geom_point(color='gray75', size=1) + geom_smooth()
p
We get a basic plot, and as we expect, the result of geom_smooth() on this dataset is a straight line parallel to the x axis at y=0.
If we use xlim() or scale_x_continuous(limits=...) to see the first 10 points, you see that the geom_smooth() line is not the same:
p + xlim(0,10)
# or this one... results in the same plot
p + scale_x_continuous(limits=c(0,10))
The resulting line has a much higher standard deviation and is a bit above y=0, since the first 10 points happen to be just a bit above the average for the rest of the 990 points. If you use coord_cartesian(xlim=...), the zooming in of the plot happens after the calculations are made and no points are discarded, giving you the same points plotted, but the geom_smooth() line that matches that of the full dataset:
p + coord_cartesian(xlim=c(0,10))
I am trying to plot three data sets that share an x axis. some of the data sets, however, have missing data and are thus of different length. I can plot them fine individually but when I try to facet them all together I get an error that the data sets contain different numbers of rows. This error only occurs when I facet the plot (which is necessary).
Any suggestions for how I could get the facet plot to accept data sets with different numbers of rows?
The code i've been using is:
ggplot()+
geom_line(data=x,aes(x=x$BIN_START,y=x$TajimaD),size=0.6,alpha=0.65,colour="skyblue1")+
geom_line(data=y,aes(x=y$BIN_START,y=y$TajimaD),size=0.3,alpha=0.85,colour="greenyellow")+
geom_line(data=z,aes(x=z$BIN_START,y=z$TajimaD),size=0.25,alpha=0.95,colour="black")+
scale_x_continuous()+
facet_grid(rows=vars(x$CHROM))+
theme_classic()+
ylab("TajimaD") +
xlab("Location (bp)")
As was suggested in a comment I have now moved all the data into a single file and added a column to indicate the population the data is from. I am still getting a similar error message: "replacement has 22588 rows, data has 7537"
ggplot()+
geom_line(data=x,aes(x=a$BIN_START,y=a$TajimaD,color=a$Population),size=0.6,alpha=0.65)+
scale_x_continuous()+
facet_grid(rows=vars(a$CHROM))+
theme_classic()+
ylab("TajimaD") +
xlab("Location (bp)")
On your second attempt you're using x as data but then use a$BIN_START, etc. It's very likely that x and a have a different number of rows, and hence the error. I suggest removing the <dataset_name>$ alltogether in all your aes() calls when you use ggplot2. When you say data = x, you only need to write aes(x=BIN_START,y=TajimaD,color=Population) (i.e. no need for x$).
I'm having some trouble with qplot in R. I am trying to plot data from a data frame. When I execute the command below the plot gets bunched up on the left side (see the image below). The data frame only has 963 rows so I don't think size is the issue, but I can use the same command on a smaller data frame and it looks fine. Any ideas?
library(ggplot2)
qplot(x=variable,
y=value,
data=data,
color=Classification,
main="Average MapQ Scores")
Or similarly:
ggplot(data = data, aes(x = variable, y = value, color = Classification) +
geom_point()
Your column value is likely a factor, when it should be a numeric. This causes each categorical value of value to be given its own entry on the y-axis, thus producing the effect you've noticed.
You should coerce it to be a numeric
data$value <- as.numeric(as.character(data$value))
Note that there is probably a good reason it has been interpreted as a factor and not a numeric, possibly because it has some entries that are not pure numeric values (maybe 1,000 or 1000 m or some other character entry among the numbers). The consequence of the coercion may be a loss of information, so be warned or cleanse the data thoroughly.
Also, you appear to have the same problem on the x-axis.
My data frame (df) consists of 5 columns with 2,000 numerical values for each one.
Using reshape I reformatted my data frame to two columns: 1st containing the values (df$Values) (a total of 10,000) and a 2nd containing the name of the column (df$Labels) from where the value in col 1 is coming from.
I will use the 2nd column as a group factor.
I generated a mycolor and myshapes for coloring and setting the shape of lines.
With ggplot I tried to generate a density plot containing the density plot for the five factors.
The problem is that the x-axis show the counts, which maximum is 10,000. This value does not make any sense because the maximum possible counts for each plot must be 2,000. Anyone knows what is going on? Which is code I need to use to properly correct the x-axis?
ggplot2, geom_density() plot:
Here is the code:
ggplot(df, aes(x=Values, colour=Labels, linetype=Labels))+
geom_density(aes(y=..count..))+
theme_classic()+
scale_colour_manual(values = mycolor)+
scale_linetype_manual(values = myshapes)+
ggtitle("Title")+
scale_x_continuous(limits = c(0.5,1.5))