How to plot many data frames on the same graph with ggplot - r

I have 36 different data frames that contain dX and dY variables. I have stored them in a list and want to display them all on the same graph with x = dX and y = dY.
The 36 data frames do not share the same dX values. They roughly cover the same range but don't have the exact same values, so using a merge creates a ton of NA values. The number of rows are however identical.
I tried something ugly that almost works:
g <- ggplot()
for (i in 1:36) {
g <- g + geom_line(data = df.list[[i]], aes(dX, dY, colour = i))
}
print(g)
This displays the curves correctly, but the colours are not applied (and I don't have an appropriate legend). OK, 36 lines in the legend might not be practical. In that case I would reduce the number of lines to draw.
Second approach: I tried melting the data frames as follows.
df <- melt(df.list, id.vars = "dX")
ggplot(df, aes(x = dX, y = value, colour = L1)) + geom_line()
But this creates a 4-variable data frame with columns: dX, variable (always equal to dY), value (here are the dY values) and L1, which contains the index of the data frame in the list.
Here are the first lines of the melted data frame:
dX variable value L1
1 4.952296 dY 6.211485e-05 1
2 6.766889 dY 7.661041e-05 1
3 8.581481 dY 9.550221e-05 1
4 10.396074 dY 1.192053e-04 1
5 12.210666 dY 1.498834e-04 1
6 14.025259 dY 1.883612e-04 1
7 15.839851 dY 2.365646e-04 1
8 17.654444 dY 2.956796e-04 1
9 19.469036 dY 3.662252e-04 1
10 21.283629 dY 4.470143e-04 1
There are several problems here:
"variable" is always equal to dY. What I was expecting was the index
of the data frame in the list (which is stored in L1), or even
better, the result of a function name(i)
The curve uses a continuous scale, ranging from 1 to 36 while I wanted a discrete scale
Finally, using the geom_line() does not seem to draw the data frames curves individually, but links the points of different data sets together
Any idea how to solve my problem?

I would combine the data.frame into one large data.frame, add an id column, and then plot with ggplot. Lots of ways to do this, here is one:
newDF <- do.call(rbind, list.df)
newDF$id <- factor(rep(1:length(df.list), each = sapply(df.list, nrow)))
g <- geom(newDF, aes(x = dX, y = dY, colour = id)
g <- g + geom_line()
print(g)

It seems like the most straightforward option would be to create a single data frame (as suggested by one of the commenters) and use the index of the source data frame for the colour aesthetic:
library(dplyr) # For bind_rows() function
ggplot(bind_rows(df.list, .id="id"), aes(dX, dY, colour=id)) +
geom_line()
In the code above, .id="id" causes bind_rows to include a column called id containing the names of the list elements containing each of the data frames.

Related

Plotting a facet grid in R using ggplot2 with only one variable

I have a data frame, called mouse.data, with 3 columns: Eigenvalues, DualEigenvalues and Experiment. This question does not concern the DualEigenvalues data, so that can be forgotten.
We ran 5 experiments and used the data from each experiment to calculate 14 eigenvalues. So the first 14 rows of this data frame are the 14 eigenvalues of the first experiment, with the experiment entry having value 1, the second 14 rows are the 14 eigenvalues of the second experiment with the experiment entry having value 2 etc.
I am then plotting the eigenvalues of each pairwise experiment against each other, here is an example of this code:
eigen.1 <- mouse.data$Eigenvalues[mouse.data$Experiment == 1]
eigen.2 <- mouse.data$Eigenvalues[mouse.data$Experiment == 2]
p.data <- data.frame(x = eigen.1, y = eigen.2)
ggplot(p.data, aes(x,y)) + geom_abline(slope = 1, colour = "red") + geom_point()
This gives me graph like this one:
This is precisely what I want this graph to look like.
What I would like to do, but can't work out, is to plot a facet_grid so that the plot in the ith row and jth column plots the eigenvalues from the ith experiment on the y-axis and the eigenvalues from the jth experiment on the x-axis.
This is the closest I have got so far, I hope this makes it clearer what I mean.
This is tricky without a reproducible example of your data, but it sounds like we can roughly approximate the structure of your data frame like this:
library(ggplot2)
set.seed(1)
Eigen <- as.vector(sapply(runif(5, .5, 1.5),
function(x) sort(rgamma(14, 2, 0.02*x))))
mouse.data <- data.frame(Experiment = rep(seq(5), each = 14), Eigenvalue = Eigen)
head(mouse.data)
#> Experiment Eigenvalue
#> 1 1 39.61451
#> 2 1 44.48163
#> 3 1 54.57964
#> 4 1 75.06725
#> 5 1 75.50014
#> 6 1 94.41255
The key to getting the plot to work is to reshape your data into a long-format data frame that contains each combination of experiments. One way to do this is to split the data frame by Experiment, then use simple indexing of the resultant list (using rep) to get all unique pairs of data frames. Each unique pair is stuck together column-wise, then the resultant 25 data frames are all joined row-wise into the plotting data frame.
experiments <- split(mouse.data, mouse.data$Experiment)
experiments <- mapply(cbind,
experiments[rep(1:5, 5)],
experiments[rep(1:5, each = 5)],
SIMPLIFY = FALSE)
p.data <- do.call(rbind, lapply(experiments, setNames,
nm = c("Experiment1", "x",
"Experiment2", "y")))
Once we have done this, we can use your plot code, with the addition of a facet_grid call:
ggplot(p.data, aes(x,y)) +
geom_abline(slope = 1, colour = "red") +
geom_point() +
facet_grid(Experiment1~Experiment2)

ggplot: Plotting timeseries data with missing values

I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".
This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:
YearMonthDay (format- "YYYYMMDD", example "19810501")
Rainfall (mm)
The code for importing data into R:
StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")
The code for time data "Time" :
Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")
The code for precipitation data "data1" :
library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")
The code for data frame "Pecip1" :
Precip1 <- data.frame(Time, data1, check.rows=TRUE)
The code for ggplot is:
ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()
Using ggplot for plotting the graph between "Time" and "data1" results as:
Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".
The plot of "data1" data against its index is as shown:
The code for this plot is:
plot(data1, type = "l")
Any help would be highly appreciated. Thanks!
By using pad we can make up for those lost values an assign an NA value as to
avoid plotting in the region of missing data.
library(padr)
library(zoo)
YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)
StLucia<-data.frame(YearMonthDay,Data)
StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format=
"%Y%m%d")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-04 3
4 1981-05-05 4
Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.
So lets add the missing date:
StLucia<-pad(StLucia,interval="day")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-03 NA
4 1981-05-04 3
5 1981-05-05 4
plot(StLucia, type = "l")
If you want to fill in those NA values, use na.locf() from package(zoo)
Here is a reproducible example - change the names to match your data.
# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))
# demonstrate problem
ggplot(dd, aes(t, y)) +
geom_point() +
geom_line()
The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:
ggplot(dd, aes(t, y)) +
geom_col()
If you really want to use lines, you should fill in the missing dates with NA for rainfall. H
# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")
# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
geom_point() +
geom_line()
Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.

Trying to vertically scale the graph of a data set with R, ggplot2

I'm working with a data frame of size 2 x 400. I need to graph this (let's call it data set A) on the same graph as the main data set for my project.
All I need is the general shape of data set A's graph. ie i only need to see the trend.
The scale that data set A takes place on happens to be much smaller than that of the main graph. So dataset A just looks like a horizontal line.
I decided to scale data set A by multiplying it by a factor of... I tried various values to get the optimum vertical scaling, which leads me to the problem I'm having.
When trying to find the ideal multiplicative factor by trial and error, I expected the general shape of data set A's graph to retain its shape, and only vary in its relative vertical points . ie the horizontal coordinates of all maxes and mins shouldn't move, and only the vertical points should be moving. but this wasn't happening. I'd like to know why.
Here's the data set A (yellow), when multiplied by factor of 3:
factor of 5:
The yellow dots are the geom_point and the yellow curve is the corresponding geom_smooth.
EDIT:
here is my the code original code:
I haven't had much formal training with code. I'm apologize for any messiness!
library("ggplot2")
library("dplyr")
# READ IN DATA
temp_data <-read.table(col.names = "y",
"C:/Users/Ben/Documents/Visual Studio 2013/Projects/Home/Home/steamdata2.txt")
boilpoint <- which(temp_data$y == "boil") # JUST A MARKER..
temp_data <- filter(temp_data, y != "boil") # GETTING RID OF THE MARKER ENTRY
# DON'T KNOW WHY BUT I HAD TO DO THIS INTERMEDIATE STEP
# BEFORE I COULD CONVERT FROM FACTOR -> NUMERIC
temp_data$y <- as.character(temp_data$y)
# CONVERTING TO NUMERIC
temp_data$y <- as.numeric(temp_data$y)
# GETTING RID OF BASICALLY THE LAST ENTRY WHICH HAS THE LARGEST VALUE
temp_data <- filter(temp_data, y<max(temp_data$y))
# ADD ANOTHER COLUMN WITH THE ROW NUMBER,
# BECAUSE I DON'T KNOW HOW TO ACCESS THIS FOR GGPLOT
temp_data <- transform(temp_data, x = 1:nrow(temp_data))
n <- nrow(temp_data) # Num of readings
period <- temp_data[n,1] # (sec)
RpS <- n / period # Avg Readings per Second
MIN <- min(temp_data$y)
MAX <- max(temp_data$y)
# DERIVATIVE OF ORIGINAL
deriv <- data.frame(matrix(ncol=2, nrow=n))
# ADD ANOTHER COLUMN TO ACCESS ROW NUMBERS FOR GGPLOT LATER
colnames(deriv) <- c("y","x")
deriv <- transform(deriv, x = c(1:n))
# FILL DERIVATIVE DATAFRAME
deriv[1, 1] <- 0
for(i in 2:n){
deriv[i - 1, 1] <- temp_data[i, 1] - temp_data[i - 1, 1]
}
deriv <- filter(deriv, y != 0)
# DID THE SAME FOR SECOND DERIVATIVE
dderiv <- data.frame(matrix(ncol = 2, nrow = nrow(deriv)))
colnames(dderiv) <- c("y", "x")
dderiv <- transform(dderiv, x=rep(0, nrow(deriv)))
dderiv[1, 1] <- 0
for(i in 2:nrow(deriv)) {
dderiv$y[i - 1] <- (deriv$y[i] - deriv$y[i - 1]) /
(deriv$x[i] - deriv$x[i - 1])
dderiv$x[i - 1] <- deriv$x[i] + (deriv$x[i] - deriv$x[i - 1]) / 2
}
dderiv <- filter(dderiv, y!=0)
# HERE'S WHERE I FACTOR BY VARIOUS MULTIPLES
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
graph <- ggplot(temp_data, aes(x, y)) + geom_smooth()
graph <- graph + geom_point(data = deriv, color = "yellow")
graph <- graph + geom_smooth(data = deriv, color = "yellow")
graph <- graph + geom_point(data = dderiv, color = "green")
graph <- graph + geom_smooth(data = dderiv, color = "green")
graph <- graph + geom_vline(xintercept = boilpoint, color = "red")
graph <- graph + xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)")))
graph <- graph + xlim(c(0,n)) + ylim(c(MIN, MAX))
It's hard to check without your raw data, but I'm 99% sure that your main problem is that you're hard-coding the y limits with ylim(c(MIN, MAX)). This is exacerbated by accidentally scaling both variables in your deriv and dderiv data frame, not just y.
I was able to debug the problem when I noticed that your top "scale by 3" graph has a lot more yellow points than your bottom "scale by 5" graph.
The quick fix is don't scale the row numbers, only scale the y values, which is to say, replace this
# scales entire data frame: bad!
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
with this:
# only scale y
deriv$y <- MIN + deriv$y * 3
dderiv$y <- MIN + dderiv$y * 3
I think there is another problem too: even with my correction above, negative values of your derivatives will be excluded. If deriv$y or dderiv$y is ever negative, then MIN + deriv$y * 3 will be less than MIN, and since your y axis begins at MIN it won't be plotted.
So I think the whole fix would be to instead do something like
# keep the original y values around so we can experiment with scaling
# without running *all* the code again
deriv$y_orig <- deriv$y
# multiplicative scale
# fill in the value of `prop` to be the proportion of the vertical plot area
# that you want taken up by the derivative
deriv$y <- deriv$y_orig * diff(c(MIN, MAX)) / diff(range(deriv$y_orig)) * prop
# shift into plot range
# fill in the value of `intercept` to be the y value of the
# lowest point of this line
deriv$y <- deriv$y + MIN - min(deriv$y) + 1
I normally don't answer questions that aren't reproducible with data because I hate lack of clarity and I hate the inability to test. However, your question was very clear and I'm pretty sure this will work even without testing. Fingers crossed!
A few other, more general comments:
It's good you know that to convert factor to numeric you need to go via character. It's an annoyance, but if you want to understand more here's the r-faq on it.
I'm not sure why you bother with (deriv$x[i] - deriv$x[i - 1]) in your for loop. Since you define x to be 1, 2, 3, ... the difference is always 1. I'm more confused by why you divide by 2 in the second derivative.
Your for loop can probably be replaced by the diff() function. (See below.)
You seem to have just gotten your foot in the dplyr door, so I used base functions in my recommendation. Keep working with dplyr, I think you'll like it. The big dplyr function you're not using is mutate. It works like base::transform for adding new columns.
I dislike that you've created all these different data frames, it clutters things up. I think your code could be simplified to something like this
all_data = filter(temp_data, y != "boil") %>%
mutate(y = as.numeric(as.character(y))) %>%
filter(y < max(y)) %>%
mutate(
x = 1:n(),
deriv = c(NA, diff(y)) / c(NA, diff(x)),
dderiv = c(NA, diff(deriv)) / 2
)
Rather than having separate data frames for the original data, first derivative and second derivative, this puts them all in the same data frame.
The big benefit of having things in one data frame is that you could then "gather" it into a nice, long (rather than wide) tidy format and simplify your plotting call:
library(tidyr)
long_data = gather(all_data, key = function, value = y, y, deriv, dderiv)
Then your ggplot call would look more like this:
graph <- ggplot(temp_data, aes(x, y, color = function)) +
geom_smooth() +
geom_point() +
geom_vline(xintercept = boilpoint, color = "red") +
scale_color_manual(values = c("green", "yellow", "blue")) +
xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)"))) +
xlim(c(0,n)) + ylim(c(MIN, MAX))
With data in long format, you'd have a column of you data (I've named it "function") that maps to color, so you don't have to add all the layers one at a time, and you get a nicely generated legend!

Merge data.frames for grouped boxplot r

I have two data frames z (1 million observations) and b (500k observations).
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
b= Tracer time treatment
2 0 S
35 0 S
10 0 X
04 0 X
20 15 S
11 15 S
12 15 X
25 15 X
I'd like to create grouped boxplots using time as a factor and treatment as colour. Essentially I need to bind them together and then differentiate between them but not sure how. One way I tried was using:
zz<-factor(rep("Z", nrow(z))
bb<-factor(rep("B",nrow(b))
dumB<-merge(z,zz) #this won't work because it says it's too big
dumB<-merge(b,zz)
total<-rbind(dumB,dumZ)
But z and zz merge won't work because it says it's 10G in size (which can't be right)
The end plot might be similar to this example: Boxplot with two levels and multiple data.frames
Any thoughts?
Cheers,
EDIT: Added boxplot
I would approach it as follows:
# create a list of your data.frames
l <- list(z,b)
# assign names to the dataframes in the list
names(l) <- c("z","b")
# bind the dataframes together with rbindlist from data.table
# the id parameter will create a variable with the names of the dataframes
# you could also use 'bind_rows(l, .id="id")' from 'dplyr' for this
library(data.table)
zb <- rbindlist(l, id="id")
# create the plot
ggplot(zb, aes(x=factor(time), y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~id) +
theme_bw()
which gives:
Other alternatives for creating your plot:
# facet by 'time'
ggplot(zb, aes(x=id, y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
# facet by 'time' & color by 'id' instead of 'treatment'
ggplot(zb, aes(x=treatment, y=Tracer, color=id)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
In respons to your last comment: to get everything in one plot, you use interaction to distinguish between the different groupings as follows:
ggplot(zb, aes(x=treatment, y=Tracer, color=interaction(id, time))) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.7)) +
theme_bw()
which gives:
The key is you do not need to perform a merge, which is computationally expensive on large tables. Instead assign a new variable and value (source c(b,z) in my code below) to each dataframe and then rbind. Then it becomes straight forward, my solution is very similar to #Jaap's just with different faceting.
library(ggplot2)
#Create some mock data
t<-seq(1,55,by=2)
z<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
b<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
#Add a variable to each table to id itself
b$source<-"b"
z$source<-"z"
#concatenate the tables together
all<-rbind(b,z)
ggplot(all, aes(source, tracer, group=interaction(treatment,source), fill=treatment)) +
geom_boxplot() + facet_grid(~time)

working with 3 columns of data in ggplot2: x, y1, and y2 into a stacked bar plot

I have 3 column data. The first column, depth, should be on the x axis. The other two columns are nr and r. I need to plot the data in a stacked barplot with A on the bottom and B on the top of nr. The data is very large (ie. the read depth goes from 0 to 1022), so I can't type everything out specifically in r or on here. Here's an example of what the data would look like:
Depth r nr
6 2395 2904
8 0 3095
9 2689 0
12 3894 3578
15 5 4739
the r and the nr have to be on the y axis, and the depth has to be on the x axis. I've tried everything I can think of and am unable to get a 'height' to use or to just get the basic equation.
Work in long format
#using reshape2::melt
library(reshape2)
# assuming your original data.frame is called `D`
longD <- melt(D, id.var = 1)
ggplot(longD, aes(x = Depth, y = value, colour = variable, fill = variable)) +
geom_bar(stat = 'identity')
Using barchart from lattice you can deal with wide format :
library(lattice)
barchart(r+nr~factor(Depth),data=dt,stack=TRUE,auto.key=TRUE)
equivalent to this , using long format from #mnel answer:
barchart(value~factor(Depth),data=longD,
groups=variable,stack=TRUE,auto.key=TRUE)
Just to show base R graphics can match it as well, and assuming your data.frame is called dat:
barplot(
t(dat)[2:3,],
names.arg=t(dat)[1,],
space=c(0,diff(t(dat)[1,])),
axis.lty=1
)

Resources