In R, both ggplot2 and lattice package provide possibilities to visualize data not only by their x and y position but also considering an additional factor, changing the color, size or shape of the observation representation (point, smooth line, etc.) or splitting the visualization into separate diagrams along this factor.
Example for ggplot:
require(ggplot2)
ggplot(diamonds, aes(x = carat, y = price, col=clarity)) +
geom_point(alpha = .3)
Example for lattice:
require(lattice)
require(mlmRev); data(Chem97, package = "mlmRev")
densityplot(~ gcsescore | factor(score), Chem97, groups = gender,
plot.points = FALSE, auto.key = TRUE)
Obviously, these really easy ways of differentiating data by another factor are created for the use with one single dataframe, containing all observations to be shown. However, I more often do have separate data inputs, in the form of separate dataframes, containing different columns to be represented as x and y. The third factor for a separation in the plot then would be the dataframe resp. data source itself. The only solution for this, I was able to find so far, is merging all data into one dataframe and previously adding another column to each source dataframe, only containing the third factor, resp. the data source (so in each cell of this column there is the same string expression). Finally ggplot2 and lattice are then able to separate the data again by this third factor and visalize them separated, as wished.
Now to the final problem: This seems to be a really poor workflow and is not very efficient for bigger amounts of data. Is there perhaps an alternative way to achieve the same result or at least a way to efficiently automate the last described workflow?
It's usually a good idea to merge more data source into one when working with ggplot. There are of course exception to this, and ggplot give the tools to deal with this cases.
That said, it's possible to pass the data argument to each geom_*
A general rule I use is that if different data sources are used in the same geom_* then they have to be combined, if they will be used in different geom_s, they can (and maybe should) stay separate.
Bind data sources to use in the same geom_*
df1 <- data.frame(group = LETTERS[1:3],
obs = runif(3))
df2 <- data.frame(group = LETTERS[1:3],
obs = runif(3))
library(purrr)
dfT <- list(df1 = df1, df2 = df2) %>%
map_df(~rbind(.x), .id = 'src')
library(ggplot2)
ggplot(dfT, aes(x = group, y = obs)) +
geom_line(aes(group = src, color = src), size = 1)
Use different data sources
df1 <- data.frame(group = LETTERS[1:3],
hValue = runif(3))
df2 <- data.frame(group = rep(LETTERS[1:3], each = 3),
pValue = runif(9))
library(ggplot2)
ggplot() +
geom_line(data = df1, aes(x = group, y = hValue, group = 1), size = 1) +
geom_point(data = df2, aes(x = group, y = pValue, color = group))
Related
I have been working on plotting several lines according to different probability levels and am stuck adding labels to each line to represent the probability level.
Since each curve plotted has varying x and y coordinates, I cannot simply have a large data-frame on which to perform usual ggplot2 functions.
The end goal is to have each line with a label next to it according to the p-level.
What I have tried:
To access the data comfortably, I have created a list df with for example 5 elements, each element containing a nx2 data frame with column 1 the x-coordinates and column 2 the y-coordinates. To plot each curve, I create a for loop where at each iteration (i in 1:5) I extract the x and y coordinates from the list and add the p-level line to the plot by:
plot = plot +
geom_line(data=df[[i]],aes(x=x.coor, y=y.coor),color = vector_of_colors[i])
where vector_of_colors contains varying colors.
I have looked at using ggrepel and its geom_label_repel() or geom_text_repel() functions, but being unfamiliar with ggplot2 I could not get it to work. Below is a simplification of my code so that it may be reproducible. I could not include an image of the actual curves I am trying to add labels to since I do not have 10 reputation.
# CREATION OF DATA
plevel0.5 = cbind(c(0,1),c(0,1))
colnames(plevel0.5) = c("x","y")
plevel0.8 = cbind(c(0.5,3),c(0.5,1.5))
colnames(plevel0.8) = c("x","y")
data = list(data1 = line1,data2 = line2)
# CREATION OF PLOT
plot = ggplot()
for (i in 1:2) {
plot = plot + geom_line(data=data[[i]],mapping=aes(x=x,y=y))
}
Thank you in advance and let me know what needs to be clarified.
EDIT :
I have now attempted the following :
Using bind_rows(), I have created a single dataframe with columns x.coor and y.coor as well as a column called "groups" detailing the p-level of each coordinate.
This is what I have tried:
plot = ggplot(data) +
geom_line(aes(coors.x,coors.y,group=groups,color=groups)) +
geom_text_repel(aes(label=groups))
But it gives me the following error:
geom_text_repel requires the following missing aesthetics: x and y
I do not know how to specify x and y in the correct way since I thought it did this automatically. Any tips?
You approach is probably a bit to complicated. As far as I get it you could of course go on with one dataset and use the group aesthetic to get the same result you are trying to achieve with your for loop and multiple geom_line. To this end I use dplyr:.bind_rows to bind your datasets together. Whether ggrepel is needed depends on your real dataset. In my code below I simply use geom_text to add an label at the rightmost point of each line:
plevel0.5 <- data.frame(x = c(0, 1), y = c(0, 1))
plevel0.8 <- data.frame(x = c(0.5, 3), y = c(0.5, 1.5))
library(dplyr)
library(ggplot2)
data <- list(data1 = plevel0.5, data2 = plevel0.8) |>
bind_rows(.id = "id")
ggplot(data, aes(x = x, y = y, group = id)) +
geom_line(aes(color = id)) +
geom_text(data = ~ group_by(.x, id) |> filter(x %in% max(x)), aes(label = id), vjust = -.5, hjust = .5)
Lets say I have a data frame :
df <- data.frame(x = c("A","B","C"), y = c(10,20,30))
and I wish to plot it with ggplot2 such that I get a plot like a histogram ( where instead of plotting count I plot my y column values from the data frame. ( I don't mind if the x column is a factor column or a character column.
I will add that I know how to reorder a bar chart by descending/ascending, but ordering like a histogram (highest values in the middle- around the mean and decreasing to both sides) is still beyond me.
I thought of transmuting the data such that I can fit it in a histogram - like creating a vector with 10 "A"objects, 20 "B" and 30 "C" and then running a histogram on that. But its not practical for what I'm trying to do as it seems like a lazy and highly inefficient way to do it. Also the df data frame is huge as it is- so multiplying by millions etc is not going to be kind on my system.
This seems like a strange thing to want to do, since if the ordering is not already implicit in your x variables, then ordering as a bell curve is at best artificial. However, it's fairly trivial to implement if you really want to...
library(ggplot2)
df <- data.frame(yvals = floor(abs(rnorm(26)) * 100),
xvals = LETTERS,
stringsAsFactors = FALSE)
ggplot(data = df, aes(x = xvals, y = yvals)) + geom_bar(stat = "identity")
ordered <- order(df$yvals)
left_half <- ordered[seq(1, length(ordered), 2)]
right_half <- rev(ordered[seq(2, length(ordered), 2)])
new_order <- c(left_half, right_half)
df2 <- df[new_order,]
df2$xvals <- factor(df2$xvals, levels = df2$xvals)
ggplot(data = df2, aes(x = xvals, y = yvals)) + geom_bar(stat = "identity")
I have a dataset like this:
I want to show the trend, the x-axis containing the year values and the y-axis the corresponding values from the columns, so Maybe
ggplot(data,aes(year,bm))
I want to not just plot one column but Maybe more of them. As in one plot it seems to be too much detailed I wanted to make use of facet_grid to arrange the plots nicely next to eacht other. However It did not work for my data as I think I have no 'real' objects to compare.
Does anyone has an idea how to I can realize facet_grid so it loos like something like this (in may case p1=BM and p2=BMSW):
The problem is your data format. Here is an example with some fake data
library(tidyverse)
##Create some fake data
set.seed(3)
data <- tibble(
year = 1991:2020,
bm = rnorm(30),
bmsw = rnorm(30),
bmi = rnorm(30),
bmandinno = rnorm(30),
bmproc = rnorm(30),
bmart = rnorm(30)
)
##Gather the variables to create a long dataset
new_data <- data %>%
gather(model, value, -year)
##plot the data
ggplot(new_data, aes(x = year, y = value)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_grid(~model)
age <- rnorm(100, 0:100)
freq <- rnorm(100, 0:1)
char1<-stringi::stri_rand_strings(100, length = 1, pattern = "[abc]")
char2<-stringi::stri_rand_strings(100, length = 1, pattern = "[def]")
char3<-stringi::stri_rand_strings(100, length = 1, pattern = "[def]")
char3<-stringi::stri_rand_strings(100, length = 1, pattern = "[ghi]")
dftest <- data.frame(age, freq, char1, char2, char3)
dflist <- list(dftest, dftest, dftest, dftest, dftest)
This creates a sample data frame that demonstrates the problem I am having.
I would like to create scatterplots for age vs freq for each of the data frames in this list, but I want a different color for the points based on the value in columns "char#". I also need a separate trend line for values in each of these separate characteristics.
I also want to be able to do this based on combinations of different characteristics from different char columns. An example of this is 3*3=9 separate colors for each of the combinations, each with a different trend line.
How would this be done?
I hope this was reproducible and clear enough. I have only posted a few times, so I am still getting used to the format.
Thanks!
Let's start by creating a data frame that will allow us to show points with different colors:
df2 <- data.frame(age=rnorm(200,0:100),
freq=rnorm(200,0:1),id=rep(1:2,each=100))
Then we can plot like so:
plot(dflist2$age,dflist2$freq, col=dflist2$id, pch=16)
We set col (color) equal to id (this would represent each data frame). pch is the point type (solid dots).
You can try dplyr for data preparing and ggplot for plotting. All functions are loaded via the tidyverse package:
library(tidyverse)
# age vs freq plus trendline for char1
as.tbl(dftest) %>%
ggplot(aes(age, freq, color=char1)) +
geom_point() +
geom_smooth(method = "lm")
# age vs freq plus trendline for combinations of char columns
as.tbl(dftest) %>%
unite(combi, char1, char2, char3, sep="-") %>%
ggplot(aes(age, freq, color=combi)) +
geom_point() +
geom_smooth(method = "lm")
# no plot as too many combinations make the plot to busy
dflist %>%
bind_rows( .id = "df_source") %>%
ggplot(aes(age, freq, color=char1)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
facet_wrap(~df_source)
I am fairly new to R and am attempting to plot two time series lines simultaneously (using different colors, of course) making use of ggplot2.
I have 2 data frames. the first one has 'Percent change for X' and 'Date' columns. The second one has 'Percent change for Y' and 'Date' columns as well, i.e., both have a 'Date' column with the same values whereas the 'Percent Change' columns have different values.
I would like to plot the 'Percent Change' columns against 'Date' (common to both) using ggplot2 on a single plot.
The examples that I found online made use of the same data frame with different variables to achieve this, I have not been able to find anything that makes use of 2 data frames to get to the plot. I do not want to bind the two data frames together, I want to keep them separate. Here is the code that I am using:
ggplot(jobsAFAM, aes(x=jobsAFAM$data_date, y=jobsAFAM$Percent.Change)) + geom_line() +
xlab("") + ylab("")
But this code produces only one line and I would like to add another line on top of it.
Any help would be much appreciated.
TIA.
ggplot allows you to have multiple layers, and that is what you should take advantage of here.
In the plot created below, you can see that there are two geom_line statements hitting each of your datasets and plotting them together on one plot. You can extend that logic if you wish to add any other dataset, plot, or even features of the chart such as the axis labels.
library(ggplot2)
jobsAFAM1 <- data.frame(
data_date = runif(5,1,100),
Percent.Change = runif(5,1,100)
)
jobsAFAM2 <- data.frame(
data_date = runif(5,1,100),
Percent.Change = runif(5,1,100)
)
ggplot() +
geom_line(data = jobsAFAM1, aes(x = data_date, y = Percent.Change), color = "red") +
geom_line(data = jobsAFAM2, aes(x = data_date, y = Percent.Change), color = "blue") +
xlab('data_date') +
ylab('percent.change')
If both data frames have the same column names then you should add one data frame inside ggplot() call and also name x and y values inside aes() of ggplot() call. Then add first geom_line() for the first line and add second geom_line() call with data=df2 (where df2 is your second data frame). If you need to have lines in different colors then add color= and name for eahc line inside aes() of each geom_line().
df1<-data.frame(x=1:10,y=rnorm(10))
df2<-data.frame(x=1:10,y=rnorm(10))
ggplot(df1,aes(x,y))+geom_line(aes(color="First line"))+
geom_line(data=df2,aes(color="Second line"))+
labs(color="Legend text")
I prefer using the ggfortify library. It is a ggplot2 wrapper that recognizes the type of object inside the autoplot function and chooses the best ggplot methods to plot. At least I don't have to remember the syntax of ggplot2.
library(ggfortify)
ts1 <- 1:100
ts2 <- 1:100*0.8
autoplot(ts( cbind(ts1, ts2) , start = c(2010,5), frequency = 12 ),
facets = FALSE)
I know this is old but it is still relevant. You can take advantage of reshape2::melt to change the dataframe into a more friendly structure for ggplot2.
Advantages:
allows you plot any number of lines
each line with a different color
adds a legend for each line
with only one call to ggplot/geom_line
Disadvantage:
an extra package(reshape2) required
melting is not so intuitive at first
For example:
jobsAFAM1 <- data.frame(
data_date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 100),
Percent.Change = runif(5,1,100)
)
jobsAFAM2 <- data.frame(
data_date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 100),
Percent.Change = runif(5,1,100)
)
jobsAFAM <- merge(jobsAFAM1, jobsAFAM2, by="data_date")
jobsAFAMMelted <- reshape2::melt(jobsAFAM, id.var='data_date')
ggplot(jobsAFAMMelted, aes(x=data_date, y=value, col=variable)) + geom_line()
This is old, just update new tidyverse workflow not mentioned above.
library(tidyverse)
jobsAFAM1 <- tibble(
date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 5),
Percent.Change = runif(5, 0,1)
) %>%
mutate(serial='jobsAFAM1')
jobsAFAM2 <- tibble(
date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 5),
Percent.Change = runif(5, 0,1)
) %>%
mutate(serial='jobsAFAM2')
jobsAFAM <- bind_rows(jobsAFAM1, jobsAFAM2)
ggplot(jobsAFAM, aes(x=date, y=Percent.Change, col=serial)) + geom_line()
#Chris Njuguna
tidyr::gather() is the one in tidyverse workflow to turn wide dataframe to long tidy layout, then ggplot could plot multiple serials.
An alternative is to bind the dataframes, and assign them the type of variable they represent. This will let you use the full dataset in a tidier way
library(ggplot2)
library(dplyr)
df1 <- data.frame(dates = 1:10,Variable = rnorm(mean = 0.5,10))
df2 <- data.frame(dates = 1:10,Variable = rnorm(mean = -0.5,10))
df3 <- df1 %>%
mutate(Type = 'a') %>%
bind_rows(df2 %>%
mutate(Type = 'b'))
ggplot(df3,aes(y = Variable,x = dates,color = Type)) +
geom_line()