I have a dataset like this:
I want to show the trend, the x-axis containing the year values and the y-axis the corresponding values from the columns, so Maybe
ggplot(data,aes(year,bm))
I want to not just plot one column but Maybe more of them. As in one plot it seems to be too much detailed I wanted to make use of facet_grid to arrange the plots nicely next to eacht other. However It did not work for my data as I think I have no 'real' objects to compare.
Does anyone has an idea how to I can realize facet_grid so it loos like something like this (in may case p1=BM and p2=BMSW):
The problem is your data format. Here is an example with some fake data
library(tidyverse)
##Create some fake data
set.seed(3)
data <- tibble(
year = 1991:2020,
bm = rnorm(30),
bmsw = rnorm(30),
bmi = rnorm(30),
bmandinno = rnorm(30),
bmproc = rnorm(30),
bmart = rnorm(30)
)
##Gather the variables to create a long dataset
new_data <- data %>%
gather(model, value, -year)
##plot the data
ggplot(new_data, aes(x = year, y = value)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_grid(~model)
Related
I have a longitudinal data where I would like to make the expected value curve. In the x-axis I have time and in the y-axis I have a continuous variable.
Without data it is hard to reproduce your problem first I generated some random data:
df <- data.frame(Age = sample(1:50),
variable = runif(50, 0, 1))
I am not sure if this is what you want, but you can use geom_smooth to create an expected value curve using this code:
library(tidyverse)
df %>%
ggplot(aes(x = Age, y = variable)) +
geom_point() +
geom_smooth()
Output:
I have a list of model-output in R that I want to plot using ggplot. I want to produce a scatter plot within which every column of data is a different colour. In the example here, I have three model outputs which I want to plot against 'measured'. What I want in the end is a scatter with three different 'clouds' of points, each of which is a different colour. Here is a reproducible example of what I have so far:
library(ggplot)
library(tidyverse)
#data for three different models as well as a column for 'observations' (measured)
output <- list(model1 = 1:10, model2 = 22:31, model3=74:83)
#create the dataframe
df <- data.frame(
predicted = output,
measured = 1:length(output[[1]]),
#year = as.factor(data$year),
#site = data$site
#model = as.factor(names(output)),
#stringsAsFactors=TRUE)
fix.empty.names = TRUE)
#fix the column names
colnames(df)<-names(output)
#plot the data with a different colour for each column of data
p <- ggplot(df) +
geom_point(
aes(
measured,
predicted,
colour =colnames(df)
)
) +
ylim(-5, 90)+
theme_minimal()
p + geom_hline(yintercept=0)
print(p)
I am getting the error: Error in FUN(X[[i]], ...) : object 'measured' not found
why is 'measured' not being found? I can see it in the df?
Perhaps I needs to collapse all the model outputs into one column a create a column as a 'factor' column to 'assign' each data point to a particular model?
The first issue is that your output list only has as many elements as you have models, so it has no name for the last "measured" column and that gets overwritten with NA.
Compare:
colnames(df)<-names(output). # NA in last col
colnames(df)<-c(names(output), "measured"). # fixed
Then, to plot your data in ggplot2 it's almost always better to convert to longer, "tidy" format, with one row per observation. pivot_longer from tidyr is great for that.
df %>%
pivot_longer(-measured, # don't pivot "measured" -- keep in every row
names_to = "model",
values_to = "predicted") %>%
ggplot() +
geom_point(
aes(
measured,
predicted,
colour = model
)
) +
ylim(-5, 90)+
theme_minimal() +
geom_hline(yintercept=0)
You changed the name of your object :
colnames(df)<-names(output)
So now your columns were not found.
I reorganized your object into a data frame that can be easily understood by ggplot2. Do not hesitate to look at your objects.
Here is one option :
library(ggplot2)
library(tidyverse)
#data for three different models as well as a column for 'observations' (measured)
output <- list(model1 = 1:10, model2 = 22:31, model3=74:83)
#create the dataframe
df <- data.frame(
predicted = unlist(output),
measured = 1:length(unlist(output)),
model = names(output)
)
#plot the data with a different colour for each column of data
p <- ggplot(df) +
geom_point(aes(measured, predicted,colour = model)) +
ylim(-5, 90)+
theme_minimal()
p + geom_hline(yintercept=0)
print(p)
plotwithgroups
If you add this line :
facet_grid(~model) +
You can get this which sounds like what you were asking :
plotwithfacet
Lets say I have a data frame :
df <- data.frame(x = c("A","B","C"), y = c(10,20,30))
and I wish to plot it with ggplot2 such that I get a plot like a histogram ( where instead of plotting count I plot my y column values from the data frame. ( I don't mind if the x column is a factor column or a character column.
I will add that I know how to reorder a bar chart by descending/ascending, but ordering like a histogram (highest values in the middle- around the mean and decreasing to both sides) is still beyond me.
I thought of transmuting the data such that I can fit it in a histogram - like creating a vector with 10 "A"objects, 20 "B" and 30 "C" and then running a histogram on that. But its not practical for what I'm trying to do as it seems like a lazy and highly inefficient way to do it. Also the df data frame is huge as it is- so multiplying by millions etc is not going to be kind on my system.
This seems like a strange thing to want to do, since if the ordering is not already implicit in your x variables, then ordering as a bell curve is at best artificial. However, it's fairly trivial to implement if you really want to...
library(ggplot2)
df <- data.frame(yvals = floor(abs(rnorm(26)) * 100),
xvals = LETTERS,
stringsAsFactors = FALSE)
ggplot(data = df, aes(x = xvals, y = yvals)) + geom_bar(stat = "identity")
ordered <- order(df$yvals)
left_half <- ordered[seq(1, length(ordered), 2)]
right_half <- rev(ordered[seq(2, length(ordered), 2)])
new_order <- c(left_half, right_half)
df2 <- df[new_order,]
df2$xvals <- factor(df2$xvals, levels = df2$xvals)
ggplot(data = df2, aes(x = xvals, y = yvals)) + geom_bar(stat = "identity")
In R, both ggplot2 and lattice package provide possibilities to visualize data not only by their x and y position but also considering an additional factor, changing the color, size or shape of the observation representation (point, smooth line, etc.) or splitting the visualization into separate diagrams along this factor.
Example for ggplot:
require(ggplot2)
ggplot(diamonds, aes(x = carat, y = price, col=clarity)) +
geom_point(alpha = .3)
Example for lattice:
require(lattice)
require(mlmRev); data(Chem97, package = "mlmRev")
densityplot(~ gcsescore | factor(score), Chem97, groups = gender,
plot.points = FALSE, auto.key = TRUE)
Obviously, these really easy ways of differentiating data by another factor are created for the use with one single dataframe, containing all observations to be shown. However, I more often do have separate data inputs, in the form of separate dataframes, containing different columns to be represented as x and y. The third factor for a separation in the plot then would be the dataframe resp. data source itself. The only solution for this, I was able to find so far, is merging all data into one dataframe and previously adding another column to each source dataframe, only containing the third factor, resp. the data source (so in each cell of this column there is the same string expression). Finally ggplot2 and lattice are then able to separate the data again by this third factor and visalize them separated, as wished.
Now to the final problem: This seems to be a really poor workflow and is not very efficient for bigger amounts of data. Is there perhaps an alternative way to achieve the same result or at least a way to efficiently automate the last described workflow?
It's usually a good idea to merge more data source into one when working with ggplot. There are of course exception to this, and ggplot give the tools to deal with this cases.
That said, it's possible to pass the data argument to each geom_*
A general rule I use is that if different data sources are used in the same geom_* then they have to be combined, if they will be used in different geom_s, they can (and maybe should) stay separate.
Bind data sources to use in the same geom_*
df1 <- data.frame(group = LETTERS[1:3],
obs = runif(3))
df2 <- data.frame(group = LETTERS[1:3],
obs = runif(3))
library(purrr)
dfT <- list(df1 = df1, df2 = df2) %>%
map_df(~rbind(.x), .id = 'src')
library(ggplot2)
ggplot(dfT, aes(x = group, y = obs)) +
geom_line(aes(group = src, color = src), size = 1)
Use different data sources
df1 <- data.frame(group = LETTERS[1:3],
hValue = runif(3))
df2 <- data.frame(group = rep(LETTERS[1:3], each = 3),
pValue = runif(9))
library(ggplot2)
ggplot() +
geom_line(data = df1, aes(x = group, y = hValue, group = 1), size = 1) +
geom_point(data = df2, aes(x = group, y = pValue, color = group))
I am fairly new to R and am attempting to plot two time series lines simultaneously (using different colors, of course) making use of ggplot2.
I have 2 data frames. the first one has 'Percent change for X' and 'Date' columns. The second one has 'Percent change for Y' and 'Date' columns as well, i.e., both have a 'Date' column with the same values whereas the 'Percent Change' columns have different values.
I would like to plot the 'Percent Change' columns against 'Date' (common to both) using ggplot2 on a single plot.
The examples that I found online made use of the same data frame with different variables to achieve this, I have not been able to find anything that makes use of 2 data frames to get to the plot. I do not want to bind the two data frames together, I want to keep them separate. Here is the code that I am using:
ggplot(jobsAFAM, aes(x=jobsAFAM$data_date, y=jobsAFAM$Percent.Change)) + geom_line() +
xlab("") + ylab("")
But this code produces only one line and I would like to add another line on top of it.
Any help would be much appreciated.
TIA.
ggplot allows you to have multiple layers, and that is what you should take advantage of here.
In the plot created below, you can see that there are two geom_line statements hitting each of your datasets and plotting them together on one plot. You can extend that logic if you wish to add any other dataset, plot, or even features of the chart such as the axis labels.
library(ggplot2)
jobsAFAM1 <- data.frame(
data_date = runif(5,1,100),
Percent.Change = runif(5,1,100)
)
jobsAFAM2 <- data.frame(
data_date = runif(5,1,100),
Percent.Change = runif(5,1,100)
)
ggplot() +
geom_line(data = jobsAFAM1, aes(x = data_date, y = Percent.Change), color = "red") +
geom_line(data = jobsAFAM2, aes(x = data_date, y = Percent.Change), color = "blue") +
xlab('data_date') +
ylab('percent.change')
If both data frames have the same column names then you should add one data frame inside ggplot() call and also name x and y values inside aes() of ggplot() call. Then add first geom_line() for the first line and add second geom_line() call with data=df2 (where df2 is your second data frame). If you need to have lines in different colors then add color= and name for eahc line inside aes() of each geom_line().
df1<-data.frame(x=1:10,y=rnorm(10))
df2<-data.frame(x=1:10,y=rnorm(10))
ggplot(df1,aes(x,y))+geom_line(aes(color="First line"))+
geom_line(data=df2,aes(color="Second line"))+
labs(color="Legend text")
I prefer using the ggfortify library. It is a ggplot2 wrapper that recognizes the type of object inside the autoplot function and chooses the best ggplot methods to plot. At least I don't have to remember the syntax of ggplot2.
library(ggfortify)
ts1 <- 1:100
ts2 <- 1:100*0.8
autoplot(ts( cbind(ts1, ts2) , start = c(2010,5), frequency = 12 ),
facets = FALSE)
I know this is old but it is still relevant. You can take advantage of reshape2::melt to change the dataframe into a more friendly structure for ggplot2.
Advantages:
allows you plot any number of lines
each line with a different color
adds a legend for each line
with only one call to ggplot/geom_line
Disadvantage:
an extra package(reshape2) required
melting is not so intuitive at first
For example:
jobsAFAM1 <- data.frame(
data_date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 100),
Percent.Change = runif(5,1,100)
)
jobsAFAM2 <- data.frame(
data_date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 100),
Percent.Change = runif(5,1,100)
)
jobsAFAM <- merge(jobsAFAM1, jobsAFAM2, by="data_date")
jobsAFAMMelted <- reshape2::melt(jobsAFAM, id.var='data_date')
ggplot(jobsAFAMMelted, aes(x=data_date, y=value, col=variable)) + geom_line()
This is old, just update new tidyverse workflow not mentioned above.
library(tidyverse)
jobsAFAM1 <- tibble(
date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 5),
Percent.Change = runif(5, 0,1)
) %>%
mutate(serial='jobsAFAM1')
jobsAFAM2 <- tibble(
date = seq.Date(from = as.Date('2017-01-01'),by = 'day', length.out = 5),
Percent.Change = runif(5, 0,1)
) %>%
mutate(serial='jobsAFAM2')
jobsAFAM <- bind_rows(jobsAFAM1, jobsAFAM2)
ggplot(jobsAFAM, aes(x=date, y=Percent.Change, col=serial)) + geom_line()
#Chris Njuguna
tidyr::gather() is the one in tidyverse workflow to turn wide dataframe to long tidy layout, then ggplot could plot multiple serials.
An alternative is to bind the dataframes, and assign them the type of variable they represent. This will let you use the full dataset in a tidier way
library(ggplot2)
library(dplyr)
df1 <- data.frame(dates = 1:10,Variable = rnorm(mean = 0.5,10))
df2 <- data.frame(dates = 1:10,Variable = rnorm(mean = -0.5,10))
df3 <- df1 %>%
mutate(Type = 'a') %>%
bind_rows(df2 %>%
mutate(Type = 'b'))
ggplot(df3,aes(y = Variable,x = dates,color = Type)) +
geom_line()