GGplot2: plotting values per year of specific columns in a dataframe

GGplot2: plotting values per year of specific columns in a dataframe - r

I am trying to plot values on the y axis against years on the x axis with ggplot2.
This is the dataset: https://drive.google.com/file/d/1nJYtXPrxD0xvq6rBz2NXlm4Epi52rceM/view?usp=sharing
I want to plot the values of specific countries.
It won't work by just specifying year as the x axis and a country's values on the y axis. I'm reading I need to melt the data frame, so I did that, but it's now in a format that doesn't seem convenient to get the job done.
I'm assuming I haven't correctly melted, but I have a hard time finding what I need to specifically do.
What I did beforehand is manually transpose the data and make the years a column, as well as all the countries.
This is the dataset transposed:
https://drive.google.com/file/d/131wNlubMqVEG9tID7qp-Wr8TLli9KO2Q/view?usp=sharing
Here's how I melted:
inv_melt.data <- melt(investments_t.data, id.vars="Year")
ggplot() +
geom_line(aes(x=Year, y=value), data = inv_melt.data)
The plot shows the aggregated values of all countries per year, but I want them per country in such a manner that I can also select to plot certain countries only.
How do I utilize melt in such a manner? Could someone walk me through this?

There are no columns named "Year" in the linked to data set, there are columns per year. So it need to be melted by "country" and then the "variable" edited with sub.
inv_melt.data <- reshape2::melt(investments_t.data, id.vars="country")
inv_melt.data$variable <- as.integer(sub("^X", "", inv_melt.data$variable))
ggplot(inv_melt.data, aes(variable, value, color = country)) +
geom_line(show.legend = FALSE)
Edit.
The following code keeps only some countries, filtering out the ones with more missing values.
i <- sapply(investments_t.data[-1], function(x) sum(is.na(x)) == 0)
i <- c(1, which(i))
inv_melt.data <- reshape2::melt(investments_t.data[i], id.vars = "Year")
ggplot(inv_melt.data, aes(Year, value, color = variable)) +
geom_line(show.legend = FALSE)

Related

Summarising data before plotting with geom_tile() renders different results

I have noticed that when plotting with ggplot2's geom_tile(), summarising the data before plotting renders a completely different result than when it is not pre-summarised. I don't understand why.
For a dataframe with three columns, year (character), state (character) and profit (numeric), consider the following examples:
# Plot straight away
data %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit))
# Summarise before plotting
data %>% group_by(year, state) %>% summarize(profit_mean = mean(profit)) %>%
ungroup() %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit_mean))
These two examples render two different tile plots - the values are quite different. I thought that these two methods of plotting would be analogous and that ggplot2 would take a mean automatically - is that not so?
I tried reproducing this error on a smaller subset of data, but it didn't appear. What could be going on here?

OP, this was a very interesting question.
First, let's get this out of the way. It is clear what plotting the summary of your data is plotting just that: the summary. You are summarizing via mean, so what is plotted equals the mean of the values for each tile.
The actual question here is: If you have a dataset containing more than one value per tile, what is the result of plotting the "non-summarized" dataset?
User #akrun is correct: the default stat used for geom_tile is stat="identity", but it might not be clear what that exactly means. It says it "leaves the data unchanged"... but that's not clear what that means here.
Illustrative Example Dataset
For purposes of demonstration, I'll create an illustrative dataset, which will answer the question very clearly. I'm creating two individual datasets df1 and df2, which each contain 4 "tiles" of data. The difference between these is that the values themselves for the tiles are different. I've include text labels on each tile for more clarity.
library(ggplot2)
library(cowplot)
df1 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(5,15,20,25)
)
df2 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(10,5,25,15)
)
tile1 <- ggplot(df1, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df1")
tile2 <- ggplot(df2, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df2")
plot_grid(tile1, tile2)
Plotting the Combined Data Frame
Each of the data frames df1 and df2 contain only one value per tile, so in order to see how that changes when we have more than one value per tile, we need to combine them into one so that each tile will contain 2 values. In this example, we are going to combine them in two ways: first df1 then df2, and the other way is df2 first, then df1.
df12 <- rbind(df1, df2)
df21 <- rbind(df2, df1)
Now, if we plot each of those as before and compare, the reason for the discrepancy the OP posted should be quite obvious. I'm including the value for each tile for each originating dataset to make things super-clear.
tile12 <- ggplot(df12, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df1, then df2") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
tile21 <- ggplot(df21, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df2, then df1") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
plot_grid(tile12, tile21)
Note that the legend colorbar value does not change, so it's not doing an addition. Plus, since we know it's stat="identity", we know this should not be the case. When we use the dataset that contains first observations from df1, then observations from df2, the value plotted is the one from df2. When we use the dataset that contains observations first from df2, then from df1, the value plotted is the one from df1.
Given this piece of information, it can be clear that the value shown in geom_tile() when using stat="identity" (default argument) corresponds to the last observation for that particular tile represented in the data frame.
So, that's the reason why your plot looks odd OP. You can either summarize beforehand as you have done, or use stat_summary(geom="tile"... to do the transformation in one go within ggplot.

Graphing different variables in the same graph R- ggplot2

I have several datasets and my end goal is to do a graph out of them, with each line representing the yearly variation for the given information. I finally joined and combined my data (as it was in a per month structure) into a table that just contains the yearly means for each item I want to graph (column depicting year and subsequent rows depicting yearly variation for 4 different elements)
I have one factor that is the year and 4 different variables that read yearly variations, thus I would like to graph them on the same space. I had the idea to joint the 4 columns into one by factor (collapse into one observation per row and the year or factor in the subsequent row) but seem unable to do that. My thought is that this would give a structure to my y axis. Would like some advise, and to know if my approach to the problem is effective. I am trying ggplot2 but does not seem to work without a defined (or a pre defined range) y axis. Thanks

I would suggest next approach. You have to reshape your data from wide to long as next example. In that way is possible to see all variables. As no data is provided, this solution is sketched using dummy data. Also, you can change lines to other geom you want like points:
library(tidyverse)
set.seed(123)
#Data
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
#Plot
df %>% pivot_longer(-year) %>%
ggplot(aes(x=factor(year),y=value,group=name,color=name))+
geom_line()+
theme_bw()
Output:

We could use melt from reshape2 without loading multiple other packages
library(reshape2)
library(ggplot2)
ggplot(melt(df, id.var = 'year'), aes(x = factor(year), y = value,
group = variable, color = variable)) +
geom_line()
-output plot
Or with matplot from base R
matplot(as.matrix(df[-1]), type = 'l', xaxt = 'n')
data
set.seed(123)
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))

How to barplot select rows of data from a dataframe in R?

This is my first time submitting a question, so apologies in advance if my formatting is not optimal.
I have a dataframe with roughly 6,000 rows of data in 2 columns, and I want to be able to pull out individual rows (and multiple rows together) to barplot.
I read my file in as a dataframe, here is a very small subset:
gene log2
1 SMa0002 0.457418
2 SMa0005 1.116950
3 SMa0007 0.686749
4 SMa0009 0.169450
5 SMa0011 0.393365
6 SMa0013 0.601940
So what I would want to be able to do is have a barplot where the x axis is a number of genes (SMaXXX, SMaXXX, SMaXXX, etc.), and the y-axis is the log2 column. It only has (+) values displayed, but there are (-) values as well. I have no real preference about whether I use barplot or geom_bar in ggplot2, or another plotter.
I know how to just plot the dataframe;
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity")
I've tried playing around with using 'match' but I haven't been able to figure out how to make that work. Ideally the code is versatile so I can just punch in different SMaXXXX codes to generate many different plots.
Thanks for reading!

It seems that you just need a way to subset your data.frame when plotting, right?
Let's assume you've got a vector subset.genes of the genes you need to plot:
df=data.frame(gene=c("SMa0002","SMa0005","SMa0006","SMa0007","SMa0011","SMa0013"),
"log2"=runif(6), stringsAsFactors=F)
subset.genes=sample(unique(df$gene), 4, replace=F)
A couples of ways:
1°) Inside ggplot2
ggplot(df, aes(x = gene, y = log2)) + geom_bar(stat = "identity") +
scale_x_discrete(limits=subset.genes)
2°) before:
df2 <- subset(df, gene %in% subset.genes)
ggplot(df2, aes(x = gene, y = log2)) + geom_bar(stat = "identity")

plotting multiple plot in R for different calendar date

I have about 20 years of daily data in a time series. It has columns Date, rainfall and other data.
I am trying plot rainfall vs Time. I want to get 20 line plots with different colours and legend is generated that show the years in one graph. I tried the following codes but it is not giving me the desired results. Any suggestion to fix my issue would be most welcome
library(ggplot2)
library(seas)
data(mscdata)
p<-ggplot(data=mscdata,aes(x=date,y=precip,group=year,color=year))
p+geom_line()+scale_x_date(labels=date_format("%m"),breaks=date_breaks("1 months"))

It doesnt look great but here's a method. We first coerce the data into dates in the same year:
mscdata$dayofyear <- as.Date(format(mscdata$date, "%j"), format = "%j")
Then we plot:
library(ggplot2)
library(scales)
p <- ggplot(data = mscdata, aes(x = dayofyear, y = precip, group = year, color = year))
p + geom_line() +
scale_x_date(labels = date_format("%m"), breaks = date_breaks("1 months"))

While I agree with #Jaap that this may not be the best way to depict these data, try to following:
mscdata$doy <- as.numeric(strftime(mscdata$date, format="%j"))
ggplot(data=mscdata,aes(x=doy,y=precip,group=year)) +
geom_line(aes(color=year))

Although the given answers are good answers to your questions as it stands, i don't think it will solve your problem. I think you should be looking at a different way to present the data. #Jaap already suggested using facets. Take for example this approach:
#first add a month column to your dataframe
mscdata$month <- format(mscdata$date, "%m")
#then plot it using boxplot with year on the X-axis and month as facet.
p1 <- ggplot(data = mscdata, aes(x = year, y = precip, group=year))
p1 + geom_boxplot(outlier.shape = 3) + facet_wrap(~month)
This will give you a graph per month, showing the rainfall per year next to one each other. Because i use boxplot, the peaks in rainfall show up as dots ('normal' rain events are inside box).
Another possible approach would be to use stat_summary.

Dynamically Set X limits on time plot

I am wondering how to dynamically set the x axis limits of a time series plot containing two time series with different dates. I have developed the following code to provide a reproducible example of my problem.
#Dummy Data
Data1 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1997","9/13/1998"), Area_2D = c(20,11,5,25,50))
Data2 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
Data3 <- data.frame(Date = c("4/24/1995","6/23/1995","2/12/1996","4/14/1996","9/13/1998"), Area_2D = c(20,25,28,30,35))
Data4 <- data.frame(Date = c("6/23/1995","4/14/1996","11/3/1997","11/6/1997","4/15/1998"), Area_2D = c(13,15,18,25,19))
#Convert date column as date
Data1$Date <- as.Date(Data1$Date,"%m/%d/%Y")
Data2$Date <- as.Date(Data2$Date,"%m/%d/%Y")
Data3$Date <- as.Date(Data3$Date,"%m/%d/%Y")
Data4$Date <- as.Date(Data4$Date,"%m/%d/%Y")
#PLOT THE DATA
max_y1 <- max(Data1$Area_2D)
# Define colors to be used for cars, trucks, suvs
plot_colors <- c("blue","red")
plot(Data1$Date,Data1$Area_2D, col=plot_colors[1],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
par(new=T)
plot(Data2$Date,Data2$Area_2D, col=plot_colors[2],
ylim=c(0,max_y1), xlim=c(min_x1,max_x1),pch=16, xlab="Date",ylab="Area", type="o")
The main problem I see with the code above is there are two different x axis on the plot, one for Data1 and another for Data2. I want to have a single x axis spanning the date range determined by the dates in Data1 and Data2.
My questions is:
How do i dynamically create an x axis for both series? (i.e select the minimum and maximum date from the data frames 'Data1' and 'Data2')

The solution is to combine the data into one data.frame, and base the x-axis on that. This approach works very well with the ggplot2 plotting package. First we merge the data and add an ID column, which specifies to which dataset it belongs. I use letters here:
Data1$ID = 'A'
Data2$ID = 'B'
merged_data = rbind(Data1, Data2)
And then create the plot using ggplot2, where the color denotes which dataset it belongs to (can easily be changed to different colors):
library(ggplot2)
ggplot(merged_data, aes(x = Date, y = Area_2D, color = ID)) +
geom_point() + geom_line()
Note that you get one uniform x-axis here. In this case this is fine, but if the timeseries do not overlap, this might be problematic. In that case we can use multiple sub-plots, known as facets in ggplot2:
ggplot(merged_data, aes(x = Date, y = Area_2D)) +
geom_point() + geom_line() + facet_wrap(~ ID, scales = 'free_x')
Now each facet has it's own x-axis, i.e. one for each sub-dataset. What approach is most valid depends on the specific situation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

GGplot2: plotting values per year of specific columns in a dataframe - r

Related

Summarising data before plotting with geom_tile() renders different results

Graphing different variables in the same graph R- ggplot2

How to barplot select rows of data from a dataframe in R?

plotting multiple plot in R for different calendar date

Dynamically Set X limits on time plot

Categories

Resources