Avoid repetitive, similar analysis and plots - r

I have a table with many variables. One of the variables contains year information: from 1999 till 2010.
I need to do for each year the same analysis, for instance, to plot a graph, a histogram, etc.
Currently, I subset the data so that each year goes into a data frame(table) and I do the analysis in turn for each year. This is very inefficient:
dates <- (sample(seq(as.Date('1999/01/01'), as.Date('2010/01/01'), by="day"), 50, replace = TRUE))
dt<-data.table( YEAR = format.Date(dates,"%Y"),
Var1=sample(0:100, 50, rep=TRUE),
Var2 =sample(0:500, 50, rep=TRUE)
)
year_1999<-dt[YEAR=="1999"]
plot_1999<- ggplot(year_1999, aes (x=Var1))+
geom_line(aes(y=Var2), size=1, color="blue") +
labs(y="V2", x="V1", title="Year 1999")
plot_1999
How can I better write this in a compact way? I suppose I need a function but I have no idea how to.

Instead of repeating the code several times, we can specify the 'YEAR' in facet_wrap
library(ggplot2)
ggplot(dt, aes(x = Var1, y = Var2)) +
geom_line(aes(size = 1, color = "blue")) +
labs(y = "V2", x = "V1") +
facet_wrap(~ YEAR)

Try this if you want to create a separate plot object for each unique year in dt$YEAR:
for (i in unique(dt$YEAR)) {
year <- dt[YEAR==i]
plot <- ggplot(year, aes (x=Var1))+
geom_line(aes(y=Var2), size=1, color="blue") +
labs(y="V2", x="V1", title="Year 1999")
assign(paste("plot", i, sep=""), plot)
}

Related

Represent dataset in column bar in R using ggplot [duplicate]

I have a csv file which looks like the following:
Name,Count1,Count2,Count3
application_name1,x1,x2,x3
application_name2,x4,x5,x6
The x variables represent numbers and the applications_name variables represent names of different applications.
Now I would like to make a barplot for each row by using ggplot2. The barplot should have the application_name as title. The x axis should show Count1, Count2, Count3 and the y axis should show the corresponding values (x1, x2, x3).
I would like to have a single barplot for each row, because I have to store the different plots in different files. So I guess I cannot use "melt".
I would like to have something like:
for each row in rows {
print barplot in file
}
Thanks for your help.
You can use melt to rearrange your data and then use either facet_wrap or facet_grid to get a separate plot for each application name
library(ggplot2)
library(reshape2)
# example data
mydf <- data.frame(name = paste0("name",1:4), replicate(5,rpois(4,30)))
names(mydf)[2:6] <- paste0("count",1:5)
# rearrange data
m <- melt(mydf)
# if you are wanting to export each plot separately
# I used facet_wrap as a quick way to add the application name as a plot title
for(i in levels(m$name)) {
p <- ggplot(subset(m, name==i), aes(variable, value, fill = variable)) +
facet_wrap(~ name) +
geom_bar(stat="identity", show_guide=FALSE)
ggsave(paste0("figure_",i,".pdf"), p)
}
# or all plots in one window
ggplot(m, aes(variable, value, fill = variable)) +
facet_wrap(~ name) +
geom_bar(stat="identity", show_guide=FALSE)
I didn't see #user20650's nice answer before preparing this. It's almost identical, except that I use plyr::d_ply to save things instead of a loop. I believe dplyr::do() is another good option (you'd group_by(Name) first).
yourData <- data.frame(Name = sample(letters, 10),
Count1 = rpois(10, 20),
Count2 = rpois(10, 10),
Count3 = rpois(10, 8))
library(reshape2)
yourMelt <- melt(yourData, id.vars = "Name")
library(ggplot2)
# Test a function on one piece to develope graph
ggplot(subset(yourMelt, Name == "a"), aes(x = variable, y = value)) +
geom_bar(stat = "identity") +
labs(title = subset(yourMelt, Name == 'a')$Name)
# Wrap it up, with saving to file
bp <- function(dat) {
myPlot <- ggplot(dat, aes(x = variable, y = value)) +
geom_bar(stat = "identity") +
labs(title = dat$Name)
ggsave(filname = paste0("path/to/save/", dat$Name, "_plot.pdf"),
myPlot)
}
library(plyr)
d_ply(yourMelt, .variables = "Name", .fun = bp)

ggplot2 comparation of time period

I need to visualize and compare the difference in two equally long sales periods. 2018/2019 and 2019/2020. Both periods begin at week 44 and end at week 36 of the following year. If I create a graph, both periods are continuous and line up. If I use only the week number, the values ​​are sorted as continuum and the graph does not make sense. Can you think of a solution?
Thank You
Data:
set.seed(1)
df1 <- data.frame(sells = runif(44),
week = c(44:52,1:35),
YW = yearweek(seq(as.Date("2018-11-01"), as.Date("2019-08-31"), by = "1 week")),
period = "18/19")
df2 <- data.frame(sells = runif(44),
week = c(44:52,1:35),
YW = yearweek(seq(as.Date("2019-11-01"), as.Date("2020-08-31"), by = "1 week")),
period = "19/20")
# Yearweek on x axis, when both period are separated
ggplot(df1, aes(YW, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
labs(color="Legend text")
# week on x axis when weeks are like continuum and not splited by year
ggplot(df1, aes(week, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
labs(color="Legend text")
Another alternative is to facet it. This'll require combining the two sets into one, preserving the data source. (This is commonly a better way of dealing with it in general, anyway.)
(I don't have tstibble, so my YW just has seq(...), no yearweek. It should translate.)
ggplot(dplyr::bind_rows(tibble::lst(df1, df2), .id = "id"), aes(YW, sells)) +
geom_line(aes(color = id)) +
facet_wrap(id ~ ., scales = "free_x", ncol = 1)
In place of dplyr::bind_rows, one might also use data.table::rbindlist(..., idcol="id"), or do.call(rbind, ...), though with the latter you will need to assign id externally.
One more note: the default formatting of the x-axis is obscuring the "year" of the data. If this is relevant/important (and not apparent elsewhere), then use ggplot2's normal mechanism for forcing labels, e.g.,
... +
scale_x_date(labels = function(z) format(z, "%Y-%m"))
While unlikely that you can do this without having tibble::lst available, you can replace that with list(df1=df1, df2=df2) or similar.
If you want to keep the x axis as a numeric scale, you can do:
ggplot(df1, aes((week + 9) %% 52, sells)) +
geom_line(aes(color="Period 18/19")) +
geom_line(data=df2, aes(color="Period 19/20")) +
scale_x_continuous(breaks = 1:52,
labels = function(x) ifelse(x == 9, 52, (x - 9) %% 52),
name = "week") +
labs(color="Legend text")
Try this. You can format your week variable as a factor and keep the desired order. Here the code:
library(ggplot2)
library(tsibble)
#Data
df1$week <- factor(df1$week,levels = unique(df1$week),ordered = T)
df2$week <- factor(df2$week,levels = unique(df2$week),ordered = T)
#Plot
ggplot(df1, aes(week, sells)) +
geom_line(aes(color="Period 18/19",group=1)) +
geom_line(data=df2, aes(color="Period 19/20",group=1)) +
labs(color="Legend text")
Output:

Highlight positions without data in facet_wrap ggplot

When facetting barplots in ggplot the x-axis includes all factor levels. However, not all levels may be present in each group. In addition, zero values may be present, so from the barplot alone it is not possible to distinguish between x-axis values with no data and those with zero y-values. Consider the following example:
library(tidyverse)
set.seed(43)
site <- c("A","B","C","D","E") %>% sample(20, replace=T) %>% sort()
year <- c("2010","2011","2012","2013","2014","2010","2011","2012","2013","2014","2010","2012","2013","2014","2010","2011","2012","2014","2012","2014")
isZero = rbinom(n = 20, size = 1, prob = 0.40)
value <- ifelse(isZero==1, 0, rnorm(20,10,3)) %>% round(0)
df <- data.frame(site,year,value)
ggplot(df, aes(x=year, y=value)) +
geom_bar(stat="identity") +
facet_wrap(~site)
This is fish census data, where not all sites were fished in all years, but some times no fish were caught. Hence the need to differentiate between the two situations. For example, there was no catch at site C in 2010 and it was not fished in 2011, and the reader cannot tell the difference. I would like to add something like "no data" to the plot for 2011. Maybe it is possible to fill the rows where data is missing, generate another column with the desired text to be added and then include this via geom_text?
So here is an example of your proposed method:
# Tabulate sites vs year, take zero entries
tab <- table(df$site, df$year)
idx <- which(tab == 0, arr.ind = T)
# Build new data.frame
missing <- data.frame(site = rownames(tab)[idx[, "row"]],
year = colnames(tab)[idx[, "col"]],
value = 1,
label = "N.D.") # For 'no data'
ggplot(df, aes(year, value)) +
geom_col() +
geom_text(data = missing, aes(label = label)) +
facet_wrap(~site)
Alternatively, you could also let the facets omit unused x-axis values:
ggplot(df, aes(x=year, y=value)) +
geom_bar(stat="identity") +
facet_wrap(~site, scales = "free_x")

superpose densities, non exclusive subsets

I need to have several density functions onto a single plot. Each density corresponds to a subset of my overall dataset. The subsets are defined by the value taken by one of the variables in the dataset.
Concretely, I would like to draw a density function for 1, 3, and 10 years horizons. Of course, the 10 years horizons includes the shorter ones. Likewise, the 3 year horizon density should be constructed taking data from the last year.
The subsets need to correspond to data[period == 1,], data[period <= 3, ], data[period == 10,].
I have managed to do so by adding geom_densitys on top of each other, i.e., by redefining the data each time.
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green")
It works fine but I feel like this is not the right way to do it (and indeed, it makes e.g., the creation of a legend cumbersome).
On the other hand, doing like that :
ggplot(data, aes(x=BEST_CUR_EV_TO_EBITDA, color=period)) +
geom_density(alpha=.2, fill="blue")
won't do because then the periods are taken to be mutually exclusive.
Is there a way to specify aes(color) based on the value taken by period where subsets overlap?
Running code:
library(data.table)
library(lubridate)
library(ggplot2)
YEARS <- 10
today <- Sys.Date()
lastYr <- Sys.Date()-years(1)
last3Yr <- Sys.Date()-years(3)
start.date = Sys.Date()-years(YEARS)
date = seq(start.date, Sys.Date(), by=1)
BEST_CUR_EV_TO_EBITDA <- rnorm(length(date), 3,1)
data <- cbind.data.frame(date, BEST_CUR_EV_TO_EBITDA)
data <- cbind.data.frame(data, period = rep(10, nrow(data)))
subPeriods <- function(aDf, from, to, value){
aDf[aDf$date >= from & aDf$date <= to, "period"] = value
return(aDf)
}
data <- subPeriods(data, last3Yr, today, 3)
data <- subPeriods(data, lastYr, today, 1)
data <- data.table(data)
colScale <- scale_colour_manual(
name = "horizon"
, values = c("1 Y" = "grey", "3 Y" = "red", "10 Y" = "green"))
ggplot() +
geom_density(data = data[period <=3,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="red") +
geom_density(data = data[period ==1,], aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="grey") +
geom_density(data = data, aes(x=BEST_CUR_EV_TO_EBITDA), alpha=.2, fill="green") +
colScale
One of the ways to deal with dependent grouping is to create an independent grouping based on the existing groups. The way I'd opted to do it below is by creating three new columns (period_one, period_three and period_ten) with mutate function, where
period_one= BEST_CUR_EV_TO_EBITDA values for period==1
period_three= BEST_CUR_EV_TO_EBITDA values for period<=1
period_ten= BEST_CUR_EV_TO_EBITDA values for all periods
These columns were then converted into the long-format using gather function, where the columns (period_one, period_three and period_ten) are stacked in "period" variable, and the corresponding values in the column "val".
df2 <- data %>%
mutate(period_one=ifelse(period==1, BEST_CUR_EV_TO_EBITDA, NA),
period_three=ifelse(period<=3, BEST_CUR_EV_TO_EBITDA, NA),
period_ten=BEST_CUR_EV_TO_EBITDA) %>%
select(date, starts_with("period_")) %>%
gather(period, val, period_one, period_three, period_ten)
The ggplot is straightforward with long format consisting of independent grouping:
ggplot(df2, aes(val, fill=period)) + geom_density(alpha=.2)

Single barplot for each row of dataframe

I have a csv file which looks like the following:
Name,Count1,Count2,Count3
application_name1,x1,x2,x3
application_name2,x4,x5,x6
The x variables represent numbers and the applications_name variables represent names of different applications.
Now I would like to make a barplot for each row by using ggplot2. The barplot should have the application_name as title. The x axis should show Count1, Count2, Count3 and the y axis should show the corresponding values (x1, x2, x3).
I would like to have a single barplot for each row, because I have to store the different plots in different files. So I guess I cannot use "melt".
I would like to have something like:
for each row in rows {
print barplot in file
}
Thanks for your help.
You can use melt to rearrange your data and then use either facet_wrap or facet_grid to get a separate plot for each application name
library(ggplot2)
library(reshape2)
# example data
mydf <- data.frame(name = paste0("name",1:4), replicate(5,rpois(4,30)))
names(mydf)[2:6] <- paste0("count",1:5)
# rearrange data
m <- melt(mydf)
# if you are wanting to export each plot separately
# I used facet_wrap as a quick way to add the application name as a plot title
for(i in levels(m$name)) {
p <- ggplot(subset(m, name==i), aes(variable, value, fill = variable)) +
facet_wrap(~ name) +
geom_bar(stat="identity", show_guide=FALSE)
ggsave(paste0("figure_",i,".pdf"), p)
}
# or all plots in one window
ggplot(m, aes(variable, value, fill = variable)) +
facet_wrap(~ name) +
geom_bar(stat="identity", show_guide=FALSE)
I didn't see #user20650's nice answer before preparing this. It's almost identical, except that I use plyr::d_ply to save things instead of a loop. I believe dplyr::do() is another good option (you'd group_by(Name) first).
yourData <- data.frame(Name = sample(letters, 10),
Count1 = rpois(10, 20),
Count2 = rpois(10, 10),
Count3 = rpois(10, 8))
library(reshape2)
yourMelt <- melt(yourData, id.vars = "Name")
library(ggplot2)
# Test a function on one piece to develope graph
ggplot(subset(yourMelt, Name == "a"), aes(x = variable, y = value)) +
geom_bar(stat = "identity") +
labs(title = subset(yourMelt, Name == 'a')$Name)
# Wrap it up, with saving to file
bp <- function(dat) {
myPlot <- ggplot(dat, aes(x = variable, y = value)) +
geom_bar(stat = "identity") +
labs(title = dat$Name)
ggsave(filname = paste0("path/to/save/", dat$Name, "_plot.pdf"),
myPlot)
}
library(plyr)
d_ply(yourMelt, .variables = "Name", .fun = bp)

Resources