I have seen different variations of this question, but none that are straight-forward in answering a problem I come across all of the time. I often have large datasets like the one described in this link:
Make multiple separate plots from single data frame in R
Example provided:
head(data)
Park_name Zone Year Height_mm
1 Park1 Zone1 2011 380
2 Park1 Zone1 2011 510
3 Park1 Zone1 2011 270
4 Park1 Zone2 2011 270
5 Park1 Zone2 2011 230
6 Park1 Zone2 2011 330
# load packages
require(ggplot2)
require(plyr)
# read data
Y <- read.table("C:/data.csv", sep=",", header=TRUE)
# define the theme
th <- theme_bw() +
theme(axis.text.x=element_text(),
axis.line=element_line(colour="black"),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background=element_blank(),
legend.justification=c(10,10), legend.position=c(10,10),
legend.title = element_text(),
legend.key = element_blank()
)
# determine park levels
parks <- levels(Y[,"Park_name"])
# apply seperately for each park
p <- lapply(parks, function(park) {
ggplot(Y[Y[, "Park_name"]==park,], aes(x=as.factor(Year), y=Height_mm)) +
facet_grid(Zone~.) + # show each zone in a seperate facet
geom_point() + # plot the actual heights (if desired)
# plot the mean and confidence interval
stat_summary(fun.data="mean_cl_boot", color="red")
})
# finally print your plots
lapply(p, function(x) print(x+th))
I want to create a singular plot to put in a report appendices for each Park's Zone, plotting year x height. Sometimes this totals over 100 plots. I do not want to facet wrap. I want the plots uniquely individual and it would be great to save jpegs automatically to a designated folder. I also want each plot to uniquely record:
1. A unique y-axis title. (let's say the height column had values in both feet and meters and you needed figures to identify which one.
2. A unique main-title based off the Park Name and Zone.
This is a huge challenge for me but may be an easy coding problem for someone who uses code so often. I would be eternally grateful for help, since I need this type of loop all of the time. Thank you!
I think the main problem with the example you provided is that the loop is made over the "parks" vector, which only contains the levels of "Park_name". I think a better approach would be to loop over the data, subsetting by each "Park_name" entry.
I am also assuming that you have a column with the "units" variable (I added it in the plot as "Units"); however, if that is not the case, you may be able to create it using dplyr::separate. I hope you find this code useful!
# determine park levels
parks <- unique(data[,"Park_name"])
# lapply for each park entry
p <- lapply(parks, function(park) {
#Subset the data by the each entry in the parks vector
subdata <- subset(data,data$Park_name == park)
#Collapse the zone vector as a string
zones <- paste(unique(subdata[,"Zone"]),
collapse = " ")
##ggplot
ggplot(subdata, aes(x=as.factor(Year), y=Height_mm)) +
facet_grid(Zone~.) +
geom_point() +
#Add the title and y lab as variables defined by park name, zones and a column with unit information
labs(title = paste(subdata$Park_name, zones, sep = " "),
y = paste0("Height (", subdata$Units,")"),
x = "Year") +
stat_summary(fun.data="mean_cl_boot", color="red")
#Save the plot, define your folder location as "C:/myplots/"
ggsave(filename = paste0(folder, park,".jpeg"),
device = "jpeg",
width = 15,
height = 10,
units = "cm",
dpi = 200)
})
Related
I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))
I constructed a tSNE plot from Rtsne package. Below is the code
library(Rtsne)
library(ggplot2)
dtm<-read.table(args[1],sep=",",header=T,row.names=1)
dtm_t<-t(dtm)
chr_tsne_model<-Rtsne(dtm_t, perplexity=8, theta=0.1, num_threads=30)
d_tsne_1<-as.data.frame(chr_tsne_model$Y)
ggplot(d_tsne_1, aes(x=V1, y=V2)) + geom_point(size=2) + xlab("") + ylab("") + ggtitle("t-SNE") + theme_grey(base_size=21)
I had the sample names in the first row of the file while had gene names in the first column. I wanted to see intersample association in reduced dimensions. hence the dots in the plot are samples.
Now I would like to see 12 samples (out of total 25 samples) how they are associated and hence want them in red color and rest in yellow and blue.
ggplot(d_tsne_1, aes(x=V1, y=V2, fill=rownames(d_tsne_1))) + geom_point(size=2) + xlab("dim1") + ylab("dim2") + ggtitle("t-SNE") + theme_grey(base_size=21) + scale_fill_manual(values=c('1'="red",'2'="red",'3'="red",'4'="red",'5'="red",'6'="red",'7'="red",'8'="red",'9'="red",'10'="red",'11'="yellow",'12'="yellow",'13'="yellow",'14'="yellow",'15'="yellow",'16'="yellow",'17'="yellow",'18'="yellow",'19'="blue",'20'="blue",'21'="blue",'22'="blue",'23'="blue",'24'="blue"))
I tried the above modification but to no avail.
The data looks like this:
Genes Sam_1 Sam_2_PD Sam_3_PD Sam_4 Sam_5
ENSG001 1.005 1.325 2.005 3.562 0.004 4.005
ENSG002 1.023 2.355 2.005 3.666 1.004 4.005
.
.
2500 rows, 25 col
The dots in Rtsne plot represents columns (Sam_1,Sam_2_PD etc)
PS : I noticed that chr_tsne_model$Y has lost the sample names from dtm_t so do they retain the sample order ??
Please help.
Thanks in advance
tl;dr: Yes, Rtsne() drops the names in $Y but the order of your samples are retained.
Longer answer:
I don't have your data, so I'll illustrate some things with the iris dataset.
Let's suppose I have some experiment:
experiment <- iris[!duplicated(iris[,1:4]),]
I probably want to seperate my actual data, versus metadata (at least it's common in my work)
metadata <- data.frame(sample_id = rownames(experiment),
colour = experiment$Species)
data <- as.matrix(experiment[,1:4])
For the ggplot part, I would advise you to store your color-coding in a column of a dataframe instead of manually specifying every datapoint inside scale_colour_manual(), like I did with colour = experiment$Species.
Now for the tSNE part, indeed your sample names are dropped but the order is retained. I would also recommend setting a seed before performing tSNEs to make them more reproducible, as the outcome of a tSNE depends on a random initialisation of the data.
set.seed(1)
tsne <- Rtsne(data)
Before plotting, we gather all relevant data in a single dataframe:
df <- data.frame(x = tsne$Y[,1],
y = tsne$Y[,2],
colour = metadata$colour)
And plot
ggplot(df, aes(x, y, colour = colour)) +
geom_point()
Now personally, I would take away the axis ticks, axis texts and panel grids since the numbers coming from a tSNE mean nothing, they are just embedding coordinates. Furthermore, you might find it easy to store the tSNE coordinates in the metadata so you could easily find them back later.
EDIT: More detailed data has been posted, so I'll show how it would work with the provided variable names
# Attempt to load in data
z <- "Genes Sam_1 Sam_2_PD Sam_3_PD Sam_4 Sam_5
ENSG001 1.005 1.325 2.005 3.562 0.004 4.005
ENSG002 1.023 2.355 2.005 3.666 1.004 4.005"
# Removing column 1 because number of column names doesn't
# match number of data points
dtm <- read.table(text = z, header = T)[,-1]
Next we capture some metadata of interest, in this case: does the sample has a PD suffix or not?
metadata <- data.frame(
sample_name = colnames(dtm),
treatment = paste0(ifelse(grepl("PD", colnames(dtm)), "", "Not "), "PD")
)
We run the tSNE and format a data.frame for plotting
# Had to set perplexity to 1 because only 2 genes to work with
chr_tsne_model <- Rtsne(t(dtm), perplexity=1, theta=0.1, num_threads=30)
df <- chr_tsne_model$Y
colnames(df) <- c("x", "y")
df <- cbind(df, metadata)
And plot the thing:
ggplot(df, aes(x, y, colour = treatment)) +
geom_point() +
scale_colour_manual(values = c("blue","yellow")) +
# Optional
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.line = element_line(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
Which with the two rows of data looks like the following:
I'm trying to generate a series of bar charts, one for each of 7 provinces, based off a master data table. However, the software only plots data from one of the provinces -- province 4. When I export to PDF I get 7 of the same bar chart (with different titles).
The data is in this format (abbreviated for clarity)
province travelcat pc_pop
60 1 0 to 4 hours 0.6807
21 1 4 to 8 hours 0.1093
28 2 4 to 8 hours 0.0969
44 2 36 to 48 hours 0.0014
31 3 48 to 72 hours 0.0016
49 3 > 72 hours 0.0007
Weirdly, when I generate a filtered table prov_filter and print that, it shows the data exactly as I'd expect it, specific to each province. Similarly the province title province_number is assigned correctly in the resulting PDF printouts. So the filtering is happening...but the data isn't going into the plot.
province_list=list()
for (i in unique(slim_prov_TCR$province)) {
province_number <- paste("Province",i)
prov_filter <- filter(slim_prov_TCR, province == i)
print(prov_filter)
plot <- ggplot(prov_filter, aes(x = prov_filter$travelcat, y = prov_filter$pc_pop))
+ theme(axis.text.x = element_text(angle=45, hjust=1))
+ scale_y_continuous(limits=c(0,1),labels = scales::percent)
+ ylab("% of provincial population") + xlab("Travel time to nearest medical facility")
+ ggtitle(province_number)
+ stat_summary(fun.y="identity",geom="bar")
filename=paste(province_number,".pdf",sep="")
province_list[[i]] = plot
print(plot)
}
I've done this before using similar code with no problems, but this time I've had serial problems, despite revising the filter code using multiple methods. I'm relatively new to R and statistics land in general so I'm probably mucking up something on the grammar side. Any and all help appreciated.
For reference purposes the final printout code is below
for (i in unique(slim_prov_TCR$province)) { # Another for loop, this time to save out the bar charts in province_list as PDFs
province_number <- paste("Province",i)
filename=paste(province_number,".pdf",sep="") # Make the file name for each PDF. The paste makes the name a variable of the disrict, so each chart is named by sensor
pdf(filename,width=3.5,height=3.5) # PDF basic specifications. Modify the width and height here.
print(province_list[[i]])
dev.off()
}
As highlighted by alistaire and Gregor, using $ for categories and having the + at the beginning of lines was confusing R. Reformatting these two points did the trick. See below text.
province_list=list()
for (i in unique(slim_prov_TCR$province)) {
province_number <- paste("Province",i)
prov_filter <- filter(slim_prov_TCR, province == i)
print(prov_filter)
plot <- ggplot(prov_filter, aes(x = travelcat, y = pc_pop)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(limits=c(0,1),labels = scales::percent) +
ylab("% of provincial population") + xlab("Travel time to nearest medical facility") +
ggtitle(province_number) +
stat_summary(fun.y="identity",geom="bar")
filename=paste(province_number,".pdf",sep="")
province_list[[i]] = plot
print(plot)
}
I want to display two dimensions in my data, (1) reporting entity in different facets and (2) country associated to the data point on the x-axis. The problem is that the country dimension includes a "total", which is a lot higher than all of the individual values, so I would need an own axis limit for that.
My solution was to try another facetting dimension, but I could not get it working and looking nicely at the same time. Consider the following dummy data:
id <- c(1,1,1,1,1,1,2,2,2,2,2,2)
country <- c("US","US","UK","World","World","World","US","US","UK","World","World","World")
value <- c(150,40,100,1000,1100,1500,5,10,20,150,200,120)
# + some other dimensions
mydat <- data.frame(id,country,value)
id country value
1 1 US 150
2 1 US 40
3 1 UK 100
4 1 World 1000
5 1 World 1100
6 1 World 1500
7 2 US 5
8 2 US 10
9 2 UK 20
10 2 World 150
11 2 World 200
12 2 World 120
If I use a facet grid to display a world total, the axis limit is forced for the other countries as well:
mydat$breakdown <- mydat$country == "World"
ggplot(mydat) + aes(x=country,y=value) + geom_point() +
facet_grid(id ~ breakdown,scales = "free",space = "free_x") +
theme(strip.text.x = element_blank() , strip.background = element_blank(),
plot.margin = unit( c(0,0,0,0) , units = "lines" ) )
(the last part of the plot is just to remove the additional strip).
If I use a facet wrap, it does give me different axis limits for each plot, but then I cannot pass the space = "free_x" argument, meaning that the single column for the total will consume the same space as the entire country overview, which looks ugly for data sets with many countries:
ggplot(mydat) + aes(x=country,y=value) + geom_point() +
facet_wrap(id ~ breakdown,scales = "free")
There are several threads here which ask similar questions, but none of the answers helped me to achieve this yet.
Different axis limits per facet in ggplot2
Is it yet possible to have different axis breaks / limits for individual facets in ggplot with free scale?
Setting individual axis limits with facet_wrap and scales = "free" in ggplot2
Maybe try gridExtra::grid.arrange or cowplot::plot_grid:
lst <- split(mydat, list(mydat$breakdown, mydat$id))
plots <- lapply(seq(lst), function(x) {ggplot(lst[[x]]) +
aes(x=country,y=value) +
geom_point() +
ggtitle(names(lst)[x]) + labs(x=NULL, y=NULL)
})
do.call(gridExtra::grid.arrange,
c(plots, list(ncol=2, widths=c(2/3, 1/3)),
left="Value", bottom="country"))
I've had a good look around this site and others on how to set the hjust and vjust according to a value in a particular column. The following shows how the data is structured (but is a simplified subset of many entries for many years):
YearStart <- c(2001,2002,2003,2001,2002,2003)
Team <- c("MU","MU","MU","MC","MC","MC")
Attendance <- c(67586,67601,67640,33058,34564,46834)
Position <- c(3,1,3,1,9,16)
offset <-c()
df <- data.frame(YearStart,Team,Attendance,Position)
so
> head(df)
YearStart Team Attendance Position
1 2001 MU 67586 3
2 2002 MU 67601 1
3 2003 MU 67640 3
4 2001 MC 33058 1
5 2002 MC 34564 9
6 2003 MC 46834 16
what I would like to acheive is a vjust value based on the Team. In the following, MU would be vjust=1 and MC would be vjust=-1 so I can control where the data label is located from the data group with which it is associated.
I've tried to hack around a couple of examples that use a function containing a lookup table (it's not a straight ifelse as I have many values for Team) but I can't seem to pass a string to the function through the aes method along these lines:
lut <- list(MU=1,MC=-1)
vj <-function(x){lut[[x]]}
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(vjust = vj(Team) ) )
print(p)
The following is pseudo(ish)code which applies the labels twice to each group in each location above and below the points.
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(Team="MU"), vjust=1)+
geom_text(aes(Team="MC"), vjust=-1)
print(p)
I've tried several other strategies for this and I can't tell whether I'm trying this from the wrong direction or I'm just missing a very trivial piece of ggplot syntax. I've accomplished a stop-gap solution by labelling them manually in Excel but that's not sustainable :-)
To specify an aesthetic, that aesthetic should be a column in your data.frame.
(Notice also that your lookup function should have single brackets, not double.)
And a final thought: vjust and hjust are strictly only defined between [0, 1] for left/bottom and right/top justification. In practise, however, it is usually possible to extend this. I find that settings of (-0.2, 1.2) work quite well, in most cases.
lut <- list(MU=-0.2, MC=1.2)
vj <- function(x) lut[x]
df$offset <- vj(df$Team)
library(ggplot2)
ggplot(df, aes(YearStart, Attendance, label=Position, group=Team)) +
geom_point(aes(colour=Team)) +
geom_text(aes(vjust = offset))