Not sure why this subset is not working in ggplot - r

I have sub-setted my data set so that only three sites are included, as I only want to plot three sites and the following code does not seem to work with ggplot. Anyone have any idea why?
rm(list=ls())
require(ggplot2)
require(reshape2)
require(magrittr)
require(dplyr)
require(tidyr)
setwd("~/Documents/Results")
mydata <- read.csv("Metals sheet R.csv")
L <- subset(mydata, Site =="B1"| Site == "B2"| Site == "B3", select = c(Site,Date,Al))
L$Date <- as.Date(L$Date, "%d/%m/%Y")
ggplot(data=L, aes(x=Date, y=Al, xaxt="n", colour=Site)) +
geom_point() +
labs(title = "Total Al in the Barlwyd and Bowydd
19/03/2015.", x = "Site",
y = "Total concentration (mg/L)") +
scale_x_date(date_breaks = "1 month", labels = date_format("%m"))
It seems to falter after the ggplot line. Thanks in advance. I have double checked it but can't see anything wrong? I might possibly need a way to only plot three of my 21 sites.
The head of my subsetted L data set looks something like this (x58 reps)
Date Site Al
12/08/2015 B1 22.3
12/08/2015 B2 23.4
12/08/2015 B3 203
Thankyou in advance.

I think xaxt = "n" is wrong. The ggplot aes function is only for matching variables in your data to plot elements. To remove the x-axis text in ggplot, use the theme function e.g. ggplot2 plot without axes, legends, etc.
On a separate note, the %in% operator provides a quicker way of selecting a subset of values from a column:
subset(mydata, Site %in% c("B1", "B2", B3"))

Related

Create barplot to represent time series in ggplot2

I have a basic dataframe with 3 columns: (i) a date (when a sample was taken); (ii) a site location and (iii) a binary variable indicating what the condition was when sampling (e.g. wet versus dry).
Some reproducible data:
df <- data.frame(Date = rep(seq(as.Date("2010-01-01"), as.Date("2010-12-01"), by="months"),times=2))
df$Site <- c(rep("Site.A",times = 12),rep("Site.B",times = 12))
df$Condition<- as.factor(c(0,0,0,0,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0))
What I would like to do is use ggplot to create a bar chart indicating the condition of each site (y axis) over time (x axis) - the condition indicated by a different colour. I am guessing some kind of flipped barplot would be the way to do this, but I cannot figure out how to tell ggplot2 to recognise the values chronologically, rather than summed for each condition. This is my attempt so far which clearly doesn't do what I need it to.
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()
So I have 2 questions. Firstly, how do I tell ggplot to recognise changes in condition over time and not just group each condition in a traditional stacked bar chart?
Secondly, it seems ggplot converts the date to a numerical value, how would I reformat the x-axis to show a time period, e.g. in a month-year format? I have tried doing this via the scale_x_date function, but get an error message.
labDates <- seq(from = (head(df$Date, 1)),
to = (tail(df$Date, 1)), by = "1 months")
Datelabels <-format(labDates,"%b %y")
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()+
scale_x_date(labels = Datelabels, breaks=labDates)
I have also tried converting sampling times to factors and displaying these instead. Below I have done this by changing each sampling period to a letter (in my own code, the factor levels are in a month-year format - I put letters here for simplicity). But I cannot format the axis to place each level of the factor as a tick mark. Either a date or factor solution for this second question would be great!
df$Factor <- as.factor(unique(df$Date))
levels(df$Factor) <- list(A = "2010-01-01", B = "2010-02-01",
C = "2010-03-01", D = "2010-04-01", E = "2010-05-01",
`F` = "2010-06-01", G = "2010-07-01", H = "2010-08-01",
I = "2010-09-01", J = "2010-10-01", K= "2010-11-01", L = "2010-12-01")
ggplot(df) +
geom_bar(aes(x=Site,y=Date,fill=Condition),stat='identity')+coord_flip()+
scale_y_discrete(breaks=as.numeric(unique(df$Date)),
labels=levels(df$Factor))
Thank you in advance!
It doesn't really make sense to use geom_bar() considering you do not want to summarise the data and require the visualisation over "time"
I would rather use geom_line() and increase the line thickness if you want to portray a bar chart.
library(tidyr)
library(dplyr)
library(ggplot2)
library(scales)
library(lubridate)
df <- data.frame(Date = rep(seq.Date(as.Date("2010-01-01"), as.Date("2010-12-01"), by="months"),times=2))
df$Site <- c(rep("Site.A",times = 12),rep("Site.B",times = 12))
df$Condition<- as.factor(c(0,0,0,0,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,0,0,0,0,0))
df$Date <- ymd(df$Date)
ggplot(df) +
geom_line(aes(y=Site,x=Date,color=Condition),size=10)+
scale_x_date(labels = date_format("%b-%y"))
Note using coord_flip() also does not work, I think this causes the Date issue, see below threads:
how to use coord_carteisan and coord_flip together in ggplot2
In ggplot2, coord_flip and free scales don't work together

Ordering bars in ggplot2 stacked barplot via levels() but output looks different

I'm struggling with my ggplot2 stacked barplot. I want to define the order of the bars manually. So I do that normally by transforming the variable into a factor and defining the levels in my desired order.
data <- transform(data, variable = factor(variable, levels = c("A4 Da/De/Du", "A2 London", "A3 Berlin", "A1 Paris", "A5 Rome")))
When I check my variable levels I can see that the levels are in my desired order to plot
head(data$variable)
When I plot my data everything looks as desired, but somehow, and I have no clue why, one variable (for example "A4 Da/De/Du") is not in my defined variable order...
Has someone an idea what the problem could be?
-It's the only variable with special characters (e.g "/") in it
-It's the only variable which has zero levels in it (e.g. c(20,40,0,0,40))
-My ggplot code is quite complex, and I use the "reorder()" function, and I use the "forcats" package in my ggplot2 code. Could that be a problem?
Thanks very much for any help or ideas!
EDIT (some example data)
library(reshape2)
library(ggplot2)
library(dplyr)
df <- data.frame(cbind(a=c(20,40,20,10,10),b=c(10,30,50,5,5), c=c(60,10,10,15,5), d=c(80,20,0,0,0), e=c(50,10,10,15,15)))
colnames(df) <- c("D1 Paris", "D2 London", "D3 Berlin", "D4 Da/De/Du", "D5 Rome")
df$category <- c("C1", "C2", "C3", "C4", "C5")
data <- data %>% group_by(variable) %>% arrange(variable)
data <- melt(data)
data$percent <- data$value/100
data <- transform(data, variable = factor(variable,
levels = c("D4 Da/De/Du", "D2 London", "D3 Berlin", "D1 Paris", "D5 Rome")))
And the short version of the ggplot2 code:
ggplot(data, aes(x=reorder((variable), percent), y=percent, fill=category)) +
coord_flip()+
geom_bar(stat="identity", width = .4, colour="black", lwd=0.1)
SOLUTION
I finally solved my problem :)
Gregor was right, after transforming the levels of the specific variable in the desired order, the reorder() function in ggplot2 is no longer necessary respectively overwrites the earlier defined levels, what at the end produced my error...
Thanks Gregor!

Adding a trend line to a scatterplot using R

I have a data set with number of people at a certain age (ranging from 0-105+), recorded in the period 1846-2014, and I am making a scatterplot of the summed amount of people by year; there's one data set for males and one for females. After that, I am going to add a trend line, but I am having problems figuring out how.
This is what I've got so far:
B <- as.matrix(read.table("clipboard"))
head(B)
age <- 0:105
y <- 1846:2014
plot(c(1846:2014), c(colSums(B)), col=3, xlab="Year", ylab="Summed age", main="Summed people")
This gives me the plot, but I am not sure how to add the trend line. Please help.
Plot looks like this: https://www.dropbox.com/s/5dono5bjrmqylcp/Plot.png?dl=0
Data available here:
https://www.ssb.no/statistikkbanken/SelectVarVal/Define.asp?subjectcode=01&ProductId=01&MainTable=FolkemEttAarig&SubTable=1&PLanguage=1&nvl=True&Qid=0&gruppe1=Hele&gruppe2=Hele&gruppe3=Hele&VS1=AlleAldre00B&VS2=Kjonn3&VS3=&mt=0&KortNavnWeb=folkemengde&CMSSubjectArea=befolkning&StatVariant=&checked=true
I downloaded your data file and posted it somewhere accessible.
urlsrc <- "http://www.math.mcmaster.ca/bolker/misc"
urlfn <- "201512516853914205393FolkemEttAarig.tsv"
d <- read.delim(url(paste(urlsrc,urlfn,sep="/")),header=TRUE,
check.names=FALSE)
dm <- d[,3:171]
y <- as.numeric(names(dm))
Now make the plot:
plot(y, colSums(dm),
col=3, xlab="Year", ylab="Summed age", main="Summed people")
abline(lm(colSums(dm) ~ y))
You can also do it like this:
library("tidyr")
library("ggplot2"); theme_set(theme_bw())
library("dplyr")
d2 <- gather(dm,year,pop,convert=TRUE)
d3 <- d2 %>% group_by(year) %>% summarise(pop=mean(pop))
ggplot(d3,aes(year,pop)) + geom_point() +
geom_smooth(method="lm")
There is a confidence interval around this trend line, but it's so narrow that it's hard to see.
update: I accidentally used the mean instead of the sum in the second plot, but of course it should be easy to change that.

Plotting Several Grouped Bar Plots in Loop [R]

my challenge is to plot several bar plots at once, a plot for each of variables of different subsets. My goal is to compare regional differences for each variable. I would like to print all the resulting plots to a html file via R Markdown.
My main difficulty in making automatic grouped bar charts is that you need to tabulate the groups using table(data$Var[i], data$Region)but I don't know how to do this automatically. I would highly appreciate a hint on this.
Here is a an example of what one of my subset looks like:
# To Create this example of data:
b <- rep(matrix(c(1,2,3,2,1,3,1,1,1,1)), times=10)
data <- matrix(b, ncol=10)
colnames(data) <- paste("Var", 1:10, sep = "")
data <- as.data.frame(data)
reg_name <- c("North", "South")
Region <- rep(reg_name, 5)
data <- cbind(data,Region)
Using beside = TRUE, I was able to create one grouped bar plot (grouped by Region for Var1 from data):
tb <- table(data$Var1,data$Region)
barplot(tb, main="Var1", xlab="Values", legend=rownames(tb), beside=TRUE,
col=c("green", "darkblue", "red"))
I would like to loop this process to generate for example 10 plots for Var1 to Var10:
for(i in 1:10){
tb <- table(data[i], data$Region)
barplot(tb, main = i, xlab = "Values", legend = rownames(tb), beside = TRUE,
col=c("green", "darkblue", "red"))
}
R prefer the apply family of functions, therefore I tried to create a function to be applied:
fct <- function(i) {
tb <- table(data[i], data$Region)
barplot(tb, main=i, xlab="Values", legend = rownames(tb), beside = TRUE,
col=c("green", "darkblue", "red"))
}
sapply(data, fct)
I have tried other ways, but I was never successful. Maybe lattice or ggplot2 would offer easier way to do this. I am just starting in R, I will gladly accept any tips and suggestions. Thank you!
(I run on Windows, with the most recent Rv3.1.2 "Pumpking Helmet")
Given that you say "My goal is to compare regional differences for each variable", I'm not sure you've chosen the optimal plotting strategy. But yes, it is possible to do what you are asking.
Here's the default plot you get with your code above, for reference:
If you want a list with 10 plots for each variable, you can do the following (with ggplot)
many_plots <-
# for each column name in dat (except the last one)...
lapply(names(dat)[-ncol(dat)], function(x) {
this_dat <- dat[, c(x, 'Region')]
names(this_dat)[1] <- 'Var'
ggplot(this_dat, aes(x=Var, fill=factor(Var))) +
geom_bar(binwidth=1) + facet_grid(~Region) +
theme_classic()
})
Sample output, for many_plots[[1]]:
If you wanted all the plots in one image, you can do this (using reshape and data.table)
library(data.table)
library(reshape2)
dat2 <-
data.table(melt(dat, id.var='Region'))[, .N, by=list(value, variable, Region)]
ggplot(dat2, aes(y=N, x=value, fill=factor(value))) +
geom_bar(stat='identity') + facet_grid(variable~Region) +
theme_classic()
...but that's not a great plot.

Multiple frequency lines on same graph where y is a character value

I'm trying to create a frequency plot of number of appearances of a graph type by year.
I have played around with ggplot2 for a while, but I think this is over my head (I'm just getting started with R)
I attached a schematic of what I would like the result to look like. One of the other issues I'm running into is that there are many years that the graph types don't appear. Is there a way to exclude the graph type if it does not appear that year?
e.g. in 1940 there is no "sociogram" I don't want to have a bunch of lines at 0...
year <- c("1940","1940","1940","1940","1940","1940","1940","1940","1940","1940","1940","1941","1941","1941","1941","1941","1941","1941","1941","1941","1941","1941","1941","1941","1941")
type <- c("Line","Column", "Stacked Column", "Scatter with line", "Scatter with line", "Scatter with line", "Scatter with line", "Map with distribution","Line","Line","Line","Bar","Bar","Stacked bar","Column","Column","Sociogram","Sociogram","Column","Column","Column","Line","Line","Line","Line")
ytmatrix <- cbind(as.Date(as.character(year), "%Y", type))
Please let me know if something doesn't make sense. StackOverflow is quickly becoming one of my favorite sites!
Thank,
Jon
Here's what I have so far...
Thank you again for all your help!
And here's how I did it (I can't share the data file yet, since it's something we're hoping to use it for a publication, but the ggplot area is probably the more interesting, though I didn't really do anything new/that wasn't discussed in the post):
AJS = read.csv(data) #read in file
Type = AJS[,17] #select and name "Type" column from csv
Year = AJS[,13] #select and name "Year" column from csv
Year = substr(Year,9,12) #get rid of junk from year column
Year = as.Date(Year, "%Y") #convert the year character to a date
Year = format(Year, "%Y") #get rid of the dummy month and day
Type = as.data.frame(Type) #create data frame
yt <- cbind(Year,Type) #bind the year and type together
library(ggplot2)
trial <- ggplot(yt, aes(Year,..count.., group= Type)) + #plot the data followed by aes(x- axis, y-axis, group the lines)
geom_density(alpha = 0.25, aes(fill=Type)) +
opts(axis.text.x = theme_text(angle = 90, hjust = 0)) + #adjust the x axis ticks to horizontal
opts(title = expression("Trends in the Use of Visualizations in The American Journal of Sociology")) + #Add title
scale_y_continuous('Appearances (10 or more)') #change Y-axis label
trial
This might be a more interesting dataframe to experiment with:
df1 <- data.frame(date = as.Date(10*365*rbeta(100, .5, .1)),group="a")
df2 <- data.frame(date = as.Date(10*365*rbeta(50, .1, .5)),group="b")
df3 <- data.frame(date = as.Date(10*365*rbeta(25, 3,3)),group="c")
dfrm <- rbind(df1,df2,df3)
I thought working with an example in the help(stat_density) page would work, but it does not:
m <- ggplot(dfrm, aes(x=date), group=group)
m+ geom_histogram(aes(y=..density..)) + geom_density(fill=NA, colour="black")
However an example I found in a search of hte archives found a posting by #Hadley Wickham that does work:
m+ geom_density(aes(fill=group), colour="black")

Resources