I'm quite new to R and statistics in general. I am trying to plot in a line graph 2 categorical variables (part of speech "pos", condition "trcond") and a numerical one (score "totacc") in ggplot2.
> df1<-df[, c("trcond", "subtitle", "pos", "totacc")]
> head(df1)
trcond subtitle pos totacc
7 L New Scene_16 lex 0.250
29 N New Scene_16 lex 0.500
8 L New Scene_25 lex 0.875
30 N New Scene_25 lex 0.666
9 L New Scene_29 lex 1.000
31 N New Scene_29 lex 0.833
I have used this ggplot2 command:
>ggplot(data=summdfo, aes(x=pos, y=totacc, group=trcond, colour=trcond))
+ geom_line() + geom_point()
But it is not working, the graph has coloured (blue and red) dots all over the place and more than just two lines linking them. I would like to post the graph I get as I lack words to explain but this is my first post and I don't seem to be able to upload pictures.
I would like to get a standard simple 2-line graph such as the blue and red ones in this page (where y=total bill, by x=time (lunch,dinner) grouped by gender): http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_%28ggplot2%29/
Is this possible with my data set at all? If so, what am I doing wrong with the code?
Here I tried to create a data frame based on limited sample from your data.
df1 <- data.frame(trcond=rep(c('L', 'N'), 3),
subtitle=rep('New Scene_29', 6), # Not in use, just a dummy
pos=c('lex', 'lex', 'lex', 'noLex', 'noLex', 'noLex'),
totacc=c(0.250, 0.5, 0.875, 0.666, 1.000, 0.833))
Because trcond by pos is not balanced in this data frame, the plot is going to be jumbled up like this:
ggplot(data=df1, aes(x=pos, y=totacc, group=trcond, color=trcond))+
geom_line() +
geom_point()
However, if you apply a summary function which will compute means for each condition, a correct plot will appear:
ggplot(data=df1, aes(x=pos, y=totacc, group=trcond, color=trcond))+
geom_line(stat='summary', fun.y='mean') +
geom_point(stat='summary', fun.y='mean')
Again, this is trying to figure out what's in your data. The best is that you provide here a sample of your data using dput(head(df1, 50)) to give you a better answer.
Related
I have a remote sensing data set consisting of 106 columns and 28 rows. The rows relate to individual observations, or individual plots in my instance. The first column stores the uniqueID by which each plot may be identified. The next 100 columns store the average measured reflectance values for each plot in consecutive spectral bands (band_x, band_x2, band_x3, etc.). The remaining 5 columns store the values of various plant parameters (e.g. chlorophyll, nitrogen, biomass, etc.) that were measured in the field for each plot. The data set just more or less looks as follows:
PlotID b1 b2 .... b99 b100 biomass nitrogen
1 0.11 0.16 0.40 0.41 10 52
2 0.09 0.11 0.41 0.40 19 35
3 0.10 0.19 0.43 0.49 18 72
4 0.13 0.10 0.44 0.39 16 46
...
I'm looking to create contour plots that depict R2 (Rsquared) values for all possible correlations for all possible combinations of two bands that are correlated to a single plant parameter (e.g. biomass). For example, the contour plots need to present the R2 values for the correlation between all possible simple ratio combinations (band_x1/band_x2) and a single trait. Besides, I am looking to replicate this for two other type of indices, being a normalized difference index ((band_x2+band_x1)/(band_x2-band_x1)) and a simple difference index (band_x2-band_x1).
I have been looking at the contour.plot syntax in R and various practical examples, however, none does in anyway relate to what I am after. I have seen these graphs before, so there must be a way of generating them. Who can help me out?
Thanks in advance!
Edit: to clarify some things, here is an example of a graph that I am looking for to recreate:
http://image.slidesharecdn.com/2269e63a-1825-41b1-8d58-6901fd5b56ba-150102021118-conversion-gate01/95/thenkabailuavgermanyfinal1b-46-638.jpg?cb=1420186425
Using the help of Heroka, I have by now managed to recreate most of the plot, based on the following code (the majority of the code, however, is mostly related to graphics):
n_band=101
dat <- read.table("C:\\data.txt", header=TRUE)
res <- expand.grid(paste0("b", seq(from = 450, to = 950, by = 5)),paste0("b",seq(from = 450, to = 950, by = 5)),outcome=c("nitrogen"))
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
library(scales)
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2)) +
geom_tile() +
facet_grid(~outcome)
p1 +
theme(axis.text.x=element_text(angle=+90)) +
geom_vline(xintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
geom_hline(yintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
labs(list(title = "Contour plot of R^2 values for all possible correlations between Simple Ratio indices & Nitrogen Content", x = "Wavelength 1 (nm)", y = "Wavelength 2 (nm)")) +
scale_x_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_y_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_fill_continuous(low = "black", high = "green")
ContourPlot
I am getting quiet near to my ultimate goal, but a few things remain that I would like to change:
- Have a scale bar in discrete colors, preferably relying on a vastly diverse but gradual color scheme to better allow identification of the band combinations with highest R2 values. I would ideally like to use a standard number of classes (8), each comprising of the same number of observations, for all plots. Hereby allowing the software itself to determine the break values, based on the min and max R2 values for each parameter being correlated.
- Besides, I would like to be able to identify the highest values from each the plot, or more specifically their (x,y) coordinates so I can tell which bands produce highest correlations. I have used which.min and which.max, but they yield no sensible results nor (x,y) coordinates.
Here is an example how you might solve this kind of problem. I've made an assumption on how to calculate R2, but that's easily fixable if it's wrong.
First, we simulate some data
set.seed(123)
n_band=100
dat <- data.frame(matrix(runif(28*n_band),ncol=n_band))
colnames(dat) <- paste0("b",1:n_band)
dat$biomass <- rpois(28,10)
dat$nitrogen <- rpois(28,10)
dat$ID <- 1:28
Then, we observe that for each combination of band1, band2 and outcome we only need to store one number (R2). So, first we generate a dataframe containing all combinations of column names as string:
res <- expand.grid(paste0("b",1:n_band),paste0("b",1:n_band),outcome=c("biomass","nitrogen"))
Then we use apply to get the R2 for each row of res (thus each combination). As each row of res contains three column names, we can use those to access the original data.
#ignore warnings; correlation between similar variables is missing
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
Then plotting is simple:
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2))+
geom_tile() +
facet_grid(~outcome)
p1
The data has 4 columns and roughly 600 rows. The data is twitter data collected using the twitteR package, and then summarized into a data frame. The summary is based on how many words from these libraries each tweet has, the tweets are given a score and then the summary is the number of tweets which get specific scores. So the columns are the two types of scores, the dates, and then the number of tweets with those scores.
Score1 Score2 Date Number
0 0 01/10/2015 50
0 1 01/10/2015 34
1 0 01/10/2015 10
...and so on
With dates and data that extend over a month, and the scores either way can go +/- 10 or so.
I'm trying to plot that kind of data using a bubble plots, score1 on the x axis and score2 on the y axis with the size of the bubble dependant on the number (how many tweets of with those scores there were per day).
My problem is that I only know how to use ggplot.
g <- ggplot(
twitterdata,
aes(x=score1, y=score2, size=number, label=""), guide=FALSE) +
geom_point(colour="black", fill="red", shape=21) +
scale_size_area(max_size = 30) +
scale_x_continuous(name="score1", limits=c(0, 10)) +
scale_y_continuous(name="score2", limits=c(-10, 10)) +
geom_text(size=4) +
theme_bw()
and that just gives me the plot for all dates, and what I need is a good way to see how that data changes over time. I've looked into using sliders and selectors but I really have no idea what would be the best tool to use. I've tried subsetting the data based on date, which works nicely but ideally I could make some kind of interactive graph.
I really need some way select certain days out of that data to plot so it doesn't pile up all on itself, but do it interactively so it can be presented.
Any help would be greatly appreciated, thank you.
It sounds like this won't completely satisfy your use case, but an extremely low-overhead way to add some interactivity to your plot would be to install.packages('plotly') and add the following line to your code:
# your original code
g <- ggplot(
twitterdata,
aes(x=score1, y=score2, size=number, label=""),
guide=FALSE)+
geom_point(colour="black", fill="red", shape=21) +
scale_size_area(max_size = 30) +
scale_x_continuous(name="score1", limits=c(0,10)) +
scale_y_continuous(name="score2", limits=c(-10,10)) +
geom_text(size=4) +
theme_bw()
# add this line
gg <- ggplotly(g)
Details and demos: https://plot.ly/ggplot2/
As Eric suggested, if you want sliders and such you should check out shiny. Here's a demo combining shiny with plotly: https://plot.ly/r/shiny-tutorial/
I am looking to scale the x axis on my barplot to time, so as to accurately represent when measurements were taken.
I have these data frames:
> Botcv
Date Average SE
1 2014-09-01 4.0 1.711307
2 2014-10-02 5.5 1.500000
> Botc1
Date Average SE
1 2014-10-15 2.125 0.7180703
2 2014-11-12 1.000 0.4629100
3 2014-12-11 0.500 0.2672612
> Botc2
Date Average SE
1 2014-10-15 3.375 1.3354708
2 2014-11-12 1.750 0.4531635
3 2014-12-11 0.625 0.1829813
I use this code to produce a grouped barplot:
covaverage <- c(Botcv$Average,NA,NA,NA)
c1average <- c(NA,NA, Botc1$Average)
c2average <- c(NA,NA, Botc2$Average)
date <- c(Botcv$Date, Botc1$Date)
averagematrix <- matrix(c(covaverage,c1average, c2average), nrow=3, ncol=5, byrow=TRUE)
barplot(averagematrix,date, xlab="Date", ylab="Average", axis.lty=1, space=NULL,width=3,beside=T, ylim=c(0.00,6.00))
R plots the bars equal distances apart by default and I have been trying to find a workaround for this. I have seen several other solutions that utilise ggplot2 but I am producing plots for my masters thesis and would like to keep the appearance of my barplots in line with other graphs that I have created using base R graphics. I also want to add error bars to the plot. If anyone could provide a solution then I would be very grateful!! Thanks!
Perhaps you can use this as a start. It is probably easier to use boxplots, as they can be put at a given x position by using the at argument. For base barplots this cannot be done, but you can use rectangle instead to replicate the barplot look. Error bars can be added using arrows or segments.
bar_w = 1 # width of bars
offset = c(-1,1) # offset to avoid overlapping
cols = grey.colors(2) # colors for different types
# combine into a single data frame
d = data.frame(rbind(Botc1, Botc2), 'type' = c(1,1,1,2,2,2))
# set up empty plot with sensible x and y lims
plot(as.Date(d$Date), d$Average, type='n', ylim=c(0,4))
# draw data of data frame 1 and 2
for (i in unique(d$type)){
dd = d[d$type==i, ]
x = as.Date(dd$Date)
y = dd$Average
# rectangles
rect(xleft=x-bar_w+offset[i], ybottom=0, xright=x+bar_w+offset[i], ytop=y, col=cols[i])
# errors bars
arrows(x0=x+offset[i], y0=y-0.5*dd$SE, x1=x+offset[i], y1=y+0.5*dd$SE, col=1, angle=90, code=3, length = 0.1)
}
If what you want to get is simply the theme that will match the base theme the + theme_bw() in ggplot2 will achieve this:
data(mtcars)
require(ggplot2)
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_boxplot() +
theme_bw()
Result
Alternative
boxplot(mpg~cyl,data=mtcars)
If, as you said, the only thing you want to achieve is similar look, and you have working plot in the ggplot2 using the theme_bw() should produce plots that are indistinguishable from what would be derived via the standard plotting mechanism. If you feel so inclined you may tweak some minutiae details like font sizes, thickness of graph borders or visualisation of outliers.
I'm at very beginning with R programming. I'm using RStudio for an exam, and I have to represent graphically the results of some calculations on a dataset.
I have a structure like that:
and what I was thinking to do was make some histograms with the 3 values of the mean for each row, and the same for median and trimmed mean.
First question: Is this a correct way to represent this kind of data graphically? Or there are some better plot.
Second question: Could someone give me the code to draw a graph with on the x avis the 3 strings ("Lobby", "R & D","ROE") and on the y axis a scale of values that includes the results, in order to have the histograms representing the differences in investment in lobbing, r & d and the roe obtained.
Hope I've been clear enough, if I haven't specified something relevant please ask me.
Its sounds like you want to do the following. With your data in a csv call bar.csv having this format:
Dept Mean Median Trimmed_Mean
Lobby 0.008 0.0018 0.0058
R & D 6.25 3.2 4.78
ROE 19.08 16.66 16.276
You can use library(ggplot2) and library(reshape) and the commands listed here
dat.m<-read.csv("bar.csv")
dat.m<-melt(dat.m,id.vars="Dept")
ggplot(dat.m, aes(x = Dept, y = value,fill=variable)) + geom_bar(stat='identity')+
facet_wrap(~ Dept, ncol = 3,scales="free_y") #facet wrapped
ggplot(dat.m, aes(x = Dept, y = value,fill=variable)) + geom_bar(stat='identity')
#stacked bar
to display the graphs below:
As zhaoy says, a historgram works with raw data (usually) - and what you have is summary data. Also, you could use library(ggplot2) to produce a boxplot summary graph like this (using spray data in the ggplot2 library):
library(ggplot2)
p<-qplot(spray,count,data=InsectSprays,geom='boxplot')
p<-p+stat_summary(fun.y=mean,shape=1,col='red',geom='point')
print(p)
Or simply using the standard boxplot command, with the same data, with added functionality to display the means:
boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
means <- tapply(InsectSprays$count,InsectSprays$spray,mean)
points(means,col="red",pch=18)
In response to question 1: The purpose of histograms is to display the density or frequency of continuous data. If you're trying to compare the mean / median / trimmed mean across the 3 categories in the row.name column, I suggest bar graphs. I'm not sure comparing mean / median / trimmed mean in a single graph is coherent to viewers, so it may be ideal to generate 3 bar graphs.
In response to question 2: If you aim to compare the 3 categories in the row.name column using multiple columns of data, I suggest a box-plot. I realize that the box-plot does not traditionally include the mean, but it is one of the best visualizations for comparing data across categories. Please see r-bloggers.com/box-plot-with-r-tutorial for an example.
I am trying to plot populations of predators and of prey over time, with confidence intervals. I can plot these two separately, how to plot on same graph?
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(d.means)),d.means,d.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(w.means)),w.means,w.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
After the first plot, use the code below before plotting the next plot:
par(new=T)
Make sure that you set the xlim and ylim to accommodate both plots. And you will need to use the options axes=F and ann=F.
These graphical features are discussed in detail in the ebook "R Fundamentals & Graphics". You might want to use it as a desk reference.
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
Here you have created all the variables you need but to plot them using ggplot you need them to be in a tall dataset with an variable indicating if they are predator or prey. I also added a time variable, I think yours would be week.
x=data.frame(means=c(w.means,d.means),
n=c(w.n,d.n),
se=c(w.se,d.se),
role=c(rep("pred",length(w.n)),rep("prey",length(d.n))),
time=c(1:length(w.n),1:length(d.n))
)
I don't know exactly what your data look like so here is a fake one I cooked up just to illustrate the format.
means n se role time
1 0.9874234 10 0.16200575 pred 1
2 1.4120207 12 0.08895026 pred 2
3 2.7352516 8 0.07991036 pred 3
4 1.1301248 11 0.05481813 prey 1
5 2.4810040 13 0.28682585 prey 2
6 3.1546947 9 0.22126054 prey 3
Once the data are in this nice format using ggplot is really pretty easy.
ggplot(x, aes(x=time, y=means, colour=role)) +
geom_errorbar(aes(ymin=means-se, ymax=means+se), width=.1) +
geom_line()
That gives this: