Drawing a 100x100 contour plot depicting R2 (Rsquared) values using R - r

I have a remote sensing data set consisting of 106 columns and 28 rows. The rows relate to individual observations, or individual plots in my instance. The first column stores the uniqueID by which each plot may be identified. The next 100 columns store the average measured reflectance values for each plot in consecutive spectral bands (band_x, band_x2, band_x3, etc.). The remaining 5 columns store the values of various plant parameters (e.g. chlorophyll, nitrogen, biomass, etc.) that were measured in the field for each plot. The data set just more or less looks as follows:
PlotID b1 b2 .... b99 b100 biomass nitrogen
1 0.11 0.16 0.40 0.41 10 52
2 0.09 0.11 0.41 0.40 19 35
3 0.10 0.19 0.43 0.49 18 72
4 0.13 0.10 0.44 0.39 16 46
...
I'm looking to create contour plots that depict R2 (Rsquared) values for all possible correlations for all possible combinations of two bands that are correlated to a single plant parameter (e.g. biomass). For example, the contour plots need to present the R2 values for the correlation between all possible simple ratio combinations (band_x1/band_x2) and a single trait. Besides, I am looking to replicate this for two other type of indices, being a normalized difference index ((band_x2+band_x1)/(band_x2-band_x1)) and a simple difference index (band_x2-band_x1).
I have been looking at the contour.plot syntax in R and various practical examples, however, none does in anyway relate to what I am after. I have seen these graphs before, so there must be a way of generating them. Who can help me out?
Thanks in advance!
Edit: to clarify some things, here is an example of a graph that I am looking for to recreate:
http://image.slidesharecdn.com/2269e63a-1825-41b1-8d58-6901fd5b56ba-150102021118-conversion-gate01/95/thenkabailuavgermanyfinal1b-46-638.jpg?cb=1420186425
Using the help of Heroka, I have by now managed to recreate most of the plot, based on the following code (the majority of the code, however, is mostly related to graphics):
n_band=101
dat <- read.table("C:\\data.txt", header=TRUE)
res <- expand.grid(paste0("b", seq(from = 450, to = 950, by = 5)),paste0("b",seq(from = 450, to = 950, by = 5)),outcome=c("nitrogen"))
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
library(scales)
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2)) +
geom_tile() +
facet_grid(~outcome)
p1 +
theme(axis.text.x=element_text(angle=+90)) +
geom_vline(xintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
geom_hline(yintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
labs(list(title = "Contour plot of R^2 values for all possible correlations between Simple Ratio indices & Nitrogen Content", x = "Wavelength 1 (nm)", y = "Wavelength 2 (nm)")) +
scale_x_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_y_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_fill_continuous(low = "black", high = "green")
ContourPlot
I am getting quiet near to my ultimate goal, but a few things remain that I would like to change:
- Have a scale bar in discrete colors, preferably relying on a vastly diverse but gradual color scheme to better allow identification of the band combinations with highest R2 values. I would ideally like to use a standard number of classes (8), each comprising of the same number of observations, for all plots. Hereby allowing the software itself to determine the break values, based on the min and max R2 values for each parameter being correlated.
- Besides, I would like to be able to identify the highest values from each the plot, or more specifically their (x,y) coordinates so I can tell which bands produce highest correlations. I have used which.min and which.max, but they yield no sensible results nor (x,y) coordinates.

Here is an example how you might solve this kind of problem. I've made an assumption on how to calculate R2, but that's easily fixable if it's wrong.
First, we simulate some data
set.seed(123)
n_band=100
dat <- data.frame(matrix(runif(28*n_band),ncol=n_band))
colnames(dat) <- paste0("b",1:n_band)
dat$biomass <- rpois(28,10)
dat$nitrogen <- rpois(28,10)
dat$ID <- 1:28
Then, we observe that for each combination of band1, band2 and outcome we only need to store one number (R2). So, first we generate a dataframe containing all combinations of column names as string:
res <- expand.grid(paste0("b",1:n_band),paste0("b",1:n_band),outcome=c("biomass","nitrogen"))
Then we use apply to get the R2 for each row of res (thus each combination). As each row of res contains three column names, we can use those to access the original data.
#ignore warnings; correlation between similar variables is missing
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
Then plotting is simple:
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2))+
geom_tile() +
facet_grid(~outcome)
p1

Related

Plot two datasets with different scales on the same graph, same axis in R

I have two datasets, that I'd like to see on a single scatterplot with a single axis. One dataset has Y values ranging from 0 to 0.0006, the other between 0 and 1.
Each dataset has 50 entries.
In R, is there a way of changing the scale of the y axis at the 0.0006 mark to show detail in both halves of the graph, e.g., the range of 0 - 0.0006 and 0.0006 - 1 would be the same size on the graph.
I did this using a log scale, this is a sample dataset, which doesnt go all the way to 1 but taps out around 0.07.
I'm still open to other techniques as this one gives too much emphasis to the 0.0006-0 range.
You can scale your data for plotting, then call axis twice:
y1<-runif(50,0,0.0006)
y2<-runif(50,0.0006,1)
x<-runif(50)
y1.scaled<-y1*(0.5/0.0006)
y2.scaled<-(y2-0.0006)*(1-0.5)/(1-0.0006) + 0.5
plot(c(0,1),c(0,1),col=NA,yaxt='n',ylab="",xlab="")
points(x,y1.scaled,pch=20,col="red")
points(x,y2.scaled,pch=21,col="black")
axis(2,at=seq(0,0.5,length.out = 3), labels = c(0,0.0003,0.0006), col="red")
axis(2,at=seq(0.5,1,length.out = 3), labels = seq(0.0006,1,length.out=3))
See this post for how to re-scale a set of numbers with a known min and max to any other min and max:
How to scale down a range of numbers with a known min and max value
Assuming you have two different datasources (and that values from either source can be <0.0006) we could combine them, create an indicator for whether or not the value is <0.0006, and then use a facet_wrap with free scales. Something like this:
library(ggplot2)
set.seed(1)
y1<-runif(50,0,0.0006)
y2<-runif(50,0,1)
x<-1:50
df<-as.data.frame(rbind(cbind(y1,x),cbind(y2,x))) #Combine data
df$y1 <- as.numeric(as.character(df$y1))
df$x <- as.numeric(as.character(df$x))
df$group <- (df$y1 <= 0.0006) #Create group
#ggplot with facet
ggplot(data=df) + geom_point(aes(y=y1,x=x)) + facet_wrap(~grp,scales="free")

R - ggplot2 - Get histogram of difference between two groups

Let's say I have a histogram with two overlapping groups. Here's a possible command from ggplot2 and a pretend output graph.
ggplot2(data, aes(x=Variable1, fill=BinaryVariable)) + geom_histogram(position="identity")
So what I have is the frequency or count of each event. What I'd like to do instead is to get the difference between the two events in each bin. Is this possible? How?
For example, if we do RED minus BLUE:
Value at x=2 would be ~ -10
Value at x=4 would be ~ 40 - 200 = -160
Value at x=6 would be ~ 190 - 25 = 155
Value at x=8 would be ~ 10
I'd prefer to do this using ggplot2, but another way would be fine. My dataframe is set up with items like this toy example (dimensions are actually 25000 rows x 30 columns) EDITED: Here is example data to work with GIST Example
ID Variable1 BinaryVariable
1 50 T
2 55 T
3 51 N
.. .. ..
1000 1001 T
1001 1944 T
1002 1042 N
As you can see from my example, I'm interested in a histogram to plot Variable1 (a continuous variable) separately for each BinaryVariable (T or N). But what I really want is the difference between their frequencies.
So, in order to do this we need to make sure that the "bins" we use for the histograms are the same for both levels of your indicator variable. Here's a somewhat naive solution (in base R):
df = data.frame(y = c(rnorm(50), rnorm(50, mean = 1)),
x = rep(c(0,1), each = 50))
#full hist
fullhist = hist(df$y, breaks = 20) #specify more breaks than probably necessary
#create histograms for 0 & 1 using breaks from full histogram
zerohist = with(subset(df, x == 0), hist(y, breaks = fullhist$breaks))
oneshist = with(subset(df, x == 1), hist(y, breaks = fullhist$breaks))
#combine the hists
combhist = fullhist
combhist$counts = zerohist$counts - oneshist$counts
plot(combhist)
So we specify how many breaks should be used (based on values from the histogram on the full data), and then we compute the differences in the counts at each of those breaks.
PS It might be helpful to examine what the non-graphical output of hist() is.
Here's a solution that uses ggplot as requested.
The key idea is to use ggplot_build to get the rectangles computed by stat_histogram. From that you can compute the differences in each bin and then create a new plot using geom_rect.
setup and create a mock dataset with lognormal data
library(ggplot2)
library(data.table)
theme_set(theme_bw())
n1<-500
n2<-500
k1 <- exp(rnorm(n1,8,0.7))
k2 <- exp(rnorm(n2,10,1))
df <- data.table(k=c(k1,k2),label=c(rep('k1',n1),rep('k2',n2)))
Create the first plot
p <- ggplot(df, aes(x=k,group=label,color=label)) + geom_histogram(bins=40) + scale_x_log10()
Get the rectangles using ggplot_build
p_data <- as.data.table(ggplot_build(p)$data[1])[,.(count,xmin,xmax,group)]
p1_data <- p_data[group==1]
p2_data <- p_data[group==2]
Join on the x-coordinates to compute the differences. Note that the y-values aren't the counts, but the y-coordinates of the first plot.
newplot_data <- merge(p1_data, p2_data, by=c('xmin','xmax'), suffixes = c('.p1','.p2'))
newplot_data <- newplot_data[,diff:=count.p1 - count.p2]
setnames(newplot_data, old=c('y.p1','y.p2'), new=c('k1','k2'))
df2 <- melt(newplot_data,id.vars =c('xmin','xmax'),measure.vars=c('k1','diff','k2'))
make the final plot
ggplot(df2, aes(xmin=xmin,xmax=xmax,ymax=value,ymin=0,group=variable,color=variable)) + geom_rect()
Of course the scales and legends still need to be fixed, but that's a different topic.

"Heatbars" for visualizing consecutive missing data days?

I am trying to visualize large chunks of consecutive missing data side-by-side on ranges of 3, 5 and 10 years sampled daily. Hopefully using ggplot2 since I already have some aesthetics functions done.
I imagined this would come from a barplot or maybe some heatmap variation, but I am not too sure how to use them with time-series data.
I chose a black/white list of bars because I think it is easier to observe where (1) lies large chunks of missing data and (2) if they are occurring on different moments in time (which would be important to choose which stations to use, etc), while being (3) relatively easy to observe many bars which would not be true to the more conventional line plots for time-series.
This is a draft of what I had in mind.
Here is some example data for 5 stations (in practice this could be up to over 80):
#Data from 5 different stations sampled daily.
df <- cbind(seq(as.Date(("2010/01/01")),by="day",length.out=365*5),data.frame(matrix(rnorm(365*5*5),365*5,5)))
colnames(df) <- c("timestamp","st1","st2","st3","st4","st5")
#Add varying ranges of missing consecutive amount of days to observe result on visualization.
df[1:50,"st1"] <- NA # 50
df[51:200,"st2"] <- NA # 150
df[1:400,"st3"] <- NA # 400
df[501:1300,"st5"] <- NA # 800
Here's a rough stab at it...Alter the scales and theme elements to your liking...
library(ggplot2)
library(scales)
library(reshape2)
melt(df, id.vars = "timestamp") -> k
k$value <- ifelse(is.na(k$value), "NA", "Not NA")
ggplot(data = k) +
geom_point(aes(x = timestamp, y = variable, fill = value, colour = value), shape =22) +
scale_x_date() +
theme_bw()

How to measure area between 2 distribution curves in R / ggplot2

The specific example is that imagine x is some continuous variable between 0 and 10 and that the red line is distribution of "goods" and the blue is "bads", I'd like to see if there is value in incorporating this variable into checking for 'goodness' but I'd like to first quantify the amount of stuff in the areas where the blue > red
Because this is a distribution chart, the scales look the same, but in reality there is 98 times more good in my sample which complicates things, since it's not actually just measuring the area under the curve, but rather measuring the bad sample where it's distribution is along lines where it's greater than the red.
I've been working to learn R, but am not even sure how to approach this one, any help appreciated.
EDIT
sample data:
http://pastebin.com/7L3Xc2KU <- a few million rows of that, essentially.
the graph is created with
graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)
The only way I can think of to do this is to calculate the area between the curve using simple trapezoids. First we manually compute the densities
d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])
Now we create functions that will interpolate between our observed density points
f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)
Next we find the x range of the overlap of the densities
ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))
and divide that into 500 sections
i <- seq(min(ovrng), max(ovrng), length.out=500)
Now we calculate the distance between the density curves
h <- f0(i)-f1(i)
and using the formula for the area of a trapezoid we add up the area for the regions where d1>d0
area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627
We can plot the region using
plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj];
segments(j, f1(j), j, f1(j)+h[jj])
Here's a way to shade the area between two density plots and calculate the magnitude of that area.
# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)),
group=c(rep("Bad", 1000), rep("Good", 2000)))
# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
geom_density(aes(x=x, y=..count.., colour=group), lwd=1)
Some extra calculations to shade the area between the two density plots
(adapted from this SO question):
pp1 = ggplot_build(p1)
# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])
# We want ymax and ymin to differ only when the density of "Good"
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]
# Shade the area between "Good" and "Bad"
p1a = p1 +
geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)
Here are the two plots:
To get the area (number of values) in specific ranges of Good and Bad, use the density function on each group (or you can continue to work with the data pulled from ggplot as above, but this way you get more direct control over how the density distribution is generated):
## Calculate densities for Bad and Good.
# Use same number of points and same x-range for each group, so that the density
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"],
n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"],
n=1024, from=min(dat$x), to=max(dat$x))
## Normalize so that densities sum to number of rows in each group
# Number of rows in each group
counts = tapply(dat$x, dat$group, length)
bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y
## Results
# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495 # Out of 2000 total in the data frame
# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315 # Out of 1000 total in the data frame

Plot probability heatmap/hexbin with different sized bins

This is related to another question: Plot weighted frequency matrix.
I have this graphic (produced by the code below in R):
#Set the number of bets and number of trials and % lines
numbet <- 36
numtri <- 1000
#Fill a matrix where the rows are the cumulative bets and the columns are the trials
xcum <- matrix(NA, nrow=numbet, ncol=numtri)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(5/6,1/6), replace = TRUE)
xcum[,i] <- cumsum(x)/(1:numbet)
}
#Plot the trials as transparent lines so you can see the build up
matplot(xcum, type="l", xlab="Number of Trials", ylab="Relative Frequency", main="", col=rgb(0.01, 0.01, 0.01, 0.02), las=1)
I very much like the way that this plot is built up and shows the more frequent paths as darker than the rarer paths (but it is not clear enough for a print presentation). What I would like to do is to produce some kind of hexbin or heatmap for the numbers. On thinking about it, it seems that the plot will have to incorporate different sized bins (see my back of the envelope sketch):
My question then: If I simulate a million runs using the code above, how can I present it as a heatmap or hexbin, with the different sized bins as shown in the sketch?
To clarify: I do not want to rely on transparency to show the rarity of a trial passing through a part of the plot. Instead I would like to denote rarity with heat and show a common pathway as hot (red) and a rare pathway as cold (blue). Also, I do not think the bins should be the same size because the first trial has only two places where the path can be, but the last has many more. Hence the fact I chose a changing bin scale, based on that fact. Essentially I am counting the number of times a path passes through the cell (2 in col 1, 3 in col 2 etc) and then colouring the cell based on how many times it has been passed through.
UPDATE: I already had a plot similar to #Andrie, but I am not sure it is much clearer than the top plot. It is the discontinuous nature of this graph, that I do not like (and why I want some kind of heatmap). I think that because the first column has only two possible values, that there should not be a huge visual gap between them etc etc. Hence why I envisaged the different sized bins. I still feel that the binning version would show large number of samples better.
Update: This website outlines a procedure to plot a heatmap:
To create a density (heatmap) plot version of this we have to effectively enumerate the occurrence of these points at each discrete location in the image. This is done by setting a up a grid and counting the number of times a point coordinate "falls" into each of the individual pixel "bins" at every location in that grid.
Perhaps some of the information on that website can be combined with what we have already?
Update: I took some of what Andrie wrote with some of this question, to arrive at this, which is quite close to what I was conceiving:
numbet <- 20
numtri <- 100
prob=1/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
#from the other question
require(MASS)
dens <- kde2d(mxcum$bet, mxcum$outcome)
filled.contour(dens)
I don't quite understand what's going on, but this seems to be more like what I wanted to produce (obviously without the different sized bins).
Update: This is similar to the other plots here. It is not quite right:
plot(hexbin(x=mxcum$bet, y=mxcum$outcome))
Last try. As above:
image(mxcum$bet, mxcum$outcome)
This is pretty good. I would just like it to look like my hand-drawn sketch.
Edit
I think the following solution does what you ask for.
(Note that this is slow, especially the reshape step)
numbet <- 32
numtri <- 1e5
prob=5/6
#Fill a matrix
xcum <- matrix(NA, nrow=numtri, ncol=numbet+1)
for (i in 1:numtri) {
x <- sample(c(0,1), numbet, prob=c(prob, 1-prob), replace = TRUE)
xcum[i, ] <- c(i, cumsum(x)/cumsum(1:numbet))
}
colnames(xcum) <- c("trial", paste("bet", 1:numbet, sep=""))
mxcum <- reshape(data.frame(xcum), varying=1+1:numbet,
idvar="trial", v.names="outcome", direction="long", timevar="bet")
library(plyr)
mxcum2 <- ddply(mxcum, .(bet, outcome), nrow)
mxcum3 <- ddply(mxcum2, .(bet), summarize,
ymin=c(0, head(seq_along(V1)/length(V1), -1)),
ymax=seq_along(V1)/length(V1),
fill=(V1/sum(V1)))
head(mxcum3)
library(ggplot2)
p <- ggplot(mxcum3, aes(xmin=bet-0.5, xmax=bet+0.5, ymin=ymin, ymax=ymax)) +
geom_rect(aes(fill=fill), colour="grey80") +
scale_fill_gradient("Outcome", formatter="percent", low="red", high="blue") +
scale_y_continuous(formatter="percent") +
xlab("Bet")
print(p)
FYI: This is more of an extended comment than an answer.
To me, this new plot looks like a stacked bar where each bar's height is equal to the intersection points of the upper and lower line at the next trial.
The way that I would approach this is to treat "Trials" as a categorical variable. Then we can search each row of xcum for elements that are equal. If they are, then we can consider this to be a point of intersection whose minima also represents the multiple defining the height of our bars.
x <- t(xcum)
x <- x[duplicated(x),]
x[x==0] <- NA
Now we have the multiples of the actual points, we need to figure out how to take it to the next step and find a way of binning the information. That means we need to make a decision about how many points will represent each grouping. Let's write some points out for posterity.
Trial 1 (2) = 1, 0.5 # multiple = 0.5
Trial 2 (3) = 1, 0.66, 0.33 # multiple = 0.33
Trial 3 (4) = 1, 0.75, 0.5, 0.25 # multiple = 0.25
Trial 4 (5) = 1, 0.8, 0.6, 0.4, 0.2 # multiple = 0.2
Trial 5 (6) = 1, 0.8333335, 0.6666668, 0.5000001, 0.3333334, 0.1666667
...
Trial 36 (35) = 1, 0.9722223, ..., 0.02777778 # mutiple = 0.05555556 / 2
In other words, for each Trial there are n-1 points to plot. In your drawing you have 7 bins. So we need to figure out the multiples for each bin.
Let's cheat and divide the last two columns by two, we know from visual inspection that the minima is lower than 0.05
x[,35:36] <- x[,35:36] / 2
Then find the minimum of each column:
x <- apply(x, 2, function(x) min(x, na.rm=T))[-1] # Drop the 1
x <- x[c(1,2,3,4,8,17,35)] # I'm just guessing here by the "look" of your drawing.
The clearest way to do this is to create each bin separately. Obviously, this could be done automatically later. Remembering that each point is
bin1 <- data.frame(bin = rep("bin1",2), Frequency = rep(x[1],2))
bin2 <- data.frame(bin = rep("bin2",3), Frequency = rep(x[2],3))
bin3 <- data.frame(bin = rep("bin3",4), Frequency = rep(x[3],4))
bin4 <- data.frame(bin = rep("bin4",5), Frequency = rep(x[4],5))
bin5 <- data.frame(bin = rep("bin5",9), Frequency = rep(x[5],9))
bin6 <- data.frame(bin = rep("bin6",18), Frequency = rep(x[6],18))
bin7 <- data.frame(bin = rep("bin7",36), Frequency = rep(x[7],36))
df <- rbind(bin1,bin2,bin3,bin4,bin5,bin6,bin7)
ggplot(df, aes(bin, Frequency, color=Frequency)) + geom_bar(stat="identity", position="stack")

Resources