Scatter plot in ggplot, one numeric variable across two groups - r

I would like to create a scatter plot in ggplot2 which displays male test_scores on the x-axis and female test_scores on the y-axis using the dataset below. I can easily create a geom_line plot splitting male and female and putting the date ("dts") on the x-axis.
library(tidyverse)
#create data
dts <- c("2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05",
"2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05")
sex <- c("M","F","M","F","M","F","M","F","M","F")
test <- round(runif(10,.5,1),2)
semester <- data.frame("dts" = as.Date(dts), "sex" = sex, "test_scores" =
test)
#show the geom_line plot
ggplot(semester, aes(x = dts, y = test, color = sex)) + geom_line()
It seems with only one time series, ggplot2 does better with the data in wide format than long format. For instance, I could easily create two columns, "male_scores" and "female_scores" and plot those against each other, but I would like to keep my data tidy and in long format.
Cheers and thank you.

You've over-tidied. Tidying data isn't just the mechanism of making it as long as possible, its making it as wide as necessary..
For example, if you had location as X and Y for animal sightings you wouldn't have two rows, one with a "label" column containing "X" and the X coordinate in a "value" column and another with "Y" in the "label" column and the Y coordinate in the "value" column - unless you really where storing the data in a key-value store but that's another story...
Widen your data and put the test scores for male and female into test_core_male and test_score_female, then they are the x and y aesthetics for your scatter plot.

The problem with keeping the data long is that you will not have a corresponding X value a given Y value. The reason for this is the structure of the dataset --
dts sex test_scores
1 2011-01-02 M 0.67
2 2011-01-02 F 0.78
3 2011-01-03 M 0.58
4 2011-01-04 F 0.58
5 2011-01-05 M 0.51
If ypu were to use the code --
ggplot(semester, aes(x = semester$test_scores[semester$sex=='M',] ,
y = semester$test_scores[semester$sex=='F',],
color = sex)) + geom_point()
GGplot will kick an error. The main reason is by subsetting the male score there are no corresponding female scores for that subset. You need to first collapse the data down to a date level. As you correctly point out this isn't in a long format at that point.
I would recommend for this one off plot creating a wide dataset. There are multiple ways of doing that, but that is a different topic.

Related

ggplot: Plotting timeseries data with missing values

I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".
This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:
YearMonthDay (format- "YYYYMMDD", example "19810501")
Rainfall (mm)
The code for importing data into R:
StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")
The code for time data "Time" :
Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")
The code for precipitation data "data1" :
library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")
The code for data frame "Pecip1" :
Precip1 <- data.frame(Time, data1, check.rows=TRUE)
The code for ggplot is:
ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()
Using ggplot for plotting the graph between "Time" and "data1" results as:
Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".
The plot of "data1" data against its index is as shown:
The code for this plot is:
plot(data1, type = "l")
Any help would be highly appreciated. Thanks!
By using pad we can make up for those lost values an assign an NA value as to
avoid plotting in the region of missing data.
library(padr)
library(zoo)
YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)
StLucia<-data.frame(YearMonthDay,Data)
StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format=
"%Y%m%d")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-04 3
4 1981-05-05 4
Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.
So lets add the missing date:
StLucia<-pad(StLucia,interval="day")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-03 NA
4 1981-05-04 3
5 1981-05-05 4
plot(StLucia, type = "l")
If you want to fill in those NA values, use na.locf() from package(zoo)
Here is a reproducible example - change the names to match your data.
# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))
# demonstrate problem
ggplot(dd, aes(t, y)) +
geom_point() +
geom_line()
The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:
ggplot(dd, aes(t, y)) +
geom_col()
If you really want to use lines, you should fill in the missing dates with NA for rainfall. H
# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")
# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
geom_point() +
geom_line()
Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.

Drawing a 100x100 contour plot depicting R2 (Rsquared) values using R

I have a remote sensing data set consisting of 106 columns and 28 rows. The rows relate to individual observations, or individual plots in my instance. The first column stores the uniqueID by which each plot may be identified. The next 100 columns store the average measured reflectance values for each plot in consecutive spectral bands (band_x, band_x2, band_x3, etc.). The remaining 5 columns store the values of various plant parameters (e.g. chlorophyll, nitrogen, biomass, etc.) that were measured in the field for each plot. The data set just more or less looks as follows:
PlotID b1 b2 .... b99 b100 biomass nitrogen
1 0.11 0.16 0.40 0.41 10 52
2 0.09 0.11 0.41 0.40 19 35
3 0.10 0.19 0.43 0.49 18 72
4 0.13 0.10 0.44 0.39 16 46
...
I'm looking to create contour plots that depict R2 (Rsquared) values for all possible correlations for all possible combinations of two bands that are correlated to a single plant parameter (e.g. biomass). For example, the contour plots need to present the R2 values for the correlation between all possible simple ratio combinations (band_x1/band_x2) and a single trait. Besides, I am looking to replicate this for two other type of indices, being a normalized difference index ((band_x2+band_x1)/(band_x2-band_x1)) and a simple difference index (band_x2-band_x1).
I have been looking at the contour.plot syntax in R and various practical examples, however, none does in anyway relate to what I am after. I have seen these graphs before, so there must be a way of generating them. Who can help me out?
Thanks in advance!
Edit: to clarify some things, here is an example of a graph that I am looking for to recreate:
http://image.slidesharecdn.com/2269e63a-1825-41b1-8d58-6901fd5b56ba-150102021118-conversion-gate01/95/thenkabailuavgermanyfinal1b-46-638.jpg?cb=1420186425
Using the help of Heroka, I have by now managed to recreate most of the plot, based on the following code (the majority of the code, however, is mostly related to graphics):
n_band=101
dat <- read.table("C:\\data.txt", header=TRUE)
res <- expand.grid(paste0("b", seq(from = 450, to = 950, by = 5)),paste0("b",seq(from = 450, to = 950, by = 5)),outcome=c("nitrogen"))
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
library(scales)
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2)) +
geom_tile() +
facet_grid(~outcome)
p1 +
theme(axis.text.x=element_text(angle=+90)) +
geom_vline(xintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
geom_hline(yintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
labs(list(title = "Contour plot of R^2 values for all possible correlations between Simple Ratio indices & Nitrogen Content", x = "Wavelength 1 (nm)", y = "Wavelength 2 (nm)")) +
scale_x_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_y_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_fill_continuous(low = "black", high = "green")
ContourPlot
I am getting quiet near to my ultimate goal, but a few things remain that I would like to change:
- Have a scale bar in discrete colors, preferably relying on a vastly diverse but gradual color scheme to better allow identification of the band combinations with highest R2 values. I would ideally like to use a standard number of classes (8), each comprising of the same number of observations, for all plots. Hereby allowing the software itself to determine the break values, based on the min and max R2 values for each parameter being correlated.
- Besides, I would like to be able to identify the highest values from each the plot, or more specifically their (x,y) coordinates so I can tell which bands produce highest correlations. I have used which.min and which.max, but they yield no sensible results nor (x,y) coordinates.
Here is an example how you might solve this kind of problem. I've made an assumption on how to calculate R2, but that's easily fixable if it's wrong.
First, we simulate some data
set.seed(123)
n_band=100
dat <- data.frame(matrix(runif(28*n_band),ncol=n_band))
colnames(dat) <- paste0("b",1:n_band)
dat$biomass <- rpois(28,10)
dat$nitrogen <- rpois(28,10)
dat$ID <- 1:28
Then, we observe that for each combination of band1, band2 and outcome we only need to store one number (R2). So, first we generate a dataframe containing all combinations of column names as string:
res <- expand.grid(paste0("b",1:n_band),paste0("b",1:n_band),outcome=c("biomass","nitrogen"))
Then we use apply to get the R2 for each row of res (thus each combination). As each row of res contains three column names, we can use those to access the original data.
#ignore warnings; correlation between similar variables is missing
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
Then plotting is simple:
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2))+
geom_tile() +
facet_grid(~outcome)
p1

"Heatbars" for visualizing consecutive missing data days?

I am trying to visualize large chunks of consecutive missing data side-by-side on ranges of 3, 5 and 10 years sampled daily. Hopefully using ggplot2 since I already have some aesthetics functions done.
I imagined this would come from a barplot or maybe some heatmap variation, but I am not too sure how to use them with time-series data.
I chose a black/white list of bars because I think it is easier to observe where (1) lies large chunks of missing data and (2) if they are occurring on different moments in time (which would be important to choose which stations to use, etc), while being (3) relatively easy to observe many bars which would not be true to the more conventional line plots for time-series.
This is a draft of what I had in mind.
Here is some example data for 5 stations (in practice this could be up to over 80):
#Data from 5 different stations sampled daily.
df <- cbind(seq(as.Date(("2010/01/01")),by="day",length.out=365*5),data.frame(matrix(rnorm(365*5*5),365*5,5)))
colnames(df) <- c("timestamp","st1","st2","st3","st4","st5")
#Add varying ranges of missing consecutive amount of days to observe result on visualization.
df[1:50,"st1"] <- NA # 50
df[51:200,"st2"] <- NA # 150
df[1:400,"st3"] <- NA # 400
df[501:1300,"st5"] <- NA # 800
Here's a rough stab at it...Alter the scales and theme elements to your liking...
library(ggplot2)
library(scales)
library(reshape2)
melt(df, id.vars = "timestamp") -> k
k$value <- ifelse(is.na(k$value), "NA", "Not NA")
ggplot(data = k) +
geom_point(aes(x = timestamp, y = variable, fill = value, colour = value), shape =22) +
scale_x_date() +
theme_bw()

Color Dependent Bar Graph in R

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

Plotting an filled line chart with 4 variables against a 5th variable ggplot2

I am trying to create a postion="fill" which represents an allocation on the y axis (to always sum to 100) and another variable on the x axis. Variable 1-4 are numeric integers, variable 5 is also numeric. Variable 5 is a continuous numeric. All five variables on are on the same row.
Y axis: variable 1 + variable 2 + variable 3 + variable 4 = 100
X axis: variable 5
Is there a way to do this without melting my data table?
Sample code, caution: runs a bit slow due to how I set up variables 1-4...
library(combinat)
combinations <- combn(100, 4)
permutations <- combinations[, colSums(combinations) == 100]
rm(combinations)
data <- t(rbind(permutations,
replicate(ncol(permutations), cumprod(1+rnorm(20, 0.05, 0.30))[20])
))
One way to generate a reproducible example would be
set.seed(1)
data_ex <- data.frame(t(rmultinom(1000,prob=rep(0.25,4),size=100)),
v5=runif(1000,0.8,1))
and then
library(ggplot2)
library(reshape2)
ggplot(melt(data_ex,id.var="v5")) +
geom_area(aes(x=v5,y=value,fill=variable))
draws the plot.
If you really want to do things the hard way you can avoid using melt, but melt is much (much much) easier!
cumvals <- t(apply(data_ex[,1:4],1,cumsum))
data2 <- data.frame(cumvals,v5=data_ex$v5)
ggplot(data2,aes(x=v5)) +
## these must go in reverse order
geom_area(aes(y=X4),fill="green")+
geom_area(aes(y=X3),fill="purple")+
geom_area(aes(y=X2),fill="red")+
geom_area(aes(y=X1),fill="blue")

Resources