I feel like this question has been asked many times before, however from the questions I've looked at, none of the solutions so far have worked for me.
I wish to plot the values of two correlation matrices as scatter plots, next to each other in one plot with the same y range (from 0 - 1).
My original data are time series spanning several years of 112 companies, which I've split into two subsets, period A & period B. The original data is a zoo object.
I have then created the correlation matrices for both periods:
corr_A <- cor(series_A)
corr_B <- cor(series_B)
For further data analysis, I have removed the double entries:
corr_A[lower.tri(corr_A, diag=TRUE)] <- NA
corr_A <- as.vector(corr_A)
corr_A <- corr_A[!is.na(corr_A)]
corr_B[lower.tri(corr_B, diag=TRUE)] <- NA
corr_B <- as.vector(corr_B)
corr_B <- corr_B[!is.na(corr_B)]
As a result, I have two vectors, each with a length of 6216 (111 + 110 + 109 + .. + 1 = 6216).
I have then combined these vectors into a matrix:
correlation <- matrix(c(corr_A, corr_B), nrow=6216)
colnames(correlation) <- c("period_A", "period_B")
I would now like to plot this matrix so the result looks similar to this picture:
I've tried to plot using xyplot from lattice:
xyplot(period_A + period_B ~ X, correlation)
However, in the resulting plot, the two scatter plots are stacked over each other:
I have also tried changing the matrix itself - instead of using 6216 rows, I have used 12432 rows, and then index'd the first 6512 rows as "period_A" and the last 6512 rows as "period_B" - the resulting plot looks quite similar to my desired plot:
Is there any way I can create my desired plot using xyplot? Or are there any other (ggplot, car) methods of generating the plot?
Edit (added sample data for reproducible example):
head(correlation) #data frame with 6216 rows, 3 columns
X period_A period_B
1 0.5 0.4
2 0.3 0.6
3 0.2 0.4
4 0.6 0.6
I've finally found a solution:
https://stats.stackexchange.com/questions/63203/boxplot-equivalent-for-heavy-tailed-distributions
First of all, we stack the data.
correlation <- stack(correlation)
Then, we use stripplot (from lattice) combiend with jitter=TRUE to create the desired plot.
stripplot(correlation$values ~ correlation$ind, jitter=T)
The resulting plot looks exactly like my desired plot and can be manipulated using the standard lattice/plot commands (ylab, xlab, xlim, ylim,scales, etc.).
Related
I am new to R. I have a file with 92 rows and 6 columns. I have to plot 92 graphs of a Equation (water retention) getting 4 values (Theta_r,theta_s,alpha,n, m=1-1/n) from each row one by one from CSV file.
I have plotted a graph for this equation simply with some given values
with the following code.
library(ggplot2)
x<- seq(0,6,0.1)
cbind(x)
log<- 10^x
tht_s <- 0.4
tht_r <- 0
alpha<- 0.1
n<- 1.3
m<- 0.2308
theta<- tht_r + (tht_s- tht_r) * (1+(alpha*log)^n)^-m
dat<- data.frame(x,theta)
ggplot()+
geom_line(data = dat , aes(x=x, y=theta)) +
ylab ("water content (cm3 cm-3)")`
but this time R should take values from CSV file's rows one by one and generate 92 graphs. I tried with loop for (i 1:92) but I get error may be because variable "log" has around 60 values (that are for the points to plot graph)
Error in data.frame(x, theta) : arguments imply differing number of rows: 61, 92
In alpha * log :longer object length is not a multiple of shorter object length
and I can't figure out how to make R take values from rows one by one using loop. I have tried many approaches and seen many youtube videos but couldn't figure out.
I need to save these graphs 10x2 on single sheet.
Thank you in advance
This question already has an answer here:
Grouping & Visualizing cumulative features in R
(1 answer)
Closed 6 years ago.
I don't have much experience with R and since I am trying to create a fairly specific graph in R, I hope some of you can help me out.
I have data of the results of four classifiers being used on five different datasets. To get an accurate result each classifier was run on the same dataset 10 times. So now I have a table of the results as following:
DataSet1 DataSet1 DataSet1 ... DataSet2 DataSet2 ...
Classifier1 0.6 0.5 0.7 0.3 0.2
Classifier2 0.4 0.5 0.6 0.6 0.7
And so on.
What I am trying to get for my graph is to have four seperate graphs in different colors representing the four Classifiers. The y axis would just represent the results of the classifications and the x-axis should portray the five different datasets.
Each "mark" on the x-axis should be one dataset and the point on the y-axis for each graph would be the mean value of the 10 results for that classifier on that specific dataset.
I have tried using ggplot2 to achieve this by creating a data frame out of the data and melting it with the dataset names as variables. I might not truly understand what melting really does.
I am not very familiar with creating graphs and plots and apologize if my description is clumsy and lacking.
I would appreciate any help greatly.
TL;DR: Reformat input data + ggplot with facets
Input
Since test data wasn't provided, I created dummy data
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1)
dummydata <-
matrix(
data = sample(do.call("c", select(iris, -Species)), 10*5*4, replace = T),
nrow=4, ncol=10*5
)
rownames(dummydata) <- paste0("Classifier", 1:4)
colnames(dummydata) <- rep(paste0("DataSet", 1:5), each=10)
dummydata
Here, dummydata is a matrix that looks like
# DataSet1 DataSet1 ... (x10 total columns of each Dataset) ... DataSet5
# Classifier1 3.1 5.6 ...
# Classifier2 2.8 1.3 ...
# ...
# Classifier4 1.3 ...
Reformat input to a workable state
Make sure column names are unique
Make sure object is a data frame
Make sure there is a column for the row name
Make the data frame long (to be used by ggplot)
We do so by:
## Make col names unique
colnames(dummydata) <- paste(colnames(dummydata), 1:10, sep="_")
dummydata_reformat <-
dummydata %>%
## make sure it is data frame
## with a column for the classifier type i.e. rowname
as.data.frame() %>%
tibble::rownames_to_column("classifier") %>%
## reformat data
gather(dataset,value,-classifier) %>%
separate(dataset, into=c("dset", "x"), sep="_")
The data now looks like
#> dummydata_reformat
# classifier dset x value
#1 Classifier1 DataSet1 1 3.1
#2 Classifier2 DataSet1 1 2.8
#3 Classifier3 DataSet1 1 1.6
# ...
Plot
## Plot
dummydata_reformat %>%
ggplot(aes(x=dset,
y=value
## can add: "color = classifier" to color by classifier
## but since you are splitting the plots by classifier,
## this does not make sense
)) +
geom_point() +
xlab("") +
facet_wrap(~classifier) +
theme(
## rotate the x-axis to fit text
axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)
)
I have a remote sensing data set consisting of 106 columns and 28 rows. The rows relate to individual observations, or individual plots in my instance. The first column stores the uniqueID by which each plot may be identified. The next 100 columns store the average measured reflectance values for each plot in consecutive spectral bands (band_x, band_x2, band_x3, etc.). The remaining 5 columns store the values of various plant parameters (e.g. chlorophyll, nitrogen, biomass, etc.) that were measured in the field for each plot. The data set just more or less looks as follows:
PlotID b1 b2 .... b99 b100 biomass nitrogen
1 0.11 0.16 0.40 0.41 10 52
2 0.09 0.11 0.41 0.40 19 35
3 0.10 0.19 0.43 0.49 18 72
4 0.13 0.10 0.44 0.39 16 46
...
I'm looking to create contour plots that depict R2 (Rsquared) values for all possible correlations for all possible combinations of two bands that are correlated to a single plant parameter (e.g. biomass). For example, the contour plots need to present the R2 values for the correlation between all possible simple ratio combinations (band_x1/band_x2) and a single trait. Besides, I am looking to replicate this for two other type of indices, being a normalized difference index ((band_x2+band_x1)/(band_x2-band_x1)) and a simple difference index (band_x2-band_x1).
I have been looking at the contour.plot syntax in R and various practical examples, however, none does in anyway relate to what I am after. I have seen these graphs before, so there must be a way of generating them. Who can help me out?
Thanks in advance!
Edit: to clarify some things, here is an example of a graph that I am looking for to recreate:
http://image.slidesharecdn.com/2269e63a-1825-41b1-8d58-6901fd5b56ba-150102021118-conversion-gate01/95/thenkabailuavgermanyfinal1b-46-638.jpg?cb=1420186425
Using the help of Heroka, I have by now managed to recreate most of the plot, based on the following code (the majority of the code, however, is mostly related to graphics):
n_band=101
dat <- read.table("C:\\data.txt", header=TRUE)
res <- expand.grid(paste0("b", seq(from = 450, to = 950, by = 5)),paste0("b",seq(from = 450, to = 950, by = 5)),outcome=c("nitrogen"))
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
library(scales)
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2)) +
geom_tile() +
facet_grid(~outcome)
p1 +
theme(axis.text.x=element_text(angle=+90)) +
geom_vline(xintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
geom_hline(yintercept=c(seq(from = 1, to = 101, by = 5)),color="#8C8C8C") +
labs(list(title = "Contour plot of R^2 values for all possible correlations between Simple Ratio indices & Nitrogen Content", x = "Wavelength 1 (nm)", y = "Wavelength 2 (nm)")) +
scale_x_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_y_discrete(breaks = c("b450","b475","b500","b525","b550","b575","b600","b625","b650","b675","b700","b725","b750","b775","b800","b825","b850","b875","b900","b925","b950")) +
scale_fill_continuous(low = "black", high = "green")
ContourPlot
I am getting quiet near to my ultimate goal, but a few things remain that I would like to change:
- Have a scale bar in discrete colors, preferably relying on a vastly diverse but gradual color scheme to better allow identification of the band combinations with highest R2 values. I would ideally like to use a standard number of classes (8), each comprising of the same number of observations, for all plots. Hereby allowing the software itself to determine the break values, based on the min and max R2 values for each parameter being correlated.
- Besides, I would like to be able to identify the highest values from each the plot, or more specifically their (x,y) coordinates so I can tell which bands produce highest correlations. I have used which.min and which.max, but they yield no sensible results nor (x,y) coordinates.
Here is an example how you might solve this kind of problem. I've made an assumption on how to calculate R2, but that's easily fixable if it's wrong.
First, we simulate some data
set.seed(123)
n_band=100
dat <- data.frame(matrix(runif(28*n_band),ncol=n_band))
colnames(dat) <- paste0("b",1:n_band)
dat$biomass <- rpois(28,10)
dat$nitrogen <- rpois(28,10)
dat$ID <- 1:28
Then, we observe that for each combination of band1, band2 and outcome we only need to store one number (R2). So, first we generate a dataframe containing all combinations of column names as string:
res <- expand.grid(paste0("b",1:n_band),paste0("b",1:n_band),outcome=c("biomass","nitrogen"))
Then we use apply to get the R2 for each row of res (thus each combination). As each row of res contains three column names, we can use those to access the original data.
#ignore warnings; correlation between similar variables is missing
res$R2 <- apply(res, MARGIN=1,FUN=function(x){
return(cor(dat[,x[1]]/dat[,x[2]],dat[,x[3]])^2)
})
Then plotting is simple:
library(ggplot2)
p1 <- ggplot(res, aes(x=Var1, y=Var2, fill=R2))+
geom_tile() +
facet_grid(~outcome)
p1
I have some biological data that looks like this, with 2 different types of clusters (A and B):
Cluster_ID A1 A2 A3 B1 B2 B3
5 chr5:100947454..100947489,+ 3.31322 7.52365 3.67255 21.15730 8.732710 17.42640
12 chr5:101227760..101227782,+ 1.48223 3.76182 5.11534 15.71680 4.426170 13.43560
29 chr5:102236093..102236457,+ 15.60700 10.38260 12.46040 6.85094 15.551400 7.18341
I clean up the data:
CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)
#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]
#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)
The data size is reduced from 6981 clusters to 599 after this.
I then go on to apply PCA:
#Get data dimensions
dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE)
summary(PCA.CAGE)
I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.
This is what I have tried, to erroneous results:
qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))
The errors appear as such:
> qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")
Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]
> qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data
> ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()
Error: ggplot2 doesn't know how to deal with data of class
Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.
pca <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)
So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.
I try to make a cumulative plot for a particular (for instance the first) column of my data (example):
1 3
2 5
4 9
8 11
12 17
14 20
16 34
20 40
Than I want to overlap this plot with another cumulative plot of another data (for example the second column) and save it as a png or jpg image.
Without using the vectors implementation "by hand" as in Cumulative Plot with Given X-Axis because if I have a very large dataset i can't be able to do that.
I try the follow simple commands:
A <- read.table("cumul.dat", header=TRUE)
Read the file, but now I want that the cumulative plot is down with a particular column of this file.
The command is:
cdat1<-cumsum(dat1)
but this is for a particular vector dat1 that I need to take from the data array (cumul.dat).
Thanks
I couldn't follow your question so this is a shot in the dark answer based on key words I did get:
m <- read.table(text=" 1 3
2 5
4 9
8 11
12 17
14 20
16 34
20 40")
library(ggplot2)
m2 <- stack(m)
qplot(rep(1:nrow(m), 2), values, colour=ind, data=m2, geom="step")
EDIT I decided I like this approach bettwe:
library(ggplot2)
library(reshape2)
m$x <- seq_len(nrow(m))
m2 <- melt(m, id='x')
qplot(x, value, colour=variable, data=m2, geom="step")
I wasn't quite sure when the events were happening and what the observations were. I'm assuming the events are just at 1,2,3,4 and the columns represent sounds of the different groups. If that's the case, using Lattice I would do
require(lattice)
A<-data.frame(dat1=c(1,2,4,8,12,14,16,20), dat2=c(3,5,9,11,17,20,34,40))
dd<-do.call(make.groups, lapply(A, function(x) {data.frame(x=seq_along(x), y=cumsum(x))}))
xyplot(y~x,dd, groups=which, type="s", auto.key=T)
Which produces
With base graphics, this can be done by specifying type='s' in the plot call:
matplot(apply(A, 2, cumsum), type='s', xlab='x', ylab='y', las=1)
Note I've used matplot here, but you could also plot the series one at a time, the first with plot and the second with points or lines.
We could also add a legend with, for example:
legend('topleft', c('Series 1', 'Series 2'), bty='n', lty=c(1, 3), col=1:2)