differences in class giving rise to unexpected axis formatting - r

I have a column from a data frame (that contains a set of estimated proportions of cell counts) for which class() returns "factor" and column (that contains the actual cell counts) from another for which class() returns "numeric". As I have to plot these against one another to see if there's a relationship between them. Hence I have to convert the factor entities to numerics:
> class(proportions$Neutrophils)
[1] "factor"
> head(proportions$Neutrophils)
[1] 2.3 14.9
178 Levels: #VALUE! 0.0 0.4 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 ... abs neutrophils
> head(as.numeric(proportions$Neutrophils)) #notice that the numbers are completely trans
[1] 1 1 82 57 1 1
> head(as.numeric(proportions$Neutrophils))
[1] 1 1 82 57 1 1
> max(as.numeric(proportions$Neutrophils)) #factors converted to numeric
[1] 176
I have arranged the patterns of the columns in such a way that the corresponding values align:
ptr<-match(sample.details$barcode1, proportions$barcode2)
proportions<-proportions[ptr,]
I convert the factor column to numeric and plot:
plot(proportions$Neutrophils, as.numeric(SPVs[,7]), pch=19, ylab= "proportion estimates", xlab="counts", main="Neutrophils Proportions Validation")
When I don't convert it to numeric though:
plot(proportions1$Neutrophils, SPVs[,7], pch=19, ylab= "proportion estimates", xlab="counts", main="Neutrophils Proportions Validation")
What is worrying about the graphs are analogous and yet the x-axis on the second graph is not arranged in ascending order...
All I want is an estimate of whether the two columns are related but if the order is mixed up there is no way of telling this...
How do I ensure that the order of the x-axis values are ascending?

Related

R: Draw a 95% confidence ellipse and exclude all observations out of the ellipse [duplicate]

This question already has an answer here:
How to get the points inside of the ellipse in ggplot2?
(1 answer)
Closed 2 years ago.
I have a data set that needs to be cleaned from mistakes. For that, I have a sub-data set that contains only observations that I know are correct ("Match"). I would like to draw a 95% confidence ellipse around those correct observations on a plot and exclude all observations out of the ellipse from my main data set.
I figured out how to draw it but now I would like to be able to take out data based on that.
I'm a beginner with R so all of that is pretty new to me so I might not understand complicated coding. :)
Thanks !
To add more details, my data are measurements of collembolas (a type of insect). It has this basic structure:
replicate node day MajorAxisLengtnh MinorAxisLength Data.type
1 1 1 50 2.1 0.4 Match
2 2 1 50 2.3 0.2 Unknown
Therefore, I want to validate measurements by excluding unrealistic aspect ratios (length/width). Using the subset that I know is correct (match observations), I want to determine a reasonable range of aspect ratios for collembola, and use it to remove any unrealistic observation. I was advised to use a 95% confidence ellipse for good observations and take out observations that don't fit in the ellipse.
The SIBER package has some functions to help you here.
library(SIBER)
Let's use the iris dataset, plotting sepal width vs length.
dat <- iris[,1:2]
plot(dat)
mu <- colMeans(dat)
Sigma <- cov(dat)
addEllipse(mu, Sigma, p.interval = 0.95, col = "blue", lty = 3)
Z <- pointsToEllipsoid(dat, Sigma, mu) # converts the data to ellipsoid coordinates
out <- !ellipseInOut(Z, p = 0.95) # logical vector
(outliers <- dat[out,]) # finds the points outside the ellipse
# Sepal.Length Sepal.Width
#16 5.7 4.4
#34 5.5 4.2
#42 4.5 2.3
#61 5.0 2.0
#118 7.7 3.8
#132 7.9 3.8
points(outliers, col="red", pch=19)
You can then use the out vector to remove unwanted rows.
dat.in <- dat[!out,]

How to make a X-Y plot

I am not sure how to make a X-Y plot by R.
I have A B C datasets.
A dataset
ID Result
1.1 2
1.2 4
1.3 2.5
1.4 9
B dataset
ID Result
1.1 1
1.2 7
1.3 6
1.4 9
C dataset
ID Result
1.1 0.5
1.2 8
1.3 9
1.4 9
I want to make a plot X=result A , y=the result B, the other plot x=result A and Y=result C....
then A represented by red spots, B is black and C is blue for example. So the spot 1.1 should be x=2 and y=1 in red (A) and block (B). the spot 4,7, it means it is ID 1.2 in red and block.... The spot 9,9 it means is is ID 1.4 in the red and block.....
I try qqplots but I dont know how to make the X and Y correctly.
Thanks
ggplot2 is an excellent library for producing plots and there are many reference manuals online. Below is an answer to your question using the ggplot approach. The A,B,C data frames are unified into a single frame and the geom_point() for an x-y plot is used. The aes() sets the x and y coordinates (here you seem to seek to plot 'result' as both the x and y, if I understood the question?). The points are scaled by color, which is defined in the data frame as attributes A,B,C. Importantly, this variable must be a factor. The colors are defined by the manual color scale.
library(ggplot2)
dataA <- data.frame(ID=c(1.1,1.2,1.3),result=c(2,4,2.5),index=c(1,2,3),color="A")
dataB <- data.frame(ID=c(1.1,1.2,1.3),result=c(1,7,6),index=c(1,2,3),color="B")
dataC <- data.frame(ID=c(1.1,1.2,1.3),result=c(0.5,8,9),index=c(1,2,3),color="C")
data <- rbind(dataA,dataB,dataC)
data$color <- as.factor(data$color)
ggplot(data) +
geom_point(aes(x=result,y=result,color=color,size=10)) +
scale_color_manual(values=c("red", "black", "blue")) +
theme_bw()

How to create two barplots of unequal height (different max values) in R but with the same units on the Y axis?

Is it possible to make barplots (two) of unequal size (different max values on Y axis) but equal units (count data)?
The data is count data of the number of nesting attempts per season. Each species has 7 seasons of data. My objective is to present the data as clearly as possible for the reader to show the increase in the number of each of the two species nesting season on season. Although the initial pattern of increase is similar for both species, the number of species 1 nesting rises more rapidly. Plotting both sets of data on the same barplot is not a good option because the 7 seasons of data are not concurrent for the two species - rather it is the first 7 years of colonisation for each species (eg the labels on the x axis are different for the two species)
I have tried par(fig) and layout but not yet achieved what I need and I am not sure which function is better suited to what I need. Any advice welcome
Two barplots, one above the other, each taking up half the window. The Y units are the same for both graphs but the maximum for one is 300 whilst the other is 900. When they are plotted a count of 100 looks very different on the two graphs
SPECIES1 <- c(2,12,44,153,451,857)
SPECIES2 <- c(4,15,35,54,63,243)
windows(11,12)
par(oma=c(3,0.1,1,0.1),mfrow=c(2,1),mar=c(2,6,2,2.1))
barplot(SPECIES2,space=c(0.1,0),ylim=c(0,300),col="black",axes=FALSE)
axis(2,at=seq(0,300,100),las=2, cex.axis=0.9)
barplot(SPECIES1,space=c(0.1,0),ylim=c(0,900), col="black",border=NA,axes=FALSE )axis(2,at=seq(0,900,100),las=2,cex.axis=0.9)
Here how you go by using ggplot package
## supp dose len
## 1 VC D0.5 6.8
## 2 VC D1 15.0
## 3 VC D2 33.0
## 4 OJ D0.5 4.2
## 5 OJ D1 10.0
## 6 OJ D2 29.5
ggplot(data=df2, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())
But you need third variable(supp in above case). Please provide Sample data which you want to plot for clear answer.

heatmap.2 specify row order OR prevent reorder?

I'm trying to generate some plots of log-transformed fold-change data using heatmap.2 (code below).
I'd like to order the rows in the heatmap by the values in the last column (largest to smallest). The rows are being ordered automatically (I'm unsure the precise calculation used 'under the hood') and as shown in the image, there is some clustering being performed.
sample_data
gid 2hrs 4hrs 6hrs 8hrs
1234 0.5 0.75 0.9 2
2234 0 0 1.5 2
3234 -0.5 0.1 1 3
4234 -0.2 -0.2 0.4 2
5234 -0.5 1.2 1 -0.5
6234 -0.5 1.3 2 -0.3
7234 1 1.2 0.5 2
8234 -1.3 -0.2 2 1.2
9234 0.2 0.2 0.2 1
0123 0.2 0.2 3 0.5
code
data <- read.csv(infile, sep='\t',comment.char="#")
rnames <- data[,1] # assign labels in column 1 to "rnames"
mat_data <- data.matrix(data[,2:ncol(data)]) # transform columns into a matrix
rownames(mat_data) <- rnames # assign row names
# custom palette
my_palette <- colorRampPalette(c("turquoise", "yellow", "red"))(n = 299)
# (optional) defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(-4,-1,length=100), # for red
seq(-1,1,length=100), # for yellow
seq(1,4,length=100)) # for green
# plot data
heatmap.2(mat_data,
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(12,9), # widens margins around plot
col=my_palette, # use on color palette defined earlier
breaks=col_breaks, # enable color transition at specified limits
dendrogram='none', # only draw a row dendrogram
Colv=FALSE) # turn off column clustering
Plot
I'm wondering if anyone can suggest either how to turn off reordering so I can reorder my matrix by the last column and force this order to be used, or alternatively hack the heatmap.2 function to do this.
You are not specifying Rowv=FALSE and by default the rows are reordered (in heatmap.2 help, for parameter Rowv :
determines if and how the row dendrogram should be reordered. By
default, it is TRUE, which implies dendrogram is computed and
reordered based on row means. If NULL or FALSE, then no dendrogram is
computed and no reordering is done.
So if you want to have the rows ordered according to the last columns, you can do :
mat_data<-mat_data[order(mat_data[,ncol(mat_data)],decreasing=T),]
and then
heatmap.2(mat_data,
density.info="none",
trace="none",
margins =c(12,9),
col=my_palette,
breaks=col_breaks,
dendrogram='none',
Rowv=FALSE,
Colv=FALSE)
You will get the following image :

Plot a character vector against a numeric vector in R

I have the following data frame in R:
>AcceptData
Mean.Rank Sentence.Type
1 2.5 An+Sp+a
2 2.6 An+Nsp+a
3 2.1 An+Sp-a
4 3.1 An+Nsp-a
5 2.4 In+Sp+a
6 1.7 In+Nsp+a
7 3.1 In+Sp-a
8 3.0 In+Nsp-a
Which I want to plot, with the Sentence.Type column in the x axis, with the actual name of each cell as a point in the x axis. I want the y axis to go from 1 to 4 in steps of .5
So far I haven't been able to plot this, neither with plot() not with hist(). I keep getting different types of errors, mainly because of the nature of the character column in the data.frame.
I know this should be easy for most, but I'm sort of noob with R still and after hours I can't get the plot right. Any help is much appreciated.
Edit:
Some of the errors I've gotten:
> hist(AcceptData$Sentence.Type,AcceptData$Mean.Rank)
Error in hist.default(AcceptData$Sentence.Type, AcceptData$Mean.Rank) :
'x' must be numeric
Or: (this doesn't give an error, but definitely not the graph I want. It has all the x values cramped to the left of the x axis)
plot(AcceptData$Sentence.Type,AcceptData$Mean.Rank,lty=5,lwd=2,xlim=c(1,16),ylim=c(1,4),xla b="Sentence Type",ylab="Mean Ranking",main="Mean Acceptability Ranking per Sentence")
The default plot function has a method that allows you to plot factors on the x-axis, but to use this, you have to convert your text data to a factor:
Here is an example:
x <- letters[1:5]
y <- runif(5, 0, 5)
plot(factor(x), y)
And with your sample data:
AcceptData <- read.table(text="
Mean.Rank Sentence.Type
1 2.5 An+Sp+a
2 2.6 An+Nsp+a
3 2.1 An+Sp-a
4 3.1 An+Nsp-a
5 2.4 In+Sp+a
6 1.7 In+Nsp+a
7 3.1 In+Sp-a
8 3.0 In+Nsp-a", stringsAsFactors=FALSE)
plot(Mean.Rank~factor(Sentence.Type), AcceptData, las=2,
xlab="", main="Mean Acceptability Ranking per Sentence")

Resources