I have a matrix of Chip-seq results data like this for 26000 genes
LncRNA_ID LncRNA_Name Control_Raw_TagCount ICLIP_EZH2_Raw_TagCount
1 AK092525 47908 194887
2 ENST00000423879 RP11-12M5.1 10794 90146
3 AF318349 5514 61617
4 ENST00000506392 CTC-313D10.1 288 40880
5 ENST00000438080 RP11-177A2.4 25005 37380
6 AK123756 800 35469
I want to plot the counts densities of both samples, control and EZH2, that is column 3 and 4, in order to compare them. I am using R and I am very confused, mainly because I can't plot them as histograms, I get one figure with only one bar and not all the bars that I am waiting for, the same if I am interested to do a boxplot. Probably is a very silly question but I am a bit desperate
ezh2<-data$ICLIP_EZH2_Raw_TagCount
control<-data$Control_Raw_TagCount
hist(ezh2)# not working, i can't see distribution at all
Do you have any idea to do it?
Thanks in advance
Box plot, where the two columns are stuck together and then split along the groups:
N <- length(d$Control_Raw_TagCount)
x <- c(d$Control_Raw_TagCount, d$ICLIP_EZH2_Raw_TagCount)
group <- rep(c("Control_Raw_TagCount", "ICLIP_EZH2_Raw_TagCount"), c(N, N))
boxplot(x ~ group)
Here I've assumed the data name is d, so adjust that to your data frame's name. If you want something like hollow histograms (see pg26 of OpenIntro Statistics), the histPlot function in the openintro package will do the trick using the arguments probability=TRUE, hollow=TRUE:
# install.packages("openintro")
library(openintro)
histPlot(d$Control_Raw_TagCount, probability=TRUE, hollow=TRUE)
histPlot(d$ICLIP_EZH2_Raw_TagCount, probability=TRUE, hollow=TRUE,
lty=3, border='red')
If the vertical scale isn't right, add a ylim argument to the first call to histPlot (e.g. ylim=c(0,0.05)).
Related
I have a data frame data_2 and wish to create a Bland-Altman plot to compare the differences between the data in the columns alog1 vs. dig1.
Please help with the function for this and how to execute this. Would the function be barplot()?
Thanks for your time.
Another name for a Bland-Altman plot is a Tukey mean-difference plot. (I have nothing against Bland and Altman, but I think 'mean-difference' is more descriptive.) Note that this different from a boxplot (observe the pictures on the two Wikipedia pages). The mean-difference plot is simply a regular scatterplot, except that instead of plotting x versus y, you are plotting the difference x-y against the mean of x and y (or in your case, alog1 and dig1). Probably the easiest way to make this is to form these two new variables first, and then simply plot them as you would any other scatterplot. Here is some sample code:
mn <- (data_2$alog1 + data_2$dig1)/2
dif <- data_2$alog1 - data_2$dig1
plot(mn, dif)
If you wanted to add arguments to customize your plot, you could do that just as you normally would, for example:
plot(mn, dif, main="Bland-Altman plot", xlab="mean of alog1 & dig1",
ylab="difference between alog1 & dig1")
I am doing a project on Functional Data Analysis, where I am trying to plot spaghetti plots for height. I am using xyplot from lattice library. Why is y-axis wrapped in xyplot?
Here I am plotting data for only one individual. If plot whole data set it looks like a block of thick lines.
My code in R is:
xyplot(height ~ age|sex, p_data, type="l", group=id)
Resulting in:
Without seeing p_data it's hard to say, but based upon the axis labelling I would guess that height is being treated as a factor variable.
Run is.factor(p_data$height), and if the answer is TRUE then try
p_data$height <- as.numeric(levels(p_data$height))[p_data$height]
and repeat your plot. If this doesn't work then at least give us some idea of what the p_data dataframe looks like.
#Joe has put you on the right path. The issue is almost certainly that the height variable is being treated as a factor (categorical variable) rather than a continuous, numeric variable:
E.g. - I can replicate a similar problem via:
p_data <- data.frame(height=c(96,72,100,45),age=1:4,sex=c("m","f","f","m"),id=1)
p_data$height <- factor(p_data$height,levels=p_data$height)
# it's all out of order cap'n!
p_data$height
#[1] 96 72 100 45
#Levels: 96 72 100 45
# same plot call as you are using
xyplot(height ~ age|sex, p_data, type="l", group=id)
If you fix it up like so:
p_data$height <- as.numeric(as.character(p_data$height))
....then the same call gives an appropriate result:
xyplot(height ~ age|sex, p_data, type="l", group=id)
Let's say I have the following dataset
bodysize=rnorm(20,30,2)
bodysize=sort(bodysize)
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1)
dat=as.data.frame(cbind(bodysize,survive))
I'm aware that the glm plot function has several nice plots to show you the fit,
but I'd nevertheless like to create an initial plot with:
1)raw data points
2)the loigistic curve and both
3)Predicted points
4)and aggregate points for a number of predictor levels
library(Hmisc)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
All fine up to here.
Now I want to plot the real data survival rates for a given levels of x1
dat$bd<-cut2(dat$bodysize,g=5,levels.mean=T)
AggBd<-aggregate(dat$survive,by=list(dat$bd),data=dat,FUN=mean)
plot(AggBd,add=TRUE)
#Doesn't work
I've tried to match AggBd to the dataset used for the model and all sort of other things but I simply can't plot the two together. Is there a way around this?
I basically want to overimpose the last plot along the same axes.
Besides this specific task I often wonder how to overimpose different plots that plot different variables but have similar scale/range on two-dimensional plots. I would really appreciate your help.
The first column of AggBd is a factor, you need to convert the levels to numeric before you can add the points to the plot.
AggBd$size <- as.numeric (levels (AggBd$Group.1))[AggBd$Group.1]
to add the points to the exisiting plot, use points
points (AggBd$size, AggBd$x, pch = 3)
You are best specifying your y-axis. Also maybe using par(new=TRUE)
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival")
g=glm(survive~bodysize,family=binomial,dat)
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE)
points(bodysize,fitted(g),pch=20)
#then
par(new=TRUE)
#
plot(AggBd$Group.1,AggBd$x,pch=30)
obviously remove or change the axis ticks to prevent overlap e.g.
plot(AggBd$Group.1,AggBd$x,pch=30,xaxt="n",yaxt="n",xlab="",ylab="")
giving:
Suppose I ran a factor analysis & got 5 relevant factors. Now, I want to graphically represent the loading of these factors on the variables. Can anybody please tell me how to do it. I can do using 2 factors. But can't able to do when number of factors are more than 2.
The 2 factor plotting is given in "Modern Applied Statistics with S", Fig 11.13. I want to create similar graph but with more than 2 factors. Please find the snap of the Fig mentioned above:
X & y axes are the 2 factors.
Regards,
Ari
Beware: not the answer you are looking for and might be incorrect also, this is my subjective thought.
I think you run into the problem of sketching several dimensions on a two dimension screen/paper.
I would say there is no sense in plotting more factors' or PCs' loadings, but if you really insist: display the first two (based on eigenvalues) or create only 2 factors. Or you could reduce dimension by other methods also (e.g. MDS).
Displaying 3 factors' loadings in a 3 dimensional graph would be just hardly clear, not to think about more factors.
UPDATE: I had a dream about trying to be more ontopic :)
You could easily show projections of each pairs of factors as #joran pointed out like (I am not dealing with rotation here):
f <- factanal(mtcars, factors=3)
pairs(f$loadings)
This way you could show even more factors and be able to tweak the plot also, e.g.:
f <- factanal(mtcars, factors=5)
pairs(f$loadings, col=1:ncol(mtcars), upper.panel=NULL, main="Factor loadings")
par(xpd=TRUE)
legend('topright', bty='n', pch='o', col=1:ncol(mtcars), attr(f$loadings, 'dimnames')[[1]], title="Variables")
Of course you could also add rotation vectors also by customizing the lower triangle, or showing it in the upper one and attaching the legend on the right/below etc.
Or just point the variables on a 3D scatterplot if you have no more than 3 factors:
library(scatterplot3d)
f <- factanal(mtcars, factors=3)
scatterplot3d(as.data.frame(unclass(f$loadings)), main="3D factor loadings", color=1:ncol(mtcars), pch=20)
Note: variable names should not be put on the plots as labels, but might go to a distinct legend in my humble opinion, specially with 3D plots.
It looks like there's a package for this:
http://factominer.free.fr/advanced-methods/multiple-factor-analysis.html
Comes with sample code, and multiple factors. Load the FactoMineR package and play around.
Good overview here:
http://factominer.free.fr/docs/article_FactoMineR.pdf
Graph from their webpage:
You can also look at the factor analysis object and see if you can't extract the values and plot them manually using ggplot2 or base graphics.
As daroczig mentions, each set of factor loadings gets its own dimension. So plotting in five dimensions is not only difficult, but often inadvisable.
You can, though, use a scatterplot matrix to display each pair of factor loadings. Using the example you cite from Venables & Ripley:
#Reproducing factor analysis from Venables & Ripley
#Note I'm only doing three factors, not five
data(ability.cov)
ability.FA <- factanal(covmat = ability.cov,factor = 3, rotation = "promax")
load <- loadings(ability.FA)
rot <- ability.FA$rot
#Pairs of factor loadings to plot
ind <- combn(1:3,2)
par(mfrow = c(2,2))
nms <- row.names(load)
#Loop over pairs of factors and draw each plot
for (i in 1:3){
eqscplot(load[,ind[1,i]],load[,ind[2,i]],xlim = c(-1,1),
ylim = c(-0.5,1.5),type = "n",
xlab = paste("Factor",as.character(ind[1,i])),
ylab = paste("Factor",as.character(ind[2,i])))
text(load[,ind[1,i]],load[,ind[2,i]],labels = nms)
arrows(c(0,0),c(0,0),rot[ind[,i],ind[,i]][,1],
rot[ind[,i],ind[,i]][,2],length = 0.1)
}
which for me resulting in the following plot:
Note that I had to play a little with the the x and y limits, as well as the various other fiddly bits. Your data will be different and will require different adjustments. Also, plotting each pair of factor loadings with five factors will make for a rather busy collection of scatterplots.
I am trying to plot some pairs of scatterplots using "pairs".
My dataframe look like :
>e
X Y Z
0 0 0
2 3 4
0 3 4
3 3 3
A completely standard dataframe here.
I use this to plot my scatter plots, again nothing fancy:
pairs(~X+Y+Z, data=e, log="xy")
It works great, but it doesn't plot the labels. However if I remove the log="xy" in the command, then the labels are plotted nicely. So I guess it has to do with the fact that I want my scatterplots to be in log scale.
So my question is what shall I do?
Shall I remove all lines with zeros in it before hand (how do you do that?)
Is there a magic trick that will let me have log="xy" and my scatterplots labeled?
Please let me know if it is not clear.
You ignored this (where I called your data frame DF):
R> pairs(~X+Y+Z, data=df, log="xy")
There were 30 warnings (use warnings() to see them)
and if you look at these thirty warnings, you will see that
you cannot plot data containing zeros on a log scale (and I guess you know why)
log is not a recognised parameter for pairs()
So if you want a pairs plot in logs, you may have to takes logs yourself (and either add a small epsilon or use a transformation like log(1 + x) and call pairs() on that data.
Edit The easiest is probably pairs(~X+Y+Z, data=log(1+DF))