qqnorm plotting for multiple subsets - r

I am very new to R. I have figured out how to make qqnorm plots on a subset of my dataframe. However, I would like to make qqnorm plots on subsets that are defined by two factors (one factor has 48 categories (brain_region) and each of those categories can be further subdivided by another factor, which has three levels (GroupID)). I have tried the following:
by(t, t[,"GroupID"], function(x) tapply(t$FA,t$brain_region,qqnorm))
but it does not seem to be working. I'm also not sure if this is the best approach, as I'm new to this program.
I would also like to save each of the separately generated qqnorm plot with the x axis as labeled as "FA" and the title with the specific level of each of the two factors (brain region/GroupID). Thank you very much for any help.

Plotting is one of the few things where apply isn't the optimal solution. ggplot offers you enough possibilities to get this done, as shown in this answer.
Plotting all levels in one go
If you use the base plots, you can better use a for loop for this. Plus, if you want to plot different plots on the same graphics device, you can use eg par(mfrow=) or layout (see the help page ?layout)
Let's take the built-in data set iris as an example:
data(iris)
op <- par(mfrow=c(1,3))
for(i in levels(iris$Species)){
tmp <- with(iris, Petal.Width[Species==i])
qqnorm(tmp,xlab="Petal.Width",main=i)
qqline(tmp)
}
par(op)
rm(i,tmp)
gives :
Don't forget to clean up your workspace after using a for loop. Not really obligatory, but it can prevent serious confusion later on.
Combine two factors
In order to get this done for 2 factor levels at the same time, you can either construct a nested for-loop, or combine both factors into a single factor. Take the dataset mtcars:
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am,
labels=c('automatic','manual'))
To combine both levels, you can use this simple construct :
mtcars$combined <- factor(paste(mtcars$cyl,mtcars$am,sep='/'))
And then do the same again. With two for loops, your code would like like the code below. Be warned though that this only works if you have data for every combination of the factors, and you don't have too many levels. If you have a lot of levels, you better save the plots by using eg png() (see ?png for info) instead of plotting them all on the same graphics device.
lcyl <- levels(mtcars$cyl)
lam <- levels(mtcars$am)
par(mfrow=c(length(lam),length(lcyl)))
for(i in lam){
for(j in lcyl){
tmp <- with(mtcars,mpg[am==i & cyl==j])
qqnorm(tmp,xlab="Petal.Width",
main=paste(i,j,sep="/"))
qqline(tmp)
}
}
gives :

Related

How can I stop sapply dropping my barplot titles?

I'm wanting to make a barplot for the factor variables in my data set. To do this I've been running sapply(data[sapply(data, class)=='factor'],function(x) barplot(table(x))). To my annoyance, the plots remember their factor labels, but none of them have retained a title. How can I fix this without titling each graph by hand?
Currently, I'm getting humorously vague untitled graphs like this:
How about
## extract names
fvars <- names(data)[which(sapply(data,inherits,"factor"))]
## apply barplot() with main=
lapply(fvars, function(x) barplot(table(data[[x]]), main=x))
?
Example data:
data <- mtcars
for (i in c("vs","am","gear","carb")) data[[i]] <- factor(data[[i]])
Note that this creates all the plots at once. If you're working in a GUI with a plot history (RStudio or RGui) you can page back through the graphs. Otherwise, you might want to use par(mfrow=c(nr,nc)) (fill in number of rows and columns) to set up subplots before you start.
The numbers that are returned are the bar midpoints (see ?barplot): you could wrap the barplot() call in invisible() if you don't want to see them.

Generate correct legend with auto.key in xyplot

I realize this is perhaps more of a data frame issue than a xyplot questions - but here it goes.
I have a data frame dat that has 108 rows and 5 columns. dat$Treatment is a factor with 5 levels. I want to create an xy plot with ONLY the data where dat$Treatment=="Control". Since I didn't know any better way to do it, I created tmp as shown below. xyplot plots the correct graph, with only the data in the rows where dat$Treatment=="Control". However the legend displays all the data, for example those where dat$Treatment=="High dose"
Where is auto.key getting that from? I thought my tmp data frame didn't even have it. Can someone please help me understand?
tmp <- dat[dat$Treatment=="Control",]
xyplot(tmp[,5] ~ Day, groups=tmp$Animal, data=tmp,
type="b", ylab="Tumor volume",
par.settings=simpleTheme(col=1:8,
pch=20,
cex=1.3,
lwd=2,
lty="dotted"),
auto.key=list(title="Animal", x=.05, y=.95,
corner=c(0,1), border=T, lines=T, points=F, type="b"))
I'm not too familiar with the lattice package, so others with more experience will have to weigh in. My guess is that you're seeing this behavior because of how R is handling dat$Treatment. I'm guessing this variable is stored as a factor, with levels you don't want to include in the plot. As a rough first step, I'd try saving the new data frame (as you have), but additionally run the following command:
tmp$Treatment = as.factor(as.character(tmp$Treatment))
This should save the Treatment variable as a factor with only one level. My guess is that the xyplot function looks up the levels of that factor when it plots. As a related example, consider the following:
data(iris)
iris.2 = iris[iris$Species == "setosa",]
table(iris.2$Species)
iris.2$Species = factor(as.character(iris.2$Species))
table(iris.2$Species)
Here, the two tables are reported differently because we've resaved the Species variable as a new factor. Hope this helps --
auto.key get's its values form the levels of the factor variables. When you subset a factor variable, all the levels are maintained (so in the future, you can know which levels are missing from a particular subset). If you want to remove levels that aren't used in your subset you can use
tmp <- droplevels(dat[dat$Treatment=="Control",])
This way auto.key will never see the other factor levels.

Is there a better way to plot multicolor lines in R than splitting the data?

Sequential portions of my time series are under different treatments, and I'd like to separately color a line connecting observations in each portion.
For example, in the series under treatment A I'd have a red line, and in the succeeding series under treatment B I'd have a blue line.
plot(response, type="l",col="treatment") failed - all observations were connected with a line the same color.
This listhost posting proposed just splitting the data by treatment and then separately plotting each subset on the same plot. (http://r.789695.n4.nabble.com/Can-R-plot-multicolor-lines-td791081.html).
Is there a more elegant way?
An alternative using Map that avoids manually plotting segments:
dat <- data.frame(treatment=rep(LETTERS[1:2],3:4),
response=c(6,5,2,1,5,6,7),time=1:7)
plot(response ~ time, data=dat, type="n")
Map(
function(x) lines(response ~ time, data=x, col=x$treatment),
split(dat, dat$treatment)
)
There are two popular more elegant ways. One is to use the ggplot2 package. Without more information it's hard to advise you other than look at help or examples in various places. The other is to check out the function matplot. That will require you to first restructure your data as a matrix but it can easily do what you want. Keep in mind that while it says in the help, "Plot the columns of one matrix against the columns of another", the x-axis matrix can be a vector the same length as one column of a matrix containing your line information. The function will just recycle the x vector.

PCA Biplot : A way to hide vectors to see all data points clearly

I am trying to do PCA with R.
My Data has 10,000 columns and 90 rows
I used the prcomp function to do PCA.
Trying to prepare a biplot with the prcomp results, I ran into the problem that the 10,000 plotted vectors cover my datapoints. Is there any option for the biplot to hide the vectors' representation?
OR
I can use plot to get the PCA results. But I am not sure how to label these points according to my datapoints, which are numbered 1 to 90.
Sample<-read.table(file.choose(),header=F,sep="\t")
Sample.scaled<-data.frame(apply(Sample_2XY,2,scale))
Sample_scaled.2<-data.frame(t(na.omit(t(Sample_2XY.scaled))))
pca.Sample<-prcomp(Sample_2XY.scaled.2,retx=TRUE)
pdf("Sample_plot.pdf")
plot(pca.Sample$x)
dev.off()
If you do a help(prcomp) or ?prcomp, the help file tells us all the things contained in the prcomp() object returned by the function. We just need to pick which things we want to plot and do it with some function that gives us more control than biplot().
A more general trick for cases when the help file doesn't clarify things is to do a str() on the prcomp object (in your case pca.Sample) to see all its parts and find what we want ( str() compactly displays the internal structure of an R object. )
Here is an example with some of R's sample data:
# do a pca of arrests in different states
p<-prcomp(USArrests, scale = TRUE)
str(p) gives me something ugly and too long to include, but I can see that p$x has the states as rownames and their locations on the principal components as columns. Armed with this, we can plot it any way we want, such as with plot() and text() (for labels):
# plot and add labels
plot(p$x[,1],p$x[,2])
text(p$x[,1],p$x[,2],labels=rownames(p$x))
If we are making a scatterplot with many observations, the labels may not be readable. We therefore might want to only label more extreme values, which we can identify with quantile():
#make a new dataframe with the info from p we want to plot
df <- data.frame(PC1=p$x[,1],PC2=p$x[,2],labels=rownames(p$x))
#make sure labels are not factors, so we can easily reassign them
df$labels <- as.character(df$labels)
# use quantile() to identify which ones are within 25-75 percentile on both
# PC and blank their labels out
df[ df$PC1 > quantile(df$PC1)["25%"] &
df$PC1 < quantile(df$PC1)["75%"] &
df$PC2 > quantile(df$PC2)["25%"] &
df$PC2 < quantile(df$PC2)["75%"],]$labels <- ""
# plot
plot(df$PC1,df$PC2)
text(df$PC1,df$PC2,labels=df$labels)

Producing statistics over levels

I've generated a set of levels from my dataset, and now I want to find a way to sum the rest of the data columns in order to plot it while plotting my first column. Something like:
levelSet <- cut(frame$x1, "cutting")
boxplot(frame$x1~levelSet)
for (l in levelSet)
{
x2Sum<-sum(frame$x2[levelSet==l])
}
or maybe the inside of the loop should look like:
lines(sum(frame$x2[levelSet==l]))
Any thoughts? I am new to R, but I can't seem to get a hang of the indexing and ~ notation thus far.
I know r doesn't work this way, but I'd like functionality that 'looks' like
hist(frame$x2~levelSet)
## Or
hist(frame$x2, breaks = levelSet)
To plot a histograph, boxplot, etc. over a level set:
Try the lattice package:
library(lattice)
histogram(~x2|equal.count(x1),data=frame)
Substitute shingle for equal.count to set your own break points.
ggplot2 would also work nicely for this.
To put a histogram over a boxplot:
par(mfrow=c(2,1))
hist(x2)
boxplot(x2)
You can also use the layout() command to fine-tune the arrangement.

Resources