Producing statistics over levels

Producing statistics over levels - r

I've generated a set of levels from my dataset, and now I want to find a way to sum the rest of the data columns in order to plot it while plotting my first column. Something like:
levelSet <- cut(frame$x1, "cutting")
boxplot(frame$x1~levelSet)
for (l in levelSet)
{
x2Sum<-sum(frame$x2[levelSet==l])
}
or maybe the inside of the loop should look like:
lines(sum(frame$x2[levelSet==l]))
Any thoughts? I am new to R, but I can't seem to get a hang of the indexing and ~ notation thus far.
I know r doesn't work this way, but I'd like functionality that 'looks' like
hist(frame$x2~levelSet)
## Or
hist(frame$x2, breaks = levelSet)

To plot a histograph, boxplot, etc. over a level set:
Try the lattice package:
library(lattice)
histogram(~x2|equal.count(x1),data=frame)
Substitute shingle for equal.count to set your own break points.
ggplot2 would also work nicely for this.
To put a histogram over a boxplot:
par(mfrow=c(2,1))
hist(x2)
boxplot(x2)
You can also use the layout() command to fine-tune the arrangement.

Related

How can I stop sapply dropping my barplot titles?

I'm wanting to make a barplot for the factor variables in my data set. To do this I've been running sapply(data[sapply(data, class)=='factor'],function(x) barplot(table(x))). To my annoyance, the plots remember their factor labels, but none of them have retained a title. How can I fix this without titling each graph by hand?
Currently, I'm getting humorously vague untitled graphs like this:

How about
## extract names
fvars <- names(data)[which(sapply(data,inherits,"factor"))]
## apply barplot() with main=
lapply(fvars, function(x) barplot(table(data[[x]]), main=x))
?
Example data:
data <- mtcars
for (i in c("vs","am","gear","carb")) data[[i]] <- factor(data[[i]])
Note that this creates all the plots at once. If you're working in a GUI with a plot history (RStudio or RGui) you can page back through the graphs. Otherwise, you might want to use par(mfrow=c(nr,nc)) (fill in number of rows and columns) to set up subplots before you start.
The numbers that are returned are the bar midpoints (see ?barplot): you could wrap the barplot() call in invisible() if you don't want to see them.

R: how to make multiple plots from one CSV, grouping by a column

I'd like to put multiple plots onto a single visual output in R, based on data that I have in a CSV that looks something like this:
user,size,time
fred,123,0.915022
fred,321,0.938769
fred,1285,1.185608
wilma,5146,2.196687
fred,7506,1.181990
barney,5146,1.860287
wilma,1172,1.158015
barney,5146,1.219313
wilma,13185,1.455904
wilma,8754,1.381372
wilma,878,1.216908
barney,2974,1.223852
I can read this just fine, using, e.g.:
data = read.csv('data.csv')
For the moment, a fairly simple plot is fine, so I'm just trying plot(), without much to it (setting type='o' to get lines and points), and' from solving a past problem, I know that I can do, e.g., the following, to get data for just fred:
plot(data$time[which(data$user == 'fred')], data$size[which(data$user == 'fred')], type='o')
What I'd like, though, is to have the data for each user all showing up on one set of axes, with color coding (and a legend to match users to colors) to identify different user data.
And if another user shows up, I'd like another line to show up, with another color (perhaps recycling if I have too many users at once).
However, just this doesn't do it:
plot(data$size, data$time, type='o',col=c("red", "blue", "green"))
Because it doesn't seem to group by the user.
And just this:
plot(data, type='o')
gives me an error:
Error in plot.default(...) :
formal argument "type" matched by multiple actual arguments
This:
plot(data)
does do something, but not what I want.
I've poked around, but I'm new enough to R that I'm not quite sure how best to search for this, nor where to look for examples that would hit a use-case like this.
I even got somewhat closer with this:
plot(data$size[which(data$user == 'wilma')], data$time[which(data$user == 'wilma')], type='o', col=c('red'))
lines(data$size[which(data$user == 'fred')], data$time[which(data$user == 'fred')], type='o', col=c('green'))
lines(data$size[which(data$user == 'barney')], data$time[which(data$user == 'barney')], type='o', col=c('blue'))
This gives me a plot (which I'd post inline, but as a new user, I'm not allowed to yet):
not-quite-right plot
which is kind of close to what I want, except that it:
doesn't have a legend
has ugly axis labels, instead of just time and size
is scaled to the first plot, and thus is missing data from some of the others
isn't sorted by x-axis, which I could do externally, though I'm guessing I could do it fairly easily in R.
So, the question, ultimately, is this:
What's an easy way to plot data like this which:
has multiple lines based on the labels in the first column of the CSV
uses the same set of axes for the data in columns 2 and 3, regardless of the label
has a legend and color-coding for which label is being used for a particular line (or set of points)
will adapt to adding new labels to the data file, hopefully without change to the R code.
Thanks in advance for any help or pointers on this.
P.S. I looked around for similar questions, and found one that's sort of close, but it's not quite the same, and I failed to figure out how to adapt it to what I'm trying to do.

Good question. This is doable in base plot, but it's even easier and more intuitive using ggplot2. Below is an example of how to do this with random data in ggplot2
First download and install the package
install.packages("ggplot2",repos='http://cran.us.r-project.org')
require(ggplot2)
Next generate the data
a <- c(rep('a',3),rep('b',3),rep('c',3))
b <- rnorm(9,50,30)
c <- rep(seq(1,3),3)
dat <- data.frame(a,b,c)
Finally, make the plot
ggplot(data=dat, aes(x=c, y=b , group=a, colour=a)) + geom_line() + geom_point()
Basically, you are telling ggplot that your x axis corresponds to the c column (dat$c), your y axis corresponds to the b column (y$b) and to group (draw separate lines) by the a column (dat$a). Colour specifies that you want to group colour by the a column as well.
The resulting graph looks like this:

qqnorm plotting for multiple subsets

I am very new to R. I have figured out how to make qqnorm plots on a subset of my dataframe. However, I would like to make qqnorm plots on subsets that are defined by two factors (one factor has 48 categories (brain_region) and each of those categories can be further subdivided by another factor, which has three levels (GroupID)). I have tried the following:
by(t, t[,"GroupID"], function(x) tapply(t$FA,t$brain_region,qqnorm))
but it does not seem to be working. I'm also not sure if this is the best approach, as I'm new to this program.
I would also like to save each of the separately generated qqnorm plot with the x axis as labeled as "FA" and the title with the specific level of each of the two factors (brain region/GroupID). Thank you very much for any help.

Plotting is one of the few things where apply isn't the optimal solution. ggplot offers you enough possibilities to get this done, as shown in this answer.
Plotting all levels in one go
If you use the base plots, you can better use a for loop for this. Plus, if you want to plot different plots on the same graphics device, you can use eg par(mfrow=) or layout (see the help page ?layout)
Let's take the built-in data set iris as an example:
data(iris)
op <- par(mfrow=c(1,3))
for(i in levels(iris$Species)){
tmp <- with(iris, Petal.Width[Species==i])
qqnorm(tmp,xlab="Petal.Width",main=i)
qqline(tmp)
}
par(op)
rm(i,tmp)
gives :
Don't forget to clean up your workspace after using a for loop. Not really obligatory, but it can prevent serious confusion later on.
Combine two factors
In order to get this done for 2 factor levels at the same time, you can either construct a nested for-loop, or combine both factors into a single factor. Take the dataset mtcars:
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$am <- factor(mtcars$am,
labels=c('automatic','manual'))
To combine both levels, you can use this simple construct :
mtcars$combined <- factor(paste(mtcars$cyl,mtcars$am,sep='/'))
And then do the same again. With two for loops, your code would like like the code below. Be warned though that this only works if you have data for every combination of the factors, and you don't have too many levels. If you have a lot of levels, you better save the plots by using eg png() (see ?png for info) instead of plotting them all on the same graphics device.
lcyl <- levels(mtcars$cyl)
lam <- levels(mtcars$am)
par(mfrow=c(length(lam),length(lcyl)))
for(i in lam){
for(j in lcyl){
tmp <- with(mtcars,mpg[am==i & cyl==j])
qqnorm(tmp,xlab="Petal.Width",
main=paste(i,j,sep="/"))
qqline(tmp)
}
}
gives :

Kernel density plots on a single figure

I have been trying to plot simple density plots using R as:
plot(density(Data$X1),col="red")
plot(density(Data$X2),col="green")
Since I want to compare, I'd like to plot both in one figure. But 'matplot' doesn't work!! I also tried with ggplot2 as:
library(ggplot2)
qplot(Data$X1, geom="density")
qplot(Data$X2, add = TRUE, geom="density")
Also in this case, plots appear separately (though I wrote add=TRUE)!! Can anyone come up with an easy solution to the problem, please?

In ggplot2 or lattice you need to reshape the data to seupose them.
For example :
dat <- data.frame(X1= rnorm(100),X2=rbeta(100,1,1))
library(reshape2)
dat.m <- melt(dat)
Using ``lattice`
densityplot(~value , groups = variable, data=dat.m,auto.key = T)
Using ``ggplot2`
ggplot(data=dat.m)+geom_density(aes(x=value, color=variable))
EDIT add X1+X2
Using lattice and the extended formua interface, it is extremely easy to do this:
densityplot(~X1+X2+I(X1+X2) , data=dat) ## no need to reshape data!!

You can try:
plot(density(Data$X1),col="red")
points(density(Data$X2),col="green")
I must add that the xlim and ylim values should ideally be set to include ranges of both X1 and X2, which could be done as follows:
foo <- density(Data$X1)
bar <- density(Data$X2)
plot(foo,col="red", xlim=c(min(foo$x,bar$x),max(foo$x,bar$x)) ylim=c(min(foo$y,bar$y),max(foo$y,bar$y))
points(bar,col="green")

In base graphics you can overlay density plots if you keep the ranges identical and use par(new=TRUE) between them. I think add=TRUE is a base graphics strategy that some functions but not all will honor.

If you specify n, from, and to in the calls to density and make sure that they match between the 2 calls then you should be able to use matplot to plot both in one step (you will need to bind the 2 sets of y values into a single matrix).

In R, how to plot bars only in a certain interval of data?

My problem is very simple.
I have to plot a data series in R, using bars. Data are contained in a vector vet.
I've used barplot, that plots my data from the first to the last:
barplot(vet), and everything was fine.
Now, on the contrary, I would like to plot not all my data, but just a part of them: from 10% to the end.
How could I do this with barplot()?
How could I do this with plot()?
Thanx

You need to subset your data before plotting:
##Work out the 10% quantile and subset
v = vet[vet > quantile(vet, 0.1)]

It is not clear exactly what you want to do.
If you want to plot only a subset of the bars (but the whole bars) then you could just subset the data before passing it to barplot.
If you want to plot all the bars, but only that part beyond 10% (not include 0) then you can do this by setting the ylim argument. But it is very discouraged to do a barplot that does not include 0. You may be better off using a dotplot instead of a barplot if 0 is not meaningful.
If you want the regular plot, but want to exclude plotting outside of a given window within the plot then the clip function may be what you want.
The gap.barplot function from the plotrix package may also be what you want.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Producing statistics over levels - r

Related

How can I stop sapply dropping my barplot titles?

R: how to make multiple plots from one CSV, grouping by a column

qqnorm plotting for multiple subsets

Kernel density plots on a single figure

In R, how to plot bars only in a certain interval of data?

Categories

Resources