how to script in R over a factor's levels - r

I have a data frame with a quantitative variable, x, and several different factors, f1, f2, ...,fn. The number of levels is not constant across factors.
I want to create a (single) plot of densities of x by factor level fi.
I know how to hand code this for a specific factor. For example, here is the plot for a factor with two levels.
# set up the background plot
plot(density(frame$x[frame$f1=="level1"]))
# add curves
lines(density(frame$x[frame$f1=="level2"]))
I could also do this like so:
# set up the background plot
plot(NA)
# add curves
lines(density(frame$x[frame$f1=="level1"]))
lines(density(frame$x[frame$f1=="level2"]))
What I'd like to know is how can I do this if I only specify the factor as input. I don't even know how to write a for loop that would do what I need, and I have the feeling that the 'R way' would avoid for loops.
Bonus: For the plots, I would like to specify limiting values for the axes. Right now I do this in this way:
xmin=min(frame$x[frame$f1=="level1"],frame$x[frame$f1=="level2"])
How can I include this type of calculation in my script?

I'm assuming your data is in the format (data frame called df)
f1 f2 f3 fn value
A........................... value 1
A............................value 2
.............................
B............................value n-1
B............................value n
In that cause, lattice (or ggplot2) will be very useful.
library(lattice)
densityplot(~value, groups = f1, data = df, plot.points = FALSE)
This should get you close to what you are looking for, I think.
Greg

You could also do:
# create an empty plot. You may want to add xlab, ylab etc
# EDIT: also add some appropriate axis limits with xlim and ylim
plot(0, 0, "n", xlim=c(0, 10), ylim=c(0, 2))
levels <- unique(frame$f1)
for (l in levels)
{
lines(density(frame$x[frame$f1==l]))
}

ggplot2 code
library(ggplot2)
ggplot(data, aes(value, colour = f1)) +
stat_density(position = "identity")

Related

How to Plot Bar Charts for a Categorical Variable Against an Analytical Variable in R

I'm struggling with how to do something with R that comes very easily to me in Excel: so I'm sure this is something quite basic but I'm just not aware of the equivalent method in R.
In essence, I have a two variables in my dataset: a categorical variable which has a list of names, and an analytical variable that has the frequency corresponding to that particular observation.
Something like this:
Name Freq
==== =========
X 100
Y 200
and so on.
I would like to plot a bar chart with the names listed on the X-Axis (X, Y and so on) and bars of height corresponding to the relevant value of the Freq. variable for that observation.
This is something very trivial with Excel; I can just select the relevant cells and create a bar chart.
However, in R I just can't seem to figure out how to do this! The bar charts in R seems to be univariate only and doesn't behave the way I want it to. Trying to plot the two variables results in a scatter plot which is not what I'm going for.
Is there something very basic I'm missing here, or is R just not capable of performing this task?
Any pointers will be much helpful.
Edited to Add:
I was primarily trying to use base R's plot function to get the job done.
Using, plot(dataset1$Name, dataset1$Freq) does not lead to a bar graph but a scatter-plot instead.
First the data.
dat <- data.frame(Name = c("X", "Y"), Freq = c(100, 200))
With base R.
barplot(dat$Freq, names.arg = dat$Name)
If you want to display a long list of names.arg, maybe the best way is to customize your horizontal axis with function staxlab from package plotrix. Here are two example plots.
One, with the axis labels rotated 45 degrees.
set.seed(3)
Name <- paste0("Name_", LETTERS[1:10])
dat2 <- data.frame(Name = Name, Freq = sample(100:200, 10))
bp <- barplot(dat2$Freq)
plotrix::staxlab(1, at = bp, labels = dat2$Name, srt = 45)
Another, with the labels spread over 3 lines.
bp <- barplot(dat2$Freq)
plotrix::staxlab(1, at = bp, labels = dat2$Name, nlines = 3)
Add colors with argument col. See help("par").
With ggplot2.
library(ggplot2)
ggplot(dat, aes(Name, Freq)) +
geom_bar(stat = "identity")
To add colors you have the aesthetics colour (for the contour of the bars) and fill (for the interior of the bars).

Plot 'ranges' of variable in data

I have observations in the form of ranges
For eg: A 13-20, B 15-30, C 23-40, D 2-11
I want to plot them in R in form of the starting value and the end value for eg. 13 and 20 for A(upper and lower limits if you may say) in order to visualize and find out what ranges are common to certain combinations of observations. Is there a quick way to do this in R ? I think this is a very trivial problem I am having but I cant think of anyway to do it right now.
Here is a solution using ggplot. It's not clear at all what format your data is in, so this assumes a data frame with columns id (A-D), min, and max.
df <- data.frame(id=LETTERS[1:4], min=c(13,15,23,2), max=c(20,30,40,11))
library(ggplot2)
ggplot(df, aes(x=id))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()
I've added a lot of customization just to give you an idea of how it's done. You use the aes(...) function to tell ggplot which columns in df map to various aesthetics of the graph. So for instance aes(x=id) tells ggplot that the values for the x-axis are to be found in the id column of df, etc.
EDIT: Response to OP's comment.
To change the size of axis text, use the theme(...) function, as in:
ggplot(df, aes(x=id))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()+
theme(axis.text.x=element_text(size=15))
Here I made the x-axis text bigger. Play around with size=... to get it the way you want. Also read the documentation (?theme) for a list of other formatting options.
It is not clear whether the dataset has range column as string or not i.e. '13-20', '15-30' etc. or if it is two numeric columns as showed in the created example.
matplot(m1, xaxt='n', pch=1, ylab='range')
axis(1, at=seq_len(nrow(m1)), labels=row.names(m1))
s1 <- seq_len(nrow(m1))
arrows(s1, m1[,1], s1, m1[,2], angle=90, length=0.1)
If the data has string column (d1)
library(splitstackshape)
d2 <- setDF(cSplit(d1, 'range', '-'))
matplot(d2[,-1], xaxt='n', pch=1, ylab='range')
axis(1, at=seq_len(nrow(d2)), labels=d2$Col1)
arrows(s1, d2[,2], s1, d2[,3], angle=90, length=0.1)
data
m1 <- matrix(c(13,20, 15,30, 23,40, 2,11),
byrow=TRUE,dimnames=list(LETTERS[1:4],NULL), ncol=2)
d1 <- data.frame(Col1=LETTERS[1:4],
range=c('13-20', '15-30', '23-40', '2-11'), stringsAsFactors=FALSE)

Plot one numeric variable against n numeric variables in n plots

I have a huge data frame and I would like to make some plots to get an idea of the associations among different variables. I cannot use
pairs(data)
, because that would give me 400+ plots. However, there's one response variable y I'm particularly interested in. Thus, I'd like to plot y against all variables, which would reduce the number of plots from n^2 to n. How can I do it?
EDIT: I add an example for the sake of clarity. Let's say I have the dataframe
foo=data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
and my response variable is x3. Then I'd like to generate four plots arranged in a row, respectively x1 vs x3, x2 vs x3, an histogram of x3 and finally x4 vs x3. I know how to make each plot
plot(foo$x1,foo$x3)
plot(foo$x2,foo$x3)
hist(foo$x3)
plot(foo$x4,foo$x3)
However I have no idea how to arrange them in a row. Also, it would be great if there was a way to automatically make all the n plots, without having to call the command plot (or hist) each time. When n=4, it's not that big of an issue, but I usually deal with n=20+ variables, so it can be a drag.
Could do reshape2/ggplot2/gridExtra packages combination. This way you don't need to specify the number of plots. This code will work on any number of explaining variables without any modifications
foo <- data.frame(x1=1:10,x2=seq(0.1,1,0.1),x3=-7:2,x4=runif(10,0,1))
library(reshape2)
foo2 <- melt(foo, "x3")
library(ggplot2)
p1 <- ggplot(foo2, aes(value, x3)) + geom_point() + facet_grid(.~variable)
p2 <- ggplot(foo, aes(x = x3)) + geom_histogram()
library(gridExtra)
grid.arrange(p1, p2, ncol=2)
The package tidyr helps doing this efficiently. please refer here for more options
data %>%
gather(-y_value, key = "some_var_name", value = "some_value_name") %>%
ggplot(aes(x = some_value_name, y = y_value)) +
geom_point() +
facet_wrap(~ some_var_name, scales = "free")
you would get something like this
If your goal is only to get an idea of the associations among different variables, you can also use:
plot(y~., data = foo)
It is not as nice as using ggplot and it doesn't automatically put all the graphs in one window (although you can change that using par(mfrow = c(a, b)), but it is a quick way to get what you want.
I faced the same problem, and I don't have any experience of ggplot2, so I created a function using plot which takes the data frame, and the variables to be plotted as arguments and generate graphs.
dfplot <- function(data.frame, xvar, yvars=NULL)
{
df <- data.frame
if (is.null(yvars)) {
yvars = names(data.frame[which(names(data.frame)!=xvar)])
}
if (length(yvars) > 25) {
print("Warning: number of variables to be plotted exceeds 25, only first 25 will be plotted")
yvars = yvars[1:25]
}
#choose a format to display charts
ncharts <- length(yvars)
nrows = ceiling(sqrt(ncharts))
ncols = ceiling(ncharts/nrows)
par(mfrow = c(nrows,ncols))
for(i in 1:ncharts){
plot(df[,xvar],df[,yvars[i]],main=yvars[i], xlab = xvar, ylab = "")
}
}
Notes:
You can provide the list of variables to be plotted as yvars,
otherwise it will plot all (or first 25, whichever is less) the variables in the data frame against xvar.
Margins were going out of bounds if the number of plots exceeds 25,
so I kept a limit to plot 25 charts only. Any suggestions to nicely
handle this are welcome.
Also the y axis labels are removed as titles of the graphs take care
of it. x axis label is set to xvar.

Plot frequency of a value of 2 factors in the same plot in R

I'd like to plot the frequency of a variable color coded for 2 factor levels for example blue bars should be the hist of level A and green the hist of level B both n the same graph? Is this possible with the hist command? The help of hist does not allow for a factor. Is there another way around?
I managed to do this by barplots manually but i want to ask if there is a more automatic method
Many thanks
EC
PS. I dont need density plots
Just in case the others haven't answered this is a way that satisfies. I had to deal with stacking histograms recently, and here's what I did:
data_sub <- subset(data, data$V1 == "Yes") #only samples that have V1 as "yes" in my dataset #are added to the subset
hist(data$HL)
hist(data_sub$HL, col="red", add=T)
Hopefully, this is what you meant?
It's rather unclear what you have as a data layout. A histogram requires that you have a variable that is ordinal or continuous so that breaks can be created. If you also have a separate grouping factor you can plot histograms conditional on that factor. A nice worked example of such a grouping and overlaying a density curve is offered in the second example on the help page for the histogram function in the lattice package.
A nice resource for learning relative merits of lattice and ggplot2 plotting is the Learning R blog. This is from the first of a multipart series on side-by=side comparison of the two plotting systems:
library(lattice)
library(ggplot2)
data(Chem97, package = "mlmRev")
#The lattice method:
pl <- histogram(~gcsescore | factor(score), data = Chem97)
print(pl)
# The ggplot method:
pg <- ggplot(Chem97, aes(gcsescore)) + geom_histogram(binwidth = 0.5) +
facet_wrap(~score)
print(pg)
I don't think you can do that easily with a bar histogram, as you would have to "interlace" the bars from both factor levels... It would need some kind of "discretization" of the now continuous x axis (i.e. it would have to be split in "categories" and in each category you would have 2 bars, for each factor level...
But it is quite easy and without problems if you are just fine with plotting the density line function:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
dx <- density(x)
dy <- density(y)
plot(dx, xlim = range(dx$x, dy$x), ylim = range(dx$y, dy$y),
type = "l", col = "red")
lines(dy, col = "blue")
It's very possible.
I didn't have data to work with but here's an example of a histogram with different colored bars. From here you'd need to use my code and figure out how to make it work for factors instead of tails.
BASIC SETUP
histogram <- hist(scale(vector)), breaks= , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2))
#EXAMPLE
x<-rnorm(1000)
histogram <- hist(scale(x), breaks=20 , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < 2, "red", "green"))
I agree with the others that a density plot is more useful than merging colored bars of a histogram, particularly if the group's values are intermixed. This would be very difficult and wouldn't really tell you much. You've got some great suggestions from others on density plots, here's my 2 cents for density plots that I sometimes use:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
DF <- data.frame("Group"=c(rep(c("y","x"), each=1000)), "Value"=c(y,x))
library(sm)
with(DF, sm.density.compare(Value, Group, xlab="Grouping"))
title(main="Comparative Density Graph")
legend(-9, .4, levels(DF$Group), fill=c("red", "darkgreen"))

Get a histogram plot of factor frequencies (summary)

I've got a factor with many different values. If you execute summary(factor) the output is a list of the different values and their frequency. Like so:
A B C D
3 3 1 5
I'd like to make a histogram of the frequency values, i.e. X-axis contains the different frequencies that occur, Y-axis the number of factors that have this particular frequency. What's the best way to accomplish something like that?
edit: thanks to the answer below I figured out that what I can do is get the factor of the frequencies out of the table, get that in a table and then graph that as well, which would look like (if f is the factor):
plot(factor(table(f)))
Update in light of clarified Q
set.seed(1)
dat2 <- data.frame(fac = factor(sample(LETTERS, 100, replace = TRUE)))
hist(table(dat2), xlab = "Frequency of Level Occurrence", main = "")
gives:
Here we just apply hist() directly to the result of table(dat). table(dat) provides the frequencies per level of the factor and hist() produces the histogram of these data.
Original
There are several possibilities. Your data:
dat <- data.frame(fac = rep(LETTERS[1:4], times = c(3,3,1,5)))
Here are three, from column one, top to bottom:
The default plot methods for class "table", plots the data and histogram-like bars
A bar plot - which is probably what you meant by histogram. Notice the low ink-to-information ratio here
A dot plot or dot chart; shows the same info as the other plots but uses far less ink per unit information. Preferred.
Code to produce them:
layout(matrix(1:4, ncol = 2))
plot(table(dat), main = "plot method for class \"table\"")
barplot(table(dat), main = "barplot")
tab <- as.numeric(table(dat))
names(tab) <- names(table(dat))
dotchart(tab, main = "dotchart or dotplot")
## or just this
## dotchart(table(dat))
## and ignore the warning
layout(1)
this produces:
If you just have your data in variable factor (bad name choice by the way) then table(factor) can be used rather than table(dat) or table(dat$fac) in my code examples.
For completeness, package lattice is more flexible when it comes to producing the dot plot as we can get the orientation you want:
require(lattice)
with(dat, dotplot(fac, horizontal = FALSE))
giving:
And a ggplot2 version:
require(ggplot2)
p <- ggplot(data.frame(Freq = tab, fac = names(tab)), aes(fac, Freq)) +
geom_point()
p
giving:

Resources