Get a histogram plot of factor frequencies (summary) - r

I've got a factor with many different values. If you execute summary(factor) the output is a list of the different values and their frequency. Like so:
A B C D
3 3 1 5
I'd like to make a histogram of the frequency values, i.e. X-axis contains the different frequencies that occur, Y-axis the number of factors that have this particular frequency. What's the best way to accomplish something like that?
edit: thanks to the answer below I figured out that what I can do is get the factor of the frequencies out of the table, get that in a table and then graph that as well, which would look like (if f is the factor):
plot(factor(table(f)))

Update in light of clarified Q
set.seed(1)
dat2 <- data.frame(fac = factor(sample(LETTERS, 100, replace = TRUE)))
hist(table(dat2), xlab = "Frequency of Level Occurrence", main = "")
gives:
Here we just apply hist() directly to the result of table(dat). table(dat) provides the frequencies per level of the factor and hist() produces the histogram of these data.
Original
There are several possibilities. Your data:
dat <- data.frame(fac = rep(LETTERS[1:4], times = c(3,3,1,5)))
Here are three, from column one, top to bottom:
The default plot methods for class "table", plots the data and histogram-like bars
A bar plot - which is probably what you meant by histogram. Notice the low ink-to-information ratio here
A dot plot or dot chart; shows the same info as the other plots but uses far less ink per unit information. Preferred.
Code to produce them:
layout(matrix(1:4, ncol = 2))
plot(table(dat), main = "plot method for class \"table\"")
barplot(table(dat), main = "barplot")
tab <- as.numeric(table(dat))
names(tab) <- names(table(dat))
dotchart(tab, main = "dotchart or dotplot")
## or just this
## dotchart(table(dat))
## and ignore the warning
layout(1)
this produces:
If you just have your data in variable factor (bad name choice by the way) then table(factor) can be used rather than table(dat) or table(dat$fac) in my code examples.
For completeness, package lattice is more flexible when it comes to producing the dot plot as we can get the orientation you want:
require(lattice)
with(dat, dotplot(fac, horizontal = FALSE))
giving:
And a ggplot2 version:
require(ggplot2)
p <- ggplot(data.frame(Freq = tab, fac = names(tab)), aes(fac, Freq)) +
geom_point()
p
giving:

Related

How to assign the same colors to duplicate values using plot() function in R?

I have a dataframe with 4 columns. I am plotting total car crashes vs. total losses colored by the column "distance". I am able to generate a colored plot. However when I inspect the plot I see the plot is not colored the way it is supposed to be. Since there are many duplicate distance values I created a palette by the unique values of the distance column. And then assigned this as color to legend and plot(). However examining plot colors are not right. For example distance 600 is colored with yellow in legend but the corresponding dot is colored red. I think the problem is that I need to create a color as much as the number of instances in the dataframe which is 58. But I have duplicate values and they are not originally sorted.Basically I need colors from lowest distance to highest distance with increasing shade of a color that will match with the crashes and losses data properly.
Below is the minimum reproducible dataframe and my code that works but is flawed.
# creating dataframe
year <- data.frame(year = seq(1946,2003,1))
crashes <- data.frame(crashes = c(386,317,294,287,266,245,268,296,226,265,243,239,183,212,195,224,170,169,140, 147,111,119,100,115,128,111,80,77,68,69,84,72,90,82,59,67,45,59,50,64,55,63,56,56,57,68,34,32,26,21,20,30,35,28 ,22,27,34,NA))
losses <- data.frame(losses = c(432,423,341,291,282,288,387,323,229,305,244,333,200,215,211,245,197,177,153,152, 115,189,124,129,133,120,91,90,69,78,88,77,95,98,62,70,45,62,70,68,65,73,90,65,61,74,39,33,31,22,21,39,35,58,25,36 ,40,NA))
distance <- data.frame(distance = c(600,571,589,613,618,605,605,610,608,584,605,615,605,597,603,600,578,560,541,500,478,459,449,447,452,444,431,433,452,436,426,425,430,426,430,417,372,401,389,418,414,397,443,436,431,439,430,425,415,423,437,463,487,505,503,508,516,529))
df <- cbind(year,crashes,losses,distance)
palette <- heat.colors(length(unique(df[order(df$distance),]$distance)))
plot(df$crashes,df$losses, main = "Crashes,Losses and Distance",xlab = "Crashes", ylab = "Losses", col = palette)
#legend
legend(x = 401,y = 450, legend = unique(df[order(df$distance),]$distance) , cex=.3, fill = palette, xpd=TRUE)
Hi from what i understand you are trying to give each distance a unique colour, this is very easy but i would highly recommend you to first add a grouping character factor for the distances, instead of using 47 unique levels of distance (the numerics > this would result in 47 different colours you can try it if you use: 'distance' instead of: 'distance_new' in the following)
here is my code for this:
# same df as you posted:
df <- cbind(year,crashes,losses,distance)
# this is necessary if you would want to use 'distance' for coloring
df$distance<- as.factor(df$distance)
## imo better add a grouping factor distance_new
## (this is automatically stored as class factor)
df$distance_new<-cut(as.numeric(df$distance),4, labels = c("very low","low","medium","high"))
# Now the plot
plot(df$crashes,df$losses, main = "Crashes,Losses and Distance",
xlab = "Crashes", ylab = "Losses",
col = df$distance_new, pch =19)
# add a legend
legend("bottomright",legend = unique(df$distance_new ),
fill=unique(df$distance_new), cex= 0.5, title= "Distance")
View(df)

How to Plot Bar Charts for a Categorical Variable Against an Analytical Variable in R

I'm struggling with how to do something with R that comes very easily to me in Excel: so I'm sure this is something quite basic but I'm just not aware of the equivalent method in R.
In essence, I have a two variables in my dataset: a categorical variable which has a list of names, and an analytical variable that has the frequency corresponding to that particular observation.
Something like this:
Name Freq
==== =========
X 100
Y 200
and so on.
I would like to plot a bar chart with the names listed on the X-Axis (X, Y and so on) and bars of height corresponding to the relevant value of the Freq. variable for that observation.
This is something very trivial with Excel; I can just select the relevant cells and create a bar chart.
However, in R I just can't seem to figure out how to do this! The bar charts in R seems to be univariate only and doesn't behave the way I want it to. Trying to plot the two variables results in a scatter plot which is not what I'm going for.
Is there something very basic I'm missing here, or is R just not capable of performing this task?
Any pointers will be much helpful.
Edited to Add:
I was primarily trying to use base R's plot function to get the job done.
Using, plot(dataset1$Name, dataset1$Freq) does not lead to a bar graph but a scatter-plot instead.
First the data.
dat <- data.frame(Name = c("X", "Y"), Freq = c(100, 200))
With base R.
barplot(dat$Freq, names.arg = dat$Name)
If you want to display a long list of names.arg, maybe the best way is to customize your horizontal axis with function staxlab from package plotrix. Here are two example plots.
One, with the axis labels rotated 45 degrees.
set.seed(3)
Name <- paste0("Name_", LETTERS[1:10])
dat2 <- data.frame(Name = Name, Freq = sample(100:200, 10))
bp <- barplot(dat2$Freq)
plotrix::staxlab(1, at = bp, labels = dat2$Name, srt = 45)
Another, with the labels spread over 3 lines.
bp <- barplot(dat2$Freq)
plotrix::staxlab(1, at = bp, labels = dat2$Name, nlines = 3)
Add colors with argument col. See help("par").
With ggplot2.
library(ggplot2)
ggplot(dat, aes(Name, Freq)) +
geom_bar(stat = "identity")
To add colors you have the aesthetics colour (for the contour of the bars) and fill (for the interior of the bars).

Convert absolute values to ranges for charting in R

Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.
Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))

Subset of data included in more than one ggplot facet

I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?
Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase

how to script in R over a factor's levels

I have a data frame with a quantitative variable, x, and several different factors, f1, f2, ...,fn. The number of levels is not constant across factors.
I want to create a (single) plot of densities of x by factor level fi.
I know how to hand code this for a specific factor. For example, here is the plot for a factor with two levels.
# set up the background plot
plot(density(frame$x[frame$f1=="level1"]))
# add curves
lines(density(frame$x[frame$f1=="level2"]))
I could also do this like so:
# set up the background plot
plot(NA)
# add curves
lines(density(frame$x[frame$f1=="level1"]))
lines(density(frame$x[frame$f1=="level2"]))
What I'd like to know is how can I do this if I only specify the factor as input. I don't even know how to write a for loop that would do what I need, and I have the feeling that the 'R way' would avoid for loops.
Bonus: For the plots, I would like to specify limiting values for the axes. Right now I do this in this way:
xmin=min(frame$x[frame$f1=="level1"],frame$x[frame$f1=="level2"])
How can I include this type of calculation in my script?
I'm assuming your data is in the format (data frame called df)
f1 f2 f3 fn value
A........................... value 1
A............................value 2
.............................
B............................value n-1
B............................value n
In that cause, lattice (or ggplot2) will be very useful.
library(lattice)
densityplot(~value, groups = f1, data = df, plot.points = FALSE)
This should get you close to what you are looking for, I think.
Greg
You could also do:
# create an empty plot. You may want to add xlab, ylab etc
# EDIT: also add some appropriate axis limits with xlim and ylim
plot(0, 0, "n", xlim=c(0, 10), ylim=c(0, 2))
levels <- unique(frame$f1)
for (l in levels)
{
lines(density(frame$x[frame$f1==l]))
}
ggplot2 code
library(ggplot2)
ggplot(data, aes(value, colour = f1)) +
stat_density(position = "identity")

Resources