I have a dataframe with 4 columns. I am plotting total car crashes vs. total losses colored by the column "distance". I am able to generate a colored plot. However when I inspect the plot I see the plot is not colored the way it is supposed to be. Since there are many duplicate distance values I created a palette by the unique values of the distance column. And then assigned this as color to legend and plot(). However examining plot colors are not right. For example distance 600 is colored with yellow in legend but the corresponding dot is colored red. I think the problem is that I need to create a color as much as the number of instances in the dataframe which is 58. But I have duplicate values and they are not originally sorted.Basically I need colors from lowest distance to highest distance with increasing shade of a color that will match with the crashes and losses data properly.
Below is the minimum reproducible dataframe and my code that works but is flawed.
# creating dataframe
year <- data.frame(year = seq(1946,2003,1))
crashes <- data.frame(crashes = c(386,317,294,287,266,245,268,296,226,265,243,239,183,212,195,224,170,169,140, 147,111,119,100,115,128,111,80,77,68,69,84,72,90,82,59,67,45,59,50,64,55,63,56,56,57,68,34,32,26,21,20,30,35,28 ,22,27,34,NA))
losses <- data.frame(losses = c(432,423,341,291,282,288,387,323,229,305,244,333,200,215,211,245,197,177,153,152, 115,189,124,129,133,120,91,90,69,78,88,77,95,98,62,70,45,62,70,68,65,73,90,65,61,74,39,33,31,22,21,39,35,58,25,36 ,40,NA))
distance <- data.frame(distance = c(600,571,589,613,618,605,605,610,608,584,605,615,605,597,603,600,578,560,541,500,478,459,449,447,452,444,431,433,452,436,426,425,430,426,430,417,372,401,389,418,414,397,443,436,431,439,430,425,415,423,437,463,487,505,503,508,516,529))
df <- cbind(year,crashes,losses,distance)
palette <- heat.colors(length(unique(df[order(df$distance),]$distance)))
plot(df$crashes,df$losses, main = "Crashes,Losses and Distance",xlab = "Crashes", ylab = "Losses", col = palette)
#legend
legend(x = 401,y = 450, legend = unique(df[order(df$distance),]$distance) , cex=.3, fill = palette, xpd=TRUE)
Hi from what i understand you are trying to give each distance a unique colour, this is very easy but i would highly recommend you to first add a grouping character factor for the distances, instead of using 47 unique levels of distance (the numerics > this would result in 47 different colours you can try it if you use: 'distance' instead of: 'distance_new' in the following)
here is my code for this:
# same df as you posted:
df <- cbind(year,crashes,losses,distance)
# this is necessary if you would want to use 'distance' for coloring
df$distance<- as.factor(df$distance)
## imo better add a grouping factor distance_new
## (this is automatically stored as class factor)
df$distance_new<-cut(as.numeric(df$distance),4, labels = c("very low","low","medium","high"))
# Now the plot
plot(df$crashes,df$losses, main = "Crashes,Losses and Distance",
xlab = "Crashes", ylab = "Losses",
col = df$distance_new, pch =19)
# add a legend
legend("bottomright",legend = unique(df$distance_new ),
fill=unique(df$distance_new), cex= 0.5, title= "Distance")
View(df)
I'm struggling with how to do something with R that comes very easily to me in Excel: so I'm sure this is something quite basic but I'm just not aware of the equivalent method in R.
In essence, I have a two variables in my dataset: a categorical variable which has a list of names, and an analytical variable that has the frequency corresponding to that particular observation.
Something like this:
Name Freq
==== =========
X 100
Y 200
and so on.
I would like to plot a bar chart with the names listed on the X-Axis (X, Y and so on) and bars of height corresponding to the relevant value of the Freq. variable for that observation.
This is something very trivial with Excel; I can just select the relevant cells and create a bar chart.
However, in R I just can't seem to figure out how to do this! The bar charts in R seems to be univariate only and doesn't behave the way I want it to. Trying to plot the two variables results in a scatter plot which is not what I'm going for.
Is there something very basic I'm missing here, or is R just not capable of performing this task?
Any pointers will be much helpful.
Edited to Add:
I was primarily trying to use base R's plot function to get the job done.
Using, plot(dataset1$Name, dataset1$Freq) does not lead to a bar graph but a scatter-plot instead.
First the data.
dat <- data.frame(Name = c("X", "Y"), Freq = c(100, 200))
With base R.
barplot(dat$Freq, names.arg = dat$Name)
If you want to display a long list of names.arg, maybe the best way is to customize your horizontal axis with function staxlab from package plotrix. Here are two example plots.
One, with the axis labels rotated 45 degrees.
set.seed(3)
Name <- paste0("Name_", LETTERS[1:10])
dat2 <- data.frame(Name = Name, Freq = sample(100:200, 10))
bp <- barplot(dat2$Freq)
plotrix::staxlab(1, at = bp, labels = dat2$Name, srt = 45)
Another, with the labels spread over 3 lines.
bp <- barplot(dat2$Freq)
plotrix::staxlab(1, at = bp, labels = dat2$Name, nlines = 3)
Add colors with argument col. See help("par").
With ggplot2.
library(ggplot2)
ggplot(dat, aes(Name, Freq)) +
geom_bar(stat = "identity")
To add colors you have the aesthetics colour (for the contour of the bars) and fill (for the interior of the bars).
Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.
Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))
I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?
Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase
I have a data frame with a quantitative variable, x, and several different factors, f1, f2, ...,fn. The number of levels is not constant across factors.
I want to create a (single) plot of densities of x by factor level fi.
I know how to hand code this for a specific factor. For example, here is the plot for a factor with two levels.
# set up the background plot
plot(density(frame$x[frame$f1=="level1"]))
# add curves
lines(density(frame$x[frame$f1=="level2"]))
I could also do this like so:
# set up the background plot
plot(NA)
# add curves
lines(density(frame$x[frame$f1=="level1"]))
lines(density(frame$x[frame$f1=="level2"]))
What I'd like to know is how can I do this if I only specify the factor as input. I don't even know how to write a for loop that would do what I need, and I have the feeling that the 'R way' would avoid for loops.
Bonus: For the plots, I would like to specify limiting values for the axes. Right now I do this in this way:
xmin=min(frame$x[frame$f1=="level1"],frame$x[frame$f1=="level2"])
How can I include this type of calculation in my script?
I'm assuming your data is in the format (data frame called df)
f1 f2 f3 fn value
A........................... value 1
A............................value 2
.............................
B............................value n-1
B............................value n
In that cause, lattice (or ggplot2) will be very useful.
library(lattice)
densityplot(~value, groups = f1, data = df, plot.points = FALSE)
This should get you close to what you are looking for, I think.
Greg
You could also do:
# create an empty plot. You may want to add xlab, ylab etc
# EDIT: also add some appropriate axis limits with xlim and ylim
plot(0, 0, "n", xlim=c(0, 10), ylim=c(0, 2))
levels <- unique(frame$f1)
for (l in levels)
{
lines(density(frame$x[frame$f1==l]))
}
ggplot2 code
library(ggplot2)
ggplot(data, aes(value, colour = f1)) +
stat_density(position = "identity")