I'm working in Rstudio.
With ggplot2, I'm trying to form a plot where I have frequencies of a categorical variable (number of shares purchased), per category (there are 5 categories). For example, members of category A might buy 1 share more frequently than members of category D.
I now have a count plot. However, because one category is much bigger than the others, you don't get a good idea about the n shares in the other categories.
The code of the count plot is as follows:
#ABS. DISTRIBUTION SHARES/CATEGORY
ggplot(dat, aes(x=Number_share, fill=category)) +
geom_histogram(binwidth=.5, alpha=.5, position="dodge")
This results in this graph: https://imgur.com/a/e4k94
Therefore, I am planning to make a plot where, instead of an absolute count, you have a distribution relative to their category.
I calculated the relative frequencies of each category:
library(MASS)
categories = dat$category
categories.freq = table(categories)
categories.relfreq = categories.freq / nrow(dat)
cbind(categories.relfreq)
categories.relfreq
Beauvent 1 0.002708692
Beauvent 2 0.015020931
E&B 0.037182960
Ecopower 1 0.042107855
Ecopower 2 0.029549372
Ecopower 3 0.873183945
I don't know how to make a plot where the frequency of a share number acquisition is relative to the category, instead of absolute. Can anybody help me with this?
I think what you are looking for is this
ggplot(dat, aes(x=Number_share, fill=category)) +
geom_bar(position="fill")
This will stack the categories on top of each other and the position="fill" argument will give the relative counts
I found that this problem is very similar: Histogram with weights in R
basically it's because the default of a histogram is to use counts on the y-axis, while I want to use a hist(freq=TRUE), or in the case of ggplot: ggplot_histogram(y= ..density..).
Related
I am trying to make a plot of proportions of a binomial distribution (yes/no) depending on one ordinal and one continuous variable. Somehow when including the continuous one as color of the dots the appearance of the plot radically changes. Can someone help me with how to include the third variable without having the plot turn into below table-looking result?
Code as follows:
#making table with proportions of people who switch (1),
## after arsenic level and education.
educ_switch <- prop.table(table(welldata$educ[welldata$switch==1],
welldata$arsenic[welldata$switch==1],
welldata$switch[welldata$switch==1]))
educ_switch <- as_data_frame(educ_switch, make.names=TRUE)
#remove observations where the proportion is 0
educ_switch1 <- educ_switch[which (educ_switch$proportion>0),]
p <- ggplot(educ_switch1, aes(x = educ, y=proportion))
If I do p + geom_point()
I get the following picture:
But when I try to distinguish the third variable by coloring it with p + geom_point(aes(colour = arsenic))
I get this weird looking thing instead:
Context: when you have "many" categories it can become hard to distinguish them in a bar plot. I found the plot below dealing with this situation quite nicely by linking the legend with categories in the plot.
Question: is it possible to do something similar with ggplot2?
With ggplot2 it is straighforward to get this:
But I really do not know were to start to acheive the result shown in the 1st plot.
Here is some code to sort it out:
library(ggplot2)
ggplot(data = mtcars, aes(x = vs, y = disp, fill = factor(carb))) +
geom_bar(stat = "identity")
Expected output (not as nice as the one presented above but it shows the idea)
There is no proper legend on the axes in any of the plots, but my guess is that the desired chart is based on relative frequencies, while your plot seems to show absolute frequencies, though I'm not sure about that.
Assuming that you want to produce a stacked bar chart giving the (relative) number of observations of a categorial variable in two groups, there are two ways to get the two stacked bars to be of the same height:
There need to be the exact same amount of observations in both of
them. Then you can use absolute frequencies.
The absolute frequencies need to be transformed to relative frequencies (or percent) by dividing them by the total number of observations in each group.
You can calculate the relative frequencies yourself and use them as the y-values.
Or refer to this post, as it seems to describe exactly what you want using ggplot2.
I am having trouble deciding how to graph the data I have.
It consists of overlapping quantities that represent a population, hence my decision to use a stacked bar.
These represent six population divisions ("groups") wherein group 1 and group 2 are the main division. Groups 4 to 6 are subgroups of two, and these are subgroups of each other. Its simple diagram is below:
Note: groups 1 and 2 complete the entire population or group 1 + group 2 = 100%.
I want all of these information in one chart which I do not know what and how to implement.
So far I have the one below, which is wrong because Group 1 is included in the main bar.
require(ggplot2)
require(reshape)
tab <- data.frame(
set=c("XXX","XXX","XXX","XXX","XXX","XXX"),
group=c("1","6","5","4","3","2"),
rate=as.numeric(c(10000,20000,50000,55000,75000,100000))
)
dat <- melt(tab)
dat$time <- factor(dat$group,levels=dat$group)
ggplot(dat,aes(x=set)) +
geom_bar(aes(weight=value,fill=group),position="fill",color="#7F7F7F") +
scale_fill_brewer("Groups", palette="OrRd")
What do you guys suggest to visualize it? I want to use R and ggplot for consistency and uniformity with the other graphs I have made already.
Using facets you can divide your plot into two:
# changed value of set for group 1
tab <- data.frame(
set=c("UUU","XXX","XXX","XXX","XXX","XXX"),
group=c("1","6","5","4","3","2"),
rate=as.numeric(c(10000,20000,50000,55000,75000,100000))
)
# explicitly defined id.vars
dat <- melt(tab, id.vars=c('set','group'))
dat$time <- factor(dat$group,levels=dat$group)
# added facet_wrap, in geom_bar aes changed weight to y,
# added stat="identity", changed position="stack"
ggplot(dat,aes(x=set)) +
geom_bar(aes(y=value,fill=group),position="stack", stat="identity", color="#7F7F7F") +
scale_fill_brewer("Groups", palette="OrRd") +
facet_wrap(~set, scale="free_x")
My guess is what you need is a treemap. Please correct me if I misunderstood your question.
here a link on Treemapping]1
If tree map is what you need you can use either portfolio package or googleVis.
I am working with a Danish dataset on immigrants by country of origin and age group. I transformed the data so I can see the top countries of origin for each age group.
I am plotting it using facet_wrap. What I would like to do is, since different age groups come from quite different areas, to show a different set of values for one axis in each facet. For example, those that are between 0 and 10 years old come from countries x,y and z, while those 10-20 years of age come from countries q, r, z and so on.
In my current version, it shows the entire set of values, including countries that are not in the top 10. I would like to show just the top ten countries of origin for each facet, in effect having different axis labels for each. (And, if it is possible, sorting by high to low for each facet).
Here is what I have so far:
library(ggplot2)
library(reshape)
###load and inspect data
load(url('http://dl.dropbox.com/u/7446674/dk_census.rda'))
head(dk_census)
###reshape for plotting--keep just a few age groups
dk_census.m <- melt(dk_census[dk_census$Age %in% c('0-9 år', '10-19 år','20-29 år','30-39 år'),c(1,2,4)])
###get top 10 observations for each age group, store in data frame
top10 <- by(dk_census.m[order(dk_census.m$Age,-dk_census.m$value),], dk_census.m$Age, head, n=10)
top10.df<-do.call("rbind", as.list(top10))
top10.df
###plot
ggplot(data=top10.df, aes(x=as.factor(Country), y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
One option (that I actually strongly suspect you won't be happy with) is this:
p <- ggplot(data=top10.df, aes(x=Country, y=value)) +
geom_bar(stat="identity")+
coord_flip() +
facet_wrap(~Age)+
labs(title="Immigrants By Country by Age",x="Country of Origin",y="Population")
pp <- dlply(.data=top10.df,.(Age),function(x) {x$Country <- reorder(x$Country,x$value); p %+% x})
library(gridExtra)
do.call(grid.arrange,pp)
(Edited to sort each graph.)
Keep in mind that the only reason faceting exists is to plot multiple panels that share a common scale. So when you start asking to facet on some variable, but have the scales be different (oh, and also sort them separately on each panel as well) what you're doing is really no longer faceting. It's just making four different plots and arranging them together.
using lattice (Here I use ``latticeExtrafor ggplot2 theme), you can set torelation=freebetween panels. Here I am using abbreviate = TRUE` to short long labels.
library(latticeExtra)
barchart(value~ Country|Age,data=top10.df,layout=c(2,2),
horizontal=T,
par.strip.text =list(cex=2),
scales=list(y=list(relation='free',cex=1.5,abbreviate=T,
labels=levels(factor(top10.df$Country)))),
# ,cex=1.5,abbreviate=F),
par.settings = ggplot2like(),axis=axis.grid,
main="Immigrants By Country by Age",
ylab="Country of Origin",
xlab="Population")
My dataset:
I have data in the following format (here, imported from a CSV file). You can find an example dataset as CSV here.
PAIR PREFERENCE
1 5
1 3
1 2
2 4
2 1
2 3
… and so on. In total, there are 19 pairs, and the PREFERENCE ranges from 1 to 5, as discrete values.
What I'm trying to achieve:
What I need is a stacked histogram, e.g. a 100% high column, for each pair, indicating the distribution of the PREFERENCE values.
Something similar to the "100% stacked columns" in Excel, or (although not quite the same, a so-called "mosaic plot"):
What I tried:
I figured it'd be easiest using ggplot2, but I don't even know where to start. I know I can create a simple bar chart with something like:
ggplot(d, aes(x=factor(PAIR), y=factor(PREFERENCE))) + geom_bar(position="fill")
… that however doesn't get me very far. So I tried this, and it gets me somewhat closer to what I'm trying to achieve, but it still uses the count of PREFERENCE, I suppose? Note the ylab being "count" here, and the values ranging to 19.
qplot(factor(PAIR), data=d, geom="bar", fill=factor(PREFERENCE_FIXED))
Results in:
So, what do I have to do to get the stacked bars to represent a histogram?
Or do they actually do this already?
If so, what do I have to change to get the labels right (e.g. have percentages instead of the "count")?
By the way, this is not really related to this question, and only marginally related to this (i.e. probably same idea, but not continuous values, instead grouped into bars).
Maybe you want something like this:
ggplot() +
geom_bar(data = dat,
aes(x = factor(PAIR),fill = factor(PREFERENCE)),
position = "fill")
where I've read your data into dat. This outputs something like this:
The y label is still "count", but you can change that manually by adding:
+ scale_x_discrete("Pairs") + scale_y_continuous("Votes")