I hope I can explain this well. Suppose you have a fictitious data set that has 3 columns,
Car
Color
Yes/No
Each row is an observation that indicates if the user likes their model car and color. I'd like to create chart that shows on the X axis each model car then a line graph for each color where the y value is the percent liked (yes) of the total for that combination of car/color.
What is the best approach to work this in R? I'm thinking this could be useful in general where the response is Yes/No, and you want to show an interaction between two categorical features.
Thanks!
Ok, this is what I ended up doing. It seems to do what I need it to do ;0
PS - I'm not sure if it's appropriate to answer my own question. Thanks for the comments!
prop <- data.frame(prop.table(table(data$outcome,data$Factor1,data$Factor2),3))
names(prop) <- c("Outcome","Factor1","Factor2","Percentage")
# Remove No Percent
prop <- prop[which(prop$Outcome=="Yes"),]
# Bar plot
ggplot(data=prop, aes(x=Factor1, y=Percentage, fill=Factor2)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_brewer(palette="Paired")+
theme_minimal()
# Line Plot
ggplot(data=prop, aes(x=Factor1, y=Percentage, group=Factor2)) +
geom_line(aes(color=Factor2))+
geom_point(aes(color=Factor2))
Related
I'm hoping to get some help on making the following histogram looks as nice and understandable as possible. I am plotting the salaries of Immigrant versus US Born workers. I am wondering
1. How would you modify colors, axis intervals, etc. to make the graph more clear/appealing?
2. How could I add a key to indicate purple is for US born workers, and pink is for foreign born?
3. How can I add two different lines to indicate the median of each group? And a corresponding label for each?
My current code is set up as this:
ggplot(NHIS1,aes(x=adj_SALARY, y=..density..)) +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='0'), alpha=.5,binwidth=800, fill="purple",position="identity") + xlim(4430.4,50000) +
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed") +
geom_histogram(data=subset(NHIS1,IMMIGRANT=='1'), alpha=.5,binwidth=800,fill="red") + xlim(4430.4,50000)
geom_vline(xintercept=median(NHIS1$adj_SALARY), col="black", linetype="dashed")
And my final histogram at the moment appears as this:
If you have two variables, one for income , one for immigrant status, you do not need to plot two histograms but one will suffice if you specify the grouping. Also, I'd suggest you also use density lines, which help smooth over the histogram's bumps:
Assuming this is roughly like your data:
df <- data.frame(income = sample(1000:5000, 1000),
born = sample(c("US", "Foreign"), 1000, replace = T))
Then a crude way to plot one histogram as well as density lines for the two groups would be this:
ggplot(df, aes(x=income, color=born, fill=born)) +
geom_histogram(aes(y=..density..), alpha=0.5, binwidth=100,
position="identity") +
geom_density(alpha=.2)
This question has been asked before: overlaying-histograms-with-ggplot2-in-r discusses several options with many examples. You should definitely take a look at it.
Another option to compare the distributions could be violin plots using geom_violin(). I see violin plots as the better option when you need to compare distributions because they give you more flexibility and are still clearer. But that may be just me. Refer to the examples in the manual.
I would like to visualize a data frame much like the following in a plot:
grade number
A 2
B 6
C 1
D 0
E 1
The idea is to have the grades on the x-axis as categories and the number of pupils who received the respective grade on the y-axis.
My task is to display them not as points like in a line chart, but as thickness above the category like in a violin plot. This is really about the pure visuals of it.
I tried ggplot2's violin, but It always takes the values of the number column for the y-axis. But the y-axis is supposed to have just one single dimension: the level around which the density-plot is rotated.
I'd be very happy If someone had a hint at how I should maybe restructure my data or maybe if I am completely mistaken with my approach.
Ah, yes: on top I'd like to display the grade-point-average as a small bar.
Thank you very much in advance for taking your time. I'm sure the solution is very obvious, but I just don't see it.
As #Gregor mentioned, a smoothed density estimate (which is what a violin plot is) with just five ordinal values isn't really appropriate here. Even if you had plus/minus grades, you'd still probably be better off with bars or lines. See below for a few options:
library(ggplot2)
# Fake data
dat = data.frame(grades=LETTERS[c(1:4,6)],
count=c(5,12,11,5,3), stringsAsFactors=FALSE)
# Reusable plot elements
thm = list(theme_bw(),
scale_y_continuous(limits=c(0,max(dat$count)), breaks=seq(0,20,2)),
labs(x="Grade", y="Count"))
ggplot(dat, aes(grades, count)) +
geom_bar(stat="identity", fill=hcl(240,100,50)) +
geom_text(aes(y=0.5*count, label=paste0(count, " (", sprintf("%1.1f", count/sum(count)*100),"%)")),
colour="white", size=3) +
thm
ggplot(dat, aes(grades, count)) +
geom_line(aes(group=1),alpha=0.4) +
geom_point() +
thm
ggplot(dat, aes(x=as.numeric(factor(grades)))) +
geom_ribbon(aes(ymin=0, ymax=count), fill="grey80") +
geom_text(aes(y=count, label=paste0(sprintf("%1.1f", count/sum(count)*100),"%")), size=3) +
scale_x_continuous(labels=LETTERS[c(1:4,6)]) +
thm
The data has 4 columns and roughly 600 rows. The data is twitter data collected using the twitteR package, and then summarized into a data frame. The summary is based on how many words from these libraries each tweet has, the tweets are given a score and then the summary is the number of tweets which get specific scores. So the columns are the two types of scores, the dates, and then the number of tweets with those scores.
Score1 Score2 Date Number
0 0 01/10/2015 50
0 1 01/10/2015 34
1 0 01/10/2015 10
...and so on
With dates and data that extend over a month, and the scores either way can go +/- 10 or so.
I'm trying to plot that kind of data using a bubble plots, score1 on the x axis and score2 on the y axis with the size of the bubble dependant on the number (how many tweets of with those scores there were per day).
My problem is that I only know how to use ggplot.
g <- ggplot(
twitterdata,
aes(x=score1, y=score2, size=number, label=""), guide=FALSE) +
geom_point(colour="black", fill="red", shape=21) +
scale_size_area(max_size = 30) +
scale_x_continuous(name="score1", limits=c(0, 10)) +
scale_y_continuous(name="score2", limits=c(-10, 10)) +
geom_text(size=4) +
theme_bw()
and that just gives me the plot for all dates, and what I need is a good way to see how that data changes over time. I've looked into using sliders and selectors but I really have no idea what would be the best tool to use. I've tried subsetting the data based on date, which works nicely but ideally I could make some kind of interactive graph.
I really need some way select certain days out of that data to plot so it doesn't pile up all on itself, but do it interactively so it can be presented.
Any help would be greatly appreciated, thank you.
It sounds like this won't completely satisfy your use case, but an extremely low-overhead way to add some interactivity to your plot would be to install.packages('plotly') and add the following line to your code:
# your original code
g <- ggplot(
twitterdata,
aes(x=score1, y=score2, size=number, label=""),
guide=FALSE)+
geom_point(colour="black", fill="red", shape=21) +
scale_size_area(max_size = 30) +
scale_x_continuous(name="score1", limits=c(0,10)) +
scale_y_continuous(name="score2", limits=c(-10,10)) +
geom_text(size=4) +
theme_bw()
# add this line
gg <- ggplotly(g)
Details and demos: https://plot.ly/ggplot2/
As Eric suggested, if you want sliders and such you should check out shiny. Here's a demo combining shiny with plotly: https://plot.ly/r/shiny-tutorial/
I am having trouble deciding how to graph the data I have.
It consists of overlapping quantities that represent a population, hence my decision to use a stacked bar.
These represent six population divisions ("groups") wherein group 1 and group 2 are the main division. Groups 4 to 6 are subgroups of two, and these are subgroups of each other. Its simple diagram is below:
Note: groups 1 and 2 complete the entire population or group 1 + group 2 = 100%.
I want all of these information in one chart which I do not know what and how to implement.
So far I have the one below, which is wrong because Group 1 is included in the main bar.
require(ggplot2)
require(reshape)
tab <- data.frame(
set=c("XXX","XXX","XXX","XXX","XXX","XXX"),
group=c("1","6","5","4","3","2"),
rate=as.numeric(c(10000,20000,50000,55000,75000,100000))
)
dat <- melt(tab)
dat$time <- factor(dat$group,levels=dat$group)
ggplot(dat,aes(x=set)) +
geom_bar(aes(weight=value,fill=group),position="fill",color="#7F7F7F") +
scale_fill_brewer("Groups", palette="OrRd")
What do you guys suggest to visualize it? I want to use R and ggplot for consistency and uniformity with the other graphs I have made already.
Using facets you can divide your plot into two:
# changed value of set for group 1
tab <- data.frame(
set=c("UUU","XXX","XXX","XXX","XXX","XXX"),
group=c("1","6","5","4","3","2"),
rate=as.numeric(c(10000,20000,50000,55000,75000,100000))
)
# explicitly defined id.vars
dat <- melt(tab, id.vars=c('set','group'))
dat$time <- factor(dat$group,levels=dat$group)
# added facet_wrap, in geom_bar aes changed weight to y,
# added stat="identity", changed position="stack"
ggplot(dat,aes(x=set)) +
geom_bar(aes(y=value,fill=group),position="stack", stat="identity", color="#7F7F7F") +
scale_fill_brewer("Groups", palette="OrRd") +
facet_wrap(~set, scale="free_x")
My guess is what you need is a treemap. Please correct me if I misunderstood your question.
here a link on Treemapping]1
If tree map is what you need you can use either portfolio package or googleVis.
I am fairly new to R and ggplot2 and am having some trouble plotting multiple variables in the same histogram plot.
My data is already grouped and just needs to be plotted. The data is by week and I need to plot the number for each category (A, B, C and D).
Date A B C D
01-01-2011 11 0 11 1
08-01-2011 12 0 3 3
15-01-2011 9 0 2 6
I want the Dates as the x axis and the counts plotted as different colors according to a generic y axis.
I am able to plot just one of the categories at a time, but am not able to find an example like mine.
This is what I use to plot one category. I am pretty sure I need to use position="dodge" to plot multiple as I don't want it to be stacked.
ggplot(df, aes(x=Date, y=A)) + geom_histogram(stat="identity") +
labs(title = "Number in Category A") +
ylab("Number") +
xlab("Date") +
theme(axis.text.x = element_text(angle = 90))
Also, this gives me a histogram with spaces in between the bars. Is there any way to remove this? I tried spaces=0 as you would do when plotting bar graphs, but it didn't seem to work.
I read some previous questions similar to mine, but the data was in a different format and I couldn't adapt it to fit my data.
This is some of the help I looked at:
Creating a histogram with multiple data series using multhist in R
http://www.cookbook-r.com/Graphs/Plotting_distributions_%28ggplot2%29/
I'm also not quite sure what the bin width is. I think it is how the data should be spaced or grouped, which doesn't apply to my question since it is already grouped. Please advise me if I am wrong about this.
Any help would be appreciated.
Thanks in advance!
You're not really plotting histograms, you're just plotting a bar chart that looks kind of like a histogram. I personally think this is a good case for faceting:
library(ggplot2)
library(reshape2) # for melt()
melt_df <- melt(df)
head(melt_df) # so you can see it
ggplot(melt_df, aes(Date,value,fill=Date)) +
geom_bar() +
facet_wrap(~ variable)
However, I think in general, that changes over time are much better represented by a line chart:
ggplot(melt_df,aes(Date,value,group=variable,color=variable)) + geom_line()