I have data that are in several groups, and I want to display them in a faceted stacked bar chart. The data show responses to a survey question. When I look at them in the dataframe, they make sense, and when I plot them (without faceting), they make sense.
However the data appear to change when they are faceted. I have never had this problem before. I was able to re-create a change (not the exact same change) with some dummy data.
myDF <- data.frame(rep(c('aa','ab','ac'), each = 9),
rep(c('x','y','z'),times = 9),
rep(c("yes", "no", "maybe"), each=3, times=3),
sample(50:200, 27, replace=FALSE))
colnames(myDF) <- c('place','program','response','number')
library(dplyr)
myDF2 <- myDF %>%
group_by(place,program) %>%
mutate(pct=(100*number)/sum(number))
The data in myDF are basically a count of responses to a question. The myDF2 only creates a percent of respondents with any particular response within each place and program.
library(ggplot2)
my.plot <-ggplot(myDF2,
aes(x=place, y=pct)) +
geom_bar(aes(fill=myDF$response),stat="identity")
my.plot.facet <-ggplot(myDF2,
aes(x=place, y=pct)) +
geom_bar(aes(fill=myDF$response),stat="identity")+
facet_wrap(~program)
I am hoping to see a plot that shows the proper "pct" for each "response" within each "program" and "place". However, my.plot.facet shows only one "response" per place.
The data are not like that. For example, head(myDF2) shows that program 'aa' in place 'x' has both yes and no.
> head(myDF2)
Source: local data frame [6 x 5]
Groups: place, program
place program response number pct
1 aa x yes 69 18.35106
2 aa y yes 95 25.81522
3 aa z yes 192 41.64859
4 aa x no 129 34.30851
5 aa y no 188 51.08696
6 aa z no 162 35.14100
It turns out that ORDER matters here. The myDF2 is not a data frame anymore. It is a dplyr object. That means that ggplot2 is really struggling.
If the data need to be faceted by program, 'program' needs to be first called in the group_by()
Note that this is true here by looking at the inverse plot faceting.
my.plot.facet2 <-ggplot(myDF2,
aes(x=program, y=pct)) +
geom_bar(aes(fill=myDF2$response),stat="identity")+
facet_wrap(~place)
produces:
Related
This seems so simple. I can easily do it in Excel but I want to automate the process through R. I have installed ggplot2. Using RStudio I have read in my CSV file.
The resulting data frame has over 200 rows, each a town in New Hampshire. The first column is titled "Town" and each row below that has the text name of the town, (e.g., "Concord" or "Lancaster"). Column 2 contains a number for each town (spending per elementary school pupil) and the title of that column in the CSV file is "01/02 Elem PPE" - but it shows as "X01.02.Elem.PPE" when using View(). Column 3 has similar numbers for each town and its title in View() is "X02.03.Elem.PPE". Columns 4 through 11 are similar.
I just want to plot a line graph of the numbers in columns 2-11 for one row (one town). It will show how the spending per pupil has changed in that town over time. There must be a simple way to do this, but I can't find it.
Please help. I am a 77 year old with some programming experience 3-5 decades ago but new to R and Rstudio only yesterday.
First, I'll make some new data that mimics yours. It should have more or less the same properties.
library(glue)
library(tidyverse)
set.seed(4314)
mat <- matrix(rpois(40, 5000), ncol=10)
colnames(mat) <- glue("X{sprintf('%2.0f', 1:10)}.{sprintf('%2.0f', 2:11)}.Elem.PPE", sep="") %>%
gsub(". ", ".0", ., fixed=TRUE) %>%
gsub("X ", "X0", ., fixed=TRUE)
df <- tibble(town = c("Concord", "Lancaster", "Manchester", "Nashua"))
df <- bind_cols(df, as_tibble(mat))
Now, this is where you would start. I'm going to assume that you read your csv into an object called df. The first thing you should do to make plotting easier is to pivot the data from wide-form (one-row and 10 columns per observation) to long-form with 1 column and 10 rows per observation. I'm going to save this in an object called df2. The pivot_longer function is in the tidyr package. The first argument is the columns that you want to change from wide- to long-form, in this case, it's everything except town. Then you tell it a variable name for the column names and a variable name for the values. Then, I'm just using a couple of regular expressions to go from X01.02.Elem.PPE to 01/02 for plotting purposes.
df2 <- df %>%
pivot_longer(-town, names_to="time", values_to="val") %>%
mutate(time = gsub("X(.*)\\.Elem\\.PPE", "\\1", time),
time = gsub("\\.", "/", time))
The resulting data frame looks like this:
# # A tibble: 40 x 3
# town time val
# <chr> <chr> <int>
# 1 Concord 01/02 4965
# 2 Concord 02/03 4953
# 3 Concord 03/04 5066
# 4 Concord 04/05 5100
# 5 Concord 05/06 4979
# 6 Concord 06/07 5090
# 7 Concord 07/08 5136
# 8 Concord 08/09 5076
# 9 Concord 09/10 5079
# 10 Concord 10/11 4945
Next, we can make a plot for a single place (before we think about automation). Let's try Concord. First, we'll save the values that we want to put on the x-axis:
xlabs <- unique(df2$time)
Next, we can use ggplot() to make the plot. In the code below, we're first piping the data frame to a filter that will pull out the values for a single town. The filtered data frame is piped into the ggplot() function. Since time in the data frame is a character vector, we need to turn it into a factor and then into a numeric to make the line plot. We add the line geometry to plot the line. Then we change the x-axis labels with scale_x_continuous(). The labs() function changes the axis labels for the x- and y-axes. Finally, ggtitle() puts the title at the top of the plot. I also like theme_bw() rather than the gray background, but that's entirely a matter of personal preference. The resulting plot looks like this:
df2 %>% filter(town == "Concord") %>%
ggplot(aes(x=as.numeric(as.factor(time)), y=val)) +
geom_line() +
scale_x_continuous(breaks=1:10, labels = xlabs) +
labs(x="Time", y="Spending per Pupil") +
ggtitle("Concord") +
theme_bw()
Now, the next part you mentioned was automation - you want to do this for every row of the original data frame. We could do that as follows. First, untown grabs the unique values of town from the data. The for() loop loops from 1 to the number of values in untown. Then you can see where "Concord" was in the previous plot, we now have untown[i]. We also use ggsave() at the end and we paste together the town name and .png. This will make a different plot for each town in R's working directory.
untown <- unique(df2$town)
for(i in 1:length(untown)){
df2 %>% filter(town == untown[i]) %>%
ggplot(aes(x=as.numeric(as.factor(time)), y=val)) +
geom_line() +
scale_x_continuous(breaks=1:10, labels = xlabs) +
labs(x="Time", y="Spending per Pupil") +
ggtitle(untown[i]) +
theme_bw()
ggsave(glue("{untown[i]}.png"), width=9, height=6)
}
I have a large data set with protein IDs and corresponding abundance profiles across a number of gel fractions. I want to plot these profiles of abundances across the fractions.
The data looks like this
IDs<- c("prot1", "prot2", "prot3", "prot4")
fraction1 <- c(3,4,2,4)
fraction2<- c(1,2,4,1)
fraction3<- c(6,4,6,2)
plotdata<-data.frame(IDs, fraction1, fraction2, fraction3)
> plotdata
IDs fraction1 fraction2 fraction3
1 prot1 3 1 6
2 prot2 4 2 4
3 prot3 2 4 6
4 prot4 4 1 2
I want it to look like this:
Every protein has a profile. Every fraction has a corresponding abundance value per protein. I want to have multiple proteins per plot.
I tried figuring out ggplot2 using the cheat sheet and failed. I don't know what the input df should look like and what method I should use to get these profiles.
I would use excel, but a bug draws the wrong profile of my data depending on order of data, so I can't trust it to do what I want.
First, you'll have to reorganize your data.frame for ggplot2. You can do it one step with reshape2::melt. Here you can change the 'variable' and 'value' names.
library(reshape2)
library(dplyr)
library(ggplot2)
data2 <- melt(plotdata, id.vars = "IDs")
Then, we'll group the data by protein:
data2 <- group_by(data2, IDs)
Finally, you can plot it quite simply:
ggplot(data2) +
geom_line(aes(variable, value, group = IDs,
color = IDs))
I have a data set with 3 attributes (organization hierarchy region-area-territory, territory is the lowest grain) plus two numeric fields (sales qty and headcount).
How do I generate correlation between sales qty and territory headcount, and display the correlation by region, area and territory?
I used dplyr package, g=group_by (mydataset, region, area, territory), and then summarize(g, cor(sales_qty, headcount). The display looks right, but all correlation is 'NA'. If I omit territory, then the result looks right (group by region and area). Even though territory is the lowest level, can I still use 'group_by' feature? Why is it showing NA?
Thank you for helping!
Without looking at your code it is hard to tell what you are trying. I can't comment what you are doing wrong. Here is what I have tried to get correlation with groups. It works well.
set.seed(1234)
df <- data.frame(group = rep(1:5, 100), x = rnorm(500) , y = rnorm(500) )
library(dplyr)
df %>%
group_by(group) %>%
do(data.frame(x=cor(.$x,.$y)))
Output:
group x
<int> <dbl>
1 1 0.1293551648
2 2 0.0006703073
3 3 0.2021294935
4 4 -0.0162522307
5 5 0.0995898089
I have a list of words coming straight from file, one per line, that I import with read.csv which produces a data.frame. What I need to do is to compute and plot the numbers of occurences of each of these words. That, I can do easily, but the problem is that I have several hundreds of words, most of which occur just once or twice in the list, so I'm not interested in them.
EDIT https://gist.github.com/anonymous/404a321840936bf15dd2#file-wordlist-csv here is a sample wordlist that you can use to try. It isn't the same I used, I can't share that as it's actual data from actual experiments and I'm not allowed to share it. For all intents and purposes, this list is comparable.
A "simple"
df <- data.frame(table(words$word))
df[df$Freq > 2, ]
does the trick, I now have a list of the words that occur more than twice, as well as a hard headache as to why I have to go from a data.frame to an array and back to a data.frame just to do that, let alone the fact that I have to repeat the name of the data.frame in the actual selection string. Beats me completely.
The problem is that now the filtered data.frame is useless for charting. Suppose this is what I get after filtering
Var1 Freq
6 aspect 3
24 colour 7
41 differ 18
55 featur 7
58 function 19
81 look 4
82 make 3
85 mean 7
95 opposit 14
108 properti 3
109 purpos 6
112 relat 3
116 rhythm 4
118 shape 6
120 similar 5
123 sound 3
obviously if I just do a
plot(df[df$Freq > 2, ])
I get this
which obviously (obviously?) has all the original terms on the x axis, while the y axis only shows the filtered values. So the next logical step is to try and force R's hand
plot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq)
But clearly R knows best and already did that, because I get the exact same result. Using ggplot2 things get a little better
qplot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq)
(yay for consistency) but I'd like that to show an actual histograms, y'know, with bars, like the ones they teach in sixth grade, so if I ask that
qplot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq) + geom_bar()
I get
Error : Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Defunct; last used in version 0.9.2)
so let us try the last suggestion, shall we?
qplot(df[df$Freq > 2, ]$Var1, stat='identity') + geom_bar()
fair enough, but there are my bars? So, back to basics
qplot(words$word) + geom_bar() # even if geom_bar() is probably unnecessary this time
gives me this
Am I crazy or [substitute a long list of ramblings and complaints about R]?
I generate some random data
set.seed(1)
df <- data.frame(Var1 = letters, Freq = sample(1: 8, 26, T))
Then I use dplyr::filter because it is very fast and easy.
library(ggplot2); library(dplyr)
qplot(data = filter(df, Freq > 2), Var1, Freq, geom= "bar", stat = "identity")
First of all, at least with plot(), there.s no reason to force a data.frame. plot() understands table objects. You can do
plot(table(words$words))
# or
plot(table(words$words), type="p")
# or
barplot(table(words$words))
We can use Filter to filter rows, unfortunately that drops the table class. But we can add that back on with as.table. This looks like
plot(as.table(Filter(function(x) x>2, table(words$words))), type="p")
This seems simple, but I've tried multiple variations of matplot, ggplot2, regular old plot...I can't get any to do what I need.
I have a gigantic dataframe of years, months, and observations. I simplified it down to number of observations per month, per year, see below. I'm not sure why it read in with the "X" in front of each column heading, but if it's not going to affect the code, right now I don't care.
head(storms)
X Month X1992 X1993 X1994
1 1 1 2 1
2 2 2 4 1
3 3 3 26 10
4 4 4 47 26
5 5 5 969 615
The full (simplified) set is 10 columns of years (1992-2001), each with 12 months/rows of totals (1 storm in Jan 1992, 26 storms in March 1993...). I need simply to plot these all on an x-axis 120 months long, # of observations per month on the y-axis. It could be a line or bars or vertical lines. I've seen many ways to plot 20 lines with 12 months on the x-axis; that is not what I'm going for. I also need to label the years every 12 months, but I think I can figure that out after I get this block out of the way.
In other words (I hope this is more clear if the previous is not):
y axis: # of storms, ylim=c(0-1000)
x axis: 10 sets of months (Jan-Dec, 1992-2001, 120 months total). The only labels will be the years, every 12 months of course.
I know I'm just thinking about it wrong, could someone please set my head straight?
(first post; please also tell me if I'm not formatting or inquiring properly!)
is this something you are looking for? If I am not mistaken, you may want to rearrange your data frame. You wanna make your data frame longer rather than wider. Then, you can draw a figure. The thing is that you have 120 month. So you may need to think plot space issue. But at least this example let you move forward. I hope this helps you.
library(tidyr)
library(ggplot2)
# Create a sample data
month <- rep(c(1:12), each = 1, times = 2)
nintytwo <- runif(24, 0, 20)
nintythree <- runif(24, 0, 20)
# Crate a data frame
ana <- data.frame(month, nintytwo, nintythree)
# Make the data longer rather than wider.
bob <- gather(ana, year, value, -month)
bob$month <- as.factor(bob$month)
# Draw a firure
cathy <- ggplot(bob, aes(x= year,y = value, fill = month)) + geom_bar(stat="identity", position="dodge")
cathy
Here's an example using base R :
# create an example data
set.seed(123)
df <- data.frame(Month=1:12)
for(y in 1992:2001){
tmp <- data.frame(X=as.integer(abs(rnorm(12,mean=2,sd=10))))
colnames(tmp) <- paste("X",y,sep="")
df <- cbind(df,tmp)
}
# reshape to long format (one column with n.of storms, and period columns)
long <- reshape(df[,-1], idvar="Month", ids=df$Month,
times=names(df[,-1]), timevar="Year",
varying = list(names(df[,-1])),
direction = "long",v.names="Storms")
# remove the "X" from the year
long$Year <- substr(long$Year,2,nchar(long$Year))
nYears <- length(unique(long$Year))
# plot the line
plot(x=1:nrow(long),y=long$Storms,type="l",
xaxt="n",main="Monthly Storms",
xlab="Period",ylab="Storms",col="RoyalBlue")
# add custom labels
axis(1,at=((1:nYears)*12)-6,labels=unique(long$Year))
# add vertical lines
abline(v=c(0.5,((1:nYears)*12)+0.5),col="Gray80",lty=2)
Result :