Grouping/stacking factor levels in ggplot bar chart - r

I'm relatively new to R and a complete beginner with ggplot, but I haven't managed to find an answer to the seemingly simple problem I have. Using ggplot, I would like to make a bar chart in which two of three or more graphed factor levels are stacked.
Essentially, this is the type of data I am looking at:
df <- data.frame(Answer=c("good","good","kinda good","kinda good",
"kinda good","good","bad","good","bad"))
This provides me with a factor with three levels, two of which are very similar:
Answer
1 good
2 good
3 kinda good
4 kinda good
5 kinda good
6 good
7 bad
8 good
9 bad
If I let ggplot go over these data for me now,
c <- ggplot(df, aes(df$Answer))
c + geom_bar()
I will get a bar chart with three columns. However, I would like to end up with two columns, one of which should be a stack of the two factor levels "good" and "kinda good", still visibly separated.
I am working with 100 columns of input (study on orthography), which I will need to go through manually, so I would like to make the code as easily adjustable as possible. Some of them have more than ten levels, and I would need to sort them into three columns. Therefore, in most cases my data would more likely look like this:
df <- data.frame(Answer=c("good","goood","goo0d","good",
"I don't know","Bad","bad","baaad","really bad"))
I would consequently group this into three categories. In approximately half of the cases, I could probably still filter using pattern matching because I will be looking at the use of spaces. The other half, however, is looking at capitalization, which would get a little messy, or at least very tedious.
I have thought of two different approaches to solve this issue more efficiently:
Simply rewriting the factor levels, but this would result in a loss of information (and I would like to keep the two levels separate). I would like to keep the original levels names because I think I need them to graph the ratio within that stacked column and to label the column properly.
I could split the respective column/factor into two separate columns/factors and graph them next to each other, and thus create a "fake" third dimension. This is looking to be the most promising approach, but before I work through 100 columns of data with this - is there a more elegant approach, maybe within the ggplot2 package, where I could just point/group the level names instead of changing/reordering the data frame behind it?
Thanks!

You can try the following for a more automated approach in grouping the answers.
We select some keywords based on your data and loop over them to see which answers may contain each keyword
groups <- c('good','bad','ugly','know')
df <- data.frame(Answer=c("good","medium good","kinda good","still good",
"I don't know","good","bad","good","really bad"))
idx <- sapply(groups, function(x) grepl(x, df$Answer, ignore.case = TRUE))
df$group <- rep(colnames(idx), nrow(idx))[t(idx)]
df
# Answer group
# 1 good good
# 2 medium good good
# 3 kinda good good
# 4 still good good
# 5 I don't know know
# 6 good good
# 7 bad bad
# 8 good good
# 9 really bad bad
library('ggplot2')
ggplot(df, aes(group, fill = Answer)) + geom_bar()

Related

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

Simple barplot displaying voting of a county

I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.

R Question: How can I create a histogram with 2 variables against eachother?

Okay, let me be as clear as I can in my problem. I'm new to R, so your patience is appreciated.
I want to create a histogram using two different vectors. The first vector contains a list of models (products). These models are listed as either integers, strings, or NA. I'm not exactly sure how R is storing them (I assume they're kept as strings), or if that is a relevant issue. I also have a vector containing a list of incidents pertaining to that model. So for example, one row in the dataframe might be:
Model Incidents
XXX1991 7
How can I create a histogram where the number of incidents for each model is shown? So the histogram will look like
| =
| =
Frequency of | =
Incidents | = =
| = = =
| = = = = =
- - - - - -
Each different Model
Just to give a general idea.
I also need to be able to map everything out with standard deviation lines, so that it's easy to see which models are the least reliable. But that's not the main question here. I just don't want to do anything that will make me unable to use standard deviation in the future.
So far, all I really understand is how to make a histogram with the frequency marked, but for some reason, the x-axis is marked with numbers, not the models' names.
I don't really care if I have to download new packages to make this work, but I suspect that this already exists in basic R or ggplot2 and I'm just too dumb to figure it out.
Feel free to ask clarfying questions. Thanks.
EDIT: I forgot to mention, there are multiple rows of incidents listed under each model. So to add to my example earlier:
Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
3
5
XXX1002 9
XXX1002 4
etc . . .
I want to add up all the incidents for a model under one label.
I am assuming that you did not mean to leave the model blank in your example, so I filled in some values.
You can add up the number of incidents by model using aggregate then make the relevant plot using barplot.
## Example Data
data = read.table(text="Model Incidents
XXX1991 7
XXX1991 1
XXX1991 19
XXX1992 3
XXX1992 5
XXX1002 9
XXX1002 4",
header=TRUE)
TAB = aggregate(data$Incidents, list(data$Model), sum)
TAB
Group.1 x
1 XXX1002 13
2 XXX1991 27
3 XXX1992 8
barplot(TAB$x, names.arg=TAB$Group.1 )

r ggplot x axis tick marks order

I've seen several questions about the order of x axis marks but still none of them could solve my problem.
I'm trying to do a density plot which shows the distribution of people by percentile within each score given like this
library(dplyr); library(ggplot2); library(ggtheme)
ggplot(KA,aes(x=percentile,group=kscore,color=kscore))+
xlab('Percentil')+ ylab('Frecuencia')+ theme_tufte()+ ggtitle("Prospectos")+
scale_color_brewer(palette = "Greens")+geom_density(size=3)
but the x axis mark gets ordered like 1,10,100,11,12,..,2,20,21,..,99 instead of just 1,2,3,..,100 which is my desired output
I fear this affects the whole plot not just the labels
I'll turn my comment to an answer so this can be marked resolved:
Your x variable is (almost certainly) a factor. You probably want it to be numeric.
KA$percentile = as.numeric(as.character(KA$percentile))
When you're seeing weird stuff, it's good to check on your data. Running str(KA) is a good way to see what's there. If you just want to see classes, sapply(KA, class) is a nice summary.
And it's a common R quirk that if you're converting from factor to numeric, go by way of character or you risk ending up with just the level numbers:
year_fac = factor(1998:2002)
as.numeric(year_fac) # not so good
# [1] 1 2 3 4 5
as.numeric(as.character(year_fac)) # what you want
# [1] 1998 1999 2000 2001 2002

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources