Stacked bart chart 4 variables with ggplot - r

I am very new to this and I wanted to add that the various ways in which I tried to reshape/melt the data. My data in three different variations:
Version 1:
year,type,total,action,perc
2015,v,"1,199,310",crime,42.16
2015,p,"8,024,115",crime,18.24
2015,v,"505,681",arrest,42.16
2015,p,"1,463,213",arrest,18.24
2016,v,"1,250,162",crime,32.85
2016,p,"7,928,530",crime,17.07
2016,v,"410,717",arrest,32.85
2016,p,"1,353,283",arrest,17.07
2017,v,"1,247,321",crime,41.58
2017,p,"7,694,086",crime,16.24
2017,v,"518,617",arrest,41.58
2017,p,"1,249,757",arrest,16.24
Version 2:
year,type,crime,arrest,perc
2015,1,"1,199,310","505,681",42.16
2015,2,"8,024,115","1,463,213",18.24
2016,1,"1,250,162","410,717",32.85
2016,2,"7,928,530","1,353,283",17.07
2017,1,"1,247,321","518,617",41.58
2017,2,"7,694,086","1,249,757",16.24
Version 3:
df <- vpcrimetotal
year,vcrime,varrest,varrestperc,pcrime,parrest,parrestperc
2017,"1,247,321","518,617",0.4158,"7,694,086","1,249,757",0.1624
2016,"1,250,162","410,717",0.3285,"7,928,530","1,353,283",0.1707
2015,"1,199,310","505,681",0.4216,"8,024,115","1,463,213",0.1824
The idea is to show the total number of violent crime versus property crime from 1990-2017 with the number of arrests (labeled as a percent) inside each bar based on crime type (property or violent). The preference is to stack all four into one bar per year with different colors for each.
I found these that helped but was still confused in figuring out how to fit my data into them. how to create stacked bar charts for multiple variables with percentages, but to maybe look like this Count and Percent Together using Stack Bar in R
I have used these sets of data to the code but is probably confusing if I post all the different ones I tried that don't work.

Related

How to prevent combining 2 boxplots into 1 on ggplot

I am trying to create a plot with 6 panels showing two boxplots (representing the effect of injury (1 = presence of injury, 0 = absence of injury). The closest I have gotten is this:
ggplot(cell_measurement, aes(Injury, Acceleration_mean)) +
geom_boxplot(alpha=.3, aes(fill = Injury)) +
geom_jitter(size=0.1)+
facet_grid(vars(Animal))
Which has produced:
enter image description here
The data looks like this:
enter image description here
What it has instead done is merged the Injury to make 0.5 but my issue is that the 0 and 1 group are two distinct groups I am trying to represent on each panel. I have tried changing the Injury groups into 'Presence' and 'Absence' but everything I have tried has not worked.
Any suggestions on how to go about this, please?
Thank you!
(sorry I have not been on here long enough to post actual images)

Scatterplot for comparing species abundance

I have a homework question that states the following:
The file “channel_islands_counts_edit.csv” contains survey data on temperate rocky reef fishes from the Channel Islands, collected at many sites over many years. The data has columns for Year, Date, Site, count, and SpeciesName (broken into adults and juveniles). The version of the data that I’ve given you looks at 16 sites over 27 years, with count data for 27 categories of fish. Imagine we’re interested in whether the abundance of different species are correlated across sites (to get a sense for whether species have similar habitat preferences and/or interact with each other), and whether the across-site correlations are consistent over time. To visualize this, make some code that does the following:
For each year, draw a scatterplot that compares the abundance of Hypsypops rubicundus (adults) and the abundance of Paralabrax clathratus (adults) across sites. Feel free to transform the data for plotting purposes, if you think that helps you see any patterns.
I imported my data set, and ran the following code which is giving me 27 plots, with Site as x and Count as y, but there is no data shown in the plots.
head(channel_islands)
sapply(channel_islands, class)
levels(channel_islands$SpeciesName)
par(mfrow= c(6,5)) # set the plotting area into a 6 row*5 column array
for (i in 1:27) {
HR11<-subset(channel_islands,SpeciesName=="Hypsypops rubicundus,adult"[i] & Site==11)
PC15<-subset(channel_islands,SpeciesName=="Paralabrax clathratus,adult"[i] & Site==15)
with(HR11,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='green',main=i))
with(PC15,plot(count~Site,type='b',pch=19,ylim=c(0,10),xlim=c(0,16),col='blue',main=i))
}
If anyone could help me figure out how to compare species abundance across sites, over 27 years, I would really appreciate it.
The code "Hypsypops rubicundus,adult"[i] doesn't really make sense. Technically, it should work for when i == 1 but beyond that it would just return NA. I'm assuming SpeciesName == NA will never be true so you will get an empty subset.
Consider looking into using ggplot2 with facet_grid to quickly make multiple plots without the loop. The R Graphics Cookbook has good examples on using facets.

Grouping/stacking factor levels in ggplot bar chart

I'm relatively new to R and a complete beginner with ggplot, but I haven't managed to find an answer to the seemingly simple problem I have. Using ggplot, I would like to make a bar chart in which two of three or more graphed factor levels are stacked.
Essentially, this is the type of data I am looking at:
df <- data.frame(Answer=c("good","good","kinda good","kinda good",
"kinda good","good","bad","good","bad"))
This provides me with a factor with three levels, two of which are very similar:
Answer
1 good
2 good
3 kinda good
4 kinda good
5 kinda good
6 good
7 bad
8 good
9 bad
If I let ggplot go over these data for me now,
c <- ggplot(df, aes(df$Answer))
c + geom_bar()
I will get a bar chart with three columns. However, I would like to end up with two columns, one of which should be a stack of the two factor levels "good" and "kinda good", still visibly separated.
I am working with 100 columns of input (study on orthography), which I will need to go through manually, so I would like to make the code as easily adjustable as possible. Some of them have more than ten levels, and I would need to sort them into three columns. Therefore, in most cases my data would more likely look like this:
df <- data.frame(Answer=c("good","goood","goo0d","good",
"I don't know","Bad","bad","baaad","really bad"))
I would consequently group this into three categories. In approximately half of the cases, I could probably still filter using pattern matching because I will be looking at the use of spaces. The other half, however, is looking at capitalization, which would get a little messy, or at least very tedious.
I have thought of two different approaches to solve this issue more efficiently:
Simply rewriting the factor levels, but this would result in a loss of information (and I would like to keep the two levels separate). I would like to keep the original levels names because I think I need them to graph the ratio within that stacked column and to label the column properly.
I could split the respective column/factor into two separate columns/factors and graph them next to each other, and thus create a "fake" third dimension. This is looking to be the most promising approach, but before I work through 100 columns of data with this - is there a more elegant approach, maybe within the ggplot2 package, where I could just point/group the level names instead of changing/reordering the data frame behind it?
Thanks!
You can try the following for a more automated approach in grouping the answers.
We select some keywords based on your data and loop over them to see which answers may contain each keyword
groups <- c('good','bad','ugly','know')
df <- data.frame(Answer=c("good","medium good","kinda good","still good",
"I don't know","good","bad","good","really bad"))
idx <- sapply(groups, function(x) grepl(x, df$Answer, ignore.case = TRUE))
df$group <- rep(colnames(idx), nrow(idx))[t(idx)]
df
# Answer group
# 1 good good
# 2 medium good good
# 3 kinda good good
# 4 still good good
# 5 I don't know know
# 6 good good
# 7 bad bad
# 8 good good
# 9 really bad bad
library('ggplot2')
ggplot(df, aes(group, fill = Answer)) + geom_bar()

barplot: selecting data in R

I have a problem for building barplot.
I am working on air traffic in different countries. I would like to get barplots for each countries with the different airport names in the X axis. The Y axis will show the quantity of airlines using the airport.
My plan is to make the script for 1 country and to replicate it manually for the others.
in my data, I have in the different columns:
Country / aiport / destination.
So each rows is actually one airline that is using the airport.
Do you have an idea about how to do this?
For now I have this idea:
UK<-traffic[traffic$Country=="UK",]
UK$airport <- as.factor(UK$airport)
countUK<-table(UK$airport)
barplot(countUK)
This is not working, I have a bunch of airports that are not in UK in the X axis...
Thanks for your help
Answer found:
You could try to drop unused factor levels, i.e.
UK <- droplevels(UK) after the line UK$airport <- as.factor(UK$airport).

cummeRbund csHeatmap column user-defined order

I am using the R package cummeRbund (from Bioconductor) to visualize RNA-seq data, I created a cuffGeneSet instance called "DEG_genes" that contains 662 genes that are significantly differentially expressed between males and females. My goal is to create a heatmap using csHeatmap() in which the male and female samples (replicates) are separated but with a specific user-defined order within the sex category.
I used:
> DEG<-diffData(genes(cuff)) # take differentially expressed genes
> DEG_significant<-subset(DEG,significant=='yes') # retain only significant changes
> DEG_sign_IDs <- DEG_significant$gene_id # retrieve IDs
> DEG_genes<-getGenes(cuff,DEG_sign_IDs) # get CuffGeneSet instance
> hmap<-csHeatmap(DEG_genes,clustering='none',labRow=F,replicates=T)
This gives me ALMOST what I want: the heatmap shows Females on the left and Males on the right but they are alphabetically ordered (Female_0,Female_1,Female_10,Female_11,Female_12...Female_19,Female_2,Female_20,Female_21..,Female_29 on the left and similarly for males Male_0,Male_1,Male_10...Male_19,Male_2,Male_20...etc on the right) and I want them to be in a specific order (clusterReps). I created a test vector with replicate names on a specific order (Males on the left with 0 and 6 echanged and females on the right) as follow:
clusterReps<-c("Male_6","Male_1","Male_2","Male_3","Male_4","Male_5","Male_0","Male_7","Male_8","Male_9","Male_10","Male_11","Male_12","Male_13","Male_14","Male_15","Male_16","Male_17","Male_18","Male_19","Male_20","Male_21","Male_22","Male_23","Male_24","Male_25","Male_26","Male_27","Male_28","Male_29","Male_30","Male_31","Male_32","Male_33","Female_0","Female_1","Female_2","Female_3","Female_4","Female_5","Female_6","Female_7","Female_8","Female_9","Female_10","Female_11","Female_12","Female_13","Female_14","Female_15","Female_16","Female_17","Female_18","Female_19","Female_20","Female_21","Female_22","Female_23","Female_24","Female_25","Female_26","Female_27","Female_28")
I would like the data to be exactly the same except the order of the columns that must follow the order of the "clusterReps" vector. Knowing that the heatmap is a ggplot, I looked everywhere for a solution the last 2 days but with no success (despite a closely ressembling problem with heatmap.2() instead of csHeatmap() on stackoverflow, I tried to get a replicate fpkm matrix and use heatmap.2 but could only use heatmap_2 and some options were not accepted).
Using:
> hmap<-hmap+scale_x_discrete(limits=clusterReps)
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
only changes the x-axis labels but not the actual data (the heatmap remains identical).
Is there a similar function that rearranges the columns and not just labels?
Thanks in advance for your help, I'm not familiar with handling ggplot objects, and in particular heatmaps from cummeRbund.
EDIT:
Here is what I can give as further information:
> DEG_genes
CuffGeneSet instance for 662 genes
Slots:
annotation
fpkm
repFpkm
diff
count
isoforms CuffFeatureSet instance of size 930
TSS CuffFeatureSet instance of size 785
CDS CuffFeatureSet instance of size 230
promoters CuffFeatureSet instance of size 662
splicing CuffFeatureSet instance of size 785
relCDS CuffFeatureSet instance of size 662
> summary(DEG_genes)
Length Class Mode
662 CuffGeneSet S4
I am afraid I can't give more information for the moment, please let me know if you want me to execute a command and report the output if it can help.
I am not very fluent in R, but I was having the same problem. To solve it I made a script that renames all my sample names in all the files inside the cuffdiff folder to something that will give the right order when sorted alphabetically, and then rebuild the database.

Resources