Make a Barplot from the means of two variables in R - r

I'm having real trouble creating a simple bar plot. I am looking at the parasite load in mackerel, and want to create a bar plot that shows the difference in mean parasite load in male and female fish (with error bars).
At the moment, my data includes a column titled as SEX, under which fish are classed as M or F. Another column shows the number of parasites in each individual.
Any help is appreciated

You do not provide any data so we cannot respond precisely, but I will illustrate with other data and then say what I think you need. Using the built-in iris data:
Tab = aggregate(iris$Sepal.Length, list(iris$Species), mean)
barplot(Tab$x, names.arg=Tab$Group.1)
So you probably need something like:
Tab = aggregate(DAT$parasites, list(DAT$SEX), mean)
barplot(Tab$x, names.arg=Tab$Group.1)

Related

Multiple boxplots in one graph, R

I'm working with a dataset where I have one continous variable (V1) and want to see how that variable differs depending on demographics such as sex, age group etc.
I would like to do one graph that contains multiple boxplots - so that V1 is on the Y-axis and all my demographic variables (sex, age groups etc.) are on the x-axis with their corresponding p-values. Anyonw know how to do this in R?
I've added two photos to illustrate my dataset and the output I want.
Thanks!
Output example
Data example
It would be nice to have actual data and the code you already have so we can replicate what you have and work what you want. That being said, this link might be what you are looking for:
https://statisticsglobe.com/draw-multiple-boxplots-in-one-graph-in-r#example-2-drawing-multiple-boxplots-using-ggplot2-package
Scroll down about half way to Example 4: Drawing Multiple Boxplots for Each Group Side-by-Side

How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.
Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary). My dataset (nba_data) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.
player_name
draft_year
height_in
player_a
1998
76
player_b
1972
81
player_c
2012
79
So far the closest I've gotten is
ggplot(data = nba_data, aes(x = draft_year,
y = height_in,
group = cut(x = draft_year, breaks = 5))) +
geom_boxplot()
And this is the result I get. As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?
I created the same graph in Excel to get an idea of what it should look like:
I also attempted to create categories with cut, but was unable to apply it to my boxgraph. I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.
Thanks!
Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.
From the ?cut help page, the breaks argument is:
breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
You gave it a single number, so that's interpreted as the number of intervals.
Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5).
I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point. But that's a good thing for cut, because cut will only work on numerics. I'd suggest cutting the column as numeric to create a bin column, and then replace NAs in the bin column with "Undrafted".
If you don't mind using a package, you can get the effect you want with:
library(santoku)
ggplot(..., aes(..., group = chop_width(draft_year, 5)))

How to create a violin plot (or boxplot) for ordinal values in R?

I am trying to create a violin plot for a huge dataset. I got an ordinal scaled variable for age groups (10-19, 20-29, 30-39, 40-49, and so on and a nominal variable for - let's say - drug use (YES/NO), medication intake (YES/NO), alcohol intake (YES/NO), gaming > 6 hours per day (YES/NO).
My goal is to create a boxplot (with R) with Age Groups on the x-axis and the different items on the y-axis. The x-axis would look like: 10-19, 20-29, 30-39, 40-49 years old, while the y-axis is just a stacking of different items that are independent of each other.
Mainly, I am interested in how to connect the ordinal scaled variable "age groups" and one single dichotomized item "drug use". It is worth mentioning that I only want to focus on people that answered YES. Therefore I created a new subset.
My code looks like this:
drug_age <- (drug_data$age) #drug data = people that answered YES
data$age <- as.factor(data$age)
ggplot2(drug_data,aes(x=age,y=drugs,fill=drugs))+
geom_violin)
As you can tell, I am lost at this point. My variables are factorized but at this point I don't really care about code, but more about logic.
My goal is to create a graphic that shows that the frequency of druge use is more common among 20-29 years olds than 10-19 years olds, for instance.
You don't have to write code for me or anything, it would just be super helpful to give me some hints.
Thank you so much for your help!
Gertie

Changing plotting order of points in R / ggplot2

I have the following code to plot a large dataset (450k) in ggplot2
x<-ggplot()+
geom_point(data=data_Male,aes(x=a,y=b),color="Turquoise",position=position_jitter(w=0.2,h=1),alpha=0.1,size=.5,show.legend=TRUE)+
geom_point(data=data_Female,aes(x=a,y=b),color="#FF9999",position=position_jitter(w=0.2,h=1),alpha=0.1,size=.5,show.legend=TRUE)+
theme_bw()
x<-x+geom_smooth(data=data_Male,aes(x=a,y=b,alpha="Male"),method="lm",colour="Blue",linetype=1,se=T)+
geom_smooth(data=data_Female,aes(x=a,y=b,alpha="Female"),method="lm",colour="Dark Red",linetype=5,se=T)+
geom_smooth(data=data_All,aes(x=a,y=b,alpha="All"),method="lm",colour="Black",linetype=3,se=T)+
scale_fill_discrete(name="Key",labels=c("Female","Male","All"))+
scale_colour_discrete(name="Plot Colour",labels=c("Female","Male","All"))+
scale_alpha_manual(name="Key",
values=c(1,1,1),
breaks=c("Female","Male","All"),
guide=guide_legend(override.aes=list(linetype=c(5,1,3),name="Key",
shape=c(16,16,NA),
color=c("Dark Red","Blue","Black"),
fill=c("#FF9999","Turquoise",NA))))
How can I change the order in which points are plotted? I have seen answered questions here dealing with a single dataframe but I am working with several dataframes so I cannot re-order the rows or ask ggplot to plot by certain criteria from within the dataframe. You can see an example of the kind of problem that this causes in the attached picture: the Female points are plotted on top of the Male points. Ideally I would like to be able to plot all the points in a random order, so that one "cloud" of points is not plotted on top of the other, obscuring it (N.B. the image shown doesn't include the "All" line).
Any help would be appreciated. Thank you.
I belive this is not possible. The following should work though:
You'd have to paste the two data frames together to df. The new data frame will appear sorted by male and female.
You can then suffle the new data frame:
set.seed(42)
rows <- sample(nrow(df))
male_female_mixed <- df[rows, ]
Then you can plot male_female_mixed

What is the simplest way for making a grouped bar chart that calculates the mean of a category in R?

So I have imported call center data from a csv file into R.
flows = read.csv("data.csv")
There are two important columns to me:
name
duration
I am trying to create a bar chart that calculates the average duration of the call for a group, which is divided up by the variable name. Essentially, the chart displays which types of calls have the highest average duration.
There are also about 50 different names, so if I could limit the chart to the top 5/10 that would be ideal. Sorry if this is a simple problem, appreciate any help in advance.
This should work
flows %>%
group_by(name) %>%
dplyr::summarize(Mean = mean(duration, na.rm=TRUE))
After this, you probably want to sort it according to duration and keep the 5 first values.
flows<-flows[order(flows$Mean),]
flows<-flows[5,]

Resources