Boxplot with many variables and categories - r

Is is possible to create something like this in R?
I have 7 different variables that i want to include for product A and the same 7 for the rest of the products, B, C...
However I also want to include the summaries vales (min, mean and max).
How can I create this?
I already have all the different variables as a "Value".
I was trying with something like
protein~product
but i want for all variables inside the Product AAA. If possible, the same for all products ( i don't know it that will be possible due to the amount of the variables).
this is a part of the data..
product protein fat moisture ash fiber starch sugar
AAA 49 1.0 NA NA 10 7.4 6.1
BBB 35 1.6 NA NA 10.6 8.5 10.0
AVF 40 1.2 NA NA 6 7.8 6.3
Thank you!

You can start your adventure with this example.
EDIT: I added some info, how to get from your data format to a long data format, required for the plot.
Also find more info at similar questions:
Plot multiple boxplot in one graph
# simulate the data
set.seed(314)
id <- rep(1:100, each = 3)
prod <- paste("product",rep(letters[1:3], each=300))
ing <- rep(c('protein','fat','starch'), 300)
mg <- rnorm(900, 5, 2)
df <- data.frame(prod, ing, mg, id)
#reconstruct your data format
yourdata <- df %>% group_by(id, prod) %>% spread(ing, mg)
library(ggplot2)
library(dplyr)
library(tidyr)
# get your format in long format
pd <- yourdata %>% gather(ing, mg, -id, -prod)
# use the long format for the plot
ggplot(pd, aes(x = ing, y = mg, fill = ing)) + geom_boxplot() +
facet_grid(~prod)

Related

using a loop to plot multiple boxplots in ggplot2, long dataframe

here's my data:
head(df)
FY Analyte Value
<fct> <fct> <dbl>
1 2007-08 CONF(G) 634
2 2007-08 PH(G) 7.8
3 2007-08 TEMP(G) 24.8
4 2007-08 UHS(G) 2.5
5 2007-08 FC(G) 0.5
6 2007-08 CBOD(C) 1
My dataset is a long df, spanning 10 years. I want to create multiple ggplots (of each Analyte) where the x axis is FY (financial year) and the y axis is Value. Ideally the Y axis title would also change based on the variable being plotted.
I've seen a few reproducible chunks of code in my search to do this but none of them seem to apply to a long dataframe (where I want to loop through each level of the Analyte variable). I also want it to save to my working directory (possibly using the png and dev.off() functions).
Anyone know a solution?
Thanks!
Split the data for each Analyte and use map to save the plot as separate image.
library(tidyverse)
df %>%
group_split(Analyte) %>%
map(~{
analyte_name <- .$Analyte[1]
tmp <- ggplot(., aes(FY, Value)) + geom_boxplot() + ggtitle(analyte_name)
ggsave(paste0(analyte_name, '.png'), tmp)
})

Column titles contain sample info how do i split them into two columns in R

I am trying to clean up some data frames to a more useful format, I am running R studio 1.3.1093 and R 3.5.3.
My data frame looks like this:
Peptide
5C_T6m
5C_T12m
PEP
0.5
1.1
TIDE
0.6
1.2
and I am trying to convert it to:
Peptide
Temp
Timepoint
abundance
PEP
5
6
0.5
TIDE
5
6
0.6
PEP
5
12
1.1
TIDE
5
12
1.2
I can't visualize in my head how it is possible to move between the two. in a stepwise approach.
Im new to R, and have done some bits of data reshaping using TidyVerse, but this seems to me like it requires multiple steps to get there, and its is hard for me to visualise the individual steps.
Any help with either just the steps i would need to take or some code suggestions would be great.
Thanks!
The function pivot_longer is very useful in this kind of cases
df %>% pivot_longer(cols=!Peptide,
names_to = c("Temp", "Timepoint"),
names_pattern = "(.*)C_T(.*)m",
values_to = "abundance")
Here is a possible solution, as I understood that you wanted.
library(tidyverse)
df <- data.frame(Peptide = c("PEP","TIDE"), C5_T6m = c(0.5,0.6), C5_T12m = c(1.1,1.2))
dt <- df %>%
gather( Timepoint, abundance, 2:3) %>%
mutate(Temp = str_extract(Timepoint,"5")) %>%
mutate(Timepoint = str_extract(Timepoint,"6|12")) %>%
select(Peptide,Temp,Timepoint,abundance)
Result
> dt
Peptide Temp Timepoint abundance
1 PEP 5 6 0.5
2 TIDE 5 6 0.6
3 PEP 5 12 1.1
4 TIDE 5 12 1.2

R time series multiple lines plot

I have a very big dataset that I'd like to illustrate using plotly in R.
A sample of my dataset is shown below:
> new_data_2
# Groups: newdatum [8]
date activity totaal
<date> <fct> <int>
1 2019-11-21 N11 144
2 2019-09-22 N11 129
3 2019-05-15 N22 117
4 2019-01-23 N22 12
5 2019-07-04 N22 12
6 2019-07-18 N22 12
...
For every activity I want to display the amount (totaal) per date (date) in a time series plot.
Somehow I don't get it right in R. Somehow I need to group my activity to display, but I can't figure it out.
new_data_2 %>%
group_by(activity) %>%
plot_ly(x=new_data_2$newdatum) %>%
add_lines(y=~new_data_2$totaal, color = ~factor(newdatum))
It does display an empty plot and not with the 'activity' on the left side.
What i want to achieve is:
You're on the right track, but after the group_by() you need to tell R to do something to the groups.
new_data_2 %>%
group_by(activity, date) %>% # use two groupings since you want by activity & date
summarise(totaal_2 = sum(totaal))
That should get to the dataframe you're looking for. You can use ggplot & plotly on it from there.
I would recommend reshaping the data first (as above), saving it as a new object, and then graphing it. Doing it this way helps you see each step along the way. Pipes %>% are great, but can make each step difficult to see.
This might not be very obvious at first, but the structure of your data is ideal for plot with multiple time series. You don't even need to worry with the group_by function. Your dataset seems to hava a long format where the dates in the date column and the names in activity column are not unique. But you will have only one variable per activity and date.
Given the correct specifications, plot_ly() will group your data using color=~activity like this: p <- plot_ly(new_data2, x = ~date, y = ~totall, color = ~activity) %>% add_lines(). Since you haven't provided a data sample that is large enough, I'll use the built-in dataset economics_long to show you how you can do this. First of all, notice how the structure of my sampled dataset matches yours:
date variable value
1 1967-07-01 psavert 12.5
2 1967-08-01 psavert 12.5
3 1967-09-01 psavert 11.7
4 1967-10-01 psavert 12.5
5 1967-11-01 psavert 12.5
6 1967-12-01 psavert 12.1
...
Plot:
Code:
library(plotly)
library(dplyr)
# data
data("economics_long")
df <- data.frame(economics_long)
# keep only some variables that have values on a comparable level
df <- df %>% filter(!(variable %in% c('pop', 'pce', 'unemploy')))
# plotly time series
p <- plot_ly(df, x = ~date, y = ~value, color = ~variable) %>%
add_lines()
# show plot
p

Pivoted data with a column describing some function of other columns

I am trying to pivot some data such that I retrieve (1) the total of some measurement for two+ groups, and then (2) that measurement divided by the # of observations in that group. I have achieved (1) but not (2). Below is an example output I desire:
grouping measurement_total group_size mean
1 1 301 60 5.0
2 2 215 40 5.4
Let some data be:
> grouping <- c(1,2,1,1,2)
> measurement <- sample(rnorm(1,10),100, replace=TRUE)
> dataframe <- cbind(grouping, measurement)
To create the pivot, I used aggregate. I then used a cbind to get the # of observations per group:
> aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=sum)
grouping measurement V2
1 1 301 60
2 2 215 40
I now need to create "V3" which would be { measurement / V2 } such that I achieve the result. NB I can get the mean only by using FUN=mean, but this means I cannot also get the group size.
> aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=mean)
grouping V2(# obs.) mean
1 1 1 5.0
2 2 1 5.4
What are some options for achieving this simply, ideally with a single line? I.e. I could obtain the two tables separately and merge the two, but it's a little long-winded.
Thanks
John
You can use dplyr to do this fairly easily
library(dplyr)
dataframe <- data.frame(dataframe) # Convert to dataframe
dataframe %>%
group_by(grouping) %>%
mutate(measurement_total = sum(measurement)) %>%
mutate(group_size = length(measurement)) %>%
mutate(mean = mean(measurement)) %>%
filter(row_number()==1) %>%
select(-measurement)
Of course, the easy way to do it in base R would be:
df <- aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=sum)
df$mean <- df$measurement/df$V2
But if you're going to be doing dataframe manipulation, it might be a good idea to get into dplyr

R: graph multiple columns on one line

This seems simple, but I've tried multiple variations of matplot, ggplot2, regular old plot...I can't get any to do what I need.
I have a gigantic dataframe of years, months, and observations. I simplified it down to number of observations per month, per year, see below. I'm not sure why it read in with the "X" in front of each column heading, but if it's not going to affect the code, right now I don't care.
head(storms)
X Month X1992 X1993 X1994
1 1 1 2 1
2 2 2 4 1
3 3 3 26 10
4 4 4 47 26
5 5 5 969 615
The full (simplified) set is 10 columns of years (1992-2001), each with 12 months/rows of totals (1 storm in Jan 1992, 26 storms in March 1993...). I need simply to plot these all on an x-axis 120 months long, # of observations per month on the y-axis. It could be a line or bars or vertical lines. I've seen many ways to plot 20 lines with 12 months on the x-axis; that is not what I'm going for. I also need to label the years every 12 months, but I think I can figure that out after I get this block out of the way.
In other words (I hope this is more clear if the previous is not):
y axis: # of storms, ylim=c(0-1000)
x axis: 10 sets of months (Jan-Dec, 1992-2001, 120 months total). The only labels will be the years, every 12 months of course.
I know I'm just thinking about it wrong, could someone please set my head straight?
(first post; please also tell me if I'm not formatting or inquiring properly!)
is this something you are looking for? If I am not mistaken, you may want to rearrange your data frame. You wanna make your data frame longer rather than wider. Then, you can draw a figure. The thing is that you have 120 month. So you may need to think plot space issue. But at least this example let you move forward. I hope this helps you.
library(tidyr)
library(ggplot2)
# Create a sample data
month <- rep(c(1:12), each = 1, times = 2)
nintytwo <- runif(24, 0, 20)
nintythree <- runif(24, 0, 20)
# Crate a data frame
ana <- data.frame(month, nintytwo, nintythree)
# Make the data longer rather than wider.
bob <- gather(ana, year, value, -month)
bob$month <- as.factor(bob$month)
# Draw a firure
cathy <- ggplot(bob, aes(x= year,y = value, fill = month)) + geom_bar(stat="identity", position="dodge")
cathy
Here's an example using base R :
# create an example data
set.seed(123)
df <- data.frame(Month=1:12)
for(y in 1992:2001){
tmp <- data.frame(X=as.integer(abs(rnorm(12,mean=2,sd=10))))
colnames(tmp) <- paste("X",y,sep="")
df <- cbind(df,tmp)
}
# reshape to long format (one column with n.of storms, and period columns)
long <- reshape(df[,-1], idvar="Month", ids=df$Month,
times=names(df[,-1]), timevar="Year",
varying = list(names(df[,-1])),
direction = "long",v.names="Storms")
# remove the "X" from the year
long$Year <- substr(long$Year,2,nchar(long$Year))
nYears <- length(unique(long$Year))
# plot the line
plot(x=1:nrow(long),y=long$Storms,type="l",
xaxt="n",main="Monthly Storms",
xlab="Period",ylab="Storms",col="RoyalBlue")
# add custom labels
axis(1,at=((1:nYears)*12)-6,labels=unique(long$Year))
# add vertical lines
abline(v=c(0.5,((1:nYears)*12)+0.5),col="Gray80",lty=2)
Result :

Resources