Looping through variables to make many boxplots - r

I am using from the package OlinkAnalyze and I am trying to make box plots.
install.packages("OlinkAnalyze")
library(OlinkAnalyze)
df = npx_data1
the code for the boxplot is:
plot <- df %>%
na.omit() %>% # removing missing values which exists for Site
olink_boxplot(variable = "Site",
olinkid_list = c("OID01216", "OID01217"),
number_of_proteins_per_plot = 2)
plot[[1]]
It takes values from the olinkID column. What I would like, is to loop through the column, choosing the next two olinkID at a time, to make boxplots, renaming the plot each time (e.g.plot 1 with OID01216 and OID01217 and plot 2 with OID01218 OID01219

I used a while loop.
install.packages("OlinkAnalyze")
library(OlinkAnalyze)
df = npx_data1
i <- 1
ids <- as.data.frame(unique(df$OlinkID))
while(i <= nrow(ids)){
print(i)
x <- i+1
temp <- ids[i:x,]
plotx <- df %>%
na.omit() %>% #
olink_boxplot(variable = "Site",
olinkid_list = c(paste(c(ids[i,],ids[x,]))),
number_of_proteins_per_plot = 2)
plottemp <- assign(paste0("plot_",ids[i,],"_",ids[i,]),plotx)
i <- i+2
}

If you want the loop, you could write like this:
for i in seq(from = 1, to = length(data$OlinkID), by = 2){
the plot code
}
This way you can access the two observations you want by data$OlinkID[i] or data$OlinkID[i+1].
So the boxplot code should be
plot <- data %>%
na.omit() %>%
olink_boxplot(variable = "Oxwatchtime",
olinkid_list = c(data$OlinkID[i],data$OlinkID[i+1]),
number_of_proteins_per_plot = 2)
If you want to save the plots, add a ggsave() or a png()/pdf() in the loop to save them externally or create a list with them using ggarrange() function from ggpubr package. Let me know if it works as you intended.

OlinkAnalyze::olink_boxplot() plots several plots until all proteins specified under the olinkid_list argument are plotted. The number_of_proteins_per_plot argument determines the number of IDs plotted on one plot.
Try this:
library(OlinkAnalyze)
data("npx_data1")
ids <- unique(npx_data1$OlinkID)
olink_boxplot(npx_data1,
variable = "Site",
olinkid_list = ids,
verbose = TRUE,
number_of_proteins_per_plot = 2)
The code runs for a while as each plot takes time to be generated. When it completes you can use the arrow buttons in RStudio to look at all the plots.

Related

Rename column within function in r using dplyr

In this case, I have a loop that triggers a function, which in turn triggers a function that collects the data.
One weird thing, is I cannot rename the columns in the dataset created - d. Bascially I need standardised names such that I can pass different variables, and as a result, I need to rename the columns during the dplyr transformation. The problem is here: %>% rename(Con = 1, DV = 2). In the dataset I have selected, I want to label the first column con, and the second column DV, such that I can pass this into the CollectDEffect function to run the cohensD analysis. All of this works when I run line by line, but I want to run the function by all the DVs and create a table, hence why I need to get this working within the loop.
# Function to run analyses and create the dataframe with output
CD_EE_DF <- data.frame("Test" = character())
CollectDEffect = function(cd, d){
excess <- data.frame("Test" = cd,
"Sample Size" = nrow(d),
"Original Cohen's d" = cohensD(d$DV ~ d$Con))
CD_EE_DF <- rbind(CD_EE_DF, excess)
return(CD_EE_DF)
}
# Data transformation, where the error is
CollectDEffect_Trigger = function(DVTest){
# Problem occurs here with the rename
d <- df %>% filter(Gender == "Female", Target_Gender != "") %>% select(Target_Gender, DVTest) %>% rename(Con = 1, DV = 2) %>% na.omit()
CD_EE_DF <- CollectDEffect(paste0("A_",DVTest),d)
}
# Loop that triggers all of the analyses
vec_dv <- c("Status", "warmth")
for (DVTest in vec_dv) {
CD_EE_DF <- CollectDEffect_Trigger(DVTest)
}

r.squared matrix of predictions vs actual values in R

I want to create a matrix that displays the r.squared coefficient of determination of some predictions made over the years and the actual values.
My goal is to display a matrix that looks something like this.
The only way I found is to make multiple lists, calculate each row/ column individually using map2_dbl(l.predicted_line1, l.actual, ~ summary(lm(.x ~ .y))$r.squared), and then add the resulting vectors in a matrix with some code. This would create 9 lists, which I want to avoid.
Is there any way of doing this in a more efficiently?
#sample data
l.actual <- list(
overall_15 = c(59,65,73,73,64,69,64,69,63,NA,82,60,NA,73,NA,73,73,NA,69,
69,71,66,65,70,72,72,NA,64,69,67,64,71,NA,62,62,71,67,63,64,76,72),
overall_16 = c(60,68,75,74,68,71,NA,72,64,69,82,66,64,77,NA,71,72,NA,69,
69,75,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,65,76,73),
overall_17 = c(63,68,NA,74,72,72,NA,73,66,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
overall_18 = c(NA,68,NA,78,73,72,NA,72,68,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
overall_19 = c(NA,NA,NA,77,73,72,NA,71,69,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
overall_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
overall_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
overall_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
l.predicted <- list(
potential_15 = c(59,68,74,76,65,75,64,72,66,NA,85,60,NA,76,NA,73,75,NA,71,
71,71,67,65,70,72,72,NA,68,74,67,64,71,NA,62,62,71,71,63,67,78,72),
potential_16 = c(60,71,75,75,68,73,NA,74,66,69,83,66,64,77,NA,71,74,NA,70,
70,76,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,66,76,73),
potential_17 = c(63,69,NA,75,72,72,NA,73,69,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
potential_18 = c(NA,68,NA,78,73,72,NA,72,69,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
potential_19 = c(NA,NA,NA,77,73,72,NA,71,70,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
potential_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
potential_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
potential_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
Here is a solution using some tidyverse packages. The key thing is to use the function expand_grid() to get all combinations of the elements of each list. This results in a tibble with two named list columns. Next we can use mutate() to pull out the names of the list and assign them to new columns, and extract the numeric IDs. Use filter() to retain only the rows where potential is less than or equal to overall. Finally get the R-squared for each row using your suggested code, and plot. (Note I did not try too hard to get the plot to look just like yours.)
library(purrr)
library(dplyr)
library(ggplot2)
library(tidyr)
r_squared_combinations <- expand_grid(l.actual, l.predicted) %>%
mutate(overall = names(l.actual),
potential = names(l.predicted),
overall_n = as.numeric(gsub('overall_', '', overall)),
potential_n = as.numeric(gsub('potential_', '', potential))) %>%
filter(potential_n <= overall_n) %>%
mutate(r_squared = map2_dbl(l.predicted, l.actual, ~ summary(lm(.x ~ .y))$r.squared))
ggplot(r_squared_combinations, aes(x = overall, y = potential, fill = r_squared, label = round(r_squared, 3))) +
geom_tile() +
geom_text(color = 'white')
Side note: incidentally the base function expand.grid() would work about as well as tidyr::expand_grid() but expand_grid() returns a tibble by default which may be more convenient if you are using tidyverse functions otherwise.

Venn Diagrams with `venneuler` in `R`: delete the name of the set from the plot and add elements name

For the past few days I have been trying to figure out how to draw a Venn diagram from an arbitrary number of sets and came across the R package venneuler. This code:
genes <- paste("gene",1:20,sep="")
df = data.frame(term=character(), class=character(), stringsAsFactors = FALSE)
for (k in 1:15) {
df2=data.frame(term=sample(genes,10), class = rep(paste("class ",k,sep=""),10), stringsAsFactors =
FALSE)
df<-rbind(df,df2)
}
library(rJava)
library(UpSetR)
library(tidyverse)
library(venneuler)
library(grid)
v <- venneuler(df)
par(cex = 0.5)
plot(v)
produces a figure like this:
This figure was just what I was looking for. Anyway, I would like to remove the name of the set (class 1, class 2 etc.) from the plot, and instead add the elements (e.g. gene1, gene2) contained within each set.
You could directly modify v$labels:
library(venneuler)
library(dplyr)
library(stringr)
v <- venneuler(df)
par(cex = 0.5)
# Create the list of genes per class
classgenes <- df %>% group_by(class) %>%
summarize(labels = paste(stringr::str_sort(term,numeric=T),collapse = '\n')) %>%
ungroup
# Order the class new label according to v$labels order
newlabels <- left_join(data.frame(class = v$labels), classgenes)
#> Joining, by = "class"
# Modify the labels
v$labels <- newlabels$labels
plot(v)

R highcharter - Two barchart in same plot with different X-axis

I am trying to do the following:
I have a two datasets about my company. The first one has, say, the top 20 growing sellers. The second one has the bottom 20 losing sellers. So, it's something like this:
growing_seller <- c("a","b","c","d","e","f","g","h","i","h")
sales_yoy_growing <- c(100000,90000,75000,50000,37500,21000,15000,12000,10000,8000)
top_growing <- data.frame(growing_seller,sales_yoy_growing)
losing_seller <- c("i","j","k","l","m","n","o","p","q","r")
sales_yoy_losing <- c(-90000,-75000,-50000,-37500,-21000,-15000,-12000,-10000,-8000,-5000)
bottom_losing <- data.frame(losing_seller,sales_yoy_losing)
I am trying to plot both charts in the same plot using DIFFERENT categories, corresponding to the sellers' name. So what I have so far is this:
library(highcharter)
growing_seller <- c("a","b","c","d","e","f","g","h","i","h")
sales_yoy_growing <- c(100000,90000,75000,50000,37500,21000,15000,12000,10000,8000)
top_growing <- data.frame(growing_seller,sales_yoy_growing)
losing_seller <- c("i","j","k","l","m","n","o","p","q","r")
sales_yoy_losing <- c(-90000,-75000,-50000,-37500,-21000,-15000,-12000,-10000,-8000,-5000)
bottom_losing <- data.frame(losing_seller,sales_yoy_losing)
highchart() %>%
hc_add_series(
data = top_growing$sales_yoy_growing,
type = "column",
grouping = FALSE
) %>%
hc_add_series(
data = bottom_losing$sales_yoy_losing,
type = "column"
)
This is what I want to achieve graphically: Chart example
Now,I would like to have a different category array per each independent x-axis: something like the possibility to have "two hc_xAxis" controls, where I could specify per each plotted series its own categories.
My final aim is to, then, have the seller's name as I parse over each of the different columns.
Hope I was clear enough :)
Thanks
Highcharts displays the point's name in the tooltip by default. You just need to point the name value in your data.
You can do it this way:
top_growing <- data.frame(name = growing_seller, y = sales_yoy_growing)
This is the whole code:
library(highcharter)
growing_seller <- c("a","b","c","d","e","f","g","h","i","h")
sales_yoy_growing <- c(100000,90000,75000,50000,37500,21000,15000,12000,10000,8000)
top_growing <- data.frame(name = growing_seller, y = sales_yoy_growing)
losing_seller <- c("i","j","k","l","m","n","o","p","q","r")
sales_yoy_losing <- c(-90000,-75000,-50000,-37500,-21000,-15000,-12000,-10000,-8000,-5000)
bottom_losing <- data.frame(name = losing_seller, y = sales_yoy_losing)
highchart() %>%
hc_add_series(
data = top_growing,
type = "column",
grouping = FALSE
) %>%
hc_add_series(
data = bottom_losing,
type = "column"
)

How can I create a running median of diel cycle from multiyear data?

I think this problem may be of interest to others who deal with data smoothing of long-term environmental variables.
I have a dataset structured as below:
Columns:
Date Hour_Min Y(response variable)
These data are hourly, and I need to create a moving average of the diel cycle, but categorized by the Hour_Min. In other words, if I were to use a 31 day window, for a given day the running average data point for Hour_Min 00:00 would take the average of the day in question with the data points from Hour_Min 00:00 for the previous and the following 15 days. This would then repeat for that day's hour 1:00, etc. through the dataframe.
Unfortunately the data also have many NAs, which is problematic for moving window averages, although I think that can be solved using rollapply from the zoo package.
One approach I tried was to use tidyr's spread function to switch from long to wide format, to create a dataframe like this:
Date Y_Hour_Min_0000 Y_Hour_Min_0100 Y_Hour_Min_0200 etc...
If I could change the format in this way, I could then create new columns of running averages of each Y_Hour_Min_.... column. I would then need to gather everything together back to long format (another task I'm not sure how to approach).
However, I wasn't able to get the spread function to work so that it kept Date as a grouping variable associated with each Y_Hour_Min_.... column.
Another, possibly more elegant solution would be if there is a way to create a single new column in one step, using some combination of rollapply and custom function.
Any thoughts on how to implement code for this task will be greatly appreciated. Below I have a simple code to simulate my dataset:
Simulated data:
### Create vector of hours/dates:
date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30
23:00"), by="hour")
### Create vector of noisy sine function:
d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15
### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA
### Create test dataframe:
df <- data.frame(dt = date, y = y) %>%
separate(dt, c("date", "hour_min"), sep=" ") %>%
mutate(date = as.Date(date))
I think this could work:
EDIT: Simplified code by adding fill = NA parameter to rollapply() function as suggested in the comments.
# add a complete date + time stamp
df$date_time <- paste(df$date, df$hour_min)
# make new column to store median data
df$median_y <- NA
# set rolling median width
width_roll <- 31
# do a rolling median for each hour, one at a time
# add NAs where no median can be calculated
for (i in levels(factor(df$hour_min))) {
df[df$hour_min == i, "median_y"] <- rollapply(df[df$hour_min == i, "y"],
width = width_roll,
median,
na.rm = TRUE,
fill = NA))
}
The approach is just to use the rollapply() function as you suggested, but only on one particular hour at a time. Then each of these is placed back into a new column in turn.
Here's an example for just one hour over the whole year, which makes it easier to visualize the median smoothing.
# Examples:
# plot one hour plus rolling median over time
# here i = "23:00:00"
plot(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
y = df[df$hour_min == i, "y"],
type = "l",
col = "blue",
ylab = "y values",
xlab = i)
lines(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
y = df[df$hour_min == i, "median_y"],
lwd = 3)
legend("topleft",
legend = c("raw", "median"),
col = c("blue", "black"),
lwd = 3)
Plot for a single hour
This is for everything (lots of data so not so easy to see but looks like it worked).
# plot all the data
plot(x = as.POSIXct(df$date_time),
y = df$y,
type = "l",
col = "blue",
ylab = "y values",
xlab = "Date")
lines(x = as.POSIXct(df$date_time),
y = df$median_y,
lwd = 3)
legend("topleft",
legend = c("raw", "median"),
col = c("blue", "black"),
lwd = 3)
Plot for all data
I'll take a crack at it but its not perfect. Hoping someone can come in and top me off.
TL:DR;
df2 <- df %>% slice(-7441) %>% spread(hour_min, y)
mov_avg <- function(x) {c(rep(NA, 15), rollapply(x, width = list(-15:15), FUN = mean, align="center", na.rm=T), rep(NA, 15))}
avgs <- as.data.frame(matrix(unlist(lapply(df2[,2:ncol(df2)], mov_avg)), nrow = nrow(df2), byrow = FALSE))
colnames(avgs) <- paste0("avg_", colnames(df2[,2:ncol(df2)]))
final_df <- cbind(df2, avgs) %>%
gather(2:ncol(.), key = "hour_min", value = "value") %>%
arrange(date, hour_min)
In Depth:
Starting at your starting point.. I added set.seed(1) so we can all follow along in tandem.
Your Initial Starting Point:
### Create vector of hours/dates:
set.seed(1)
date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30
23:00"), by="hour")
### Create vector of noisy sine function:
d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15
### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA
### Create test dataframe:
df <- data.frame(dt = date, y = y) %>%
separate(dt, c("date", "hour_min"), sep=" ") %>%
mutate(date = as.Date(date))
First thing was to do what you said and try the long format. Normally I think this problem would be best by using dplyr's group_by on the hour_min column and doing the rolling average there, but I'm not sure how to do that.
First thing I noticed is that there is a duplicate value for one row on a given day. There are two observations for 1am, which breaks our spread, so I removed that observation using slice(-7441)
So let's spread your df.
df2 <- df %>% slice(-7441) %>% spread(hour_min, y)
As we can see, the dataframe is now 365 observations long(dates), and 25 columns wide (date + 24 hours)
dim(df2)
[1] 365 25
Next thing I did which is where this isn't perfect, is using rollapply. When using rollapply we can give it a width = list(-15:15). This will look 15 days into the past and 15 into the future and average all 31 days together. The problem is the first 15 days don't have a past 15, and the last 15 days don't have a future 15. So I padded these with NAs. I'm hoping someone can fix this part of my answer.
I created a custom function to do this:
mov_avg <- function(x) {c(rep(NA, 15), rollapply(x, width = list(-15:15), FUN = mean, align="center", na.rm=T), rep(NA, 15))}
If we just do the rollapply we will get a vector of length 335. I padded 15 in front and back to get us to our needed 365.
Next we want to lapply that function across our entire dataframe. That will give us a list of 24 vectors of length 365. We then want to turn that into a dataframe and bind it to our current dataframe.
Lastly we gather all of the columns back into the long format and arrange
avgs <- as.data.frame(matrix(unlist(lapply(df2[,2:ncol(df2)], mov_avg)), nrow = nrow(df2), byrow = FALSE))
colnames(avgs) <- paste0("avg_", colnames(df2[,2:ncol(df2)]))
final_df <- cbind(df2, avgs) %>%
gather(2:ncol(.), key = "hour_min", value = "value") %>%
arrange(date, hour_min)
I hope this helps.

Resources