r.squared matrix of predictions vs actual values in R - r

I want to create a matrix that displays the r.squared coefficient of determination of some predictions made over the years and the actual values.
My goal is to display a matrix that looks something like this.
The only way I found is to make multiple lists, calculate each row/ column individually using map2_dbl(l.predicted_line1, l.actual, ~ summary(lm(.x ~ .y))$r.squared), and then add the resulting vectors in a matrix with some code. This would create 9 lists, which I want to avoid.
Is there any way of doing this in a more efficiently?
#sample data
l.actual <- list(
overall_15 = c(59,65,73,73,64,69,64,69,63,NA,82,60,NA,73,NA,73,73,NA,69,
69,71,66,65,70,72,72,NA,64,69,67,64,71,NA,62,62,71,67,63,64,76,72),
overall_16 = c(60,68,75,74,68,71,NA,72,64,69,82,66,64,77,NA,71,72,NA,69,
69,75,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,65,76,73),
overall_17 = c(63,68,NA,74,72,72,NA,73,66,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
overall_18 = c(NA,68,NA,78,73,72,NA,72,68,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
overall_19 = c(NA,NA,NA,77,73,72,NA,71,69,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
overall_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
overall_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
overall_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)
l.predicted <- list(
potential_15 = c(59,68,74,76,65,75,64,72,66,NA,85,60,NA,76,NA,73,75,NA,71,
71,71,67,65,70,72,72,NA,68,74,67,64,71,NA,62,62,71,71,63,67,78,72),
potential_16 = c(60,71,75,75,68,73,NA,74,66,69,83,66,64,77,NA,71,74,NA,70,
70,76,67,71,73,73,73,NA,66,NA,69,65,70,76,NA,67,71,72,64,66,76,73),
potential_17 = c(63,69,NA,75,72,72,NA,73,69,69,83,67,64,76,NA,71,73,NA,70,
70,79,NA,73,72,NA,NA,NA,NA,NA,70,NA,70,77,NA,68,74,74,66,64,75,69),
potential_18 = c(NA,68,NA,78,73,72,NA,72,69,67,86,NA,62,75,65,71,71,67,71,
71,76,NA,71,71,NA,NA,74,NA,71,NA,NA,68,74,NA,67,75,74,65,NA,72,NA),
potential_19 = c(NA,NA,NA,77,73,72,NA,71,70,66,87,63,62,73,65,NA,NA,NA,NA,
NA,75,NA,NA,67,NA,NA,73,NA,NA,NA,NA,NA,74,NA,NA,74,74,65,NA,68,NA),
potential_20 = c(NA,NA,NA,77,NA,NA,NA,72,71,66,87,NA,NA,NA,65,NA,NA,NA,70,
70,75,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,74,NA,66,71,73,NA,NA,69,NA),
potential_21 = c(NA,67,NA,76,NA,69,NA,73,69,65,85,NA,NA,NA,NA,NA,NA,NA,NA,
NA,75,NA,NA,NA,NA,NA,69,NA,NA,NA,NA,NA,73,NA,67,68,72,NA,NA,68,NA),
potential_22 = c(NA,NA,NA,75,NA,NA,NA,75,67,65,84,NA,NA,NA,NA,NA,NA,NA,68,
68,73,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,67,69,71,NA,NA,68,NA)
)

Here is a solution using some tidyverse packages. The key thing is to use the function expand_grid() to get all combinations of the elements of each list. This results in a tibble with two named list columns. Next we can use mutate() to pull out the names of the list and assign them to new columns, and extract the numeric IDs. Use filter() to retain only the rows where potential is less than or equal to overall. Finally get the R-squared for each row using your suggested code, and plot. (Note I did not try too hard to get the plot to look just like yours.)
library(purrr)
library(dplyr)
library(ggplot2)
library(tidyr)
r_squared_combinations <- expand_grid(l.actual, l.predicted) %>%
mutate(overall = names(l.actual),
potential = names(l.predicted),
overall_n = as.numeric(gsub('overall_', '', overall)),
potential_n = as.numeric(gsub('potential_', '', potential))) %>%
filter(potential_n <= overall_n) %>%
mutate(r_squared = map2_dbl(l.predicted, l.actual, ~ summary(lm(.x ~ .y))$r.squared))
ggplot(r_squared_combinations, aes(x = overall, y = potential, fill = r_squared, label = round(r_squared, 3))) +
geom_tile() +
geom_text(color = 'white')
Side note: incidentally the base function expand.grid() would work about as well as tidyr::expand_grid() but expand_grid() returns a tibble by default which may be more convenient if you are using tidyverse functions otherwise.

Related

Looping through variables to make many boxplots

I am using from the package OlinkAnalyze and I am trying to make box plots.
install.packages("OlinkAnalyze")
library(OlinkAnalyze)
df = npx_data1
the code for the boxplot is:
plot <- df %>%
na.omit() %>% # removing missing values which exists for Site
olink_boxplot(variable = "Site",
olinkid_list = c("OID01216", "OID01217"),
number_of_proteins_per_plot = 2)
plot[[1]]
It takes values from the olinkID column. What I would like, is to loop through the column, choosing the next two olinkID at a time, to make boxplots, renaming the plot each time (e.g.plot 1 with OID01216 and OID01217 and plot 2 with OID01218 OID01219
I used a while loop.
install.packages("OlinkAnalyze")
library(OlinkAnalyze)
df = npx_data1
i <- 1
ids <- as.data.frame(unique(df$OlinkID))
while(i <= nrow(ids)){
print(i)
x <- i+1
temp <- ids[i:x,]
plotx <- df %>%
na.omit() %>% #
olink_boxplot(variable = "Site",
olinkid_list = c(paste(c(ids[i,],ids[x,]))),
number_of_proteins_per_plot = 2)
plottemp <- assign(paste0("plot_",ids[i,],"_",ids[i,]),plotx)
i <- i+2
}
If you want the loop, you could write like this:
for i in seq(from = 1, to = length(data$OlinkID), by = 2){
the plot code
}
This way you can access the two observations you want by data$OlinkID[i] or data$OlinkID[i+1].
So the boxplot code should be
plot <- data %>%
na.omit() %>%
olink_boxplot(variable = "Oxwatchtime",
olinkid_list = c(data$OlinkID[i],data$OlinkID[i+1]),
number_of_proteins_per_plot = 2)
If you want to save the plots, add a ggsave() or a png()/pdf() in the loop to save them externally or create a list with them using ggarrange() function from ggpubr package. Let me know if it works as you intended.
OlinkAnalyze::olink_boxplot() plots several plots until all proteins specified under the olinkid_list argument are plotted. The number_of_proteins_per_plot argument determines the number of IDs plotted on one plot.
Try this:
library(OlinkAnalyze)
data("npx_data1")
ids <- unique(npx_data1$OlinkID)
olink_boxplot(npx_data1,
variable = "Site",
olinkid_list = ids,
verbose = TRUE,
number_of_proteins_per_plot = 2)
The code runs for a while as each plot takes time to be generated. When it completes you can use the arrow buttons in RStudio to look at all the plots.

Mutate ifelse on a vector

Let's say I have this data frame:
set.seed(2)
df <- iris[c(1:5,51:55,101:105),]
df_long <- gather(df, key = "flower_att", value = "measurement",
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
df_long$setosa_sub <-sample(5,size = 60, replace = TRUE)
df_long$versicolor_sub <-sample(5,size = 60, replace = TRUE)
df_long$virginica_sub <-sample(5,size = 60, replace = TRUE)
df_long$sub_q<-0
Now I want to copy a value to sub_q variable based on Species variable and sub values.
I know how to do it one by one:
df_long2 <- df_long %>%
mutate(sub_q =ifelse(Species =="setosa", setosa_sub,sub_q)) %>%
mutate(sub_q =ifelse(Species =="versicolor", versicolor_sub,sub_q)) %>%
mutate(sub_q =ifelse(Species =="virginica", virginica_sub,sub_q))
But I can't figure out what is the right way to apply on a vector of the Species values instead.
species_vector <- c("setosa","versicolor","virginica")
I'm actually not sure if I need to make new function or just loop it somehow. Hope it's make sense...
I don't see anything wrong with the way you are doing it. Another way, using an apply function (sapply in this case) would work like this:
# a helper function to find the right value for the xth row
get_correct_sub <- function(x){
col_name = paste0(df_long$Species[x],'_sub')
df_long[[ col_name ]][x] }
# apply each row index to the helper function
df_long2 = df_long
df_long2$sub_q = sapply(1:nrow(df_long), get_correct_sub)
The helper function adds "_sub" to the species name, treats that as a column name, and then gets the value for that column.
Here is a datastep() solution. I created a vector lookup to map the Species to the desired column, then step through the data row by row and assign the value using the lookup. data is the input dataset and n. is the current row number:
library(libr)
# Create vector lookup
species_vector <- c("setosa" = "setosa_sub", "versicolor" = "versicolor_sub", "virginica" = "virginica_sub")
# Step through data row by row, and assign value using lookup
df_long2 <- df_long %>%
datastep({
sub_q <- data[n., species_vector[Species]]
})

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

Inconsistent results with normalisation in dplyr

I am trying to normalise a column ROE with this normalisation function
2*((PATORBIS$ROE-min(PATORBIS$ROE,na.rm = T))/
((max(PATORBIS$ROE,na.rm = T)-min(PATORBIS$ROE,na.rm = T))))-1
when I run the function above it gives me the correct normalisation, whereas using it with mutate from dplyr, the exact same function gives incorrect results.
Sample data:
PATORBIS <- data.frame(Company=c("ACHAOGEN","ACHAOGEN","ACHAOGEN","ACHAOGEN"),year=as.numeric(c("2013","2014","2015","2016")),ROE=as.numeric(c("-170","-31.2","-62.8",NA)))
plot2 <- PATORBIS %>%
select("Company","year","ROE") %>%
filter(!is.na(ROE)) %>%
mutate(ROE=2*(ROE-min(ROE,na.rm = T))/(max(ROE,na.rm = T)-min(ROE,na.rm = T))-1)
Does anyone have a similar issue of inconsistent results with mutate in dplyr?
I got the same results, exactly the same removing the filter (and the select is not necessary)
library(dplyr)
PATORBIS <- data.frame(Company=c("ACHAOGEN","ACHAOGEN","ACHAOGEN","ACHAOGEN"),year=as.numeric(c("2013","2014","2015","2016")),ROE=as.numeric(c("-170","-31.2","-62.8",NA)))
plot2 <- PATORBIS %>%
# select("Company","year","ROE") %>%
# filter(!is.na(ROE)) %>%
mutate(ROE=2*(ROE-min(ROE,na.rm = T))/(max(ROE,na.rm = T)-min(ROE,na.rm = T))-1)
plot2
2*((PATORBIS$ROE-min(PATORBIS$ROE,na.rm = T))/
((max(PATORBIS$ROE,na.rm = T)-min(PATORBIS$ROE,na.rm = T))))-1

Subtotals in columns using reshape2

I have spent some time now learning reshape2 and plyr but I still do not get it. This time I have a problem with (a) subtotals and (b) passing different aggregate functions. Here an example using data from a tutorial on the blog of mrdwab
# libraries
library(plyr)
library(reshape2)
# get data and add few more variables
book.sales = read.csv("http://news.mrdwab.com/data-booksales")
book.sales$Stock = book.sales$Quantity + 10
book.sales$SubjCat[(book.sales$Subject == 'Economics') |
(book.sales$Subject == 'Management')] <- '1_EconSciences'
book.sales$SubjCat[book.sales$Subject %in%
c('Anthropology', 'Politics', 'Sociology')] <- '2_SocSciences'
book.sales$SubjCat[book.sales$Subject %in% c('Communication', 'Fiction',
'History', 'Research', 'Statistics')] <- '3_other'
# to get to my starting dataframe (close to the project I am working on)
book.sales1 <- ddply(book.sales, c('Region', 'Representative', 'SubjCat',
'Subject', 'Publisher'), summarize,
Stock = sum(Stock), Sold = sum(Quantity),
Ratio = round((100 * sum(Quantity)/sum(Stock)), digits = 1))
#melt it
m.book.sales = melt(data = book.sales1, id.vars = c('Region', 'Representative',
'SubjCat', 'Subject', 'Publisher'),
measured.vars = c('Stock', 'Sold', 'Ratio'))
# cast it --- # Please ignore this cast this was a mistake
# Tab1 <- dcast(data = m.book.sales,
# formula = Region + Representative ~ Publisher + variable,
# fun.aggregate = sum, margins = c('Region', 'Representative'))
Tab1 <- dcast(data = m.book.sales, formula = Region + Representative ~
SubjCat + Subject + variable, fun.aggregate = sum,
margins = c('Region', 'Representative', 'SubjCat', 'Subject'))
Now my questions :
I have been able to add the subtotals in rows. But is it possible also to add margins in the columns. Say for example, Totals of Stock for one Publisher? Sorry I meant to say example total sold for all publishers.
There is a problem with the columns with “ratio”. How can I get “mean” instead of “sum” for this variable ?
Please note: Question number one (about subtotals in margins) could be solved.
P.S.: I have seen some examples using reshape. Will you recommend to use it instead of reshape2 (which seems not to include the functionalities of two functions).
Not sure exactly what you want for question 1, but if you want total of stock for Publisher would you not just do this?
totalofstock <- ddply(book.sales, ('Publisher'), function(x)
data.frame=c(subtotals = sum(x$Stock)))
and if you want to add it to Tab1 you just do this:
Tab1$bloomsburytotalofstock<-totalofstock[1,][[2]]
head(Tab1)
As for question 2 getting a mean instead of a sum surely you would be changing the function from sum to mean
e.g.
ratiomeans <- ddply(book.sales1, ('Publisher'), function(x)
data.frame=c(ratioMEAN = mean(x$Ratio)))
Also I would suggest sticking with reshape2. reshape2 is basically the new version of reshape. As far as I know reshape is no longer being worked on but still exists so that people with old code using reshape do not have to rewrite everything.
EDIT
justratio<-(m.book.sales[m.book.sales$variable=="Ratio",])
Tab2 <- dcast(data = justratio,
formula = Region + Representative ~ SubjCat + Subject + variable,
fun.aggregate = mean,
margins = c('Region', 'Representative', 'SubjCat', 'Subject'))
final<-merge(Tab1,Tab2,by=c("Region","Representative"))

Resources