Rename column within function in r using dplyr - r

In this case, I have a loop that triggers a function, which in turn triggers a function that collects the data.
One weird thing, is I cannot rename the columns in the dataset created - d. Bascially I need standardised names such that I can pass different variables, and as a result, I need to rename the columns during the dplyr transformation. The problem is here: %>% rename(Con = 1, DV = 2). In the dataset I have selected, I want to label the first column con, and the second column DV, such that I can pass this into the CollectDEffect function to run the cohensD analysis. All of this works when I run line by line, but I want to run the function by all the DVs and create a table, hence why I need to get this working within the loop.
# Function to run analyses and create the dataframe with output
CD_EE_DF <- data.frame("Test" = character())
CollectDEffect = function(cd, d){
excess <- data.frame("Test" = cd,
"Sample Size" = nrow(d),
"Original Cohen's d" = cohensD(d$DV ~ d$Con))
CD_EE_DF <- rbind(CD_EE_DF, excess)
return(CD_EE_DF)
}
# Data transformation, where the error is
CollectDEffect_Trigger = function(DVTest){
# Problem occurs here with the rename
d <- df %>% filter(Gender == "Female", Target_Gender != "") %>% select(Target_Gender, DVTest) %>% rename(Con = 1, DV = 2) %>% na.omit()
CD_EE_DF <- CollectDEffect(paste0("A_",DVTest),d)
}
# Loop that triggers all of the analyses
vec_dv <- c("Status", "warmth")
for (DVTest in vec_dv) {
CD_EE_DF <- CollectDEffect_Trigger(DVTest)
}

Related

How can I add a column in R whose values reference a column in a different data frame?

So I have an R script that ranks college football teams. It outputs a rating and I want to take that rating from a different data frame and add it as a new column to a different data frame containing info from the upcoming week of games. Here's what I'm currently trying to do:
random_numbers <- rnorm(130, mean = mean_value, sd = sd_value)
sample_1 <- as.vector(sample(random_numbers, 1, replace = TRUE))
upcoming_games_df <- upcoming_games_df %>%
mutate(home_rating = case_when(home_team %in% Ratings$team ~ Ratings$Rating[Ratings$team == home_team]),
TRUE ~ sample_1)
sample_2 <- as.vector(sample(random_numbers, 1, replace = TRUE))
upcoming_games_df <- upcoming_games_df %>%
mutate(away_rating = case_when(away_team %in% PrevWeek_VoA$team ~ Ratings$Rating[Ratings$team == away_team]),
TRUE ~ sample_2)
I originally had the sample(random_numbers) inside of the mutate() function but I got error "must be a vector, not a formula object." So I moved it outside the mutate() function and added the as.vector(), but it still gave me the same error. I also got a warning about "longer object length is not a multiple of shorter object length". I don't know what to do now. The code above is the last thing I tried before coming here for help.
case_when requires all arguments to be of same length. sample_1 or sample_2 have a length of 1 and it can get recycled. (as.vector is not needed as rnorm returns a vector).
In addition, when we use ==, it is elementwise comparison and can be used only when the length of both the columns compared are same or one of them have a length of 1 (i.e. it gets recycled). Thus Ratings$team == home_team would be the cause of longer object length warning.
Instead of case_when, this maybe done with a join (assuming the 'team' column in 'Ratings' is not duplicated)
library(dplyr)
upcoming_games_df2 <- upcoming_games_df %>%
left_join(Ratings, by = c("home_team" = "team")) %>%
mutate(home_rating = coalesce(Rating, sample_1), team = NULL) %>%
left_join(PrevWeek_VoA, by = c("away_team" = "team")) %>%
mutate(away_rating = coalesce(Rating, sample_2))

Looping through variables to make many boxplots

I am using from the package OlinkAnalyze and I am trying to make box plots.
install.packages("OlinkAnalyze")
library(OlinkAnalyze)
df = npx_data1
the code for the boxplot is:
plot <- df %>%
na.omit() %>% # removing missing values which exists for Site
olink_boxplot(variable = "Site",
olinkid_list = c("OID01216", "OID01217"),
number_of_proteins_per_plot = 2)
plot[[1]]
It takes values from the olinkID column. What I would like, is to loop through the column, choosing the next two olinkID at a time, to make boxplots, renaming the plot each time (e.g.plot 1 with OID01216 and OID01217 and plot 2 with OID01218 OID01219
I used a while loop.
install.packages("OlinkAnalyze")
library(OlinkAnalyze)
df = npx_data1
i <- 1
ids <- as.data.frame(unique(df$OlinkID))
while(i <= nrow(ids)){
print(i)
x <- i+1
temp <- ids[i:x,]
plotx <- df %>%
na.omit() %>% #
olink_boxplot(variable = "Site",
olinkid_list = c(paste(c(ids[i,],ids[x,]))),
number_of_proteins_per_plot = 2)
plottemp <- assign(paste0("plot_",ids[i,],"_",ids[i,]),plotx)
i <- i+2
}
If you want the loop, you could write like this:
for i in seq(from = 1, to = length(data$OlinkID), by = 2){
the plot code
}
This way you can access the two observations you want by data$OlinkID[i] or data$OlinkID[i+1].
So the boxplot code should be
plot <- data %>%
na.omit() %>%
olink_boxplot(variable = "Oxwatchtime",
olinkid_list = c(data$OlinkID[i],data$OlinkID[i+1]),
number_of_proteins_per_plot = 2)
If you want to save the plots, add a ggsave() or a png()/pdf() in the loop to save them externally or create a list with them using ggarrange() function from ggpubr package. Let me know if it works as you intended.
OlinkAnalyze::olink_boxplot() plots several plots until all proteins specified under the olinkid_list argument are plotted. The number_of_proteins_per_plot argument determines the number of IDs plotted on one plot.
Try this:
library(OlinkAnalyze)
data("npx_data1")
ids <- unique(npx_data1$OlinkID)
olink_boxplot(npx_data1,
variable = "Site",
olinkid_list = ids,
verbose = TRUE,
number_of_proteins_per_plot = 2)
The code runs for a while as each plot takes time to be generated. When it completes you can use the arrow buttons in RStudio to look at all the plots.

Mutate ifelse on a vector

Let's say I have this data frame:
set.seed(2)
df <- iris[c(1:5,51:55,101:105),]
df_long <- gather(df, key = "flower_att", value = "measurement",
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
df_long$setosa_sub <-sample(5,size = 60, replace = TRUE)
df_long$versicolor_sub <-sample(5,size = 60, replace = TRUE)
df_long$virginica_sub <-sample(5,size = 60, replace = TRUE)
df_long$sub_q<-0
Now I want to copy a value to sub_q variable based on Species variable and sub values.
I know how to do it one by one:
df_long2 <- df_long %>%
mutate(sub_q =ifelse(Species =="setosa", setosa_sub,sub_q)) %>%
mutate(sub_q =ifelse(Species =="versicolor", versicolor_sub,sub_q)) %>%
mutate(sub_q =ifelse(Species =="virginica", virginica_sub,sub_q))
But I can't figure out what is the right way to apply on a vector of the Species values instead.
species_vector <- c("setosa","versicolor","virginica")
I'm actually not sure if I need to make new function or just loop it somehow. Hope it's make sense...
I don't see anything wrong with the way you are doing it. Another way, using an apply function (sapply in this case) would work like this:
# a helper function to find the right value for the xth row
get_correct_sub <- function(x){
col_name = paste0(df_long$Species[x],'_sub')
df_long[[ col_name ]][x] }
# apply each row index to the helper function
df_long2 = df_long
df_long2$sub_q = sapply(1:nrow(df_long), get_correct_sub)
The helper function adds "_sub" to the species name, treats that as a column name, and then gets the value for that column.
Here is a datastep() solution. I created a vector lookup to map the Species to the desired column, then step through the data row by row and assign the value using the lookup. data is the input dataset and n. is the current row number:
library(libr)
# Create vector lookup
species_vector <- c("setosa" = "setosa_sub", "versicolor" = "versicolor_sub", "virginica" = "virginica_sub")
# Step through data row by row, and assign value using lookup
df_long2 <- df_long %>%
datastep({
sub_q <- data[n., species_vector[Species]]
})

Log Transform many variables in R with loop

I have a data frame that has a binary variable for diagnosis (column 1) and 165 nutrient variables (columns 2-166) for n=237. Let’s call this dataset nutr_all. I need to create 165 new variables that take the natural log of each of the nutrient variables. So, I want to end up with a data frame that has 331 columns - column 1 = diagnosis, cols 2-166 = nutrient variables, cols 167-331 = log transformed nutrient variables. I would like these variables to take the name of the old variables but with "_log" at the end
I have tried using a for loop and the mutate command, but, I'm not very well versed in r, so, I am struggling quite a bit.
for (nutr in (nutr_all_nomiss[,2:166])){
nutr_all_log <- mutate(nutr_all, nutr_log = log(nutr) )
}
When I do this, it just creates a single new variable called nutr_log. I know I need to let r know that the "nutr" in "nutr_log" is the variable name in the for loop, but I'm not sure how.
For any encountering this page more recently, dplyr::across() was introduced in late 2020 and it is built for exactly this task - applying the same transformation to many columns all at once.
A simple solution is below.
If you need to be selective about which columns you want to transform, check out the tidyselect helper functions by running ?tidyr_tidy_select in the R console.
library(tidyverse)
# create vector of column names
variable_names <- paste0("nutrient_variable_", 1:165)
# create random data for example
data_values <- purrr::rerun(.n = 165,
sample(x=100,
size=237,
replace = T))
# set names of the columns, coerce to a tibble,
# and add the diagnosis column
nutr_all <- data_values %>%
set_names(variable_names) %>%
as_tibble() %>%
mutate(diagnosis = 1:237) %>%
relocate(diagnosis, .before = everything())
# use across to perform same transformation on all columns
# whose names contain the phrase 'nutrient_variable'
nutr_all_with_logs <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = list(log10 = log10),
.names = "{.col}_{.fn}"))
# print out a small sample of data to validate
nutr_all_with_logs[1:5, c(1, 2:3, 166:168)]
Personally, instead of adding all the columns to the data frame,
I would prefer to make a new data frame that contains only the
transformed values, and change the column names:
logs_only <- nutr_all %>%
mutate(across(
.cols = contains('nutrient_variable'),
.fns = log10)) %>%
rename_with(.cols = contains('nutrient_variable'),
.fn = ~paste0(., '_log10'))
logs_only[1:5, 1:3]
We can use mutate_at
library(dplyr)
nutr_all_log <- nutr_all_nomiss %>%
mutate_at(2:166, list(nutr_log = ~ log(.)))
In base R, we can do this directly on the data.frame
nm1 <- paste0(names(nutr_all_nomiss)[2:166], "_nutr_log")
nutr_all_nomiss[nm1] <- log(nutr_all_nomiss[nm1])
In base R, we can use lapply :
nutr_all_nomiss[paste0(names(nutr_all_nomiss)[2:166], "_log")] <- lapply(nutr_all_nomiss[2:166], log)
Here is a solution using only base R:
First I will create a dataset equivalent to yours:
nutr_all <- data.frame(
diagnosis = sample(c(0, 1), size = 237, replace = TRUE)
)
for(i in 2:166){
nutr_all[i] <- runif(n = 237, 1, 10)
names(nutr_all)[i] <- paste0("nutrient_", i-1)
}
Now let's create the new variables and append them to the data frame:
nutr_all_log <- cbind(nutr_all, log(nutr_all[, -1]))
And this takes care of the names:
names(nutr_all_log)[167:331] <- paste0(names(nutr_all[-1]), "_log")
given function using dplyr will do your task, which can be used to get log transformation for all variables in the dataset, it also checks if the column has -ive values. currently, in this function it will not calculate the log for those parameters,
logTransformation<- function(ds)
{
# this function creats log transformation of dataframe for only varibles which are positive in nature
# args:
# ds : Dataset
require(dplyr)
if(!class(ds)=="data.frame" ) { stop("ds must be a data frame")}
ds <- ds %>%
dplyr::select_if(is.numeric)
# to get only postive variables
varList<- names(ds)[sapply(ds, function(x) min(x,na.rm = T))>0]
ds<- ds %>%
dplyr::select(all_of(varList)) %>%
dplyr::mutate_at(
setNames(varList, paste0(varList,"_log")), log)
)
return(ds)
}
you can use it for your case as :
#assuming your binary variable has namebinaryVar
nutr_allTransformed<- nutr_all %>% dplyr::select(-binaryVar) %>% logTransformation()
if you want to have negative variables too, replace varlist as below:
varList<- names(ds)

How to work around error while reshape data frame with spread()

I am trying to transform long data frame into wide and flagged cases. I pivot it and use a temporary vector that serves as a flag. It works perfectly on small data sets: see the example (copy and paste into your Rstudio), but when I try to do it on real data it reports an error:
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
Error: Duplicate identifiers for rows (169, 249), (57, 109), (11, 226)
The structure wide data set is relevant for further processing
Is there any work around for this problem. I bet a lot of people try to clean data and get to the same problem.
Please help me
Here is the code:
First chunk "example "makes small data set for good visualisation how it supiosed to look
Second chunk "real data" is sliced portion of data set from churn library
library(caret)
library(tidyr)
#example
#============
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1") ,
flags = c(1, 1, 1, 1, 1, 1))
df
df2 <- spread(data = df, key = "factors" , value = flags, fill = " ")
df2
#=============
# real data
#============
data(churn)
str(churnTrain)
churnTrain <- churnTrain[1:250,1:4]
churnTrain$temporary <-1
churnTrain3 <- spread(churnTrain, key = "state", value = "temporary", fill = 0)
str(churnTrain)
head(churnTrain3)
str(churnTrain3)
#============
Spread can only put one unique value in the 'cell' that intersects the spread 'key' and the rest of the data (in the churn example, account_length, area_code and international_plan). So the real question is how to manage these duplicate entries. The answer to that depends on what you are trying to do. I provide one possible solution below. Instead of making a dummy 'temporary' variable, I instead count the number of episodes and use that as the dummy variable. This can be done very easily with dplyr:
library(tidyr)
library(dplyr)
library(C50) # this is one source for the churn data
data(churn)
churnTrain <- churnTrain[1:250,1:4]
churnTrain2 <- churnTrain %>%
group_by(state, account_length, area_code, international_plan) %>%
tally %>%
dplyr::rename(temporary = n)
churnTrain3 <- spread(churnTrain2, key = "state", value = "temporary", fill = 0)
Spread now works.
As others point out, you need to input a unique vector into spread. My solution is use base R:
library(C50)
f<- function(df, key){
if (sum(names(df)==key)==0) stop("No such key");
u <- unique(df[[key]])
id <- matrix(0,dim(df)[1],length(u))
uu <- lapply(df[[key]],function(x)which(u==x)) ## check 43697442 for details
for(i in 1:dim(df)[1]) id[i,uu[[i]]] <- 1
colnames(id) = as.character(u)
return(cbind(df,id));
}
df <- data.frame(var1 = (1:6),
var2 = (7:12),
factors = c("facto1", "facto2", "facto3", "facto3","facto5", "facto1"))
f(df, key='fact')
f(df, key='factors')
data(churn)
churnTrain <- churnTrain[1:250,1:4]
f(churnTrain, key='state')
Although you may see a for-loop and other temporary variables inside the f function, the speed is not slow indeed.

Resources