Add a Column created Within a Function to a dataframe in R - r

I have searched and tried multiple previously asked questions that might be similar to my question, but none worked.
I have a dataframe in R called df2, a column called df2$col. I created a function to take the df, the df$col, and two parameters that are names for two new columns I want created and worked on within the function. After the function finishes running, I want a return df with the two new columns included. I get the two columns back indeed, but they are named after the placeholders in the function shell. See below:
df2 = data.frame(col = c(1, 3, 4, 5),
col1 = c(9, 6, 8, 3),
col2 = c(8, 2, 8, 4))
the function I created will take col and do something to it; return the transformed col, as well as the two newly created columns:
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- 2
hi_perc <- 6
df$df_col_flagH <- as.factor(ifelse(df_col_name<lo_perc, 1, 0))
df$df_col_flagL <- as.factor(ifelse(df_col_name>hi_perc, 1, 0))
df_col_name <- df_col_name + 1.4
df_col_name <- df_col_name * .12
return(df)
}
When I call the function, no_way(df2, col, df$new_col, df$new_col2), instead of getting a df with col, col1, col2, new_col1, new_col2, I get the first three right but get the parametric names for the last two. So something like df, col, col1, col2, df_col_flagH, df_col_flagL. I essentially want the function to return the df with the new columns' names I give it when I am calling it. Please help.

I don't see what your function is trying to do, but this might point you in the right direction:
no_way <- function(df = df2, df_col_name = "col", df_col_flagH = "col1", df_col_flagL = "col2") {
lo_perc <- 2
hi_perc <- 6
df[[df_col_flagH]] <- as.factor(ifelse(df[[df_col_name]] < lo_perc, 1, 0)) # as.factor?
df[[df_col_flagL]] <- as.factor(ifelse(df[[df_col_name]] > hi_perc, 1, 0))
df[[df_col_name]] <- (df[[df_col_name]] + 1.4) * 0.12 # Do in one step
return(df)
}

I needed to call the function with the new column names as strings instead:
no_way(mball, 'TEAM_BATTING_H', 'hi_TBH', 'lo_TBH')
Additionally, I had to use brackets around the target column in my function.

Related

Loop over several dataframes to do several actions in R

I have several dataframes (dataframe_1, dataframe_2...) that I want to loop in order to execute the same functions over all the dataframes. These functions are:
Select specific columns:
dataframe_1 <- dataframe_1[, c("Column_1", "Column_2")]
Rename the columns:
dataframe_1 <- rename(dtaframe_1, New_Name_for_Column_1 = Column_1)
Create new columns. For example, by using the ifelse() function:
dataframe_1$Column_3 <- ifelse(dataframe_1$Column_1 = 5, 1, 0)
I have proven the code with some dataframes individually without errors.
However, if I execute the following loop:
list_dataframes = list(dataframe_1, dataframe_2)
for (dataframe in 1:length(list_dataframes)){
dataframe <- dataframe[, c("Column_1", "Column_2")]
dataframe <- rename(dtaframe, New_Name_for_Column_1 = Column_1)
dataframe$Column_3 <- ifelse(dataframe$Column_1 = 5, 1, 0)
}
The following error arises:
Error in dataframe[, c("Column_1", "Column_2", :
incorrect number of dimensions
(All dataframes have the same column names.)
Any idea?
Thanks!
You are not iterating over the list of dataframes, but rather over a sequence 1:length(list_dataframes). Consider the following for illustration:
a = list("a", "b")
for (i in a){print(i)}
for (i in 1:length(a)){print(i)}
In your code, you need to explicitly access the list elements like this:
list_dataframes = list(dataframe_1, dataframe_2)
for (df_number in 1:length(list_dataframes)){
list_dataframes[[df_number]] <- list_dataframes[[df_number]][, c("Column_1", "Column_2")]
list_dataframes[[df_number]] <- rename(list_dataframes[[df_number]], New_Name_for_Column_1 = Column_1)
list_dataframes[[df_number]]$Column_3 <- ifelse(list_dataframes[[df_number]]$Column_1 = 5, 1, 0)
}
the code for (dataframe in 1:length(list_dataframes)) creates a vector of numbers c(1,2) in which the value of one value at a time is stored in a variable named dataframe. This iteration variable is scalar i.e. it has 1 dimension and a length of 1. This is why you can not subset doing dataframe[, c("Column_1", "Column_2")] Do this instead: list_dataframes[[dataframe]][, c("Column_1", "Column_2")]
You could try to iterate over dataframes using purrr::map_dfr(), e.g.
list_dataframes = list(dataframe_1, dataframe_2)
library(dplyr)
library(purrr)
list_dataframes %>%
map_dfr(~.x %>%
select(Column_1, Column_2) %>%
rename(New_Name_for_Column_1 = Column_1) %>%
mutate(Column3= ifelse(Column_1 == 5, 1, 0)))

pass a list of variable names as an argument to an R function

I am trying achieve the following: I have a dataset, and a function that subsets this dataset and then performs a series of operations on the subset. Subsetting happens based on row names. I am able to do it step by step (i.e. running this function for each subset separately), but I have a list of desired subsets, and I would like to loop over this list. It sounds complicated - please check the example below.
This is what I can do:
#dataframe with rownames
whole_dataset <- data.frame(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2))
row.names(whole_dataset) = c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")
# two different non-overlapping subsets
his <- c("HTA1", "HTA2", "HTB2")
cse <- c("CSE1", "CSE2")
#this is the function I have
fav_complex <- function (data, complex) {
small_data<- data[complex,] #subset only the rows that you need
sum.all<-colSums(small_data) #calculate sum of columns
return(sum.all)
}
#I generate two deparate named vectors
his_data <- fav_complex(data = whole_dataset, complex = his)
cse_data <- fav_complex(data = whole_dataset, complex = cse)
#and merge them
merged_data<- rbind(his_data,cse_data)
it looks like this
> merged_data
wt1 wt2
his_data 6 9
cse_data 12 6
I would like to somehow generate the merged_data dataframe without having to call the 'fav_complex' function multiple times. In real life I have about 20 subsets, and it is a lot of code. This is my solution that doesn't work
#I first have a character vector listing all the variable names
subset_list <- c("his", "cse")
#then create a loop that goes over this list
#make an empty dataframe
merged_data2 <- data.frame()
#fill it with a for loop output
for (element in subset_list) {
result <- fav_complex(data = whole_dataset, element)
merged_data2 <-rbind(merged_data2, result)
}
I know this is wrong. In this loop, 'element' is just a string, rather than a variable with stuff in it. But I don't know how to make it a variable. noquote(element) didn't work. I tried reading about non standard evaluation and eval(), substitute(), but it is too abstract for me - I think I am not there yet with my R expertise.
Consider by to run needed operation across all subsets. But first create a group column:
# ANY FUNCTION TO APPLY ON SUBSETS (REMOVE GROUP COL)
fav_complex_new <- function (sub) {
sum.all <- colSums(transform(sub, group=NULL))
return(sum.all)
}
# ASSIGN GROUPING
whole_dataset$group <- ifelse(row.names(whole_dataset) %in% his, "his",
ifelse(row.names(whole_dataset) %in% cse, "cse", NA))
# BY CALL
df_list <- by(whole_dataset, whole_dataset$group, FUN=fav_complex_new)
# COMBINE ALL DFs IN LIST
merged_data <- do.call(rbind, df_list)
Rextester demo (includes OP's original and above solution)
Following #Gregor's suggestion of a modified workflow, would you consider this solution, including some bonus data wrangling?
Put the data that's currently in row names in its own column.
Add a column for complex. We can do this programmatically in case the data are large.
Use dplyr to created split-apply-combine summaries of data grouped by complex.
It could work like this
library(dplyr)
whole_dataset <- tibble(wt1 = c(1, 2, 3, 6, 6),
wt2 = c(2, 3, 4, 4, 2),
id = factor(c("HTA1", "HTA2", "HTB2", "CSE1", "CSE2")))
whole_dataset <- mutate(whole_dataset,
complex = case_when(
grepl("^HT", id) ~ "his",
grepl("^CSE", id) ~ "cse")
) %>%
group_by(factor(complex))
whole_dataset %>% summarize(sum_wt1 = sum(wt1),
sum_wt2 = sum(wt2))
# # A tibble: 2 x 3
# `factor(complex)` sum_wt1 sum_wt2
# <fct> <dbl> <dbl>
# 1 cse 12 6
# 2 his 6 9

Trying to compare two dataframes, and writing a logical result to a new dataframe in R

I have an R dataframe that contains 18 columns, I would like to write a function that compares column 1 to column 2, and if both columns contain the same value, a logical result of T or F is written to a new column (this part is not too hard for me), however I would like to repeat this process over for the next columns and write T/F to a new column.
values col 1 = values col 2, write T/F to new column, values col 3 = values col 4, write T/F to a new column (or write results to a new dataframe)
I have been trying to do this with the purrr package, and use the pmap/map function, but I know I am making a mistake and missing some important part.
This function should work if I understand your problem correctly.
df <-
data.frame(a = c(18, 6, 2 ,0),
b = c(0, 6, 2, 18),
c = c(1, 5, 6, 8),
d = c(3, 5, 9, 2))
compare_columns <-
function(x){
n_columns <- ncol(x)
odd_columns <- 2*1:(n_columns/2) - 1
even_columns <- 2*1:(n_columns/2)
comparisons_list <-
lapply(seq_len(n_columns/2),
function(y){
df[, odd_columns[y]] == df[, even_columns[y]]
})
comparisons_df <-
as.data.frame(comparisons_list,
col.names = paste0("column", odd_columns, "_column", even_columns))
return(cbind(x, comparisons_df))
}
compare_columns(df)

Using an element from a table in selecting columns/rows in R

I've been working on a process to create all possible combinations of unique integers for lengths 1:n. I found the nCr function (combn function in the combinat package to be useful here).
Once all unique occurrences are iterated, they are appended to a consolidation table that contains any possible length+combination of the digits 1:n. A subset of the final table's relevant column (one record) looks like this (column is named String and the subset table f1):
c(1,3,4,5,9,10)
I need to select these columns from a secondary data source (df) one at a time (I am going to loop through this table), so my logic was to use this code:
df[,f1$String]
However, I get a message that says that undefined columns are selected, but if I copy and paste the contents of the cell such as:
df[,c(1, 3, 4, 5, 9, 10)]
it works fine ... I've tried all I can think of at this point; if anyone has some insight it would be greatly appreciated.
Code to reproduce is:
library(combinat)
library(data.table)
library(plyr)
rm(list=ls())
NCols=10
NRows=10
myMat<-matrix(runif(NCols*NRows), ncol=NCols)
XVars <- as.data.frame(myMat)
colnames(XVars) <- c("a","b","c","d","e","f","g","h","i","j")
x1 <- as.data.frame(colnames(XVars[1:ncol(XVars)]))
colnames(x1) <- "Independent.Variable"
setDT(x1)[, Index := .GRP, by = "Independent.Variable"]
colClasses = c("character", "numeric", "numeric")
col.names = c("String", "r!", "n!")
Combination <- read.table(text = "", colClasses = colClasses, col.names = col.names)
for(i in 1:nrow(x1)){
x2<- as.data.frame(combn(nrow(x1),i))
for (i in 1:ncol(x2)){
x3 <- paste("c(",paste(x2[1:nrow(x2),i], collapse = ", "), ")", sep="")
x3 <- as.data.frame(x3)
colnames(x3) <- "String"
x3 <- mutate(x3, "r!" = nrow(x2))
x3 <- mutate(x3, "n!" = nrow(x1))
Combination <- rbind(Combination, x3)
}
}
setDT(Combination)[, Index := .GRP, by = c("String", "r!", "n!")]
f1 <- Combination[717,]
f1$String <- as.character(f1$String)
## reference to data frame
myMat[,(f1$String)]
## pasted element
myMat[, c(1, 3, 4, 5, 9, 10)]
f1$String is the string "c(1, 3, 4, 5, 9, 10)". When you use myMat[,(f1$String)], R will look for the column with name "c(1, 3, 4, 5, 9, 10)". To get column numbers 1,3,4,5,9,10, you have to parse the string to an R expression and evaluate it first:
myMat[,eval(parse(text=f1$String))]
As #user3794498 noticed, you set f1$String as.character() so you cannot use is to get the columns you want.
You can change the way you define f1 or extract the column numbers from f1$String. Something like this should also work (load stringr before) myMat[, f1$String %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric].

Referring to a data frame by a variable name when creating a new column in R

I have a series of ten data frames containing two columns, x and y. I want to add a new column to each data frame containing the name of the data frame. The problem I am running into is how to refer to the data frame using a variable so I can perform this task iteratively. In addition to just referring to it by the variable name, I have also tried get() as follows:
for(i in 1:10){
name <- paste(substr(fileList, 3, 7),i, sep = "")
assign(newName, as.data.frame(get(name)))
get(newName)$Species = c(paste(substr(fileList, 3, 7),i, sep = ""))
}
However, I get the following error when I do so:
Error in get(newName)$Species = c(paste(substr(fileList[a], 3, 7), i, :
could not find function "get<-"
Is there another way to phrase the column assignment command so that I can get around this error, or is the solution more complex?
Here are three different options if you put all your data frames into a named list:
df_list <- list(a = data.frame(x = 1:5),
b = data.frame(x = 1:5))
#Option 1
for (i in seq_along(df_list)){
df_list[[i]][,'Species'] <- names(df_list)[i]
}
#Option 2
tmp <- do.call(rbind,df_list)
tmp$Species <- rep(names(df_list),times = sapply(df_list,nrow))
split(tmp,tmp$Species)
#Option 3
mapply(function(x,y) {x$Species <- y; x},df_list,names(df_list),SIMPLIFY = FALSE)

Resources