Updating A Certain Column Values Within A User Defined Function In R

Updating A Certain Column Values Within A User Defined Function In R - r

I have the following piece of code to update a column in a dataframe in R with the median value. This works fine, but I would like to be able to call this as a function from other parts of the program, passing over other dataframes and columns.
medianVal <- median(df$column, na.rm = T)
df$column[is.na(df$column)] <- medianVal
The logic for the code I am attempting to use is Pass over the DataFrame and Column, Get the median Value, Update and return the dataframe
updateWithMedian <- function(DataFrame, Column)
{
medianValue <- median(Column, na.rm = T)
Column[is.na(DataFrame$Column)] <- medianValue
return(DataFrame)
}
DataFrame[[Column]] in the function helps me identify the column, but I am still struggling to update the NA values to the median.
For example, the code
DataFrame[[Column]][is.na(DataFrame$Column)] <- medianValue
dosent feel like the correct syntax.

You're mixing notations here. If you use quoted column names, you cannot use the dataframe$variable kind of notation. Try that (untested) solution:
updateWithMedian <- function(df, colname)
{
medianValue <- median(df[,colname], na.rm = T)
df[is.na(df[,colname]), colname] <- medianValue
return(df)
}

We can also do
library(dplyr)
library(zoo)
updateWithMedian <- function(df, colname) {
df %>%
mutate_at(vars(colname), na.aggregate, FUN = median)
}
updateWithMedian(df, "column")

Thanks, that has worked perfectly. Just a follow on question, if I wanted to update the column value with the value of a different column in the same dataframe, what would be the correct step for it?
I have tried the the code below but this replaces the NAs in Col1 with the Col2 Name, but its the value I need.
updateWithMedian <- function(df, colname1, colname2)
{
df[is.na(df[,colname1]), colname1] <- colname2
return(df)
}

Related

Pass a function input as column name to data.frame function

I have a function taking a character input. Within the function, I want to use the data.frame() function. Within the data.frame() function, one column name should be the function's character input.
I tried it like this and it didn't work:
frame_create <- function(data, **character_input**){
...
some_vector <- c(1:50)
temp_frame <- data.frame(**character_input** = some_vector, ...)
return(temp_frame)
}

Either use, names to assign or with setNames as = wouldn't allow evaluation on the lhs of =. In package functions i.e tibble or lst, it can be created with := and !!
frame_create <- function(data, character_input){
some_vector <- 1:50
temp_frame <- data.frame(some_vector)
names(temp_frame) <- character_input
return(temp_frame)
}

Can you explain your requirement for using a function to create a new dataframe column? If you have a dataframe df and you want to make a copy with a new column appended then the trivial solution is:
df2 <- df
df2$new_col <- 1:50
Example of merging multiple dataframes in R:
cars1 <- mtcars
cars2 <- cars1
cars3 <- cars2
list1 <- list(cars1, cars2, cars3)
all_cars <- Reduce(rbind, list1)

Create variable based on other variable outside function

I'm trying to make my code general, I'd only want to change the YEAR variable without having to change everything in the code
YEAR = 1970
y <- data.frame(col1 = c(1:5))
function (y){
summarize(column_YEAR = sum(col1))
}
#Right now this gives
column_YEAR
1 15
#I would like this function to output this (so col1 is changed to column_1970)
column_1970
1 15
or for example this
df <- list("a_YEAR" = anotherdf)
#I would like to have a list with a df with the name a_1970
I tried things like
df <- list(assign(paste0(a_, YEAR), anotherdf))
But it does not work, does somebody have any advice? Thanks in advance :)

rlang provides a flexible way to defuse R expressions. You can use that functionality to create dynamic column names within dplyr flow. In this example dynamic column name is created using suffix argument passed to a wrapper function on dplyr's summarise.
library("tidyverse")
YEAR = 1970
y <- data.frame(col1 = c(1:5))
function (y) {
summarize(column_YEAR = sum(col1))
}
my_summarise <- function(.data, suffix, sum_col) {
var_name <- paste0("column_", suffix)
summarise(.data,
{{var_name}} := sum({{sum_col}}))
}
my_summarise(.data = y, suffix = YEAR, sum_col = col1)
Results
my_summarise(.data = y, suffix = YEAR, sum_col = col1)
# column_1970
# 1 15
You can also source arguments directly from global environment but from readability perspective this is poorer solution as it's not immediately clear how the function creates suffix.
my_summarise_two <- function(.data, sum_col) {
var_name <- paste0("column_", YEAR)
summarise(.data,
{{var_name}} := sum({{sum_col}}))
}
my_summarise_two(.data = y, sum_col = col1)

Argument is not numeric or logical: returning NA with one string column

Hello I would like to calculate mean for every numeric column in my data. For now I have:
for(i in names(MyData)){
avg <- mean(MyData[[i]], na.rm = TRUE)
print(avg)
}
but I get error like topic name because last of MyData is decisive and I have here string, is there way that ignore column with string. I also know that I can change it into numbers but I don't want to do it.

We can do this more easily if we use summarise_if from dplyr
library(dplyr)
MyData %>%
summarise_if(is.numeric, mean)
In the OP's code, it is looping through each of the columns and just printing the result and not storing it. There is also a possibility that some columns are not numeric. In the below code, we pre-assign a vector ('v1') with 0 values to store the output. Create a logical condition with if/else and return the mean if it is numeric or else return NA
v1 <- numeric(length(MyData))
for(i in seq_along(MyData)) {
if(is.numeric(MyData[[i]])) {
v1[i] <- mean(MyData[[i]], na.rm = TRUE)
} else {
v1[i] <- NA_real_
}
}
In base R, this can also be done with sapply
i1 <- sapply(MyData, is.numeric)
sapply(MyData[i1], mean, na.rm = TRUE)
Or with colMeans
colMeans(MyData[i], na.rm = TRUE)

I give three arguments, the input df, the column I want to clean,the new column I want to be added with cleansed names. Where am I going wrong?

library(dplyr)
clean_name <- function(df,col_name,new_col_name){
#remove whitespace and common titles.
df$new_col_name <- mutate_all(df,
trimws(gsub("MR.?|MRS.?|MS.?|MISS.?|MASTER.?","",df$col_name)))
#remove any chunks of text where a number is present
df$new_col_name<- transmute_all(df,
gsub("[^\\s]*[\\d]+[^\\s]*","",df$col_name,perl = TRUE))
}
I get the following error
"Error: Column new_col_name must be a 1d atomic #vector or a list"

what you want to do is make sure that the output of the functions you're using is either a vector or a list with only one dimension so that you can add it as a new column in the desired data frame. You can verify the class of an object with the Class function which comes within the base package.
The mutate function by itself should do what you want, it returns the same data frame but with the new column:
library(dplyr)
clean_name <- function(df, col_name, new_col_name) {
# first_cleaning_to_colname = The first change you want to make to the col_name column. This should be a vector.
# second_cleaning_to_colname = The change you're going to make to the col_name column after the first one. This should be a vector too.
first_change <- mutate(df, col_name = first_cleaning_to_colname)
second_change <- mutate(first_change, new_col_name = second_cleaning_to_colname)
return(second_change)
}
You can make both this changes at the same time but I thought this way it's easier to read.

If we are passing unquoted column names, then use
library(tidyverse)
clean_name <- function(df,col_name, new_col_name){
col_name <- enquo(col_name)
new_col_name <- enquo(new_col_name)
df %>%
mutate(!! new_col_name :=
trimws(str_replace_all(!!col_name, "MR.?|MRS.?|MS.?|MISS.?|MASTER.?","")) ) %>%
transmute(!! new_col_name := trimws(str_replace_all(!! new_col_name,
"[^\\s]*[\\d]+[^\\s]*","")))
}
clean_name(dat1, col1, colN)
# colN
#1 one
#2 two
data
dat1 <- data.frame(col1 = c("MR. one", "MS. two 24"), stringsAsFactors = FALSE)

Compute median per column in loop

I have this loop to compute the mean per column, which works.
for (i in 1:length(DF1)) {
tempA <- DF1[i] # save column of DF1 onto temp variable
names(tempA) <- 'word' # label temp variable for inner_join function
DF2 <- inner_join(tempA, DF0, by='word') # match words with numeric value from look-up DF0
tempB <- as.data.frame(t(colMeans(DF2[-1]))) # compute mean of column
DF3<- rbind(tempB, DF3) # save results togther
}
The script uses the dplyr package for inner_join.
DF0 is the look-up database with 3 columns (word, value1, value2, value3).
DF 1 is the text data with one word per cell.
DF3 is the output.
Now I want to compute the median instead of the mean. It seemed easy enough with the colMedians function from 'robustbase', but I can't get the below to work.
library(robustbase)
for (i in 1:length(DF1)) {
tempA <- DF1[i]
names(tempA) <- 'word'
DF2 <- inner_join(tempA, DF0, by='word')
tempB <- as.data.frame(t(colMedians(DF2[-1])))
DF3<- rbind(tempB, DF3)
}
The error message reads:
Error in colMedians(tog[-1]) : Argument 'x' must be a matrix.
I've tried to format DF2 as a matrix prior to the colMedians function, but still get the error message:
Error in colMedians(tog[-1]) : Argument 'x' must be a matrix.
I don't understand what is going on here. Thanks for the help!
Happy to provide sample data and error traceback, but trying to keep it as crisp and simple as possible.

According to the comment by the OP, the following solved the problem.
I have added a call to library(dplyr).
My contribution was colMedians(data.matrix(DF2[-1]), na.rm = TRUE).
library(robustbase)
library(dplyr)
for (i in 1:length(DF1)) {
tempA <- DF1[i]
names(tempA) <- 'word'
DF2 <- inner_join(tempA, DF0, by='word')
tempB <- colMedians(data.matrix(DF2[-1]), na.rm = TRUE)
DF3 <- rbind(tempB, DF3)
}

Stumbled on this answer which helped me fix the loop as following:
DF3Mean <- data.frame() # instantiate dataframe
DF4Median <- data.frame( # instantiate dataframe
for (i in 1:length(DF1)) {
tempA <- DF1[i] # save column of DF1 onto temp variable
names(tempA) <- 'word' # label temp variable for inner_join function
DF2 <- inner_join(tempA, DF0, by='word') # match words with numeric value from look-up DF0
tempMean <- as.data.frame(t(colMeans(DF2[-1]))) # compute mean of column
DF3Mean <- rbind(tempMean, DF3Mean) # save results togther
tempMedian <- apply(DF2[ ,2:4], 2, median) #compute mean for columns 2,3, and 4
DF4Median <- rbind(tempMedian, DF4Median) # save results togther
}
I guess I was too stuck in my mind on the colMedian function.