Loop to fill NA's with means not working properly - r

I'm trying to fill all the NA's in my fields with the mean of each column.
The code I've been using is:
var1<-colnames(DF)
for (i in 1:length(var1)) {
v<-paste0("`",var1[i],"`")
DF<-DF %>%
mutate(v=ifelse(is.na(v),mean(v,na.rm=TRUE),v))
}
After running this piece of code, nothing happens with the DF.
I already tried running for an individual column, and the code works:
DF<-DF%>%
mutate(col1=ifelse(is.na(col1),mean(col1,na.rm=TRUE),col1))
I'm using the ` in the paste part because some of the columns can have spaces between words and I cannot change this. I have the feeling that this part is where the mistake reside.

For multiple columns use mutate_at (for all columns - mutate_all)
DF %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .)))
It can be made compact with na.aggregate from zoo (replaces the NA with the mean for each columns. By default FUN = mean)
library(zoo)
na.aggregate(DF)
If we are using a for loop, then there is no need for a package. Just update the column NA elements with the mean of that column
for(nm in var1) DF[[nm]][is.na(DF[[nm]])] <- mean(DF[[nm]], na.rm = TRUE)
Or with lapply
DF[] <- lapply(DF, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
Or using colMeans
DF[is.na(DF)] <- colMeans(DF, na.rm = TRUE)[col(DF)][is.na(DF)]
data
set.seed(24)
DF <- as.data.frame(matrix(sample(c(NA, 0:5), 20 *5, replace = TRUE), 20, 5))

Related

NA values are not getting replaced in a function but works when called outside a function in R [duplicate]

This question already has answers here:
How to replace NA's in numerical columns with the median of those columns? [duplicate]
(3 answers)
Closed 1 year ago.
so I have this tibble main_df which has some columns like "Rainfall_(mm)", "Speed_of_maximum_wind_gust_(km/h)", "9am_Temperature", "9am_relative_humidity_(%)", "9am_cloud_amount_(oktas)"... etc. I tried to identify the numeric columns with this code col_type_vector <- sapply(main_df, typeof) and for all numeric columns I want to replace the "NA" values with the median value of that column. note that I start from 3 because I don't want the first 2 columns.
the loop and the function is given below:
set_na_to_median <- function(data_frame, column_name) {
median_value <- median(data_frame[[column_name]], na.rm = TRUE)
na_indices <- which(is.na(data_frame[column_name]))
data_frame[na_indices, column_name] <- median_value
}
col_type_vector <- sapply(main_df, typeof)
for (item in names(col_type_vector)[3:length(names(col_type_vector))]) {
if (col_type_vector[item] == "integer" | col_type_vector[item] == "double" | col_type_vector[item] == "numeric") {
set_na_to_median(main_df, item)
}
}
but when I do it the NA values do not get replaced. If I run the same code outside the function and loops manually it works perfectly. I have basically wasted my whole day on this? what am I doing wrong?
Thanks in advance.
You need to assign your NA-replaced tibble somewhere. Try replacing
set_na_to_median(main_df, item)
with
main_df <- set_na_to_median(main_df, item)
To know the type of column you should use the function class instead of typeof. (col_type_vector <- sapply(main_df, class))
However, I think there is an easier process to do this.
Using dplyr -
library(dplyr)
main_df <- main_df %>%
mutate(across(where(is.numeric),
~replace(., is.na(.), median(., na.rm = TRUE))))
main_df
You may also use base R to do this -
main_df[] <- lapply(main_df, function(x)
if(is.numeric(x)) replace(x, is.na(x), median(x, na.rm = TRUE)) else x)
We may use na.aggregate from zoo
library(dplyr)
library(zoo)
main_df <- main_df %>%
mutate(across(where(is.numeric), na.aggregate, FUN = median))

For loop to replace NA values in R

I want to write a for loop in R to replace NA values from one column of my dataframe and replace them for the mean of the values of the same column when 2 conditions are true.
When conditions are met, I want to assign the mean to NAs using observations from the same year and from the same group.
I wrote the following code, but I am struggling to write the conditions.
missing <- which(is.na(df$price))
for (i in 1:36){
x <- df[missing,]group
y <- df[missing,]year
selection <- df[conditions??,]$price
df[missing,]$price <- mean(selection, na.rm = TRUE)
}
You don't need a for loop, you can directly replace all the NAs with the mean(, na.rm=T) directly to calculate the mean of said column without NAs. This is for the general case:
df[is.na(df$price),]$price <- mean(df$price, na.rm = TRUE)
Using tidyverse you can achieve what you want:
library(tidyverse)
df %>% group_by(group, year) %>% mutate(price=ifelse(is.na(price), mean(price, na.rm=T), price))
Using data.table
dt <- data.table(df)
dt[,price:=fifelse(is.na(price), mean(price, na.rm=T), price), by=.(group,year)][]
A base R solution using by, which splits a data frame by the groups in the list in the second argument, and applies a function defined in the third:
result <- by(df,
list(df[["group"]], df[["year"]]),
function(x) {
x[is.na(x$price), "price"] <- mean(x[["price"]], na.rm = TRUE)
x
},
simplify = TRUE)
do.call(rbind, result)

calculate z-score across multiple dataframes in R

I have ten dataframes with equal number of rows and columns. They look like this:
df1 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(3490,9447,4368,908,204),
INPP4B=c(NA,9459,4395,1030,NA),
BCL2=c(NA,9480,4441,1209,NA),
IRS2=c(NA,NA,4639,1807,NA),
HRAS=c(3887,9600,4691,1936,1723))
df2 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(10892,17829,7156,1325,387),
INPP4B=c(NA,17840,7185,1474,NA),
BCL2=c(NA,17845,7196,1526,NA),
IRS2=c(NA,NA,12426,10244,NA),
HRAS=c(11152,17988,7545,2734,2423))
df3 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(11376,17103,8580,780,178),
INPP4B=c(NA,17318,9001,2829,NA),
BCL2=c(NA,17124,8621,1141,NA),
IRS2=c(NA,NA,8658,1397,NA),
HRAS=c(11454,17155,8683,1545,1345))
I would like to calculate z-score for each data frame, based on mean and variance across multiple dataframes. The z-score should be calculated as follows: z-score=(x-mean(x))/sd(x))).
I found that ddply function of plyr can do this job, but the solution was for single dataframe, while I have multiple dataframes as separate files with 18214 rows and 269 columns.
I would appreciate any suggestions.
Thank you very much for your help!
Olha
Here is one option where we bind the datasets together with bind_rows (from dplyr), then group by the grouping column and return the zscore transformed numeric columns
library(dplyr)
bind_rows(df1, df2, df3, .id = 'grp') %>%
group_by(geneID) %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore'))
NOTE: if we dont need new columns, then remove the .names part
If we need to do this in a loop, without binding into a single data.frame, can loop over the list
library(purrr)
list(df1, df2, df3) %>% # // automatically => mget(ls('^df\\d+$'))
map(~ .x %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore')))
Here is a base R solution with function scale.
df_list <- list(df1, df2, df3)
df_list2 <- lapply(df_list, function(DF){
i <- sapply(DF, is.numeric)
DF[i] <- lapply(DF[i], scale)
DF
})
S3 methods
Considering that scale is generic and that methods can be written for it, here is a data.frame method, then applied to the same list df_list.
scale.data.frame <- function(x, center = TRUE, scale = TRUE){
i <- sapply(x, is.numeric)
x[i] <- lapply(x[i], scale, center = center, scale = scale)
x
}
df_list3 <- lapply(df_list, scale)
identical(df_list2, df_list3)
#[1] TRUE

How to conditionally replace values with NA across multiple columns

I would like to replace outliers in each column of a dataframe with NA.
If for example we define outliers as being any value greater than 3 standard deviations from the mean I can achieve this per variable with the code below.
Rather than specify each column individually I'd like to perform the same operation on all columns of df in one call. Any pointers on how to do this?!
Thanks!
library(dplyr)
data("iris")
df <- iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length)%>%
head(10)
# add a clear outlier to each variable
df[1, 1:3] = 99
# replace values above 3 SD's with NA
df_cleaned <- df %>%
mutate(Sepal.Length = replace(Sepal.Length, Sepal.Length > (abs(3 * sd(df$Sepal.Length, na.rm = TRUE))), NA))
You need to use mutate_all(), i.e.
library(dplyr)
df %>%
mutate_all(funs(replace(., . > (abs(3 * sd(., na.rm = TRUE))), NA)))
Another option is base R
df[] <- lapply(df, function(x) replace(x, . > (abs(3 * sd(x, na.rm = TRUE))), NA))
or with colSds from matrixStats
library(matrixStats)
df[df > abs(3 * colSds(as.matrix(df), na.rm = TRUE))] <- NA

Reaplace NA values from different columns

R replace problem
Can't replace in the dataset NA values from different columns with a median of the same column with NA value.
Titanic.new is the dataset.
I have tried:
fun3<-function(x)
{
column.numeric<-x[,sapply(x,is.numeric)]
column.numeric[which(is.na(column.numeric))]<-median(column.numeric,na.rm = TRUE)
return(column.numeric)
}
fun3(titanic.new)
I'm getting an error:
Error in median.default(column.numeric, na.rm = TRUE) :
need numeric data
What am I doing wrong?
We can do some modification in the function. Loop through the columns of the dataset, find whether the type is numeric ('i1') -> return a logical vector. Subset the data using the vector, loop through the columns with lapply and replace the NAs in the column with the median of that column
fun3<-function(x){
i1 <- sapply(x,is.numeric)
x[i1] <- lapply(x[i1], function(y) replace(y, is.na(y), median(y, na.rm = TRUE)))
x
}
fun3(titanic.new)
Or it can be done with tidyverse
library(tidyverse)
titanic.new %>%
mutate_if(is.numeric, list(~ replace(., is.na(.), median(., na.rm = TRUE))))
which can be wrapped in a function as well
fun4 <- function(x) {
x %>%
mutate_if(is.numeric,
list(~ replace(., is.na(.), median(., na.rm = TRUE))))
}
Also, this can be done more compactly with na.aggregate
library(zoo)
i1 <- sapply(titanic.new, is.numeric)
titanic.new[i1] <- na.aggregate(titanic.new[i1], FUN = median)

Resources