Reaplace NA values from different columns - r

R replace problem
Can't replace in the dataset NA values from different columns with a median of the same column with NA value.
Titanic.new is the dataset.
I have tried:
fun3<-function(x)
{
column.numeric<-x[,sapply(x,is.numeric)]
column.numeric[which(is.na(column.numeric))]<-median(column.numeric,na.rm = TRUE)
return(column.numeric)
}
fun3(titanic.new)
I'm getting an error:
Error in median.default(column.numeric, na.rm = TRUE) :
need numeric data
What am I doing wrong?

We can do some modification in the function. Loop through the columns of the dataset, find whether the type is numeric ('i1') -> return a logical vector. Subset the data using the vector, loop through the columns with lapply and replace the NAs in the column with the median of that column
fun3<-function(x){
i1 <- sapply(x,is.numeric)
x[i1] <- lapply(x[i1], function(y) replace(y, is.na(y), median(y, na.rm = TRUE)))
x
}
fun3(titanic.new)
Or it can be done with tidyverse
library(tidyverse)
titanic.new %>%
mutate_if(is.numeric, list(~ replace(., is.na(.), median(., na.rm = TRUE))))
which can be wrapped in a function as well
fun4 <- function(x) {
x %>%
mutate_if(is.numeric,
list(~ replace(., is.na(.), median(., na.rm = TRUE))))
}
Also, this can be done more compactly with na.aggregate
library(zoo)
i1 <- sapply(titanic.new, is.numeric)
titanic.new[i1] <- na.aggregate(titanic.new[i1], FUN = median)

Related

For loop to replace NA values in R

I want to write a for loop in R to replace NA values from one column of my dataframe and replace them for the mean of the values of the same column when 2 conditions are true.
When conditions are met, I want to assign the mean to NAs using observations from the same year and from the same group.
I wrote the following code, but I am struggling to write the conditions.
missing <- which(is.na(df$price))
for (i in 1:36){
x <- df[missing,]group
y <- df[missing,]year
selection <- df[conditions??,]$price
df[missing,]$price <- mean(selection, na.rm = TRUE)
}
You don't need a for loop, you can directly replace all the NAs with the mean(, na.rm=T) directly to calculate the mean of said column without NAs. This is for the general case:
df[is.na(df$price),]$price <- mean(df$price, na.rm = TRUE)
Using tidyverse you can achieve what you want:
library(tidyverse)
df %>% group_by(group, year) %>% mutate(price=ifelse(is.na(price), mean(price, na.rm=T), price))
Using data.table
dt <- data.table(df)
dt[,price:=fifelse(is.na(price), mean(price, na.rm=T), price), by=.(group,year)][]
A base R solution using by, which splits a data frame by the groups in the list in the second argument, and applies a function defined in the third:
result <- by(df,
list(df[["group"]], df[["year"]]),
function(x) {
x[is.na(x$price), "price"] <- mean(x[["price"]], na.rm = TRUE)
x
},
simplify = TRUE)
do.call(rbind, result)

Not able to get the value, getting the position

I am trying to get the first non zero value but I am getting the position using my code. I know I am getting this as I am using which in my code but I need the value. Please help I am sharing my sample data and the R query I used.
Cnt<- c(9940000126,9940000188,9940000406,9940000992,9940001017,9940001288,9940001833,9940002276,9940002629)
FY12_April <- c(0,0,0,0,0,0,0,0,0)
FY12_August <- c(0,0,.343545,0,0,0,0,0,0)
FY12_December <- c(0,0,0,0,0,0,0,0,0)
FY12_February <- c(0,0,0,0,0,0,0,0,0)
FY12_January <- c(0,0,0.98557,0,0,0,0,0,0.41949703)
FY12_July <- c(0,0,0,0,0,-1.211583915,0,0,0)
FY12_June <- c(-1.47268885,0,0,0,-0.80164469,0,0,0,0)
SamData <- data.frame(Cnt,FY12_April,FY12_August,FY12_December,FY12_February,FY12_January,FY12_July,FY12_June)
ProcessData <- SamData %>% mutate(Count = apply(select(.,FY12_April:FY12_June),1,function(x) sum(x!=0,na.rm=TRUE))) %>%
mutate(FirstInst = colnames(select(.,FY12_April:FY12_June))[apply(select(.,FY12_April:FY12_June),1,function(x)which(x!=0)[1])]) %>%
mutate(FirstInstAmt = apply(select(.,FY12_April:FY12_June),1,function(x)which(x!=0, arr.ind = TRUE, useNames = TRUE)[1]))
We can use max.col with row index to get the value. It would be more efficient rather than apply
SamData$FirstInstAmt <- SamData[-1][cbind(seq_len(nrow(SamData)),
max.col(SamData[-1] != 0, 'first'))]
Or if we want to use apply
SamData$FirstInstAmt <- apply(SamData[-1], 1, function(x) x[x !=0][1])
SamData$FirstInstAmt
#[1] -1.4726888 NA 0.3435450 NA -0.8016447
#[6] -1.2115839 NA NA 0.4194970
Or using pmap with dplyr
library(dplyr)
library(purrr)
SamData %>%
mutate(FirstInstAmt = pmap_dbl(.[-1], ~ {x <- c(...); x[x != 0][1]}))
Or use c_across with rowwise
SamData %>%
rowwise %>%
mutate(FirstInstAmt = {tmp <- c_across(FY12_April:FY12_June)
tmp[tmp!= 0][1]})
Or replace the 0 values with NA and use coalesce to return the first non-NA
SamData %>%
mutate(FirstInstAmt = coalesce(!!! .[-1] * NA^(!.[-1])))
NOTE: Using rowwise/c_across or pmap or apply could be slower as these are loops. The most efficient among this would be based on indexing (max.col) or to a certain extent the coalesce/replace

calculate z-score across multiple dataframes in R

I have ten dataframes with equal number of rows and columns. They look like this:
df1 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(3490,9447,4368,908,204),
INPP4B=c(NA,9459,4395,1030,NA),
BCL2=c(NA,9480,4441,1209,NA),
IRS2=c(NA,NA,4639,1807,NA),
HRAS=c(3887,9600,4691,1936,1723))
df2 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(10892,17829,7156,1325,387),
INPP4B=c(NA,17840,7185,1474,NA),
BCL2=c(NA,17845,7196,1526,NA),
IRS2=c(NA,NA,12426,10244,NA),
HRAS=c(11152,17988,7545,2734,2423))
df3 <- data.frame(geneID=c("AKT1","AKT2","AKT3","ALK",
"APC"),
CDKN2A=c(11376,17103,8580,780,178),
INPP4B=c(NA,17318,9001,2829,NA),
BCL2=c(NA,17124,8621,1141,NA),
IRS2=c(NA,NA,8658,1397,NA),
HRAS=c(11454,17155,8683,1545,1345))
I would like to calculate z-score for each data frame, based on mean and variance across multiple dataframes. The z-score should be calculated as follows: z-score=(x-mean(x))/sd(x))).
I found that ddply function of plyr can do this job, but the solution was for single dataframe, while I have multiple dataframes as separate files with 18214 rows and 269 columns.
I would appreciate any suggestions.
Thank you very much for your help!
Olha
Here is one option where we bind the datasets together with bind_rows (from dplyr), then group by the grouping column and return the zscore transformed numeric columns
library(dplyr)
bind_rows(df1, df2, df3, .id = 'grp') %>%
group_by(geneID) %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore'))
NOTE: if we dont need new columns, then remove the .names part
If we need to do this in a loop, without binding into a single data.frame, can loop over the list
library(purrr)
list(df1, df2, df3) %>% # // automatically => mget(ls('^df\\d+$'))
map(~ .x %>%
mutate(across(where(is.numeric),
~(.- mean(., na.rm = TRUE))/sd(., na.rm = TRUE), .names = '{col}_zscore')))
Here is a base R solution with function scale.
df_list <- list(df1, df2, df3)
df_list2 <- lapply(df_list, function(DF){
i <- sapply(DF, is.numeric)
DF[i] <- lapply(DF[i], scale)
DF
})
S3 methods
Considering that scale is generic and that methods can be written for it, here is a data.frame method, then applied to the same list df_list.
scale.data.frame <- function(x, center = TRUE, scale = TRUE){
i <- sapply(x, is.numeric)
x[i] <- lapply(x[i], scale, center = center, scale = scale)
x
}
df_list3 <- lapply(df_list, scale)
identical(df_list2, df_list3)
#[1] TRUE

How to conditionally replace values with NA across multiple columns

I would like to replace outliers in each column of a dataframe with NA.
If for example we define outliers as being any value greater than 3 standard deviations from the mean I can achieve this per variable with the code below.
Rather than specify each column individually I'd like to perform the same operation on all columns of df in one call. Any pointers on how to do this?!
Thanks!
library(dplyr)
data("iris")
df <- iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length)%>%
head(10)
# add a clear outlier to each variable
df[1, 1:3] = 99
# replace values above 3 SD's with NA
df_cleaned <- df %>%
mutate(Sepal.Length = replace(Sepal.Length, Sepal.Length > (abs(3 * sd(df$Sepal.Length, na.rm = TRUE))), NA))
You need to use mutate_all(), i.e.
library(dplyr)
df %>%
mutate_all(funs(replace(., . > (abs(3 * sd(., na.rm = TRUE))), NA)))
Another option is base R
df[] <- lapply(df, function(x) replace(x, . > (abs(3 * sd(x, na.rm = TRUE))), NA))
or with colSds from matrixStats
library(matrixStats)
df[df > abs(3 * colSds(as.matrix(df), na.rm = TRUE))] <- NA

Loop to fill NA's with means not working properly

I'm trying to fill all the NA's in my fields with the mean of each column.
The code I've been using is:
var1<-colnames(DF)
for (i in 1:length(var1)) {
v<-paste0("`",var1[i],"`")
DF<-DF %>%
mutate(v=ifelse(is.na(v),mean(v,na.rm=TRUE),v))
}
After running this piece of code, nothing happens with the DF.
I already tried running for an individual column, and the code works:
DF<-DF%>%
mutate(col1=ifelse(is.na(col1),mean(col1,na.rm=TRUE),col1))
I'm using the ` in the paste part because some of the columns can have spaces between words and I cannot change this. I have the feeling that this part is where the mistake reside.
For multiple columns use mutate_at (for all columns - mutate_all)
DF %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .)))
It can be made compact with na.aggregate from zoo (replaces the NA with the mean for each columns. By default FUN = mean)
library(zoo)
na.aggregate(DF)
If we are using a for loop, then there is no need for a package. Just update the column NA elements with the mean of that column
for(nm in var1) DF[[nm]][is.na(DF[[nm]])] <- mean(DF[[nm]], na.rm = TRUE)
Or with lapply
DF[] <- lapply(DF, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
Or using colMeans
DF[is.na(DF)] <- colMeans(DF, na.rm = TRUE)[col(DF)][is.na(DF)]
data
set.seed(24)
DF <- as.data.frame(matrix(sample(c(NA, 0:5), 20 *5, replace = TRUE), 20, 5))

Resources