I would like to replace outliers in each column of a dataframe with NA.
If for example we define outliers as being any value greater than 3 standard deviations from the mean I can achieve this per variable with the code below.
Rather than specify each column individually I'd like to perform the same operation on all columns of df in one call. Any pointers on how to do this?!
Thanks!
library(dplyr)
data("iris")
df <- iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length)%>%
head(10)
# add a clear outlier to each variable
df[1, 1:3] = 99
# replace values above 3 SD's with NA
df_cleaned <- df %>%
mutate(Sepal.Length = replace(Sepal.Length, Sepal.Length > (abs(3 * sd(df$Sepal.Length, na.rm = TRUE))), NA))
You need to use mutate_all(), i.e.
library(dplyr)
df %>%
mutate_all(funs(replace(., . > (abs(3 * sd(., na.rm = TRUE))), NA)))
Another option is base R
df[] <- lapply(df, function(x) replace(x, . > (abs(3 * sd(x, na.rm = TRUE))), NA))
or with colSds from matrixStats
library(matrixStats)
df[df > abs(3 * colSds(as.matrix(df), na.rm = TRUE))] <- NA
Related
I want to write a for loop in R to replace NA values from one column of my dataframe and replace them for the mean of the values of the same column when 2 conditions are true.
When conditions are met, I want to assign the mean to NAs using observations from the same year and from the same group.
I wrote the following code, but I am struggling to write the conditions.
missing <- which(is.na(df$price))
for (i in 1:36){
x <- df[missing,]group
y <- df[missing,]year
selection <- df[conditions??,]$price
df[missing,]$price <- mean(selection, na.rm = TRUE)
}
You don't need a for loop, you can directly replace all the NAs with the mean(, na.rm=T) directly to calculate the mean of said column without NAs. This is for the general case:
df[is.na(df$price),]$price <- mean(df$price, na.rm = TRUE)
Using tidyverse you can achieve what you want:
library(tidyverse)
df %>% group_by(group, year) %>% mutate(price=ifelse(is.na(price), mean(price, na.rm=T), price))
Using data.table
dt <- data.table(df)
dt[,price:=fifelse(is.na(price), mean(price, na.rm=T), price), by=.(group,year)][]
A base R solution using by, which splits a data frame by the groups in the list in the second argument, and applies a function defined in the third:
result <- by(df,
list(df[["group"]], df[["year"]]),
function(x) {
x[is.na(x$price), "price"] <- mean(x[["price"]], na.rm = TRUE)
x
},
simplify = TRUE)
do.call(rbind, result)
I am trying to get the first non zero value but I am getting the position using my code. I know I am getting this as I am using which in my code but I need the value. Please help I am sharing my sample data and the R query I used.
Cnt<- c(9940000126,9940000188,9940000406,9940000992,9940001017,9940001288,9940001833,9940002276,9940002629)
FY12_April <- c(0,0,0,0,0,0,0,0,0)
FY12_August <- c(0,0,.343545,0,0,0,0,0,0)
FY12_December <- c(0,0,0,0,0,0,0,0,0)
FY12_February <- c(0,0,0,0,0,0,0,0,0)
FY12_January <- c(0,0,0.98557,0,0,0,0,0,0.41949703)
FY12_July <- c(0,0,0,0,0,-1.211583915,0,0,0)
FY12_June <- c(-1.47268885,0,0,0,-0.80164469,0,0,0,0)
SamData <- data.frame(Cnt,FY12_April,FY12_August,FY12_December,FY12_February,FY12_January,FY12_July,FY12_June)
ProcessData <- SamData %>% mutate(Count = apply(select(.,FY12_April:FY12_June),1,function(x) sum(x!=0,na.rm=TRUE))) %>%
mutate(FirstInst = colnames(select(.,FY12_April:FY12_June))[apply(select(.,FY12_April:FY12_June),1,function(x)which(x!=0)[1])]) %>%
mutate(FirstInstAmt = apply(select(.,FY12_April:FY12_June),1,function(x)which(x!=0, arr.ind = TRUE, useNames = TRUE)[1]))
We can use max.col with row index to get the value. It would be more efficient rather than apply
SamData$FirstInstAmt <- SamData[-1][cbind(seq_len(nrow(SamData)),
max.col(SamData[-1] != 0, 'first'))]
Or if we want to use apply
SamData$FirstInstAmt <- apply(SamData[-1], 1, function(x) x[x !=0][1])
SamData$FirstInstAmt
#[1] -1.4726888 NA 0.3435450 NA -0.8016447
#[6] -1.2115839 NA NA 0.4194970
Or using pmap with dplyr
library(dplyr)
library(purrr)
SamData %>%
mutate(FirstInstAmt = pmap_dbl(.[-1], ~ {x <- c(...); x[x != 0][1]}))
Or use c_across with rowwise
SamData %>%
rowwise %>%
mutate(FirstInstAmt = {tmp <- c_across(FY12_April:FY12_June)
tmp[tmp!= 0][1]})
Or replace the 0 values with NA and use coalesce to return the first non-NA
SamData %>%
mutate(FirstInstAmt = coalesce(!!! .[-1] * NA^(!.[-1])))
NOTE: Using rowwise/c_across or pmap or apply could be slower as these are loops. The most efficient among this would be based on indexing (max.col) or to a certain extent the coalesce/replace
R replace problem
Can't replace in the dataset NA values from different columns with a median of the same column with NA value.
Titanic.new is the dataset.
I have tried:
fun3<-function(x)
{
column.numeric<-x[,sapply(x,is.numeric)]
column.numeric[which(is.na(column.numeric))]<-median(column.numeric,na.rm = TRUE)
return(column.numeric)
}
fun3(titanic.new)
I'm getting an error:
Error in median.default(column.numeric, na.rm = TRUE) :
need numeric data
What am I doing wrong?
We can do some modification in the function. Loop through the columns of the dataset, find whether the type is numeric ('i1') -> return a logical vector. Subset the data using the vector, loop through the columns with lapply and replace the NAs in the column with the median of that column
fun3<-function(x){
i1 <- sapply(x,is.numeric)
x[i1] <- lapply(x[i1], function(y) replace(y, is.na(y), median(y, na.rm = TRUE)))
x
}
fun3(titanic.new)
Or it can be done with tidyverse
library(tidyverse)
titanic.new %>%
mutate_if(is.numeric, list(~ replace(., is.na(.), median(., na.rm = TRUE))))
which can be wrapped in a function as well
fun4 <- function(x) {
x %>%
mutate_if(is.numeric,
list(~ replace(., is.na(.), median(., na.rm = TRUE))))
}
Also, this can be done more compactly with na.aggregate
library(zoo)
i1 <- sapply(titanic.new, is.numeric)
titanic.new[i1] <- na.aggregate(titanic.new[i1], FUN = median)
I would like to efficiently impute missing values with a slightly different value in each cell.
for example:
df <- data_frame(x = rnorm(100), y = rnorm(100))
df[1:5,1] <- NA
df[1:5, 2] <- NA
df %<>% mutate_all(funs(ifelse(is.na(.), jitter(median(., na.rm = TRUE)), .)))
However, this imputes with the same number in all cells.
How can I add a different noise to each cell?
Of course, I could do this with a loop, but my data frame is huge and I would like to do this efficiently
We can use rep with n()
library(dplyr)
library(magrittr)
df %<>%
mutate_all(list(~ case_when(is.na(.) ~ jitter(rep(median(., na.rm = TRUE), n())),
TRUE ~ .)))
I'm trying to fill all the NA's in my fields with the mean of each column.
The code I've been using is:
var1<-colnames(DF)
for (i in 1:length(var1)) {
v<-paste0("`",var1[i],"`")
DF<-DF %>%
mutate(v=ifelse(is.na(v),mean(v,na.rm=TRUE),v))
}
After running this piece of code, nothing happens with the DF.
I already tried running for an individual column, and the code works:
DF<-DF%>%
mutate(col1=ifelse(is.na(col1),mean(col1,na.rm=TRUE),col1))
I'm using the ` in the paste part because some of the columns can have spaces between words and I cannot change this. I have the feeling that this part is where the mistake reside.
For multiple columns use mutate_at (for all columns - mutate_all)
DF %>%
mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .)))
It can be made compact with na.aggregate from zoo (replaces the NA with the mean for each columns. By default FUN = mean)
library(zoo)
na.aggregate(DF)
If we are using a for loop, then there is no need for a package. Just update the column NA elements with the mean of that column
for(nm in var1) DF[[nm]][is.na(DF[[nm]])] <- mean(DF[[nm]], na.rm = TRUE)
Or with lapply
DF[] <- lapply(DF, function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
Or using colMeans
DF[is.na(DF)] <- colMeans(DF, na.rm = TRUE)[col(DF)][is.na(DF)]
data
set.seed(24)
DF <- as.data.frame(matrix(sample(c(NA, 0:5), 20 *5, replace = TRUE), 20, 5))