change outliers to NAs while keeping existing NAs - r

I am new to r. I have a dataframe showing 8 trials per participant (in rows) per 4 different tasks/measures (in columns). I would like to remove outliers* (per participant per task) and convert them to NAs while keeping pre-existing NAs.
The code I am using is below; it is throwing out the pre-existing NAs (i.e.,the NAs that exist within the raw dataframe) with the additional result that I cannot get a dataframe back (it won't accept as.data.frame) I think because of unequal sizes. I presume the problem is the remove outliers function but
I thought that when the action on the NAs was within a function it was just stating how to deal with NAs during the function application only, and
I have tried to change the function with variations on na.rm = FALSE throughout but that won’t run. Any help much appreciated.
fname = "VSA perceptual controls_right.csv"
ctrl_vsa_trials = read.csv(fname, header = TRUE, stringsAsFactors = FALSE, na.strings = c(""))
remove_outliers = function(x, na.rm = TRUE, ...){
qnt = quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
H = 1.5 * IQR(x, na.rm = na.rm)
y = x
y[x < (qnt[1] - H)] = NA
y[x > (qnt[2] + H)] = NA
y
}
ctrl_vsa_trials_clean = aggregate(cbind(Pre_first,Post_first,Pre_adj,Post_adj) ~ Ppt, ctrl_vsa_trials, remove_outliers, na.action = NULL)
this is due to issues I had with the measuring device, I feel it is justified!

I am not sure that I understand what you exactly need, but if what you are trying is to replace columns with cleaned columns, you can try this:
ctrl_vsa_trials_clean <- ctrl_vsa_trials
cols <- c("Pre_first", "Post_first", "Pre_adj", "Post_adj")
ctrl_vsa_trials_clean[, cols] <- apply(ctrl_vsa_trials_clean[, cols], 2,
remove_outliers)

Related

How can I incorporate NA removal into aggregate based on a custom function?

This is my first time using any custom functions, so bear with me. I made a function for standard error that I'd like to use with aggregate. It worked until I tried to exclude NAs.
Dummy data frame to work with:
se <- function(x) sd(x)/sqrt(length(x))
df <- data.frame(site = c('N','N','N','S','S','S'),
birds = c(NA,4,2,9,3,1),
worms = c(2,1,2,4,0,5))
means <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = mean)
error <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = se)
So aggregate worked before I excluded NAs (e.g. error <- aggregate(df[,2:3], list(site = df$site), FUN = se)), and it works when finding the mean (using the rest of the values to take the mean and ignoring the missing value). How can I exclude NAs in that same manner when using my custom se function?
The problem is that you do not have an explicit argument for na.rm in your se function. If you add that to your function, it should work:
se <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm)/sqrt(sum(!is.na(x)))
}

Why is na.rm not working in my code?

I have a problem with removing NAs.
This is my code:
dataset <- read.csv2("my_file.csv", header = T)
year_order <- ret12[order(dataset$year, na.rm = T, decreasing = T), ]
# Returns:
# *Error in order(ret12$year, na.rm = T, decreasing = T) :
# argument lengths differ*
Why?
If you what to order and remove NA's marked by missing values in the ret12$year column (which would normally come last in the order()-ing), then you need to order, then omit:
year.order <- ret12[ order(ret12$year, decreasing=T),][ 1:sum(!is.na(ret12)), ]

R - Removing all outliers from a data set

I'd like to make a function that removes all outliers from a data set. I've read a lot of Stack Overflow articles about this, so I am aware of the dangers of removing outliers. Here's what I have so far:
# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# Removes all outliers from a data set
remove_all_outliers <- function(df){
# We only want the numeric columns
a<-df[,sapply(df, is.numeric)]
b<-df[,sapply(df, !is.numeric)]
a<-lapply(a,function(x) remove_outliers(x))
d<-merge(a,b)
d
}
There are a few things wrong with this that I know of, but please correct me if anything could be handled better.
!is.numeric() is not a thing, How should I accomplish this?
I have allso tried is.numeric==FALSE
is.numeric() converts factors into ints. How do I prevent this?
Did I do lapply right?
Is there a better / easier way to perform the remove_outliers function than separating the data set, performing it, then merging it back with the non-numeric set?
Factors are ints, just not atomic ints.
# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
You can replace the columns by index so you don't need to create separate data sets. Just make sure you pass the same data to lapply, eg, you don't want to do data[, 1:3] <- lapply(data, FUN) which I have done many times.
# Removes all outliers from a data set
remove_all_outliers1 <- function(df){
# We only want the numeric columns
df[,sapply(df, is.numeric)] <- lapply(df[,sapply(df, is.numeric)], remove_outliers)
df
}
Similar to above (and slightly easier I think), you can pass the entire data set to lapply. Also making sure not to
data <- lapply(data, if (x) something else anotherthing)
or
data[] <- lapply(data, if (x) something)
Which are also mistakes I have made many times
remove_all_outliers2 <- function(df){
df[] <- lapply(df, function(x) if (is.numeric(x))
remove_outliers(x) else x)
df
}
## test
mt <- within(mtcars, {
mpg <- factor(mpg)
gear <- letters[1:2]
})
head(mt)
identical(remove_all_outliers1(mt), remove_all_outliers2(mt))
# [1] TRUE
Your ideas can work with a few minor adjustments. !is.numeric can work as either Negate(is.numeric) or the more verbose function(x) !is.numeric(x) or !sapply(x, is.numeric). Generally, function(function) doesn't work in r out of the box.
# Removes all outliers from a data set
remove_all_outliers <- function(df){
# We only want the numeric columns
## drop = FALSE in case only one column for either
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
## note brackets
a[]<-lapply(a, function(x) remove_outliers(x))
## stack them back together, not merge
## you could merge if you had a unique id, one id per row
## then make sure the columns are returned in the original order
d<-cbind(a,b)
d[, names(df)]
}
identical(remove_all_outliers2(mt), remove_all_outliers(mt))
# [1] TRUE

R speed up the for loop using apply() or lapply() or etc

I wrote a special "impute' function that replaces the column values that have missing (NA) values with either mean() or mode() based on the specific column name.
The input dataframe is 400,000+ rows and its vert slow , how can i speed up the imputation part using lapply() or apply().
Here is the function , mark section I want optimized with START OPTIMIZE & END OPTIMIZE:
specialImpute <- function(inputDF)
{
discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
dfList <- list()
counter = 1;
Whilecounter = nrow(inputDF)
#for testing just do 10 iterations,i = 10;
while (Whilecounter >0)
{
studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]
vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
#was discovered and subset before
if (!is.null(vect))
{
#not subset before
if (length(vect)<1)
{
#subset the dataframe base on regex inputDF$STUDYID_SUBJID
df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)
#START OPTIMIZE
for (i in nrow(df))
{
#impute , add column mean & add to list
#apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})
if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
#impute using mean for CONTINUOUS variables
if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
#impute using mode ordinal & nominal values
if (is.na(df[i,"COVAR_ORDINAL_1"])) {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
if (is.na(df[i,"COVAR_ORDINAL_2"])) {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
if (is.na(df[i,"COVAR_ORDINAL_3"])) {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
if (is.na(df[i,"COVAR_ORDINAL_4"])) {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
#nominal
if (is.na(df[i,"COVAR_NOMINAL_1"])) {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
if (is.na(df[i,"COVAR_NOMINAL_2"])) {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
if (is.na(df[i,"COVAR_NOMINAL_3"])) {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
if (is.na(df[i,"COVAR_NOMINAL_4"])) {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
if (is.na(df[i,"COVAR_NOMINAL_5"])) {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
if (is.na(df[i,"COVAR_NOMINAL_6"])) {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
if (is.na(df[i,"COVAR_NOMINAL_7"])) {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
if (is.na(df[i,"COVAR_NOMINAL_8"])) {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}
}#for
#END OPTIMIZE
dfList[[counter]] <- df
#add to discoveredDf since already substed
discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
counter = counter +1;
#for debugging to check progress
if (counter %% 100 == 0)
{
print(counter)
}
}
}
Whilecounter = Whilecounter -1;
}#end while
return (dfList)
}
Thanks
It's likely that performance can be improved in many ways, so long as you use a vectorized function on each column. Currently, you're iterating through each row, and then handling each column separately, which really slows you down. Another improvement is to generalize the code so you don't have to keep typing a new line for each variable. In the examples I'll give below, this is handled because continuous variables are numeric, and categorical are factors.
To get straight to an answer, you can replace your code to be optimized with the following (though fixing variable names) provided that your numeric variables are numeric and ordinal/categorical are not (e.g., factors):
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
# Correct cols_to_impute with names of your variables to be imputed
# e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)
cols_to_impute <- names(df) %in% c("names", "of", "columns")
library(purrr)
df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)
Below is a detailed comparison of five approaches:
Your original approach using for to iterate on rows; each column then handled separately.
Using a for loop.
Using lapply().
Using sapply().
Using dmap() from the purrr package.
The new approaches all iterate on the data frame by column and make use of a vectorized function called impute, which imputes missing values in a vector with the mean (if numeric) or the mode (otherwise). Otherwise, their differences are relatively minor (except sapply() as you'll see), but interesting to check.
Here are the utility functions we'll use:
# Function to simulate a data frame of numeric and factor variables with
# missing values and `n` rows
create_dat <- function(n) {
set.seed(13)
data.frame(
con_1 = sample(c(10:20, NA), n, replace = TRUE), # continuous w/ missing
con_2 = sample(c(20:30, NA), n, replace = TRUE), # continuous w/ missing
ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
ord_2 = sample(c(letters, NA), n, replace = TRUE) # ordinal w/ missing
)
}
# Function that imputes missing values in a vector with mean (if numeric) or
# mode (otherwise)
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
Now, wrapper functions for each approach:
# Original approach
func0 <- function(d) {
for (i in 1:nrow(d)) {
if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)
if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)
if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))
if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
}
return(d)
}
# for loop operates directly on d
func1 <- function(d) {
for(i in seq_along(d)) {
d[[i]] <- impute(d[[i]])
}
return(d)
}
# Use lapply()
func2 <- function(d) {
lapply(d, function(col) {
impute(col)
})
}
# Use sapply()
func3 <- function(d) {
sapply(d, function(col) {
impute(col)
})
}
# Use purrr::dmap()
func4 <- function(d) {
purrr::dmap(d, impute)
}
Now, we'll compare the performance of these approaches with n ranging from 10 to 100 (VERY small):
library(microbenchmark)
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
ORIGINAL = func0(dat),
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
# Plot the results
library(tidyr)
library(ggplot2)
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
It's pretty clear that the original approach is much slower than the new approaches that use the vectorized function impute on each column. What about differences between the new ones? Let's bump up our sample size to check:
ns <- seq(5000, 50000, by = 5000)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
Looks like sapply() is not great (as #Martin pointed out). This is because sapply() is doing extra work to get our data into a matrix shape (which we don't need). If you run this yourself without sapply(), you'll see that the remaining approaches are all pretty comparable.
So the major performance improvement is to use a vectorized function on each column. I suggested using dmap at the beginning because I'm a fan of the function style and the purrr package generally, but you can comfortably substitute for whichever approach you prefer.
Aside, many thanks to #Martin for the very useful comment that got me to improve this answer!
If you are going to be working with what looks like a matrix, then use a matrix instead of a dataframe, since indexing into a dataframe, like it was a matrix, is very costly. You might want to extract the numerical values to a matrix for part of your calculations. This can provide a significant increase in speed.
Here is a really simple and fast solution using data.table.
library(data.table)
# name of columns
cols <- c("a", "c")
# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x))) , .SDcols = cols ]
I haven't compared the performance of this solution to the one provided by #Simon Jackson, but this should be pretty fast.
data from reproducible example
set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1),
b=sample(1:15, 9, replace=TRUE),
c=LETTERS[c(1:6,NA,NA,1)])

rdata: Some method to iterate through column names of a data frame?

I have about 30 lines of code that do just this (getting Z scores):
data$z_col1 <- (data$col1 - mean(data$col1, na.rm = TRUE)) / sd(data$col1, na.rm = TRUE)
data$z_col2 <- (data$col2 - mean(data$col2, na.rm = TRUE)) / sd(data$col2, na.rm = TRUE)
data$z_col3 <- (data$col3 - mean(data$col3, na.rm = TRUE)) / sd(data$col3, na.rm = TRUE)
data$z_col4 <- (data$col4 - mean(data$col4, na.rm = TRUE)) / sd(data$col4, na.rm = TRUE)
data$z_col5 <- (data$col5 - mean(data$col5, na.rm = TRUE)) / sd(data$col5, na.rm = TRUE)
Is there some way, maybe using apply() or something, that I can just essentially do (python):
for col in ['col1', 'col2', 'col3']:
data{col} = ... z score code here
Thanks R friends.
A data.frame is a list, thus you can use lapply. Don't use apply on a data.frame as this will coerce to a matrix.
lapply(data, function(x) (x - mean(x,na.rm = TRUE))/sd(x, na.rm = TRUE))
Or you could use scale which performs this calculation on a vector.
lapply(data, scale)
You can translate the python style approach directy
for(col in names(data)){
data[[col]] <- scale(data[[col]])
}
Note that this approach is not memory efficient in R as [[<.data.frame copies the entire data.frame each time.
I think you're right, apply() may be the way to go here.
For example:
data <- array(1:20, dim=c(4, 5))
data.zscores <- apply(data, 2, function(x)
(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE))
The function apply() takes a matrix or array as it's first argument. The "2" refers to the dimension the function is iterated over - which in our case is columns. If we wanted to do it by row, we'd go with "1". Lastly, we have the function we want to apply to each column. See ?apply for more details.
Check this out
I iterate through the data frame to recognise NA rows
for(i in names(houseDF)){
print(i)
print(nrow(houseDF[is.na(houseDF[i]),]))
print("---------------------")
}

Resources