I have a long (row per pt) dataset, with columns for numerous variables. I've created a for loop to run over each row and print the id of every participant that is an outlier based on their results for a specific column/variable. In the below example looking at column x, this correctly identifies Pt6 as an outlier on variable x.
dat <- data.frame(id=c("Pt1","Pt2", "Pt3","Pt4", "Pt5", "Pt6"),
x=c(1,3,3,3,5,31),
y=c(2,9,10,10.5,10.5,11),
z=c(34,34,34,35,68,36))
for (row in 1:nrow(dat)) {
variable <- dat[row, "x"]
id <- dat[row, "id"]
if((variable>(mean(dat$x, na.rm=TRUE)
+ (2*sd(dat$x, na.rm=TRUE))))
|
(variable<(mean(dat$x, na.rm=TRUE)
- (2*sd(dat$x, na.rm=TRUE))))
)
{
print(id)
}}
However, I'd like to identify all participants that are an outlier based on each column individually - in the example data, it should identify Pt6 because of their x value AND Pt1 because of their y value AND Pt5 because of their z value.
I know I'll need to nest another for loop to go over the columns, something like the below, but it only identifies Pt5 so I think it is not looking at the columns individually?
for (row in 1:nrow(dat)) {
for (col in 1:ncol(dat))
value <- dat[row, col]
id <- dat[row, "id"]
if((value>(mean(dat[[col]], na.rm=TRUE)
+ (2*sd(dat[[col]], na.rm=TRUE))))
|
(value<(mean(dat[[col]], na.rm=TRUE)
- (2*sd(dat[[col]], na.rm=TRUE))))
)
{
print(id)
}}
I'm new to forloops (obviously) - trying to get out of the bad habit of copy pasting. I've tried looking at other answers but I can't see how to apply it here / they're not in R. Any help appreciated! Open to different approaches altogether (e.g apply based ones) but would quite like to plug my gap in forloop understanding if possible. Thanks!
Lets start by looking at your for-loops. You can optimize these quite easily, by storing your results (mean and such) in a variable, so these do not have to be recomputed. This is by far the slowest part of your loop so the boost will be significant. In your first code example this would look like this:
dat <- data.frame(id=c("Pt1","Pt2", "Pt3","Pt4", "Pt5", "Pt6"),
x=c(1,3,3,3,5,31),
y=c(2,9,10,10.5,10.5,11),
z=c(34,34,34,35,68,36))
# Pre-define variables
mu <- mean(dat$x, na.rm = TRUE)
sd2 <- 2 * sd(dat$x, na.rm = TRUE)
upper <- mu + sd2
lower <- mu - sd2
# Create storage
rows <- logical(n <- nrow(dat))
for (row in 1:n) {
variable <- dat[row, "x"]
if(variable > upper || variable < lower)
{
# Set index to true, for row being an "outlier"
rows[row] <- TRUE
}
}
# Print outlier rows
dat[rows,]
For you next loop, it would make sense to either store a matrix of "outlier indicators" or just the row/column pair, for example as a list. You are getting most of the way already. It makes sense to loop over columns in the outer loop, so you once again avoid recomputing mean and standard deviation at every iteration
# Specify columns to iterate over
cols <- names(dat)[-1]
# Storage for outliers
outliers <- list()
for(j in cols){
# Pre-define variables
mu <- mean(dat[, j], na.rm = TRUE)
sd2 <- 2 * sd(dat[, j], na.rm = TRUE)
upper <- mu + sd2
lower <- mu - sd2
# Create storage
rows <- logical(n <- nrow(dat))
for (row in 1:n) {
variable <- dat[row, j]
if(variable > upper || variable < lower)
{
# Set index to true, for row being an "outlier"
rows[row] <- TRUE
}
}
outliers[[j]] <- rows
}
# Print outliers
dat[outliers[['x']], ]
dat[outliers[['y']], ]
dat[outliers[['z']], ]
Now this is one method for doing it. But many functions in R are vectorized. So we could simplify this massively. Vectorization basically allows us to evaluate functions over vector inputs and this is also possible for logical comparison such as < <= == and so on. This allows us to remove the row iteration in this case, and simplifies the code drastically. For example the first code would be reduced to
# Only 1 column
mu <- mean(dat$x)
sd2 <- sd(dat$x) * 2
upper <- mu + sd2
lower <- mu - sd2
rows <- dat$x > upper | datx < lower
# Alternative, cheeky 1 liner:
rows <- abs(dat$x) - (mean(dat$x) + 2 * sd(dat$x)) > 0
while the latter could even be done as
outliers <- lapply(dat[, c('x', 'y', 'z')],
function(x)x[abs(x) - (mean(x) + 2 * sd(x)) > 0])
dat[outliers[['x']], ]
dat[outliers[['y']], ]
dat[outliers[['z']], ]
where I replace the for-loop with a call to lapply which will iterate over the columns in dat and apply the function specified, returning a list for each column. There is no real performance gain from replacing the for-loop, but it is easier to read for smaller calls like this.
The following code computes the column means and SD's first. Then the limits of mu +/- 2sd. Then uses a sapply loop to see which column elements are within those limits. Finally, it subsets the id column based on the results of sapply.
means <- colMeans(dat[-1], na.rm = TRUE)
sds <- apply(dat[-1], 2, sd, na.rm = TRUE)
ci95 <- means + cbind(-2*sds, 2*sds)
out <- sapply(seq_along(dat[-1]), function(i){
v <- dat[-1][[i]]
v < ci95[i, 1] | v > ci95[i, 2]
})
out
# [,1] [,2] [,3]
#[1,] FALSE TRUE FALSE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE
#[5,] FALSE FALSE TRUE
#[6,] TRUE FALSE FALSE
dat[[1]][rowSums(out) > 0]
#[1] "Pt1" "Pt5" "Pt6"
I have four variabes (length1, length2, length3 and length4) and would like to add a fifth column o my dataframe which contains the maximum values of the four lengths at each row.
When I run
length <- data.frame(length1, length2, length3, length4)
length$maximum <- apply(length, 1, max, na.rm = TRUE)
I obtain some -Inf values. I guess this happens in those rows where all the variables have NA values. What could I do to replace the -Inf values in the length$maximum variable with NAs?
I have tried:
my.max <- function(x) ifelse( !all(is.na(x)), max(x, na.rm=T), NA)
length$maximum <- apply(length, 1, my.max, na.rm = TRUE)
But it does not seem to work.
Your function definition - my.max works well. I would suggest adjusting the name of the function - e.g. my_max.
However, when you call your function, you've added an extra argument that isn't required. The code below should work:
# create test data
length1 <- c(1:9, NA_integer_)
length2 <- c(2:10, NA_integer_)
length3 <- c(3:11, NA_integer_)
length4 <- c(4:12, NA_integer_)
length <- data.frame(length1, length2, length3, length4)
# define function
my_max <- function(x) ifelse( !all(is.na(x)), max(x, na.rm = TRUE), NA)
length$maximum <- apply(length, 1, my_max)
I am new to r. I have a dataframe showing 8 trials per participant (in rows) per 4 different tasks/measures (in columns). I would like to remove outliers* (per participant per task) and convert them to NAs while keeping pre-existing NAs.
The code I am using is below; it is throwing out the pre-existing NAs (i.e.,the NAs that exist within the raw dataframe) with the additional result that I cannot get a dataframe back (it won't accept as.data.frame) I think because of unequal sizes. I presume the problem is the remove outliers function but
I thought that when the action on the NAs was within a function it was just stating how to deal with NAs during the function application only, and
I have tried to change the function with variations on na.rm = FALSE throughout but that won’t run. Any help much appreciated.
fname = "VSA perceptual controls_right.csv"
ctrl_vsa_trials = read.csv(fname, header = TRUE, stringsAsFactors = FALSE, na.strings = c(""))
remove_outliers = function(x, na.rm = TRUE, ...){
qnt = quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
H = 1.5 * IQR(x, na.rm = na.rm)
y = x
y[x < (qnt[1] - H)] = NA
y[x > (qnt[2] + H)] = NA
y
}
ctrl_vsa_trials_clean = aggregate(cbind(Pre_first,Post_first,Pre_adj,Post_adj) ~ Ppt, ctrl_vsa_trials, remove_outliers, na.action = NULL)
this is due to issues I had with the measuring device, I feel it is justified!
I am not sure that I understand what you exactly need, but if what you are trying is to replace columns with cleaned columns, you can try this:
ctrl_vsa_trials_clean <- ctrl_vsa_trials
cols <- c("Pre_first", "Post_first", "Pre_adj", "Post_adj")
ctrl_vsa_trials_clean[, cols] <- apply(ctrl_vsa_trials_clean[, cols], 2,
remove_outliers)
I am trying to write some R code which will take the iris dataset and do a log transform of the numeric columns as per some criterion, say if skewness > 0.2. I have tried to use ldply, but it doesn't quite give me the output I want. It is giving me a transposed data frame, the variable names are missing and the non-numeric column entries are messed up.
Before posting this question I searched and found the following related topics but didn't quite meet what exactly I was looking for
Selecting only numeric columns from a data frame
extract only numeric columns from data frame
data
Below is the code. Appreciate the help!
data(iris)
df <- iris
df <- ldply(names(df), function(x)
{
if (class(df[[x]])=="numeric")
{
tmp <- df[[x]][!is.na(df[[x]])]
if (abs(skewness(tmp)) > 0.2)
{
df[[x]] <- log10( 1 + df[[x]] )
}
else df[[x]] <- df[[x]]
}
else df[[x]] <- df[[x]]
#df[[x]] <- data.frame(df[[x]])
#df2 <- cbind(df2, df[[x]])
#return(NULL)
}
)
Try with lapply:
#Skewness package
library(e1071)
lapply(iris, function(x) {
if(is.numeric(x)){
if(abs(skewness(x, na.rm = T))>0.2){
log10(1 + x)} else x
}
else x
})
We can use lapply
library(e1071)
lapply(iris, function(x) if(is.numeric(x) & abs(skewness(x, na.rm = TRUE)) > 0.2)
log10(1+x) else x)
We can also loop by the columns of interest after creating a logical index
i1 <- sapply(iris, is.numeric)
i2 <- sapply(iris[i1], function(x) abs(skewness(x, na.rm = TRUE)) > 0.2)
iris[i1][i2] <- lapply(iris[i1][i2], function(x) log10(1+x))
I wrote a special "impute' function that replaces the column values that have missing (NA) values with either mean() or mode() based on the specific column name.
The input dataframe is 400,000+ rows and its vert slow , how can i speed up the imputation part using lapply() or apply().
Here is the function , mark section I want optimized with START OPTIMIZE & END OPTIMIZE:
specialImpute <- function(inputDF)
{
discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
dfList <- list()
counter = 1;
Whilecounter = nrow(inputDF)
#for testing just do 10 iterations,i = 10;
while (Whilecounter >0)
{
studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]
vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
#was discovered and subset before
if (!is.null(vect))
{
#not subset before
if (length(vect)<1)
{
#subset the dataframe base on regex inputDF$STUDYID_SUBJID
df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)
#START OPTIMIZE
for (i in nrow(df))
{
#impute , add column mean & add to list
#apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})
if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
#impute using mean for CONTINUOUS variables
if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
#impute using mode ordinal & nominal values
if (is.na(df[i,"COVAR_ORDINAL_1"])) {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
if (is.na(df[i,"COVAR_ORDINAL_2"])) {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
if (is.na(df[i,"COVAR_ORDINAL_3"])) {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
if (is.na(df[i,"COVAR_ORDINAL_4"])) {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
#nominal
if (is.na(df[i,"COVAR_NOMINAL_1"])) {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
if (is.na(df[i,"COVAR_NOMINAL_2"])) {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
if (is.na(df[i,"COVAR_NOMINAL_3"])) {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
if (is.na(df[i,"COVAR_NOMINAL_4"])) {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
if (is.na(df[i,"COVAR_NOMINAL_5"])) {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
if (is.na(df[i,"COVAR_NOMINAL_6"])) {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
if (is.na(df[i,"COVAR_NOMINAL_7"])) {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
if (is.na(df[i,"COVAR_NOMINAL_8"])) {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}
}#for
#END OPTIMIZE
dfList[[counter]] <- df
#add to discoveredDf since already substed
discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
counter = counter +1;
#for debugging to check progress
if (counter %% 100 == 0)
{
print(counter)
}
}
}
Whilecounter = Whilecounter -1;
}#end while
return (dfList)
}
Thanks
It's likely that performance can be improved in many ways, so long as you use a vectorized function on each column. Currently, you're iterating through each row, and then handling each column separately, which really slows you down. Another improvement is to generalize the code so you don't have to keep typing a new line for each variable. In the examples I'll give below, this is handled because continuous variables are numeric, and categorical are factors.
To get straight to an answer, you can replace your code to be optimized with the following (though fixing variable names) provided that your numeric variables are numeric and ordinal/categorical are not (e.g., factors):
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
# Correct cols_to_impute with names of your variables to be imputed
# e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)
cols_to_impute <- names(df) %in% c("names", "of", "columns")
library(purrr)
df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)
Below is a detailed comparison of five approaches:
Your original approach using for to iterate on rows; each column then handled separately.
Using a for loop.
Using lapply().
Using sapply().
Using dmap() from the purrr package.
The new approaches all iterate on the data frame by column and make use of a vectorized function called impute, which imputes missing values in a vector with the mean (if numeric) or the mode (otherwise). Otherwise, their differences are relatively minor (except sapply() as you'll see), but interesting to check.
Here are the utility functions we'll use:
# Function to simulate a data frame of numeric and factor variables with
# missing values and `n` rows
create_dat <- function(n) {
set.seed(13)
data.frame(
con_1 = sample(c(10:20, NA), n, replace = TRUE), # continuous w/ missing
con_2 = sample(c(20:30, NA), n, replace = TRUE), # continuous w/ missing
ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
ord_2 = sample(c(letters, NA), n, replace = TRUE) # ordinal w/ missing
)
}
# Function that imputes missing values in a vector with mean (if numeric) or
# mode (otherwise)
impute <- function(x) {
if (is.numeric(x)) { # If numeric, impute with mean
x[is.na(x)] <- mean(x, na.rm = TRUE)
} else { # mode otherwise
x[is.na(x)] <- names(which.max(table(x)))
}
x
}
Now, wrapper functions for each approach:
# Original approach
func0 <- function(d) {
for (i in 1:nrow(d)) {
if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)
if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)
if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))
if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
}
return(d)
}
# for loop operates directly on d
func1 <- function(d) {
for(i in seq_along(d)) {
d[[i]] <- impute(d[[i]])
}
return(d)
}
# Use lapply()
func2 <- function(d) {
lapply(d, function(col) {
impute(col)
})
}
# Use sapply()
func3 <- function(d) {
sapply(d, function(col) {
impute(col)
})
}
# Use purrr::dmap()
func4 <- function(d) {
purrr::dmap(d, impute)
}
Now, we'll compare the performance of these approaches with n ranging from 10 to 100 (VERY small):
library(microbenchmark)
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
ORIGINAL = func0(dat),
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
# Plot the results
library(tidyr)
library(ggplot2)
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
It's pretty clear that the original approach is much slower than the new approaches that use the vectorized function impute on each column. What about differences between the new ones? Let's bump up our sample size to check:
ns <- seq(5000, 50000, by = 5000)
times <- sapply(ns, function(n) {
dat <- create_dat(n)
op <- microbenchmark(
FOR_LOOP = func1(dat),
LAPPLY = func2(dat),
SAPPLY = func3(dat),
DMAP = func4(dat)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()
Looks like sapply() is not great (as #Martin pointed out). This is because sapply() is doing extra work to get our data into a matrix shape (which we don't need). If you run this yourself without sapply(), you'll see that the remaining approaches are all pretty comparable.
So the major performance improvement is to use a vectorized function on each column. I suggested using dmap at the beginning because I'm a fan of the function style and the purrr package generally, but you can comfortably substitute for whichever approach you prefer.
Aside, many thanks to #Martin for the very useful comment that got me to improve this answer!
If you are going to be working with what looks like a matrix, then use a matrix instead of a dataframe, since indexing into a dataframe, like it was a matrix, is very costly. You might want to extract the numerical values to a matrix for part of your calculations. This can provide a significant increase in speed.
Here is a really simple and fast solution using data.table.
library(data.table)
# name of columns
cols <- c("a", "c")
# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x))) , .SDcols = cols ]
I haven't compared the performance of this solution to the one provided by #Simon Jackson, but this should be pretty fast.
data from reproducible example
set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1),
b=sample(1:15, 9, replace=TRUE),
c=LETTERS[c(1:6,NA,NA,1)])