Replacing Defined Outliers Using Apply/Tapply R - r

Good Afternoon R wizards,
I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R
The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.
I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)
Winsor75 <- function(x) {
Max <- quantile(x, probs = .75)
return(Max)
}
y <- as.data.frame(x)
y$a[y$a > Winsor75(x)] <- Winsor75(x)
The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.
Any suggestions?
Thanks!

as.data.frame(lapply(y, function(x) pmin(x, quantile(x, 0.75, na.rm = TRUE))))
As a function:
df_winsor <- function(df, p) {
as.data.frame(lapply(df,
function(x) pmin(x, quantile(x, probs = p, na.rm = TRUE))))
}
Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".

One option is to use mutate_all with custom function and apply rules to all columns.
Approach:
I have crated an replaceOutlier function (based on OPs function) which calculatesMaxand then replaces any item which is more thanMaxbefore returning vector.replaceOutlieris applied over all columns usingdplyr::mutate_all`.
library(tidyverse)
replaceOutlier <- function(x) {
Max <- quantile(x, probs = .75)
x[x>Max] <- Max
return(x)
}
x %>% as_tibble() %>% mutate_all(funs(replaceOutlier))
#Results
# # A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 -0.626 1.08
# 2 0.698 0.390
# 3 -0.836 -0.621
# 4 0.698 1.08
# 5 0.330 1.08
# 6 -0.820 -0.0449
# 7 0.487 -0.0162
# 8 0.698 0.944
# 9 0.576 0.821
# 10 -0.305 0.594
#
Data
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")

Related

Combination of approx and map2 is surprisingly slow

I have a dataset that looks like the below:
> head(mydata)
id value1 value2
1: 1 200001 300001
2: 2 200002 300002
3: 3 200003 300003
4: 4 200004 300004
5: 5 200005 300005
6: 6 200006 300006
value1 and value2 represent amounts at the beginning and the end of a given year. I would like to linearly interpolate the value for a given month, for each id (i.e. rowwise).
After trying different options that were slower, I am currently using map2 from the purrr package in combination with approx from base R. I create the new variable using assignment by reference from the data.table package. This is still surprisingly slow, as it takes approximately 2.2 min for my code to run on my data (1.7 million rows).
Note that I also use get() to access the variables for the interpolation, as their names need to be dynamic. This is slowing down my code, but it doesn't seem to be the bottleneck. Also, I have tried to use the furrr package to speed up map2 by making the code parallel, but the speed gains were not material.
Below is reproducible example with 1000 rows of data. Any help to speed up the code is greatly appreciated!
mydata <- data.table(id = 1:1000, value1= 2001:3000, value2= 3001:4000)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
Just use the formula for linear interpolation to vectorize over the whole data.table.
mydata <- data.table(id = 0:1e6, value1= 2e6:3e6, value2= 3e6:4e6)
floor_value <- "value1"
ceiling_value <- "value2"
m <- 7
monthly_sum_assured <- function(a, b, m) {
monthly_value <- approx(x = c(0, 12), c(a, b), xout = m)$y
}
system.time({
mydata[, interpolated_value := map2(get(floor_value), get(ceiling_value),
~ monthly_sum_assured(.x, .y, m))]
})
#> user system elapsed
#> 41.50 0.53 42.05
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0 0 0
identical(unlist(mydata$interpolated_value), mydata$interpolated_value2)
#> [1] TRUE
It also works just as fast when m is a vector.
m <- sample(12, 1e6 + 1, 1)
system.time({
mydata[, interpolated_value2 := get(floor_value) + m*(get(ceiling_value) - get(floor_value))/12]
})
#> user system elapsed
#> 0.01 0.00 0.02

Error: calling arguments in user functions using quosures

I am trying to create cross tables using weights::wtd.chi.sq.
The data:
data_in <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt1 wgt2
never always Yes 2 High M 1.3 0.4
sometimes always No 2 Med F 0.4 0.2
always sometimes Yes 4 Low F 1.2 0.7
never never No 2 High M 0.5 0.7
always always No 4 High M 0.7 0.8
sometimes never Yes 3 Low F 0.56 0.3
sometimes never Yes 2 Med F 0.9 0.1
")
x_tab function that feeds into another function:
xtab_func <- function(dat, col, target, wgt){
col <- rlang::as_string(ensym(col))
target <- rlang::as_string(ensym(target))
wgt <- rlang::as_string(ensym(wgt))
wtd.chi.sq(dat[[target]],dat[[col]], weight = dat[[wgt]])
}
Running it gives:
xtab_func(data_in, 'Q50_1','pov','wgt1')
Chisq df p.value
7.3395092 4.0000000 0.1189981
Now I am looping through a vector of columns to repeat this for tabulation for each column. The error happens when I try to call target and wgt within the xtab function above. I've tried 3 different ways but none of them work.
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
# browser()
target <- rlang::as_string(ensym(target))
print(target)
wgt <- rlang::as_string(ensym(wgt))
target_name <- enquo(target_name)
school_type <- enquo(school_type)
d <- list()
for (i in columns){
# OPTION 1
# x <- xtab_func(dat, i, !!target, !!wgt)
# OPTION 2
x <- xtab_func(dat, i, target, wgt)
# OPTION 3
# x <- xtab_func(dat, i, dat[[target]],dat[[wgt]])
x$i <- i
d[[i]] <- x
df <- do.call(rbind, d)
}
return(df)
}
When I run this I could see the chi values for the columns by pov...
cols <- data_in %>% select(starts_with("Q"))
cols <- names(cols)
crosstab(data_in,'pov',cols, 'wgt1', 'pov','trad')
But I get these errors:
Error: Only strings can be converted to symbols
OR
Error in model.frame.default(formula = weight ~ var1 + var2) :
invalid type (NULL) for variable 'var1'
Any idea how I call those variables? Thank you!
It's not clear to me at all why you are trying to use all the rlang stuff when you are just passing character values to your functions. This could be greatly simplified to
xtab_func <- function(dat, col, target, wgt){
weights::wtd.chi.sq(dat[[target]],dat[[col]], weight = dat[[wgt]])
}
and
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
d <- list()
for (i in columns){
x <- as.data.frame(as.list(xtab_func(dat, i, target, wgt)))
x$i <- i
d[[i]] <- x
}
df <- do.call(rbind, d)
return(df)
}
Just use [[]] with character values to index into your data.
With the OP's default function xtab_func, we can modify the crosstab to
library(purrr)
library(dplyr)
crosstab <- function(dat, target, columns, wgt,target_name, school_type){
purrr::map_dfr(columns, ~ {
xtab_func(dat, !!.x, !!target, !!wgt)
})
}
-testing
crosstab(data_in,'pov', cols, 'wgt1', 'pov','trad')
# A tibble: 4 x 3
Chisq df p.value
<dbl> <dbl> <dbl>
1 7.34 4 0.119
2 6.02 4 0.198
3 1.47 2 0.480
4 4.83 4 0.306

Trying to make a script calculate a value (using a function) for every 24 rows

I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by, you can also use the group_by + summarise functions from the dplyr package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277

min max scaling/normalization in r for train and test data

I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.
So far this is the function I have come up with:
min_max_scaling <- function(train, test){
min_vals <- sapply(train, min)
range1 <- sapply(train, function(x) diff(range(x)))
# scale the training data
train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))
for(i in seq_len(ncol(train))){
column <- (train[,i] - min_vals[i])/range1[i]
train_scaled[i] <- column
}
colnames(train_scaled) <- colnames(train)
# scale the testing data using the min and range of the train data
test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))
for(i in seq_len(ncol(test))){
column <- (test[,i] - min_vals[i])/range1[i]
test_scaled[i] <- column
}
colnames(test_scaled) <- colnames(test)
return(list(train = train_scaled, test = test_scaled))
}
The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values
My questions are:
1. Is there a way to vectorize the two for loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?
Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.
normalize <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
To get a vector, use apply instead of lapply.
as.data.frame(apply(df$name, normalize))
Update to address Holger's suggestion.
If you want to pass additional arguments to min() and max(), e.g., na.rm, then you can use:
normalize <- function(x, ...) {
return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}
x <- c(1, NA, 2, 3)
normalize(a)
# [1] NA NA NA NA
normalize(a, na.rm = TRUE)
# 0.0 NA 0.5 1.0
Just keep in mind, that whatever you pass to min() via the ellipsis ... you also implicitly pass to max(). In this case, this shouldn't be a big problem since both min() and max() share the same function signature.
Regarding your 2nd question, you can use the caret package:
library(caret)
train = data.frame(a = 1:3, b = 10:12)
test = data.frame(a = 1:6, b = 7:12)
pp = preProcess(train, method = "range")
predict(pp, train)
# a b
# 1 0.0 0.0
# 2 0.5 0.5
# 3 1.0 1.0
predict(pp, test)
# a b
# 1 0.0 -1.5
# 2 0.5 -1.0
# 3 1.0 -0.5
# 4 1.5 0.0
# 5 2.0 0.5
# 6 2.5 1.0
This packages also defines other transformation methods, see: http://machinelearningmastery.com/pre-process-your-dataset-in-r/
set.seed(1984)
### simulating a data set
df <- data.frame(var1 = rnorm(100,5,3),
var2 = rpois(100,15),
var3 = runif(50,90,100))
df_train <- df[1:60,]
df_test <- df[61:100,]
## the function
normalize_data <- function(train_set, test_set) ## the args are the two sets
{
ranges <- sapply(train_set, function(x) max(x)-min(x)) ## range calculation
normalized_train <- train_set/ranges # the normalization
normalized_test <- test_set/ranges
return(list(ranges = ranges, # returning a list
normalized_train= normalized_train,
normalized_test =normalized_test ))
}
z <- normalize_data(df_train, df_test) ## applying the function
## the results
z$ranges
var1 var2 var3
13.051448 22.000000 9.945934
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209
> head(z$normalized_train)
var1 var2 var3
1 0.47715854 1.1492978 7.289028
2 0.18322387 0.4545455 4.280883
3 0.69451066 1.3070668 9.703761
4 -0.04125108 1.6090169 7.277882
5 0.35731555 0.7272727 4.133561
6 0.86120315 0.6032616 9.246209

How to vectorize this replacement operation?

I have a dataset that needs to be capped/trimmed etc. based on values from another dataset. Both datasets have same structure (column names etc.).
What is a quick way to apply the transformations stored in another dataset to the current dataset?
Sample data:
#generate sample data & set some values to NA
#this is the dataset that has variables that need to be trimmed
x1 <- data.frame(a=rep(11:20), b=rep(41:50))
x1[2,1] <- NA
x1
#a vector containing values to trim to (in this case, say 75th percentile)
y1 <- apply(x1, 2, function(x) quantile(x, 0.75, na.rm=T))
y1
#I am doing this inside a loop
for (i in 1:ncol(x1)){
x1[is.na(x1[[i]]),] <- y1[i] #if missing, set to some value
x1[x1[[i]] > y1[i], i] <- y1[i] #if larger than 75th pctl, set to some value
}
x1
I am pretty sure there is a faster vectorized way to do this. I'd greatly appreciate any inputs.
One option: write your logic as a function that takes a vector and a value:
myfun <- function(x, y) {
x[is.na(x)] <- y
x[x > y] <- y
return (x)
}
Then use mapply which will treat x1 as a list of columns (which it sort of is):
mapply(myfun, x1, y1)
And you can coerce it back to a data.frame by wrapping it:
data.frame(mapply(myfun, x1, y1))
You could also add SIMPLIFY=FALSE if you want...
As per the comments, Map is a better choice here since it avoids some typing and probably some overhead:
as.data.frame(Map(myfun, x1, y1))
Here is another option using the data.table package. data.table is very fast and has powerful syntax, but the disadvantage is the need to learn new syntax.
library(data.table)
x1 <- data.frame(a=rep(11:20), b=rep(41:50))
x1[2,1] <- NA
# Convert data.frame to data.table.
DT <- data.table(x1)
# Put your desired operations into a function, for clarity/tidiness.
update_vals <- function(x, prob=0.75) {
xcut <- quantile(x, probs=prob, na.rm=TRUE)
x[is.na(x) | x > xcut] <- xcut
return(x)
}
# Use lapply and data.table syntax to 'loop' over columns.
DT2 = DT[, lapply(.SD, update_vals)]
DT2
# a b
# 1: 11 41.00
# 2: 18 42.00
# 3: 13 43.00
# 4: 14 44.00
# 5: 15 45.00
# 6: 16 46.00
# 7: 17 47.00
# 8: 18 47.75
# 9: 18 47.75
# 10: 18 47.75

Resources