Removing outliers from groups using data.table in R - r

I have a data.table object that contains group column. I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. My data.table can be build using simple script:
col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)
I've been trying to split data.frame by group variable, however, it's too messy approach. How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?

Assuming that you want to remove outliers according to both col1 and col2, based on the 95% quantile:
dt_filt <- dt[,
.SD[
((col1 < quantile(col1, probs = 0.95)) &
(col2 < quantile(col2, probs = 0.95)))
], by = group
]
which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds.

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?

I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".
I am able to use the rle(df) function to get the running length of all the "good" and "bad".
How do i find the 5 largest runs that attribute to ONLY "bad"?
How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?
Your assistance is much appreciated!
One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad
library(data.table)
setDT(df1)[, grp := rleid(goodbad)][, n := .N, grp][ ,
rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad),
n = n, start = rn[1], last = rn[.N]), .(grp)
][n %in% head(unique(n), 5)][, grp := NULL][]
Or we can use rle and other base R methods
rl <- rle(df1$goodbad)
grp <- with(rl, rep(seq_along(values), lengths))
df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
rn = seq_len(nrow(df1)))
df3 <- subset(df2, goodbad == 'bad')
do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),],
n %in% head(unique(n), 5)), range))
data
set.seed(24)
df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
replace = TRUE), stringsAsFactors = FALSE)
The sort(...) function arranges things by increasing or decreasing order. The default is increasing, but you can set "decreasing = TRUE". Use ?sort for more info.
The which(...) function returns the INDEX of values that meet a logical criteria. The code below sorts the times columns of rows where the goodbad value == GOOD.
sort(your.df$times[which(your.df$goodbad == GOOD)])
If you wanted to get the top 5 you could do this:
top5_good <- sort(your.df$times[which(your.df$goodbad == GOOD)])[1:5]
top5_bad <- sort(your.df$times[which(your.df$goodbad == BAD)])[1:5]

conditional rolling average

library(data.table)
set.seed(123)
d <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))
For each ID, I want to do a 7-years rolling average of y starting from 1998 onwards. However, the condition is that in each rolling window,
I only select the top 5 highest value of y to do the average. For e.g.
first rolling window would be
1998-2004 - only do the average of top 5 highest 'y' values
1999-2005 - only do the average of top 5 highest 'y' values
.
.
2007-2013 - only do the average of top 5 highest 'y' values
2008-2014 - only do the average of top 5 highest 'y' values
I am interested in using data.table to achieve this. However also open to other suggestions. Here's what I tried
d = setDT(d)
d[, avg.Y := frollmean(y, 7), by = ID]
How do I enter another argument where for each rolling 7-years window I only select the top 5 highest y value to calculate the mean?
EDIT
I could also have a case that some IDs might not have minimum 7 years of data to do a moving average in which case the above function will give me NAs. For those IDs, is it possible to simply take an arithematic mean? For e.g. if a ID has data from 1998-2002, in such cases, can I simply take the average of y from 1998-2002
We can use rollapplyr from zoo and apply a custom function to calculate mean of top 5 values in each rolling window.
library(data.table)
library(zoo)
setDT(d)
d[, avg.Y:= rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA), by = ID]
For cases where there could be less number of observations than the window size we can do
d[, avg.Y:= if (.N > 6)
rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA)
else mean(y), by = ID]
First time using frollapply() but this seems to work:
get_mean_top5 <- function(x) mean(-sort(-x, partial = 1:5)[1:5])
d[, test := frollapply(y, 7, FUN = get_mean_top5), by = ID]
The function get_mean_top5() filters out the top 5 highest values and then takes the mean. Other more readable forms would be:
get_mean_top5 <- function(x) mean(mean(x[order(x, decreasing=TRUE)[1:5]]))
A few more steps and a little bit repetitive base R solution:
df$seven_year_group <- paste0(ave(as.integer(as.factor(df$yearRef)) %% 7,
as.integer(as.factor(df$yearRef)) %% 7,
FUN = seq.int),
"_",
df$ID)
seven_year_averages <- data.frame(avg_y = do.call("rbind", lapply(split(df, df$seven_year_group),
function(x){mean(tail(x[order(x$y), "y"], 5))})))
seven_year_averages$seven_year_group <- row.names(seven_year_averages)
df <- merge(df, seven_year_averages, by = "seven_year_group", all.x = TRUE)
Data:
set.seed(2019)
df <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))

Sample from specific rows in a dataframe column [duplicate]

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

Calculating a weighted mean in data.table in R varying weights

My question relates to this previously asked question:
Calculating a weighted mean using data.table in R with weights in one of the table columns
In my case, I have different weights-columns across the columns I want to aggregate. Let's say I have four columns col_a, col_b, col_c and col_d and let's assume I want to aggregate col_a and col_b with weiths w_1 and col_c, col_d with w_2. Example:
require(data.table)
id <- c(1,1,1,2,2,2)
col_a <- c(123,56,87,987,1003,10)
col_b <- c(17,234,20,88,765,69)
col_c <- c(45,90,543,30,1,543)
col_d <- c(60,43,700,3,88,46)
w_1 <- c(1,1,1,1,1,1)
w_2 <- c(1.5,1,1.2,0.8,1,1)
dt <- data.table(id, col_a, col_b, col_c, col_d, w_1, w_2);dt
Now the desired result would look like this:
data.table(id=c(1,2),col_a=c(weighted.mean(col_a[1:3],w_1[1:3]),weighted.mean(col_a[4:6],w_1[4:6])),col_b=c(weighted.mean(col_b[1:3],w_1[1:3]),weighted.mean(col_b[4:6],w_1[4:6])),
col_c=c(weighted.mean(col_c[1:3],w_2[1:3]),weighted.mean(col_c[4:6],w_1[4:6])),col_d=c(weighted.mean(col_d[1:3],w_2[1:3]),weighted.mean(col_d[4:6],w_2[4:6])))
This, I thought could be accomplished similar to #akrun answer to this post:
R collapse multiple rows into 1 row using specific function to each column
where I would have the two functions weighted.mean(x, w_1) and weighted.mean(x, w_2) instead of min or median.
Here is how far I got:
colsToKeep <- c("col_a","col_b","col_c","col_d")
dt[, Map(function(x,y) get(x)(y, na.rm = TRUE),
setNames(rep(c('weighted.mean', 'weighted.mean'),2),names(.SD)), .SD),.SDcols=colsToKeep, by = id]
My question: how can get the arguments w=w_1 and w=w_2 into the setNames-function? Is that even possible?
Could be something like this too:
colsToKeep <- c("col_a", "col_b", "col_c", "col_d")
colsToW <- c("w_1", "w_1", "w_2", "w_2")
eval(parse(text = paste0("dt[, .(", paste0("w_", colsToKeep, " = weighted.mean(", colsToKeep, ",", colsToW, ")", collapse = ", "), "), by = id]")))
or this one:
dt[, Map(function(x,y,w) get(x)(y, w, na.rm = TRUE),
setNames(rep('weighted.mean',length(colsToKeep)), paste0("W_", colsToKeep)),
.SD[, ..colsToKeep], .SD[, ..colsToW]),
by = id]
As mentioned by Roland, you can cast into a long format. The benefit is that in the long run, you do not have to change the codes every time when there is a new column. Explanation in line. You can print mdt to take a look.
#cast into a long format with col values in 1 column and rows in another columns
mdt <- melt(dt, id.var=c("id",grep("^w", names(dt), value=TRUE)),
variable.name="col", value.name="colVal")
mdt <- melt(mdt, id.var=c("id","col","colVal"),
variable.name="w", value.name="wVal")
#prob need to think of a programmatic way rather than typing columns
myPairs <- data.table(rbind(
c(col="col_a", w="w_1"),
c(col="col_b", w="w_1"),
c(col="col_c", w="w_2"),
c(col="col_d", w="w_2")))
#calculate weighted mean according to myPairs and then pivot the table
dcast(mdt[myPairs, on=.(col, w),
weighted.mean(colVal, wVal),
by=.(id, col)],
id ~ col,
value.var="V1")

Resources