How to identify and remove outliers in a data.frame using R? - r

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)

The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)

A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

Related

Is there a way of using the pmap function in purrr to iteratively adjust a dataset?

I've created a function which I am trying to apply to a dataset using pmap. The function I've created amends some columns in a dataset. I want the amendment that's applied to the two columns to carry over to the 2nd and subsequent iterations of pmap.
Reproducible example below:
library(tidyr)
library(dplyr)
set.seed(1982)
#create example dataset
dataset <- tibble(groupvar = sample(c(1:3), 20, replace = TRUE),
a = sample(c(1:10), 20, replace = TRUE),
b = sample(c(1:10), 20, replace = TRUE),
c = sample(c(1:10), 20, replace = TRUE),
d = sample(c(1:10), 20, replace = TRUE)) %>%
arrange(groupvar)
#function to sum 2 columns (col1 and col2), then adjust those columns such that the cumulative sum of the two columns
#within the group doesn't exceed the specified limit
shared_limits <- function(col1, col2, group, limit){
dataset <- dataset
dataset$group <- dataset[[group]]
dataset$newcol <- dataset[[col1]] + dataset[[col2]]
dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum))
dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
pmin(dataset[[col1]], dataset$adjusted_sum))
dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
dataset
}
#apply function directly
new_dataset <- shared_limits("a", "b", "groupvar", 25)
#apply function using a separate parameters table and pmap_dfr
shared_limits_table <- tibble(col1 = c("a","b"),
col2 = c("c","d"),
group = "groupvar",
limit = c(25, 30))
dataset <- pmap_dfr(shared_limits_table, shared_limits)
In the example above the pmap function applies the shared limit to columns "a" and "c" and returns an adjusted dataset as the first element in the list. It then applies the shared limit to columns "b" and "d" and returns this as the second element in the list. However the adjustments that have been made to "a" and "c" are now lost.
Is there any way of storing the adjustments that are made to each column as we progress through each iteration of pmap?
You can iteratively apply a function to your dataset with reduce
First, I'd fix your function since dataset is undefined
shared_limits <- function(df, col1, col2, group, limit){
dataset <- df
dataset$group <- dataset[[group]]
dataset$newcol <- dataset[[col1]] + dataset[[col2]]
dataset <- dataset %>% group_by(groupvar) %>% mutate(cumulative_sum=cumsum(newcol))
dataset$limited_cumulative_sum <- ifelse(dataset$cumulative_sum>limit, limit, dataset$cumulative_sum)
dataset <- dataset %>% group_by(groupvar) %>% mutate(limited_cumulative_sum_lag=lag(limited_cumulative_sum))
dataset$limited_cumulative_sum_lag <- ifelse(is.na(dataset$limited_cumulative_sum_lag),0,dataset$limited_cumulative_sum_lag)
dataset$adjusted_sum <- dataset$limited_cumulative_sum - dataset$limited_cumulative_sum_lag
dataset[[col1]] <- ifelse(dataset$adjusted_sum==dataset$newcol, dataset[[col1]],
pmin(dataset[[col1]], dataset$adjusted_sum))
dataset[[col2]] <- dataset$adjusted_sum - dataset[[col1]]
dataset <- dataset %>% ungroup() %>% dplyr::select(-group, -newcol, -cumulative_sum, -limited_cumulative_sum, -limited_cumulative_sum_lag, -adjusted_sum)
dataset
}
Then make a list of the arguments you want to pass to the function at each step
shared_limits_args_list <- list(
list("a", "c", "groupvar", 25),
list("b", "d", "groupvar", 30))
Then call reduce, setting the dataset as your initial x with the .init parameter. At each iteration a sublist of arguments from shared_limits_args_list will be passed to the function as y. [[ is used to select the list elements for each position. The output dataframe from the function will become the new x for the next iteration, and the next sublist of shared_limits_args_list will be the next set of arguments. When all of the sublists of shared_limits_args_list have been used, the final dataframe is output.
dataset_combined <-
reduce(shared_limits_args_list,
function(x,y) shared_limits(df=x, y[[1]], y[[2]], y[[3]], y[[4]]),
.init=dataset)

Delete rows after a negative value in multiple data frames

I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))

Subsetting dataset on dynamic columns

I have a question on data subset based on dynamic column class. For example:
#Coming from other source. Dont exaclty know about their names and number of classes.
#But following are two demography, which will help in imagining the problem
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#moredemo.................
# reproducible data
set.seed(1)
col1 <- as.data.frame(rep(gender, 100))
col2 <- as.data.frame(rep(agegroup, 25))
col3 <- runif(200)
datafile <- cbind(col1, col2, col3)
names(datafile)[1] = "gender"
names(datafile)[2] = "agegroup"
datafile <- as.data.frame(datafile)
#Subset is only for gender = 1 and agegroup = 3
#Subset is for every combination of classes in each demography
#No hardcoded name is required, because demography name will not be know
dat_gender_1_agegroup_3 <- datafile[datafile$gender == 1 & datafile$agegroup == 3, ]
But there can be more demography and not just gender and agegroup. There can be income or education or race and so on. each of the demography has varying number of class. Kindly help me in getting the subset of the dataset datafile on the varying number of columns. Thanks in advance
Using expand grid for combos then apply to subset:
#dummy data
set.seed(123)
mydata <- data.frame(gender = sample(1:2, 100, replace = TRUE),
agegroup = sample(1:10, 100, replace = TRUE))
#groups
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
#get all combo
myCombo <- expand.grid(gender, agegroup)
#result is a list object
apply(myCombo, 1, function(i){
mydata[ mydata$gender == i[1] &
mydata$agegroup == i[2], ]
})
Edit: Based on update, I think you just need split command
split(datafile, datafile[, 1:2])
What about (assuming the column names are "gender" and "agegroup"):
gender <- c(1,2)
agegroup <- c(1,2,3,4,5,6,7,8)
data_subset <- subset(full_data, gender%in%gender | agegroup%in%agegroup | [AND SO ON])
You can add as many [column_name]%in%[values] as you want.
HTH a little!
EDIT: you can very well use & instead of |, obviously.

Removing outliers from groups using data.table in R

I have a data.table object that contains group column. I am trying to remove outliers from each of the groups, however I cannot come up with the nice solution for that. My data.table can be build using simple script:
col1 <- rnorm(30, mean = 5, sd = 2)
col2 <- rnorm(30, mean = 5, sd = 2)
id <- seq(1, 30)
group <- sample(4, 30, replace = TRUE)
dt <- data.table(id, group, col1, col2)
I've been trying to split data.frame by group variable, however, it's too messy approach. How would I "easily" remove top n% of outliers from each group in data.table without having too many data transformations?
Assuming that you want to remove outliers according to both col1 and col2, based on the 95% quantile:
dt_filt <- dt[,
.SD[
((col1 < quantile(col1, probs = 0.95)) &
(col2 < quantile(col2, probs = 0.95)))
], by = group
]
which basically splits the data based on the group column, calculates the thresholds, and then subsets the data to keep only rows where col1 and col2 are lower than the thresholds.

Serial Subsetting in R

I am working with a large datasets. I have to extract values from one datasets, the identifiers for the values are stored in another dataset. So basically I am subsetting twice for each value of one category. For multiple category, I have to combine such double-subsetted values. So I am doing something similar to this shown below, but I think there must be a better way to do it.
example datasets
set.seed(1)
df <- data.frame(number= seq(5020, 5035, 1), value =rnorm(16, 20, 5),
type = rep(c("food", "bar", "sleep", "gym"), each = 4))
df2 <- data.frame(number= seq(5020, 5035, 1), type = rep(LETTERS[1:4], 4))
extract value for grade A
asub_df2 <-subset(df2, type == "A" )
asub_df <-subset(df, number == asub_df2$number)
new_a <- cbind(asub_df, grade = rep(c("A"),nrow(asub_df)))
similarly extract value for grade B in new_b and combine to do any analysis.
can we use
You can split the 'df2' and use lapply
Filter(Negate(is.null),
lapply(split(df2, df2$type), function(x) {
x1 <- subset(df, number==x$number)
if(nrow(x1)>0) {
transform(x1, grade=x$type[1])
}
}))

Resources