Identifying outliers in different groups - r

I have a data set where participants were assigned to different groups and completed the same tests. I know I can use the aggregate function to identify the mean and sd but I cannot figure out how to find the outliers in these groups.
df<-read.table(header=T, text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
I like the format of this code but do not know how to change it in order to identify outliers for each group for each test.
ALSO, I want outliers to be considered anything greater than 2 standard deviations rather than 3. Can I format that too within this code?
##to get outliers on test1 if groups were combined
badexample <- boxplot(df$test1, plot=F)$out
which(df$test1 %in% badexample)
This would work if I wanted the outliers of both groups together on test1 but I want to separate by group.
Output should contain:
Outliers for group 0 on test1
outliers for group 0 on test2
outliers for group 1 on test1
outliers for group 1 on test2

You can write a function to compute the outliers and then call it with ave.
outlier <- function(x, SD = 2){
mu <- mean(x)
sigma <- sd(x)
out <- x < mu - SD*sigma | x > mu + SD*sigma
out
}
with(df, ave(test1, group, FUN = outlier))
# [1] 0 0 0 0 0 0 0 0 0 0
with(df, ave(test2, group, FUN = outlier))
# [1] 0 0 0 0 0 0 0 0 0 0
To have new columns in df with these results, assign in the usual way.
df$out1 <- with(df, ave(test1, group, FUN = outlier))
df$out2 <- with(df, ave(test2, group, FUN = outlier))

An option, using data.table:
library(data.table)
df <- read.table(header=T, sep=",", text="id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
DT <- as.data.table(df)
DT[, `:=`(mean1 = mean(test1), sd1 = sd(test1), mean2 = mean(test2), sd2 = sd(test2)), by = "group"]
DT[, `:=`(outlier1 = abs(test1-mean1)>2*sd1, outlier2 = abs(test2-mean2)>2*sd2)]
DT
# id group test1 test2 mean1 sd1 mean2 sd2 outlier1 outlier2
# 1: 1 0 57 82 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 2: 2 0 77 80 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 3: 3 0 67 90 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 4: 4 0 15 70 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 5: 5 0 58 72 54.8 23.66854 78.8 8.074652 FALSE FALSE
# 6: 6 1 18 44 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 7: 7 1 44 44 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 8: 8 1 18 46 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 9: 9 1 20 44 22.8 12.04990 43.2 3.033150 FALSE FALSE
# 10: 10 1 14 38 22.8 12.04990 43.2 3.033150 FALSE FALSE

Here's a way with dplyr -
df %>%
mutate_at(
vars(starts_with("test")),
list(outlier = ~(abs(. - mean(.)) > 2*sd(.)))
)
id group test1 test2 test1_outlier test2_outlier
1 1 0 57 82 FALSE FALSE
2 2 0 77 80 FALSE FALSE
3 3 0 67 90 FALSE FALSE
4 4 0 15 70 FALSE FALSE
5 5 0 58 72 FALSE FALSE
6 6 1 18 44 FALSE FALSE
7 7 1 44 44 FALSE FALSE
8 8 1 18 46 FALSE FALSE
9 9 1 20 44 FALSE FALSE
10 10 1 14 38 FALSE FALSE

Related

R Identify Max and Col Source [duplicate]

This question already has answers here:
How can I take pairwise parallel maximum or minimum between two vectors?
(3 answers)
For each row return the column name of the largest value
(10 answers)
Closed 7 months ago.
HAVE = data.frame("STUDENT"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"CLASS"=c('A','A','A','B','B','B','C','C','C'),
"SEMESTER"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"SCORE"=c(50, 74, 78, 79, 100, 65, 61, 70, 87),
"TEST"=c(80, 59, 63, 96, 57, 53, 93, 89, 92))
WANT = HAVE %>%
rowwise() %>%
mutate(MAX = max(c(SCORE, TEST)))
WANT$WHICHCOL = c("TEST", "SCORE", "SCORE", "TEST", "SCORE", "SCORE", "TEST", "TEST", "TEST")
I am able to identify the way to get the max value between SCORE and TEST but I wish to also make the column WHICHCOL which equals to 'TEST' if TEST> SCORE or 'SCORE' if SCORE > TEST
pmax is a built-in function that will be much more efficient than a rowwise max:
HAVE %>%
mutate(
MAX = pmax(SCORE, TEST),
WHICHCOL = ifelse(SCORE > TEST, "SCORE", "TEST")
)
# STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
# 1 1 A 1 50 80 80 TEST
# 2 1 A 2 74 59 74 SCORE
# 3 1 A 3 78 63 78 SCORE
# 4 2 B 1 79 96 96 TEST
# 5 2 B 2 100 57 100 SCORE
# 6 2 B 3 65 53 65 SCORE
# 7 3 C 1 61 93 93 TEST
# 8 3 C 2 70 89 89 TEST
# 9 3 C 3 87 92 92 TEST
Note that, since I use > not >=, TEST will win ties.
A base R solution:
df1 <- HAVE[c("SCORE", "TEST")]
x <- max.col(df1, "first")
MAX <- df1[cbind(1:nrow(df1), x)]
WHICHCOL <- names(df1)[x]
HAVE <- cbind(HAVE, MAX, WHICHCOL)
HAVE
#> STUDENT CLASS SEMESTER SCORE TEST MAX WHICHCOL
#> 1 1 A 1 50 80 80 TEST
#> 2 1 A 2 74 59 74 SCORE
#> 3 1 A 3 78 63 78 SCORE
#> 4 2 B 1 79 96 96 TEST
#> 5 2 B 2 100 57 100 SCORE
#> 6 2 B 3 65 53 65 SCORE
#> 7 3 C 1 61 93 93 TEST
#> 8 3 C 2 70 89 89 TEST
#> 9 3 C 3 87 92 92 TEST

create a new variable based on other factors using R

So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)

Omit rows with 0 in many specific columns

I have a very wide dataset with multiple psychometric scales and I would like to remove rows if any of a handful of columns contains zero (i.e., a missing response).
I know how to do it when the data frame is small, but my method is not scalable. For example,
dftry <- data.frame(x = c(1, 2, 5, 3, 0), y = c(0, 10, 5, 3, 37), z=c(12, 0, 33, 22, 23))
x y z
1 1 0 12
2 2 10 0
3 5 5 33
4 3 3 22
5 0 37 23
# Remove row if it has 0 in y or z columns
# is there a difference between & and , ?
dftry %>% filter(dftry$y > 0 & dftry$z > 0)
x y z
1 5 5 33
2 3 3 22
3 0 37 23
In my actual data, I want to remove rows if there are zeroes in any of these columns:
# this is the most succinct way of selecting the columns in question
select(c(1:42, contains("BMIS"), "hamD", "GAD"))
You can use rowSums :
cols <- c('y', 'z')
dftry[rowSums(dftry[cols] == 0, na.rm = TRUE) == 0, ]
# x y z
#1 5 5 33
#2 3 3 22
#3 0 37 23
We can integrate this into dplyr for your real use-case.
library(dplyr)
dftry %>%
filter(rowSums(select(.,
c(1:42, contains("BMIS"), "hamD", "GAD")) == 0, na.rm = TRUE) == 0)
Does this work using dplyr:
> library(dplyr)
> dftry
x y z a b c BMIS_1 BMIS_3 hamD GAD m n
1 1 0 12 1 0 12 1 0 12 12 12 12
2 2 10 0 2 10 0 2 10 0 0 0 0
3 5 5 33 5 5 33 5 5 33 33 33 33
4 3 3 22 3 3 22 3 3 22 22 22 22
5 0 37 23 0 37 23 0 37 23 23 23 23
> dftry %>% select(c(1:3,contains('BMIS'), hamD, GAD)) %>% filter_all(all_vars(. != 0))
x y z BMIS_1 BMIS_3 hamD GAD
1 5 5 33 5 5 33 33
2 3 3 22 3 3 22 22
>
Data used:
> dftry
x y z a b c BMIS_1 BMIS_3 hamD GAD m n
1 1 0 12 1 0 12 1 0 12 12 12 12
2 2 10 0 2 10 0 2 10 0 0 0 0
3 5 5 33 5 5 33 5 5 33 33 33 33
4 3 3 22 3 3 22 3 3 22 22 22 22
5 0 37 23 0 37 23 0 37 23 23 23 23
> dput(dftry)
structure(list(x = c(1, 2, 5, 3, 0), y = c(0, 10, 5, 3, 37),
z = c(12, 0, 33, 22, 23), a = c(1, 2, 5, 3, 0), b = c(0,
10, 5, 3, 37), c = c(12, 0, 33, 22, 23), BMIS_1 = c(1, 2,
5, 3, 0), BMIS_3 = c(0, 10, 5, 3, 37), hamD = c(12, 0, 33,
22, 23), GAD = c(12, 0, 33, 22, 23), m = c(12, 0, 33, 22,
23), n = c(12, 0, 33, 22, 23)), class = "data.frame", row.names = c(NA,
-5L))
>

Check for perfect square and replace with another value in R

I have attempted multiple different ways to check for perfect squares in an R object then replace with 0's. Below are the multiple single lines of codes I have tried; code must be a single line:
> y
[1] 9 72 49 70 16 3 3 4 81 6 43 7 12 9 3
is.integer(sqrt(y))
[1] FALSE
> ifelse(is.integer(sqrt(y)), 0, y)
[1] 9
> ifelse(sqrt(y)==is.integer(y), 0, y)
[1] 9 72 49 70 16 3 3 4 81 6 43 7 12 9 3
You can divide the number by 1 and get the remainder using %% and compare the value with 0.
sqrt(y)
#[1] 3.00 8.49 7.00 8.37 4.00 1.73 1.73 2.00 9.00 2.45 6.56 2.65 3.46 3.00 1.73
sqrt(y) %% 1 == 0
#[1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
Now turn these values to 0 by :
y[sqrt(y) %% 1 == 0] <- 0
#[1] 0 72 0 70 0 3 3 0 0 6 43 7 12 0 3
Or another way :
y * +(sqrt(y) %% 1 != 0)
#[1] 0 72 0 70 0 3 3 0 0 6 43 7 12 0 3
We could create a condition with round or ceiling or as.integer which convert to integer and only those that are exactly matching will return TRUE because of the precision involved
y[sqrt(y) == round(sqrt(y))] <- 0
y[sqrt(y) == as.integer(sqrt(y))] <- 0
data
y <- c(9, 72, 49, 70, 16, 3, 3, 4, 81, 6, 43, 7, 12, 9, 3)

Enumerate quantiles in reverse order

I'm trying to get the quantile number of a column in a data frame, but in reverse order. I want the highest number to be in quantile number 1.
Here is what I have so far:
> x<-c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
> x <- data.frame(x)
> within(x, Q <- as.integer(cut(x, quantile(x, probs=0:5/5, na.rm=TRUE),
include.lowest=TRUE)))
x Q
1 10.0 1
2 12.0 1
3 75.0 3
4 89.0 4
5 25.0 2
6 100.0 4
7 67.0 2
8 89.0 4
9 4.0 1
10 67.0 2
11 120.2 5
12 140.5 5
13 170.5 5
14 78.1 3
And what I want to get is:
x Q
1 10.0 5
2 12.0 5
3 75.0 3
4 89.0 2
5 25.0 4
6 100.0 2
7 67.0 4
8 89.0 2
9 4.0 5
10 67.0 4
11 120.2 1
12 140.5 1
13 170.5 1
14 78.1 3
One way to do this is to specify the reversed labels in the cut() function. If you want Q to be an integer then you need to first coerce the factor labels into a character and then into an integer.
result <- within(x, Q <- as.integer(as.character((cut(x,
quantile(x, probs = 0:5/5, na.rm = TRUE),
labels = c(5, 4, 3, 2, 1),
include.lowest = TRUE)))))
head(result)
x Q
1 10 5
2 12 5
3 75 3
4 89 2
5 25 4
6 100 2
Your data:
x <- c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
x <- data.frame(x)

Resources