How to count all the outlier values in a row? - r

In R (I'm so new) I'm trying to create a outlier_count variable where an integer would indicate the number of outlier values per row.
So, let's say my dataset looks like this, and assuming "10" is an outlier:
var1 var2 var3 var4 var5 var6 var7
a 1 1 10 10 1 1 1
b 10 1 1 1 1 1 1
c 1 1 1 1 1 1 1
d 1 1 1 1 1 1 1
e 1 1 1 1 1 1 1
f 1 1 1 1 1 1 1
I want to end up with something like:
var1 var2 var3 var4 var5 var6 var7 outlier_count
a 1 1 10 10 1 1 1 2
b 10 1 1 1 1 1 1 1
c 1 1 1 1 1 1 1 0
d 1 1 1 1 1 1 1 0
e 1 1 1 1 1 1 1 0
f 1 1 1 1 1 1 1 0
So, in every row, I know how many values were outliers.
I tried a couple of functions but the variable ends up being NA when a single column is NA.
Is there an easy, error-proof way of doing this?

After your explanations in the comments and the edit with the expected output, it becomes very simple.
First read in the data.
df <- read.table(text = "
var1 var2 var3 var4 var5 var6 var7
a 1 1 10 10 1 1 1
b 10 1 1 1 1 1 1
c 1 1 1 1 1 1 1
d 1 1 1 1 1 1 1
e 1 1 1 1 1 1 1
f 1 1 1 1 1 1 1
", header = TRUE)
Now the code. I will consider an outlier everything below or above the 0.05 and 0.95 quantiles. Change this if you want so.
out <- sapply(df, function(x) x < quantile(x, 0.05) | x > quantile(x, 0.95))
df$outlier_count <- rowSums(out)
df
Note that you can do without explicitly creating the intermediate variable out. And make a one-liner out of the code above. That's up to you. I prefer it to have readable code.

Related

Count occurrences of distinct values across multiple columns and groups

I've got a dataframe like the one below (in the actual dataset the number of rows are a few thousands and I 've got more than 300 variables):
df <- data.frame (Gr = c("A","A","A","B","B","B","B","B","B"),
Var1 = c("a","b","c","e","a","a","c","e","b"),
Var2 = c("a","a","a","d","b","b","c","a","e"),
Var3 = c("e","a","b",NA,"a","b","c","d","a"),
Var4 = c("e",NA,"a","e","a","b","d","c",NA))
which returns:
Gr Var1 Var2 Var3 Var4
1 A a a e e
2 A b a a <NA>
3 A c a b a
4 B e d <NA> e
5 B a b a a
6 B a b b b
7 B c c c d
8 B e a d c
9 B b e a <NA>
and would like to get number of occurrences of each value (a,b,c,d,e, and NA) in each variable and in each group. Hence, the output should look something like the following:
df1 <- data.frame(Vars = c("Var1","Var2","Var3","Var4"),
a = c(1,3,1,1),
b = c(1,0,1,0),
c = c(1,0,0,0),
d = c(0,0,0,0),
e = c(0,0,1,1),
na = c(0,0,0,1))
df2 <- data.frame(Vars = c("Var1","Var2","Var3","Var4"),
a = c(2,1,2,1),
b = c(0,2,1,1),
c = c(1,1,1,1),
d = c(0,1,1,1),
e = c(2,1,0,1),
na = c(0,0,1,1))
output <- list(df1,df2)
names(output) <- c("A","B")
which looks like:
$A
Vars a b c d e na
1 Var1 1 1 1 0 0 0
2 Var2 3 0 0 0 0 0
3 Var3 1 1 0 0 1 0
4 Var4 1 0 0 0 1 1
$B
Vars a b c d e na
1 Var1 2 0 1 0 2 0
2 Var2 1 2 1 1 1 0
3 Var3 2 1 1 1 0 1
4 Var4 1 1 1 1 1 1
I haven't been able to make any considerable progress so far, and a tidyverse solution is preferred.
We may use mtabulate after spliting
library(qdapTools)
lapply(split(df[-1], df$Gr), mtabulate)
If we need the na count as well
lapply(split(replace(df[-1], is.na(df[-1]), "na"), df$Gr), mtabulate)
-output
$A
a b c e na
Var1 1 1 1 0 0
Var2 3 0 0 0 0
Var3 1 1 0 1 0
Var4 1 0 0 1 1
$B
a b c d e na
Var1 2 1 1 0 2 0
Var2 1 2 1 1 1 0
Var3 2 1 1 1 0 1
Var4 1 1 1 1 1 1
Or using tidyverse
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Gr, names_to = "Vars") %>%
pivot_wider(names_from = value, values_from = value,
values_fn = length, values_fill = 0) %>%
{split(.[-1], .$Gr)}
-output
$A
# A tibble: 4 × 7
Vars a e b `NA` c d
<chr> <int> <int> <int> <int> <int> <int>
1 Var1 1 0 1 0 1 0
2 Var2 3 0 0 0 0 0
3 Var3 1 1 1 0 0 0
4 Var4 1 1 0 1 0 0
$B
# A tibble: 4 × 7
Vars a e b `NA` c d
<chr> <int> <int> <int> <int> <int> <int>
1 Var1 2 2 1 0 1 0
2 Var2 1 1 2 0 1 1
3 Var3 2 0 1 1 1 1
4 Var4 1 1 1 1 1 1
A NA save base R approach using colSums
val <- sort(unique(unlist(df[-1])), na.last=T)
as.list(lapply(split(df[-1], df$Gr), function(dlist)
data.frame(sapply(val, function(x)
colSums(dlist == x | (is.na(dlist) & is.na(x)), na.rm=T)), check.names=F)))
$A
a b c d e NA
Var1 1 1 1 0 0 0
Var2 3 0 0 0 0 0
Var3 1 1 0 0 1 0
Var4 1 0 0 0 1 1
$B
a b c d e NA
Var1 2 1 1 0 2 0
Var2 1 2 1 1 1 0
Var3 2 1 1 1 0 1
Var4 1 1 1 1 1 1
reshape2::recast(df,Gr+variable~value,length,id.var = 'Gr')
Gr variable a b c d e NA
1 A Var1 1 1 1 0 0 0
2 A Var2 3 0 0 0 0 0
3 A Var3 1 1 0 0 1 0
4 A Var4 1 0 0 0 1 1
5 B Var1 2 1 1 0 2 0
6 B Var2 1 2 1 1 1 0
7 B Var3 2 1 1 1 0 1
If you must split them:
split(reshape2::recast(df,Gr+variable~value,length,id.var = 'Gr'), ~Gr)
$A
Gr variable a b c d e NA
1 A Var1 1 1 1 0 0 0
2 A Var2 3 0 0 0 0 0
3 A Var3 1 1 0 0 1 0
4 A Var4 1 0 0 0 1 1
$B
Gr variable a b c d e NA
5 B Var1 2 1 1 0 2 0
6 B Var2 1 2 1 1 1 0
7 B Var3 2 1 1 1 0 1
8 B Var4 1 1 1 1 1 1
in base R:
ftable(cbind(df[1], stack(replace(df, is.na(df),'na'), -1)),col.vars = 2)
values a b c d e na
Gr ind
A Var1 1 1 1 0 0 0
Var2 3 0 0 0 0 0
Var3 1 1 0 0 1 0
Var4 1 0 0 0 1 1
B Var1 2 1 1 0 2 0
Var2 1 2 1 1 1 0
Var3 2 1 1 1 0 1
Var4 1 1 1 1 1 1

R- count values in data.frame

df <- data.frame(row.names = c('ID1','ID2','ID3','ID4'),var1 = c(0,1,2,3),var2 = c(0,0,0,0),var3 = c(1,2,3,0),var4 = c('1','1','2','2'))
> df
var1 var2 var3 var4
ID1 0 0 1 1
ID2 1 0 2 1
ID3 2 0 3 2
ID4 3 0 0 2
I want df to look like this
var1 var2 var3 var4
0 1 4 1 0
1 1 0 1 2
2 1 0 1 2
3 1 0 1 0
So I want the values of df to be counted. The problem is, that not every value occurs in every column.
I tried this lapply(df,table) but that returns a list which I cannot convert into a data.frame (because of said reason).
I could do it kind of manually with table(df$var1) and bind everything together after doing that with every var, but that is boring. Can you find a better way?
Thanks ;)
Call table function with factor levels which are present in the entire dataset.
sapply(df,function(x) table(factor(x, levels = 0:3)))
# var1 var2 var3 var4
#0 1 4 1 0
#1 1 0 1 2
#2 1 0 1 2
#3 1 0 1 0
If you don't know beforehand what levels your data can take, we can find it from data itself.
vec <- unique(unlist(df))
sapply(df, function(x) table(factor(x, levels = vec)))
We could do this without any loop
table(c(col(df)), unlist(df))
# 0 1 2 3
# 1 1 1 1 1
# 2 4 0 0 0
# 3 1 1 1 1
# 4 0 2 2 0

How to create new variables using loop and ifelse statement

I have a number of variables with the same name but different suffixes. For example, var1, var2, var3, var4, var5, var6, ... and so on. Each of these variables has a random sequence of 0, 1, and 2. Using these variables, I am trying to create a new variable called testvariable. If any of the already existing variables has 1, I will set testvariable to 1. If they have 0 or 2, I will assign 0.
Is there a simple loop and/or ifelse statement I can use to create this variable? My real data is a lot more complex than this, so I don't want to copy and paste each individual variable and values.
Edit: This is for R.
If I understand you correctly, if any of the variables like var1, var2, .., etc has a value 1, then the testvariable must be 1 else 0, then do:
Sample df:
var1 var2 var3 var4 var5
1 1 1 1 1 1
2 2 2 2 2 2
3 0 1 0 1 0
4 0 0 0 0 1
5 1 1 2 1 2
6 2 2 2 2 2
7 2 2 1 2 2
8 1 1 2 1 1
9 0 0 0 0 0
Code:
df$testvariable <- ifelse(rowSums(df[, grepl("var", names(df))] == 1) > 0, 1, 0)
Output:
var1 var2 var3 var4 var5 testvariable
1 1 1 1 1 1 1
2 2 2 2 2 2 0
3 0 1 0 1 0 1
4 0 0 0 0 1 1
5 1 1 2 1 2 1
6 2 2 2 2 2 0
7 2 2 1 2 2 1
8 1 1 2 1 1 1
9 0 0 0 0 0 0
I would use mget for this. mget gives the variables as a list. Then you can check for each element of the list using sapply and combine the results using any. At the end I took advantage of your encoding of 0 and 1, but you can also use an if statement.
var1 <- c(0,0,1,2)
var2 <- c(2,2,2,2)
var3 <- c(0,2,0,2)
var4 <- c(0,2,2,2)
any(sapply(mget(paste0("var", 1:4)), function(x) 1 %in% x)) * 1
#> [1] 1
any(sapply(mget(paste0("var", 2:4)), function(x) 1 %in% x)) * 1
#> [1] 0

Conditional running count (cumulative sum) with reset in R (dplyr)

I'm trying to calculate a running count (i.e., cumulative sum) that is conditional on other variables and that can reset for particular values on another variable. I'm working in R and would prefer a dplyr-based solution, if possible.
I'd like to create a variable for the running count, cumulative, based on the following algorithm:
Calculate the running count (cumulative) within combinations of id and age
Increment running count (cumulative) by 1 for every subsequent trial where accuracy = 0, block = 2, and condition = 1
Reset running count (cumulative) to 0 for each trial where accuracy = 1, block = 2, and condition = 1, and the next increment resumes at 1 (not the previous number)
For each trial where block != 2, or condition != 1, leave the running count (cumulative) as NA
Here's a minimal working example:
mydata <- data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1),
age = c(1,1,1,1,1,1,1,1,1,1,2),
block = c(1,1,2,2,2,2,2,2,2,2,2),
trial = c(1,2,1,2,3,4,5,6,7,8,1),
condition = c(1,1,1,1,1,2,1,1,1,1,1),
accuracy = c(0,0,0,0,0,0,0,1,0,0,0)
)
id age block trial condition accuracy
1 1 1 1 1 0
1 1 1 2 1 0
1 1 2 1 1 0
1 1 2 2 1 0
1 1 2 3 1 0
1 1 2 4 2 0
1 1 2 5 1 0
1 1 2 6 1 1
1 1 2 7 1 0
1 1 2 8 1 0
1 2 2 1 1 0
The expected output is:
id age block trial condition accuracy cumulative
1 1 1 1 1 0 NA
1 1 1 2 1 0 NA
1 1 2 1 1 0 1
1 1 2 2 1 0 2
1 1 2 3 1 0 3
1 1 2 4 2 0 NA
1 1 2 5 1 0 4
1 1 2 6 1 1 0
1 1 2 7 1 0 1
1 1 2 8 1 0 2
1 2 2 1 1 0 1
Here is an option using data.table. Create a binary column based on matching the pasted values of 'accuracy', 'block', 'condition' with that of the custom values, grouped by run-length-id of the binary column ('ind'), 'id' and 'age', get the cumulative sum of 'ind' and assign (:=) it to a new column ('Cumulative')
library(data.table)
setDT(mydata)[, ind := match(do.call(paste0, .SD), c("121", "021")) - 1,
.SDcols = c("accuracy", "block", "condition")
][, Cumulative := cumsum(ind), .(rleid(ind), id, age)
][, ind := NULL][]
# id age block trial condition accuracy Cumulative
# 1: 1 1 1 1 1 0 NA
# 2: 1 1 1 2 1 0 NA
# 3: 1 1 2 1 1 0 1
# 4: 1 1 2 2 1 0 2
# 5: 1 1 2 3 1 0 3
# 6: 1 1 2 4 2 0 NA
# 7: 1 1 2 5 1 1 0
# 8: 1 1 2 6 1 0 1
# 9: 1 1 2 7 1 0 2
#10: 1 2 2 1 1 0 1
We can use case_when to assign the value which we need based on our conditions. We then add an additional group_by condition using cumsum to switch values when the temp column 0. In the final mutate step we temporarily replace NA values in temp to 0, then take cumsum over it and put back the NA values again to it's place to get the final output.
library(dplyr)
mydata %>%
group_by(id, age) %>%
mutate(temp = case_when(accuracy == 0 & block == 2 & condition == 1 ~ 1,
accuracy == 1 & block == 2 & condition == 1 ~ 0,
TRUE ~ NA_real_)) %>%
ungroup() %>%
group_by(id, age, group = cumsum(replace(temp == 0, is.na(temp), 0))) %>%
mutate(cumulative = replace(cumsum(replace(temp, is.na(temp), 0)),
is.na(temp), NA)) %>%
select(-temp, -group)
# group id age block trial condition accuracy cumulative
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 1 1 1 1 1 0 NA
# 2 0 1 1 1 2 1 0 NA
# 3 0 1 1 2 1 1 0 1
# 4 0 1 1 2 2 1 0 2
# 5 0 1 1 2 3 1 0 3
# 6 0 1 1 2 4 2 0 NA
# 7 0 1 1 2 5 1 0 4
# 8 1 1 1 2 6 1 1 0
# 9 1 1 1 2 7 1 0 1
#10 1 1 1 2 8 1 0 2
#11 1 1 2 2 1 1 0 1

Counting existing permutations in R

I have a large dataset with columns IDNum, Var1, Var2, Var3, Var4, Var5, Var6. The variables are boolean with value either 0 or 1. Each row could be one of 64 different possible permutations. I would like to count the number of rows corresponding to each permutation present. Is there an efficient way to write this in R?
aggregate can do this. Here's a shorter example:
r <- function() rbinom(10, 1, .5)
d <- data.frame(IDNum=1:10, Var1=r(), Var2=r())
d
IDNum Var1 Var2
1 1 0 1
2 2 0 1
3 3 0 0
4 4 1 0
5 5 1 1
6 6 0 0
7 7 1 1
8 8 1 0
9 9 0 1
10 10 0 1
Now to count the number of each combination:
> aggregate(d$IDNum, d[-1], FUN=length)
Var1 Var2 x
1 0 0 2
2 1 0 2
3 0 1 4
4 1 1 2
The values in d$IDNum aren't actually used here, but something must be passed to the length function. The values in d$IDNum for each combination are passed to length to get the count.
This will give a slightly different result and will list out all the possibilities regardless of whether they are present or not. Example data:
nam <- c("IDNum",paste0("Var",1:6))
n <- 5
set.seed(23)
dat <- setNames(data.frame(1:n,replicate(6,sample(0:1,n,replace=TRUE))),nam)
# IDNum Var1 Var2 Var3 Var4 Var5 Var6
#1 1 1 0 1 0 1 1
#2 2 0 1 1 1 0 1
#3 3 0 1 0 1 0 1
#4 4 1 1 0 1 1 0
#5 5 1 1 1 1 0 1
Count em up:
data.frame(table(dat[-1]))
# Var1 Var2 Var3 Var4 Var5 Var6 Freq
#1 0 0 0 0 0 0 0
#...
#28 1 1 0 1 1 0 1
#...
#43 0 1 0 1 0 1 1
#...
#47 0 1 1 1 0 1 1
#48 1 1 1 1 0 1 1
#...
#54 1 0 1 0 1 1 1
#...
#64 1 1 1 1 1 1 0
You can as well using the count function in dplyr:
library(dplyr)
r <- function() rbinom(10, 1, .5)
d <- data.frame(IDNum=1:10, Var1=r(), Var2=r())
d
d %>% count(Var1, Var2)
Output:
Var1 Var2 n
1 0 0 3
2 0 1 3
3 1 0 1
4 1 1 3

Resources