Find rows where multiple columns have the same value - r

library(tidyverse)
d = data.frame(x=c('A','B','C'), y=c('A','B','D'), z=c('X','B','C'), a=1:3)
print(d)
x y z a
1 A A X 1
2 B B B 2
3 C D C 3
d %>% filter(x==y) # Returns rows 1 and 2
d %>% filter(x==z) # Returns rows 2 and 3
d %>% filter(x==y & x==z) # Returns row 2
How can I do what the very last line is doing with more concise syntax for some arbitrary set of columns? For example, filter(all.equal(x,y,z)) which doesn't work but expresses the idea.

With comparisons, on multiple columns, an easier option is to take one column out (x), while keeping the rest by looping in if_all, then do the ==, so that it will return only TRUE when all the comparisons for that particular row is TRUE
library(dplyr)
d %>%
filter(if_all(y:z, ~ x == .x))

Same idea, with across instead of if_all;
d %>%
filter(across(y:z, ~`==`(.x, x)))

Related

Print rows that contain values in vector/column in R

I have some data looking like this:
1. matrix
a b c d e
f 4 5 6 7 8
g 1 2 3 4 5
h 3 2 1 6 7
2. column/vector
v <- c(5,4,8,6,0)
How can I print all the rows that contain the data in vector?
I've seen there's a function called filter that could work, or maybe lapply / grep?
Use filter with if_any:
library(dplyr)
df %>%
filter(if_any(everything(), ~ . %in% v))
Or using base R (assuming it is a matrix)
m1[rowSums(`dim<-`(m1 %in% v, dim(m1))) > 0,]

Subsetting whole clusters froma dataframe

In my data.frame below, I wonder how to subset a whole cluster of study that has any outcome larger than 1 in it?
My desired output is shown below. I tried subset(h, outcome > 1) but that doesn't give my desired output.
h = "
study outcome
a 1
a 2
a 1
b 1
b 1
c 3
c 3"
h = read.table(text = h,h=T)
DESIRED OUTPUT:
"
study outcome
a 1
a 2
a 1
c 3
c 3"
Modify the subset -
subset the 'study' based on the first logical expression outcome > 1
Use %in% on the 'study' to create the final logical expression in subset
subset(h, study %in% study[outcome > 1])
-output
study outcome
1 a 1
2 a 2
3 a 1
6 c 3
7 c 3
If we want to limit the number of 'study' elements having 'outcome' value 1, i.e. the first 'n' 'study', then get the unique 'study' from the first expression of subset, use head to get the first 'n' 'study' values and use %in% to create logical expression
n <- 3
subset(h, study %in% head(unique(study[outcome > 1]), n))
Or can be done with a group by approach with any
library(dplyr)
h %>%
group_by(study) %>%
filter(any(outcome > 1)) %>%
ungroup

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Check if names are found in different columns and which one

I have a data frame with 4 columns, each column represent a different treatment. Each column is fill with protein numbers on it and the columns have different number of rows between each other. Theres a way to compare all 4 columns and have as a result a fifth column saying if a value is found in which of the columns? I know I have some values that will happen in two or even maybe 3 of the colums and I was wondering if theres a way to get this as end result in a new column.
I tried Data$A %in% Data$B but this just gives me TRUE or FALSE between two columns. I was looking for some option like match or even contain, but all options seens that can only give me a true or false answer.
What I need is something like this.
A B C
1 DSFG DSFG DSGG
2 DDEG DDED DDEE
3 HUGO HUGI HUGO
So if this is my table, I want the result like this
D(?) E
1 DSFG A,B
2 DSGG C
4 DDEG A
5 DDED B
6 DDEE C
7 HUGO A,C
8 HUGI B
Solution
An idea via base R is to use stack to convert to long, and aggregate to get the required output.
aggregate(ind ~ values, stack(df), toString)
# values ind
#1 DDED B
#2 DDEE C
#3 DDEG A
#4 DSFG A, B
#5 DSGG C
#6 HUGI B
#7 HUGO A, C
NOTE: Your columns need to be as.character for this to work. (df[] <- lapply(df, as.character))
Explanations
Stacking turns data into "long format":
stack(df)
values ind
1 DSFG A
2 DDEG A
3 HUGO A
4 DSFG B
5 DDED B
6 HUGI B
7 DSGG C
8 DDEE C
9 HUGO C
toString() simply joins elements in a vector by comma
toString(c("A", "B", "C"))
[1] "A, B, C"
Aggregating returns a vector of "ind"s for each value, and these are then turned into a string using the function above:
aggregate(ind ~ values, stack(df), FUN=toString)
Doing it the tidy way:
Input
df <- data.frame(A = c("DSFG", "DDEG", "HUGO"), B = c("DSFG", "DDED", "HUGI"), C = c("DSGG", "DDEE", "HUGO"))
Summarizing data
library(tidyverse)
df %>%
gather("Column", "Value", 1:3) %>%
group_by(Value) %>%
summarise(Cols = paste(Column, collapse = ","))
Output
Value Cols
DDED B
DDEE C
DDEG A
DSFG A,B
DSGG C
HUGI B
HUGO A,C

find number sequence falls within ONE adjacent number (previous and next) by group

Let T={t|t=1,2,3..T} be the time (sequence order number) For each group, at each t when/if a sequence occurs, we need to make sure the sequence (it is a number,let's assume it is X) is within the set of {K-1,K,K+1}, where K is the previous sequence number at t-1. For example, if the previous sequence number K=4, for the next sequence X, if X fall within [3,4,5]. Then this X meet the requirement. If every X in the sequence meets the requirement, this group meets the require and labeled it TRUE.
I know the for loop can do the trick but I have large observations, it is very slow to do it in a loop. I known the cummax can help find the non-deceasing sequence quickly. I was wondering is there any quick solution like cummax.
seq <- c(1,2,1,2,3,1,2,3,1,2,1,2,2,3,4)
group <- rep(letters[1:3],each=5)
dt <- data.frame(group,seq)
> dt
group seq
1 a 1
2 a 2
3 a 1
4 a 2
5 a 3
6 b 1
7 b 2
8 b 3
9 b 1
10 b 2
11 c 1
12 c 2
13 c 2
14 c 3
15 c 4
The desired output:
y label
a:true
b:false
c:true
You can use the diff function to check if the adjacent sequence satisfies the condition:
library(dplyr)
dt %>% group_by(group) %>% summarize(label = all(abs(diff(seq)) <= 1))
# A tibble: 3 x 2
# group label
# <fctr> <lgl>
#1 a TRUE
#2 b FALSE
#3 c TRUE
Here is the corresponding data.table version:
library(data.table)
setDT(dt)[, .(label = all(abs(diff(seq)) <= 1)), .(group)]
You can do:
is.sequence <- function(x)
all(apply(head(cbind(x-1, x, x+1), -1) - x[-1] == 0, 1, any))
tapply(dt$seq, dt$group, is.sequence)
# a b c
# TRUE FALSE TRUE
Here is a base R example with aggregate and diff
aggregate(c(1, abs(diff(dt$seq)) * (tail(dt$group, -1) ==
head(dt$group, -1))),
dt["group"], function(i) max(i) < 2)
group x
1 a TRUE
2 b FALSE
3 c TRUE
The first argument to aggregate is a vector that uses diff and turns the result on and off (to zero) based on whether the current adjacent vector elements are in the same group.
We can also use aggregate from base R
aggregate(seq~group,dt, FUN = function(x) all(c(TRUE,
abs((x[-1] - x[-length(x)])) <=1)))
# group seq
#1 a TRUE
#2 b FALSE
#3 c TRUE

Resources