Subsetting whole clusters froma dataframe - r

In my data.frame below, I wonder how to subset a whole cluster of study that has any outcome larger than 1 in it?
My desired output is shown below. I tried subset(h, outcome > 1) but that doesn't give my desired output.
h = "
study outcome
a 1
a 2
a 1
b 1
b 1
c 3
c 3"
h = read.table(text = h,h=T)
DESIRED OUTPUT:
"
study outcome
a 1
a 2
a 1
c 3
c 3"

Modify the subset -
subset the 'study' based on the first logical expression outcome > 1
Use %in% on the 'study' to create the final logical expression in subset
subset(h, study %in% study[outcome > 1])
-output
study outcome
1 a 1
2 a 2
3 a 1
6 c 3
7 c 3
If we want to limit the number of 'study' elements having 'outcome' value 1, i.e. the first 'n' 'study', then get the unique 'study' from the first expression of subset, use head to get the first 'n' 'study' values and use %in% to create logical expression
n <- 3
subset(h, study %in% head(unique(study[outcome > 1]), n))
Or can be done with a group by approach with any
library(dplyr)
h %>%
group_by(study) %>%
filter(any(outcome > 1)) %>%
ungroup

Related

Find rows where multiple columns have the same value

library(tidyverse)
d = data.frame(x=c('A','B','C'), y=c('A','B','D'), z=c('X','B','C'), a=1:3)
print(d)
x y z a
1 A A X 1
2 B B B 2
3 C D C 3
d %>% filter(x==y) # Returns rows 1 and 2
d %>% filter(x==z) # Returns rows 2 and 3
d %>% filter(x==y & x==z) # Returns row 2
How can I do what the very last line is doing with more concise syntax for some arbitrary set of columns? For example, filter(all.equal(x,y,z)) which doesn't work but expresses the idea.
With comparisons, on multiple columns, an easier option is to take one column out (x), while keeping the rest by looping in if_all, then do the ==, so that it will return only TRUE when all the comparisons for that particular row is TRUE
library(dplyr)
d %>%
filter(if_all(y:z, ~ x == .x))
Same idea, with across instead of if_all;
d %>%
filter(across(y:z, ~`==`(.x, x)))

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Remove Duplicates from Col X based on condition in Col Y

I have a data frame in R, that has duplicates, in one of the columns, however I only want to remove the duplicate based on a specification in another column.
For Example:
DF:
X J Y
1 2 3
2 3 1
1 3 2
I want to remove rows, where X is a duplicate and = 3.
DF:
X J Y
2 3 1
1 3 2
I have tried reading on dplyr, but have so far only been unable to get the desired result.
We can create the condition to condition with duplicated and the equality operator
subset(df1, !((duplicated(X)|duplicated(X, fromLast = TRUE)) & Y == 3))
# X J Y
#2 2 3 1
#3 1 3 2
If we need to remove the whole group of rows of 'X' if there is any value of 'Y' is 3, then
library(dplyr)
df1t %>%
group_by(X) %>%
filter(! 3 %in% Y) #or
# filter(all(Y != 3))

filter rows when all columns greater than a value

I have a data frame and I would like to subset the rows where all columns values meet my cutoff.
here is the data frame:
A B C
1 1 3 5
2 4 3 5
3 2 1 2
What I would like to select is rows where all columns are greater than 2.
Second row is what I want to get.
[1] 4 3 5
here is my code:
subset_data <- df[which(df[,c(1:ncol(df))] > 2),]
But my code is not applied on all columns.
Do you have any idea how can I fix this.
We can create a logical matrix my comparing the entire data frame with 2 and then do rowSums over it and select only those rows whose value is equal to number of columns in df
df[rowSums(df > 2) == ncol(df), ]
# A B C
#2 4 3 5
A dplyr approach using filter_all and all_vars
library(dplyr)
df %>% filter_all(all_vars(. > 2))
# A B C
#1 4 3 5
dplyr > 1.0.0
#1. if_all
df %>% filter(if_all(.fns = ~. > 2))
#2. across
df %>% filter(across(.fns = ~. > 2))
An apply approach
#Using apply
df[apply(df > 2, 1, all), ]
#Using lapply as shared by #thelatemail
df[Reduce(`&`, lapply(df, `>`, 2)),]

Filtering a R DataFrame with repeated values in columns

I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.
We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)

Resources