R data frame subsetting based on a column value frequency threshold [duplicate] - r

This question already has answers here:
Getting the top values by group
(6 answers)
Closed 6 years ago.
I am a new R user and this is my first question submission (hopefully in compliance with the protocol).
I have a data frame with two columns.
df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" ))
dfc <- df %>% count(v1)
df$n <- with(dfc, n[match(df$v1,v1)])
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
6 B 4
7 C 1
8 D 2
9 D 2
10 E 1
I want to delete rows that exceed a threshold of 3 occurrences for a value in v1. All rows for that value less than the threshold are retained. In this example I want to delete row 6 and retain all remaining rows in a subset data frame.
The result would include the following values for v1:
v1
1 A
2 A
3 B
4 B
5 B
6 C
7 D
8 D
9 E
Row 6 would have been deleted because it was the 4th occurrence of "B", but the 3 previous rows for "B" have been retained.
I have read multiple posts that demonstrate how to remove ALL rows for a variable with row totals less/greater than a cumulative frequency value, such as 4. For example, I have tried:
df1 <- df %>%
group_by(v1) %>%
filter(n() < 4)
This approach keeps only the rows where all unique occurrences of V1 are < 4. 6 rows are subset.
df2 <- df %>%
group_by(v1) %>%
filter(n() > 3)
This approach keeps only the rows where all unique occurrences of v1 are > 3. 4 rows are subset.
df4 <- subset(df, v1 %in% names(table(df$v1))[table(df$v1) <4])
This approach has the same result as the first approach.
None of these methods produce the result I need.
As previously stated, I need to retain the first three rows where v1="B" and only delete rows if there are > 3 occurrences of that value.
Because I am new to R, it's possible I am overlooking a very simple solution. Any suggestions would be greatly appreciated.
Thanks.

Using dplyr's top_n:
df %>% group_by(v1) %>% top_n(3)

This seems to do it:
index <- vector("numeric", nrow(df))
for (i in 1:nrow(df)) {
if (sum(df[1:i, ] == as.character(df[i, 1])) <= 3) {
index[i] <- i
} else {
cat(i)
}
}
df[index, ]
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
7 C 1
8 D 2
9 D 2
10 E 1

We can use data.table
library(data.table)
setDT(df)[, if(.N >3) head(.SD, 3) else .SD , v1]

Related

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Count of number of elements between distinct elements in vector

Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")

Add list of columns above a certain threshold

Say I have a dataframe:
df <- data.frame(rbind(c(10,1,5,4), c(6,0,3,10), c(7,1,10,10)))
colnames(df) <- c("a", "b", "c", "d")
df
a b c d
10 1 5 4
6 0 3 10
7 1 10 10
And a vector of numbers (which correspond to the four column names a,b,c,d)
threshold <- c(7,1,5,8)
I need to compare each row in the data frame to the vector. When the value in the data frame meets or exceeds that in the vector, I need to return the column name. The output would be:
a b c d cols
10 1 5 4 a,b,c #10>7, 1>=1, 5>=5
6 0 3 10 d #10>8
7 1 10 10 a,b,c,d ##7>=7, 1>=1, 10>=5, 10>-8
The column cols can be a string that simply lists the columns where the value is exceeded.
Is there any clever way to do this? I'm migrating an old Excel function and I can write a loop or something, but I thought there almost had to be a better way.
You do not need which and the desired output is for comma separated values:
df$cols <- apply(df[-1], 1, function(x) toString(names(df)[-1][x >= threshold]))
df
id a b c d cols
1 aa 10 1 5 4 a, b, c
2 bb 6 0 3 10 d
3 cc 7 1 10 10 a, b, c, d
We can also try
i1 <- which(df >=threshold[col(df)], arr.ind=TRUE)
df$cols <- unname(tapply(names(df)[i1[,2]], i1[,1], toString))
df$cols
#[1] "a, b, c" "d" "a, b, c, d"
You can try this:
df$cols <- apply(df[, 2:5], 1, function(x) names(df[, 2:5])[which(x >= threshold)])

Filtering a R DataFrame with repeated values in columns

I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.
We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)

Find the index of the row in data frame that contain one element in a string vector

If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5

Resources