Removing Rows Based on Column Value in R [duplicate] - r

This question already has answers here:
Filtering a data frame
(3 answers)
Closed 6 years ago.
I have a simple table in R named Tag_Count:
Tag 1 freq
Cookies 1
Cakes 2
Burritos 5
I want to remove all rows where freq value is less than 3. I tried:
Tag_Count_2 <- Tag_Count[Tag_Count$freq <= 3,]
Tag_Count_2 <- Tag_Count[freq < 4]
But neither worked.

We can try
Tag_Count[!(Tag_Count$freq <= 3),]
If this is not a data.frame, then
Tag_Count[!(Tag_Count[,"freq"] <= 3),]

You can try this
library(dplyr)
df1 <- df %>%
filter(freq >= 3)
print(df1)
Tag1 freq
1 Burritos 5
data
df <- data.frame(Tag1 = c("Cookies","Cakes","Burritos"),freq = c(1,2,5), stringsAsFactors = F)

Related

How to subset a large dataframe using many conditions for many variables in a simple way? [duplicate]

This question already has answers here:
How to combine multiple conditions to subset a data-frame using "OR"?
(5 answers)
Closed 4 months ago.
I have this dataframe (but let's imagine it with many columns/variables)
df = data.frame(x = c(0,0,0,1,0),
y = c(1,1,1,0,0),
z = c(1,1,0,0,1))
I want to subset this dataset based on the condition that (x=1) and (y=0 or z = 0 or etc..)
I am already familiar with the basic function that works for small datasets, but I want a function that works for bigger datasets. Thanks
You can make use of Reduce(). The function + basically works as an OR operator since its result is >0 if it contains any TRUE value.
Correspondingly, * would work as an AND since it only returns a value >0 if all cases are TRUE.
df = data.frame(x = c(0,0,0,1,0),
y = c(1,1,1,0,0),
z = c(1,1,0,0,1))
nms <- names(df)
# take all variables except for `x`
nms_rel <- setdiff(nms, "x")
nms_rel
#> [1] "y" "z"
# filter all rows in which `x` is 1 AND any other variable is 0
df[df$x == 1 & Reduce(`+`, lapply(df[nms_rel], `==`, 0)) > 0, ]
#> x y z
#> 4 1 0 0
In base R you can filter a dataframe like this
subset(df, df$x == 1 & (df$y == 0 | df$z == 0))
Another option is to use filter from the dplyr package.
library(dplyr)
filter(df, x == 1, y == 0 | z == 0)

filtering or subsetting a dataframe does not include all values [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 1 year ago.
I have a dataframe and I'm trying to subset it based on the column ID, but because the ID values are repeated, not all values are included in the output.
Example:
values <- sample(1:100, 2520, replace=TRUE)
ID <- rep(c(1:21), times = 120) #21 unique IDs, each repeated 120 times
df <- data.frame(values, ID)
df_sub <- df %>% dplyr::filter(ID == c(1,2,5,7,9))
It's subsetting by ID correctly, but I am only getting 24 rows for each ID and not the 120 I am expecting.
length(df_sub$ID) = 120 and should be 600.
We can use %in% instead of == as == is elementwise operator and can only with a single element or the lengths shsould be same on the lhs and rhs of ==
library(dplyr)
df %>%
dplyr::filter(ID %in% c(1,2,5,7,9))

how to filter data by the number of unique values in R [duplicate]

This question already has answers here:
drop columns that take less than n values?
(2 answers)
Closed 3 years ago.
I have some data that I would like to investigate and would like to pull out
all features which have a certain number of unique values, whether that's 2,
5, 10, etc.
I'm not sure how to go about doing this though.
For example :
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst
tst %>%
filter(<variables with x unique values>)
Where x=2 would just filter to a, x=3 filter to b, etc
You can use select_if with the n_distinct function.
tst %>%
select_if(~n_distinct(.) == 2)
# a
# 1 1
# 2 1
# 3 1
# 4 0
# 5 0
Here is one way in base R:
x <- 2
tst[, apply(tst, 2, function(row) length(unique(row))) == x, drop = FALSE]
This example code will create a variable combination of abcd. Then will identify which are duplicate combinations, then will return only those combinations that are not duplicates. I hope this is what you were asking for...
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst %>%
unite(new,a,b,c,d,sep="") %>%
mutate(duplicate=duplicated(new)) %>%
filter(duplicate !="TRUE")

How to count with condition how many zeros in a data frame using just one function() in R? [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 5 years ago.
Consider the following replicable data frame:
col1 <- c(rep("a", times = 5), rep("b", times = 5), rep("c", times = 5))
col2 <- c(0,0,1,1,0,0,1,1,1,0,0,0,0,0,1)
data <- as.data.frame(cbind(col1, col2))
Now the data is a matrix of 15x2. Now I want to count how many zeros there are with the condition that only for the rows of a's. I use table():
table <- table(data$col2[data$col1=="a"])
table[names(table)==0]
This works just fine and result is 3.
But my real data has 100,000 observations with 12 different values of such col1 so I want to make a function so I don't have to type the above lines of code 12 times.
countzero <- function(row){
table <- table(data$col2[data$col1=="row"])
result <- table[names(table)==0]
return(result)
}
I expected that when I run countzero(row = a) it will return 3 as well but instead it returns 0, and also 0 for b and c.
For my real data, it returns
numeric(0)
which I have no idea why.
Anyone could help me out please?
EDIT: To all the answers showing me how to count in total how many zeros for each value of col1, it works all fine, but my purpose is to build a function that returns only the count of one specific col1 value, e.g. just the a's, because that count will be used later to compute other stuff (the percent of 0's in all a's, e.g.)
1) aggregate Try aggregate:
aggregate(col2 == 0 ~ col1, data, sum)
giving:
col1 col2 == 0
1 a 3
2 b 2
3 c 4
2) table or try table (omit the [,1] if you want the counts of 1's too):
table(data)[, 1]
giving:
a b c
3 2 4
We can use data.table which would be efficient
library(data.table)
setDT(data)[col2==0, .N, col1]
# col1 N
#1: a 3
#2: b 2
#3: c 4
Or with dplyr
library(dplyr)
data %>%
filter(col2==0) %>%
count(col1)

R: show ALL rows with duplicated elements in a column [duplicate]

This question already has answers here:
Fastest way to remove all duplicates in R
(3 answers)
Closed 6 years ago.
Does a function like this exist in any package?
isdup <- function (x) duplicated (x) | duplicated (x, fromLast = TRUE)
My intention is to use it with dplyr to display all rows with duplicated values in a given column. I need the first occurrence of the duplicated element to be shown as well.
In this data.frame for instance
dat <- as.data.frame (list (l = c ("A", "A", "B", "C"), n = 1:4))
dat
> dat
l n
1 A 1
2 A 2
3 B 3
4 C 4
I would like to display the rows where column l is duplicated ie. those with an A value doing:
library (dplyr)
dat %>% filter (isdup (l))
returns
l n
1 A 1
2 A 2
dat %>% group_by(l) %>% filter(n() > 1)
I don't know if it exists in any package, but since you can implement it easily, I'd say just go ahead and implement it yourself.

Resources