I want to do a similar thing as in this thread: Subset multiple columns in R - more elegant code?
I have data that looks like this:
df=data.frame(x=1:4,Col1=c("A","A","C","B"),Col2=c("A","B","B","A"),Col3=c("A","C","C","A"))
criteria="A"
What I want to do is to subset the data where criteria is meet in at least two columns, that is the string in at least two of the three columns is A. In the case above, the subset would be the first and last row of the data frame df.
You can use rowSums :
df[rowSums(df[-1] == criteria) >= 2, ]
# x Col1 Col2 Col3
#1 1 A A A
#4 4 B A A
If criteria is of length > 1 you cannot use == directly in which case use sapply with %in%.
df[rowSums(sapply(df[-1], `%in%`, criteria)) >= 2, ]
In dplyr you can use filter with rowwise :
library(dplyr)
df %>%
rowwise() %>%
filter(sum(c_across(starts_with('col')) %in% criteria) >= 2)
We can use subset with apply
subset(df, apply(df[-1] == criteria, 1, sum) >1)
# x Col1 Col2 Col3
#1 1 A A A
#4 4 B A A
Related
It is a follow-up question to this one. What I would like to check is if any column in a data frame contain the same value (numerical or string) for all rows. For example,
sample <- data.frame(col1=c(1, 1, 1), col2=c("a", "a", "a"), col3=c(12, 15, 22))
The purpose is to inspect each column in a data frame to see which column does not have identical entry for all rows. How to do this? In particular, there are both numbers as well as strings.
My expected output would be a vector containing the column number which has non-identical entries.
We can use apply columnwise (margin = 2) and calculate unique values in the column and select the columns which has number of unique values not equal to 1.
which(apply(sample, 2, function(x) length(unique(x))) != 1)
#col3
# 3
The same code can also be done using sapply or lapply call
which(sapply(sample, function(x) length(unique(x))) != 1)
#col3
# 3
A dplyr version could be
library(dplyr)
sample %>%
summarise_all(funs(n_distinct(.))) %>%
select_if(. != 1)
# col3
#1 3
We can use Filter
names(Filter(function(x) length(unique(x)) != 1, sample))
#[1] "col3"
I'm looking to subset rows by the value of the next row for one column.
df <- data.frame(t = c(1,2,3,4,5,6,7,8),
b = c(1,2,1,0,1,0,1,2))
So I want to subset df and get the rows where b == 2 following any row where b == 1. So subset should return 2 rows (where t=1 and t=7)
I tried using which and lag from dplyr, as mentioned in other answers, but I couldn't get that to work.
We can get the next value with lead, create a condition to check whether it is equal to 2 and the current value is 1 and use that expression in the filter
library(dplyr)
df %>%
filter(b == 1, lead(b)==2)
# t b
#1 1 1
#2 7 1
Or use subset from base R
subset(df, c(b[-1] == 2, FALSE) & b == 1)
This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 5 years ago.
Consider the following replicable data frame:
col1 <- c(rep("a", times = 5), rep("b", times = 5), rep("c", times = 5))
col2 <- c(0,0,1,1,0,0,1,1,1,0,0,0,0,0,1)
data <- as.data.frame(cbind(col1, col2))
Now the data is a matrix of 15x2. Now I want to count how many zeros there are with the condition that only for the rows of a's. I use table():
table <- table(data$col2[data$col1=="a"])
table[names(table)==0]
This works just fine and result is 3.
But my real data has 100,000 observations with 12 different values of such col1 so I want to make a function so I don't have to type the above lines of code 12 times.
countzero <- function(row){
table <- table(data$col2[data$col1=="row"])
result <- table[names(table)==0]
return(result)
}
I expected that when I run countzero(row = a) it will return 3 as well but instead it returns 0, and also 0 for b and c.
For my real data, it returns
numeric(0)
which I have no idea why.
Anyone could help me out please?
EDIT: To all the answers showing me how to count in total how many zeros for each value of col1, it works all fine, but my purpose is to build a function that returns only the count of one specific col1 value, e.g. just the a's, because that count will be used later to compute other stuff (the percent of 0's in all a's, e.g.)
1) aggregate Try aggregate:
aggregate(col2 == 0 ~ col1, data, sum)
giving:
col1 col2 == 0
1 a 3
2 b 2
3 c 4
2) table or try table (omit the [,1] if you want the counts of 1's too):
table(data)[, 1]
giving:
a b c
3 2 4
We can use data.table which would be efficient
library(data.table)
setDT(data)[col2==0, .N, col1]
# col1 N
#1: a 3
#2: b 2
#3: c 4
Or with dplyr
library(dplyr)
data %>%
filter(col2==0) %>%
count(col1)
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.
I have data frame like this :
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to find the index of the column of df that has values matching to string "a".
i.e. it should give me 1 as result.
I tried using which in sapply but its not working.
Anybody knows how to do it without a loop ??
Something like this?
which(apply(df, 2, function(x) any(grepl("a", x))))
The steps are:
With apply go over each column
Search if a is in this column with grepl
Since we get a vector back, use any to get TRUE if any element has been matched to a
Finally check which elements (columns) are TRUE (i.e. contain the searched letter a).
Since you mention you were trying to use sapply() but were unsuccessful, here's how you can do it:
> sapply(df, function(x) any(x == "a"))
col1 col2 col3
TRUE FALSE FALSE
> which(sapply(df, function(x) any(x == "a")))
col1
1
Of course, you can also use the grep()/grepl() approach if you prefer string matching. You can also wrap your which() function with unname() if you want just the column number.