R find value in multiple data frame columns

R find value in multiple data frame columns - r

Given a data set where a value could be in any of a set of columns from the dataframe:
df <- data.frame(h1=c('a', 'b', 'c', 'a', 'a', 'b', 'c'), h2=c('b', 'c', 'd', 'b', 'c', 'd', 'b'), h3=c('c', 'd', 'e', 'e', 'e', 'd', 'c'))
How can I get a logical vector that specifies which rows contain the target value? In this case, searching for 'b', I'd want a logical vector with rows (1,2,4,6,7) as TRUE.
The real data set is much larger and more complicated so I'm trying to avoid a for loop.
thanks
EDIT:
This seems to work.
>apply(df, 1, function(x) {'b' %in% as.vector(t(x))}) -> i
> i
[1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE

If speed is a concern I would go with:
rowSums(df == "b") > 0

apply(df, 1, function(r) any(r == "b"))

I'd rather wrap it into a small helper function that also returns the matching rows and performs a case-insensitive search across all columns
require(dplyr)
require(stringr)
search_df = function(df, search_term){
apply(df, 1, function(r){
any(str_detect(as.character(r), fixed(search_term, ignore_case=T)))
}) %>% subset(df, .)
}
search_df(iris, "Setosa")
To keep it more generic this can also be rewritten to expose the matching expression/rule as a function argument:
match_df = function(df, search_expr){
filter_fun = eval(substitute(function(x){search_expr}))
apply(df, 1, function(r) any(filter_fun(r))) %>% subset(df, .)
}
match_df(iris, str_detect(x, "setosa"))

Related

How do I loop over a list of strings in R?

I want to create a list that contains vectors that display either a 1 or a 0 depending on whether a certain character is present.
The data has the following form:
list <- list('a', 'b', 'c', 'd', 'e')
col1 <- c('ta', 'ta', 'tb', 'tb', 'tb', 'tc', 'td', 'te')
What I want is a list for all elements of 'list' that contains a vector displaying a 1 when the element of list is present in the element of col1 and a 0 when it is not.
For 'a' this vector will look like (1,1,0,0,0,0,0,0) for example.
My question is why the following loop does not work:
test <- list()
for (i in list){
test[[i]] <- ifelse(grepl(list[i], col1), 1, 0)
}
This loop returns a list with only zeroes.
However, when I run part of the loop individually it does give the correct result:
ifelse(grepl(list[1], col1), 1, 0)
This does in fact return the vector I want: (1,1,0,0,0,0,0,0).
How do I loop over a list of strings in R correctly?

You can do this without using the loop and in one line only. See;
List <- list('a', 'b', 'c', 'd', 'e')
col1 <- c('ta', 'ta', 'tb', 'tb', 'tb', 'tc', 'td', 'te')
lapply(List, FUN = function(x) as.numeric(grepl(x, col1)))

List <- list('a', 'b', 'c', 'd', 'e')
col1 <- c('ta', 'ta', 'tb', 'tb', 'tb', 'tc', 'td', 'te')
test <- list()
for (i in 1:length(List)){
test[[i]] <- ifelse(grepl(List[i], col1), 1, 0)
}

Frequent Sequential Patterns

What would be the best way to get the sequential pattern for such data in R :
The idea is to get the frequency of letters in process 1,2, and 3. Is there GSP function that can do that ? any insight or tutorial is appreciated.

you can use an apply and table combo (provided you read your data into R):
dat <- data.frame(process1 = c('A', 'B', 'A', 'A', 'C'), process2 = c('B', 'C', 'B', 'B', 'A'), process3 = c('C', 'C', 'A', 'B', 'B'))
apply(dat, 2, table)
# process1 process2 process3
#A 3 1 1
#B 1 3 2
#C 1 1 2
apply iterates through the columns of dat (this is what argument 2 refers to) and applies table to each, which counts each unique element. see help pages for *apply family of functions for more info.
d.b's solution above, lapply(dat, table), does the same thing but returns a list rather than a matrix.

Detect discrepancies between two sequences

I have two time series vectors: complete_data and incomplete_data. the data in the vector consists of 6 possible events which occur randomly throughout the vector. In principle the two should be the same because with every event in complete_data, that same event was then added on to incomplete_data. however in reality there were some anomalies in the system and not all of the events in complete_data were sent to incomplete_data. Thus complete_data is longer than incomplete_data. I need to find the differences in the pattern between the two and mark them. I made an attempt but it assumes that the discrepancy between the two vectors occurs in a single chunk, whereas in reality, there are various "missing events" scattered in incomplete_data.
Here is my attempt:
complete_data <- c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
dfincomplete <- as.data.frame(incomplete_data)
findMatch <- function(complete_data, incomplete_data){
matching_inorder <- NULL
matching_reverseorder <- NULL
for (i in 1:length(complete_data)){
matching_inorder[i] <- complete_data[i] == incomplete_data[i]
matching_reverseorder[i] <- rev(complete_data)[i] == rev(incomplete_data)[i]
}
is_match <- ifelse(matching_inorder == FALSE &
rev(matching_reverseorder) == FALSE, 'non_match', 'match')
is_match
}
dfcomplete$is_match_incorrect <- findMatch(dfcomplete$complete_data,
dfincomplete$incomplete_data)
And here is what I would like to get:
dfcomplete$expected_output <- c('match', 'match', 'match', 'match', 'non-match', 'match',
'match', 'match', 'non_match', 'match', 'match', 'match')
In reality my data is much larger than these examples with many different discrepancies scattered throughout the vector. Though there aren't necessarily too many discrepancies to make the task meaningless, for example, in one case the complete vector has 320 datapoints whilst the incomplete vector has 309.
Any help that can be offered would be much appreciated.

There are various ways to do this, but here's a recursive one, where x is assumed to be a complete sequence and y incomplete.
compare <- function(x, y) {
if (length(x) > 0) {
if (x[1] == y[1]) {
x[1] <- "match"
c(x[1], compare(x[-1], y[-1]))
} else {
x[1] <- "no match"
c(x[1], compare(x[-1], y))
}
}
}
compare(complete_data, incomplete_data)
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"
Another one that perhaps is more readable and uses a simple loop would be
out <- rep(NA, length(incomplete_data))
gap <- 0
for(i in seq_along(complete_data)) {
if (complete_data[i] == incomplete_data[i - gap]) {
out[i] <- "match"
} else {
out[i] <- "no match"
gap <- gap + 1
}
}
out
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"

If you can afford having event names only one letter long, here is a solution using string matching. The trick is to transform the incomplete data to a pattern including places to insert new characters.
complete_data <- c('a', 'b', 'c', 'a', 'B', 'c', 'a', 'b', 'C', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data,stringsAsFactors=FALSE)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
y <- paste0('^(.*)',paste(incomplete_data,collapse='(.*)'),'(.*)$')
x <- paste(complete_data,collapse="")
z <- str_length(str_match(x,y)[-1])
data.frame(incomplete_data=c("",incomplete_data),stringsAsFactors=FALSE) %>%
mutate(n=ifelse(incomplete_data=="",z,z+1)) %>%
filter(n>0) %>%
uncount(n) %>%
mutate(incomplete_data=ifelse(str_detect(rownames(.),"\\."),"",incomplete_data)) %>%
bind_cols(dfcomplete) %>%
mutate(match=complete_data==incomplete_data)
# incomplete_data complete_data match
#1 a a TRUE
#2 b b TRUE
#3 c c TRUE
#4 a a TRUE
#5 B FALSE
#6 c c TRUE
#7 a a TRUE
#8 b b TRUE
#9 C FALSE
#10 a a TRUE
#11 b b TRUE
#12 c c TRUE

Filter Data Frame by Matching Multiple String in Multiple Columns

I have been unsuccessfully trying to filter my data frame using the dplyr and grep libraries using a list of string across multiple columns of my data frame. I would assume this is a simple task, but either nobody has asked my specific question or it's not as easy as I thought it would originally be.
For the following data frame...
foo <- data.frame(var.1 = c('a', 'b',' c'),
var.2 = c('b', 'd', 'e'),
var.3 = c('c', 'f', 'g'),
var.4 = c('z', 'a', 'b'))
... I would like to be able to filter row wise to find rows that contain all three variables a, b, and c in them. My sought after answer would only return row 1, as it contains a, b, and c, and not return rows 2 and 3 even though they contain two of the three sought after variables, they do not contain all three in the same row.
I'm running into issues where grep only allows specifying vectors or one column at a time when I really just care about finding string across many columns in the same row.
I've also used dplyr to filter using %in%, but it just returns when any of the variables are present:
foo %>%
filter(var.1 %in% c('a', 'b', 'c') |
var.2 %in% c('a', 'b', 'c') |
var.3 %in% c('a', 'b', 'c'))
Thanks for any and all help and please, let me know if you need any clarification!

Here's an approach in base R where we check if the elements of foo are equal to "a", "b", or "c" successively, add the Booleans and check if the sum of those Booleans for each row is greater than or equal to 3
Reduce("+", lapply(c("a", "b", "c"), function(x) rowSums(foo == x) > 0)) >=3
#[1] TRUE FALSE FALSE
Timings
foo = matrix(sample(letters[1:26], 1e7, replace = TRUE), ncol = 5)
system.time(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20)
# user system elapsed
# 3.26 0.48 3.79
system.time(apply(foo, 1, function(x) all(letters[1:20] %in% x)))
# user system elapsed
# 18.86 0.00 19.19
identical(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20,
apply(foo, 1, function(x) all(letters[1:20] %in% x)))
#[1] TRUE
>

Your problem arises from trying to apply "tidyverse" solutions to data that isn't tidy. Here's the tidy solution, which uses melt to make your data tidy. See how much tidier this solution is?
> library(reshape2)
> rows = foo %>%
mutate(id=1:nrow(foo)) %>%
melt(id="id") %>%
filter(value=="a" | value=="b" | value=="c") %>%
group_by(id) %>%
summarize(N=n()) %>%
filter(N==3) %>%
select(id) %>%
unlist
Warning message:
attributes are not identical across measure variables; they will be dropped
That gives you a vector of matching row indexes, which you can then subset your original data frame with:
> foo[rows,]
var.1 var.2 var.3 var.4
1 a b c z
>

Concatenate expressions to subset a dataframe

I am attempting to create a function that will calculate the mean of a column in a subsetted dataframe. The trick here is that I always to want to have a couple subsetting conditions and then have the option to pass more conditions to the functions to further subset the dataframe.
Suppose my data look like this:
dat <- data.frame(var1 = rep(letters, 26), var2 = rep(letters, each = 26), var3 = runif(26^2))
head(dat)
var1 var2 var3
1 a a 0.7506109
2 b a 0.7763748
3 c a 0.6014976
4 d a 0.6229010
5 e a 0.5648263
6 f a 0.5184999
I want to be able to do the subset shown below, using the first condition in all function calls, and the second be something that can change with each function call. Additionally, the second subsetting condition could be on other variables (I'm using a single variable, var2, for parsimony, but the condition could involve multiple variables).
subset(dat, var1 %in% c('a', 'b', 'c') & var2 %in% c('a', 'b'))
var1 var2 var3
1 a a 0.7506109
2 b a 0.7763748
3 c a 0.6014976
27 a b 0.7322357
28 b b 0.4593551
29 c b 0.2951004
My example function and function call would look something like:
getMean <- function(expr) {
return(with(subset(dat, var1 %in% c('a', 'b', 'c') eval(expr)), mean(var3)))
}
getMean(expression(& var2 %in% c('a', 'b')))
An alternative call could look like:
getMean(expression(& var4 < 6 & var5 > 10))
Any help is much appreciated.
EDIT: With Wojciech Sobala's help, I came up with the following function, which gives me the option of passing in 0 or more conditions.
getMean <- function(expr = NULL) {
sub <- if(is.null(expr)) { expression(var1 %in% c('a', 'b', 'c'))
} else expression(var1 %in% c('a', 'b', 'c') & eval(expr))
return(with(subset(dat, eval(sub)), mean(var3)))
}
getMean()
getMean(expression(var2 %in% c('a', 'b')))

It can be simplified with defalut expr=TRUE.
getMean <- function(expr = TRUE) {
return(with(subset(dat, var1 %in% c('a', 'b', 'c') & eval(expr)), mean(var3)))
}

This is how I would approach it. The function getMean makes use of the R's handy default parameter settings:
getMean <- function(x, subset_var1, subset_var2=unique(x$var2)){
xs <- subset(x, x$var1 %in% subset_var1 & x$var2 %in% subset_var2)
mean(xs$var3)
}
getMean(dat, c('a', 'b', 'c'))
[1] 0.4762141
getMean(dat, c('a', 'b', 'c'), c('a', 'b'))
[1] 0.3814149

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R find value in multiple data frame columns - r

If speed is a concern I would go with: rowSums(df == "b") > 0

apply(df, 1, function(r) any(r == "b"))

Related

How do I loop over a list of strings in R?

Frequent Sequential Patterns

Detect discrepancies between two sequences

Filter Data Frame by Matching Multiple String in Multiple Columns

Concatenate expressions to subset a dataframe

Categories

Resources