Detect discrepancies between two sequences - r

I have two time series vectors: complete_data and incomplete_data. the data in the vector consists of 6 possible events which occur randomly throughout the vector. In principle the two should be the same because with every event in complete_data, that same event was then added on to incomplete_data. however in reality there were some anomalies in the system and not all of the events in complete_data were sent to incomplete_data. Thus complete_data is longer than incomplete_data. I need to find the differences in the pattern between the two and mark them. I made an attempt but it assumes that the discrepancy between the two vectors occurs in a single chunk, whereas in reality, there are various "missing events" scattered in incomplete_data.
Here is my attempt:
complete_data <- c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
dfincomplete <- as.data.frame(incomplete_data)
findMatch <- function(complete_data, incomplete_data){
matching_inorder <- NULL
matching_reverseorder <- NULL
for (i in 1:length(complete_data)){
matching_inorder[i] <- complete_data[i] == incomplete_data[i]
matching_reverseorder[i] <- rev(complete_data)[i] == rev(incomplete_data)[i]
}
is_match <- ifelse(matching_inorder == FALSE &
rev(matching_reverseorder) == FALSE, 'non_match', 'match')
is_match
}
dfcomplete$is_match_incorrect <- findMatch(dfcomplete$complete_data,
dfincomplete$incomplete_data)
And here is what I would like to get:
dfcomplete$expected_output <- c('match', 'match', 'match', 'match', 'non-match', 'match',
'match', 'match', 'non_match', 'match', 'match', 'match')
In reality my data is much larger than these examples with many different discrepancies scattered throughout the vector. Though there aren't necessarily too many discrepancies to make the task meaningless, for example, in one case the complete vector has 320 datapoints whilst the incomplete vector has 309.
Any help that can be offered would be much appreciated.

There are various ways to do this, but here's a recursive one, where x is assumed to be a complete sequence and y incomplete.
compare <- function(x, y) {
if (length(x) > 0) {
if (x[1] == y[1]) {
x[1] <- "match"
c(x[1], compare(x[-1], y[-1]))
} else {
x[1] <- "no match"
c(x[1], compare(x[-1], y))
}
}
}
compare(complete_data, incomplete_data)
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"
Another one that perhaps is more readable and uses a simple loop would be
out <- rep(NA, length(incomplete_data))
gap <- 0
for(i in seq_along(complete_data)) {
if (complete_data[i] == incomplete_data[i - gap]) {
out[i] <- "match"
} else {
out[i] <- "no match"
gap <- gap + 1
}
}
out
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"

If you can afford having event names only one letter long, here is a solution using string matching. The trick is to transform the incomplete data to a pattern including places to insert new characters.
complete_data <- c('a', 'b', 'c', 'a', 'B', 'c', 'a', 'b', 'C', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data,stringsAsFactors=FALSE)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
y <- paste0('^(.*)',paste(incomplete_data,collapse='(.*)'),'(.*)$')
x <- paste(complete_data,collapse="")
z <- str_length(str_match(x,y)[-1])
data.frame(incomplete_data=c("",incomplete_data),stringsAsFactors=FALSE) %>%
mutate(n=ifelse(incomplete_data=="",z,z+1)) %>%
filter(n>0) %>%
uncount(n) %>%
mutate(incomplete_data=ifelse(str_detect(rownames(.),"\\."),"",incomplete_data)) %>%
bind_cols(dfcomplete) %>%
mutate(match=complete_data==incomplete_data)
# incomplete_data complete_data match
#1 a a TRUE
#2 b b TRUE
#3 c c TRUE
#4 a a TRUE
#5 B FALSE
#6 c c TRUE
#7 a a TRUE
#8 b b TRUE
#9 C FALSE
#10 a a TRUE
#11 b b TRUE
#12 c c TRUE

Related

Analysis of many attendance lists

I have 8 attendace lists from 8 different conferences. I need to know what persons assisted to at least 7 of the 8 conferences. I don't want to do it checking name by name in each list, so I'm planning to do it using R, but I have no clue about it. Any suggestions?
Might be a more simple way (my R is getting a bit rusty), but this works:
library(dplyr)
unique_attendees <- c('a', 'b', 'c', 'd', 'e')
conf1_attendees <- c('a','b')
conf2_attendees <- c('a','b','c')
conf3_attendees <- c('a','b','c','e')
conf4_attendees <- c('b', 'e')
conf5_attendees <- c('a','d', 'e')
conf6_attendees <- c('a','d', 'e')
conf7_attendees <- c('a','b', 'e')
conf8_attendees <- c('a','b', 'c')
conferences <- list(conf1_attendees, conf2_attendees, conf3_attendees, conf4_attendees, conf5_attendees, conf6_attendees, conf7_attendees,conf8_attendees)
attendance_record <- dplyr::bind_rows(lapply(unique_attendees, function(x){
cat(c('Working with: ', x, '\n'))
attendance <- lapply(conferences, function(y){
attended <- grepl(x, y)
return(attended)
})
number_attended = length(which(unlist(attendance) == TRUE))
result <- data.frame(person=x, number_attended=number_attended)
}))
result <- attendance_record %>%
mutate(attended_at_least_7 = data.table::fifelse(number_attended >= 7, TRUE, FALSE))
print(result)
Output:
person number_attended attended_at_least_7
1 a 7 TRUE
2 b 6 FALSE
3 c 3 FALSE
4 d 2 FALSE
5 e 5 FALSE
Obviously you'll need to adapt it to your problem since we don't know how your records are stored.

Extracting a single unique character from a pattern in R

I have a data frame of unique character vectors that are all very similar to a distinct pattern, but with small deviations in each. I'm hoping to find a way to identify what the deviation is in each string. Here is what I have tried:
library(stringr)
#The strings are concatenated in my code, I separated them for easier use
KeyPattern <- c('abcd'
uniqchars <- function(x) unique(strsplit(x, "")[[1]])
KayPattern <- uniqchars(KeyPattern)
> KeyPattern
[1] "a" "b" "c" "d"
SampleString <- c('a', 'b', 'z', 'c', 'd')
str_detect(SampleString, KeyPattern)
[1] TRUE TRUE FALSE FALSE FALSE
As you can see, it recognizes the 'z' character, and correctly returns FALSE, and from there the pattern is completely off. I also considered trying:
word(string, start = 1L, end = start, sep = fixed(" "))
but this requires a pre-existing knowledge of where the deviations are (start = ..., end = ...) and it will be different in every row of the data frame.
Ultimately I want to have a data frame with one column of unique string, a column of distinct deviations (mismatches in the pattern), and it's location in the string.
Goal Sample Table:
String
Deviation from Key
Deviation start location
'a' 'b' 'c' 'z' 'd'
z
4
'a' 'b' 'a' 'c' 'd'
a
3
Current concatenated data frame:
1 ASGGGGSAASHLIALQLRLIGDAFDGGGGSGGGGSG
2 ASLTVDVGNVTYHFNNPITVLVFAILVALELGGTVHVHGNRIHVEG
3 ASLTVHVGDLTYHFENPQLVKLVAEIWARALNLTIEIRGNEIHVEG
4 ASNELVELVVEILYRMCVDPDQIKKILKRRGVSDEEVKRAIDKAIG
5 ASNMNMLEALQQRLQFYFGVVSRAALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG
6 ASNTIMLEALQQRLQFYFGVVSRAALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG
#CurrentKey
[1] "ASSTNMLEALQQRLQFYFGVVSRALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG"
Any suggestions?
see if this what you want?
df <- structure(list(STRINGS = list(c("a", "b", "c", "z", "d"), c("a",
"b", "a", "c", "d"))), class = "data.frame", row.names = c(NA,
-2L))
df
#> STRINGS
#> 1 a, b, c, z, d
#> 2 a, b, a, c, d
pattern <- c('a', 'b', 'c', 'd')
library(tidyverse)
df %>%
mutate(deviation = map_chr(STRINGS, ~ {x <- cumsum(.x[seq_along(pattern)] != pattern); .x[which(x >0)[1]]}),
deviation_start_loc = map_int(STRINGS, ~ {x <- cumsum(.x[seq_along(pattern)] != pattern); which(x > 0)[1]}))
#> STRINGS deviation deviation_start_loc
#> 1 a, b, c, z, d z 4
#> 2 a, b, a, c, d a 3
Created on 2021-06-21 by the reprex package (v2.0.0)
Here is my approach:
First, define a recursive function:
find_deviation <- function(string, key, position = 1) {
stopifnot(is.character(string), is.character(key))
if (min(length(key), length(string)) == 0)
return(c(deviation = NA, position = NA))
if (string[1] != key[1])
return(c(deviation = string[1], position = position))
find_deviation(string[-1], key[-1], position + 1)
}
Then, use it to generate the desired result:
dplyr::bind_cols(
purrr::map_dfr(SampleString, ~ c(String = paste(.x, collapse = ","))),
purrr::map_dfr(SampleString, ~ find_deviation(.x, KeyPattern))
)
Result:
# A tibble: 2 x 3
String deviation position
<chr> <chr> <chr>
1 a,b,z,c,d z 3
2 a,b,a,c,d a 3
Data used:
KeyPattern <- c('a', 'b', 'c', 'd')
SampleString <- list(c('a', 'b', 'z', 'c', 'd'), c('a', 'b', 'a', 'c', 'd'))
Using aphid library and sequence alignment, the character vectors are combined into a list, the first element being the key pattern vector.
library(aphid)
KeyPattern <- c('a', 'b', 'c', 'd')
SampleString1 <- c('a', 'b', 'z', 'c', 'd')
SampleString2 <- c('a', 'b', 'c', 'z', 'd')
SampleString3 <- c('a', 'b', 'a', 'c', 'd')
sequences=list(KeyPattern,SampleString1,SampleString2,SampleString3)
do.call(rbind,
sapply(2:length(sequences),function(x){
glo=align(sequences[c(1,x)],type="global",k=1)
tmp=glo[1,]!=glo[2,]
data.frame(
"String"=paste0(sequences[[x]],collapse=" "),
"Deviation from Key"=glo[2,tmp],
"Deviation start location"=which(tmp)
)
},simplify=F)
)
String Deviation.from.Key Deviation.start.location
1 a b z c d z 3
2 a b c z d z 4
3 a b a c d a 3

R find value in multiple data frame columns

Given a data set where a value could be in any of a set of columns from the dataframe:
df <- data.frame(h1=c('a', 'b', 'c', 'a', 'a', 'b', 'c'), h2=c('b', 'c', 'd', 'b', 'c', 'd', 'b'), h3=c('c', 'd', 'e', 'e', 'e', 'd', 'c'))
How can I get a logical vector that specifies which rows contain the target value? In this case, searching for 'b', I'd want a logical vector with rows (1,2,4,6,7) as TRUE.
The real data set is much larger and more complicated so I'm trying to avoid a for loop.
thanks
EDIT:
This seems to work.
>apply(df, 1, function(x) {'b' %in% as.vector(t(x))}) -> i
> i
[1] TRUE TRUE FALSE TRUE FALSE TRUE TRUE
If speed is a concern I would go with:
rowSums(df == "b") > 0
apply(df, 1, function(r) any(r == "b"))
I'd rather wrap it into a small helper function that also returns the matching rows and performs a case-insensitive search across all columns
require(dplyr)
require(stringr)
search_df = function(df, search_term){
apply(df, 1, function(r){
any(str_detect(as.character(r), fixed(search_term, ignore_case=T)))
}) %>% subset(df, .)
}
search_df(iris, "Setosa")
To keep it more generic this can also be rewritten to expose the matching expression/rule as a function argument:
match_df = function(df, search_expr){
filter_fun = eval(substitute(function(x){search_expr}))
apply(df, 1, function(r) any(filter_fun(r))) %>% subset(df, .)
}
match_df(iris, str_detect(x, "setosa"))

Combine vector and data.frame matching column values and vector values

I have
vetor <- c(1,2,3)
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
I need a data.frame output that match each vector value to a specific id, resulting:
id vector1
1 a 1
2 b 2
3 a 1
4 c 3
5 a 1
Here are two approaches I often use for similar situations:
vetor <- c(1,2,3)
key <- data.frame(vetor=vetor, mat=c('a', 'b', 'c'))
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- key[match(data$id, key$mat), 'vetor']
#or with merge
merge(data, key, by.x = "id", by.y = "mat")
So you want one unique integer for each different id column?
This is called a factor in R, and your id column is one.
To convert to a numeric representation, use as.numeric:
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- as.numeric(data$id)
This works because data$id is not a column of strings, but a column of factors.
Here's an answer I found that follows the "mathematical.coffee" tip:
vector1 <- c('b','a','a','c','a','a') # 3 elements to be labeled: a, b and c
labels <- factor(vector1, labels= c('char a', 'char b', 'char c') )
data.frame(vector1, labels)
The only thing we need to observe is that in the factor(vector1,...) function, vector1 will be ordered and the labels must follow that order correctly.

Concatenate expressions to subset a dataframe

I am attempting to create a function that will calculate the mean of a column in a subsetted dataframe. The trick here is that I always to want to have a couple subsetting conditions and then have the option to pass more conditions to the functions to further subset the dataframe.
Suppose my data look like this:
dat <- data.frame(var1 = rep(letters, 26), var2 = rep(letters, each = 26), var3 = runif(26^2))
head(dat)
var1 var2 var3
1 a a 0.7506109
2 b a 0.7763748
3 c a 0.6014976
4 d a 0.6229010
5 e a 0.5648263
6 f a 0.5184999
I want to be able to do the subset shown below, using the first condition in all function calls, and the second be something that can change with each function call. Additionally, the second subsetting condition could be on other variables (I'm using a single variable, var2, for parsimony, but the condition could involve multiple variables).
subset(dat, var1 %in% c('a', 'b', 'c') & var2 %in% c('a', 'b'))
var1 var2 var3
1 a a 0.7506109
2 b a 0.7763748
3 c a 0.6014976
27 a b 0.7322357
28 b b 0.4593551
29 c b 0.2951004
My example function and function call would look something like:
getMean <- function(expr) {
return(with(subset(dat, var1 %in% c('a', 'b', 'c') eval(expr)), mean(var3)))
}
getMean(expression(& var2 %in% c('a', 'b')))
An alternative call could look like:
getMean(expression(& var4 < 6 & var5 > 10))
Any help is much appreciated.
EDIT: With Wojciech Sobala's help, I came up with the following function, which gives me the option of passing in 0 or more conditions.
getMean <- function(expr = NULL) {
sub <- if(is.null(expr)) { expression(var1 %in% c('a', 'b', 'c'))
} else expression(var1 %in% c('a', 'b', 'c') & eval(expr))
return(with(subset(dat, eval(sub)), mean(var3)))
}
getMean()
getMean(expression(var2 %in% c('a', 'b')))
It can be simplified with defalut expr=TRUE.
getMean <- function(expr = TRUE) {
return(with(subset(dat, var1 %in% c('a', 'b', 'c') & eval(expr)), mean(var3)))
}
This is how I would approach it. The function getMean makes use of the R's handy default parameter settings:
getMean <- function(x, subset_var1, subset_var2=unique(x$var2)){
xs <- subset(x, x$var1 %in% subset_var1 & x$var2 %in% subset_var2)
mean(xs$var3)
}
getMean(dat, c('a', 'b', 'c'))
[1] 0.4762141
getMean(dat, c('a', 'b', 'c'), c('a', 'b'))
[1] 0.3814149

Resources