index vector by value in R - r

Say I have two character vectors
vec <- c('A', 'B', 'C', 'D', 'E')
pat <- c('D', 'B', 'A')
how do I get the indexes of the occurrences in vec of the values in pat in the order they appear in pat?
I can try
which(vec %in% pat)
but this gives me them in the incorrect order: 1 2 4. I want them as 4 2 1.

I tried different ways to solve this problem before and always found that the easiest way to solve it is the solution as mentioned in #DavidArenburg's comment:
match(pat, vec)
# [1] 4 2 1

Related

How to remove an element by name from a named vector?

How to remove an element from a named vector by name? e.g.
v <- c(1, 2, 3)
names(v) <- c('a', 'b', 'c')
# how to remove b?
v['b'] <- NULL # doesn't work
Error in v["b"] <- NULL : replacement has length zero
You could use
v[names(v) != "b"]
#a c
#1 3
Or with setdiff
v[setdiff(names(v), "b")]
Or we can use an index with match
v[-match("b", names(v))]

Frequent Sequential Patterns

What would be the best way to get the sequential pattern for such data in R :
The idea is to get the frequency of letters in process 1,2, and 3. Is there GSP function that can do that ? any insight or tutorial is appreciated.
you can use an apply and table combo (provided you read your data into R):
dat <- data.frame(process1 = c('A', 'B', 'A', 'A', 'C'), process2 = c('B', 'C', 'B', 'B', 'A'), process3 = c('C', 'C', 'A', 'B', 'B'))
apply(dat, 2, table)
# process1 process2 process3
#A 3 1 1
#B 1 3 2
#C 1 1 2
apply iterates through the columns of dat (this is what argument 2 refers to) and applies table to each, which counts each unique element. see help pages for *apply family of functions for more info.
d.b's solution above, lapply(dat, table), does the same thing but returns a list rather than a matrix.

Detect discrepancies between two sequences

I have two time series vectors: complete_data and incomplete_data. the data in the vector consists of 6 possible events which occur randomly throughout the vector. In principle the two should be the same because with every event in complete_data, that same event was then added on to incomplete_data. however in reality there were some anomalies in the system and not all of the events in complete_data were sent to incomplete_data. Thus complete_data is longer than incomplete_data. I need to find the differences in the pattern between the two and mark them. I made an attempt but it assumes that the discrepancy between the two vectors occurs in a single chunk, whereas in reality, there are various "missing events" scattered in incomplete_data.
Here is my attempt:
complete_data <- c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
dfincomplete <- as.data.frame(incomplete_data)
findMatch <- function(complete_data, incomplete_data){
matching_inorder <- NULL
matching_reverseorder <- NULL
for (i in 1:length(complete_data)){
matching_inorder[i] <- complete_data[i] == incomplete_data[i]
matching_reverseorder[i] <- rev(complete_data)[i] == rev(incomplete_data)[i]
}
is_match <- ifelse(matching_inorder == FALSE &
rev(matching_reverseorder) == FALSE, 'non_match', 'match')
is_match
}
dfcomplete$is_match_incorrect <- findMatch(dfcomplete$complete_data,
dfincomplete$incomplete_data)
And here is what I would like to get:
dfcomplete$expected_output <- c('match', 'match', 'match', 'match', 'non-match', 'match',
'match', 'match', 'non_match', 'match', 'match', 'match')
In reality my data is much larger than these examples with many different discrepancies scattered throughout the vector. Though there aren't necessarily too many discrepancies to make the task meaningless, for example, in one case the complete vector has 320 datapoints whilst the incomplete vector has 309.
Any help that can be offered would be much appreciated.
There are various ways to do this, but here's a recursive one, where x is assumed to be a complete sequence and y incomplete.
compare <- function(x, y) {
if (length(x) > 0) {
if (x[1] == y[1]) {
x[1] <- "match"
c(x[1], compare(x[-1], y[-1]))
} else {
x[1] <- "no match"
c(x[1], compare(x[-1], y))
}
}
}
compare(complete_data, incomplete_data)
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"
Another one that perhaps is more readable and uses a simple loop would be
out <- rep(NA, length(incomplete_data))
gap <- 0
for(i in seq_along(complete_data)) {
if (complete_data[i] == incomplete_data[i - gap]) {
out[i] <- "match"
} else {
out[i] <- "no match"
gap <- gap + 1
}
}
out
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"
If you can afford having event names only one letter long, here is a solution using string matching. The trick is to transform the incomplete data to a pattern including places to insert new characters.
complete_data <- c('a', 'b', 'c', 'a', 'B', 'c', 'a', 'b', 'C', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data,stringsAsFactors=FALSE)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
y <- paste0('^(.*)',paste(incomplete_data,collapse='(.*)'),'(.*)$')
x <- paste(complete_data,collapse="")
z <- str_length(str_match(x,y)[-1])
data.frame(incomplete_data=c("",incomplete_data),stringsAsFactors=FALSE) %>%
mutate(n=ifelse(incomplete_data=="",z,z+1)) %>%
filter(n>0) %>%
uncount(n) %>%
mutate(incomplete_data=ifelse(str_detect(rownames(.),"\\."),"",incomplete_data)) %>%
bind_cols(dfcomplete) %>%
mutate(match=complete_data==incomplete_data)
# incomplete_data complete_data match
#1 a a TRUE
#2 b b TRUE
#3 c c TRUE
#4 a a TRUE
#5 B FALSE
#6 c c TRUE
#7 a a TRUE
#8 b b TRUE
#9 C FALSE
#10 a a TRUE
#11 b b TRUE
#12 c c TRUE

Calculate length of each object in R

I would like to calculate the length of many objects in R and return those objects with the name-prefix 'length_'. However, when I type this code:
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- ls()
for (i in 1:length(files)) assign(paste("length_",files[i], sep = ""), length(unlist(files[i])))
This returns the vectors length_A and length_B, but each with the value 1 and not 3 and 2.
Thank you for any help,
Paul
p.s. I actually would like to apply this to a different function instead of length (GC.content from package ape to calculate GC content of DNA-sequences), but with that function I have the same problem as with the abovementioned example.
In R 3.2.0, the lengths function was introduced which calculates the length of each item of a list. Using this function, as #docendo-discimus notes in the comments above, a super compact (and R-like) solution is
lengths(mget(ls()))
which returns a named vector
A B
3 2
mget returns a list of objects in the environment and is sort of like "multipleget."
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- ls()
for (i in 1:length(files)) assign(paste("length_",files[i], sep = ""), length(get(files[i])))
This create a length_A of value 3 and length_B of value 2.
A <- c('A', 'B', '3')
B <- c('A', '2')
files <- list(A,B)
sapply(files,length)
this will give you the answer but I don't know if it's what you want.

Combine vector and data.frame matching column values and vector values

I have
vetor <- c(1,2,3)
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
I need a data.frame output that match each vector value to a specific id, resulting:
id vector1
1 a 1
2 b 2
3 a 1
4 c 3
5 a 1
Here are two approaches I often use for similar situations:
vetor <- c(1,2,3)
key <- data.frame(vetor=vetor, mat=c('a', 'b', 'c'))
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- key[match(data$id, key$mat), 'vetor']
#or with merge
merge(data, key, by.x = "id", by.y = "mat")
So you want one unique integer for each different id column?
This is called a factor in R, and your id column is one.
To convert to a numeric representation, use as.numeric:
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- as.numeric(data$id)
This works because data$id is not a column of strings, but a column of factors.
Here's an answer I found that follows the "mathematical.coffee" tip:
vector1 <- c('b','a','a','c','a','a') # 3 elements to be labeled: a, b and c
labels <- factor(vector1, labels= c('char a', 'char b', 'char c') )
data.frame(vector1, labels)
The only thing we need to observe is that in the factor(vector1,...) function, vector1 will be ordered and the labels must follow that order correctly.

Resources