Let's say I have the following data.table:
DT <- setDT(data.frame(id = 1:10, LETTERS = LETTERS[1:10],
letters = letters[1:10]))
##+ > DT
## id LETTERS letters
## 1: 1 A a
## 2: 2 B b
## 3: 3 C c
## 4: 4 D d
## 5: 5 E e
## 6: 6 F f
## 7: 7 G g
## 8: 8 H h
## 9: 9 I i
## 10: 10 J j
and I want to find the row and column numbers of the letter 'h' (which are 8 and 3). How would I do that?
DT[, which(.SD == "h", arr.ind = TRUE)]
# row col
# [1,] 8 3
EDIT:
Trying to take into account Michael's points:
str_idx = which(sapply(DT, function(x) is.character(x) || is.factor(x)))
idx <- DT[, which(as.matrix(.SD) == "h", arr.ind = TRUE), .SDcols = str_idx]
idx[, "col"] <- chmatch(names(str_idx)[idx[, "col"]], names(DT))
idx
# row col
# [1,] 8 3
Depends on the exact format of your desired output.
# applying to non-string columns is inefficient
str_idx = which(sapply(DT, is.character))
# returns a list as long as str_idx with two elements appropriately named
lapply(str_idx, function(jj) list(row = which(DT[[jj]] == 'h'), col = jj))
It should also be possible to melt the string columns your table to avoid looping.
I got some excellent advice here on how to lookup neighbors for a list of network nodes. See: lapply function to look up neighbors in igraph (when not all nodes are found)
Now I need to do the same thing with second-degree neighbors. However, substituting either ego or neighborhood function into this loop produces an error.
edgelist <- read.table(text = "
A B
B C
C D
D E
C F
F G")
testlist <- read.table(text = "
A
H
C
D
J")
testlist2 <- read.table(text = "
A
C
B
D
E")
library(igraph)
graph <- graph.data.frame(edgelist)
str(graph)
get_neighbors <- function(graph, n) {
do.call(rbind, lapply(n, function(x) {
if (x %in% V(graph)$name) {
nb <- neighborhood(graph,2, x) ##HERE##
if (length(nb) > 0) {
data.frame(lookupnode=x,
neighbor=nb$name, # h/t #MrFlick for this shortcut
stringsAsFactors=FALSE)
} else {
data.frame(lookupnode=x, neighbor=NA, stringsAsFactors=FALSE)
}
} else {
data.frame(lookupnode=x, neighbor=NA, stringsAsFactors=FALSE)
}
}))
}
A=get_neighbors(graph, as.character(testlist$V1))
Error in data.frame(lookupnode = x, neighbor = nb$name, stringsAsFactors = FALSE) : arguments imply differing number of rows: 1, 0
I gather the issue is that ego and neighborhood can't be directly coerced into a data frame. I can use unlist and then put in a data frame, but the values I want end up as row.names not values that I can put into my output.
How can I create an output of second-degree neighbors?
Changed
neighbor=nb$name, # h/t #MrFlick for this shortcut
to
neighbor=names(unlist(nb)), # h/t #MrFlick for this shortcut
and it is working for me now.
> A
lookupnode neighbor
1 A A
2 A B
3 A C
4 H <NA>
5 C C
6 C B
7 C D
8 C F
9 C A
10 C E
11 C G
12 D D
13 D C
14 D E
15 D B
16 D F
17 J <NA>
>
Suppose have a dataframe like this :-
df<- read.table(text="groups names
1 a
1 b
1 c
1 d
2 e
2 f
2 g
2 h
", header=T)
I divided this data frame into two groups by using
split_groups <-split(df, df$groups)
Then I used for loop to obtain the overlapping lists of split_group[[1]] and split_group[[2]] as follows:
slide <- list()
for(i in 1:2){
slide[[i]] <- rollapply(split_groups[[i]][,2], width =2,by=1, matrix, align="right")
}
And obtained this :-
slide[[1]]:
a
b
**b**
c
**c**
d
slide[[2]] :
e
f
**f**
g
**g**
h
then I divided slide[[1]] and slide[[2]] into lists of equal rows:
divide <- split(slide[[1]], cumsum(seq_len(nrow(slide[[1]])) %%2 == 1))
and obtained divide[[1]] = a,b ; divide[[2]] = b,c and so on.
Similarly from slide[[2]], divide[[1]] = e,f and so on.
I want to rbind divide[[1]] from split[[1]] and split[[2]] ie set1 = a,b,e,f in the form of list or dataframe.
Similarly divide[[2]] from split[[1]] and split[[2]] ie set2= b,c,f,g.
ie
set1:
a
b
e
f
set2:
b
c
f
g
How can I do this ?
May be you want this: (The slide output is different than it was showed in the post)
divide1 <- split(slide[[1]], cumsum(seq_len(nrow(slide[[1]])) %%2 == 1))
divide2 <- split(slide[[2]], cumsum(seq_len(nrow(slide[[2]])) %%2 == 1))
nm1 <- paste0("set", 1:2)
Map(function(x,y,z) setNames(data.frame(c(x,y)),z), divide1, divide2, nm1)
#$`1`
# set1
#1 a
#2 b
#3 e
#4 f
#$`2`
# set2
#1 b
#2 f
Or if you have more list elements in slide, you could do:
divide <- lapply(slide, function(x) split(x, cumsum(!!seq_len(nrow(x)) %%2)))
divN <- unlist(divide)
lstN <- split(unname(divN), substr(names(divN),1,1))
nm1 <- paste0("set", seq_along(lstN))
Map(function(x,y) setNames(data.frame(x),y), lstN, nm1)
#$`1`
# set1
#1 a
#2 b
#3 e
#4 f
#$`2`
# set2
#1 b
#2 f
Given:
df <- data.frame(rep = letters[sample(4, 30, replace=TRUE)], loc = LETTERS[sample(5:8, 30, replace=TRUE)], y= rnorm(30))
lookup <- data.frame(rep=letters[1:4], loc=LETTERS[5:8])
This will give me the rows in df that have rep,loc combinations that occur in lookup:
mdply(lookup, function(rep,loc){
r=rep
l=loc
subset(df, rep==r & loc==l)
})
But I've read that using subset() inside a function is poor practice due to scoping issues. So how do I get the desired result using index notation?
In this particular case, merge seems to make the most sense to me:
merge(df, lookup)
# rep loc y
# 1 a E 1.6612394
# 2 a E 1.1050825
# 3 a E -0.7016759
# 4 b F 0.4364568
# 5 d H 1.3246636
# 6 d H -2.2573545
# 7 d H 0.5061980
# 8 d H 0.1397326
A simple alternative might be to paste together the "rep" and "loc" columns from df and from lookup and subset based on that:
df[do.call(paste, df[c("rep", "loc")]) %in% do.call(paste, lookup), ]
# rep loc y
# 4 d H 1.3246636
# 10 b F 0.4364568
# 14 a E -0.7016759
# 15 a E 1.6612394
# 19 d H 0.5061980
# 20 a E 1.1050825
# 22 d H -2.2573545
# 28 d H 0.1397326
I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)