I have 3 vectors containing more than 1500 character string elements which I want to pairwise compare for outputting the number of matching elements between them.
A reduced tibble of my data:
va <- c("6a460daf68eb0410b51d79e495fbccc7", "e1b32017108e17e41bdabc44bac4df3c", "6ac1327da92d8584008db04b4eaf62d0", "b01a2322e2ca99315646d79cf157cb20", "12dadc27059ea5d3c8cc54e9a28cc4f6", "be73c9685b743a646f2eb0480eee2f8d")
vb <- c("6a460daf68eb0410b51d79e495fbccc7", "e1b32017108e17e41bdabc44bac4df3c","JQ183785.1.1345", "DQ794886.1.1390", "HQ791014.1.1450", "EU764755.1.1328")
vc <- c("6a460daf68eb0410b51d79e495fbccc7", "JQ183785.1.1345", "DQ794886.1.1390", "HQ791014.1.1450", "b01a2322e2ca99315646d79cf157cb20", "EF532786.1.1364")
I have made a function for outputting the number of coincident elements between two vectors:
sharing <- function(v1, v2, share=TRUE){
if(isTRUE(share)){sh <- length(v1[ v1 %in% v2])}
else if (isFALSE(share)){sh <- length(v1[ ! v1 %in% v2])}
return(sh)
}
So, applying this function 9 times (one for each pairwise comparison including self-comparison), I would be able to get 9 numbers with shared elements:
> sharing(va,va); sharing(va,vb); sharing(va,vc)
[1] 6
[1] 2
[1] 2
> sharing(vb,va); sharing(vb,vb); sharing(vb,vc)
[1] 2
[1] 6
[1] 4
> sharing(vc,va); sharing(vc,vb); sharing(vc,vc)
[1] 2
[1] 4
[1] 6
But I would like to get this as a matrix:
va vb vc
va 6 2 2
vb 2 6 4
vc 2 4 6
Is there any premade function or code which can make this?
Thanks for the help!
One option is outer to apply the sharing function on pairwise combination of vectors in a list ('lst1')
lst1 <- mget(paste0("v", letters[1:3])) # placed the vectors in a list
out <- outer(lst1, lst1, FUN = Vectorize(sharing)) #apply the sharing
dimnames(out) <- list(names(lst1), names(lst1)) # set the dim names
Related
There are many examples of how to concatenate columns element by element, but I can't find an example where columns are concatenated sequentially. I can write an example with a loop:
tst <- cbind.data.frame(c(1,2,3),c(4,5,6))
names(tst) <- c("A","B")
A B
1 1 4
2 2 5
3 3 6
vec <- c()
for (i in names(tst)){
vec <- c(vec,tst[,i])
}
vec
[1] 1 2 3 4 5 6
In other words, I want to create a vector with all the columns of the data frame appended one after the other.
The solution above works, but my question is: is there a way to do this without a loop?
Here, we can use unlist to convert to a vector
vec1 <- unlist(tst, use.names = FALSE)
identical(vec, vec1)
#[1] TRUE
I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}
I'm getting myself all tied in knots trying to understand what's going on with the code below. I'm trying to create a vector for each row in a data.frame then append to the original. I expected the code below to return a list of arrays. It appears to return a list of lists, the inner list contains the array? How can I get want I want - a new column appended each element being an array?
df <- mtcars
library(foreach)
library(iterators)
df$x = foreach (row = iter(df, by='row')) %do% {
profile <- as.numeric(row[,c('mpg', 'cyl', 'disp')])
return(profile)
}
I'm expecting the result:
df[1,]$x == as.numeric(df[1,c('mpg', 'cyl', 'disp')])
instead I get
df[1,]$x[1] == as.numeric(df[1,c('mpg', 'cyl', 'disp')])
(where I'm using == to represent both collections are the same, I realize R probably doesn't implement a list equality operator this way)
The foreach package by default returns a list of lists of your input (one list for each iteration). This is why you end up with the 'wrong' output. You can change this by using the .combine option in the foreach loop. If I understand you correctly, you wish to append row by row. This can be achieved by specifying .combine = 'rbind', which uses the familiar rbind function to combine the outputs of each loop iteration. If the order is irrelevant, you should also specify .inorder = FALSE to speed up the code. (TRUE is default, so in case the order is relevant, you don't need to bother.)
So try using foreach (row = iter(df, by='row'), .combine='rbind') %do% ... instead and see if it does the job.
This problem is not caused by foreach. As you want to assign a vector to a cell (or element) of a data frame rather than a column of a data frame. The foreach function has to coerce this vector to a list.
For example.
df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors = FALSE)
df1$x1[1] <- 5:8
# Warning message:
# In df1$x1[1] <- 5:8 :
# number of items to replace is not a multiple of replacement length
df1
# x1 x2
# 1 5 a
# 2 2 b
# 3 3 c
# 4 4 d
df1$x1[1] <- list(5:8)
df1
# x1 x2
# 1 5, 6, 7, 8 a
# 2 2 b
# 3 3 c
# 4 4 d
df1$x1[1]
# [[1]]
# [1] 5 6 7 8
df1$x1[[1]]
# [1] 5 6 7 8
Actually, you should use [[ instead of [.
df[1, ]$x[[1]] == as.numeric(df[1,c('mpg', 'cyl', 'disp')])
# [1] TRUE TRUE TRUE
As list[1] is still a list while list[[1]] extracts the first element of list. See the example below.
lst1 <- list(x1=1:4, x2=letters[1:5])
lst1[1]
# $x1
# [1] 1 2 3 4
lst1[[1]]
# [1] 1 2 3 4
In addition, you can use:
df$x[[1]]
[1] 21 6 160
instead of:
df[1, ]$x[[1]]
# [1] 21 6 160
If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?
To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)
Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20
Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20
Is there any easy way to get the averages of items in a list based on their names? Example dataset:
sampleList <- list("a.1"=c(1,2,3,4,5), "b.1"=c(3,4,1,4,5), "a.2"=c(5,7,2,8,9), "b.2"=c(6,8,9,0,6))
sampleList
$a.1
[1] 1 2 3 4 5
$b.1
[1] 3 4 1 4 5
$a.2
[1] 5 7 2 8 9
$b.2
[1] 6 8 9 0 6
What I am trying to do is get column averages between similarly but not identically named rows, outputting a list with the column averages for the a's and b's. Currently I can do the following:
y <- names(sampleList)
y <- gsub("\\.1", "", y)
y <- gsub("\\.2", "", y)
y <- sort(unique(y))
sampleList <- t(as.matrix(as.data.frame(sampleList)))
t <- list()
for (i in 1:length(y)){
temp <- sampleList[grep(y[i], rownames(sampleList)),]
t[[i]] <- apply(temp, 2, mean)
}
t
[[1]]
[1] 3.0 4.5 2.5 6.0 7.0
[[2]]
[1] 4.5 6.0 5.0 2.0 5.5
A I have a large dataset with a large number of sets of similar names, is there an easier way to go about this?
EDIT: I've broken out the name issue into a separate question. It can be found here
Well, this is shorter. You didn't say exactly how big your actual data is, so I"m not going to make any promises, but the performance of this shouldn't be terrible:
dat <- do.call(rbind,sampleList)
grp <- substr(rownames(dat),1,1)
aggregate(dat,by = list(group = grp),FUN = mean)
(Edited to remove the unnecessary conversion to a data frame, which will incur a significant performance hit, probably.)
If your data is crazy big, or even just medium-big but the number of groups is fairly large so there are a small number of vectors in each group, the standard recommendation would be to investigate data.table once you've rbinded the data into a matrix.
I might do something like this:
# A *named* vector of patterns you want to group by
patterns <- c(start.a="^a",start.b="^b",start.c="^c")
# Find the locations of those patterns in your list
inds <- lapply(patterns, grep, x=names(sampleList))
# Calculate the mean of each list element that matches the pattern
out <- lapply(inds, function(i)
if(l <- length(i)) Reduce("+",sampleList[i])/l else NULL)
# Set the names of the output
names(out) <- names(patterns)