rbinding list of matrices while keeping their names using Reduce - r

I have a list
myList <- list(matrix(letters[1:4], nrow=2), matrix(letters[5:8], nrow=2))
names(myList) <- c("xx", "yy")
I want to rbind this list of matrix, along with the names xx and yy, using Reduce. The problem I have is that Reduce goes directly to myList[[i]] so it loses the names if I pass myList directly. I'm guessing the solution is some combination of creating more 'layers' and clever use of [, but I can't seem to figure it out.
The desired output is
"xx"
"a" "c"
"b" "d"
"yy"
"e" "g"
"f" "h"

library(MASS)
for( nm in names(myList)){ cat(nm,"\n"); write.matrix(myList[[nm]]) }
xx
a c
b d
yy
e g
f h

Related

name character vectors with same name of list

I have a list that looks like this.
my_list <- list(Y = c("p", "q"), K = c("s", "t", "u"))
I want to name each list element (the character vectors) with the name of the list they are in. All element of the same vector must have the same name
I was able to write this function that works on a single list element
name_vector <- function(x){
names(x[[1]]) <- rep(names(x[1]), length(x[[1]]))
return(x)
}
> name_vector(my_list[1])
$Y
Y Y
"p" "q"
But can't find a way to vectorize it. If I run it with an apply function it just returns the list unchanged
> lapply(my_list, name_vector)
$K
[1] "p" "q"
$J
[1] "x" "y"
My desired output for my_list is a named vector
Y Y K K K
"p" "q" "s" "t" "u"
We unlist the list while setting the names by replicating
setNames(unlist(my_list), rep(names(my_list), lengths(my_list)))
Or stack into a two column data.frame, extract the 'values' column and name it with 'ind'
with(stack(my_list), setNames(values, ind))
if your names don't end with numbers :
vec <- unlist(my_list)
names(vec) <- sub("\\d+$","",names(vec))
vec
# Y Y K K K
# "p" "q" "s" "t" "u"

Extract distinct characters that differ between two strings

I have two strings, a <- "AERRRTX"; b <- "TRRA" .
I want to extract the characters in a not used in b, i.e. "ERX"
I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.
Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?
A different approach using pmatch,
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
#[1] "E" "R" "X"
Another example,
a <- "Ronak";b<-"Shah"
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
# [1] "R" "o" "n" "k"
You can use the function vsetdiff from vecsets package
install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"
We can use Reduce() to successively eliminate from a each character found in b:
a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"
This will preserve the order of the surviving characters in a.
Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():
a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"
An alternative using data.table package`:
library(data.table)
x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))
dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]
rep(dt$V1, dt$res)
#[1] "E" "R" "X"

Split string in each column for several columns

I have this table (data1) with four columns
SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G
I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :
SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C
With the following function I could split all columns at the time but the output is not what I need.
split <- function(x){
x <- as.character(x)
strsplit(as.character(x), split="-")
}
data2=apply(data1[,-1], 2, split)
data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"
$rs7730126
$rs7730126[[1]]
[1] "G" "G"
$rs6576700
$rs6576700[[1]]
[1] "C" "C"
In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)
> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
t.do.call.cbind..l..
rs17054099 T, T
rs7730126 G, G
rs2061700 C, C
If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.
I would like to have the solution in R to make it part of a pipeline.
I forgot to say that I need to apply this to a million columns.
This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.
library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
# SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1 G G T T G G
Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data
library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
# SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1 G-G T-T G-G G G T T G G
Here you want to use apply over the rows instead of columns:
df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"),
c("sample1", "G-G", "T-T", "G-G"),
c("sample2", "C-C", "T-T", "G-C"))
t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-"))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "sample1" "G" "G" "T" "T" "G" "G"
#[2,] "sample2" "C" "C" "T" "T" "G" "C"

How to split a R data frame into vectors (unbind)

I'm relatively new to R and have been trying to find a solution to this problem for a while. I am trying to take a data frame and essentially do the reverse of rbind, so that I can split an entire data frame (and hopefully preserve the original data frame) into separate vectors, and use the row.names as the source of the new variable names.
So if I have a data.frame like this:
Col1 Col2 Col3
Row1 A B C
Row2 D E F
Row3 G H I
I would like the end result to be separate vectors:
Row1 = A B C
Row2 = D E F
Row3 = G H I
I know I could subset specific rows out of the data.frame, but I want all of them separated. In terms of methodology could I use a for loop to move each row into a vector, but I wasn't exactly sure how to assign row names as variable names.
You can split the dataset by row after converting to matrix, set the names (setNames) of the list elements as 'Row1:Row3' and use list2env to assign the objects in the global environment.
list2env(setNames(split(as.matrix(df),
row(df)), paste0("Row",1:3)), envir=.GlobalEnv)
Row1
#[1] "A" "B" "C"
Row2
#[1] "D" "E" "F"
A slightly different approach than #akrun's:
Df <- data.frame(matrix(LETTERS[1:9],nrow=3))
##
R> ls()
[1] "Df"
##
sapply(1:nrow(Df), function(x){
assign(paste0("Row",row.names(Df)[x]),
value=Reduce(function(x,y){c(x,y)},Df[x,]),
envir=.GlobalEnv)
})
##
R> ls()
[1] "Df" "Row1" "Row2" "Row3"
R> Row1
[1] "A" "D" "G"
R> Row2
[1] "B" "E" "H"
R> Row3
[1] "C" "F" "I"

How do I group by a variable and list by a random order in data.table?

I have a variable that I want to group by. That is easy. However, I want the resultant table to list its rows by random order. What I actually want to do is a little more complicated. But allow me to show you a simplified version.
mydf = data.table(
x = rep(1:4, each = 5),
y = rep(c('A', 'B','c','D', 'E'), times = 2),
v = rpois(20, 30)
)
mydf[,list(sum(x),sum(v)), by=y]
mydf[,list(sum(x),sum(v)), by=list(y=sample(y))]
#to list all the raw data in order of y
mydf[,list(x,v), by=y]
mydf[,list(x,v), by=list(y=sample(y))]
If you look at the resultant outputs you will notice that the y is indeed in random order but it has become unhinged from the data that was in the rows with it.
What can I do?
I would do the operation and then order randomly:
mydf[,list(x,v),by=y][sample(seq_len(nrow(mydf)),replace=FALSE)]
EDIT: Random reordering, after grouping:
mydf[,list(sum(x),sum(v)), by=y][sample(seq_len(length(y)),replace=FALSE)]
You can do something like this to group and random order before grouping, and it looks like it does preserve the changed order:
mydf[order(setNames(sample(unique(y)),unique(y))[y])]
mydf[order(setNames(sample(unique(y)),unique(y))[y]),list(sum(x),sum(v)),by=y]
#perhaps more readable:
mydf[{z <- unique(y); order(setNames(sample(z),z)[y])}]
mydf[{z <- unique(y); order(setNames(sample(z),z)[y])},list(sum(x),sum(v)),by=y]
This is more transparent by adding a column first before ordering.
mydf[,new.y := setNames(sample(unique(y)),unique(y))[y]][order(new.y)]
Breaking it down:
##a random ordering of the elements of y
##(set.seed is used here to get consistent results)
set.seed(1); mydf[,{z <- unique(y);sample(z)}]
# [1] "B" "E" "D" "c" "A"
##assigning names to the elements of y
##creating a 1-1 bijective function between the elements of y
set.seed(1); mydf[,{z <- unique(y);setNames(sample(z),z)}]
# A B c D E
#"B" "E" "D" "c" "A"
##subsetting by y puts y through the map
##in effect every element of y is posing as an element of y, picked at random
##notice that the names (top row) are the original y
##the values (bottom row) are the mapped-to values
# A B c D E A B c D E A B c D E A B c D E
#"B" "E" "D" "c" "A" "B" "E" "D" "c" "A" "B" "E" "D" "c" "A" "B" "E" "D" "c" "A"
##ordering by this now orders by the mapped-to values
set.seed(1); mydf[{z <- unique(y);order(setNames(sample(z),z)[y])}]
EDIT: Incorporating Arun's suggestion in the comments to use setattr to set the names:
mydf[{z <- unique(y); order(setattr(sample(z),'names',z)[y])}]
mydf[{z <- unique(y); order(setattr(sample(z),'names',z)[y])},list(sum(x),sum(v)),by=y]
I think this is what you're looking for...?
mydf[,.SD[sample(.N)],by=y]
Inspired by #BlueMagister's second solution, here's the randomize-first way:
mydf[sample(nrow(mydf)),.SD,by=y]
Here, use keyby instead of by if you want the groups to appear in alphabetical order.

Resources