How to split a R data frame into vectors (unbind) - r

I'm relatively new to R and have been trying to find a solution to this problem for a while. I am trying to take a data frame and essentially do the reverse of rbind, so that I can split an entire data frame (and hopefully preserve the original data frame) into separate vectors, and use the row.names as the source of the new variable names.
So if I have a data.frame like this:
Col1 Col2 Col3
Row1 A B C
Row2 D E F
Row3 G H I
I would like the end result to be separate vectors:
Row1 = A B C
Row2 = D E F
Row3 = G H I
I know I could subset specific rows out of the data.frame, but I want all of them separated. In terms of methodology could I use a for loop to move each row into a vector, but I wasn't exactly sure how to assign row names as variable names.

You can split the dataset by row after converting to matrix, set the names (setNames) of the list elements as 'Row1:Row3' and use list2env to assign the objects in the global environment.
list2env(setNames(split(as.matrix(df),
row(df)), paste0("Row",1:3)), envir=.GlobalEnv)
Row1
#[1] "A" "B" "C"
Row2
#[1] "D" "E" "F"

A slightly different approach than #akrun's:
Df <- data.frame(matrix(LETTERS[1:9],nrow=3))
##
R> ls()
[1] "Df"
##
sapply(1:nrow(Df), function(x){
assign(paste0("Row",row.names(Df)[x]),
value=Reduce(function(x,y){c(x,y)},Df[x,]),
envir=.GlobalEnv)
})
##
R> ls()
[1] "Df" "Row1" "Row2" "Row3"
R> Row1
[1] "A" "D" "G"
R> Row2
[1] "B" "E" "H"
R> Row3
[1] "C" "F" "I"

Related

What's the R function used to find unique and distinct value in a column? [duplicate]

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Find all unique values in column separated by comma

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Split string in each column for several columns

I have this table (data1) with four columns
SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G
I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :
SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C
With the following function I could split all columns at the time but the output is not what I need.
split <- function(x){
x <- as.character(x)
strsplit(as.character(x), split="-")
}
data2=apply(data1[,-1], 2, split)
data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"
$rs7730126
$rs7730126[[1]]
[1] "G" "G"
$rs6576700
$rs6576700[[1]]
[1] "C" "C"
In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)
> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
t.do.call.cbind..l..
rs17054099 T, T
rs7730126 G, G
rs2061700 C, C
If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.
I would like to have the solution in R to make it part of a pipeline.
I forgot to say that I need to apply this to a million columns.
This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.
library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
# SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1 G G T T G G
Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data
library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
# SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1 G-G T-T G-G G G T T G G
Here you want to use apply over the rows instead of columns:
df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"),
c("sample1", "G-G", "T-T", "G-G"),
c("sample2", "C-C", "T-T", "G-C"))
t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-"))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "sample1" "G" "G" "T" "T" "G" "G"
#[2,] "sample2" "C" "C" "T" "T" "G" "C"

rbinding list of matrices while keeping their names using Reduce

I have a list
myList <- list(matrix(letters[1:4], nrow=2), matrix(letters[5:8], nrow=2))
names(myList) <- c("xx", "yy")
I want to rbind this list of matrix, along with the names xx and yy, using Reduce. The problem I have is that Reduce goes directly to myList[[i]] so it loses the names if I pass myList directly. I'm guessing the solution is some combination of creating more 'layers' and clever use of [, but I can't seem to figure it out.
The desired output is
"xx"
"a" "c"
"b" "d"
"yy"
"e" "g"
"f" "h"
library(MASS)
for( nm in names(myList)){ cat(nm,"\n"); write.matrix(myList[[nm]]) }
xx
a c
b d
yy
e g
f h

How do I group by a variable and list by a random order in data.table?

I have a variable that I want to group by. That is easy. However, I want the resultant table to list its rows by random order. What I actually want to do is a little more complicated. But allow me to show you a simplified version.
mydf = data.table(
x = rep(1:4, each = 5),
y = rep(c('A', 'B','c','D', 'E'), times = 2),
v = rpois(20, 30)
)
mydf[,list(sum(x),sum(v)), by=y]
mydf[,list(sum(x),sum(v)), by=list(y=sample(y))]
#to list all the raw data in order of y
mydf[,list(x,v), by=y]
mydf[,list(x,v), by=list(y=sample(y))]
If you look at the resultant outputs you will notice that the y is indeed in random order but it has become unhinged from the data that was in the rows with it.
What can I do?
I would do the operation and then order randomly:
mydf[,list(x,v),by=y][sample(seq_len(nrow(mydf)),replace=FALSE)]
EDIT: Random reordering, after grouping:
mydf[,list(sum(x),sum(v)), by=y][sample(seq_len(length(y)),replace=FALSE)]
You can do something like this to group and random order before grouping, and it looks like it does preserve the changed order:
mydf[order(setNames(sample(unique(y)),unique(y))[y])]
mydf[order(setNames(sample(unique(y)),unique(y))[y]),list(sum(x),sum(v)),by=y]
#perhaps more readable:
mydf[{z <- unique(y); order(setNames(sample(z),z)[y])}]
mydf[{z <- unique(y); order(setNames(sample(z),z)[y])},list(sum(x),sum(v)),by=y]
This is more transparent by adding a column first before ordering.
mydf[,new.y := setNames(sample(unique(y)),unique(y))[y]][order(new.y)]
Breaking it down:
##a random ordering of the elements of y
##(set.seed is used here to get consistent results)
set.seed(1); mydf[,{z <- unique(y);sample(z)}]
# [1] "B" "E" "D" "c" "A"
##assigning names to the elements of y
##creating a 1-1 bijective function between the elements of y
set.seed(1); mydf[,{z <- unique(y);setNames(sample(z),z)}]
# A B c D E
#"B" "E" "D" "c" "A"
##subsetting by y puts y through the map
##in effect every element of y is posing as an element of y, picked at random
##notice that the names (top row) are the original y
##the values (bottom row) are the mapped-to values
# A B c D E A B c D E A B c D E A B c D E
#"B" "E" "D" "c" "A" "B" "E" "D" "c" "A" "B" "E" "D" "c" "A" "B" "E" "D" "c" "A"
##ordering by this now orders by the mapped-to values
set.seed(1); mydf[{z <- unique(y);order(setNames(sample(z),z)[y])}]
EDIT: Incorporating Arun's suggestion in the comments to use setattr to set the names:
mydf[{z <- unique(y); order(setattr(sample(z),'names',z)[y])}]
mydf[{z <- unique(y); order(setattr(sample(z),'names',z)[y])},list(sum(x),sum(v)),by=y]
I think this is what you're looking for...?
mydf[,.SD[sample(.N)],by=y]
Inspired by #BlueMagister's second solution, here's the randomize-first way:
mydf[sample(nrow(mydf)),.SD,by=y]
Here, use keyby instead of by if you want the groups to appear in alphabetical order.

Resources