splitting vector every two indices - r

Given vector of N elements:
LETTERS[1:10]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
How can one get a data.table/frame (df) as follows?
>df
one two
A B
C D
E F
G H
I J
EDIT
Generalizing I would like to know given a vector to split as follows:
[A B C],[D E],[F G H I J]
and obtaining:
V1 V2 V3 V4 V5
A B C NA NA
D E NA NA NA
F G H I J

One option is the matrix way
as.data.frame(matrix(LETTERS[1:10], ncol=2,byrow=TRUE,
dimnames = list(NULL, c('one', 'two'))), stringsAsFactors=FALSE)
# one two
#1 A B
#2 C D
#3 E F
#4 G H
#5 I J
f we need to create an index, we can use gl to split the vector and rbind
do.call(rbind, split(v1, as.integer(gl(length(v1), 2, length(v1)))))
where
v1 <- LETTERS[1:10]
Update
Based on the update in OP's post
lst <- split(v1, rep(1:3, c(3, 2, 5)))
do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
# [,1] [,2] [,3] [,4] [,5]
#1 "A" "B" "C" NA NA
#2 "D" "E" NA NA NA
#3 "F" "G" "H" "I" "J"
Or otherwise
library(stringi)
stri_list2matrix(lst, byrow = TRUE)
Update2
If we are using a 'splitVec'
lst <- split(v1, cumsum(seq_along(v1) %in% splitVec))
and then proceed as above

Related

Within a list of vectors, convert each vector to a string then convert to dataframe in R

I have a list of vectors, j, that looks like this:
>j
[[1]
[1] "a" "b" "c"
[[2]]
[1] "c" "c"
[[3]]
[1] "d" "d" "d" "a" "a"
.
.
.
I would like to transform this into a dataframe that has one column with each vectors contents concatenated together. So the column would look like:
Column_Name
1 a b c
2 c c
3 d d d a a
I have tried using Replace() function as well as a loop where I would use after:
for (x in 1:length(j)){
j[x] = paste(j[x], collapse = " ")
}
j <- data.frame(matrix(unlist(j), nrow=length(j), byrow=T)
Any guidance would be greatly appreciated.
Thank you.
As you have tried yourself, the sapply function together with the collapse argument of paste should do it all wrapped into a data.frame:
# Toy data
set.seed(1)
j <- replicate(5, rep(sample(letters, 1), sample(1:10,1)))
print(j)
#[[1]]
#[1] "g" "g" "g" "g"
#
#[[2]]
# [1] "o" "o" "o" "o" "o" "o" "o" "o" "o" "o"
#
#[[3]]
#[1] "f" "f" "f" "f" "f" "f" "f" "f" "f"
#
#[[4]]
#[1] "y" "y" "y" "y" "y" "y" "y"
#
#[[5]]
#[1] "q"
# Collapse each element and wrap into a data.frame
res <- data.frame("Column_name" = sapply(j, paste, collapse = " "))
print(res)
# Column_name
#1 g g g g
#2 o o o o o o o o o o
#3 f f f f f f f f f
#4 y y y y y y y
#5 q
The sapply applies the paste-function on each element of the list to create a character vector of the concatenated list elements. The data.frame constructor simply converts that output to the wanted output.
Once provide the name to list and then use stack to convert list in a data.frame. Finally, dplyr package is used to collapse vector from common element separated by .
Sample Data is taken from #AndersEllernBilgrau's answer.
set.seed(1)
j <- replicate(5, rep(sample(letters, 1), sample(1:10,1)))
names(j) <- seq_along(j)
library(dplyr)
stack(j) %>% group_by(ind) %>%
summarise(Column_Name = paste0(values, collapse = " ")) %>%
ungroup() %>% select(-ind)
# # A tibble: 5 x 1
# Column_Name
# <chr>
# 1 g g g g
# 2 o o o o o o o o o o
# 3 f f f f f f f f f
# 4 y y y y y y y
# 5 q
#

How to randomize the order of all sublists simultaneously

I am looking to randomize the order of the sublists, but retaining the structure. To illustrate, I can do this with a data frame:
df1 <- data.frame("X1" = LETTERS[1:5], "X2" = letters[1:5])
df1
df1R <- df1[sample(df1[,1]),]
df1R
> df1
X1 X2
1 A a
2 B b
3 C c
4 D d
5 E e
>
> df1R <- df1[sample(df1[,1]),]
> df1R
X1 X2
2 B b
5 E e
1 A a
3 C c
4 D d
You can see here that the overall order is randomised, but rows remain together, this is what I mean by retaining the structure - A stays with a, B stays with b...
I'd like to implement this for a list:
m1 <- list(LETTERS[1:5], letters[1:5])
But I'm stuck on the how, I've had a good look round but not found a solution. Any advice?
The result would look like:
> m1R
[[1]]
[1] "B" "C" "E" "A" "D"
[[2]]
[1] "b" "c" "e" "a" "d"
You could do this to reorder all elements:
neworder <- sample.int(5)
lapply(m1, function(x) x[neworder])

"lapply" in R does not work for each element

test.data <- data.frame(a=seq(10),b=rep(seq(5),times=2),c=rep(seq(5),each=2))
test.data <- data.frame(lapply(test.data, as.character), stringsAsFactors = F)
test.ref <- data.frame(original=seq(10),name=letters[1:10])
test.ref <- data.frame(lapply(test.ref, as.character), stringsAsFactors = F)
test.match <- function (x) {
result = test.ref$name[which(test.ref$original == x)]
return(result)
}
> data.frame(lapply(test.data, test.match))
a b c
1 a a a
2 b b a
3 c c a
4 d d a
5 e e a
6 f a a
7 g b a
8 h c a
9 i d a
10 j e a
> lapply(test.data, test.match)
$a
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$b
[1] "a" "b" "c" "d" "e"
$c
[1] "a"
Hi all,
I am learning to use the apply family in R. However, I am stuck in a rather simple exercise. Above is my code. I am trying to use the "test.match" function to replace all the elements in "test.data" by the reference rule in "test.ref". However, the last column does not work if I turn the final result into data frame. It is even worse if I keep the result as a list.
Many thanks for your help,
Kevin
As mentioned in the comments, you probably want match:
do.test.match.df <- function(df, ref_df = test.ref){
res <- df
res[] <- lapply(df, function(x) ref_df$name[ match(x, ref_df$original) ])
return(res)
}
do.test.match.df(test.data)
which gives
a b c
1 a a a
2 b b a
3 c c b
4 d d b
5 e e c
6 f a c
7 g b d
8 h c d
9 i d e
10 j e e
This is the idiomatic way. lapply will always return a vanilla list. A data.frame is a special kind of list (a list of column vectors). With res[] <- lapply(df, myfun), we're assigning to columns of res.
Since all your columns are the same class, I'd suggest using a matrix instead of a data.frame.
test.mat <- as.matrix(test.data)
do.test.match <- function(mat, ref_df=test.ref){
res <- matrix(, nrow(mat), ncol(mat))
res[] <- ref_df$name[ match( c(mat), ref_df$original ) ]
return(res)
}
do.test.match(test.mat)

Gather connected IDs across different rows of data frame

Given an R data frame like this:
DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"),
ID2 = c("D",NA,"G",NA,NA,NA,"H",NA),
ID3 = c("F",NA,NA,NA,NA,NA,NA,NA))
> DF.a
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G <NA>
4 D <NA> <NA>
5 E <NA> <NA>
6 F <NA> <NA>
7 G H <NA>
8 H <NA> <NA>
I would like to simplify/reshape it into the following:
DF.b <- data.frame(ID1 = c("A","B","C","E"),
ID2 = c("D",NA,"G",NA),
ID3 = c("F",NA,"H",NA))
> DF.b
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G H
4 E <NA> <NA>
It does not seem like a straightforward reshape. The goal is to get all "connected" ID values together on a single row. Note how the connection between "C" and "H" is indirect, as both are connected to "G", but they don't appear together on the same row of DF.a. The order of the ID values in rows of DF.b does not matter.
Really you could think of this as trying to get all the connected components of a graph. The first step I would take would be to convert your data into a more natural structure -- a vector of nodes and matrix of edges:
(nodes <- as.character(sort(unique(unlist(DF.a)))))
# [1] "A" "B" "C" "D" "E" "F" "G" "H"
(edges <- do.call(rbind, apply(DF.a, 1, function(x) {
x <- x[!is.na(x)]
cbind(head(x, -1), tail(x, -1))
})))
# [,1] [,2]
# ID1 "A" "D"
# ID2 "D" "F"
# ID1 "C" "G"
# ID1 "G" "H"
Now you are ready to build a graph and compute its components:
library(igraph)
g <- graph.data.frame(edges, FALSE, nodes)
(comp <- split(nodes, components(g)$membership))
# $`1`
# [1] "A" "D" "F"
#
# $`2`
# [1] "B"
#
# $`3`
# [1] "C" "G" "H"
#
# $`4`
# [1] "E"
The output of the split function is a list, where each list element is all the nodes in one of the components of the graph. Personally I think this is the most useful representation of the output data, but if you really wanted the NA-padded structure you describe you could try something like:
max.len <- max(sapply(comp, length))
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x }))
# [,1] [,2] [,3]
# 1 "A" "D" "F"
# 2 "B" NA NA
# 3 "C" "G" "H"
# 4 "E" NA NA

Column Split without repeat

I have a dataframe with one column that I would like to split into several columns, but the number of splits is dynamic throughout the rows.
Var1
====
A/B
A/B/C
C/B
A/C/D/E
I have tried using colsplit(df$Var1,split="/",names=c("Var1","Var2","Var3","Var4")), but rows with less than 4 variables will repeat.
From Hansi, the desired output would be:
Var1 Var2 Var3 Var4
[1,] "A" "B" NA NA
[2,] "A" "B" "C" NA
[3,] "C" "B" NA NA
[4,] "A" "C" "D" "E"
> read.table(text=as.character(df$Var1), sep="/", fill=TRUE)
V1 V2 V3 V4
1 A B
2 A B C
3 C B
4 A C D E
Leading zeros in digit only fields can be preserved with colClasses="character"
a <- data.frame(Var1=c("01/B","04/B/C","0098/B","8708/C/D/E"))
read.table(text=as.character(a$Var1), sep="/", fill=TRUE, colClasses="character")
V1 V2 V3 V4
1 01 B
2 04 B C
3 0098 B
4 8708 C D E
If I understood your objective correctly here is one possible solution, I'm sure there is a better way of doing it but this was the first that came to mind:
a <- data.frame(Var1=c("A/B","A/B/C","C/B","A/C/D/E"))
splitNames <- c("Var1","Var2","Var3","Var4")
# R> a
# Var1
# 1 A/B
# 2 A/B/C
# 3 C/B
# 4 A/C/D/E
b <- t(apply(a,1,function(x){
temp <- unlist(strsplit(x,"/"));
return(c(temp,rep(NA,max(0,length(splitNames)-length(temp)))))
}))
colnames(b) <- splitNames
# R> b
# Var1 Var2 Var3 Var4
# [1,] "A" "B" NA NA
# [2,] "A" "B" "C" NA
# [3,] "C" "B" NA NA
# [4,] "A" "C" "D" "E"
i do not know a function to solve your problem, but you can achieve it easily with standard R commands :
# Here are your data
df <- data.frame(Var1=c("A/B", "A/B/C", "C/B", "A/C/D/E"), stringsAsFactors=FALSE)
# Split
rows <- strsplit(df$Var1, split="/")
# Maximum amount of columns
columnCount <- max(sapply(rows, length))
# Fill with NA
rows <- lapply(rows, `length<-`, columnCount)
# Coerce to data.frame
out <- as.data.frame(rows)
# Transpose
out <- t(out)
As it relies on strsplit, you may need to make some type conversion. See type.con

Resources