Given an R data frame like this:
DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"),
ID2 = c("D",NA,"G",NA,NA,NA,"H",NA),
ID3 = c("F",NA,NA,NA,NA,NA,NA,NA))
> DF.a
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G <NA>
4 D <NA> <NA>
5 E <NA> <NA>
6 F <NA> <NA>
7 G H <NA>
8 H <NA> <NA>
I would like to simplify/reshape it into the following:
DF.b <- data.frame(ID1 = c("A","B","C","E"),
ID2 = c("D",NA,"G",NA),
ID3 = c("F",NA,"H",NA))
> DF.b
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G H
4 E <NA> <NA>
It does not seem like a straightforward reshape. The goal is to get all "connected" ID values together on a single row. Note how the connection between "C" and "H" is indirect, as both are connected to "G", but they don't appear together on the same row of DF.a. The order of the ID values in rows of DF.b does not matter.
Really you could think of this as trying to get all the connected components of a graph. The first step I would take would be to convert your data into a more natural structure -- a vector of nodes and matrix of edges:
(nodes <- as.character(sort(unique(unlist(DF.a)))))
# [1] "A" "B" "C" "D" "E" "F" "G" "H"
(edges <- do.call(rbind, apply(DF.a, 1, function(x) {
x <- x[!is.na(x)]
cbind(head(x, -1), tail(x, -1))
})))
# [,1] [,2]
# ID1 "A" "D"
# ID2 "D" "F"
# ID1 "C" "G"
# ID1 "G" "H"
Now you are ready to build a graph and compute its components:
library(igraph)
g <- graph.data.frame(edges, FALSE, nodes)
(comp <- split(nodes, components(g)$membership))
# $`1`
# [1] "A" "D" "F"
#
# $`2`
# [1] "B"
#
# $`3`
# [1] "C" "G" "H"
#
# $`4`
# [1] "E"
The output of the split function is a list, where each list element is all the nodes in one of the components of the graph. Personally I think this is the most useful representation of the output data, but if you really wanted the NA-padded structure you describe you could try something like:
max.len <- max(sapply(comp, length))
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x }))
# [,1] [,2] [,3]
# 1 "A" "D" "F"
# 2 "B" NA NA
# 3 "C" "G" "H"
# 4 "E" NA NA
Related
Given vector of N elements:
LETTERS[1:10]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
How can one get a data.table/frame (df) as follows?
>df
one two
A B
C D
E F
G H
I J
EDIT
Generalizing I would like to know given a vector to split as follows:
[A B C],[D E],[F G H I J]
and obtaining:
V1 V2 V3 V4 V5
A B C NA NA
D E NA NA NA
F G H I J
One option is the matrix way
as.data.frame(matrix(LETTERS[1:10], ncol=2,byrow=TRUE,
dimnames = list(NULL, c('one', 'two'))), stringsAsFactors=FALSE)
# one two
#1 A B
#2 C D
#3 E F
#4 G H
#5 I J
f we need to create an index, we can use gl to split the vector and rbind
do.call(rbind, split(v1, as.integer(gl(length(v1), 2, length(v1)))))
where
v1 <- LETTERS[1:10]
Update
Based on the update in OP's post
lst <- split(v1, rep(1:3, c(3, 2, 5)))
do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
# [,1] [,2] [,3] [,4] [,5]
#1 "A" "B" "C" NA NA
#2 "D" "E" NA NA NA
#3 "F" "G" "H" "I" "J"
Or otherwise
library(stringi)
stri_list2matrix(lst, byrow = TRUE)
Update2
If we are using a 'splitVec'
lst <- split(v1, cumsum(seq_along(v1) %in% splitVec))
and then proceed as above
Say I create a data frame, foo:
foo <- data.frame(A=rep(NA,10),B=rep(NA,10))
foo$A[1:3] <- "A"
foo$B[6:10] <- "B"
which looks like,
A B
1 A <NA>
2 A <NA>
3 A <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> B
7 <NA> B
8 <NA> B
9 <NA> B
10 <NA> B
I can coalesce this into a single column, like this:
data.frame(AB = coalesce(foo$A, foo$B))
giving,
AB
1 A
2 A
3 A
4 <NA>
5 <NA>
6 B
7 B
8 B
9 B
10 B
which is nice. Now, say my data frame is huge with lots of columns. How do I coalesce that without naming each column individually? As far as I understand, coalesce is expecting vectors, so I don't see a neat and tidy dplyr solution where I can just pluck out the required columns and pass them en masse. Any ideas?
EDIT
As requested, a "harder" example.
foo <- data.frame(A=rep(NA,10),B=rep(NA,10),C=rep(NA,10),D=rep(NA,10),E=rep(NA,10),F=rep(NA,10),G=rep(NA,10),H=rep(NA,10),I=rep(NA,10),J=rep(NA,10))
foo$A[1] <- "A"
foo$B[2] <- "B"
foo$C[3] <- "C"
foo$D[4] <- "D"
foo$E[5] <- "E"
foo$F[6] <- "F"
foo$G[7] <- "G"
foo$H[8] <- "H"
foo$I[9] <- "I"
foo$J[10] <- "J"
How do I coalesce this without having to write:
data.frame(ALL= coalesce(foo$A, foo$B, foo$C, foo$D, foo$E, foo$F, foo$G, foo$H, foo$I, foo$J))
You can use do.call(coalesce, ...), which is a simpler way to write a function call with a lot of arguments:
library(dplyr)
do.call(coalesce, foo)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
You can use this (documentation of purrr: pmap)
coalesce(!!!foo)
I have 2 tables as below:
a = read.table(text=' a b
1 c
1 d
2 c
2 a
2 b
3 a
', head=T)
b = read.table(text=' a c
1 x i
2 y j
3 z k
', head=T)
And I want result to be like this:
1 x i c d
2 y j c a b
3 z k a
Originally I thought to use tapply to transform them to lists (eg. aa = tapply(a[,2], a[,1], function(x) paste(x,collapse=","))), then append it back to table b, but I got stuck...
Any suggestion to do this?
Thanks a million.
One way to do it:
mapply(FUN = c,
lapply(split(b, row.names(b)), function(x) as.character(unlist(x, use.names = FALSE))),
split(as.character(a$b), a$a),
SIMPLIFY = FALSE)
# $`1`
# [1] "x" "i" "c" "d"
#
# $`2`
# [1] "y" "j" "c" "a" "b"
#
# $`3`
# [1] "z" "k" "a"
I have the data.frame
df<-data.frame("Site.1" = c("A", "B", "C"),
"Site.2" = c("D", "B", "B"),
"Tsim" = c(2, 4, 7),
"Jaccard" = c(5, 7, 1))
# Site.1 Site.2 Tsim Jaccard
# 1 A D 2 5
# 2 B B 4 7
# 3 C B 7 1
I can get the unique levels for each column using
top.x<-unique(df[1:2,c("Site.1")])
top.x
# [1] A B
# Levels: A B C
top.y<-unique(df[1:2,c("Site.2")])
top.y
# [1] D B
# Levels: B D
How do I get the unique levels for both columns and turn them into a vector i.e:
v <- c("A", "B", "D")
v
# [1] "A" "B" "D"
top.xy <- unique(unlist(df[1:2,]))
top.xy
[1] A B D
Levels: A B C D
Try union:
union(top.x, top.y)
# [1] "A" "B" "D"
union(unique(df[1:2, c("Site.1")]),
unique(df[1:2, c("Site.2")]))
# [1] "A" "B" "D"
You can get the unique levels for the firs two collumns:
de<- apply(df[,1:2],2,unique)
de
# $Site.1
# [1] "A" "B" "C"
# $Site.2
# [1] "D" "B"
Then you can take the symmetric difference of the two sets:
union(setdiff(de$Site.1,de$Site.2), setdiff(de$Site.2,de$Site.1))
# [1] "A" "C" "D"
If you're intrested in just two first two rows (as in your example):
de<- apply(df[1:2,1:2],2,unique)
de
# Site.1 Site.2
# [1,] "A" "D"
# [2,] "B" "B"
union(de[,1],de[,2])
# [1] "A" "B" "D"
I have a dataframe with one column that I would like to split into several columns, but the number of splits is dynamic throughout the rows.
Var1
====
A/B
A/B/C
C/B
A/C/D/E
I have tried using colsplit(df$Var1,split="/",names=c("Var1","Var2","Var3","Var4")), but rows with less than 4 variables will repeat.
From Hansi, the desired output would be:
Var1 Var2 Var3 Var4
[1,] "A" "B" NA NA
[2,] "A" "B" "C" NA
[3,] "C" "B" NA NA
[4,] "A" "C" "D" "E"
> read.table(text=as.character(df$Var1), sep="/", fill=TRUE)
V1 V2 V3 V4
1 A B
2 A B C
3 C B
4 A C D E
Leading zeros in digit only fields can be preserved with colClasses="character"
a <- data.frame(Var1=c("01/B","04/B/C","0098/B","8708/C/D/E"))
read.table(text=as.character(a$Var1), sep="/", fill=TRUE, colClasses="character")
V1 V2 V3 V4
1 01 B
2 04 B C
3 0098 B
4 8708 C D E
If I understood your objective correctly here is one possible solution, I'm sure there is a better way of doing it but this was the first that came to mind:
a <- data.frame(Var1=c("A/B","A/B/C","C/B","A/C/D/E"))
splitNames <- c("Var1","Var2","Var3","Var4")
# R> a
# Var1
# 1 A/B
# 2 A/B/C
# 3 C/B
# 4 A/C/D/E
b <- t(apply(a,1,function(x){
temp <- unlist(strsplit(x,"/"));
return(c(temp,rep(NA,max(0,length(splitNames)-length(temp)))))
}))
colnames(b) <- splitNames
# R> b
# Var1 Var2 Var3 Var4
# [1,] "A" "B" NA NA
# [2,] "A" "B" "C" NA
# [3,] "C" "B" NA NA
# [4,] "A" "C" "D" "E"
i do not know a function to solve your problem, but you can achieve it easily with standard R commands :
# Here are your data
df <- data.frame(Var1=c("A/B", "A/B/C", "C/B", "A/C/D/E"), stringsAsFactors=FALSE)
# Split
rows <- strsplit(df$Var1, split="/")
# Maximum amount of columns
columnCount <- max(sapply(rows, length))
# Fill with NA
rows <- lapply(rows, `length<-`, columnCount)
# Coerce to data.frame
out <- as.data.frame(rows)
# Transpose
out <- t(out)
As it relies on strsplit, you may need to make some type conversion. See type.con