I have a three dimensional excel table which I would like to convert into a two dimensional dataframe that I can use in R. I think the best way is to read it in R and then transform it directly within R, but I do not find how. Here is an example. I have a df1-like dataframe that I want to transform to df2:
a1 <- paste("a","b","c",sep = ";")
a2 <- paste("e","f","g",sep = ";")
df1 <- data.frame(v1=a1, v2=a2, row.names = "w1")
df2 <- data.frame(w1=c(rep("v1",3),rep("v2",3)), "value"=letters[1:6])
You can achieve this by using reshape2
sub_df1 <- apply(df1,2,FUN= strsplit,split = ";")
# $v1
# $v1$w1
# [1] "a" "b" "c"
# $v2
# $v2$w1
# [1] "e" "f" "g
sub_df2 <- sapply(apply(df1,2,FUN= strsplit,split = ";"), FUN = unlist,use.names = TRUE, recursive = FALSE)
# v1 v2
# w11 "a" "e"
# w12 "b" "f"
# w13 "c" "g"
melt(sub_df2)[-1]
# Var2 value
# 1 v1 a
# 2 v1 b
# 3 v1 c
# 4 v2 e
# 5 v2 f
# 6 v2 g
You can then delete the first column by adding the [-1]
Related
I have the following data.frame:
df <- data.frame(V1 = c("A","X","A","Z","B","Y"),
V2 = c("B","Y","C","Y","C","W"),
stringsAsFactors=FALSE)
df
# V1 V2
# 1 A B
# 2 X Y
# 3 A C
# 4 Z Y
# 5 B C
# 6 Y W
I want to group all the values that occur together at some point and get the following:
list(c("A","B","C"), c("X","Y","Z","W"))
# [[1]]
# [1] "A" "B" "C"
#
# [[2]]
# [1] "X" "Y" "Z" "W"
Network analyses can help.
library(igraph)
df <- data.frame(V1 = c("A","X","A","Z","B","Y"),
V2 = c("B","Y","C","Y","C","W"),
stringsAsFactors=FALSE)
g <- graph_from_data_frame(df, directed = FALSE)
clust <- clusters(g)
clusters <- data.frame(name = names(clust$membership),
cluster = clust$membership,
row.names = NULL,
stringsAsFactors = FALSE)
split(clusters$name, clusters$cluster)
$`1`
[1] "A" "B" "C"
$`2`
[1] "X" "Z" "Y" "W"
You can of course leave everything in the cluster data.frame for further analyses.
Given an R data frame like this:
DF.a <- data.frame(ID1 = c("A","B","C","D","E","F","G","H"),
ID2 = c("D",NA,"G",NA,NA,NA,"H",NA),
ID3 = c("F",NA,NA,NA,NA,NA,NA,NA))
> DF.a
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G <NA>
4 D <NA> <NA>
5 E <NA> <NA>
6 F <NA> <NA>
7 G H <NA>
8 H <NA> <NA>
I would like to simplify/reshape it into the following:
DF.b <- data.frame(ID1 = c("A","B","C","E"),
ID2 = c("D",NA,"G",NA),
ID3 = c("F",NA,"H",NA))
> DF.b
ID1 ID2 ID3
1 A D F
2 B <NA> <NA>
3 C G H
4 E <NA> <NA>
It does not seem like a straightforward reshape. The goal is to get all "connected" ID values together on a single row. Note how the connection between "C" and "H" is indirect, as both are connected to "G", but they don't appear together on the same row of DF.a. The order of the ID values in rows of DF.b does not matter.
Really you could think of this as trying to get all the connected components of a graph. The first step I would take would be to convert your data into a more natural structure -- a vector of nodes and matrix of edges:
(nodes <- as.character(sort(unique(unlist(DF.a)))))
# [1] "A" "B" "C" "D" "E" "F" "G" "H"
(edges <- do.call(rbind, apply(DF.a, 1, function(x) {
x <- x[!is.na(x)]
cbind(head(x, -1), tail(x, -1))
})))
# [,1] [,2]
# ID1 "A" "D"
# ID2 "D" "F"
# ID1 "C" "G"
# ID1 "G" "H"
Now you are ready to build a graph and compute its components:
library(igraph)
g <- graph.data.frame(edges, FALSE, nodes)
(comp <- split(nodes, components(g)$membership))
# $`1`
# [1] "A" "D" "F"
#
# $`2`
# [1] "B"
#
# $`3`
# [1] "C" "G" "H"
#
# $`4`
# [1] "E"
The output of the split function is a list, where each list element is all the nodes in one of the components of the graph. Personally I think this is the most useful representation of the output data, but if you really wanted the NA-padded structure you describe you could try something like:
max.len <- max(sapply(comp, length))
do.call(rbind, lapply(comp, function(x) { length(x) <- max.len ; x }))
# [,1] [,2] [,3]
# 1 "A" "D" "F"
# 2 "B" NA NA
# 3 "C" "G" "H"
# 4 "E" NA NA
I have 2 tables as below:
a = read.table(text=' a b
1 c
1 d
2 c
2 a
2 b
3 a
', head=T)
b = read.table(text=' a c
1 x i
2 y j
3 z k
', head=T)
And I want result to be like this:
1 x i c d
2 y j c a b
3 z k a
Originally I thought to use tapply to transform them to lists (eg. aa = tapply(a[,2], a[,1], function(x) paste(x,collapse=","))), then append it back to table b, but I got stuck...
Any suggestion to do this?
Thanks a million.
One way to do it:
mapply(FUN = c,
lapply(split(b, row.names(b)), function(x) as.character(unlist(x, use.names = FALSE))),
split(as.character(a$b), a$a),
SIMPLIFY = FALSE)
# $`1`
# [1] "x" "i" "c" "d"
#
# $`2`
# [1] "y" "j" "c" "a" "b"
#
# $`3`
# [1] "z" "k" "a"
I have a list of gene IDs along with their sequences in R.
$2435
[1]"ATGCGGGCGGGGGTCGTCGA"
$2435
[1]"ATGCGGCGCGCGCGCTATATACGC"
$2435
[1]"ATGCGGCGCCTCTCATCGCGGGGG"
I want to combine the sequences with the same gene IDs in that list in R.
$2435
[1]"ATGCGGGCGGGGGTCGTCGAATGCGGCGCGCGCGCTATATACGCATGCGGCGCCTCTCATCGCGGGGG"
Use lapply after matching the names with unique. Here's some sample data:
A <- list("12" = "AAAABBBBCCCCDDDD",
"34" = "GGGG",
"12" = "XXXXXXXXXXXXXXXXXXXXXXX",
"10" = "FFFFGGGG",
"10" = "HHHHIIII")
A
# $`12`
# [1] "AAAABBBBCCCCDDDD"
#
# $`34`
# [1] "GGGG"
#
# $`12`
# [1] "XXXXXXXXXXXXXXXXXXXXXXX"
#
# $`10`
# [1] "FFFFGGGG"
#
# $`10`
# [1] "HHHHIIII"
Subset the related names and paste them together.
lapply(unique(names(A)), function(x) paste(A[names(A) %in% x], collapse = ""))
# [[1]]
# [1] "AAAABBBBCCCCDDDDXXXXXXXXXXXXXXXXXXXXXXX"
#
# [[2]]
# [1] "GGGG"
#
# [[3]]
# [1] "FFFFGGGGHHHHIIII"
l <- list("A" = "ABC", "B" = "XYX", "A" = "DEF", "C" = "YZY", "A" = "GHI")
tapply(l, names(l), paste, collapse = "", simplify = FALSE)
# $A
# [1] "ABCDEFGHI"
#
# $B
# [1] "XYX"
#
# $C
# [1] "YZY"
Bonus:
For a dataframe output, use this:
aggregate(unlist(A), by=list(id=names(A)), paste, collapse="")
Where A is you list.
Using #Ananda's A, I get this:
id x
1 10 FFFFGGGGHHHHIIII
2 12 AAAABBBBCCCCDDDDXXXXXXXXXXXXXXXXXXXXXXX
3 34 GGGG
I have a dataframe with one column that I would like to split into several columns, but the number of splits is dynamic throughout the rows.
Var1
====
A/B
A/B/C
C/B
A/C/D/E
I have tried using colsplit(df$Var1,split="/",names=c("Var1","Var2","Var3","Var4")), but rows with less than 4 variables will repeat.
From Hansi, the desired output would be:
Var1 Var2 Var3 Var4
[1,] "A" "B" NA NA
[2,] "A" "B" "C" NA
[3,] "C" "B" NA NA
[4,] "A" "C" "D" "E"
> read.table(text=as.character(df$Var1), sep="/", fill=TRUE)
V1 V2 V3 V4
1 A B
2 A B C
3 C B
4 A C D E
Leading zeros in digit only fields can be preserved with colClasses="character"
a <- data.frame(Var1=c("01/B","04/B/C","0098/B","8708/C/D/E"))
read.table(text=as.character(a$Var1), sep="/", fill=TRUE, colClasses="character")
V1 V2 V3 V4
1 01 B
2 04 B C
3 0098 B
4 8708 C D E
If I understood your objective correctly here is one possible solution, I'm sure there is a better way of doing it but this was the first that came to mind:
a <- data.frame(Var1=c("A/B","A/B/C","C/B","A/C/D/E"))
splitNames <- c("Var1","Var2","Var3","Var4")
# R> a
# Var1
# 1 A/B
# 2 A/B/C
# 3 C/B
# 4 A/C/D/E
b <- t(apply(a,1,function(x){
temp <- unlist(strsplit(x,"/"));
return(c(temp,rep(NA,max(0,length(splitNames)-length(temp)))))
}))
colnames(b) <- splitNames
# R> b
# Var1 Var2 Var3 Var4
# [1,] "A" "B" NA NA
# [2,] "A" "B" "C" NA
# [3,] "C" "B" NA NA
# [4,] "A" "C" "D" "E"
i do not know a function to solve your problem, but you can achieve it easily with standard R commands :
# Here are your data
df <- data.frame(Var1=c("A/B", "A/B/C", "C/B", "A/C/D/E"), stringsAsFactors=FALSE)
# Split
rows <- strsplit(df$Var1, split="/")
# Maximum amount of columns
columnCount <- max(sapply(rows, length))
# Fill with NA
rows <- lapply(rows, `length<-`, columnCount)
# Coerce to data.frame
out <- as.data.frame(rows)
# Transpose
out <- t(out)
As it relies on strsplit, you may need to make some type conversion. See type.con