Unlist data frame column preserving information from other column - r

I have a data frame which consists of two column: a character vector col1 and a list column, col2.
myVector <- c("A","B","C","D")
myList <- list()
myList[[1]] <- c(1, 4, 6, 7)
myList[[2]] <- c(2, 7, 3)
myList[[3]] <- c(5, 5, 3, 9, 6)
myList[[4]] <- c(7, 9)
myDataFrame <- data.frame(row = c(1,2,3,4))
myDataFrame$col1 <- myVector
myDataFrame$col2 <- myList
myDataFrame
# row col1 col2
# 1 1 A 1, 4, 6, 7
# 2 2 B 2, 7, 3
# 3 3 C 5, 5, 3, 9, 6
# 4 4 D 7, 9
I want to unlist my col2 still keeping for each element of the vectors in the list the information stored in col1. To phrase it differently, in commonly used data frame reshape terminology: the "wide" list column should be converted to a "long" format.
Then at the end of the day I want two vectors of length equal to length(unlist(myDataFrame$col2)). In code:
# unlist myList
unlist.col2 <- unlist(myDataFrame$col2)
unlist.col2
# [1] 1 4 6 7 2 7 3 5 5 3 9 6 7 9
# unlist myVector to obtain
# unlist.col1 <- ???
# unlist.col1
# [1] A A A A B B B C C C C C D D
I can't think of any straightforward way to get it.

You may also use unnest from package tidyr:
library(tidyr)
unnest(myDataFrame, col2)
# row col1 col2
# (dbl) (chr) (dbl)
# 1 1 A 1
# 2 1 A 4
# 3 1 A 6
# 4 1 A 7
# 5 2 B 2
# 6 2 B 7
# 7 2 B 3
# 8 3 C 5
# 9 3 C 5
# 10 3 C 3
# 11 3 C 9
# 12 3 C 6
# 13 4 D 7
# 14 4 D 9

You can use the "data.table" to expand the whole data.frame, and extract the column of interest.
library(data.table)
## expand the entire data.frame (uncomment to see)
# as.data.table(myDataFrame)[, unlist(col2), by = list(row, col1)]
## expand and select the column of interest:
as.data.table(myDataFrame)[, unlist(col2), by = list(row, col1)]$col1
# [1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"
In newer versions of R, you can now use the lengths function instead of the sapply(list, length) approach. The lengths function is considerably faster.
with(myDataFrame, rep(col1, lengths(col2)))
# [1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"

Here, the idea is to first get the length of each list element using sapply and then use rep to replicate the col1 with that length
l1 <- sapply(myDataFrame$col2, length)
unlist.col1 <- rep(myDataFrame$col1, l1)
unlist.col1
#[1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"
Or as suggested by #Ananda Mahto, the above could be also done with vapply
with(myDataFrame, rep(col1, vapply(col2, length, 1L)))
#[1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"

Related

R - change values in labelled column of dataframe without losing all labels in column

I'm trying to learn how to work with the sjlabelled package in R, and labelled data more generally. I'm trying
Here's the output for what I tried for a very simple example:
> library(dplyr)
> library(sjlabelled)
>
> df <- data.frame(col1 = c(1:3, 1:3),
+ col2 = c(seq(11, 33, 11), 11, 12, 13))
>
> df <- df %>%
+ set_labels(col1, labels = c("a" = 1, "b" = 2, "c" = 3)) %>%
+ set_labels(col2, labels = c("A" = 11, "B" = 12, "C" = 13, "D" = 22, "E" = 33))
>
> df
col1 col2
1 1 11
2 2 22
3 3 33
4 1 11
5 2 12
6 3 13
>
> get_labels(df)
$col1
[1] "a" "b" "c"
$col2
[1] "A" "B" "C" "D" "E"
>
> df <- df %>%
+ mutate(col1 = ifelse(col1 > 1 & col2 < 33, 2, as_labelled(col1)))
>
> df
col1 col2
1 1 11
2 2 22
3 3 33
4 1 11
5 2 12
6 2 13
>
> get_labels(df)
$col1
NULL
$col2
[1] "A" "B" "C" "D" "E"
I have used as_labelled to preserve labels in other situations, such as when using rbind for data frames with labelled data, but it isn't working here.
Are either of the following possible using sjlabelled or a similar approach:
a) overwriting values with other values which had already been assigned a label so that labels are preserved and the overwritten value has its label overwritten with the corresponding label (e.g. any values that are overwritten as '2' in the example will now be labelled as "b")?
b) overwriting values with values which weren't in the column (e.g. NA) without losing the labels from the values which were already labelled (e.g. with the example, overwriting '1' and '2', with NA but keeping '3' labelled as "c")?
Thank you.
We could assign as
df$col1[with(df, col1 > 1 & col2 < 33)] <- 2
-checking
> get_labels(df)
$col1
[1] "a" "b" "c"
$col2
[1] "A" "B" "C" "D" "E"
> df
col1 col2
1 1 11
2 2 22
3 3 33
4 1 11
5 2 12
6 2 13

Filtering list in R

I have a list
A <- c(1,2,3,4,5,6,7,8,9,10)
B <- c("a" ,"b", "c" ,"d","b" ,"f" ,"g" ,"a" ,"b" ,"a")
C <- c(25, 26, 27, 28, 29, 30, 31, 32, 10, 15)
mylist <- list(A,B,C)
mylist
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[1] "a" "b" "c" "d" "b" "f" "g" "a" "b" "a"
[[3]]
[1] 25 26 27 28 29 30 31 32 10 15
I would like to select all components A,B,C of the list where second component B has value "a" or "b" .
Sample output
mylist
[[1]]
[1] 1 2 6 8 9 10
[[2]]
[1] "a" "b" "b" "a" "b" "a"
[[3]]
[1] 25 26 29 32 10 15
How can I do that? Note that each component have same length.
To stay with a list, why not simply:
lapply(mylist, `[`, is.element(B, letters[1:2]))
#[[1]]
#[1] 1 2 5 8 9 10
#[[2]]
#[1] "a" "b" "b" "a" "b" "a"
#[[3]]
#[1] 25 26 29 32 10 15
I would go with a data.frame or data.table for this use case:
Using your original list (with a 10 added to A to have the same number of entries as B and C):
>df <- data.frame(A=mylist[[1]],B=mylist[[2]],C=mylist[[3]],stringsAsFactors=F)
> df[df$B %in% c("a","b"),]
A B C
1 1 a 25
2 2 b 26
5 5 b 29
8 8 a 32
9 9 b 10
10 10 a 15
This will subset the data.frame by where B values are a or b. If you build your list at first, you may avoid the list step and build the data.frame directly.
If you really want a list at end:
> as.list(df[df$B %in% c("a","b"),])
$A
[1] 1 2 5 8 9 10
$B
[1] "a" "b" "b" "a" "b" "a"
$C
[1] 25 26 29 32 10 15
If you wish to avoid the named entries, use unname: as.list(unname(df[..]))
Here is a simple solution.
First, I create mylist :
mylist <- list(1:10, letters[1:10], 25:15)
Then I create a function which returns TRUE if the condition is TRUE and FALSE otherwise
> filt <- function(x) {
+ x[2] %in% c("a", "b")
+ }
>
Then I use sapply to apply the function to mylist and I select only the components I need :
> mylist[sapply(mylist, filt) == TRUE]
[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Count the occurence of specific combinations of characters in a list

My question is very simple..but I cant manage to work it out...
I have run a variable selection method in R on 2000 genes using 1000 iterations and in each iteration I got a combination of genes. I would like to count the number of times each combination of genes occurs in R.
For example I have
# for iteration 1
genes[1] "a" "b" "c"
# for iteration 2
genes[2] "a" "b"
# for iteration 3
genes[3] "a" "c"
# for iteration 4
genes [4] "a" "b"
and this would give me
"a" "b" "c" 1
"a" "b" 2
"a" "c" 1
I have unlisted the list and got the number each gene comes but I am interested in is the combination. I tried to create a table but I have unequal length for each gene vector. Thanks in advance.
The way I could immediately think of is to paste them and then use table as follows:
genes_p <- sapply(my_genes, paste, collapse=";")
freq <- as.data.frame(table(genes_p))
# Var1 Freq
# 1 a;b 2
# 2 a;b;c 1
# 3 c 1
The above solution assumes that the genes are sorted by names and the same gene id doesn't occur more than once within an element of the list. If you want to account for both, then:
# sort genes before pasting
genes_p <- sapply(my_genes, function(x) paste(sort(x), collapse=";"))
# sort + unique
genes_p <- sapply(my_genes, function(x) paste(sort(unique(x)), collapse=";"))
Edit: Following OP's question in comment, the idea is to get all combinations of 2'ers (so to say), wherever possible and then take the table. First I'll break down the code and write them separate for understanding. Then I'll group them together to get a one-liner.
# you first want all possible combinations of length 2 here
# that is, if vector is:
v <- c("a", "b", "c")
combn(v, 2)
# [,1] [,2] [,3]
# [1,] "a" "a" "b"
# [2,] "b" "c" "c"
This gives all the combinations taken 2 at a time. Now, you can just paste it similarly. combn also allows function argument.
combn(v, 2, function(y) paste(y, collapse=";"))
# [1] "a;b" "a;c" "b;c"
So, for each set of genes in your list, you can do the same by wrapping this around a sapply as follows:
sapply(my_genes, function(x) combn(x, min(length(x), 2), function(y)
paste(y, collapse=";")))
The min(length(x), 2) is required because some of your gene list can be just 1 gene.
# [[1]]
# [1] "a;b" "a;c" "b;c"
# [[2]]
# [1] "a;b"
# [[3]]
# [1] "c"
# [[4]]
# [1] "a;b"
Now, you can unlist this to get a vector and then use table to get frequency:
table(unlist(sapply(l, function(x) combn(x, min(length(x), 2), function(y)
paste(y, collapse=";")))))
# a;b a;c b;c c
# 3 1 1 1
You can wrap this in turn with as.data.frame(.) to get a data.frame:
as.data.frame(table(unlist(sapply(l, function(x) combn(x, min(length(x), 2),
function(y) paste(y, collapse=";"))))))
# Var1 Freq
# 1 a;b 3
# 2 a;c 1
# 3 b;c 1
# 4 c 1

How to extract unique levels from 2 columns in a data frame in r

I have the data.frame
df<-data.frame("Site.1" = c("A", "B", "C"),
"Site.2" = c("D", "B", "B"),
"Tsim" = c(2, 4, 7),
"Jaccard" = c(5, 7, 1))
# Site.1 Site.2 Tsim Jaccard
# 1 A D 2 5
# 2 B B 4 7
# 3 C B 7 1
I can get the unique levels for each column using
top.x<-unique(df[1:2,c("Site.1")])
top.x
# [1] A B
# Levels: A B C
top.y<-unique(df[1:2,c("Site.2")])
top.y
# [1] D B
# Levels: B D
How do I get the unique levels for both columns and turn them into a vector i.e:
v <- c("A", "B", "D")
v
# [1] "A" "B" "D"
top.xy <- unique(unlist(df[1:2,]))
top.xy
[1] A B D
Levels: A B C D
Try union:
union(top.x, top.y)
# [1] "A" "B" "D"
union(unique(df[1:2, c("Site.1")]),
unique(df[1:2, c("Site.2")]))
# [1] "A" "B" "D"
You can get the unique levels for the firs two collumns:
de<- apply(df[,1:2],2,unique)
de
# $Site.1
# [1] "A" "B" "C"
# $Site.2
# [1] "D" "B"
Then you can take the symmetric difference of the two sets:
union(setdiff(de$Site.1,de$Site.2), setdiff(de$Site.2,de$Site.1))
# [1] "A" "C" "D"
If you're intrested in just two first two rows (as in your example):
de<- apply(df[1:2,1:2],2,unique)
de
# Site.1 Site.2
# [1,] "A" "D"
# [2,] "B" "B"
union(de[,1],de[,2])
# [1] "A" "B" "D"

Column Split without repeat

I have a dataframe with one column that I would like to split into several columns, but the number of splits is dynamic throughout the rows.
Var1
====
A/B
A/B/C
C/B
A/C/D/E
I have tried using colsplit(df$Var1,split="/",names=c("Var1","Var2","Var3","Var4")), but rows with less than 4 variables will repeat.
From Hansi, the desired output would be:
Var1 Var2 Var3 Var4
[1,] "A" "B" NA NA
[2,] "A" "B" "C" NA
[3,] "C" "B" NA NA
[4,] "A" "C" "D" "E"
> read.table(text=as.character(df$Var1), sep="/", fill=TRUE)
V1 V2 V3 V4
1 A B
2 A B C
3 C B
4 A C D E
Leading zeros in digit only fields can be preserved with colClasses="character"
a <- data.frame(Var1=c("01/B","04/B/C","0098/B","8708/C/D/E"))
read.table(text=as.character(a$Var1), sep="/", fill=TRUE, colClasses="character")
V1 V2 V3 V4
1 01 B
2 04 B C
3 0098 B
4 8708 C D E
If I understood your objective correctly here is one possible solution, I'm sure there is a better way of doing it but this was the first that came to mind:
a <- data.frame(Var1=c("A/B","A/B/C","C/B","A/C/D/E"))
splitNames <- c("Var1","Var2","Var3","Var4")
# R> a
# Var1
# 1 A/B
# 2 A/B/C
# 3 C/B
# 4 A/C/D/E
b <- t(apply(a,1,function(x){
temp <- unlist(strsplit(x,"/"));
return(c(temp,rep(NA,max(0,length(splitNames)-length(temp)))))
}))
colnames(b) <- splitNames
# R> b
# Var1 Var2 Var3 Var4
# [1,] "A" "B" NA NA
# [2,] "A" "B" "C" NA
# [3,] "C" "B" NA NA
# [4,] "A" "C" "D" "E"
i do not know a function to solve your problem, but you can achieve it easily with standard R commands :
# Here are your data
df <- data.frame(Var1=c("A/B", "A/B/C", "C/B", "A/C/D/E"), stringsAsFactors=FALSE)
# Split
rows <- strsplit(df$Var1, split="/")
# Maximum amount of columns
columnCount <- max(sapply(rows, length))
# Fill with NA
rows <- lapply(rows, `length<-`, columnCount)
# Coerce to data.frame
out <- as.data.frame(rows)
# Transpose
out <- t(out)
As it relies on strsplit, you may need to make some type conversion. See type.con

Resources