Creating data.table from a list of unequal vector lengths - r

I am looking to create a data.table from a list of unequal vectors, but instead of repeating the values for the "shorter" vector, I want it to be filled with NAs. I have one possible solution, but it repeats values and does not retain the NA as needed.
Example:
library(data.table)
my_list <- list(A = 1:4, B = letters[1:5])
as.data.table(do.call(cbind, my_list))
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 1 e
But I want it to look like:
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
Thank you!

We need to make the lengths same by appending NA at the end of the list elements having lesser length than the max length
mx <- max(lengths(my_list))
as.data.table(do.call(cbind, lapply(my_list, `length<-`, mx)))
-output
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: <NA> e
Instead of cbind/as.data.table, setDT is more compact
setDT(lapply(my_list, `length<-`, mx))[]
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e

You may use stringi::stri_list2matrix to make all the list length equal.
my_list |>
stringi::stri_list2matrix() |>
data.table::as.data.table() |>
type.convert(as.is = TRUE) |>
setNames(names(my_list))
# A B
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: NA e

Related

Count occurrences of value in multiple columns with duplicates

My problem is very similar to:
R: Count occurrences of value in multiple columns
However, the solution proposed there doesn't work for me because in the same row the value may appear twice but I want to count only the rows where this appears. I have worked out a solution but it seems too long:
> toy_data = data.table(from=c("A","A","A","C","E","E"), to=c("B","C","A","D","F","E"))
> toy_data
from to
1: A B
2: A C
3: A A
4: C D
5: E F
6: E E
> #get a table with intra-link count
> A = data.table(table(unlist(toy_data[from==to,from ])))
> A
V1 N
1: A 1
2: E 1
A #get a table with total count
> B = data.table(table(unlist(toy_data[,c(from,to)])))
> B
V1 N
1: A 4
2: B 1
3: C 2
4: D 1
5: E 3
6: F 1
>
> # concatenate changing sign
> table = rbind(B,A[,.(V1,-N)],use.names=FALSE)
> # groupby and subtract
> table[,sum(N),by=V1]
V1 V1
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Is there some function that would do the job in less lines? I thought in python I'd concatenate from and to then match(), cannot find the right sintax though
EDIT: I know this would work A=length(toy_data[from=="A"|to=="A",from]) but I would like avoiding loops among the various "A","B"... (and I don't know how to format output in this way)
You can try the code below
> toy_data[, to := replace(to, from == to, NA)][, data.frame(table(unlist(.SD)))]
Var1 Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
or
toy_data %>%
mutate(to = replace(to, from == to, NA)) %>%
unlist() %>%
table() %>%
as.data.frame()
which gives
. Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
Using data.table
library(data.table)
toy_data[from == to, to := NA][, .(to = na.omit(c(from, to)))][, .N, to]
You could just subset the to vector:
data.table(table(unlist(toy_data[,c(from,to[to!=from])])))
V1 N
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Using to:=NA as suggested by akrun, one can wrap the result in table(unlist()) and convert to data.table
data.table(table(unlist(toy_data[from==to, to:=NA, from])))

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

rbindlist data.tables different dimensions

I perform a function multiple times with different outputs as exemplified.
require(data.table)
myfunction<-function(x){
DT1<-data.table(a=c(1,2,3),b=c("a","b","c"))
DT2<-data.table(d=c(4,5,6), e=c("d","e","f"))
return(list(DT1=DT1, DT2=DT2))
}
result<-lapply(1:2, myfunction)
I want to bind results. The desired output will be as the one I am showing. My real example uses hundreds of tables.
l1<-rbindlist(list(result[[1]]$DT1, result[[2]]$DT1), idcol = TRUE)
l2<-rbindlist(list(result[[1]]$DT2, result[[2]]$DT2), idcol = TRUE)
DESIRED_OUTPUT<-list(l1, l2)
I use this option but is not working:
rbindlist data.tables wtih different number of columns
======================================================================
Update
The option that #nicola proposed doesn´t work when the number of elements of the list was diferent than 2. For the first example (DT1 and DT2). As a solution I create a variable "l" that calculate the number of elements inside the list of the function.
New example with solution.
require(data.table)
myfunction<-function(x){
DT1<-data.table(a=c(1,2,3),b=c("a","b","c"))
DT2<-data.table(d=c(4,5), e=c("d","e"))
DT3<-data.table(f=c(7,8,NA,9), g=c("g","h","i","j"))
return(list(DT1=DT1, DT2=DT2, DT3=DT3))
}
result<-lapply(1:5, myfunction)
l<-unique(sapply(result, length))
apply(matrix(unlist(result,recursive=FALSE),nrow=l),1,rbindlist,idcol=TRUE)
Here's an option:
do.call(function(...) Map(function(...) rbind(..., idcol = T), ...), result)
#$DT1
# .id a b
#1: 1 1 a
#2: 1 2 b
#3: 1 3 c
#4: 2 1 a
#5: 2 2 b
#6: 2 3 c
#
#$DT2
# .id d e
#1: 1 4 d
#2: 1 5 e
#3: 1 6 f
#4: 2 4 d
#5: 2 5 e
#6: 2 6 f
Here's another:
lapply(purrr::transpose(result), rbindlist, idcol = T)
An attempt that should match on names of the list components:
Map(
function(LL,n) rbindlist(unname(LL[names(l) %in% n]), idcol=TRUE),
list(unlist(result, recursive=FALSE)),
unique(names(l))
)
#[[1]]
# .id a b
#1: 1 1 a
#2: 1 2 b
#3: 1 3 c
#4: 2 1 a
#5: 2 2 b
#6: 2 3 c
#
#[[2]]
# .id d e
#1: 1 4 d
#2: 1 5 e
#3: 1 6 f
#4: 2 4 d
#5: 2 5 e
#6: 2 6 f

How to sort a data.table using a target vector

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.
Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.
You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources