Challenge: Recode data from Alpha to Non sequential Numeric [duplicate]

Challenge: Recode data from Alpha to Non sequential Numeric [duplicate] - r

This question already has answers here:
Convert letters to numbers
(5 answers)
Closed 5 years ago.
We are looking for a blazing fast solution to the following problem, in R (Rcpp is allowed).
I have a character vector:
set.seed(42)
x <- sample(LETTERS[1:4], 1e6, replace = TRUE)
And I want to change it to a non sequential numeric vector, where:
A = 5
B = 4
C = 3
D = 1
For example:
c("A", "B", "C", "D")
Would be:
c(5,4,3,1)
The interns and I have what we think is a ridiculously fast solution already but we think the Internet can beat us. We'll add our fastest solution as an answer after we get some responses.
Let's see!
Timings so far:
library(microbenchmark)
set.seed(42)
x <- sample(LETTERS[1:4], 1e6, replace = TRUE)
richscriven <- function(x) {
as.vector(c(A=5, B=4, C=3, D=2, E=1)[x])
}
richscriven_unname <- function(x) {
unname(c(A=5, B=4, C=3, D=2, E=1)[x])
}
richscriven_op <- function(x) {
(5:1)[c(factor(x))]
}
op_and_interns_fun <- function(x) {
c(5,4,3,1)[as.numeric(as.factor(x))]
}
ronakshah <- function(x) {
vec = c("A" = 5, "B" = 4, "C" = 3, "D" = 1)
unname(vec[match(x, names(vec))])
}
microbenchmark(
richscriven_unname(x),
richscriven(x),
richscriven_op(x),
op_and_interns_fun(x),
ronakshah(x),
times = 15
)
Unit: milliseconds
expr min lq mean median uq max neval
richscriven_unname(x) 36.06018 38.01026 62.80854 38.87179 41.86411 337.65773 15
richscriven(x) 37.90615 41.61194 43.50555 44.14130 45.17277 47.47804 15
richscriven_op(x) 31.70345 37.43262 44.10522 41.34828 45.22127 88.79605 15
op_and_interns_fun(x) 40.18935 44.20475 49.48811 45.77867 48.15706 99.85034 15
ronakshah(x) 29.36408 32.52615 42.40753 35.09052 38.55763 95.78571 15

We can put the vector in a named numeric vector
vec <- c("A" = 5, "B" = 4, "C" = 3, "D" = 1)
We can then write a function,
get_recoded_data <- function(num_vec, recode_data) {
unname(recode_data[match(num_vec, names(recode_data))])
}
and call the function using
get_recoded_data(x, vec)
On my system it takes,
system.time(get_recoded_data(x, vec))
#user system elapsed
#0.028 0.004 0.032
I am using MacOS Sierra 10.12.6, 16GB RAM i7 RStudio 1.1.383
From #zacdav's comment using fmatch function from fastmatch package gives a good performance enhancement
get_recoded_data <- function(num_vec, recode_data) {
unname(recode_data[fmatch(num_vec, names(recode_data))])
}
Checking it on the same data, I get
system.time(get_recoded_data(x, vec))
#user system elapsed
#0.017 0.004 0.021

Our answer relies on a somewhat uncommon method of subsetting by position:
op_and_interns_fun <- function(x) {
c(5,4,3,1)[as.numeric(as.factor(x))]
}

Related

Convert named list to a data.frame efficiently [duplicate]

This question already has answers here:
Named List To/From Data.Frame
(4 answers)
Closed 4 years ago.
I'm looking for a efficient way for the following transformation:
Example input:
ob <- list(a = 2, b = 3)
Expected output:
key value
1 a 2
2 b 3
Current (verbose) solution:
data.frame(key = names(ob), value = unlist(ob, use.names = FALSE))

Try stack
stack(ob)
# values ind
#1 2 a
#2 3 b
You would need to change the names though. Use setNames
setNames(stack(ob), nm = c("value", "key"))
benchmark
Addressing #Roland's comment, stack seems to be more efficient indeed. please don't use stack but OP's solution for efficiency reasons.
n <- 1e5
lst <- as.list(seq_len(n))
names(lst) <- paste0("a", seq_len(n))
library(microbenchmark)
benchmark <- microbenchmark(
snoram = snoram(lst),
markus = markus(lst), times = 50
)
benchmark
#Unit: milliseconds
# expr min lq mean median uq max neval
# snoram 2.475258 2.594479 2.739639 2.652843 2.715575 5.92216 50
# markus 114.387692 119.028200 134.745626 137.524606 144.045112 162.11510 50
Functions used (so far)
snoram <- function(l) {
data.frame(key = names(l), value = unlist(l, use.names = FALSE),
stringsAsFactors = FALSE) # this gives a hugh performance gain
# thanks to #Roland
}
markus <- function(l) {
setNames(stack(l), nm = c("value", "key"))
}

Assign specified numerical value to a vector of strings [duplicate]

This question already has answers here:
Map array of strings to an array of integers
(4 answers)
Closed 5 years ago.
Is there a way of creating new vector of numerical values based on my vector of strings?
For example I have this :
a<-c("A", "B", "A", "A")
and from this a I want to make new vector b with values replacing "A" with 1 and "B" with -1 so b(1, -1, 1, 1)
I tried using something like factor(a, levels = c("A", "B"), labels = c(1, -1))
but this doesn't produce numerical vector.

No need to that, just use:
a[a=="A"] = 1
a[a=="B"] = -1
a = as.numeric(a)
if you want keep a unchanged use:
b = a
b[a=="A"] = 1
b[a=="B"] = -1
b = as.numeric(b)
Or better solution as #joran said:
b = ifelse(a == "A",1,-1)

# Packages
library(stringi)
library(microbenchmark)
# 1. Vector
# a <- c("A", "B", "A", "A")
a <- stri_rand_strings(1e5, 1, pattern = "[A-B]")
# 2. The 'factor' solution
f1 <- function(){ as.numeric(as.character(factor(a, levels = c("A", "B"), labels = c(1, -1)))) }
# 3. The faster solution
f2 <- function(){ (-1)^(a != "A") }
# 3. Ifelse solution
f3 <- function(){ ifelse(a == "A", 1, -1) }
# 4. Ignore case of letters or my solution
f4 <- function(){ ifelse(as.numeric(grepl("a", a, ignore.case = TRUE)) == 1, 1, -1) }
# 5. Code map solution from "Nathan Werth"
f5 <- function(){ c(A = 1, B = -1)[a] }
# 6. Test
microbenchmark(
f1(), f2(), f3(), f4(), f5())
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 23.331763 23.648421 28.253174 24.235554 26.582799 123.49315 100 b
f2() 5.808460 6.025908 6.421053 6.067174 6.200166 12.94342 100 a
f3() 13.817060 14.926539 25.900652 16.388596 18.122837 129.67193 100 b
f4() 28.772036 31.363670 39.185333 32.352557 34.388918 134.35915 100 c
f5() 4.577321 5.186689 8.727417 7.375286 7.895280 106.31922 100 a

code_map <- c(A = 1, B = -1)
b <- code_map[a]

Faster version of combn

Is there a way to speed up the combn command to get all unique combinations of 2 elements taken from a vector?
Usually this would be set up like this:
# Get latest version of data.table
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
# Transform data
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
However, combn is 10 times slower (23sec versus 3 sec on my computer) than calculating all possible combinations using data.table.
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
Dealing with very large vectors, I am searching for a way to save memory by only calculating the unique combinations (like combn), but with the speed of data.table (see second code snippet).
I appreciate any help.

Here's a way using data.table function foverlaps(), that also turns out to be fast!
require(data.table) ## 1.9.4+
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
system.time(olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid])
# 0.603 0.062 0.717
Note that foverlaps() does not calculate all permutations. The subset xid != yid is needed to remove self overlaps. The subset could be internally handled more efficiently by implementing ignoreSelf argument - similar to IRanges::findOverlaps.
Now it's just a matter of performing a subset using the ids obtained:
system.time(ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid])))
# 0.576 0.047 0.662
So totally, ~1.4 seconds.
The advantage is that you can do the same way even if your data.table d has more than 1 column on which you've to get the combinations for, and using the same amount of memory (since we return the indices). In that case, you'd just do:
cbind(d[olaps$xid, ..your_cols], d[olaps$yid, ..your_cols])
But it's limited to replacing just combn(., 2L). Not more than 2L.

You could use combnPrim from gRbase
source("http://bioconductor.org/biocLite.R")
biocLite("gRbase") # will install dependent packages automatically.
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
# user system elapsed
# 27.322 0.585 27.674
system.time({
d.2 <- as.data.table(t(combnPrim(d$id,2)))
})
# user system elapsed
# 2.317 0.110 2.425
identical(d.1[order(V1, V2),], d.2[order(V1,V2),])
#[1] TRUE

A post with any variation of the word Fast in the title is incomplete without benchmarks. Before we post any benchmarks, I would just like to mention that since this question was posted, two highly optimized packages, arrangements and RcppAlgos (I am the author) for generating combinations have been released for R. Note that since version 2.3.0 for RcppAlgos we can take advantage of multiple threads for even greater efficiency.
To give you an idea of their speed over combn and gRbase::combnPrim, here is a basic benchmark:
## We test generating just over 3 million combinations
choose(25, 10)
[1] 3268760
microbenchmark(arrngmnt = arrangements::combinations(25, 10),
combn = combn(25, 10),
gRBase = gRbase::combnPrim(25, 10),
serAlgos = RcppAlgos::comboGeneral(25, 10),
parAlgos = RcppAlgos::comboGeneral(25, 10, nThreads = 4),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
arrngmnt 2.979378 3.072319 1.898390 3.756307 2.139258 0.4842967 20
combn 226.470755 230.410716 118.157110 232.905393 125.718512 17.7778585 20
gRBase 34.219914 34.209820 18.789954 34.218320 19.934485 3.6455493 20
serAlgos 2.836651 3.078791 2.458645 3.703929 2.231475 1.1652445 20
parAlgos 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 20
Now, we benchmark the other functions posted for the very specific case of producing combinations choose 2 and producing a data.table object.
The functions are as follows:
funAkraf <- function(d) {
a <- comb2.int(length(d$id)) ## comb2.int from the answer given by #akraf
setDT(list(V1 = d$id[a[,1]], V2 = d$id[a[,2]]))
}
funAnirban <- function(d) {
indices <- combi2inds(d$id)
ans2 <- setDT(list(d$id[indices$xid], d$id[indices$yid]))
ans2
}
funArun <- function(d) {
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid]
ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid]))
ans
}
funArrangements <- function(d) {
a <- arrangements::combinations(x = d$id, k = 2)
setDT(list(a[, 1], a[, 2]))
}
funGRbase <- function(d) {
a <- gRbase::combnPrim(d$id,2)
setDT(list(a[1, ], a[2, ]))
}
funOPCombn <- function(d) {
a <- combn(d$id, 2)
setDT(list(a[1, ], a[2, ]))
}
funRcppAlgos <- function(d) {
a <- RcppAlgos::comboGeneral(d$id, 2, nThreads = 4)
setDT(list(a[, 1], a[, 2]))
}
Benchmark with OP Data
And here are the benchmarks on the example given by the OP:
d <- data.table(id=as.character(paste0("A", 10001:15000)))
microbenchmark(funAkraf(d),
funAnirban(d),
funArrangements(d),
funArun(d),
funGRbase(d),
funOPCombn(d),
funRcppAlgos(d),
times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
funAkraf(d) 3.220550 2.971264 2.815023 2.665616 2.344018 3.383673 10
funAnirban(d) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
funArrangements(d) 1.464730 1.689231 1.834650 1.960233 1.932361 1.693305 10
funArun(d) 3.256889 2.908075 2.634831 2.729180 2.432277 2.193849 10
funGRbase(d) 3.513847 3.340637 3.327845 3.196399 3.291480 3.129362 10
funOPCombn(d) 30.310469 26.255374 21.656376 22.386270 18.527904 15.626261 10
funRcppAlgos(d) 1.676808 1.956696 1.943773 2.085968 1.949133 1.804180 10
We see that the function provided by #AnirbanMukherjee is the fastest for this task, followed by RcppAlgos/arrangements. For this task, nThreads has no effect as the vector passed is a character, which is not thread safe. What if we instead converted id to a factor?
Benchmarks with Factors (i.e. Categorical Variables)
dFac <- d
dFac$id <- as.factor(dFac$id)
library(microbenchmark)
microbenchmark(funAkraf(dFac),
funAnirban(dFac),
funArrangements(dFac),
funArun(dFac),
funGRbase(dFac),
funOPCombn(dFac),
funRcppAlgos(dFac),
times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
funAkraf(dFac) 10.898202 10.949896 7.589814 10.01369 8.050005 5.557014 10
funAnirban(dFac) 3.104212 3.337344 2.317024 3.00254 2.471887 1.530978 10
funArrangements(dFac) 2.054116 2.058768 1.858268 1.94507 2.797956 1.691875 10
funArun(dFac) 10.646680 12.905119 7.703085 11.50311 8.410893 3.802155 10
funGRbase(dFac) 16.523356 21.609917 12.991400 19.73776 13.599870 6.498135 10
funOPCombn(dFac) 108.301876 108.753085 64.338478 95.56197 65.494335 28.183104 10
funRcppAlgos(dFac) 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 10
Now, we see that RcppAlgos is around 2x faster than any other solution. In particular, the RcppAlgos solution is about 3x than the formerly fastest solution given by Anirban. It should be noted that this increase in efficiency was possible because factor variables are really integers underneath the hood with some additional attributes.
Confirm Equality
They all give the same result as well. The only caveat is that the gRbase solution doesn't support factors. That is, if you pass a factor, it will be converted to character. Thus all of the solutions will give the same result if you were to pass dFac except for the gRbase solution:
identical(funAkraf(d), funOPCombn(d))
#[1] TRUE
identical(funAkraf(d), funArrangements(d))
#[1] TRUE
identical(funRcppAlgos(d), funArrangements(d))
#[1] TRUE
identical(funRcppAlgos(d), funAnirban(d))
#[1] TRUE
identical(funRcppAlgos(d), funArun(d))
#[1] TRUE
## different order... we must sort
identical(funRcppAlgos(d), funGRbase(d))
[1] FALSE
d1 <- funGRbase(d)
d2 <- funRcppAlgos(d)
## now it's the same
identical(d1[order(V1, V2),], d2[order(V1,V2),])
#[1] TRUE
Thanks to #Frank for pointing out how to compare two data.tables without going through the pains of creating new data.tables and then arranging them:
fsetequal(funRcppAlgos(d), funGRbase(d))
[1] TRUE

Here is a solution using Rcpp.
library(Rcpp)
library(data.table)
cppFunction('
Rcpp::DataFrame combi2(Rcpp::CharacterVector inputVector){
int len = inputVector.size();
int retLen = len * (len-1) / 2;
Rcpp::CharacterVector outputVector1(retLen);
Rcpp::CharacterVector outputVector2(retLen);
int start = 0;
for (int i = 0; i < len; ++i){
for (int j = i+1; j < len; ++j){
outputVector1(start) = inputVector(i);
outputVector2(start) = inputVector(j);
++start;
}
}
return(Rcpp::DataFrame::create(Rcpp::Named("id") = outputVector1,
Rcpp::Named("neighbor") = outputVector2));
};
')
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
# 1.908 0.397 2.389
system.time({
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid]
ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid]))
})
# 0.653 0.038 0.705
system.time(ans2 <- combi2(d$id))
# 1.377 0.108 1.495
Using the Rcpp function to get the indices and then form the data.table, works better.
cppFunction('
Rcpp::DataFrame combi2inds(const Rcpp::CharacterVector inputVector){
const int len = inputVector.size();
const int retLen = len * (len-1) / 2;
Rcpp::IntegerVector outputVector1(retLen);
Rcpp::IntegerVector outputVector2(retLen);
int indexSkip;
for (int i = 0; i < len; ++i){
indexSkip = len * i - ((i+1) * i)/2;
for (int j = 0; j < len-1-i; ++j){
outputVector1(indexSkip+j) = i+1;
outputVector2(indexSkip+j) = i+j+1+1;
}
}
return(Rcpp::DataFrame::create(Rcpp::Named("xid") = outputVector1,
Rcpp::Named("yid") = outputVector2));
};
')
system.time({
indices <- combi2inds(d$id)
ans2 <- setDT(list(d$id[indices$xid], d$id[indices$yid]))
})
# 0.389 0.027 0.425

Here are two base-R solutions if you don't want to use additional dependencies:
comb2.int uses rep and other sequence generating functions to generate the desired output.
comb2.mat creates a matrix, uses upper.tri() to get the upper triangle and which(..., arr.ind = TRUE) to obtain the column and row indices => all combinations.
Possibility 1: comb2.int
comb2.int <- function(n, rep = FALSE){
if(!rep){
# e.g. n=3 => (1,2), (1,3), (2,3)
x <- rep(1:n,(n:1)-1)
i <- seq_along(x)+1
o <- c(0,cumsum((n-2):1))
y <- i-o[x]
}else{
# e.g. n=3 => (1,1), (1,2), (1,3), (2,2), (2,3), (3,3)
x <- rep(1:n,n:1)
i <- seq_along(x)
o <- c(0,cumsum(n:2))
y <- i-o[x]+x-1
}
return(cbind(x,y))
}
Possibility 2: comb2.mat
comb2.mat <- function(n, rep = FALSE){
# Use which(..., arr.ind = TRUE) to get coordinates.
m <- matrix(FALSE, nrow = n, ncol = n)
idxs <- which(upper.tri(m, diag = rep), arr.ind = TRUE)
return(idxs)
}
The functions give the same result as combn(.):
for(i in 2:8){
# --- comb2.int ------------------
stopifnot(comb2.int(i) == t(combn(i,2)))
# => Equal
# --- comb2.mat ------------------
m <- comb2.mat(i)
colnames(m) <- NULL # difference 1: colnames
m <- m[order(m[,1]),] # difference 2: output order
stopifnot(m == t(combn(i,2)))
# => Equal up to above differences
}
But I have other elements in my vector than sequencial integers!
Use the return values as indices:
v <- LETTERS[1:5]
c <- comb2.int(length(v))
cbind(v[c[,1]], v[c[,2]])
#> [,1] [,2]
#> [1,] "A" "B"
#> [2,] "A" "C"
#> [3,] "A" "D"
#> [4,] "A" "E"
#> [5,] "B" "C"
#> [6,] "B" "D"
#> [7,] "B" "E"
#> [8,] "C" "D"
#> [9,] "C" "E"
#> [10,] "D" "E"
Benchmark:
time(combn) = ~5x time(comb2.mat) = ~80x time(comb2.int):
library(microbenchmark)
n <- 800
microbenchmark({
comb2.int(n)
},{
comb2.mat(n)
},{
t(combn(n, 2))
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> { comb2.int(n) } 4.394051 4.731737 6.350406 5.334463 7.22677 14.68808 100
#> { comb2.mat(n) } 20.131455 22.901534 31.648521 24.411782 26.95821 297.70684 100
#> { t(combn(n, 2)) } 363.687284 374.826268 391.038755 380.012274 389.59960 532.30305 100

how to remove unique values from a vector

I have a large numeric vector - how can I remove the unique values from it efficiently?
To give a simplified example, how can I get from vector a to vector b?
> a = c(1, 2, 3, 3, 2, 4) # 1 and 4 are the unique values
> b = c(2, 3, 3, 2)

To add to the options already available:
a[duplicated(a) | duplicated(a, fromLast=TRUE)]
# [1] 2 3 3 2
Update: More benchmarks!
Comparing Prasanna's answer with mine, and comparing it against asieira's functions, we get the following:
fun1 <- function(x) x[x %in% x[duplicated(x)]]
fun2 <- function(x) x[duplicated(x) | duplicated(x, fromLast=TRUE)]
set.seed(1)
a <- ceiling(runif(1000000, min=0, max=100))
library(microbenchmark)
microbenchmark(remove.uniques1(a), remove.uniques2(a),
fun1(a), fun2(a), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# remove.uniques1(a) 1957.9565 1971.3125 2002.7045 2057.0911 2151.1178 20
# remove.uniques2(a) 2049.9714 2065.6566 2095.4877 2146.3000 2210.6742 20
# fun1(a) 213.6129 216.6337 219.2829 297.3085 303.9394 20
# fun2(a) 154.0829 155.5459 155.9748 158.9121 246.2436 20
I suspect that the number of unique values would also make a difference in terms of the efficiency of these approaches.

a[a %in% a[duplicated(a)]]
[1] 2 3 3 2

This should give the right answer.
a = c(1, 2, 3, 3, 2, 4)
dups <- duplicated(a)
dup.val <- a[dups]
a[a %in% dup.val]

One vectorized way to do this is to use the built-in table function to find which values only appear once, and then remove them from the vector:
> a = c(1, 2, 3, 3, 2, 4)
> tb.a = table(a)
> appears.once = as.numeric(names(tb.a[tb.a==1]))
> appears.once
[1] 1 4
> b = a[!a %in% appears.once]
> b
[1] 2 3 3 2
Notice the table function converts the values from the original vector to the names, which is character. So we need to convert it back to numeric in your example.
Another way of doing that with data.table:
> dt.a = data.table(a=a)
> dt.a[,count:=.N,by=a]
> b = dt.a[count>1]$a
> b
[1] 2 3 3 2
Now let's time them:
remove.uniques1 <- function(x) {
tb.x = table(x)
appears.once = as.numeric(names(tb.x[tb.x==1]))
return(x[!x %in% appears.once])
}
remove.uniques2 <- function(x) {
dt.x = data.table(data=x)
dt.x[,count:=.N,by=data]
return(dt.x[count>1]$data)
}
> a = ceiling(runif(1000000, min=0, max=100))
> system.time( remove.uniques1(a) )
user system elapsed
1.598 0.033 1.658
> system.time( remove.uniques2(a) )
user system elapsed
0.845 0.007 0.855
So both are pretty fast, but the data.table version is clearly faster. Not to mention remove.uniques2 preserves whatever type the input vector is. In the case of remove.uniques1, however, you have to replace the call to as.numeric to whatever fits the type of your original vector.

Revert list structure

GOAL
Given a list of lists my goal is to reverse its structure (R language).
So, I want to bring the elements of the nested lists to be elements of the tier one list.
Probably an example better specifies my purpose. Given:
z <- list(z1 = list(a = 1, b = 2, c = 3), z2 = list(b = 4, a = 1, c = 0))
I want an output equivalent to the subsequent R object:
o <- list(a = list(z1 = 1, z2 = 1), b = list(z1 = 2, z2 = 4), c = list(z1 = 3, z2 = 0))
SOLUTIONS
MY SOLUTION
I created my own solution, which I am attaching below, but let me know if there is some better.
revert_list_str_1 <- function(ls) {
res <- lapply(names(ls[[1]]), function(n, env) {
name <- paste(n, 'elements', sep = '_')
assign(name, vector('list', 0))
inner <- sapply(ls, function(x) {
assign(name, c(get(name), x[which(names(x) == n)]))
})
names(inner) <- names(ls)
inner
})
names(res) <- names(ls[[1]])
res
}
Executing str(revert_list_str_1(z)) I obtain the subsequent output, corresponding to what I wanted.
List of 3
$ a:List of 2
..$ z1: num 1
..$ z2: num 1
$ b:List of 2
..$ z1: num 2
..$ z2: num 4
$ c:List of 2
..$ z1: num 3
..$ z2: num 0
But as I said I'd like to investigate (and learn) the existence of a more elegant and dynamic solution.
In fact my solution works fully only if all the nested lists have the same names (also in different order). This because of names(ls[[1]]). I would also point out that it acts only on lists of 2 levels, like the one reported.
So, do you know other solutions that are more dynamic? Can rapply and/or Filter functions be useful for this task?
end edit 1.
ANALYSIS OF PROPOSED SOLUTIONS
I've done a little analysis of the proposes solutions, thans you all !.
The analysis consists of verifying the following points for all functions:
accepted classes (nested list elements)
type preserved also if there are elements with different types (if they are atomic)
object contained in the elements preserved (e.g. a matrix)
columns considered (for columns I mean the names of the nested lists)
not common columns ignored (the classification 'not' is understood positively in this case)
not common columns preserved
it works also when columns do not match (based only on the names of the first nested list)
In all this cases the classification 'yes' is understood positively execept for point 2.1.
This are all the functions I've considered (the comments relate to the analysis items mentioned above):
# yes 1.1
# yes 1.2
# yes 2.1, not 2.2, not 2.3
revert_list_str_1 <- function(ls) { # #leodido
# see above
}
# not 1.1
# not 1.2
# not 2.1, not 2.2, not 2.3
revert_list_str_2 <- function(ls) { # #mnel
# convert each component of list to a data.frame
# so rbind.data.frame so named elements are matched
x <- data.frame((do.call(rbind, lapply(ls, data.frame))))
# convert each column into an appropriately named list
o <- lapply(as.list(x), function(i, nam) as.list(`names<-`(i, nam)), nam = rownames(x))
o
}
# yes 1.1
# yes 1.2
# yes 2.1, not 2.2, yes 2.3
revert_list_str_3 <- function(ls) { # #mnel
# unique names
nn <- Reduce(unique, lapply(ls, names))
# convert from matrix to list `[` used to ensure correct ordering
as.list(data.frame(do.call(rbind,lapply(ls, `[`, nn))))
}
# yes 1.1
# yes 1.2
# yes 2.1, not 2.2, yes 2.3
revert_list_str_4 <- function(ls) { # #Josh O'Brien
# get sub-elements in same order
x <- lapply(ls, `[`, names(ls[[1]]))
# stack and reslice
apply(do.call(rbind, x), 2, as.list)
}
# not 1.1
# not 1.2
# not 2.1, not 2.2, not 2.3
revert_list_str_5 <- function(ls) { # #mnel
apply(data.frame((do.call(rbind, lapply(ls, data.frame)))), 2, as.list)
}
# not 1.1
# not 1.2
# not 2.1, yes 2.2, yes 2.3
revert_list_str_6 <- function(ls) { # #baptiste + #Josh O'Brien
b <- recast(z, L2 ~ L1)
apply(b, 1, as.list)
}
# yes 1.1
# yes 1.2
# not 2.1, yes 2.2, yes 2.3
revert_list_str_7 <- function(ll) { # #Josh O'Brien
nms <- unique(unlist(lapply(ll, function(X) names(X))))
ll <- lapply(ll, function(X) setNames(X[nms], nms))
ll <- apply(do.call(rbind, ll), 2, as.list)
lapply(ll, function(X) X[!sapply(X, is.null)])
}
CONSIDERATIONS
From this analysis emerges that:
functions revert_list_str_7 and revert_list_str_6 are the most flexible regarding the names of the nested list
functions revert_list_str_4, revert_list_str_3 followed by my own function are complete enough, good trade-offs.
the most complete in absolute function is revert_list_str_7.
BENCHMARKS
To complete the work I've done some little benchmarks (with microbenchmark R package) on this 4 functions (times = 1000 for each benchmark).
BENCHMARK 1
Input:
list(z1 = list(a = 1, b = 2, c = 3), z2 = list(a = 0, b = 3, d = 22, f = 9)).
Results:
Unit: microseconds
expr min lq median uq max
1 func_1 250.069 467.5645 503.6420 527.5615 2028.780
2 func_3 204.386 393.7340 414.5485 429.6010 3517.438
3 func_4 89.922 173.7030 189.0545 194.8590 1669.178
4 func_6 11295.463 20985.7525 21433.8680 21934.5105 72476.316
5 func_7 348.585 387.0265 656.7270 691.2060 2393.988
Winner: revert_list_str_4.
BENCHMARK 2
Input:
list(z1 = list(a = 1, b = 2, c = 'ciao'), z2 = list(a = 0, b = 3, c = 5)).
revert_list_str_6 excluded because it does not support different type of nested child elements.
Results:
Unit: microseconds
expr min lq median uq max
1 func_1 249.558 483.2120 502.0915 550.7215 2096.978
2 func_3 210.899 387.6835 400.7055 447.3785 1980.912
3 func_4 92.420 170.9970 182.0335 192.8645 1857.582
4 func_7 257.772 469.9280 477.8795 487.3705 2035.101
Winner: revert_list_str_4.
BENCHMARK 3
Input:
list(z1 = list(a = 1, b = m, c = 'ciao'), z2 = list(a = 0, b = 3, c = m)).
m is a matrix 3x3 of integers and revert_list_str_6 has been excluded again.
Results:
Unit: microseconds
expr min lq median uq max
1 func_1 261.173 484.6345 503.4085 551.6600 2300.750
2 func_3 209.322 393.7235 406.6895 449.7870 2118.252
3 func_4 91.556 174.2685 184.5595 196.2155 1602.983
4 func_7 252.883 474.1735 482.0985 491.9485 2058.306
Winner: revert_list_str_4. Again!
end edit 2.
CONCLUSION
First of all: thanks to all, wonderful solutions.
In my opinion if you know in advance that you list will have nested list with the same names reverse_str_4 is the winner as best compromise between performances and support for different types.
The most complete solution is revert_list_str_7 although the full flexibility induces an average of about 2.5 times a worsening of performances compared to reverse_str_4 (useful if your nested list have different names).

Edit:
Here's a more flexible version that will work on lists whose elements don't necessarily contain the same set of sub-elements.
fun <- function(ll) {
nms <- unique(unlist(lapply(ll, function(X) names(X))))
ll <- lapply(ll, function(X) setNames(X[nms], nms))
ll <- apply(do.call(rbind, ll), 2, as.list)
lapply(ll, function(X) X[!sapply(X, is.null)])
}
## An example of an 'unbalanced' list
z <- list(z1 = list(a = 1, b = 2),
z2 = list(b = 4, a = 1, c = 0))
## Try it out
fun(z)
Original answer
z <- list(z1 = list(a = 1, b = 2, c = 3), z2 = list(b = 4, a = 1, c = 0))
zz <- lapply(z, `[`, names(z[[1]])) ## Get sub-elements in same order
apply(do.call(rbind, zz), 2, as.list) ## Stack and reslice

EDIT -- working from #Josh O'Briens suggestion and my own improvemes
The problem was that do.call rbind was not calling rbind.data.frame which does some matching of names. rbind.data.frame should work, because data.frames are lists and each sublist is a list, so we could just call it directly.
apply(do.call(rbind.data.frame, z), 1, as.list)
However, while this may be succicint, it is slow because do.call(rbind.data.frame, ...) is inherently slow.
Something like (in two steps)
# convert each component of z to a data.frame
# so rbind.data.frame so named elements are matched
x <- data.frame((do.call(rbind, lapply(z, data.frame))))
# convert each column into an appropriately named list
o <- lapply(as.list(x), function(i,nam) as.list(`names<-`(i, nam)), nam = rownames(x))
o
$a
$a$z1
[1] 1
$a$z2
[1] 1
$b
$b$z1
[1] 2
$b$z2
[1] 4
$c
$c$z1
[1] 3
$c$z2
[1] 0
And an alternative
# unique names
nn <- Reduce(unique,lapply(z, names))
# convert from matrix to list `[` used to ensure correct ordering
as.list(data.frame(do.call(rbind,lapply(z, `[`, nn))))

reshape can get you close,
library(reshape)
b = recast(z, L2~L1)
split(b[,-1], b$L2)

The recently released purrr contains a function, transpose, whose's purpose is to 'revert' a list structure. There is a major caveat to the transpose function, it matches on position and not name, https://cran.r-project.org/web/packages/purrr/purrr.pdf. These means that it is not the correct tool for the benchmark 1 above. I therefore only consider benchmark 2 and 3 below.
Benchmark 2
B2 <- list(z1 = list(a = 1, b = 2, c = 'ciao'), z2 = list(a = 0, b = 3, c = 5))
revert_list_str_8 <- function(ll) { # #z109620
transpose(ll)
}
microbenchmark(revert_list_str_1(B2), revert_list_str_3(B2), revert_list_str_4(B2), revert_list_str_7(B2), revert_list_str_8(B2), times = 1e3)
Unit: microseconds
expr min lq mean median uq max neval
revert_list_str_1(B2) 228.752 254.1695 297.066646 268.8325 293.5165 4501.231 1000
revert_list_str_3(B2) 211.645 232.9070 277.149579 250.9925 278.6090 2512.361 1000
revert_list_str_4(B2) 79.673 92.3810 112.889130 100.2020 111.4430 2522.625 1000
revert_list_str_7(B2) 237.062 252.7030 293.978956 264.9230 289.1175 4838.982 1000
revert_list_str_8(B2) 2.445 6.8440 9.503552 9.2880 12.2200 148.591 1000
Clearly function transpose is the winner! It also utilizes much less code.
Benchmark 3
B3 <- list(z1 = list(a = 1, b = m, c = 'ciao'), z2 = list(a = 0, b = 3, c = m))
microbenchmark(revert_list_str_1(B3), revert_list_str_3(B3), revert_list_str_4(B3), revert_list_str_7(B3), revert_list_str_8(B3), times = 1e3)
Unit: microseconds
expr min lq mean median uq max neval
revert_list_str_1(B3) 229.242 253.4360 280.081313 266.877 281.052 2818.341 1000
revert_list_str_3(B3) 213.600 232.9070 271.793957 248.304 272.743 2739.646 1000
revert_list_str_4(B3) 80.161 91.8925 109.713969 98.980 108.022 2403.362 1000
revert_list_str_7(B3) 236.084 254.6580 287.274293 264.922 280.319 2718.628 1000
revert_list_str_8(B3) 2.933 7.3320 9.140367 9.287 11.243 55.233 1000
Again, transpose outperforms all others.
The problem with these above benchmarks test is that they focus on very small lists. For this reason, the numerous loops nested within functions 1-7 do not pose too much of a problem. As the size of the list and therefore the iteration increase, the speed gains of transpose will likely increase.
The purrr package is awesome! It does a lot more than revert lists. In combination with the dplyr package, the purrr package makes it possible to do most of your coding using the poweriful and beautiful functional programming paradigm. Thank the lord for Hadley!

How about this simple solution, which is completely general, and almost as fast as Josh O'Brien's original answer that assumed common internal names (#4).
zv <- unlist(unname(z), recursive=FALSE)
ans <- split(setNames(zv, rep(names(z), lengths(z))), names(zv))
And here is a general version that is robust to not having names:
invertList <- function(z) {
zv <- unlist(unname(z), recursive=FALSE)
zind <- if (is.null(names(zv))) sequence(lengths(z)) else names(zv)
if (!is.null(names(z)))
zv <- setNames(zv, rep(names(z), lengths(z)))
ans <- split(zv, zind)
if (is.null(names(zv)))
ans <- unname(ans)
ans
}

I'd like to add a further solution to this valuable collection (to which I have turned many times):
revert_list_str_9 <- function(x) do.call(Map, c(c, x))
If this were code golf, we'd have a clear winner! Of course, this requires the individual list entries to be in the same order. This can be extended, using various ideas from above, such as
revert_list_str_10 <- function(x) {
nme <- names(x[[1]]) # from revert_list_str_4
do.call(Map, c(c, lapply(x, `[`, nme)))
}
revert_list_str_11 <- function(x) {
nme <- Reduce(unique, lapply(x, names)) # from revert_list_str_3
do.call(Map, c(c, lapply(x, `[`, nme)))
}
Performance-wise it's also not too shabby. If stuff is properly sorted, we have a new base R solution to beat. If not, timings still are very competitive.
z <- list(z1 = list(a = 1, b = 2, c = 3), z2 = list(b = 4, a = 1, c = 0))
microbenchmark::microbenchmark(
revert_list_str_1(z), revert_list_str_2(z), revert_list_str_3(z),
revert_list_str_4(z), revert_list_str_5(z), revert_list_str_7(z),
revert_list_str_9(z), revert_list_str_10(z), revert_list_str_11(z),
times = 1e3
)
#> Unit: microseconds
#> expr min lq mean median uq max
#> revert_list_str_1(z) 51.946 60.9845 67.72623 67.2540 69.8215 1293.660
#> revert_list_str_2(z) 461.287 482.8720 513.21260 490.5495 498.8110 1961.542
#> revert_list_str_3(z) 80.180 89.4905 99.37570 92.5800 95.3185 1424.012
#> revert_list_str_4(z) 19.383 24.2765 29.50865 26.9845 29.5385 1262.080
#> revert_list_str_5(z) 499.433 525.8305 583.67299 533.1135 543.4220 25025.568
#> revert_list_str_7(z) 56.647 66.1485 74.53956 70.8535 74.2445 1309.346
#> revert_list_str_9(z) 6.128 7.9100 11.50801 10.2960 11.5240 1591.422
#> revert_list_str_10(z) 8.455 10.9500 16.06621 13.2945 14.8430 1745.134
#> revert_list_str_11(z) 14.953 19.8655 26.79825 22.1805 24.2885 2084.615
Unfortunately, this is not my creation, but exists courtesy of #thelatemail.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Challenge: Recode data from Alpha to Non sequential Numeric [duplicate] - r

Our answer relies on a somewhat uncommon method of subsetting by position: op_and_interns_fun <- function(x) { c(5,4,3,1)[as.numeric(as.factor(x))] }

Related

Convert named list to a data.frame efficiently [duplicate]

Assign specified numerical value to a vector of strings [duplicate]

Faster version of combn

how to remove unique values from a vector

Revert list structure

Categories

Resources