convert a data.table to a vector rowwise - r

I have a data.table such as:
example <- data.table(fir =c("A", "B", "C", "A","A", "B", "C"), las=c( "B", "C","B", "C", "B", "C","C"))
A B
B C
C B
A C
A B
B C
C C
Though I guess the problem is the same with a data.frame.
and I would like to get a vector as this:
A, B, B, C, C, B, A, C, A, B, B, C, C, C
That's, I want to stack every row on the left hand side...
I've tried unlist(example) but it extracts the data columnwise instead.
How can I get it?
I've also tried with apply, transposing and other strange things.

As in a matrix as well as a data.frame/data.table (though different from a matrix), data is stored column wise, you can transpose it first:
as.vector(t(example))
# [1] "A" "B" "B" "C" "C" "B" "A" "C" "A" "B" "B" "C" "C" "C"
A benchmark testing including options provided by #Sotos, #Frank and #Wen using a dummy data set:
example <- as.data.table(matrix(sample(LETTERS, 10^7, replace = T), ncol = 1000))
dim(example)
#[1] 10000 1000
library(microbenchmark)
psidom <- function() as.vector(t(example))
sotos <- function() c(t(example))
frank <- function() unlist(transpose(example), use.names = FALSE)
wen <- function() unname(unlist(data.frame(t(example))))
# data.table 1.10.4
microbenchmark(psidom(), sotos(), frank(), wen(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# psidom() 163.5993 178.9236 393.4838 198.6753 632.1086 1352.012 10
# sotos() 186.8764 188.3734 467.2117 343.1514 618.3121 1221.721 10
# frank() 3065.0988 3493.3691 5315.4451 4649.4643 5742.2399 9560.642 10
# wen() 7316.6743 8497.1409 9200.4397 9038.2834 9631.5313 11931.075 10
Another test in data.table dev version 1.10.5:
# data.table 1.10.5
psidom <- function() as.vector(t(example))
sotos <- function() c(t(example))
frank <- function() unlist(transpose(example), use.names = FALSE)
fast <- function() `attributes<-`(t(example), NULL)
microbenchmark(psidom(), sotos(), frank(), fast(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# psidom() 228.1248 246.4666 271.6772 256.9131 287.5072 354.2053 10
# sotos() 254.3512 280.2504 315.3487 322.5726 344.7125 390.3482 10
# frank() 290.5476 310.7076 374.6267 349.8021 431.8451 491.9301 10
# fast() 159.6006 167.6316 209.8363 196.8821 272.4758 281.3146 10

Related

Given a list, how do I access the last value of each elements with different sizes

I have a list of character vectors, and I would like to access the last value of each element.
mylist<-list(A=c("a"),
B=c("a","b"),
C=c("a","b","c"),
D=c("a","b","c","d"))
At first, (by looking at some related threads in Python), I thought I could do something like:
for(i in 1:length(mylist)){
print(mylist[[i]][-1])
}
# character(0)
# [1] "b"
# [1] "b" "c"
# [1] "b" "c" "d"
I guess this doesn't work. Basically, as a result, I would like
myfunction<-function(mylist){
output<-as.character()
for(i in 1:length(mylist)){
output<-c(output, mylist[[i]][length(mylist[[i]])])}
return(output)
}
myfunction(mylist)
# [1] "a" "b" "c" "d"
Is there a more efficient way?
As Rich Scriven pointed out in the (deleted) comments there are many ways to accomplish this task, one of which is to use sapply and tail with argument n = 1:
sapply(mylist, tail, n = 1)
# A B C D
#"a" "b" "c" "d"
Another, safer and potentially faster variant of the same idea is to use vapply
vapply(mylist, tail, FUN.VALUE = character(1), n = 1)
# or a little shorter
# vapply(mylist, tail, "", 1)
(another) benchmarking
set.seed(1)
mylist <- replicate(1e5, list(sample(letters, size = runif(1, 1, length(letters)))))
benchmark <- microbenchmark(
f1 = {myfunction(mylist)},
f2 = {sapply(mylist, function(l) l[length(l)])},
f3 = {vapply(mylist, function(l) l[length(l)], "")},
f4 = {sapply(mylist, tail, 1)},
f5 = {vapply(mylist, tail, "", 1)},
f6 = {mapply("[", mylist, lengths(mylist))},
f7 = {mapply("[[", mylist, lengths(mylist))}, # added this out of curiosity
f8 = {unlist(mylist)[cumsum(lengths(mylist))]},
times = 100L
)
autoplot(benchmark)
Same result here: Rich's unlist(mylist)[cumsum(lengths(mylist_long))] is the fastest by far. No real difference between sapply and vapply it seems. myfunction() as defined in OP's question.
#benchmark
#Unit: milliseconds
# expr min lq mean median uq max neval
# f1 28797.26121 30462.16785 31836.26875 31191.7762 32950.92537 36586.5477 100
# f2 106.34213 117.75074 127.97763 124.9191 134.82047 176.2058 100
# f3 99.72042 106.87308 119.59811 113.9663 123.63619 465.5335 100
# f4 1242.11950 1291.38411 1409.35750 1350.3460 1505.76089 1880.6537 100
# f5 1189.22615 1274.48390 1366.07234 1333.8885 1418.75394 1942.2803 100
# f6 112.27316 123.73429 132.39888 129.8220 138.33851 191.2509 100
# f7 107.27392 118.19201 128.06681 123.1317 133.29827 208.8425 100
# f8 28.03948 28.84125 31.19637 30.3115 32.94077 40.9624 100
Benchmarking the solutions proposed in the comments we find that Rich's proposal using unlist is the fastest.
By inspecting the code and tweaking the parameters we can make it even faster.
The slowness of tail is discussed there: https://stackoverflow.com/a/37238415/2270475
On OP's sample data:
library(microbenchmark)
microbenchmark(
r2evans = sapply(mylist, function(l) l[length(l)]),
markus = sapply(mylist, tail, 1),
Rich1 = mapply("[", mylist, lengths(mylist)),
Rich2 = unlist(mylist)[cumsum(lengths(mylist))],
markus2 = vapply(mylist, tail, character(1), 1),
mm = .Internal(unlist(mylist,FALSE,FALSE))[cumsum(lengths(mylist,FALSE))],
unit = "relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# r2evans 16.083333 12.764706 25.545957 12.368421 13.133333 122.1428571 100
# markus 82.333333 59.294118 50.937673 60.342105 60.644444 10.2253968 100
# Rich1 19.583333 15.294118 13.368047 15.394737 15.622222 2.7492063 100
# Rich2 4.166667 3.705882 3.211045 3.789474 3.911111 0.7650794 100
# markus2 73.166667 53.176471 44.669822 50.263158 54.155556 10.4857143 100
# mm 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
On a 1000 times longer list:
mylist_long <- do.call(c,replicate(1000,mylist,simplify = F))
length(mylist_long) # [1] 4000
microbenchmark(
r2evans = sapply(mylist_long, function(l) l[length(l)]),
markus = sapply(mylist_long, tail, 1),
Rich1 = mapply("[", mylist_long, lengths(mylist_long)),
Rich2 = unlist(mylist_long)[cumsum(lengths(mylist_long))],
markus2 = vapply(mylist_long, tail, character(1), 1),
mm = .Internal(unlist(mylist_long,FALSE,FALSE))[cumsum(lengths(mylist_long,FALSE))],
unit = "relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# r2evans 26.14882 27.20436 27.07436 28.13731 28.54701 27.23846 100
# markus 679.57251 698.84828 668.00160 715.30180 674.71067 443.42502 100
# Rich1 27.53607 28.80581 29.82736 29.00353 31.02343 38.79978 100
# Rich2 22.39863 21.79129 20.41467 21.53371 20.70750 13.03032 100
# markus2 667.97494 702.14882 676.91881 718.41899 696.11934 633.17181 100
# mm 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100

Assign specified numerical value to a vector of strings [duplicate]

This question already has answers here:
Map array of strings to an array of integers
(4 answers)
Closed 5 years ago.
Is there a way of creating new vector of numerical values based on my vector of strings?
For example I have this :
a<-c("A", "B", "A", "A")
and from this a I want to make new vector b with values replacing "A" with 1 and "B" with -1 so b(1, -1, 1, 1)
I tried using something like factor(a, levels = c("A", "B"), labels = c(1, -1))
but this doesn't produce numerical vector.
No need to that, just use:
a[a=="A"] = 1
a[a=="B"] = -1
a = as.numeric(a)
if you want keep a unchanged use:
b = a
b[a=="A"] = 1
b[a=="B"] = -1
b = as.numeric(b)
Or better solution as #joran said:
b = ifelse(a == "A",1,-1)
# Packages
library(stringi)
library(microbenchmark)
# 1. Vector
# a <- c("A", "B", "A", "A")
a <- stri_rand_strings(1e5, 1, pattern = "[A-B]")
# 2. The 'factor' solution
f1 <- function(){ as.numeric(as.character(factor(a, levels = c("A", "B"), labels = c(1, -1)))) }
# 3. The faster solution
f2 <- function(){ (-1)^(a != "A") }
# 3. Ifelse solution
f3 <- function(){ ifelse(a == "A", 1, -1) }
# 4. Ignore case of letters or my solution
f4 <- function(){ ifelse(as.numeric(grepl("a", a, ignore.case = TRUE)) == 1, 1, -1) }
# 5. Code map solution from "Nathan Werth"
f5 <- function(){ c(A = 1, B = -1)[a] }
# 6. Test
microbenchmark(
f1(), f2(), f3(), f4(), f5())
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 23.331763 23.648421 28.253174 24.235554 26.582799 123.49315 100 b
f2() 5.808460 6.025908 6.421053 6.067174 6.200166 12.94342 100 a
f3() 13.817060 14.926539 25.900652 16.388596 18.122837 129.67193 100 b
f4() 28.772036 31.363670 39.185333 32.352557 34.388918 134.35915 100 c
f5() 4.577321 5.186689 8.727417 7.375286 7.895280 106.31922 100 a
code_map <- c(A = 1, B = -1)
b <- code_map[a]

Extract the first TRUE per row in logical matrix efficiently in R [duplicate]

This question already has answers here:
How to vectorize this operation on every row of a matrix
(3 answers)
Closed 6 years ago.
Given the following matrix:
A B C
[1,] TRUE FALSE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE TRUE
[4,] FALSE TRUE TRUE
[5,] FALSE TRUE TRUE
[6,] TRUE TRUE TRUE
m <- structure(c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), .Dim = c(6L,
3L), .Dimnames = list(NULL, c("A", "B", "C")))
How we can extract the first column with TRUE value per row efficiently? Of course, we could use apply per row and then get the min(which(...)).
Here is the desired output:
[1] A B C B B A
This thread might seem a duplicate of my question but its not:
Here we are talking about a logical matrix NOT a numeric data frame
Here we seek to get the position of the first TRUE not the highest value
We can use max.col
colnames(m)[max.col(m, "first")]
#[1] "A" "B" "C" "B" "B" "A"
If there are no TRUE in a row, then we can change it to NA (if needed)
colnames(m)[max.col(m, "first")*NA^!rowSums(m)]
Or with ifelse
colnames(m)[ifelse(rowSums(m)==0, NA, max.col(m, "first"))]
Another vision, using which to work with the logical class of the matrix:
colnames(m)[aggregate(col~row, data=which(m, arr.ind = TRUE), FUN=min)$col]
#[1] "A" "B" "C" "B" "B" "A"
We get the indices of the TRUE values and then find the minimum (index of) column in which they occur, by row.
benchmark
library(microbenchmark)
n <- matrix(FALSE, nrow=1000, ncol=500) # couldn't afford a bigger one...
n <- t(apply(n, 1, function(rg) {rg[sample(1:500, 1, replace=TRUE)] <- TRUE ; rg}))
colnames(n) <- paste0("name", 1:500)
akrun <- function(n){colnames(n)[max.col(n, "first")]}
cath <- function(n){colnames(n)[aggregate(col~row, data=which(n, arr.ind = TRUE), FUN=min)$col]}
all(akrun(n)==cath(n))
#[1] TRUE
microbenchmark(akrun(n), cath(n))
# expr min lq mean median uq max neval cld
#akrun(n) 6.985716 7.233116 8.231404 7.525513 8.842927 31.23469 100 a
# cath(n) 18.416079 18.811473 19.586298 19.272398 20.262169 22.42786 100 b
Here is my attempt. It isn't a one-liner, but it is lightning fast.
joe <- function(x) {
y <- which(x)
nR <- nrow(x)
myR <- y %% nR
myR[myR==0] <- nR
myNames <- colnames(x)[ceiling(y/nR)]
myCols <- which(!(duplicated(myR)))
myNames[myCols][order(myR[myCols])]
}
Here are the benchmarks using the data provided by #Cath:
microbenchmark(akrun(n), cath(n), joe(n))
Unit: microseconds
expr min lq mean median uq max neval
akrun(n) 4248.760 5588.8640 6148.1816 5926.7130 6378.887 12502.437 100
cath(n) 12641.189 13733.1415 14808.6524 14532.8115 15559.287 20628.037 100
joe(n) 555.418 642.2405 758.5293 713.2585 800.697 4849.334 100
all.equal(akrun(n), cath(n), joe(n))
[1] TRUE
Here is another way that has a better performance w.r.t. #Cath solutions:
a <- which(m, arr.ind = T)
colnames(m)[aggregate(col~row,a[order(a[,1]),],min)$col]
# [1] "A" "B" "C" "B" "B" "A"
Benchmarking given the matrix used by #Cath:
m0h3n <- function(m){
a <- which(m, arr.ind = T)
colnames(m)[aggregate(col~row,a[order(a[,1]),],min)$col]
}
all.equal(akrun(n), cath(n), joe(n), m0h3n(n))
# [1] TRUE
microbenchmark(akrun(n), cath(n), joe(n), m0h3n(n))
# Unit: microseconds
# expr min lq mean median uq max neval
# akrun(n) 2291.981 2395.793 2871.7156 2482.7790 3561.9150 4205.370 100
# cath(n) 8263.210 8554.665 9695.9375 8782.8710 9947.9415 58239.983 100
# joe(n) 274.029 298.517 526.6722 312.0375 342.5355 2366.798 100
# m0h3n(n) 3890.178 3974.309 4280.6677 4073.1635 4227.7550 6337.501 100
Therefore, here is the ranked solutions (in terms of efficiency):
joe
akrun
m0h3n
Cath

Apply a list of character expressions stored in a data.table column to the other columns per row

I have a data.table with variables as columns and one column storing a different function for each row as character. I would like to simply apply each function to its row and store output in a new column without using a for loop as I need it to be fast and have 25000 rows.
If I take a simplified example, let's say I have a data.table dt:
dt <- data.table(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9),d=c("a+b+c","a*b*c","c/a*b"))
dt
a b c d
1: 1 4 7 a+b+c
2: 2 5 8 a*b*c
3: 3 6 9 c/a*b
I would like to have this as a result:
a b c d e
1: 1 4 7 a+b+c 12
2: 2 5 8 a*b*c 80
3: 3 6 9 c/a*b 18
So far the only solution I have found is a for loop, but it's slow for my 25000 rows and 32 variables:
for (i in 1:nrow(dt)){
dt[i,e:=eval(parse(text=dt[i,d]))]
}
I have been searching for quite a while (tried with eval, sapply,...) but so far with no success, I would really appreciate any suggestions.
interpret <- function(expr, .SD) eval(parse(text = expr[1]), envir = .SD)
dt[, e := interpret(d,.SD), by = d, .SDcols = c("a", "b", "c")]
dt
returns:
> dt
a b c d e
1: 1 4 7 a+b+c 12
2: 2 5 8 a*b*c 80
3: 3 6 9 c/a*b 18
A dummy bench :
l <- lapply(1:1e5, function(i) dt)
bigdt <- rbindlist(l)
bigdt[, e:=interpret(d,.SD), by=d, .SDcols=c("a", "b", "c")]
bigdt
microbenchmark(for (i in 1:nrow(dt)){
bigdt[i,e:=eval(parse(text=bigdt[i,d]))]
}, bigdt[, e:=interpret(d,.SD), by=d, .SDcols=c("a", "b", "c")])
gave me
Unit: milliseconds
expr
for (i in 1:nrow(dt)) { bigdt[i, `:=`(e, eval(parse(text = bigdt[i, d])))] }
bigdt[, `:=`(e, interpret(d, .SD)), by = d, .SDcols = c("a", "b", "c")]
min lq mean median uq max neval cld
2.693427 2.833544 3.240561 3.043713 3.150880 6.212202 100 a
6.891739 7.280915 9.988198 8.496646 8.721075 69.666926 100 b
>
invoke_map() from package purrr is designed to iterate over a list of functions and a list of parameters to each function.
Here is an alternative, slightly long winded, idea how to solve this problem.
dt <- data.frame(a=rep(c(1,2,3, 5), 10),b= rep(c(4,5,6, 5),10),c=rep(c(7,8,9, 5), 10),d=rep(c("a+b+c","a*b*c","c/a*b", "a+b+c"), 10), stringsAsFactors = FALSE)
Create functions in the environment based on column d
funs_map <- data.frame()
for(i in 1:length(unique(dt$d))){
eval(parse(text = paste('f', i, '<- function(', 'a, b, c', ') { return(' , unique(dt$d)[i] , ')}', sep='')))
funs_map[i,1] <- unique(dt$d)[i]
funs_map[i,2] <- paste('f', i, sep="")
}
Create a list of functions to iterate over - this will be the .f argument to invoke_map
funs_list <- as.list(funs_map$V2[match(dt$d , funs_map$V1)])
Last column will not be necessary anymore
dt <- dt[-4]
Create a list of parameters for each function - this appears to be the most time consuming step
params <-vector(mode = "list", length = nrow(dt))
for(i in 1:nrow(dt)){
params[[i]] <- as.list(dt[i,])
}
Iterate over functions
result <- invoke_map(funs_list, params)
Putting this code into a function and benchmarking:
microbenchmark(apply_funs(dt))
Unit: milliseconds
expr min lq mean median uq max neval
apply_funs(dt) 19.27345 20.34213 21.09592 20.66714 21.63639 26.83376 100
Original code:
Unit: milliseconds
expr min
for (i in 1:nrow(dt)) { dt[i, `:=`(e, eval(parse(text = dt[i, d])))] } 353.7435
lq mean median uq max neval
358.0244 362.6764 360.3644 362.9175 439.9213 100
And tokiloutok solution (fastest):
Unit: milliseconds
expr min
dt[, `:=`(e, interpret(d, .SD)), by = d, .SDcols = c("a", "b", "c")] 0.780877
lq mean median uq max neval
0.8148745 0.8432403 0.822787 0.8480175 1.203817 100

How come factor vectors are less efficient than integer or even character vectors?

I just noticed the following:
set.seed(42)
vec <- sample(c("a", "b", "c"), 1e4, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(factor(vec))
library(microbenchmark)
microbenchmark(vec=="b", vec_fac=="b", vec_int==2, vec_fac==2)
This gives to my great surprise:
Unit: microseconds
expr min lq mean median uq max neval
vec == "b" 2397.150 2406.5925 2499.5715 2470.637 2532.628 2881.588 100
vec_fac == "b" 5706.932 5765.4340 6137.5441 6032.696 6401.567 8889.446 100
vec_int == 2 510.714 541.0935 623.8341 580.506 743.695 845.305 100
vec_fac == 2 5703.237 5772.6185 6339.6577 5975.015 6378.577 31502.869 100
I would have thought that factors were much more efficient than a simple character vector but it is not the case. (Of course, vec_fac and vec_int take half less memory than vec.)
Why are factors not as efficient as integer vectors?
The testing requires some conversion. Have a look at the profiling below. Note that (levels(vec_fac) == "b")[vec_fac] is faster.
set.seed(42)
vec <- sample(c("a", "b", "c"), 1e4, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(factor(vec))
library(microbenchmark)
microbenchmark(
(levels(vec_fac) == "b")[vec_fac],
vec_int == 2,
vec == "b",
vec_fac == 2,
vec_fac == "b"
)
Unit: microseconds
expr min lq mean median uq max neval cld
(levels(vec_fac) == "b")[vec_fac] 62.861 69.7030 74.20981 71.8410 73.552 131.280 100 a
vec_int == 2 73.124 85.0970 89.96756 86.8070 87.877 125.721 100 b
vec == "b" 129.569 133.8450 138.57510 134.7005 135.129 170.621 100 c
vec_fac == 2 303.611 331.8340 348.90436 334.6135 337.820 482.783 100 d
vec_fac == "b" 347.656 376.7335 393.01326 379.2990 381.224 577.715 100 e
Profiling:
set.seed(42)
vec <- sample(c("a", "b", "c"), 1e8, replace=T)
vec_fac <- factor(vec)
vec_int <- as.integer(vec_fac)
Rprof()
junk <- vec_int == 2
Rprof(NULL)
summaryRprof()
Rprof()
junk <- vec == "b"
Rprof(NULL)
summaryRprof()
Rprof()
junk <- vec_fac == "b"
Rprof(NULL)
summaryRprof()
Rprof()
junk <- vec_fac == 2
Rprof(NULL)
summaryRprof()

Resources