This question already has answers here:
Map array of strings to an array of integers
(4 answers)
Closed 5 years ago.
Is there a way of creating new vector of numerical values based on my vector of strings?
For example I have this :
a<-c("A", "B", "A", "A")
and from this a I want to make new vector b with values replacing "A" with 1 and "B" with -1 so b(1, -1, 1, 1)
I tried using something like factor(a, levels = c("A", "B"), labels = c(1, -1))
but this doesn't produce numerical vector.
No need to that, just use:
a[a=="A"] = 1
a[a=="B"] = -1
a = as.numeric(a)
if you want keep a unchanged use:
b = a
b[a=="A"] = 1
b[a=="B"] = -1
b = as.numeric(b)
Or better solution as #joran said:
b = ifelse(a == "A",1,-1)
# Packages
library(stringi)
library(microbenchmark)
# 1. Vector
# a <- c("A", "B", "A", "A")
a <- stri_rand_strings(1e5, 1, pattern = "[A-B]")
# 2. The 'factor' solution
f1 <- function(){ as.numeric(as.character(factor(a, levels = c("A", "B"), labels = c(1, -1)))) }
# 3. The faster solution
f2 <- function(){ (-1)^(a != "A") }
# 3. Ifelse solution
f3 <- function(){ ifelse(a == "A", 1, -1) }
# 4. Ignore case of letters or my solution
f4 <- function(){ ifelse(as.numeric(grepl("a", a, ignore.case = TRUE)) == 1, 1, -1) }
# 5. Code map solution from "Nathan Werth"
f5 <- function(){ c(A = 1, B = -1)[a] }
# 6. Test
microbenchmark(
f1(), f2(), f3(), f4(), f5())
Unit: milliseconds
expr min lq mean median uq max neval cld
f1() 23.331763 23.648421 28.253174 24.235554 26.582799 123.49315 100 b
f2() 5.808460 6.025908 6.421053 6.067174 6.200166 12.94342 100 a
f3() 13.817060 14.926539 25.900652 16.388596 18.122837 129.67193 100 b
f4() 28.772036 31.363670 39.185333 32.352557 34.388918 134.35915 100 c
f5() 4.577321 5.186689 8.727417 7.375286 7.895280 106.31922 100 a
code_map <- c(A = 1, B = -1)
b <- code_map[a]
Related
This question already has answers here:
Convert letters to numbers
(5 answers)
Closed 5 years ago.
We are looking for a blazing fast solution to the following problem, in R (Rcpp is allowed).
I have a character vector:
set.seed(42)
x <- sample(LETTERS[1:4], 1e6, replace = TRUE)
And I want to change it to a non sequential numeric vector, where:
A = 5
B = 4
C = 3
D = 1
For example:
c("A", "B", "C", "D")
Would be:
c(5,4,3,1)
The interns and I have what we think is a ridiculously fast solution already but we think the Internet can beat us. We'll add our fastest solution as an answer after we get some responses.
Let's see!
Timings so far:
library(microbenchmark)
set.seed(42)
x <- sample(LETTERS[1:4], 1e6, replace = TRUE)
richscriven <- function(x) {
as.vector(c(A=5, B=4, C=3, D=2, E=1)[x])
}
richscriven_unname <- function(x) {
unname(c(A=5, B=4, C=3, D=2, E=1)[x])
}
richscriven_op <- function(x) {
(5:1)[c(factor(x))]
}
op_and_interns_fun <- function(x) {
c(5,4,3,1)[as.numeric(as.factor(x))]
}
ronakshah <- function(x) {
vec = c("A" = 5, "B" = 4, "C" = 3, "D" = 1)
unname(vec[match(x, names(vec))])
}
microbenchmark(
richscriven_unname(x),
richscriven(x),
richscriven_op(x),
op_and_interns_fun(x),
ronakshah(x),
times = 15
)
Unit: milliseconds
expr min lq mean median uq max neval
richscriven_unname(x) 36.06018 38.01026 62.80854 38.87179 41.86411 337.65773 15
richscriven(x) 37.90615 41.61194 43.50555 44.14130 45.17277 47.47804 15
richscriven_op(x) 31.70345 37.43262 44.10522 41.34828 45.22127 88.79605 15
op_and_interns_fun(x) 40.18935 44.20475 49.48811 45.77867 48.15706 99.85034 15
ronakshah(x) 29.36408 32.52615 42.40753 35.09052 38.55763 95.78571 15
We can put the vector in a named numeric vector
vec <- c("A" = 5, "B" = 4, "C" = 3, "D" = 1)
We can then write a function,
get_recoded_data <- function(num_vec, recode_data) {
unname(recode_data[match(num_vec, names(recode_data))])
}
and call the function using
get_recoded_data(x, vec)
On my system it takes,
system.time(get_recoded_data(x, vec))
#user system elapsed
#0.028 0.004 0.032
I am using MacOS Sierra 10.12.6, 16GB RAM i7 RStudio 1.1.383
From #zacdav's comment using fmatch function from fastmatch package gives a good performance enhancement
get_recoded_data <- function(num_vec, recode_data) {
unname(recode_data[fmatch(num_vec, names(recode_data))])
}
Checking it on the same data, I get
system.time(get_recoded_data(x, vec))
#user system elapsed
#0.017 0.004 0.021
Our answer relies on a somewhat uncommon method of subsetting by position:
op_and_interns_fun <- function(x) {
c(5,4,3,1)[as.numeric(as.factor(x))]
}
I'd like to apply a function by rows on a data.frame to concatenate column titles depending on the value in the row.
df
A B
1 TRUE TRUE
2 FALSE TRUE
3 FALSE FALSE
A B Result
1 TRUE TRUE A / B
2 FALSE TRUE B
3 FALSE FALSE NA
I read about dplyr using mutate() and rowwise(), but I don't know how to apply them since the columns aren't constants.
for a row "i" I would do something like:
paste(names(df)[as.logical(df[i,])], collapse = ' / ')
Any help would be welcome.
Thank you.
I would recommend against using apply on data.frames (due to matrix conversions) and especially with a margin of 1 (row operation are slow in R). Instead, you could pretty easily vectorize this over columns without matrix conversions too, here's an example
res <- rep(NA_character_, nrow(df))
for(j in names(df)) res[df[[j]]] <- paste(res[df[[j]]], j, sep = " / ")
sub("NA / ", "", res, fixed = TRUE)
# [1] "A / B" "B" NA
Below is a benchmark that shows about ~X16 improvement
set.seed(123)
N <- 1e5
df <- as.data.frame(matrix(sample(c(TRUE, FALSE), N*2, replace = TRUE), ncol = 2))
Rowwise <- function(df) apply(df, 1, FUN = function(x) paste(names(x)[x], collapse=" / "))
Colwise <- function(df) {
res <- rep(NA_character_, nrow(df));
for(j in names(df)) res[df[[j]]] <- paste(res[df[[j]]], j, sep = " / ");
sub("NA / ", "", res, fixed = TRUE)
}
microbenchmark::microbenchmark(Rowwise(df), Colwise(df))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Rowwise(df) 458.54526 502.43496 545.47028 548.42042 584.18000 669.6161 100 b
# Colwise(df) 27.11235 27.83873 34.65596 29.05341 32.83664 137.7905 100 a
If the dataset is not really big (i.e. in millions/billions of rows) we can use apply with MARGIN=1 to loop over the rows, subset the names of the vector using the logical vector as index and paste them together. It is easier to code in a single line.
df$Result <- apply(df, 1, FUN = function(x) paste(names(x)[x], collapse=" / "))
However, if we have a big dataset, another option is to create a key/value pair and replace the values by matching and it is faster than the above solution.
v1 <- do.call(paste, df)
unname(setNames(c("A / B", "B", "A", NA), do.call(paste,
expand.grid(rep(list(c(TRUE, FALSE)), 2))))[v1])
#[1] "A / B" "B" NA
Or we can use arithmetic operation to do this
c(NA, "A", "B", "A / B")[1 + df[,1] + 2 * df[,2]]
#[1] "A / B" "B" NA
Benchmarks
Using #DavidArenburg's dataset and including the two solutions posted here (changed the column names of 'df' to 'A' and 'B')
newPaste <- function(df) {
v1 <- do.call(paste, df)
unname(setNames(c("A / B", "B", "A", NA), do.call(paste,
expand.grid(rep(list(c(TRUE, FALSE)), 2))))[v1])
}
arith <- function(df){
c(NA, "A", "B", "A / B")[1 + df[,1] + 2 * df[,2]]
}
microbenchmark::microbenchmark(Rowwise(df), Colwise(df), newPaste(df),arith(df))
#Unit: milliseconds
# expr min lq mean median uq max neval
# Rowwise(df) 398.024791 453.68129 488.07312 481.051431 523.466771 688.36084 100
# Colwise(df) 25.361609 28.10300 34.20972 30.952365 35.885061 95.92575 100
# newPaste(df) 65.777304 69.07432 82.08602 71.606890 82.232980 176.66516 100
# arith(df) 1.790622 1.88339 4.74913 2.027674 4.753279 58.50942 100
I have a list of vectors of variable length, for example:
q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))
I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):
list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))
Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.
match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:
match(q, unique(q))
#[1] 1 2 1 3 4 3
each element is mapped to a single integer. Then:
tabulate(match(q, unique(q)))
#[1] 2 1 2 1
And find a structure to present the results:
as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
# vec n
#1 1, 3, 5 2
#2 2, 4 1
#3 2, 5 2
#4 7 1
Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:
table(sapply(q, deparse))
#
# 7 c(1, 3, 5) c(2, 4) c(2, 5)
# 1 2 1 2
Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:
n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n)) # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18
and then tabulate as before.
And just to compare the above alternatives:
f1 = function(x)
{
ux = unique(x)
i = match(x, ux)
cbind(vec = ux, n = tabulate(i))
}
f2 = function(x)
{
xc = sapply(x, deparse)
i = match(xc, unique(xc))
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
f3 = function(x)
{
n = max(unlist(x))
pow2 = 2 ^ (0:(n - 1))
v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
i = match(v, unique(v))
cbind(vec = x[!duplicated(v)], n = tabulate(i))
}
q2 = rep_len(q, 1e3)
all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.980041 8.161524 10.525946 8.291678 8.848133 178.96333 100 b
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643 45.25388 100 c
# f3(q2) 3.951567 4.127482 4.688778 4.261985 4.518463 10.25980 100 a
Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:
Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):
n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)
Use ordering to group similar elements mapped to integers:
gr = do.call(grouping, qq)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3
then, tabulate as before.
To continue the comparisons:
f4 = function(x)
{
n = max(lengths(x))
x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
gr = do.call(grouping, x2)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
all.equal(f3(q2), f4(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.956377 8.048250 8.792181 8.131771 8.270101 21.944331 100 b
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203 100 c
# f3(q2) 3.963746 4.103295 4.801138 4.179508 4.360991 35.105431 100 a
# f4(q2) 2.874151 2.985512 3.219568 3.066248 3.186657 7.763236 100 a
In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.
One way is to paste each vector , unlist and tabulate, i.e.
table(unlist(lapply(q, paste, collapse = ',')))
#1,3,5 2,4 2,5 7
# 2 1 2 1
I have a large sparse matrix 1M X 10 (1 Million rows and 10 columns), I want to look every row in the matrix for a value and create a new vector based on it. Below is my code. I am wondering if there is any way I can optimize it.
CreatenewVector <- function(TestMatrix){
newColumn = c()
for(i in 1:nrow(TestMatrix)){ ## Loop begins
Value = ifelse(1 %in% TestMatrix[i,],1,0)
newColumn = c(newColumn,Value)
} ##Loop ends
return(newColumn)
}
## SampleInput: TestMatrix = matrix(c(1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0), byrow = T, nrow = 4)
## Sampleoutput: = (1,1,1,0)
## In the input TestMatrix, each vector represents a row. for instance (1,0,0) is the first row and so on.
Assuming you are using a normal matrix object, not a special sparse matrix class, you should use rowSums.
rowSums(x == 1) > 0
if x is the name of your matrix. This will return a logical vector, you can easily coerce to numeric with as.numeric() if you prefer 1/0 to true/false.
To give some sense of timing I benchmarked first using a thousand row matrix, then a million row matrix:
gregor = function(x) {as.numeric(rowSums(x == 1L) > 0L)}
# original method in question
op1 = function(x){
newColumn = c()
for(i in 1:nrow(x)){ ## Loop begins
Value = ifelse(1 %in% x[i,],1,0)
newColumn = c(newColumn,Value)
} ##Loop ends
return(newColumn)
}
# modified original:
# eliminated unnecessary ifelse
# pre-allocated result vector (no growing in a loop!)
# saved numeric conversion to the end
op2 = function(x){
newColumn = logical(nrow(x))
for(i in 1:nrow(x)){ ## Loop begins
newColumn[i] = 1L %in% x[i,]
} ##Loop ends
return(as.numeric(newColumn))
}
bouncy = function(x) {
as.numeric(apply(x, 1, function(y) any(y == 1L)))
}
Here are the results for a thousand row matrix:
n = 1e3
x = matrix(sample(c(0L, 1L), size = n, replace = T), ncol = 4)
microbenchmark(gregor(x), op1(x), op2(x), bouncy(x), times = 20)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# gregor(x) 12.164 15.7750 20.14625 20.1465 24.8980 30.410 20 a
# op1(x) 1224.736 1258.9465 1345.46110 1275.6715 1338.0105 2002.075 20 d
# op2(x) 846.140 864.7655 935.46740 886.2425 951.4325 1287.075 20 c
# bouncy(x) 439.795 453.8595 496.96475 486.5495 508.0260 711.199 20 b
Using rowSums is the clear winner. I eliminated OP1 from the next test on a million row matrix:
n = 1e6
x = matrix(sample(c(0L, 1L), size = n, replace = T), ncol = 4)
microbenchmark(gregor(x), op2(x), bouncy(x), times = 30)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor(x) 9.371777 10.02862 12.55963 10.61343 14.13236 27.70671 30 a
# op2(x) 822.171523 856.68916 937.23602 881.39219 1028.26738 1183.68569 30 c
# bouncy(x) 391.604590 412.51063 502.61117 502.02431 588.78785 656.18824 30 b
Where the relative margin is even more in favor of rowSums.
I am in need of efficiency for finding the indexes (not the logical vector) between two vectors. I can do this with:
which(c("a", "q", "f", "c", "z") %in% letters[1:10])
In the same way it is better to find the position of the maximum number with which.max:
which(c(1:8, 10, 9) %in% max(c(1:8, 10, 9)))
which.max(c(1:8, 10, 9))
I am wondering if I have the most efficient way of finding the position of matching terms in the 2 vectors.
EDIT:
Per the questions/comments below. I am operating on a list of vectors. The problem involves operating on sentences that have been broken into a bag of words as seen below. The list may contain 10000-20000 or more character vectors. Then based on that index I will grab 4 words before and 4 words after the index and calculate a score.
x <- list(c('I', 'like', 'chocolate', 'cake'), c('chocolate', 'cake', 'is', 'good'))
y <- rep(x, 5000)
lapply(y, function(x) {
which(x %in% c("chocolate", "good"))
})
Here's a relatively faster way using data.table:
require(data.table)
vv <- vapply(y, length, 0L)
DT <- data.table(y = unlist(y), id = rep(seq_along(y), vv), pos = sequence(vv))
setkey(DT, y)
# OLD CODE which will not take care of no-match entries (commented)
# DT[J(c("chocolate", "good")), list(list(pos)), by=id]$V1
setkey(DT[J(c("chocolate", "good"))], id)[J(seq_along(vv)), list(list(pos))]$V1
The idea:
First we unlist your list into a column of DT named y. In addition, we create two other columns named id and pos. id tells the index in the list and pos tells the position within that id. Then, by creating a key column on id, we can do fast subsetting. With this subsetting we'll get corresponding pos values for each id. Before we collect all pos for each id in a list and then just output the list column (V1), we take care of those entries where there was no match for our query by setting key to id after first subsetting and subsetting on all possible values of id (as this'll result in NA for non-existing entries.
Benchmarking with the lapply code on your post:
x <- list(c('I', 'like', 'chocolate', 'cake'), c('chocolate', 'cake', 'is', 'good'))
y <- rep(x, 5000)
require(data.table)
arun <- function() {
vv <- vapply(y, length, 0L)
DT <- data.table(y = unlist(y), id = rep(seq_along(y), vv), pos = sequence(vv))
setkey(DT, y)
setkey(DT[J(c("chocolate", "good"))], id)[J(seq_along(vv)), list(list(pos))]$V1
}
tyler <- function() {
lapply(y, function(x) {
which(x %in% c("chocolate", "good"))
})
}
require(microbenchmark)
microbenchmark(a1 <- arun(), a2 <- tyler(), times=50)
Unit: milliseconds
expr min lq median uq max neval
a1 <- arun() 30.71514 31.92836 33.19569 39.31539 88.56282 50
a2 <- tyler() 626.67841 669.71151 726.78236 785.86444 955.55803 50
> identical(a1, a2)
# [1] TRUE
The C++ answer was faster comparing single characters, but I think using a vector of strings introduced enough overhead that now it's slower:
char1 <- c("a", "q", "f", "c", "z")
char2 <- letters[1:10]
library(inline)
cpp_whichin_src <- '
Rcpp::CharacterVector xa(a);
Rcpp::CharacterVector xb(b);
int n_xa = xa.size();
int n_xb = xb.size();
NumericVector res(n_xa);
std::vector<std::string> sa = Rcpp::as< std::vector<std::string> >(xa);
std::vector<std::string> sb = Rcpp::as< std::vector<std::string> >(xb);
for(int i=0; i < n_xa; i++) {
for(int j=0; j<n_xb; j++) {
if( sa[i] == sb[j] ) res[i] = i+1;
}
}
return res;
'
cpp_whichin <- cxxfunction(signature(a="character",b="character"), cpp_whichin_src, plugin="Rcpp")
which.in_cpp <- function(char1, char2) {
idx <- cpp_whichin(char1,char2)
idx[idx!=0]
}
which.in_naive <- function(char1, char2) {
which(char1 %in% char2)
}
which.in_CW <- function(char1, char2) {
unlist(sapply(char2,function(x) which(x==char1)))
}
which.in_cpp(char1,char2)
which.in_naive(char1,char2)
which.in_CW(char1,char2)
** Benchmarks **
library(microbenchmark)
microbenchmark(
which.in_cpp(char1,char2),
which.in_naive(char1,char2),
which.in_CW(char1,char2)
)
set.seed(1)
cmb <- apply(combn(letters,2), 2, paste,collapse="")
char1 <- sample( cmb, 100 )
char2 <- sample( cmb, 100 )
Unit: microseconds
expr min lq median uq max
1 which.in_cpp(char1, char2) 114.890 120.023 126.6930 135.5630 537.011
2 which.in_CW(char1, char2) 697.505 725.826 766.4385 813.8615 8032.168
3 which.in_naive(char1, char2) 17.391 20.289 22.4545 25.4230 76.826
# Same as above, but with 3 letter combos and 1000 sampled
Unit: microseconds
expr min lq median uq max
1 which.in_cpp(char1, char2) 8505.830 8715.598 8863.3130 8997.478 9796.288
2 which.in_CW(char1, char2) 23430.493 27987.393 28871.2340 30032.450 31926.546
3 which.in_naive(char1, char2) 129.904 135.736 158.1905 180.260 3821.785