Convert integers to decimal values - r

I have a set of integer data between 1:10000. I need to bring them in range 0:1.
For example, converting
12 --> 0.12
123 --> 0.123
1234 --> 0.1234
etc. (note that I don't want to scale the values).
Any suggestions how to do this on all the data at once?

I would simply do
x <- c(2, 14, 128, 1940, 140, 20000)
x/10^nchar(x)
## [1] 0.200 0.140 0.128 0.194 0.140 0.200
But a much faster approach (which avoids to character conversion) offered by #Frank
x/10^ceiling(log10(x))
Benchmark
library(microbenchmark)
set.seed(123)
x <- sample(1e8, 1e6)
microbenchmark(
david = x/10^nchar(x),
davidfrank = x/10^ceiling(log10(x)),
richard1 = as.numeric(paste0(".", x)),
richard2 = as.numeric(sprintf(".%d", x))
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# david 691.0513 822.6482 1052.2473 956.5541 1153.4779 2391.7856 100 b
# davidfrank 130.0522 164.3227 255.8397 197.3158 339.3224 576.2255 100 a
# richard1 1130.5160 1429.8314 1972.2624 1689.8454 2473.6409 4791.0558 100 c
# richard2 712.8357 926.8013 1181.5349 1103.1661 1315.4459 2753.6795 100 b

The non-mathy way would be to add the decimal with paste() then coerce back to numeric.
x <- c(2, 14, 128, 1940, 140, 20000)
as.numeric(paste0(".", x))
# [1] 0.200 0.140 0.128 0.194 0.140 0.200
Update 1: There was some interest about the timings of the two originally posted methods. According to the following benchmarks, they seem to be about the same.
library(microbenchmark)
x <- 1:1e5
microbenchmark(
david = { david <- x/10^nchar(x) },
richard = { richard <- as.numeric(paste0(".", x)) }
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# david 88.94391 89.18379 89.70962 89.40736 89.71012 99.68126 100
# richard 87.89776 88.17234 89.38383 88.44439 88.77052 105.06066 100
identical(richard, david)
# [1] TRUE
Update 2: I have also remembered that sprintf() is often faster than paste0(). We can also use the following.
as.numeric(sprintf(".%d", x))
Now using the same x from above, and only comparing these two choices, we have a good improvement in the timing of sprintf() versus paste(), as shown below.
microbenchmark(
paste0 = as.numeric(paste0(".", x)),
sprintf = as.numeric(sprintf(".%d", x))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# paste0 87.89413 88.41606 90.25795 88.82484 89.65674 107.8080 100
# sprintf 61.16524 61.23328 62.26202 61.29192 61.48316 79.1202 100

Related

Rcpp rowMaxs vs. matrixStats rowMaxs

I am trying to efficiently compute rowMaxs in Rcpp. A very simple implementation is
arma::mat RcppRowmaxs(arma::mat x){
int N = x.n_rows;
arma::mat rm(N,1);
for(int nn = 0; nn < N; nn++){
rm(nn) = max(x.row(nn));
}
return(rm);
}
which works perfectly fine. However, comparing this function to other packages, it turned out that other implementations are by far more efficient. Specifically, Rfast::rowMaxs is more than 6 times faster than the simple Rcpp implementation!
Naturally, I tried to mimic the behavior of Rfast.
However, as a beginner in Rcpp, I only tried to load Rfast::rowMaxs directly in Rcpp as described e.g. here. Unfortunately, using an Rcpp script to load an R function that again calls an Rcpp script seems pretty slow following my benchmark (see row "RfastinRcpp"):
m = matrix(rnorm(1000*1000),1000,1000)
microbenchmark::microbenchmark(
matrixStats = matrixStats::rowMaxs(m),
Rfast = Rfast::rowMaxs(m,value=T),
Rcpp = RcppRowmaxs(m),
RfastinRcpp = RfastRcpp(m),
apply = apply(m,1,max)
)
Unit: microseconds
expr min lq mean median uq max neval cld
matrixStats 1929.570 2042.8975 2232.1980 2086.5180 2175.470 4025.923 100 a
Rfast 666.711 727.2245 842.5578 795.2215 891.443 1477.969 100 a
Rcpp 5552.216 5825.4855 6186.9850 5997.8295 6373.737 8568.878 100 b
RfastinRcpp 7495.042 7931.2480 9471.8453 8382.6350 10659.672 19968.817 100 b
apply 12281.758 15145.7495 22015.2798 17202.9730 20310.939 136844.591 100 c
Any tips on how to improve performance in the function I've provided above? I've looked at the source code from Rfast and believe that this is the correct file. However, so far I did not manage to locate the important parts of the code.
Edit: Changed the post to focus on Rfast now, following the answer of Michail.
I just did some experiments on my laptop. I have an 5 year old HP with 2 intel i5 cores at 2.3 GHz. Attached is an picture with my results. Rfast's implementation is way faster than matrixStats' implementation, always and as the matrix gets bigger the time difference increases.
library(Rfast)
library(microbenchmark)
library(matrixStats)
x <- matrnorm(100,100)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: microseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 20.5 20.9 242.64 21.50 23.2 2223.8 10
matrixStats::rowMaxs(x) 43.7 44.7 327.43 46.95 88.2 2776.8 10
x <- matrnorm(1000,1000)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: microseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 799.5 844.0 875.08 858.5 900.3 960.3 10
matrixStats::rowMaxs(x) 2229.8 2260.8 2303.04 2269.4 2293.3 2607.8 10
x <- matrnorm(10000,10000)
microbenchmark(Rfast::rowMaxs(x,value=TRUE), matrixStats::rowMaxs(x),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
Rfast::rowMaxs(x, value = TRUE) 82.1157 83.4288 85.81769 84.57885 86.2742 93.0031 10
matrixStats::rowMaxs(x) 216.0003 218.5324 222.46670 221.00330 225.3302 237.7666 10

How to efficiently operate on column of vectors in data.table

I want to do an operation on a column that consists of numeric vectors and I'm wondering what is the best way to do it.
So far I have tried the following and the set way seems to be the best but maybe I'm missing out on some superior way to approach this? How big of speed boost could be expected by doing this in C++?
testVector <- data.table::data.table(A = lapply(1:10^5, function(x) runif(100)))
microbenchmark::microbenchmark(lapply = testVector[, B := lapply(A, diff)],
map = testVector[, C := Map(diff, A)],
set = set(testVector, NULL, "D", lapply(testVector[["A"]], diff)),
forset = {for(i in seq(nrow(testVector))) set(testVector, i, "E", list(list(diff(testVector[[i, "A"]]))))},
times = 10L)
The results are following:
Unit: milliseconds
expr min lq mean median uq max neval
set 789.7967 924.8178 1031.923 1082.325 1146.306 1174.671 10
lapply 1122.2454 1468.9556 1563.002 1619.668 1692.217 1919.405 10
map 1297.5236 1320.7022 1571.344 1592.176 1695.673 2012.051 10
forset 1887.0003 2023.7357 2139.202 2174.912 2245.943 2396.844 10
Update
I have checked how Rcpp fares with the task. While my C++ skills are very poor the speed increase is >10x.
The C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List cppDiff(List column){
int cSize = column.size();
List outputColumn(cSize, NumericVector());
for(int i = 0; i < cSize; ++i){
NumericVector vectorElement = column[i];
outputColumn[i] = Rcpp::diff(vectorElement);
}
return(outputColumn);
}
Testing code:
library(Rcpp);library(data.table);library(microbenchmark)
sourceCpp("diffColumn.cpp")
vLen <- 100L
cNum <- 1e4L
test <- data.table(A = lapply(1L:cNum, function(x) runif(vLen)))
throughMatrix <- function(column){
difmat <- diff(matrix(unlist(column), nrow = vLen, ncol = cNum))
lapply(seq(cNum), function(i) difmat[, i])
}
microbenchmark::microbenchmark(DT = set(test, NULL, "B", lapply(test[["A"]], diff)),
mat = set(test, NULL, "C", throughMatrix(test[["A"]])),
cpp = set(test, NULL, "D", cppDiff(test[["A"]])),
times = 5)
> all.equal(test$B, test$C)
[1] TRUE
> all.equal(test$B, test$D)
[1] TRUE
Unit: milliseconds
expr min lq mean median uq max neval
DT 845.04418 912.60961 1024.79183 1011.59417 1107.14306 1323.9963 10
mat 643.02187 663.92700 778.91145 816.95972 844.37206 864.1173 10
cpp 45.28504 49.35746 84.27799 78.32085 84.87942 226.1347 10
And another benchmark for 10000 x 10000 column:
Unit: milliseconds
expr min lq mean median uq max neval
DT 7851.4352 8504.3501 21632.018 25246.7860 29133.358 37424.163 5
mat 8679.9386 8724.1497 22852.724 18235.7693 39199.966 39423.794 5
cpp 244.8572 247.7443 1439.011 303.2556 2715.643 3683.552 5
Have you considered using matrices? The syntax and data structure is different enough that the code below isn't a drop-in replacement, but depending on the analysis pipeline before and after this operation I suspect matrix inputs/outputs might be a more fitting way to handle the data than list-columns anyway.
library(data.table)
VectorLength <- 1e5L
testVector <- data.table::data.table(A = lapply(1:VectorLength, function(x) runif(100)))
A <- matrix(data = runif(100L*VectorLength),nrow = 100L,ncol = VectorLength)
microbenchmark::microbenchmark(set = testVector[, B := lapply(A, diff)],
Matrix = B <- diff(A),
times = 10L)
Yields the following on a Windows PC:
Unit: milliseconds
expr min lq mean median uq max neval
set 1143.933 1251.064 1316.944 1331.4672 1376.8016 1431.8988 10
Matrix 307.945 315.689 363.255 335.4382 390.1124 499.5492 10
And the following on a Linux server running Ubuntu 14.04
Unit: milliseconds
expr min lq mean median uq max neval
set 1342.6969 1410.3132 1519.6830 1551.2051 1594.3431 1699.7480 10
Matrix 285.0472 297.3283 375.0613 302.4198 488.3482 503.0959 10
Just as for reference as to what the output looks like here when coerced to a data.table:
str(as.data.table(t(B)))
returns
Classes ‘data.table’ and 'data.frame': 99 obs. of 100000 variables:
$ V1 : num 0.23 0.24 -0.731 0.724 0.074 ...
$ V2 : num -0.628 0.585 -0.164 0.269 -0.16 ...
$ V3 : num 0.1735 0.1128 -0.3069 0.0341 -0.2664 ...
$ V4 : num -0.392 0.593 -0.345 -0.327 0.747 ...
$ V5 : num 0.1084 0.2915 0.3858 -0.1574 -0.0929 ...
$ V6 : num -0.2053 -0.2669 -0.2 0.0214 0.1111 ...
$ V7 : num 0.0582 -0.2141 0.7282 -0.6877 0.4981 ...
$ V8 : num -0.439 -0.114 0.275 0.4 -0.184 ...
$ V9 : num 0.13673 0.55244 -0.43132 0.21692 -0.00308 ...
$ V10 : num 0.701 -0.0486 -0.1464 -0.5595 -0.046 ...
$ V11 : num 0.3583 -0.2588 -0.0742 -0.2113 0.9434 ...
$ V12 : num -0.1146 0.5346 -0.0594 -0.6534 0.6112 ...
$ V13 : num 0.473 0.307 -0.544 0.718 -0.315 ...
Update: It depends.
So I was curious how the performance improvement would look at a larger scale, and this one turns out to be a somewhat interesting problem where the most efficient method is highly dependent on the size/shape of the data.
Using the following format:
VectorLength <- 1e5L
ItemLength <- 1e2L
testVector <- data.table::data.table(A = lapply(1:VectorLength, function(x) runif(ItemLength)))
A <- matrix(data = runif(ItemLength*VectorLength),nrow = ItemLength,ncol = VectorLength)
microbenchmark::microbenchmark(set = set(testVector, NULL, "D", lapply(testVector[["A"]], diff)),
Matrix = B <- diff(A),
times = 5L)
I ran through a range of VectorLength and ItemLengths values. Referred to from here on as (Vector x Item) where (10,000 x 100) would signify 10,000 vectors (data.table rows) with 100 elements. Since the matrix form was transposed to fit the base R diff function, this would therefore translate to a matrix with 100 rows and 10,000 columns.
(10,000 x 10)
Unit: milliseconds
expr min lq mean median uq max neval
set 83.947769 88.420871 102.822626 90.91088 104.737002 146.096606 5
Matrix 2.368524 2.437371 2.661553 2.45122 2.476745 3.573904 5
(10,000 x 100)
Unit: milliseconds
expr min lq mean median uq max neval
set 119.33550 140.35294 174.17641 198.14286 199.56239 213.48837 5
Matrix 20.75578 23.00535 60.10874 79.47677 88.33331 88.97251 5
(10,000 x 1,000)
Unit: milliseconds
expr min lq mean median uq max neval
set 337.0859 382.6305 407.9396 429.0512 440.6331 450.2971 5
Matrix 300.3360 316.5533 411.4678 352.0477 534.4063 553.9957 5
(10,000 x 10,000)
Unit: milliseconds
expr min lq mean median uq max neval
set 1428.319 1483.324 1518.096 1508.114 1578.929 1591.792 5
Matrix 3059.825 3119.654 4366.107 3224.755 6164.489 6261.815 5
The take-away
Depending on the dimensions of the data you will actually be using, the relative performance of methods will change drastically.
If your actual data is similar to what you originally proposed for benchmarking purposes, then the matrix operation should work well, but if the dimensions vary one way or another I'd re-benchmark with a representative "shape" data.
Hope this is as helpful for you as it was interesting to me.

How to initialize a vector of TRUE in R?

Let's say I want to initialize a vector containing only TRUE, of a length n, where n is a positive integer.
Obviously, I can do !logical(n), as well as rep(TRUE,n).
However, I would like to know which of them is faster, and whether there are other (faster) alternatives.
I was wondering if the internal functions rep_len() or rep.int() are maybe faster.
That is not the case. rep() is the simplest option and just as fast as rep_len() .
suppressMessages(require(microbenchmark))
n <- 1E6L
microbenchmark(
rep(TRUE, n),
rep_len(TRUE, n),
rep.int(TRUE, n),
!logical(n),
!vector(length=n),
times = 1E4L, units = "us"
)
#> Warning in microbenchmark(rep(TRUE, n), rep_len(TRUE, n), rep.int(TRUE, : Could
#> not measure a positive execution time for 1533 evaluations.
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> rep(TRUE, n) 1203400 1494900 1943086.70 1546600 1627950 25343300 10000
#> rep_len(TRUE, n) 1224500 1497100 1978824.31 1549950 1630400 28793600 10000
#> rep.int(TRUE, n) 1281700 1558100 2026079.65 1609650 1692350 26135600 10000
#> !logical(n) 2184300 2726350 3659910.07 2876200 3329900 28890700 10000
#> !vector(length = n) 2158900 2729200 3626122.66 2872750 3296150 28172100 10000
#> units 0 0 76.12 100 100 1400 10000
The reason why logical() and vector() are slower is obvious. They require
two operations instead of one: create the vector and change all values to TRUE
afterwards. Hence, we have to be aware that the speed advantage is not there for FALSE values: in that case logical() is even faster as rep().
microbenchmark(
logical(n),
vector(length=n),
times = 1E4L, units = "us"
)
#> Warning in microbenchmark(logical(n), vector(length = n), times = 10000L, :
#> Could not measure a positive execution time for 4059 evaluations.
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> logical(n) 461100 923400 1268676.20 985300 1110350 24575300 10000
#> vector(length = n) 462400 923800 1268691.87 984850 1103850 25761700 10000
#> units 0 0 19.45 0 0 900 10000
Created on 2021-01-23 by the reprex package (v0.3.0)

How to measure the execution time of a code without actually running the code in R?

Can I use `microbenchmark to calculate the approximate time it would take to execute my code in R? I am running some code and I can see that it takes a lot of hours to execute? I don't want to run my code all that time. I want to see the approximate execution time without actually running the code in R.
Try running your code on smaller problems to see how it scales
> fun0 = function(n) { x = integer(); for (i in seq_len(n)) x = c(x, i); x }
> p = microbenchmark(fun0(1000), fun0(2000), fun0(4000), fun0(8000), fun0(16000),
+ times=20)
> p
Unit: milliseconds
expr min lq mean median uq max
fun0(1000) 1.627601 1.697958 1.995438 1.723522 2.289424 2.935609
fun0(2000) 5.691456 6.333478 6.745057 6.928060 7.056893 8.040366
fun0(4000) 23.343611 24.487355 24.987870 24.854968 25.554553 26.088183
fun0(8000) 92.517691 95.827525 104.900161 97.305930 112.924961 136.434998
fun0(16000) 365.495320 369.697953 380.981034 374.456565 390.829214 411.203191
neval
20
20
20
20
20
Doubling the problem size leads to exponentially slower execution; visualize as
library(ggplot2)
ggplot(p, aes(x=expr, y=log(time))) + geom_point() +
geom_smooth(method="lm", aes(x=as.integer(expr)))
This is terrible news for big problems!
Investigate alternative implementations that scale better while returning the same answer, both as the problem increases in size and at a given problem size. First make sure your algorithms / implementations get the same answer
> ## linear, ok
> fun1 = function(n) { x = integer(n); for (i in seq_len(n)) x[[i]] = i; x }
> identical(fun0(100), fun1(100))
[1] TRUE
then see how the new algorithm scales with problem size
> microbenchmark(fun1(100), fun1(1000), fun1(10000))
Unit: microseconds
expr min lq mean median uq max neval
fun1(100) 86.260 97.558 121.5591 102.6715 107.6995 1058.321 100
fun1(1000) 845.160 902.221 932.7760 922.8610 945.6305 1915.264 100
fun1(10000) 8776.673 9100.087 9699.7925 9385.8560 10310.6240 13423.718 100
Explore more algorithms, especially those that replace iteration with vectorization
> ## linear, faster -- *nano*seconds
> fun2 = seq_len
> identical(fun1(100), fun2(100))
[1] TRUE
> microbenchmark(fun2(100), fun2(1000), fun2(10000))
Unit: nanoseconds
expr min lq mean median uq max neval
fun2(100) 417 505.0 587.53 553 618 2247 100
fun2(1000) 2126 2228.5 2774.90 2894 2986 5511 100
fun2(10000) 19426 19741.0 25390.93 27177 28209 43418 100
Comparing algorithms at specific sizes
> n = 1000; microbenchmark(fun0(n), fun1(n), fun2(n), times=10)
Unit: microseconds
expr min lq mean median uq max neval
fun0(n) 1625.797 1637.949 2018.6295 1657.1195 2800.272 2857.400 10
fun1(n) 819.448 843.988 874.9445 853.9290 910.871 1006.582 10
fun2(n) 2.158 2.386 2.5990 2.6565 2.716 3.055 10
> n = 10000; microbenchmark(fun0(n), fun1(n), fun2(n), times=10)
Unit: microseconds
expr min lq mean median uq max
fun0(n) 157010.750 157276.699 169905.4745 159944.5715 192185.973 197389.965
fun1(n) 8613.977 8630.599 9212.2207 9165.9300 9394.605 10299.821
fun2(n) 19.296 19.384 20.7852 20.8595 21.868 22.435
neval
10
10
10
shows the increasing importance of a sensible implementation as problem size increases.

is there a shift operation in R?

in c++ there is shift operartion
>> right shift
<< left shift
this is consider to be very fast.
I tried to apply the same in R but it seems to be wrong.
Is there a similar operation in R that is as fast as this?
thanks in advance.
You can use bitwShiftL and bitwShiftR:
bitwShiftL(16, 2)
#[1] 64
bitwShiftR(16, 2)
#[1] 4
Here is the source code. Judging by the amount of additional code in these functions, and the fact that * and / are primitives, is unlikely that these will be faster than dividing / multiplying by the equivalent power of two. On one of my VMs,
microbenchmark::microbenchmark(
bitwShiftL(16, 2),
16 * 4,
times = 1000L
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftL(16, 2) 1167 1353.5 2336.779 1604 2067 117880 1000 b
# 16 * 4 210 251.0 564.528 347 470 51885 1000 a
microbenchmark::microbenchmark(
bitwShiftR(16, 2),
16 / 4,
times = 1000L
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# bitwShiftR(16, 2) 1161 1238.5 1635.131 1388.5 1688.5 39225 1000 b
# 16/4 210 240.0 323.787 280.0 334.0 14284 1000 a
I should also point out that trying to micro-optimize an interpreted language is probably a waste of time. If performance is such a big concern that you are willing to split hairs over a couple of clock cycles, just write your program in C or C++ in the first place.

Resources