Weighted sum of digits in R

Weighted sum of digits in R - r

I am trying to figure out the most efficient way to calculate the weighted sum of digits for a numeric string (where the weight is equal to the position of the digit in the numeric string).
Example: For the number 1059, the weighted sum of digits is calculated as 1 * 1 + 0 * 2 + 5 * 3 + 9 * 4 = 52
I would like to allow for the input to be of any length, but if there are more efficient ways when there is a limit to the string length (e.g. knowing that the number is no more 10 digits allows for a more efficient program) I am open to that too. Also, if it is preferred that the input is a of type numeric rather than character that is acceptable too.
What I have right now is an old fashioned for loop:
wsod <- function(str) {
output <- 0
for (pos in 1:nchar(str)) {
digit <- as.numeric(substr(str, pos , pos))
output <- output + pos * digit
}
output
}
A few solutions have been proposed for Python (using a numeric input) but I don't think they apply to R directly.

> number <- 1059
> x <- strsplit(as.character(number), "")[[1]]
> y <- seq_len(nchar(number))
> as.numeric(as.numeric(x) %*% y)
[1] 52

weighted.digit <- function(str) {
splitted.nums <- as.numeric(strsplit(str, '')[[1]])
return(sum(splitted.nums * 1:length(splitted.nums)))
}
weighted.digit('1059')
[1] 52
One could modify this to accept a numeric input, and then simply convert that to character as a first step.

Related

Can I further vectorize this function

I am relatively new to R, and matrix-based scripting languages in general. I have written this function to return the index's of each row which has a content similar to any another row's content. It is a primitive form of spam reduction that I am developing.
if (!require("RecordLinkage")) install.packages("RecordLinkage")
library("RecordLinkage")
# Takes a column of strings, returns a list of index's
check_similarity <- function(x) {
threshold <- 0.8
values <- NULL
for(i in 1:length(x)) {
values <- c(values, which(jarowinkler(x[i], x[-i]) > threshold))
}
return(values)
}
is there a way that I could write this to avoid the for loop entirely?

We can simplify the code somewhat using sapply.
# some test data #
x = c('hello', 'hollow', 'cat', 'turtle', 'bottle', 'xxx')
# create an x by x matrix specifying which strings are alike
m = sapply(x, jarowinkler, x) > threshold
# set diagonal to FALSE: we're not interested in strings being identical to themselves
diag(m) = FALSE
# And find index positions of all strings that are similar to at least one other string
which(rowSums(m) > 0)
# [1] 1 2 4 5
I.e. this returns the index positions of 'hello', 'hollow', 'turtle', and 'bottle' as being similar to another string
If you prefer, you can use colSums instead of rowSums to get a named vector, but this could be messy if the strings are long:
which(colSums(m) > 0)
# hello hollow turtle bottle
# 1 2 4 5

How to exclude 1 when running the function?

The question is:
There is a package with a function that enables you to check if a number is prime. install.packages("schoolmath") library(schoolmath) is.prim(3)
Create a function that takes in two integers (set default values of 1 to both). The function should calculate the number of prime numbers between the two values.
My code is:
install.packages("schoolmath")
library(schoolmath)
is.prim(3)
prime <- function(x)
{
p <- 0
p1 <- ifelse(is.prim(x) == "TRUE", p + 1, p)
return(sum(p1 == 1))
}
prime(seq(1,10,1))
When I ran the function, it counts 1 as a prime number as well, which is not true. How to efficiently exclude that from the function?

You can simplify your function a little because is.prim works with vectors and looking at the documentation for sum function:
Logical true values are regarded as one, false values as zero.
Here is a function that counts the primes in a vector
count.primes <- function(x) {
sum(x > 1 & is.prim(x))
}
Example:
count.primes(1:10)
# [1] 4
count.primes(1:20)
# [1] 8

R - How to get row & column subscripts of matched elements from a distance matrix

I have an integer vector vec1 and I am generating a distant matrix using dist function. I want to get the coordinates (row and column) of element of certain value in the distance matrix. Essentially I would like to get the pair of elements that are d-distant apart. For example:
vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)
# 1 2 3 4
#2 1
#3 4 3
#4 10 9 6
#5 15 14 11 5
Say, I am interested in pair of elements in the vector that are 5 unit apart. I wanted to get the coordinate1 which are the rows and coordinate2 which are the columns of the distance matrix. In this toy example, I would expect
coord1
# [1] 5
coord2
# [1] 4
I am wondering if there is an efficient way to get these values that doesn't involve converting the dist object to a matrix or looping through the matrix?

A distance matrix is a lower triangular matrix in packed format, where the lower triangular is stored as a 1D vector by column. You can check this via
str(distMatrix)
# Class 'dist' atomic [1:10] 1 4 10 15 3 9 14 6 11 5
# ...
Even if we call dist(vec1, diag = TRUE, upper = TRUE), the vector is still the same; only the printing styles changes. That is, no matter how you call dist, you always get a vector.
This answer focus on how to transform between 1D and 2D index, so that you can work with a "dist" object without first making it a complete matrix using as.matrix. If you do want to make it a matrix, use the dist2mat function defined in as.matrix on a distance object is extremely slow; how to make it faster?.
R functions
It is easy to write vectorized R functions for those index transforms. We only need some care dealing with "out-of-bound" index, for which NA should be returned.
## 2D index to 1D index
f <- function (i, j, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (i >= 1) & (j >= 1) & (i > j) & (i <= n) & (j <= n)
k <- (2 * n - j) * (j - 1) / 2 + (i - j)
k[!valid] <- NA_real_
k
}
## 1D index to 2D index
finv <- function (k, dist_obj) {
if (!inherits(dist_obj, "dist")) stop("please provide a 'dist' object")
n <- attr(dist_obj, "Size")
valid <- (k >= 1) & (k <= n * (n - 1) / 2)
k_valid <- k[valid]
j <- rep.int(NA_real_, length(k))
j[valid] <- floor(((2 * n + 1) - sqrt((2 * n - 1) ^ 2 - 8 * (k_valid - 1))) / 2)
i <- j + k - (2 * n - j) * (j - 1) / 2
cbind(i, j)
}
These functions are extremely cheap in memory usage, as they work with index instead of matrices.
Applying finv to your question
You can use
vec1 <- c(2,3,6,12,17)
distMatrix <- dist(vec1)
finv(which(distMatrix == 5), distMatrix)
# i j
#[1,] 5 4
Generally speaking, a distance matrix contains floating point numbers. It is risky to use == to judge whether two floating point numbers are equal. Read Why are these numbers not equal? for more and possible strategies.
Alternative with dist2mat
Using the dist2mat function given in as.matrix on a distance object is extremely slow; how to make it faster?, we may use which(, arr.ind = TRUE).
library(Rcpp)
sourceCpp("dist2mat.cpp")
mat <- dist2mat(distMatrix, 128)
which(mat == 5, arr.ind = TRUE)
# row col
#5 5 4
#4 4 5
Appendix: Markdown (needs MathJax support) for the picture
## 2D index to 1D index
The lower triangular looks like this: $$\begin{pmatrix} 0 & 0 & \cdots & 0\\ \times & 0 & \cdots & 0\\ \times & \times & \cdots & 0\\ \vdots & \vdots & \ddots & 0\\ \times & \times & \cdots & 0\end{pmatrix}$$ If the matrix is $n \times n$, then there are $(n - 1)$ elements ("$\times$") in the 1st column, and $(n - j)$ elements in the j<sup>th</sup> column. Thus, for element $(i,\ j)$ (with $i > j$, $j < n$) in the lower triangular, there are $$(n - 1) + \cdots (n - (j - 1)) = \frac{(2n - j)(j - 1)}{2}$$ "$\times$" in the previous $(j - 1)$ columns, and it is the $(i - j)$<sup>th</sup> "$\times$" in the $j$<sup>th</sup> column. So it is the $$\left\{\frac{(2n - j)(j - 1)}{2} + (i - j)\right\}^{\textit{th}}$$ "$\times$" in the lower triangular.
----
## 1D index to 2D index
Now for the $k$<sup>th</sup> "$\times$" in the lower triangular, how can we find its matrix index $(i,\ j)$? We take two steps: 1> find $j$; 2> obtain $i$ from $k$ and $j$.
The first "$\times$" of the $j$<sup>th</sup> column, i.e., $(j + 1,\ j)$, is the $\left\{\frac{(2n - j)(j - 1)}{2} + 1\right\}^{\textit{th}}$ "$\times$" of the lower triangular, thus $j$ is the maximum value such that $\frac{(2n - j)(j - 1)}{2} + 1 \leq k$. This is equivalent to finding the max $j$ so that $$j^2 - (2n + 1)j + 2(k + n - 1) \geq 0.$$ The LHS is a quadratic polynomial, and it is easy to see that the solution is the integer no larger than its first root (i.e., the root on the left side): $$j = \left\lfloor\frac{(2n + 1) - \sqrt{(2n-1)^2 - 8(k-1)}}{2}\right\rfloor.$$ Then $i$ can be obtained from $$i = j + k - \left\{\frac{(2n - j)(j - 1)}{2}\right\}.$$

If the vector is not too large, the best way is probably to wrap the output of dist into as.matrix and to use which with the option arr.ind=TRUE. The only disadvantage of this standard method to retrieve the index numbers within a dist matrix is an increase of memory usage, which may become important in the case of very large vectors passed to dist. This is because the conversion of the lower triangular matrix returned by dist into a regular, dense matrix effectively doubles the amount of stored data.
An alternative consists in converting the dist object into a list, such that each column in the lower triangular matrix of dist represents one member of the list. The index number of the list members and the position of the elements within the list members can then be mapped to the column and row number of the dense N x N matrix, without generating the matrix.
Here is one possible implementation of this list-based approach:
distToList <- function(x) {
idx <- sum(seq(length(x) - 1)) - rev(cumsum(seq(length(x) - 1))) + 1
listDist <- unname(split(dist(x), cumsum(seq_along(dist(x)) %in% idx)))
# http://stackoverflow.com/a/16358095/4770166
}
findDistPairs <- function(vec, theDist) {
listDist <- distToList(vec)
inList <- lapply(listDist, is.element, theDist)
matchedCols <- which(sapply(inList, sum) > 0)
if (length(matchedCols) > 0) found <- TRUE else found <- FALSE
if (found) {
matchedRows <- sapply(matchedCols, function(x) which(inList[[x]]) + x )
} else {matchedRows <- integer(length = 0)}
matches <- cbind(col=rep(matchedCols, sapply(matchedRows,length)),
row=unlist(matchedRows))
return(matches)
}
vec1 <- c(2, 3, 6, 12, 17)
findDistPairs(vec1, 5)
# col row
#[1,] 4 5
The parts of the code that might be somewhat unclear concern the mapping of the position of an entry within the list to a column / row value of the N x N matrix. While not trivial, these transformations are straightforward.
In a comment within the code I have pointed out an answer on StackOverflow which has been used here to split a vector into a list. The loops (sapply, lapply) should be unproblematic in terms of performance since their range is of order O(N). The memory usage of this code is largely determined by the storage of the list. This amount of memory should be similar to that of the dist object since both objects contain the same data.
The dist object is calculated and transformed into a list in the function distToList(). Because of the dist calculation, which is required in any case, this function could be time-consuming in the case of large vectors. If the goal is to find several pairs with different distance values, then it may be better to calculate listDist only once for a given vector and to store the resulting list, e.g., in the global environment.
Long story short
The usual way to treat such problems is simple and fast:
distMatrix <- as.matrix(dist(vec1)) * lower.tri(diag(vec1))
which(distMatrix == 5, arr.ind = TRUE)
# row col
#5 5 4
I suggest using this method by default. More complicated solutions may become necessary in situations where memory limits are reached, i.e., in the case of very large vectors vec1. The list-based approach described above could then provide a remedy.

fill up a matrix one random cell at a time

I am filling a 10x10 martix (mat) randomly until sum(mat) == 100
I wrote the following.... (i = 2 for another reason not specified here but i kept it at 2 to be consistent with my actual code)
mat <- matrix(rep(0, 100), nrow = 10)
mat[1,] <- c(0,0,0,0,0,0,0,0,0,1)
mat[2,] <- c(0,0,0,0,0,0,0,0,1,0)
mat[3,] <- c(0,0,0,0,0,0,0,1,0,0)
mat[4,] <- c(0,0,0,0,0,0,1,0,0,0)
mat[5,] <- c(0,0,0,0,0,1,0,0,0,0)
mat[6,] <- c(0,0,0,0,1,0,0,0,0,0)
mat[7,] <- c(0,0,0,1,0,0,0,0,0,0)
mat[8,] <- c(0,0,1,0,0,0,0,0,0,0)
mat[9,] <- c(0,1,0,0,0,0,0,0,0,0)
mat[10,] <- c(1,0,0,0,0,0,0,0,0,0)
i <- 2
set.seed(129)
while( sum(mat) < 100 ) {
# pick random cell
rnum <- sample( which(mat < 1), 1 )
mat[rnum] <- 1
##
print(paste0("i =", i))
print(paste0("rnum =", rnum))
print(sum(mat))
i = i + 1
}
For some reason when sum(mat) == 99 there are several steps extra...I would assume that once i = 91 the while would stop but it continues past this. Can somone explain what I have done wrong...
If I change the while condition to
while( sum(mat) < 100 & length(which(mat < 1)) > 0 )
the issue remains..

Your problem is equivalent to randomly ordering the indices of a matrix that are equal to 0. You can do this in one line with sample(which(mat < 1)). I suppose if you wanted to get exactly the same sort of output, you might try something like:
set.seed(144)
idx <- sample(which(mat < 1))
for (i in seq_along(idx)) {
print(paste0("i =", i))
print(paste0("rnum =", idx[i]))
print(sum(mat)+i)
}
# [1] "i =1"
# [1] "rnum =5"
# [1] 11
# [1] "i =2"
# [1] "rnum =70"
# [1] 12
# ...

See ?sample
Arguments:
x: Either a vector of one or more elements from which to choose,
or a positive integer. See ‘Details.’
...
If ‘x’ has length 1, is numeric (in the sense of ‘is.numeric’) and
‘x >= 1’, sampling _via_ ‘sample’ takes place from ‘1:x’. _Note_
that this convenience feature may lead to undesired behaviour when
‘x’ is of varying length in calls such as ‘sample(x)’. See the
examples.
In other words, if x in sample(x) is of length 1, sample returns a random number from 1:x. This happens towards the end of your loop, where there is just one 0 left in your matrix and one index is returned by which(mat < 1).

The iteration repeats on level 99 because sample() behaves very differently when the first parameter is a vector of length 1 and when it is greater than 1. When it is length 1, it assumes you a random number from 1 to that number. When it has length >1, then you get a random number from that vector.
Compare
sample(c(99,100),1)
and
sample(c(100),1)
Of course, this is an inefficient way of filling your matrix. As #josilber pointed out, a single call to sample could do everything you need.

The issue comes from how sample and which do the sampling when you have only a single '0' value left.
For example, do this:
mat <- matrix(rep(1, 100), nrow = 10)
Now you have a matrix of all 1's. Now lets make two numbers 0:
mat[15]<-0
mat[18]<-0
and then sample
sample(which(mat<1))
[1] 18 15
by adding a size=1 argument you get one or the other
now lets try this:
mat[18]<-1
sample(which(mat<1))
[1] 3 13 8 2 4 14 11 9 10 5 15 7 1 12 6
Oops, you did not get [1] 15 . Instead what happens in only a single integer (15 in this case) is passed tosample. When you do sample(x) and x is an integer, it gives you a sample from 1:x with the integers in random order.

Splitting a number in R

In R I have a number, say 1293828893, called x.
I wish to split this number so as to remove the middle 4 digits 3828 and return them, pseudocode is as follows:
splitnum <- function(number){
#check number is 10 digits
if(nchar(number) != 10){
stop("number not of right size");
}
middlebits <- middle 4 digits of number
return(middlebits);
}
This is a pretty simple question but the only solutions I have found apply to character strings, rather than numeric ones.
If of interest, I am trying to create an implementation in R of the Middle-square method, but this step is particularly tricky.

You can use substr(). See its help page ?substr. In your function I would do:
splitnum <- function(number){
#check number is 10 digits
stopifnot(nchar(number) == 10)
as.numeric(substr(number, start = 4, stop = 7))
}
which gives:
> splitnum(1293828893)
[1] 3828
Remove the as.numeric(....) wrapping on the last line you want the digits as a string.

Just use integer division:
> x <- 1293828893
> (x %/% 1e3) %% 1e4
[1] 3828

Here's a function that completely avoids converting the number to a character
splitnum <- function(number){
#check number is 10 digits
if(trunc(log10(X))!=9) {
stop("number not of right size")
}
(number %/% 1e3) %% 1e4
}
splitnum(1293828893)
# [1] 3828