Represent a numeric vector as a set of characters using : - r

In R, you can define an arbitrary integer sequence using :, e.g.
a = c(1:3, 12:14)
print(a)
## 1 2 3 12 13 14
I'm looking for a way to do the inverse operation, e.g. given a vector of integers I want to produce a character (or character vector) that collapses the integer sequence(s) to the equivalent expressions using :, e.g.
some_function (a)
## "1:3" "12:14"
Bonus if the stride can be detected, e.g. c(1, 3, 5) becomes "1:2:5" or something like that.
Motivation: generate an integer sequence in R based on some data manipulation to identify database row selection, and pass the most concise representation of that sequence to an external program in the proper format.

We can be able to take into consideration the rle of the differences and paste the range together taking into consideration the sequence distance.
fun=function(s){
m=c(0,diff(s))
b=rle(m)
b$values[b$lengths==1&b$values!=1]=0
l=cumsum(!inverse.rle(b))
d=function(x)paste0(range(x[,1]),
collapse = paste0(":",unique(x[-1,-1]),":"))
f=c(by(cbind(s,m),l,d))
sub("::.*","",sub(":1:",":",f))
}
fun(c(1,1:3,12:14,c(1,3,5)))
1 2 3 4
"1" "1:3" "12:14" "1:2:5"
fun(c(1, 3, 5, 8:10, 14, 17, 20))
1 2 3
"1:2:5" "8:10" "14:3:20"
fun(1)
1
"1"

Ah, nerd heaven. Here's a first shot. You could even use this for encoding within R.
Needs testing; code always prints the stride out.
encode_ranges <- function (x) {
rle_diff <- list(
start = x[1],
rled = rle(diff(x))
)
class(rle_diff) <- "rle_diff"
rle_diff
}
decode_ranges <- function (x) {
stopifnot(inherits(x, "rle_diff"))
cumsum(c(x$start, inverse.rle(x$rled)))
}
format.rle_diff <- function (x, ...) {
stopifnot(inherits(x, "rle_diff"))
output <- character(length(x$rled$values))
start <- x$start
for (j in seq_along(x$rled$values)) {
stride <- x$rled$values[j]
len <- x$rled$lengths[j]
if (len == 1L) {
start <- end + stride
next
}
end <- start + stride * x$rled$lengths[j]
output[j] <- paste(start, end, stride, sep = ":")
}
output <- output[nchar(output) > 0]
paste(output, collapse = ", ")
}
print.rle_diff <- function (x, ...) cat(format(x, ...))
encode_ranges(c(1:3, 12:14))
encode_ranges(c(1, 3, 5, 8:10, 14, 17, 20))

We create a grouping variable with diff and cumsum, then use on the group by functions to paste the range of values
f1 <- function(vec) {
unname(tapply(vec, cumsum(c(TRUE, diff(vec) != 1)),
FUN = function(x) paste(range(x), collapse=":")))
}
f1(a)
#[1] "1:3" "12:14"
For the second case
b <- c(1, 3, 5)
un1 <- unique(diff(c(1, 3, 5)))
paste(b[1], un1, b[length(b)], sep=":")
#[1] "1:2:5"

Related

R Sort or order with custom compare function

Can I pass a custom compare function to order that, given two items, indicates which one is ranked higher?
In my specific case I have the following list.
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
If we take two vectors a and b, the index of the first element i at which a[i] > b[i] or a[i] < b[i] should determine what vector comes first. In this example, scores[['d']] > scores[['a']] because scores[['d']][2] > scores[['a']][2] (note that it doesn't matter that scores[['d']][5] < scores[['a']][5]).
Comparing two of those vectors could look something like this.
compare <- function(a, b) {
# get first element index at which vectors differ
i <- which.max(a != b)
if(a[i] > b[i])
1
else if(a[i] < b[i])
-1
else
0
}
The sorted keys of scores by using this comparison function should then be d, b, a, c.
From other solutions I've found, they mess with the data before ordering or introduce S3 classes and apply comparison attributes. With the former I fail to see how to mess with my data (maybe turn it into strings? But then what about numbers above 9?), with the latter I feel uncomfortable introducing a new class into my R package only for comparing vectors. And there doesn't seem to be a sort of comparator parameter I'd want to pass to order.
Here's an attempt. I've explained every step in the comments.
compare <- function(a, b) {
# subtract vector a from vector b
comparison <- a - b
# get the first non-zero result
restult <- comparison[comparison != 0][1]
# return 1 if result == 1 and 2 if result == -1 (0 if equal)
if(is.na(restult)) {return(0)} else if(restult == 1) {return(1)} else {return(2)}
}
compare_list <- function(list_) {
# get combinations of all possible comparison
comparisons <- combn(length(list_), 2)
# compare all possibilities
results <- apply(comparisons, 2, function(x) {
# get the "winner"
x[compare(list_[[x[1]]], list_[[x[2]]])]
})
# get frequency table (how often a vector "won" -> this is the result you want)
fr_tab <- table(results)
# vector that is last in comparison
last_vector <- which(!(1:length(list_) %in% as.numeric(names(fr_tab))))
# return the sorted results and add the last vectors name
c(as.numeric(names(sort(fr_tab, decreasing = T))), last_vector)
}
If you run the function on your example, the result is
> compare_list(scores)
[1] 4 2 1 3
I haven't dealt with the case that the two vectors are identical, you haven't explained how to deal with this.
The native R way to do this is to introduce an S3 class.
There are two things you can do with the class. You can define a method for xtfrm that converts your list entries to numbers. That could be vectorized, and conceivably could be really fast.
But you were asking for a user defined compare function. This is going to be slow because R function calls are slow, and it's a little clumsy because nobody does it. But following the instructions in the xtfrm help page, here's how to do it:
scores <- list(
'a' = c(1, 1, 2, 3, 4, 4),
'b' = c(1, 2, 2, 2, 3, 4),
'c' = c(1, 1, 2, 2, 3, 4),
'd' = c(1, 2, 3, 3, 3, 4)
)
# Add a class to the list
scores <- structure(scores, class = "lexico")
# Need to keep the class when subsetting
`[.lexico` <- function(x, i, ...) structure(unclass(x)[i], class = "lexico")
# Careful here: identical() might be too strict
`==.lexico` <- function(a, b) {identical(a, b)}
`>.lexico` <- function(a, b) {
a <- a[[1]]
b <- b[[1]]
i <- which(a != b)
length(i) > 0 && a[i[1]] > b[i[1]]
}
is.na.lexico <- function(a) FALSE
sort(scores)
#> $c
#> [1] 1 1 2 2 3 4
#>
#> $a
#> [1] 1 1 2 3 4 4
#>
#> $b
#> [1] 1 2 2 2 3 4
#>
#> $d
#> [1] 1 2 3 3 3 4
#>
#> attr(,"class")
#> [1] "lexico"
Created on 2021-11-27 by the reprex package (v2.0.1)
This is the opposite of the order you asked for, because by default sort() sorts to increasing order. If you really want d, b, a, c use sort(scores, decreasing = TRUE.
Here's another, very simple solution:
sort(sapply(scores, function(x) as.numeric(paste(x, collapse = ""))), decreasing = T)
What it does is, it takes all the the vectors, "compresses" them into a single numerical digit and then sorts those numbers in decreasing order.

print a parameter's value, including its type

I have a variable, lets call it artichoke, that can be either a sequence or a vector, for example
artichoke <- seq(-5,5)
or
artichoke <- c(-5,5)
I want to print the value and type of this variable so that I would get:
seq(-5,5)
in the first case and
c(-5,5)
in the second.
How can I do this?
There is a function called dput() in R that turns out to be cleverer than I gave it credit for ...
artichoke <- c(-5,5)
dput(artichoke, file=textConnection("art1","w"))
artichoke <- seq(-5,5)
dput(artichoke, file=textConnection("art2","w"))
Results:
art1
## [1] "c(-5, 5)"
art2
## [1] "-5:5"
R chooses to use the : operator rather than seq() . You could use something like
sub("(-?[0-9]+):(-?[0-9]+)","seq(\\1,\\2)",art2)
to convert the result from -5:5 to seq(-5,5) if you wanted.
If the sequence is more complex dput() is not as clever/doesn't recognize it as a sequence:
artichoke <- seq(-5,10,by=2)
dput(artichoke)
## c(-5, -3, -1, 1, 3, 5, 7, 9)
If you wanted to recognize this or other more complex cases you'd probably have to write your own rules ...
If we want to delay the execution, can be wrapped in quote
artichoke <- quote(seq(-5, 5))
and then eval to return the evaluated output
eval(artichoke)
#[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
artichoke
#seq(-5, 5)
Another option is to create a function, extract the arguments with match.call(), print it while returning the evaluated expression, and assign it to an object
f1 <- function(expr) {
print(as.list(match.call()[-1])$expr)
expr
}
artichoke <- f1(seq(-5, 5))
#seq(-5, 5)
artichoke <- f1(c(-5, 5))
#c(-5, 5)
If we need it quoted, then wrap with deparse
f1 <- function(expr) {
print(deparse(as.list(match.call()[-1])$expr))
expr
}
artichoke <- f1(seq(-5, 5))
#[1] "seq(-5, 5)"

R: How to create a df with columns of random strings

I have a df, I would like one of the colums to contain a string of 5 random values between 1 and 100:
expected_df <- data.frame("x" = stri_rand_strings(4, 8), "y" = round(runif(4, 13, 272)), z =(c('2 3 50 17 9', '10 3 5 100 22', '86 30 74 10 27', '6 33 4 19 66')))
I have tried to create a function that repeat '1-100' 5 times, however it repeats the same 5 numbers for each row in the df
rand_str<- function() {
x = c(sample(1:100, 5, replace = FALSE))
return(paste0(x,collapse = " "))
}
df <- data.frame("x" = stri_rand_strings(4, 8), "y" = round(runif(4, 13, 272)), z =rep(rand_str(),4))
I have tried to add rep(rand_str(),4), however it doesn't solve the problem.
How can I create 4 rows with 5 different digits in each?
Thanks in advance!
The function you're looking for is replicate. With replicate, you can use your original rand_str() function like this:
replicate(4, rand_str())
Alternatively, you can rewrite your rand_str() function like this:
rand_str <- function(n) replicate(n, paste(sample(100, 5, FALSE), collapse=" "))
Demo:
set.seed(1) # So you can replicate these results
rand_str(4)
# [1] "27 37 57 89 20" "90 94 65 62 6" "21 18 68 38 74" "50 72 98 37 75"
For reference, if you are going to use a for loop, either of the following approaches would perform more efficiently than Steffen's answer, which grows a vector with each iteration of the loop. In R, you should pre-allocate space to store the results of your loops. When possible, specifying the storage mode (for example, specifying when a character or integer is expected in the results) will help improve the function's efficiency.
This option creates an empty character vector of the required length before the loop, and each iteration of the loop replaces the empty vector at the given position with the pasted result of the sample.
rand_str <- function(n) {
returnvalue <- character(n)
for (i in 1:n) {
returnvalue[i] <- paste0(sample(1:100, 5, replace = FALSE), collapse = " ")
}
returnvalue
}
This option creates an empty matrix where each row stores the results of the samples. Once the matrix has been filled, it gets pasted together using the do.call(paste, ...) idiom commonly used to paste together rows of a data.frame.
rand_str <- function(n) {
m <- matrix(NA_integer_, ncol = 5, nrow = n)
for (i in seq.int(n)) {
m[i, ] <- sample(100, 5, FALSE)
}
do.call(paste, data.frame(m))
}
How about this?
rand_str <- function(n) {
returnvalue <- c()
for (i in 1:n) {
x = c(sample(1:100, 5, replace = FALSE))
returnvalue <- c(returnvalue, paste0(x, collapse = " "))
}
returnvalue
}
df <- data.frame("x" = stri_rand_strings(4, 8), "y" = round(runif(4, 13, 272)), z =rand_str(4))

How to make a generalized function update the value of a vector?

I have been trying to write a generalized function that multiplies each value in each row of a matrix by the corresponding value of a vector in terms of their position (i.e. matrix[1,1]*vector[1], matrix[1,2]*vector[2], etc) and then sum them together. It is important to note that the lengths of the vector and the rows of the matrix are always the same, which means that in each row the first value of the vector is multiplied with the first value of the matrix row. Also important to note, I think, is that the rows and columns of the matrix are of equal length. The end sum for each row should be assigned to different existing vector, the length of which is equal to the number of rows.
This is the matrix and vector:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
matrix <- cbind(a,b,c,d)
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[4,] -1 2 3 5
vector <- c(1, 2, 3, 4)
These are the basic functions that I have to generalize for the rows and columns of matrix and a vector of lenghts "n":
f.1 <- function() {
(matrix[1,1]*vector[1]
+ matrix[1,2]*vector[2]
+ matrix[1,3]*vector[3]
+ matrix[1,4]*vector[4])
}
f.2 <- function() {
(matrix[2,1]*vector[1]
+ matrix[2,2]*vector[2]
+ matrix[2,3]*vector[3]
+ matrix[2,4]*vector[4])
}
and so on...
This is the function I have written:
ncells = 4
f = function(x) {
i = x
result = 0
for(j in 1:ncells) {
result = result + vector[j] * matrix[i][j]
}
return(result)
}
Calling the function:
result.cell = function() {
for(i in 1:ncells) {
new.vector[i] = f(i)
}
}
The vector to which this result should be assigned (i.e. new.vector) has been defined beforehand:
new.vector <- c()
I expected that the end sum for each row will be assigned to the vector in a corresponding manner (e.g. if the sums for all rows were 1, 2, 3, 4, etc. then new.vector(1, 2, 3, 4, etc) but it did not happen.
(Edit) When I do this with the basic functions, the assignment works:
new.vector[1] <- f.1()
new.vector[2] <- f.2()
This does not however work with the generalized function:
new.vector[1:ncells] <- result cell[1:ncells]
(End Edit)
I have also tried setting the length for the the new.vector to be equal to ncells but I don't think it did any good:
length(new.vector) = ncells
My question is how can I make the new vector take the resulting sums of the multiplied elements of a row of a matrix by the corresponding value of a vector.
I hope I have been clear and thanks in advance!
There is no need for a loop here, we can use R's power of matrix multiplication and then sum the rows with rowSums. Note that m and v are used as names for matrix and vector to avoid conflict with those function names.
nr <- nrow(m)
rowSums(m * matrix(rep(v, nr), nr, byrow = TRUE))
# [1] 45 39 -4 32
However, if the vector v is always going to be the column number, we can simply use the col function as our multiplier.
rowSums(m * col(m))
# [1] 45 39 -4 32
Data:
a <- c(4, -9, 2, -1)
b <- c(-1, 3, -8, 2)
c <- c(5, 2, 6, 3)
d <- c(7, 9, -2, 5)
m <- cbind(a, b, c, d)
v <- 1:4

Find all positions of all matches of one vector of values in second vector

I need to find all positions in my vector corresponding to any of values of another vector:
needles <- c(4, 3, 9)
hay <- c(2, 3, 4, 5, 3, 7)
mymatches(needles, hay) # should give vector: 2 3 5
Is there any predefined function allowing to do this?
This should work:
which(hay %in% needles) # 2 3 5
R already has the the match() function / %in% operator, which are the same thing, and they're vectorized. Your solution:
which(!is.na(match(hay, needles)))
[1] 2 3 5
or the shorter syntax which(hay %in% needles) as #jalapic showed.
With match(), if you wanted to, you could see which specific value was matched at each position...
match(hay, needles)
[1] NA 2 1 NA 2 NA
or just a logical vector of where the matches occurred:
!is.na(match(hay, needles))
[1] FALSE TRUE TRUE FALSE TRUE FALSE
If you want to match between an integer and a integer vector then the following code is about twice as fast for longer integer vectors:
library(microbenchmark)
library(parallel)
library(Rcpp)
library(RcppArmadillo)
cppFunction(depends = "RcppArmadillo",
'std::vector<double> findMatches(const int &x, const arma::ivec &y) {
arma::uvec temp = arma::find(y == x) + 1;
return as<std::vector<double>>(wrap(temp));
}')
x <- 1L
y <- as.integer(1:1e8)
microbenchmark(findMatches(x, y))
microbenchmark(which(x %in% y))
To match all elements of a vector we can do:
needles <- c(4, 3, 9)
hay <- c(2, 3, 4, 5, 3, 7)
unlist(lapply(FUN = findMatches, X = needles, y=hay))
# The same thing in parallel
unlist(mclapply(FUN = findMatches, X = needles, y=hay))
Benchmarking:
# on a 8 core server
hay <- as.integer(1:1e7)
needles <- sample(hay, 10)
microbenchmark(which(hay %in% needles)) # 74 milliseconds
microbenchmark(unlist(lapply(FUN = findMatches, X = needles, y=hay))) # 44 milliseconds
microbenchmark(unlist(mclapply(FUN = findMatches, X = needles, y=hay))) # 46 milliseconds
Doing in parallel will only be faster if each embarrassingly parallel task is long enough to make it worth the overhead. In this example that does not seem to be the case.

Resources