I want to build a data frame like
In the head I have a value of a number n
in factorial the factorial(n) which is a recursive function
in sum the sum of the previous values of the factiorials.
I write a recursive function that successfully generate the head and factorial columns but the still struggling with the sum column.
Thanks
Below R code
fact <- function(n, x){
if (n<=1){
return (n)
} else {
n*fact(n-1)
}
}
recurDf <- function(n, df){
if (n<=1){
df <- rbind (df, data.frame("value" = paste('Value', n) , "factorial" = n, "previous.sum" = 1) )
return (df)
} else {
if(is.null(df)) {
#df <- data.frame(matrix(ncol = 3, nrow = 0))
#colnames(df) <- c("value", "factorial", "previous.sum")
df <- data.frame("value"= 'va', "factorial" =0, "previous.sum" = 0)
}
rbind (recurDf(n-1,df), data.frame("value" = paste('Value', n) , "factorial" = fact(n), "previous.sum" = sum(recurDf(n-1,df)$factorial) ))
}
}
recurDf(4, NULL)
The following returns the factor of n in its first component and the cumulative sum of all factorials to n in its second argument.
fact2 <- function(n) {
if (n <= 1) c(1,1)
else {
prev <- Recall(n-1)
n * prev[1] + c(0, prev[2])
}
}
fact2(1)
## [1] 1 1
fact2(2)
## [1] 2 3
fact2(3)
## [1] 6 9
fact2(4)
## [1] 24 33
cbind(1:4, t(sapply(1:4, fact2)))
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 3
## [3,] 3 6 9
## [4,] 4 24 33
Is there a reason you need to do this recursively?
There are much simpler ways to get to your answer.
recurDf <- function(n){
df <- data.frame("value" = c(paste('Value',1:n)) , "factorial" = c(1:n))
df$factorial <- factorial(df$factorial)
df$previous.sum <- cumsum(df$factorial)
return (df)
}
recurDf(4)
This returns
value factorial previous.sum
1 Value 1 1 1
2 Value 2 2 3
3 Value 3 6 9
4 Value 4 24 33
Related
I want to write a function to select i and j from the number of columns in ret_Nifty, while sum of i and j should be 5.
optimize.portfolio(R = ret_Nifty[, i:j], portfolio = ObjSpec,
xxxxxx = "xxx",
search_size = 2000, trace = TRUE,
traceDE = 5, itermax = 50)
But I’m not able to write a function for the same to select both variables.
Maybe something like this. calculate which column numbers add to 5 from every combination:
get_ij <- function(df, adds_to){
max_col <- ncol(df)
x <- t(combn(1:max_col, 2))
x[rowSums(x) == adds_to,]
}
get_ij(mtcars, 5)
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 3
I know how to generate 100 random numbers in R (without replacement):
random_numbers = sample.int(100, 100, replace = FALSE)
I was now curious about learning how to generate 100 "non random" numbers (without replacement). The first comes to mind is to generate a random number, and the next number will be the old number + 1 with a probability of 0.5 or an actual random number with probability 0.5. Thus, these numbers are not "fully random".
This was my attempt to write this code for numbers in a range of 0 to 100 (suppose I want to repeat this procedure 100 times):
library(dplyr)
all_games <- vector("list", 100)
for (i in 1:100){
index_i = i
guess_sets <- 1:100
prob_i = runif(n=1, min=1e-12, max=.9999999999)
guess_i = ifelse(prob_i> 0.5, sample.int(1, 100, replace = FALSE), guess_i + 1)
guess_sets_i <- setdiff(guess_sets_i, guess_i)
all_games_i = as.list(index_i, guess_i, all_games_i)
all_games[[i]] <- all_games_i
}
all_games <- do.call("rbind", all_games)
I tried to make a list that stores all guesses such that the range for the next guess automatically excludes numbers that have already been guessed, but I get this error:
Error in sample.int(1, 100, replace = FALSE) :
cannot take a sample larger than the population when 'replace = FALSE'
Ideally, I am trying to get the following results (format doesn't matter):
index_1 : 5,6,51,4,3,88,87,9 ...
index_2 77,78,79,2,65,3,1,99,100,4...
etc.
Can someone please show me how to do this? Are there easier ways in R to generate "non-random numbers"?
Thank you!
Note: I think an extra line of logic needs to be added - Suppose I guess the number 100, after guessing the number 100 I must guess a new random number since 100+1 is not included in the original range. Also, if I guess the number 5, 17 then 4 - and after guessing 4, the loop tells me to guess 4+1, this is impossible because 5 has already been guessed. In such a case, I would also have to guess a new random number?
It would be tricky to make your algorithm very efficient in R... it doesn't lend itself nicely to vectorization. Here's how I'd write it directly as a for loop:
semirandom = function(n) {
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
result = numeric(n)
result[1] = sample.int(n, size = 1)
for(i in 2:length(result)) {
if(runif(1) < .5 &&
result[i - 1] < n &&
!((result[i - 1] + 1) %in% result)) {
result[i] = result[i - 1] + 1
} else {
result[i] = safe_sample(x = setdiff(1:n, result), size = 1)
}
}
result
}
# generate 10 semirandom numbers 5 times
replicate(semirandom(10), n = 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 6 4 4 2 6
# [2,] 3 5 5 3 7
# [3,] 4 3 6 4 5
# [4,] 5 1 2 5 2
# [5,] 7 9 3 6 3
# [6,] 9 10 10 1 1
# [7,] 10 2 8 9 4
# [8,] 2 8 1 8 10
# [9,] 1 7 9 10 9
# [10,] 8 6 7 7 8
You get the error cannot take a sample larger than the population when 'replace = FALSE' because you attempt to extract 100 values from a vector of length one without replacement.
The following draws numbers between 1 and 100, draws each number not more than once, has a 50 percent chance of drawing the previous number + 1 and a 50 percent chance of drawing another random number, if the previous number + 1 has not been drawn yet, and a 100 percent chance to draw another random number, if the previous number + 1 has been drawn.
i <- sample.int(100, 1)
j <- i
for(x in 1:99) {
if((i + 1L) %in% j) {
i <- sample((1:100)[-j], 1L)
} else {
if(runif(1L) > 0.5 || i == 100L) {
i <- sample((1:100)[-j], 1L)
} else {
i <- i + 1L
}
}
j <- c(j, i)
}
When using reverse.code in R, the values in my ID column (which are not meant to be reversed) turn into NA once the ID value exceeds 999 (I have 10,110 observations).
Does anyone know if there is anything I can do to fix this?
Is there another function I can use to reverse these items without loosing data?
Here is my code:
library(psych)
keys <- c(1,-1,-1,-1) #Where column 1 = ID and the rest are my variables to be reversed
rev_dat2 <- reverse.code(keys, rev_dat)
Thanks!
Here is the relevant line of the source code of reverse.code(), where new is the object holding the reverse-coded data:
new[abs(new) > 999] <- NA
As you can see, setting values larger than 9999 to missing is hard-coded into the routine. You could write a new version of the function that didn't do that. For example, in the function below, we just make a much larger threshold:
my.reverse.code <- function (keys, items, mini = NULL, maxi = NULL)
{
if (is.vector(items)) {
nvar <- 1
}
else {
nvar <- dim(items)[2]
}
items <- as.matrix(items)
if (is.null(maxi)) {
colMax <- apply(items, 2, max, na.rm = TRUE)
}
else {
colMax <- maxi
}
if (is.null(mini)) {
colMin <- apply(items, 2, min, na.rm = TRUE)
}
else {
colMin <- mini
}
colAdj <- colMax + colMin
if (length(keys) < nvar) {
temp <- keys
if (is.character(temp))
temp <- match(temp, colnames(items))
keys <- rep(1, nvar)
keys[temp] <- -1
}
if (is.list(keys) | is.character(keys)) {
keys <- make.keys(items, keys)
keys <- diag(keys)
}
keys.d <- diag(keys, nvar, nvar)
items[is.na(items)] <- -99999999999
reversed <- items %*% keys.d
adj <- abs(keys * colAdj)
adj[keys > 0] <- 0
new <- t(adj + t(reversed))
new[abs(new) > 99999999999] <- NA
colnames(new) <- colnames(items)
colnames(new)[keys < 0] <- paste(colnames(new)[keys < 0],
"-", sep = "")
return(new)
}
The reason they used a numeric value threshold is that for the recoding they do to work, they needed all values to be numeric. So, they set missing values to -999 and then later turn them back into missing values. The same is done above, but with a lot bigger number.
keys <- c(1,-1,-1,-1) #Where column 1 = ID and the rest are my variables to be reversed
rev_dat <- data.frame(
id = 9998:10002,
x = 1:5,
y = 5:1,
z = 1:5
)
library(psych)
reverse.code(keys, rev_dat)
# id x- y- z-
# [1,] NA 5 1 5
# [2,] NA 4 2 4
# [3,] NA 3 3 3
# [4,] NA 2 4 2
# [5,] NA 1 5 1
my.reverse.code(keys, rev_dat)
# id x- y- z-
# [1,] 9998 5 1 5
# [2,] 9999 4 2 4
# [3,] 10000 3 3 3
# [4,] 10001 2 4 2
# [5,] 10002 1 5 1
If I have a vector such as x <-c(1,2,3,4,5,6,7,8,9), I want a function f such that
f(vector,index,num) where it takes the vector and gives me num "closest" elements to that one on the index
Examples:
f(x,3,4) = c(1,2,4,5)
f(x,1,5) = c(2,3,4,5,6)
f(x,8,3) = c(6,7,9)
Since there is also the issue where if we have an odd num, we will need to choose whether to pick left or right side by symmetry, let's go with choosing the left side (but right side is ok too)
i.e f(x,4,5) = c(1,2,3,5,6) and f(x,7,3) = c(5,6,8)
I hope my question is clear, thank you for any help/responses!
edit: The original vector of c(1:9) is arbitrary, the vector could be a vector of strings, or a vector of length 1000 with shuffled numbers with repeats etc.
i.e c(1,7,4,2,3,7,2,6,234,56,8)
num_closest_by_indices <- function(v, idx, num) {
# Try the base case, where idx is not within (num/2) of the edge
i <- abs(seq_along(x) - idx)
i[idx] <- +Inf # sentinel
# If there are not enough elements in the base case, incrementally add more
for (cutoff_idx in seq(floor(num/2), num)) {
if (sum(i <= cutoff_idx) >= num) {
# This will add two extra indices every iteration. Strictly if we have an even length, we should add the leftmost one first and `continue`, to break ties towards the left.
return(v[i <= cutoff_idx])
}
}
}
Here's an illustration of this algorithm: we rank the indices in order of desirability, then pick the lowest num legal ones:
> seq_along(x)
1 2 3 4 5 6 7 8 9
> seq_along(x) - idx
-2 -1 0 1 2 3 4 5 6
> i <- abs(seq_along(x) - idx)
2 1 0 1 2 3 4 5 6
> i[idx] <- +Inf # sentinel to prevent us returning the element itself
2 1 Inf 1 2 3 4 5 6
Now we can just find num elements with smallest values (break ties arbitrarily, unless you have a preference (left)).
Our first guess is all indices <= (num/2) ; this might not be enough if index is within (num/2) of the start/end.
> i <= 2
TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
> v[i <= 2]
1 2 4 5
So, adapting #dash2's code to handle the corner cases where some indices are illegal (nonpositive, or > length(x)), i.e. ! %in% 1:L. Then min(elems) would be the number of illegal indices which we cannot pick, hence we must pick abs(min(elems)) more.
Notes:
in the end the code is simpler and faster to handle it by three piecewise cases. Aww.
it actually seems to simplify things if we pick (num+1) indices, then remove idx before returning the answer. Using result[-idx] to remove it.
Like so:
f <- function (vec, elem, n) {
elems <- seq(elem - ceiling(n/2), elem + floor(n/2))
if (max(elems) > length(vec)) elems <- elems - (max(elems) - length(vec))
if (elems[1] < 1) elems <- elems + (1 - elems[1])
elems <- setdiff(elems, elem)
vec[elems]
}
Giving results:
> f(1:9, 1, 5)
[1] 2 3 4 5 6
> f(1:9, 9, 5)
[1] 4 5 6 7 8
> f(1:9, 2, 5)
[1] 1 3 4 5 6
> f(1:9, 4, 5)
[1] 1 2 3 5 6
> f(1:9, 4, 4)
[1] 2 3 5 6
> f(1:9, 2, 4)
[1] 1 3 4 5
> f(1:9, 1, 4)
[1] 2 3 4 5
> f(1:9, 9, 4)
[1] 5 6 7 8
Start a function with the variable argument x first, and the reference table and n after
.nearest_n <- function(x, table, n) {
The algorithm assumes that table is numeric, without any duplicates, and all values finite; n has to be less than or equal to the length of the table
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
Sort the table and then 'clamp' maximum and minimum values
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
Find the interval in table where x occurs; findInterval() uses an efficient search. Use the interval index as the initial lower index, and add 1 for the upper index, making sure to stay in-bounds.
## where to start?
lower <- findInterval(x, table)
upper <- min(lower + 1L, len)
Find the nearest n neighbors by comparing the lower and upper index distance to x, record the nearest value, and increment the lower or upper index as appropriate and making sure to stay in-bounds
## find
nearest <- numeric(n)
for (i in seq_len(n)) {
if (abs(x - table[lower]) < abs(x - table[upper])) {
nearest[i] = table[lower]
lower = max(1L, lower - 1L)
} else {
nearest[i] = table[upper]
upper = min(len, upper + 1L)
}
}
Then return the solution and finish the function
nearest
}
The code might seem verbose, but is actually relatively efficient because the only operations on the entire vector (sort(), findInterval()) are implemented efficiently in R.
A particular advantage of this approach is that it can be vectorized in it's first argument, calculating the test for using lower (use_lower = ...) as a vector and using pmin() / pmax() as clamps.
.nearest_n <- function(x, table, n) {
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
## where to start?
lower <- findInterval(x, table)
upper <- pmin(lower + 1L, len)
## find
nearest <- matrix(0, nrow = length(x), ncol = n)
for (i in seq_len(n)) {
use_lower <- abs(x - table[lower]) < abs(x - table[upper])
nearest[,i] <- ifelse(use_lower, table[lower], table[upper])
lower[use_lower] <- pmax(1L, lower[use_lower] - 1L)
upper[!use_lower] <- pmin(len, upper[!use_lower] + 1L)
}
# return
nearest
}
For instance
> set.seed(123)
> table <- sample(100, 10)
> sort(table)
[1] 5 29 41 42 50 51 79 83 86 91
> .nearest_n(c(30, 20), table, 4)
[,1] [,2] [,3] [,4]
[1,] 29 41 42 50
[2,] 29 5 41 42
Generalize this by taking any argument and coercing it to the required form using a reference look-up table table0 and the indexes into it table1
nearest_n <- function(x, table, n) {
## coerce to common form
table0 <- sort(unique(c(x, table)))
x <- match(x, table0)
table1 <- match(table, table0)
## find nearest
m <- .nearest_n(x, table1, n)
## result in original form
matrix(table0[m], nrow = nrow(m))
}
As an example...
> set.seed(123)
> table <- sample(c(letters, LETTERS), 30)
> nearest_n(c("M", "Z"), table, 5)
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "L" "O" "l" "P"
[2,] "Z" "z" "Y" "y" "w"
I have two data frames of same number of columns (but not rows) df1 and df2. For each row in df2, I was able to find the best (and second best) matching rows from df1 in terms of hamming distance, in my previous post. In that post, we have been using the following example data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
I now need to compute the number of bits equal to 1 for:
each row in df2
the best matching rows in df1
the second matching rows in df1
The number of bits equal to 1 of an integer a maybe computed as
sum(as.integer(intToBits(a)))
And I have applied this to #ZheyuanLi's original function, so I have got item 1>. However I'm unable to apply the same logic to get item 2> and 3>, by simple modification of #ZheyuanLi's function.
Below are the functions from #ZheyuanLi's with modification:
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb[i] <- set.bits
k <- k + p
}
## recode "id", "d" and "sb" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb <- as.data.frame(matrix(sb, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb) <- "set.bits.1"
list(id = id, d = d, sb = sb)
}
Running these gives:
> foo(df1, df2)
$id
min1 min2 ## row id for best/second best match in df1
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2 ## minimum 2 hamming distance
1 2 2
2 1 3
3 1 3
$sb
set.bits.1 ## number of bits equal to 1 for each row of df2
1 3
2 2
3 4
OK, after reading through while re-editing your question (many times!), I think I know what you want. Essentially we need change nothing to hmd(). Your required items 1>, 2>, 3> can all be computed after the for loop in foo().
To get item 1>, which you called sb, we can use a tapply(). However, your computation of sb along the for loop is fine, so I will not change it. In the following, I will demonstrate the basic procedure to get item 2> and item 3>.
The id vector inside foo() stores all matching rows in df1:
id <- c(1, 4, 2, 3, 5, 2)
so we can simply extract those rows of df1 (actually, columns of xt), to compute the number of bits equal to 1. As you can see, there are lots of duplicity in id, so we can only computes on unique(id):
id0 <- sort(unique(id))
## [1] 1 2 3 4 5
We now extract those subset columns of xt:
sub_xt <- xt[, id0]
## [,1] [,2] [,3] [,4] [,5]
## V1 9 3 10 5 6
## V2 2 4 8 7 1
To compute the number of bits equal to 1 for each column of sub_xt, we again use tapply() and vectorized approach.
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
## [1] 3 3 3 5 3
Now we need to map sbxt0 to sbxt:
sbxt <- sbxt0[match(id, id0)]
## [1] 3 5 3 3 3 3
Then we can convert sbxt to a data frame sb1:
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## min.1.set.bits.1 min.2.set.bits.1
## 1 3 5
## 2 3 3
## 3 3 3
Finally we can assemble these things up:
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb2 <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb2[i] <- set.bits
k <- k + p
}
## compute "sb1"
id0 <- sort(unique(id))
sub_xt <- xt[, id0]
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
sbxt <- sbxt0[match(id, id0)]
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## recode "id", "d" and "sb2" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb2 <- as.data.frame(matrix(sb2, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb2) <- "set.bits.1"
list(id = id, d = d, sb1 = sb1, sb2 = sb2)
}
Now, running foo(df1, df2) gives:
> foo(df1,df2)
$id
min.1 min.2
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2
1 2 2
2 1 3
3 1 3
$sb1
min.1.set.bits.1 min.2.set.bits.1
1 3 5
2 3 3
3 3 3
$sb2
set.bits.1
1 3
2 2
3 4
Note that I have renamed the sb you used to sb2.