List Appending with outputs from machine learning function - r

Please excuse the title for lack of a better phrase describing my question.
I'm running cluster stability analysis function out of 'flexclust' package, which runs bootstrap sampling on your dataset, calculate this thing called "Random Index" per each value of k (the range which I get to specify).
The function lets you try multiple distance metrics and clustering methods, and I want to run the function for every one of distance&method combination, find the best k based on each k's mean + median.
I've basically written nested for loops, initializing vector for each of the column: (name, distance metric, method, and best k). And calling a data.frame() to stitch all of them together.
###############################################################################################
df = data.frame(matrix(rbinom(10*100, 1, .5), ncol=4)) #random df for testing purpose
cl_stability <- function(df, df.name, k_low, k_high)
{
cluster.distance = c("euclidean","manhattan")
cluster.method = c("kmeans","hardcl","neuralgas")
for (dist in cluster.distance)
{
for (method in cluster.method)
{
j = 1
while (j <= length(cluster.distance)*length(cluster.method))
{
df.names = rep(c(df.name),length(cluster.distance)*length(cluster.method))
distances = c()
methods = c()
best.k.s = c()
ip = as.data.frame((bootFlexclust(df, k = k_low:k_high, multicore = TRUE,
FUN = "cclust", dist = d, method = m))#rand)
best_k = names(which.max(apply(ip, 2, mean) + apply(ip, 2, median))) #this part runs fine when I run them outside of the function
distances[j] = d
methods[j] = m
best.k.s[j] = best_k
j = j + 1
final = data.frame(df.names,distances,methods,best.k.s)
}
}
}
return(final)
}
Expected result would be a dataframe with 7 columns (name, distance metric, method, and best k, 2nd best, 3rd best, and the worst based on median+mean criteria.).
https://imgur.com/a/KpFM04m

Related

Row-wise integration of hundreds (or more) functions

Assume we have a DF with hundreds of linear model parameters, including slope m and y-intercept b, as well as upper-limits for integration up_lim.
tmp_df <- tibble(m = rnorm(1:1000, mean = 1, sd = 1),
b = rnorm(1:1000, mean = 3, sd = 0.5),
up_lim = rnorm(1:1000, mean = 11, sd = 4))
My goal is to row-wise integrate over x, from 0 to up_lim using a simple linear model:
integrand <- function(x) { m * x + b }
The result should be stored in a new column in tmp_df. I did some searching online and I am aware of the non-vector nature of the integrate function, but could not translate any of the discussion/solutions that I found to my case. My best solution was to loop, which works on a few hundred integrations but crashes my 12 Core MacBook (even after I tried multi-core support) when I feed it my full data set (> 1 million rows):
lapply(c("foreach", "doParallel"),
library, character.only = TRUE)
n <- nrow(tmp_df)
registerDoParallel(numCores)
tmp_df$Fs_linear <-
foreach (i = 1:n, .combine = rbind) %dopar% {
integrate(
function(x) { tmp_df$m[i] * x + tmp_df$b[i] },
lower = 0,
upper = tmp_df$up_lim[i])$value
}
stopImplicitCluster()
Is there an elegant/resource-efficient way to accomplish this? I would be incredibly thankful for any pointers.

In R, How do I properly pass a function and a set of parameters to said function so that it executes properly?

I am working on an R project, and I have many different functions (I'm calculating RMSEs on various data sets with various requirements).
I am currently using the "do.call()" function to invoke the function name I'm passing in, but
this causes my whole system to stall and nothing works. This has happened many times over, and I've had to restart R Studio (using version 4.0.2).
I would like to pass in a function as an argument into my parent function (which is recursive but only to 2 passes), and I would like to be able to pass in the parameters from the parent function to the child functions, as well as the recursive function call.
I'm not sure of the correct execution of this.
Any help on where I'm going wrong is greatly appreciated.
Currently, my code is as follows:
#find_generic_lambda is the parent function that is called, and the FUN argument is the named function I would like to pass in to execute inside
find_generic_lambda <- function(seq_start, seq_end, seq_increment, FUN, detailed_flag = FALSE, training_set, testing_set)
{
lambdas <- seq(seq_start, seq_end, seq_increment)
params = c(lambdas, train_set, test_set)
#invoking the passed-in function here with the parameters I'm setting
#this is where the code stumbles
RMSE <- sapply(lambdas, do.call(FUN, params))
#find the smallest lamdba
qplot(lambdas, RMSE)
#saving the first round lambda
min_lambda_first_try <- lambdas[which.min(RMSE)]
min_lambda_first_try
if (detailed_flag)
{
#if this is the first iteration of the function, continue with taking a 10% lower and 10% higher lambda value to iterate through new lambdas that are much more granuluar, with increments at 10% of what they were previously.
new_lambda_range = (seq_end + seq_start)/10
new_lambda_range
min_lambda_first_try <- find_generic_lambda(seq_start = min_lambda_first_try - new_lambda_range, seq_end = min_lambda_first_try + new_lambda_range,
seq_increment = seq_increment/10, FUN, detailed_flag = FALSE, training_set = training_set, testing_set = testing_set)
}
return (min_lambda_first_try)
}
#this is one of the functions that will be passed in as a parameter
regularized_rmse_3 <- function(l, train_set, test_set)
{
mu <- mean(train_set$rating)
just_the_sum <- train_set %>%
group_by(movieId) %>%
summarize(s = sum(rating - mu), n_i = n())
predicted_ratings <- test_set %>%
left_join(just_the_sum, by='movieId') %>%
mutate(b_i = s/(n_i+l)) %>%
mutate(pred = mu + b_i) %>%
pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
}
rmse3_lambda <- find_generic_lambda(seq_start=0, seq_end=10, seq_increment=0.5,
FUN="regularized_rmse_3",
detailed_flag = TRUE, training_set=training_set, testing_set=testing_set)
Expanding on my comments:
Here's a simplified version of your functions (so I can make example dataset) -
f <- function (l_candidate, FUN) {
RMSE <- sapply(l_candidate, FUN)
l_min_RMSE <- l_candidate[which.min(RMSE)]
return(l_min_RMSE)
}
g <- function (l, trainset, testset) {
p <- mean(trainset + l)
sqrt(mean((testset - p)^2))
}
trainset <- c(1, 1, 2, 1)
testset <- c(3, 4)
Then:
f(1:5, FUN = function (x) g(x, trainset, testset))
# [1] 2
So you pass the function g via a wrapper function into f and it will do the job for you.
Alternative
R allows you to create a function out of another function:
g <- function (trainset, testset) function (l) {
p <- mean(trainset + l)
sqrt(mean((testset - p)^2))
}
g1 <- g(trainset, testset)
g1(1)
# [1] 1.346291
In this situation, g() takes two arguments, and return a function that takes 1 argument l. So you can create a new function g1() out of g().
Then you can pass it to your parent function giving you the same results in this example:
f(1:5, FUN = g1)
# [1] 2

How to define a recursive for loop in R?

I have a priorly unknown number of variables, and for each variable I need to define a for loop and perform a series of operations. For each subsequent variable, I need to define a nested loop inside the previous one, performing the same operations. I guess there must be a way of doing this recursively, but I am struggling with it.
Consider for instance the following easy example:
results = c()
index = 0
for(i in 1:5)
{
a = i*2
for(j in 1:5)
{
b = a*2 + j
for(k in 1:5)
{
index = index + 1
c = b*2 + k
results[index] = c
}
}
}
In this example, I would have 3 variables. The loop on j requires information from the loop i, and the loop on k requires information from the loop j. This is a simplified example of my problem and the operations here are pretty simple. I am not interested on another way of getting the "results" vector, what I would like to know is if there is a way to recursevily do this operations for an unknown number of variables, lets say 10 variables, so that I do not need to nest manually 10 loops.
Here is one approach that you might be able to modify for your situation...
results <- 0 #initialise
for(level in 1:3){ #3 nested loops - change as required
results <- c( #converts output to a vector
outer(results, #results so far
1:5, #as in your loops
FUN = function(x,y) {x*2+y} #as in your loops
)
)
}
The two problems with this are
a) that your formula is different in the first (outer) loop, and
b) the order of results is different from yours
However, you might be able to find workarounds for these depending on your actual problem.
I have tried to change the code so that it is a function that allows to define how many iterations need to happen.
library(tidyverse)
fc <- function(i_end, j_end, k_end){
i <- 1:i_end
j <- 1:j_end
k <- 1:k_end
df <- crossing(i, j, k) %>%
mutate(
a = i*2,
b = a*2 + j,
c = b*2 + k,
index = row_number())
df
}
fc(5,5,5)

R apply function to data based on index column value

Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]

Create a function that takes in a vector and returns a matrix in R

I am trying to create a function that will take in a vector k and return to me a matrix with dimensions length(distMat[1,]) by length(k). distMat is a huge matrix and indSpam is a long vector. In particular to my situation, length(distMat[1,]) is 2412. When I enter in k as a vector of length one, I get a vector of length 2412. I want to be able to enter in k as a vector of length two and get a matrix of 2412x2. I am trying to use a while loop to let it go through the length of k, but it only returns to me a vector of length 2412. What am I doing wrong?
predNeighbor = function(k, distMat, indSpam){
counter = 1
while (counter<(length(k)+1))
{
preMatrix = apply(distMat, 1, order)
orderedMatrix = t(preMatrix)
truncate = orderedMatrix[,1:k[counter]]
checking = indSpam[truncate]
checking2 = matrix(checking, ncol = k[counter])
number = apply(checking2, 1, sum)
return(number[1:length(distMat[1,])] > (k[counter]/2))
counter = counter + 1
}
}
I am trying to create a function that will take in a vector k and return to me a matrix with dimensions length(distMat[1,]) by length(k)
Here's a function that does this.
foo <- function(k, distMat) {
return(matrix(0, nrow = length(distMat[1, ]), ncol = length(k)))
}
If you have other requirements, please describe them in words.
Based on your comment, I think I understand better your goal. You have a function that returns a vector of length k and you want to save it's output as rows in a matrix. This is a pretty common task. Let's do a simple example where k starts out as 1:10, and say we want to add some noise to it with a function foo() and see how the rank changes.
In the case where the input to the function is always the same, replicate() works very well. It will automatically put everything in a matrix
k <- 1:10
noise_and_rank <- function(k) {
rank(k + runif(length(k), min = -2, max = 2))
}
results <- replicate(n = 8, expr = {noise_and_rank(k)})
In the case where you want to iterate, i.e., the output from the one go is the input for the next, a for loop is good, and we just pre-allocate a matrix with 0's, to fill in one column/row at a time
k <- 1:10
n.sim <- 8
results <- matrix(0, nrow = length(k), ncol = n.sim)
results[, 1] <- k
for(i in 2:n.sim) {
results[, i] <- noise_and_rank(results[, i - 1])
}
What your original question seems to be about is how to do the pre-allocation. If the input is always the same, using replicate() means you don't worry about it. If the input is is different each time, then pre-allocate using matrix(), you don't need to write any special function.

Resources