R apply function to data based on index column value

R apply function to data based on index column value - r

Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.

All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]

Related

List Appending with outputs from machine learning function

Please excuse the title for lack of a better phrase describing my question.
I'm running cluster stability analysis function out of 'flexclust' package, which runs bootstrap sampling on your dataset, calculate this thing called "Random Index" per each value of k (the range which I get to specify).
The function lets you try multiple distance metrics and clustering methods, and I want to run the function for every one of distance&method combination, find the best k based on each k's mean + median.
I've basically written nested for loops, initializing vector for each of the column: (name, distance metric, method, and best k). And calling a data.frame() to stitch all of them together.
###############################################################################################
df = data.frame(matrix(rbinom(10*100, 1, .5), ncol=4)) #random df for testing purpose
cl_stability <- function(df, df.name, k_low, k_high)
{
cluster.distance = c("euclidean","manhattan")
cluster.method = c("kmeans","hardcl","neuralgas")
for (dist in cluster.distance)
{
for (method in cluster.method)
{
j = 1
while (j <= length(cluster.distance)*length(cluster.method))
{
df.names = rep(c(df.name),length(cluster.distance)*length(cluster.method))
distances = c()
methods = c()
best.k.s = c()
ip = as.data.frame((bootFlexclust(df, k = k_low:k_high, multicore = TRUE,
FUN = "cclust", dist = d, method = m))#rand)
best_k = names(which.max(apply(ip, 2, mean) + apply(ip, 2, median))) #this part runs fine when I run them outside of the function
distances[j] = d
methods[j] = m
best.k.s[j] = best_k
j = j + 1
final = data.frame(df.names,distances,methods,best.k.s)
}
}
}
return(final)
}
Expected result would be a dataframe with 7 columns (name, distance metric, method, and best k, 2nd best, 3rd best, and the worst based on median+mean criteria.).
https://imgur.com/a/KpFM04m

Store values from a loop

I am simulating dice throws, and would like to save the output in a single object, but cannot find a way to do so. I tried looking here, here, and here, but they do not seem to answer my question.
Here is my attempt to assign the result of a 20 x 3 trial to an object:
set.seed(1)
Twenty = for(i in 1:20){
trials = sample.int(6, 3, replace = TRUE)
print(trials)
i = i+1
}
print(Twenty)
What I do not understand is why I cannot recall the function after it is run?
I also tried using return instead of print in the function:
Twenty = for(i in 1:20){
trials = sample.int(6, 3, replace = TRUE)
return(trials)
i = i+1
}
print(Twenty)
or creating an empty matrix first:
mat = matrix(0, nrow = 20, ncol = 3)
mat
for(i in 1:20){
mat[i] = sample.int(6, 3, replace = TRUE)
print(mat)
i = i+1
}
but they seem to be worse (as I do not even get to see the trials).
Thanks for any hints.

There are several things wrong with your attempts:
1) A loop is not a function nor an object in R, so it doesn't make sense to assign a loop to a variable
2) When you have a loop for(i in 1:20), the loop will increment i so it doesn't make sense to add i = i + 1.
Your last attempt implemented correctly would look like this:
mat <- matrix(0, nrow = 20, ncol = 3)
for(i in 1:20){
mat[i, ] = sample.int(6, 3, replace = TRUE)
}
print(mat)
I personally would simply do
matrix(sample.int(6, 20 * 3, replace = TRUE), nrow = 20)
(since all draws are independent and with replacement, it doesn't matter if you make 3 draws 20 times or simply 60 draws)

Usually, in most programming languages one does not assign objects to for loops as they are not formally function objects. One uses loops to interact iteratively on existing objects. However, R maintains the apply family that saves iterative outputs to objects in same length as inputs.
Consider lapply (list apply) for list output or sapply (simplified apply) for matrix output:
# LIST OUTPUT
Twenty <- lapply(1:20, function(x) sample.int(6, 3, replace = TRUE))
# MATRIX OUTPUT
Twenty <- sapply(1:20, function(x) sample.int(6, 3, replace = TRUE))
And to see your trials, simply print out the object
print(Twenty)
But since you never use the iterator variable, x, consider replicate (wrapper to sapply which by one argument can output a matrix or a list) that receives size and expression (no sequence inputs or functions) arguments:
# MATRIX OUTPUT (DEFAULT)
Twenty <- replicate(20, sample.int(6, 3, replace = TRUE))
# LIST OUTPUT
Twenty <- replicate(20, sample.int(6, 3, replace = TRUE), simplify = FALSE)

You can use list:
Twenty=list()
for(i in 1:20){
Twenty[[i]] = sample.int(6, 3, replace = TRUE)
}

Loop for generating lottery numbers

I am working with R and using the expression sort(sample(1:60,6,replace=FALSE)) for generating 6 numbers between 1 and 60, without replacement...
I would like to create a loop using FOR statements that allow to generate n different samples, using the logic above.
Any suggestion about how to build this loop?

Use replicate:
replicate(sort(sample(1:60, 6, replace = FALSE)), n = 1000)
The result is a matrix of size 6x1000, so each column is one sample.
I guess you want to do random draws which would allow equal samples. In case you do want unique samples, I gave it a shot:
lottery <- function(n) {
S <- replicate(sort.int(sample(1:60, 6, repl = F)), n = n)
while(d <- anyDuplicated(S, MARGIN = 2)) {
S <- cbind(S[, -d], sort.int(sample(1:60, 6, repl = F)))
}
S
}

You can use the rerun function that returns a list with the result that you need
library(purrr)
rerun(.n = 1000, sort(sample(1:60, 6, replace = FALSE))) %>%
unique()

In R, how can I vectorize a column assignment that filters for another condition?

I have a data table I am assigning a new column to and I am trying to vectorize the following loop so that my code runs more efficiently. I have looked into ifelse but I don't think it works on data tables- please correct me if I'm wrong! I've seen answers for Matlab and C++ but would appreciate help on how to do it in R. Below is my assignment loop:
for (i in 1:20610) {
if (table$Delta[i] > 0) {
table$CashFlow[i] = -1*table$Buy[i]*table$Delta[i]
}
else {
table$CashFlow[i] = -1*table$Sell[i]*table$Delta[i]
}
}
Thank you!
Example of data and expected output

We can use ifelse
table$CashFlow <- with(table, ifelse(Delta > 0, -1*Buy*Delta, -1*Sell*Delta))
Or another option is to use row/column indexing
table[-1][with(table, cbind(1:nrow(table), (Delta <= 0)+1))]*-1 * table$Delta
data
set.seed(24)
table <- data.frame(Delta = rnorm(5), Buy = sample(1:10, 5,
replace = TRUE), Sell = sample(1:7, 5, replace = TRUE))

How to apply a distribution function for each row in data frame

I know similar questions have been asked in this site here, here, and here, but none of them tackles my problem.
I've a data frame which I want to apply the rdirichlet function (from gtools) to each line. So, each line shall be consider as aplha.
data = NULL
data <- data.frame(rbind(
oct = c(60, 32, 8),
sep = c(53, 35, 12),
ago = c(54, 40, 6)
))
data <- data/100*1000
library(gtools) # contains the function
sim <- 10000 # simulation
My first attenpt was to use apply, it does work, but the output is not that clear for conducting further analysis; each row computation becomes a vector:
p = apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
I also try in a loop without success:
p = NULL
for(i in 1:length(data)) {
p[i] <- rdirichlet(sim, alpha = data[i] + 1)
}
Any tip how can I solve this?

Well firstly you might want to change the data in your anonymous function in the apply to x to match the x in function(x)
apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
This works for me, as in it provides an output with three columns and 30000 rows.

Two important things here. First, vectorizing is the best way to go:
ans <- apply(data, 1, function(x) rdirichlet(sim, alpha = x + 1))
By doing this, you'll receive each row computations as vector, essentially k vs sim like.
Then you'll need to subsample things like:
margin <- ans[1:100000,1] - ans[100001:200000,1]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R apply function to data based on index column value - r

All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in example[, accumulating_cost(threshold, target), by = index]

Related

List Appending with outputs from machine learning function

Store values from a loop

Loop for generating lottery numbers

In R, how can I vectorize a column assignment that filters for another condition?

How to apply a distribution function for each row in data frame

Categories

Resources