I have a priorly unknown number of variables, and for each variable I need to define a for loop and perform a series of operations. For each subsequent variable, I need to define a nested loop inside the previous one, performing the same operations. I guess there must be a way of doing this recursively, but I am struggling with it.
Consider for instance the following easy example:
results = c()
index = 0
for(i in 1:5)
{
a = i*2
for(j in 1:5)
{
b = a*2 + j
for(k in 1:5)
{
index = index + 1
c = b*2 + k
results[index] = c
}
}
}
In this example, I would have 3 variables. The loop on j requires information from the loop i, and the loop on k requires information from the loop j. This is a simplified example of my problem and the operations here are pretty simple. I am not interested on another way of getting the "results" vector, what I would like to know is if there is a way to recursevily do this operations for an unknown number of variables, lets say 10 variables, so that I do not need to nest manually 10 loops.
Here is one approach that you might be able to modify for your situation...
results <- 0 #initialise
for(level in 1:3){ #3 nested loops - change as required
results <- c( #converts output to a vector
outer(results, #results so far
1:5, #as in your loops
FUN = function(x,y) {x*2+y} #as in your loops
)
)
}
The two problems with this are
a) that your formula is different in the first (outer) loop, and
b) the order of results is different from yours
However, you might be able to find workarounds for these depending on your actual problem.
I have tried to change the code so that it is a function that allows to define how many iterations need to happen.
library(tidyverse)
fc <- function(i_end, j_end, k_end){
i <- 1:i_end
j <- 1:j_end
k <- 1:k_end
df <- crossing(i, j, k) %>%
mutate(
a = i*2,
b = a*2 + j,
c = b*2 + k,
index = row_number())
df
}
fc(5,5,5)
Related
require(quantmod)
require(TTR)
iris2 <- iris[1:4]
b=NULL
for (i in 1:ncol(iris2)){
for (j in 1:ncol(iris2)){
a<- runCor(iris2[,i],iris2[,j],n=21)
b<-cbind(b,a)}}
I want to calculate a rolling correlation of different columns within a dataframe and store the data separately by a column. Although the code above stores the data into variable b, it is not as useful as it is just dumping all the results. What I would like is to be able to create different dataframe for each i.
In this case, as I have 4 columns, what I would ultimately want are 4 dataframes, each containing 4 columns showing rolling correlations, i.e. df1 = corr of col 1 vs col 1,2,3,4, df2 = corr of col 2 vs col 1,2,3,4...etc)
I thought of using lapply or rollapply, but ran into the same problem.
d=NULL
for (i in 1:ncol(iris2))
for (j in 1:ncol(iris2))
{c<-rollapply(iris2, 21 ,function(x) cor(x[,i],x[,j]), by.column=FALSE)
d<-cbind(d,c)}
Would really appreciate any inputs.
If you want to keep the expanded loop, how about a list of dataframes?
e <- list(length = length(ncol(iris2)))
for (i in 1:ncol(iris2)) {
d <- matrix(0, nrow = length(iris2[,1]), ncol = length(iris2[1,]))
for (j in 1:ncol(iris2)) {
d[,j]<- runCor(iris2[,i],iris2[,j],n=21)
}
e[[i]] <- d
}
It's also a good idea to allocate the amount of space you want with placeholders and put items into that space rather than use rbind or cbind.
Although it is not a good practice to create dataframes on the fly in R (you should prefer putting them in a list as in other answer), the way to do so is to use the assign and get functions.
for (i in 1:ncol(iris2)) {
for (j in 1:ncol(iris2)){
c <- runCor(iris2[,i],iris2[,j],n=21)
# Assign 'c' to the name df1, df2...
assign(paste0("df", i), c)
}
}
# to have access to the dataframe:
get("df1")
# or inside a loop
get(paste0("df", i))
Since you stated your computation was slow, I wanted to provide you with a parallel solution. If you have a modern computer, it probably has 2 cores, if not 4 (or more!). You can easily check this via:
require(parallel) # for parallelization
detectCores()
Now the code:
require(quantmod)
require(TTR)
iris2 <- iris[,1:4]
Parallelization requires the functions and variables be placed into a special environment that is created and destroyed with each process. That means a wrapper function must be created to define the variables and functions.
wrapper <- function(data, n) {
# variables placed into environment
force(data)
force(n)
# functions placed into environment
# same inner loop written in earlier answer
runcor <- function(data, n, i) {
d <- matrix(0, nrow = length(data[,1]), ncol = length(data[1,]))
for (j in 1:ncol(data)) {
d[,i] <- TTR::runCor(data[,i], data[,j], n = n)
}
return(d)
}
# call function to loop over iterator i
worker <- function(i) {
runcor(data, n, i)
}
return(worker)
}
Now create a cluster on your local computer. This allows the multiple cores to run separately.
parallelcluster <- makeCluster(parallel::detectCores())
models <- parallel::parLapply(parallelcluster, 1:ncol(iris2),
wrapper(data = iris2, n = 21))
stopCluster(parallelcluster)
Stop and close the cluster when finished.
I have a set of vectors of length n, say, for example that n=3:
vec1<-c(1,2,3)
vec2<-c(2,2,2)
And a multidimensional array of size n^n:
threeDarray<-array(0,dim=c(3,3,3))
I want to create a loop that goes through my set of vectors and adds 1 to the corresponding index in the array. After analysing the two vectors above the array should be like:
threeDarray[1,2,3]=1
threeDarray[2,2,2]=1
I'm trying to use the multidimensional array to store the number of occurrences of each vector (my vectors are patterns in a time series).
The community is right (and the noob is wrong). Multidimensional arrays are not the way to go about this.
An example of code working with lists:
freqPatterns<-function(timeSeries,dimension){
temp<-character()
for (i in 1:(length(timeSeries)-dimension+1)){
pattern<-paste(as.character(rank(timeSeries[i:(i+dimension-1)])-1),collapse=", ")
#print(pattern)
temp[[length(temp)+1]] <- pattern
}
freqTable=sort(table(temp),decreasing=T)
return(freqTable)
}
Thank you guys!
Like you found out yourself, I wouldn't use a multidimensioanl array neither.
Here is a solution using a dataframe:
n=4 # dimension
ll = lapply(vector("list", n), function(x) x=1:n) # build list of vectors (n * 1:n)
df_occurs = expand.grid(ll, KEEP.OUT.ATTRS=F) # get all combinations
df_occurs$occurences = 0
# for-loop for storing the occurences
for(v in list(vec1, vec2)) {
v_match = apply(df_occurs[,1:n], 1, function(x) all(x==v))
df_occurs$occurences[v_match] = 1
}
Maybe performance is an issue with large n. If it's possible to build a character-key out of your vector, eg.
paste(vec1, collapse="")
the lookup in the dataframe would be easier:
df_occurs = data.frame(
key = apply(expand.grid(ll, KEEP.OUT.ATTRS=F), 1, paste, collapse=""),
occurences = 0
)
for(key in list(vec1, vec2)) {
df_occurs$occurences[df_occurs$key==paste(key, collapse="")] = 1
}
I have a for loop in R in which I want to store the result of each calculation (for all the values looped through). In the for loop a function is called and the output is stored in a variable r in the moment. However, this is overwritten in each successive loop. How could I store the result of each loop through the function and access it afterwards?
Thanks,
example
for (par1 in 1:n) {
var<-function(par1,par2)
c(var,par1)->var2
print(var2)
So print returns every instance of var2 but in var2 only the value for the last n is saved..is there any way to get an array of the data or something?
initialise an empty object and then assign the value by indexing
a <- 0
for (i in 1:10) {
a[i] <- mean(rnorm(50))
}
print(a)
EDIT:
To include an example with two output variables, in the most basic case, create an empty matrix with the number of columns corresponding to your output parameters and the number of rows matching the number of iterations. Then save the output in the matrix, by indexing the row position in your for loop:
n <- 10
mat <- matrix(ncol=2, nrow=n)
for (i in 1:n) {
var1 <- function_one(i,par1)
var2 <- function_two(i,par2)
mat[i,] <- c(var1,var2)
}
print(mat)
The iteration number i corresponds to the row number in the mat object. So there is no need to explicitly keep track of it.
However, this is just to illustrate the basics. Once you understand the above, it is more efficient to use the elegant solution given by #eddi, especially if you are handling many output variables.
To get a list of results:
n = 3
lapply(1:n, function(par1) {
# your function and whatnot, e.g.
par1*par1
})
Or sapply if you want a vector instead.
A bit more complicated example:
n = 3
some_fn = function(x, y) { x + y }
par2 = 4
lapply(1:n, function(par1) {
var = some_fn(par1, par2)
return(c(var, par1)) # don't have to type return, but I chose to make it explicit here
})
#[[1]]
#[1] 5 1
#
#[[2]]
#[1] 6 2
#
#[[3]]
#[1] 7 3
I'm working on subsets of data from multiple time periods and I'd like to do column and level reduction on my training set and then apply the same actions to other datasets of the same structure.
dataframeReduce in the Hmisc package is what I've been using, but applying the function to different dataset results in slightly different actions.
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
The final line ends up erroring because the backPredictors has a column removed that trainPredictors does retains. All other sets should have the transformations applied to trainPredictors applied to them.
Does anyone know how to apply the same cleanup actions to multiple datasets either using dataframeReduce or another function/block of code?
An example
Using the function NAins from http://trinkerrstuff.wordpress.com/2012/05/02/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
library("Hmisc")
trainPredictors<-NAins(mtcars, .1)
testPredictors<-NAins(mtcars, .3)
trainPredictors<-dataframeReduce(trainPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-dataframeReduce(testPredictors,
fracmiss=0.2, maxlevels=20, minprev=0.075)
testPredictors<-testPredictors[,names(trainPredictors)]
If your goal is to have the same variables with the same levels, then you need to avoid using dataframeReduce a second time, and instead use the same columns as produced by the dataframeReduce operation on hte train-set and apply factor reduction logic to the test-set in a manner that results in whatever degree of homology is needed of subsequent comparison operations. If it is a predict operation that is planned then you need to get the levels to be the same and you need to modify the code in dataframeReduce that works on the levels:
if (is.category(x) || length(unique(x)) == 2) {
tab <- table(x)
if ((min(tab)/n) < minprev) {
if (is.category(x)) {
x <- combine.levels(x, minlev = minprev)
s <- "grouped categories"
if (length(levels(x)) < 2)
s <- paste("prevalence<", minprev, sep = "")
}
else s <- paste("prevalence<", minprev, sep = "")
}
}
So a better problem statement is likely to produce a better strategy. This will probably require both knowing what levels are in the entire set and in the train and test sets as well as what testing or predictions are anticipated (but not yet stated).
I followed the discussion over HERE and am curious why is using<<- frowned upon in R. What kind of confusion will it cause?
I also would like some tips on how I can avoid <<-. I use the following quite often. For example:
### Create dummy data frame of 10 x 10 integer matrix.
### Each cell contains a number that is between 1 to 6.
df <- do.call("rbind", lapply(1:10, function(i) sample(1:6, 10, replace = TRUE)))
What I want to achieve is to shift every number down by 1, i.e all the 2s will become 1s, all the 3s will be come 2 etc. Therefore, all n would be come n-1. I achieve this by the following:
df.rescaled <- df
sapply(2:6, function(i) df.rescaled[df.rescaled == i] <<- i-1))
In this instance, how can I avoid <<-? Ideally I would want to be able to pipe the sapply results into another variable along the lines of:
df.rescaled <- sapply(...)
First point
<<- is NOT the operator to assign to global variable. It tries to assign the variable in the nearest parent environment. So, say, this will make confusion:
f <- function() {
a <- 2
g <- function() {
a <<- 3
}
}
then,
> a <- 1
> f()
> a # the global `a` is not affected
[1] 1
Second point
You can do that by using Reduce:
Reduce(function(a, b) {a[a==b] <- a[a==b]-1; a}, 2:6, df)
or apply
apply(df, c(1, 2), function(i) if(i >= 2) {i-1} else {i})
But
simply, this is sufficient:
ifelse(df >= 2, df-1, df)
You can think of <<- as global assignment (approximately, because as kohske points out it assigns to the top environment unless the variable name exists in a more proximal environment). Examples of why this is bad are here:
Examples of the perils of globals in R and Stata