How to simplify several for loops into a single loop or function in R - r

I am trying to combine several for loops into a single loop or function. Each loop is evaluating if an individual is present at a site that is protected, and based on that is assigning a number (numbers represent sites) at each time step. After that, the results for each time step are stored in a matrix and later used in other analysis. The problem that I am having is that I am repeating the same loop several times to evaluate the different scenarios (10%, 50%, 100% of sites protected). Since I need to store my results for each scenario I can't think of a better way to simplify this into a single loop or function. Any ideas or suggestions will be appreciated. This is a very small and simplify idea of the problem. I would like to keep the structure of the loop since my original loop is using several if statements. The only thing that is changing is the proportion of sites that are protected.
N<-10 # number of sites
sites<-factor(seq(from=1,to=N))
sites10<-as.factor(sample(sites,N*1))
sites5<-as.factor(sample(sites,N*0.5))
sites1<-as.factor(sample(sites,N*0.1))
steps<-10
P.stay<-0.9
# storing results
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites1){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites1<-as.factor(result)
###
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites5){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites5<-as.factor(result)
###
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites10){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites10<-as.factor(result)
#
results.sites1
results.sites5
results.sites10

Instead of doing this:
sites10<-as.factor(sample(sites,N*1))
sites5<-as.factor(sample(sites,N*0.5))
sites1<-as.factor(sample(sites,N*0.1))
and running distinct loops for each of the three variables, you can make a general loop and put it in a function, then use one of the -apply functions to call it with specific parameters. For example:
N<-10 # number of sites
sites<-factor(seq(from=1,to=N))
steps<-10
P.stay<-0.9
simulate.n.sites <- function(n) {
n.sites <- sample(sites, n)
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% n.sites){
...etc...
return(result)
}
results <- lapply(c(1, 5, 10), simulate.n.sites)
Now results will be a list, with three matrix elements.
The key is to identify places where you repeat yourself, and then refactor those areas into functions. Not only is this more concise, but it's easy to extend in the future. Want to sample for 2 site? Put a 2 in the vector you pass to lapply.
If you're unfamiliar with the -apply family of functions, definitely look into those.
I also suspect that much of the rest of your code could be simplified, but I think you've gutted it too much for me to make sense of it. For example, you define an element of time.step$event based on a condition, but then you overwrite that element. Surely this isn't what the actual code does?

Related

For loop to create multiple empty data frames gives error

I wrote a for loop to create empty multiple data frames, using a vector of names, but even though it seemed really easy at start I got an error message : Error in ID_names[i] <- data.frame() : replacement has length zero
To be more specific I' ll provide you with a reproducable example:
ID_names <- c("Athens","Rome","Barcelona","London","Paris","Madrid")
for(i in 1:length(ID_names){
ID_names[i] <- data.frame()
}
Do you have any idea why this is wrong? I would like to ask you not only provide a solution, but specify me why this for loop is wrong in order to avoid such kind of mistakes in the future.
You are trying to store a dataframe in one element of a vector (ID_names[i]) which is not possible. You might want to create a list of empty dataframes and assign names to it which can be done using replicate.
ID_names <- c("Athens","Rome","Barcelona","London","Paris","Madrid")
list_data <- setNames(replicate(length(ID_names), data.frame()), ID_names)
However, very rarely such initialisation of empty dataframes will be useful. It ends up creating more confusion down the road. Depending on your actual use case there might be other better ways to handle this.

How to loop through columns of a data.frame and use a function

This has probably been answered already and in that case, I am sorry to repeat the question, but unfortunately, I couldn't find an answer to my problem. I am currently trying to work on the readability of my code and trying to use functions more frequently, yet I am not that familiar with it.
I have a data.frame and some columns contain NA's that I want to interpolate with, in this case, a simple kalman filter.
require(imputeTS)
#some test data
col <- c("Temp","Prec")
df_a <- data.frame(c(10,13,NA,14,17),
c(20,NA,30,NA,NA))
names(df_a) <- col
#this is my function I'd like to use
gapfilling <- function(df,col){
print(sum(is.na(df[,col])))
df[,col] <- na_kalman(df[,col])
}
#this is my for-loop to loop through the columns
for (i in col) {
gapfilling(df_a, i)
}
I have two problems:
My for loop works, yet it doesn't overwrite the data.frame. Why?
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
You most definitely do not have to avoid for loops. What you should avoid is using a loop to perform actions that could be vectorized. Loops are in general just fine, however they are (much) slower compared to compiled languages such as c++, but are equivalent to loops in languages such as python.
My for loop works, yet it doesn't overwrite the data.frame. Why?
This is a problem with overwriting values within a function, or what is referred to as scope. Basically any assignment is restricted to its current environment (or scope). Take the example below:
f <- function(x){
a <- x
cat("a is equal to ", a, "\n")
return(3)
}
x <- 4
f(x)
a is equal to 4
[1] 3
print(a)
Error in print(a) : object 'a' not found
As you can see, "a" definitely exists, but it stops existing after the function call has been fulfilled. It is restricted to the environment (or scope) of the function. Here the scope is basically the time at which the function is run.
To alleviate this, you have to overwrite the value in the global environment
for (i in col) {
df_a[, i] <- gapfilling(df_a, i)
}
Now for readability (not speed) one could change this to a lapply
df_a[, col] <- lapply(df_a[, col], na_kalman)
I set a heavy point on it not being faster than using a loop. lapply iterates over each column, as you would in a loop. Speed could be obtained if say na_kalman was programmed to take multiple columns, and possibly save time using optimized c or c++ code.

How to modify elements of a vector based on other elements in parallel?

I'm trying to parallelize part of my code, but I can't figure out how since everywhere I read about parallelization the object is completely split and then processed using only the "information" in the chunk. My problem cannot be split in independent chunks, but the updates can be done independently.
Here's a simplified version
a <- runif(10000)
indexes <- c(seq(1,9999,2), seq(2,10000,2))
for(i in indexes){
.prev <- ifelse(i>1, a[i-1], 0)
.next <- ifelse(i<10000, a[i+1], 1)
a[i] <- runif(1, min(.prev,.next), max(.prev, .next))
}
While each iteration on the for loop depends on the current values in a, the order defined in indexes makes the problem parallelized within even and odd indexes, e.g., if indexes = seq(1,9999,2) the dependence is not a problem since the values used in each iteration won't ever be modified. On the other hand it will be impossible to execute this with the subset a[indexes], hence the "splitting" strategy in the guides I read are unable to perform this operation.
How can I parallelize a problem like this, where I need the object to be modified and "look" at the whole vector every time? Instead of each worker returning some output, each worker has to modify a "shared" object in memory?

Need to loop through columns to perform t tests?

So I have a customer survey, and I need to determine if there are significant differences between the four areas. I obviously want to do a t-test on these, and here is my current R solution.
`for (i in colnames(c_survey))
assign(i, subset(c_survey, select=i))
elements <- list(quality, ease.of.use, price, service)
elements_alt<-list(service,price,ease.of.use,quality)
for(i in elements){print(names(elements)[i])
for (j in elements_alt) {print(t.test(i,j)$p.value)}}`
(Edit) I figured out the nested loop, but I still think there's a faster way to do what I want than this whole two list, nested loop nonsense. Also my output from this has no names on it so I have no idea what's being compared to what, and it includes all the duplicate inverse comparisons. I also can't save the results. I think my solution certainly lies elsewhere.
Also, would producing so many t-test p values even be the best statistical way to accomplish what I want? It seems like there should be something easier than this...

How to use a value that is specified in a function call as a "variable"

I am wondering if it is possible in R to use a value that is declared in a function call as a "variable" part of the function itself, similar to the functionality that is available in SAS IML.
Given something like this:
put.together <- function(suffix, numbers) {
new.suffix <<- as.data.frame(numbers)
return(new.suffix)
}
x <- c(seq(1000,1012, 1))
put.together(part.a, x)
new.part.a ##### does not exist!!
new.suffix ##### does exist
As it is written, the function returns a dataframe called new.suffix, as it should because that is what I'm asking it to do.
I would like to get a dataframe returned that is called new.part.a.
EDIT: Additional information was requested regarding the purpose of the analysis
The purpose of the question is to produce dataframes that will be sent to another function for analysis.
There exists a data bank where elements are organized into groups by number, and other people organize the groups
into a meaningful set.
Each group has an id number. I use the information supplied by others to put the groups together as they are specified.
For example, I would be given a set of id numbers like: part-1 = 102263, 102338, 202236, 302342, 902273, 102337, 402233.
So, part-1 has seven groups, each group having several elements.
I use the id numbers in a merge so that only the groups of interest are extracted from the large data bank.
The following is what I have for one set:
### all.possible.elements.bank <- .csv file from large database ###
id.part.1 <- as.data.frame(c(102263, 102338, 202236, 302342, 902273, 102337, 402233))
bank.names <- c("bank.id")
colnames(id.part.1) <- bank.names
part.sort <- matrix(seq(1,nrow(id.part.1),1))
sort.part.1 <- cbind(id.part.1, part.sort)
final.part.1 <- as.data.frame(merge(sort.part.1, all.possible.elements.bank,
by="bank.id", all.x=TRUE))
The process above is repeated many, many times.
I know that I could do this for all of the collections that I would pull together, but I thought I would be able to wrap the selection process into a function. The only things that would change would be the part numbers (part-1, part-2, etc..) and the groups that are selected out.
It is possible using the assign function (and possibly deparse and substitute), but it is strongly discouraged to do things like this. Why can't you just return the data frame and call the function like:
new.part.a <- put.together(x)
Which is the generally better approach.
If you really want to change things in the global environment then you may want a macro, see the defmacro function in the gtools package and most importantly read the document in the refrences section on the help page.
This is rarely something you should want to do... assigning to things out of the function environment can get you into all sorts of trouble.
However, you can do it using assign:
put.together <- function(suffix, numbers) {
assign(paste('new',
deparse(substitute(suffix)),
sep='.'),
as.data.frame(numbers),
envir=parent.env(environment()))
}
put.together(part.a, 1:20)
But like Greg said, its usually not necessary, and always dangerous if used incorrectly.

Resources