R stats - generate incremental variables - r

I am quite new in R programming and facing a simple problem but can't find any solution.
In other programming languages, it is possible to generate incrementally named variables into a loop. Is this possible in R? How?
For example, I would like to save the output of an operation into a new variable each time a loop is done:
for(i in 1:5) {
var_[i] <- i + pi
}
In this way, I would end up with var_1, var_2,..., var_5.
Thank you in advance for any help.

The literal version of what you're attempting is generally considered bad practice in R.
We generally avoid creating large collections of isolated data structures. It is much cleaner to put them all in a list and then set their names attribute:
x <- vector("list",5)
for (i in seq_len(5)){
x[[i]] <- i + pi
}
names(x) <- paste0("var_",1:5)
And then you'd refer to them via things like:
x[["var_1"]]

While this is possible in R, it is highly discouraged. It is better to work with named lists or vector or accumulate results. For example here you can store them as a vector.
myvar<-1:5+pi
# myvar[1] == 4.141593
# myvar[5] == 8.141593
or if you wanted to create a list you could use
myvar <- lapply(1:5, function(x) {x+pi})
names(myvar)<-paste("var", 1:5, sep="_")
# myvar[["var_1"]] == myvar[[1]] == 4.141593
But if you really need to create a bunch of variables (and you don't) you can use the assign() function

Related

r- call function inside for loop alternative

my_function <- function(n){}
result = list()
for(i in 0:59){
result[i] = my_function(i)
}
write.csv(result, "result.csv")
New to R, read that for-loops are bad in R, so is there an alternative to what I'm doing? I'm basically trying to call my_function with a parameter that's increasing, and then write the results to a file.
Edit
Sorry I didn't specify that I wanted to use some function of i as a parameter for my_function, 12 + (22*i) for example. Should I create a list of values and then called lapply with that list of values?
for loops are fine in R, but they're syntactically inefficient in a lot of use cases, especially simple ones. The apply family of functions usually makes a good substitute.
result <- lapply(0:59, my_function)
write.csv(result, "result.csv")
Depending on what your function's output is, you might want sapply rather than lapply.
Edit:
Per your update, you could do it as you say, creating the vector of values first, or you could just do something like:
lapply(12+(22*0:59), my_function)

Recall different data names inside loop

here is how I created number of data sets with names data_1,data_2,data_3 .....and so on
for initial
dim(data)<- 500(rows) 17(column) matrix
for ( i in 1:length(unique( data$cluster ))) {
assign(paste("data", i, sep = "_"),subset(data[data$cluster == i,]))
}
upto this point everything is fine
now I am trying to use these inside the other loop one by one like
for (i in 1:5) {
data<- paste(data, i, sep = "_")
}
however this is not giving me the data with required format
any help will be really appreciated.
Thank you in advance
Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe.
Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split, by and tapply, or at the plyr and dplyr packages.
Using Base R
With base R, it depends on what you want to do. In the most general case you can use a combination of split() and lapply or even a for loop:
mylist <- split( data, f = data$cluster)
for(mydata in mylist){
head(mydata)
...
}
Or
mylist <- split( data, f = data$cluster)
result <- lapply(mylist, function(mydata){
doSomething(mydata)
})
Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for loop.
If you want to add some variables based on other variables, then the plyr or dplyr packages come in handy
Using plyr and dplyr
These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split and lapply but then in a way Hadley approves of :-)
For example:
library(plyr)
result <- ddply(data, .(cluster),
function(mydata){
doSomething(mydata)
})
Use dlply if the result should be a list.

update a vector using assign in R

I am implementing k-means in R.
In a loop, I am initiating several vectors that will be used to store values that belong to a particular cluster, as seen here:
for(i in 1:k){
assign(paste("cluster",i,sep=""),vector())
}
I then want to add to a particular "cluster" vector, depending on the value I get for the variable getIndex. So if getIndex is equal to 2 I want to add the variable minimumDistance to the vector called cluster2. This is what I am attempting to do:
minimumDistance <- min(distanceList)
getIndex <- match(minimumDistance,distanceList)
clusterName <- paste("cluster",getIndex,sep="")
name <- c(name, minimumDistance)
But obviously the above code does not work because in order to append to a vector that I'm naming I need to use assign as I do when I instantiate the vectors. But I do not know how to use assign, when using paste, when also appending to a vector.
I cannot use the index such as vector[i] because I don't know what index of that particular vector I want to add to.
I need to use the vector <- c(vector,newItem) format but I do not know how to do this in R. Or if there is any other option I would greatly, greatly appreciate it. If I were using Python I would simply use paste and then use append but I can't do that in R. Thank you in advance for your help!
You can do something like this:
out <- list()
for (i in 1:nclust) {
# assign some data (in this case a list) to a cluster
assign(paste0("N_", i), list(...))
# here I put all the clusters data in a list
# but you could use a similar statement to do further data manipulation
# ie if you've used a common syntax (here "N_" <index>) to refer to your elements
# you can use get to retrieve them using the same syntax
out[[i]] <- get(paste0("N_", i))
}
If you want a more complete code example, this link sounds like a similar problem emclustr::em_clust_mvn

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

How to rewrite this Stata code in R?

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Resources