I'm still new to writing my own functions. As an exercise and because I use it alot, I want to write a flexible function to easily reverse survey response scales. This is what I came up with:
rev_scale = function(var, new_var, scale){
for (i in 1:length(abs(var))){
new_var[i] = scale-abs(var[i])+1
}
}
Info on code
var = variable I want to reverse.
new_var = new column with the reversed variable
scale = how many points in the scale (eg. 5 for a 5-point scale)
The reason why I use 'abs' instead of just 'var' is that some dataframes also return value-labels, and I only want the values in this function.
Question
When applying this new function on a variable, R returns "NULL". However, if I run the for-loop separately, with the arguments 'imputed', my new variable is properly reversed.
Any ideas on what is happening here?
Thanks in advance!
### Example of the (working) for-loop with arguments 'imputed' ###
df <- data.frame(matrix(ncol = 1, nrow = 4))
df$var = c(1,2,3,4)
for (i in 1:length(abs(df$var))){
df$var_rev[i] = 4-abs(df$var[i])+1
}
df$var_rev
OUTPUT:
[1] 4 3 2 1
R does not use reference-variables (think pointers)*. So your new_var outside of your function does not get updated when refered to inside a function. Instead, R creates a new copy of new_var and updates that.
You should instead return the new value from your function. I.e.
rev_scale = function(var, scale){
res <- vector('numeric', length(var))
for (i in 1:length(abs(var))){
res[i] = scale-abs(var[i])+1
}
return(res)
}
Also note that I have removed new_var from the function's arguments. In other words, I have completely separated the functions input-arguments from its output.
The reason you get a NULL from the function is that in R, all functions returns somethings. If not specified, the function will return the last value of the last statement, except when the last statement is a control structure (ifs, loops) - then it defaults to a NULL.
* There are a couple of exceptions and work-arounds, but I will not go into that here.
Edit:
As benimwolfspelz noted, you do not need to explicitly iterate over each element in var, as R does this implicitly. Your entire function could be reduced to:
rev_scale = function(var, scale) {
scale-abs(var)+1
}
Secondly, in your for-loop, your can simplify length(abs(var)) to length(var) as abs(var) does not change the length of the vector.
I am Having a little problem doing a Levene test in R. I does not get any output value, only NaN. Anyone know what the problem might be?
Have used the code:
with(Test,levene.test(Sample1,Sample2,location="median"))
The problem
Best regards
The levene.test function assumes the data are in a single vector. The second argument is a grouping variable.
Concatenate your data using the c() function: data=c(Sample1, Sample2). Construct a vector of group names like gp = rep('Gp1','Gp2', each=240). Then, call the function as follows: levene.test(data, gp, location='median').
This can also be done directly:
levene.test(c(Sample1, Sample2), rep('Gp1', 'Gp2', each=240)), location='median')
When using the post command, I get the following error:
post command requires expressions be bound in parenthesis
My program generates a matrix which stores regression coefficients for each simulation, and then uses the post command to declare as float and place the output of the matrix in parenthesis (betas).
A sample of the code:
*Priors
set more off
global nmc=10
global l = 4 /* number of lags */
global cnt=150 /* number of countries */
set seed 10101
* Gen empty beta matrix
matrix betas = J(153,$nmc+1,.)
*** THIS IS WHERE MONTECARLO STARTS***
program bootStrapCH5, rclass
tempname sim
postfile `sim' betas using results, replace /* As trial I'll create only the betas matrix for now. */
*postfile `sim' betas alpha_mean b1_mean b2_mean b3_mean b4_mean se_alpha se1 se2 se3 se4 using results, replace
quietly {
forvalues i = 1/$nmc {
* Fixed effects regression.
reg gdp_growth_wb L(1/4).gdp_growth_wb i.id
matrix B1= e(b)
mat li B1
predict g_hat,xb
gen e_hat= gdp_growth_wb - g_hat
*gen flag=e(sample)
* Generate the "wild" errors for the forecasts
gen eta=rnormal()
gen e_star=e_hat*eta
**RECURSION
levelsof id, local(codes)
capture noisily replace y_star= _b[_cons] + _b[L.gdp_growth_wb]*L.y_star + ///
_b[L2.gdp_growth_wb]*L2.y_star + _b[L3.gdp_growth_wb]*L3.y_star + ///
_b[L4.gdp_growth_wb]*L4.y_star + e_star if (id==1 & Dini4forward==1)
forvalues cc= 2(1)150 {
capture noisily replace y_star= _b[_cons] + _b[`cc'.id] + _b[L.gdp_growth_wb]*L.y_star + ///
_b[L2.gdp_growth_wb]*L2.y_star + _b[L3.gdp_growth_wb]*L3.y_star + ///
_b[L4.gdp_growth_wb]*L4.y_star + e_star if (id==`cc' & Dini4forward==1)
}
*Regression with new sample: y_star
reg y_star L(1/4).y_star i.id
matrix b= e(b)'
matrix betas= (betas , b)
matrix list betas
post `sim' float (betas)
}
}
postclose `sim'
end
*Execute program
bootStrapCH5
use results, clear
summarize
I also tried an alternative:
post `sim' (betas)
And got the error:
> type mismatch
post: above message corresponds to expression 1, variable betas
Any ideas on how to fix this are very much appreciated.
I'm not very familiar with postfile, but I think one issue could be that you are trying to insert a kx2 matrix into a single variable inside of your loop with post.
When you initiate postfile using:
postfile `sim' betas using results
you have declared a Stata dataset with a single variable, betas.
So, instead of using
post `sim' float (betas)
you might try:
tempname sim
postfile `sim' float (betas1 betas2) using results, replace
forvalues i = 1/$nmc {
* Some code. . .
local rows = rowsof(betas)
forvalues i = 1/`r' {
post `sim' (betas[`i',1]) (betas[`i',2])
}
* some other code. . .
}
or something similar to declare a file with the proper number of variables which you intend on posting to the dataset.
Further, I'm not sure that you can post a matrix directly anyway (I could be wrong about this). If you can't, then you could nest a forvalues loop inside of the loop you currently have to iterate through the elements of betas and post them individually - as I have done in the example above.
Finally, you are trying to cast the values of betas as data type float in your post command. I believe the storage types need to be declared in the postfile command (but again, I could be wrong about this). The first error you cite (expressions bound in parenthesis) is a direct result of including float in the post command.
Bottom line - I suspect the first error is due to declaring the data type when you try to post the data, and the second error (type mismatch) is a result of trying to insert an kx2 matrix into a variable. See below for an example of type mismatch when trying to (incorrectly) create data from a matrix:
clear *
mat a = (1\2)
set obs 2
gen x = a
Although I admittedly would have expected the error to be more analogous to this:
mat a = (1\2)
set obs 2
gen x = a*2
matrix operators that return matrices not allowed in this context
Also look at svmat for creating data from matrices.
I am trying to create (vector) objects in R. Thereby, I want to achieve that I don't specify a priori the name of the object. For example if I have a list of length 3, I want to create the objects p1 to p3 and if I have a list of length 10, the objects p1to p10 have to be created. The length should be arbitrary and not a priori determined.
Thanks for your help!
I guess the proper way of doing that is to consider a list p = list() and then you can use p[[i]] with i as big as you wish without having specified any length.
Then once your list is filled up, you can rename it: names(p) = paste0("p",c(1:length(p)))
Finally, if you want to get all the pi variables directly accessible, you add attach(p)
This is kind of a hack but you can do the following
short_list <- list(rnorm(10),rnorm(20),1:3)
long_list <- c(short_list,short_list )
paste0("p",seq_along(short_list))
mapply(assign, paste0("p",seq_along(short_list)), short_list, MoreArgs = list(envir = .GlobalEnv))
result:
> p3
[1] 1 2 3
you can do the same with long_list
I dont see a statistical model you will need this. Better start working with lists like short_list or data.frame's directly.
PS If you just want to use it for glm you probably want to learn formula's in R.
glm(y~., data=your_data) takes all columns in your data-frame that are not named y as regressor. Maybe this helps.
assign (and maybe also attach) are often a sign that you have not yet arrived at an "Rish" version of the code.
Considering that you need this for modeling: if your $p_1 \cdot p_n$ are of the same type, you can put them into a matrix (inside a column of a data.frame; for modeling they anyways need to be of same length):
df$matrix <- p.matrix
If you directly create the data.frame, you need to make sure the matrix is not expanded to data.frame columns:
df <- data.frame (matrix = I (matrix), ...)
Then glm (y ~ matrix, ...) will work.
For examples of this technique see e.g. packages pls or hyperSpec or the pls paper in the Journal of Statistical Software.
I'm running various modeling algorithms on a data set. I've had best results by modeling my input variables to my responses one at a time, e.g.:
model <- train(y ~ x1 + x2 + ... + xn, ...)
Once I train my models, I'd like to not re-run them each time, so I've been trying to save them as .rda files. Here's an example loop for a random forest model (feel free to suggest a better way than a loop!):
# data_resp contains my measured responses, one per column
# data_pred contains my predictors, one per column
for (i in 1:ncol(data_resp)) {
model <- train(data_pred_scale[!is.na(data_resp[, i]), ],
data_resp[!is.na(data_resp[, i]), i],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
save(model, file = paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""))
When I load the model, however, it's going to be called model.
I haven't found a good way to save the model with it's corresponding name to try and refer back to it later. I found that one can assign an object to a string like so:
assign(paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""), train(...))
But I'm still left with how to refer to the object when I save it:
save(???, file = ...)
I don't know how to call the object by it's custom name.
Lastly, even loading has presented a problem. I've tried assign("model_name", load("./model.rda")), but the resultant object, called string ends up just holding a string of the object name, "model".
In looking around, I found THIS question, which seems relevant, but I'm trying to figure out how to apply it to my situation.
I could create a list with the names of each column name in data_resp (my measured responses) and then use lapply to use train(), but I'm still a bit stuck on how to dynamically refer to the new object name to keep the resultant model in.
When you save the model, save another object called 'name' which is a character string of the thing you want to name it as:
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m1"
> save(model,name,file="save1.rda")
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m2"
> save(model,name,file="save2.rda")
Now each file knows what it wants its resulting object to be called. How do you get that back on load? Load into a new environment, and assign:
> e=new.env()
> load("save1.rda",env=e)
> assign(e$name,e$model)
> summary(m1)
Call:
lm(formula = y ~ x, data = d)
You can now safely rm or re-use the 'e' object. You can of course wrap this in a function:
> blargh=function(f){e=new.env();load(f,env=e);assign(e$name,e$model,.GlobalEnv)}
> blargh("save2.rda")
> m2
Call:
lm(formula = y ~ x, data = d)
Note this is a double bad thing to do - firstly, you should probably store all the models in one file as a list with names. Secondly, this function has side effects, and if you had an object called m2 already it would get stomped on.
Using assign like this is nearly always a sign (dyswidt?) that you should use a list instead.
B
There is a fair amount of guesswork involved in this answer but I think this could help:
# get a vector with the column names in data_resp
modNames <- colnames( data_resp )
# create empty list
models <- as.list( NULL )
# iterate through your columns and assign the result as list members
for( n in modNames )
{
models[[n]] <- train(data_pred_scale[!is.na(data_resp[, n]), ], ### this may need modification, can't test without data
data_resp[!is.na(data_resp[, n]), n],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
}
# save the whole bunch
save( models, file = "models.rda" )
You can now retrieve, just with load( "models.rda ), this one object, the list with all your models, and address them with list notation, either as models[[1]] or with the column name, eg. models[["first"]].
I think the other answers about doing this with a loop are great. I used this as a chance to finally try and understand lapply better, as many of the StackOverflow questions about how to do this ended up suggesting the use of lists and lapply instead of loops.
I really like the idea of combining all results of train() into a list (which #vaettchen did in his loop), and in thinking about how to do this with a list, this is what I came up with. First, I needed my data.frame in list form, one entry per column. Since I don't really work with lists, I hunted around until just trying as.list(df), which worked like a charm.
Next, I want to apply my train function to each element of my list of measured response variables, so I defined the function like this:
# predictors are stored in data_pred
# responses are in data_resp (one per column)
# rows in data_pred/data_resp (perhaps obviously) match, one per observation
train_func <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3:6),
ntrees = 500) }
Now I just need to use lapply to apply the train() call on each element of data_resp. I didn't know how to create an empty, placeholder list, so thanks to #vaettchen for that (I was trying list_name <- list() without success):
models <- lapply(as.list(data_resp), train_func)
Awesomely, I found that models has it's elements automatically named to my column names in data_resp, which is just fantastic. I'm using this in conjunction with the shiny package, so this will make it incredibly easy for the user to select a response variable from a drop down (which can store the response variable name) and do:
predict(models[["resp_name"]], new_data)
I think this is much better than the loop based approach and everything just happened to fall in place nicely. I realize the question explicitly asked for naming variables programmatically, so apologies if that pushed others to answer in that fashion vs. a "bigger picture" answer. The ease of lapply suggests I was trying to force a particular solution when a (at least to my eyes) much better one existed.
Bonus: I didn't realize lists could be multi-dimensional, but in trying it, it appears they can be! This is even better, as I'm using numerous algorithms and I can store everything in one big list object.
func_rf <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3),
ntrees = 100) }
# svmRadial method requires formula syntax to work with factors,
# so the train function has to be a bit different
# add `scale = F` since I had to preProcess the numeric vars ahead of time
# and cbind to the factors. Without it, caret will try to scale the data
# for you, which fails for factors
func_svm <- function(y) { train(y ~ ., cbind(data_pred, y),
method = "svmRadial", tuneGrid = data.frame(.C = 1, .sigma = .2),
scale = F) }
model_list <- list(NULL)
model_list$rf <- lapply(as.list(data_resp), func_rf)
model_list$svm <- lapply(as.list(data_resp), func_svm)
Now I can refer the desired model and response variable with list syntax!
predict(model_list[["svm"]][["response_variable"]], new_data)
Super happy with this and hopefully it makes the code more efficient, faster, and I really love the "meta-object" I end up with vs. a ton of files, one per model/response variable combination, that I have to load in one at a time later on.
A bit of an old question but still without an accepted answer.
As I understand, you need to programmatically rename a variable and save it so that when reloaded it keeps the new name.
Try this:
saveWithName = function(var.name, obj){
# var.name is a string with the name of the variable you want to assign
# obj is any kind of R object (data.frame, list, etc.) you want to rename and save
assign(var.name, obj)
save(list=var.name, file=sprintf("model_%s.RData", var.name))
}
saveWithName("lab1", c(1,2))
saveWithName("lab2", c(3,4))
load("model_lab1.RData")
load("model_lab2.RData")
print(lab1)
#>[1] 1 2
print(lab2)
#[1] 3 4