Write an R Function that repeats data manipulating routines - r

Here is what I would expect the function to do:
datalist <- c("var1","var2",...)
my.function <- function(datalist){
n <- length(dlist)
varnames <- paste("data", dlist, sep = ".")
for (...) { # for each var in 'varnames'
... # grab each variable from some specific online dataset;
... # do some basic data manipulation for each variable
}
... # return all the results
}
The main difficulty for me is:
(1) how to do the loop so the grabbed data could be properly temporally stored, and
(2) how the multiple variables could be returned, after finishing the loop;
EDIT:
The loop can create variables I want during the loop, say VAR1 and VAR2, which were stored in the 'dlist' argument, but I cannot manipulate VAR1 or VAR2 in the function, dlist[1] or dlist[2] in the function would only give me a string but not the variable itself.
Thanks in advance.

I think I have solved the problem and make the function work as I expected.
As I described in the question, the main problem in fact is how to manipulate the variables while VAR1 and VAR2 themselves are strings in the function.
eval combined with as.name should work:
eval(as.name(dlist[i]))

Related

R function to reverse a survey item produces NULL

I'm still new to writing my own functions. As an exercise and because I use it alot, I want to write a flexible function to easily reverse survey response scales. This is what I came up with:
rev_scale = function(var, new_var, scale){
for (i in 1:length(abs(var))){
new_var[i] = scale-abs(var[i])+1
}
}
Info on code
var = variable I want to reverse.
new_var = new column with the reversed variable
scale = how many points in the scale (eg. 5 for a 5-point scale)
The reason why I use 'abs' instead of just 'var' is that some dataframes also return value-labels, and I only want the values in this function.
Question
When applying this new function on a variable, R returns "NULL". However, if I run the for-loop separately, with the arguments 'imputed', my new variable is properly reversed.
Any ideas on what is happening here?
Thanks in advance!
### Example of the (working) for-loop with arguments 'imputed' ###
df <- data.frame(matrix(ncol = 1, nrow = 4))
df$var = c(1,2,3,4)
for (i in 1:length(abs(df$var))){
df$var_rev[i] = 4-abs(df$var[i])+1
}
df$var_rev
OUTPUT:
[1] 4 3 2 1
R does not use reference-variables (think pointers)*. So your new_var outside of your function does not get updated when refered to inside a function. Instead, R creates a new copy of new_var and updates that.
You should instead return the new value from your function. I.e.
rev_scale = function(var, scale){
res <- vector('numeric', length(var))
for (i in 1:length(abs(var))){
res[i] = scale-abs(var[i])+1
}
return(res)
}
Also note that I have removed new_var from the function's arguments. In other words, I have completely separated the functions input-arguments from its output.
The reason you get a NULL from the function is that in R, all functions returns somethings. If not specified, the function will return the last value of the last statement, except when the last statement is a control structure (ifs, loops) - then it defaults to a NULL.
* There are a couple of exceptions and work-arounds, but I will not go into that here.
Edit:
As benimwolfspelz noted, you do not need to explicitly iterate over each element in var, as R does this implicitly. Your entire function could be reduced to:
rev_scale = function(var, scale) {
scale-abs(var)+1
}
Secondly, in your for-loop, your can simplify length(abs(var)) to length(var) as abs(var) does not change the length of the vector.

How to reference variables from a list when looping over variables using "for"

I am a beginner at R coming from Stata and my first head ache is to figure out how I can loop over a list of names conducting the same operation to all names. The names are variables coming from a data frame. I tried defining a list in this way: mylist<- c("df$name1", "df$name2") and then I tried: for (i in mylist) { i } which I hoped would be equivalent to writing df$name1 and then df$name2 to make R print the content of the variables with the names name1 and name2 from the data frame df. I tried other commands like deleting a variable i=NULL within the for command, but that didn't work either. I would greatly appreciate if someone could tell me what am I doing wrong? I wonder if it has somethign to do with the way I write the i, maybe R does not interpret it to mean the elements of my character vector.
For more clarification I will write out the code I would use for Stata in this instance. Instead of asking Stata to print the content of a variable I am asking it to give summary statistics of a variable i.e. the no. of observations, mean, standard deviation and min and max using the summarize command. In Stata I don't need to refer to the dataframe as I ususally have only one dataset in memory and I need only write:
foreach i in name1 name2 { #name1 and name2 being the names of the variables
summarize `i'
}
So far, I don't manage to do the same thing using the for function in R, which I naivly thought would be:
mylist<-c("df$name1", "df$name2")
for (i in mylist) {
summary(i)
}
you probably just need to print the name to see it. For example, if we have a data frame like this:
df <- data.frame("A" = "a", "B" = "b", "C" = "c")
df
# > A B C
# > 1 a b c
names(df)
# "A" "B" "C"
We can operate on the names using a for loop on the names(df) vector (no need to define a special list).
for (name in names(df)){
print(name)
# your code here
}
R is a little more reticent to let you use strings/locals as code than Stata is. You can do it with functions like eval but in general that's not the ideal way to do it.
In the case of variable names, though, you're in luck, as you can use a string to pull out a variable from a data.frame with [[]]. For example:
df <- data.frame(a = 1:10,
b = 11:20,
c = 21:30)
for (i in c('a','b')) {
print(i)
print(summary(df[[i]]))
}
Notes:
if you want an object printed from inside a for loop you need to use print().
I'm assuming that you're using the summary() function just as an example and so need the loop. But if you really just want a summary of each variable, summary(df) will do them all, or summary(df[,c('a','b')]) to just do a and b. Or check out the stargazer() function in the stargazer package, which has defaults that will feel pretty comfortable for a Stata user.

get() not working for column in a data frame in a list in R (phew)

I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.

Assign dataframe name to a variable in a function

I have created a function and passing data frame as a parameter to the function. Now, I would like to take that data frame name as a string and store it into as a string variable.
Code used:
RFun <- function(a){
args=(commandArgs(TRUE))
l<<-80
h<<-85
fname<<-paste(a,"_Temp.csv")
a_R<-filter(a_RW,cs==2|cs==3)
a_R<-a_Rinse[-c(2,3)]
write.csv(a_R,file=fname,row.names=FALSE)
a_Rinse_Temperature_Deviations <- read.csv(paste("~/",fname"))
}
RFun(df)
From the above function when I try to execute it is creating numeric variables l and h with values which I have specified, but fname is creating for the complete data frame with rows and columns and it is not storing as I require here.
It is taking lot of time for execution as well.
Expected fname should be df_Temp.csv. Where df is the data frame.
Looks like assign(String varName , obj Value) might get you where you need to be.
RFun<-function(a){
args=(commandArgs(TRUE))
l<<-80
h<<-85
fname <<- "File_Name_Text"
assign (fname,paste(a,"_Temp.csv"))
a_R<-filter(a_RW,cs==2|cs==3)
a_R<-a_Rinse[-c(2,3)]
write.csv(a_R,file=fname,row.names=FALSE)
a_Rinse_Temperature_Deviations <- read.csv(paste("~/",fname))
}
It's hard to follow without a working example. But try to assign only the "name" of your df instead of the complete df. Try this:
fname <<- paste(deparse(substitute(a)),"_Temp.csv",sep="")

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Resources