Passing arguments for variables in a function - r

I am brand-new to R and trying to understand the basic syntax of functions.
In both of the functions f(x1) and g(x,1) below, I would like to generate y=2. Only the former works.
I'm familiar with the str_interp() and paste() functions, but those seem to work only in the context of strings, not variables. E.g., prefixnum <- str_interp("${prefix}${num}") doesn't solve the issue.
My motivation is that I'd like to call a function by specifying components of variable names. My background is in Stata, where placeholders are designated with a backtick and a tick (e.g., `prefix'`num'). I've consulted a few relevant resources, to no avail.
As an aside, I've read varying thoughts about whether variables should be prefixed with its dataframe (e.g., df$var). What is the logic behind whether or not to follow this convention? Why does f(df$x1) work, but writing f(x1) and modifying the function to be y <- df$var*2 not work?
df <- data.frame(x1=1)
f <- function(var) {
y <- var*2
y
}
f(df$x1)
g <- function(prefix,num) {
y <- df$prefixnum*2 #where "prefixnum" is a placeholder of some sort
y
}
g(x,1)

Possibly you are trying to pass column name as an argument to the function. You can try to paste prefix and num together to get column name and use that to subset dataframe.
g <- function(data, prefix, num) {
y <- data[[paste0(prefix, num)]] *2
y
}
g(df,'x', 1)
#[1] 2

Related

Expand grid in R with paste

I am trying to analyse a dataframe using hierarchical clustering hclust function in R.
I would like to pass in a vector of p values I'll write beforehand (maybe something like c(5/4, 3/2, 7/4, 9/4)) and be able to have these specified as the different p value options with Minkowski distance when I use expand.grid. Ideally, when hyperparams is viewed, it would also be clear which value of p has been used for each minkowski, i.e. they should be labelled. So for example, where (if you run my code for hyperparams) there would currently just be one minkowski under Dists, for each of the methods in Meths, there would be, if I supplied the p vector as c(5/4, 3/2, 7/4, 9/4), now instead 4 rows for Minkowski distance: minkowski, p=5/4, minkowski, p=3/2, minkowski, p=7/4, minkowski, p=9/4 (or looking something like that, making the p values clear). Any ideas?
(Note: no packages please, only base R!)
Edit: I worded it poorly before, now rewritten. Let's take the following example instead:
acc <- function(x){
first = sum(x)
second = sum(x^2)
return(list(First=first,Second=second))
}
iris0 <- iris
iris1 <- cbind(log(iris[,1:4]),iris[5])
iris2 <- cbind(sqrt(iris[,1:4]),iris[5])
Now the important bit:
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This will work. But now if I want to include a term like "minkowski",p=3 in expand.grid, how would I do it?
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary","minkowski,p=3"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This gives an error.
In reality there should be no p argument unless the method="minkowski". I have tried to use strsplit to get the first part of the expression into ds, and a switch with strsplit to get the second part and then use parse (it would return NULL if the length of the strsplit was not 2 -- this should pass no argument, I think). The issue seems to be that strsplit is not strsplit(x,",") fails to evaluate the vectorized x but rather tries to evaluate the character x which is not a string. Can anyone suggest any workaround/fix or other method for including the minkowski,p=1.6 terms and the like?
We can create a 'p' value column
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary",
"minkowski3", "minkowski4", "minkowski5"),
DS=c("iris0","iris1","iris2"))
Suppose, we have another column of 'p' values in 'tests', the above solution can be changed to
tests$p <- as.list(args(dist))$p # default value
i1 <- grepl("minkowski", tests$Dists)
tests$Dists <- sub("[0-9.]+$", "", tests$Dists)
tests$p[i1] <- rep(3:5, length.out = sum(i1))
Map(function(x, ds, p){
dist1 <- dist(get(ds)[, 1:4], method = x, p = p)
ct <- cutree(hclust(dist1), 3)
acc(table(get(ds)$Species, ct))},
as.character(tests[[1]]), as.character(tests[[2]]), tests$p )

How do I modify arguments inside a function?

I have a series of lines of code that replace the contents of an existing column based on the contents of another column (i.e. I am creating a categorical variable where the 'cut' function is not applicable). I am new to R and want to write a function that will perform this task on all data.frames without having to insert and customize 50 lines of code each time.
X is the data frame, Y is the categorical variable, and Z is the other (string) variable. This code works:
X$Y <- ""
X <- transform(X, Y=ifelse(Z=="Alameda",20,""))
... (many more lines)
For example I do:
d.f$loc <- ""
d.f <- transform(d.f, loc=ifelse(county=="Alameda",20,""))
# ... and so on
Now I want to do this for several dataframes and different columns instead of loc and county.
However, neither of these functions produces the desired results:
ab<-function(Y,Z,env=X) {
env$Y<-transform(env,Y=ifelse(Z=="Alameda",20,""))
...
}
abc<-function(X,Y,Z) {
X<-transform(X,Y=ifelse(Z=="Alameda",20,""))
...
}
Both of these functions run without error but do not alter the data frame X in any way. Am I doing something wrong in calling the environment or using a function within another function? It seems like a simple question and I would not post if I had not already spent 5+ hours trying to learn this. Thanks in advance!
R uses "call by value" for all objects. Only the return value goes back to the calling enviroment. parameter passing mechanism in R
You can do 
ab <- function(X, Y, Z) {
X <- transform(X, Y=ifelse(Z=="Alameda",20,""))
...
return(X)
}
If your dataframes are in a list L you can do lapply(L, ab) or eventually lapply(L, ab, Y=..., Z=...) As a result you will get a list of the modified dataframes. BTW: Have also a look at with() and within(), e.g. X$Y <- with(X, ifelse(Z=="Alameda",20,""))
implicit returning the value
There is no need for an explicit call of return(...) - you can do it implicit, i.e. using the issue that a function returns the value of its last calculated expression:
ab <- function(X, Y, Z) {
X <- transform(X, Y=ifelse(Z=="Alameda",20,""))
...
X ### <<<<< last expression
}
Here is example how you can do it for your situation:
ab <- function(X, Y, Z) {
X[, Y] <- ifelse(X[,Z]>12,20,99)
# ...
X ### <<<<< last expression
}
B <- BOD # BOD is one of the dataframes which come with R
ab(B, "loc", "demand")

R assign a list of values to a list of objects

Thank you for trying to help. I am happy to be corrected on all R misdemeanors.
I am not sure that I was entirely clear with my earlier post as below, so I will hope to clarify:
In the R console, my calls 'use source (etc)' to a .R file
Code within the .R file uses variables (for e.g. 'extracted info' ) ex1, ex2, ex3. These may hold strings or (a string of) numbers pulled from text.
In line with your guidance I've renamed my function to 'reset' (and ?reset indicates no other occurrences) are in scope. I'm passing both x and y which from outside the function:
#send variables ex1, ex2, ex3 together with location, loc and parse, prs to be reset with 0
reset(x<-c(loc,prs,ex1,ex2,ex3),y<-rep(c(0),length(x))) #repeats 0 in y variable as many times as there are entries for x
reset<-function(x,y){
print(c("resetting ",x," with ", y))
if (length(x) == length(y)) {x <- y
print(paste(x,"=",y),sep="") #both x and y should now be equal (to y)
} else {
paste("list lengths differ: x=",length(x)," y=",length(y),sep="")
}
}
Now both x and y are 0 but ex1, ex2 and ex3 still contain the previous values
I would like ex1, ex2 and ex3 all to be 0 before they are used in a subsequent section of code, so they don't contaminate extracted data with previous values such as:
loc<-str_locate(data[i],"=")
prs<-str_locate(data[i],",")
#extract data from the end of loc to before the occurrence of prs
ex1<-str_sub(data[i],loc[2]+1,prs[1]-1)
#cleanup
#below is simplified for example;
#in reality I wish to send ex1:ex(n) to be reset with values val1:val(n)
The desired outcome would be that back in the Rconsole >ex1 should now return 0.
Hope you can understand my dilemma and possibly help.
Say my code uses some variables to hold data extracted from a string using Stringr str_sub. The variables are temporary in that I use the values to construct other strings then they should be freed up to be used in an upcoming test: i.e. if (test==true){extract<-str_sub(string, start, end)}
For a later test, I would like extract==0; simple enough, but I have a few of these and would like to do it in one fell swoop.
I've used a for loop, but if there is a simpler way, please identify this.
My attempt is using a function:
#For variables loc, prs, ex1 and x2, set all values to 0
x<-assign(x<-c(loc, prs, ex1, ex2),y<-rep(c(0),length(x)))
#Function
assign <- function(x, y) {
if(length(x)==length(y)){
for (i in 1:length(x)){x[i]<-y[i]}
print(c("Assigned",x[i]))
return (x)
} else { print (c("list lengths differ: x=",length(x)," y=",length(y)))
}
}
The problem being that this returns x as 0, but the list of variables retain their values.
I'm a bit of a noob to both r and SO, so although I've benefitted from SO's bountiful advice on numerous occasions, this is my first question, so please be gentle. I have searched this issue, but have not found what I need in a few hours now. Hope you can help.
Beware of naming a function assign. There is already one in base-r and you will create confusion.
There are a couple of problems with your function besides its name. First, you do not need the for-loop to replace x by y, as this is a basic vectorized operation. Just use x <- y ; second, your should wrap your message in paste.
asgn <- function(x, y) {
if(length(x)==length(y)){
## This step is not needed, return(y) is better as #Rick proposed in their now deleted answer
## I am leaving it to show you how the for-loop is not needed
x<-y
return (x)
} else {
print (paste("list lengths differ: x=",length(x)," y=",length(y)))
return(x)
}
}
Then, there are a couple of problems with your function call. You use <- instead of = to specify the arguments. They are only somewhat synonymous for assigning variables, but a function argument is another matter. Finally, you are trying to use x is the definition of y in the arguments (length(x)), but this is not possible, because it is not yet defined, so it is looking for x in the parent environment. You should test your function with length(3) instead.
x<-asgn(x=c(loc, prs, ex1, ex2),y=rep(c(0),length(3)))

Applying multiple function via sapply

I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})

Assignment to a data.frame with `with`

Here's an example that assigns in two different ways, one which works and one which doesn't:
library(datasets)
dat <- as.data.frame(ChickWeight)
dat$test1 <- with(dat, Time + weight)
with(dat, test2 <- Time + weight)
> colnames(dat)
[1] "weight" "Time" "Chick" "Diet" "test1"
I've grown accustomed to this behavior. Perhaps more surprising is that test2 just disappears (instead of winding up in the base environment, as I'd expect):
> ls(pattern="test")
character(0)
Note that with is a fairly simple^H^H^H^H^H^H short function:
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
First let's replicate with's functionality:
eval( substitute(Time+weight), envir=dat, enclos=parent.frame() )
Now test with a different enclosure:
testEnv <- new.env()
eval( substitute(test3 <- Time+weight), envir=dat, enclos=testEnv )
ls( envir=testEnv )
Which still doesn't assign anywhere. This disproves my hunch that it was related to the enclosing environment being discarded, and rather points to something more fundamental to the ,enclos argument not doing what I think it does.
I'm curious about the mechanics of why this is going on and if there's an alternative which allows assignment.
Change with to within. with is only for making variables available, not changing them.
Edit: To elaborate, I believe that both with and within create a new environment and populate it with the given list-like object (such as a data frame), and then evaluate the given expression within that environhment. The difference is that with returns the result of the expression and discards the environment, while within returns the environment (converted back to whatever class it originally was, e.g. data.frame). Either way, any assignments made within the expression are presumably performed inside the created environment, which is discarded by with. This explains why test2 is nowhere to be found after doing with(dat, test2 <- Time + weight).
Note that since within returns the modified environment instead of editing it in place (i.e. call-by-value semantics), you need to do dat <- within(dat, test2 <- Time + weight).
If you want a function to do assignment to the current environment (or any specified environment), look at assign.
Edit 2: The modern answer is to embrace the tidyverse and use magrittr & dplyr:
library(datasets)
library(dplyr)
library(magrittr)
dat <- as.data.frame(ChickWeight)
dat %<>% mutate(test1 = Time + weight)
The last line is equivalent to
dat <- dat %>% mutate(test1 = Time + weight)
which is in turn equivalent to
dat <- mutate(dat, test1 = Time + weight)
Use whichever of the last 3 lines makes the most sense to you.
Inspired by the fact that the following works from the command line ...
eval(substitute(test <- Time + weight, dat))
... I put together the following, which seems to work.
myWith <- function(DAT, expr) {
X <- call("eval",
call("substitute", substitute(expr), DAT))
eval(X, parent.frame())
}
## Trying it out
dat <- as.data.frame(ChickWeight)
myWith(dat, test <- Time + weight)
head(test)
# [1] 42 53 63 70 84 103
(The complicated aspect of this problem is that we need substitute() to search for symbols in one environment (the current frame) while the "outer" eval() assigns into a different environment (the parent frame).)
I get the sense that this is being made way too complex. Both with and within return values calculated by operations on named columns of dataframes. If you don't assign them to anything, the value will get garbage collected. The usual way to store tehn is assignment to to a named object or possibly a component of an object with the <- operator. within returns the entire dataframe, whereas with returns only the vector that was calculated from whatever operations were performed on the column names. You could, of course, use assign instead of <-, but I think overuse of that function may obfuscate rather than clarify the code. The difference in use is just assignment to an entrire dataframe or just a column:
dat <- within(dat, newcol <- oldcol1*oldcol2)
dat$newcol <- with(dat, oldcol1*oldcol2)

Resources