R character variables in function - r

I’ve worked with SAS and SQL previously I’m trying to get into R via a course. I’ve been set the following task by my tutor:
“Using the Iris dataset, write an R function that takes as its arguments an Iris species and attribute name and returns the minimum and maximum values of the attribute for that species.”
Which sounded straightforward at first, but I’ve come unstuck trying to make the function. The below is as far as I've gotten
#write the function
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table$y)
themax <-max(new_table$y)
return(themin)
return(themax)}
#test the function - Species , Attribute, Data
question_2("setosa",Sepal.Width, iris)
I assumed I needed quotes around the species when running the function, but I get an error that there were "no non-missing arguments to min/max", which I'm guessing means my attempt at making 'new_table' has brought back zero observations.
Can anyone see where I'm going wrong?
edit: thanks all for the swift and insightful responses. i'll take that reading on board. thanks again!

Indeed, your teacher didn't give you the easiest thing to do in R. You were almost right. You can't return twice in a function.
question_2 <- function(x, y, data){
new_table <- subset(data, Species==x)
themin <-min(new_table[[y]])
themax <-max(new_table[[y]])
return(list(themin, themax))}
question_2("setosa","Sepal.Width", iris)

df$colname cannot be used with a variable to the right of $, because it will search for the column named "colname" ("y" in your case) rather than the character the variable colname (if it even exists) represents.
The syntax df[["colname"]] is useful in this case because it allows for character input (which may also be a variable representing a character). This holds for both object types list and data.frame. In fact, a data.frame can be seen as a list of vectors.
Example
df <- data.frame(col1 = 5:7, col2 = letters[1:3])
a <- "col1"
# $ vs [[
df$col1 # works because "col1" is a column of df
df$a # does not work because "a" is not a column of df
df[["col1"]] # works because "col1" is a column of df
df[[a]] # works because "col1" is a column of df
# dataframes can be seen as list of vectors
ls <- list(col1 = 5:7, col2 = letters[1:3])
ls$col1 # works
ls[[a]] # works

One problem is that Sepal.Width seems to be some object in the workspace. Otherwise R would yell at you Object "Sepal.Width" not found.. Whatever Sepal.Width (the object) is, it is probably not a character string with the value "Sepal.Width". But even if it were, R would not know how to use the $ operator to get that named column from new_table, not without some needlessly advanced programming. #Flo.P's suggestion of using [[ is a good one.
You must pass y as "Sepal.Width".
Another approach: you can take advantage of subset by writing this:
question_2 <- function(x, y, data){
newy <- subset(data, subset=Species==x, select=y)
themin <-min(newy)
themax <-max(newy)
return(c(themin, themax))
}
question_2("setosa","Sepal.Width", iris)

Related

Retrieving variable names for numeric variables

I need to run through a large data frame and extract a vector with the name of the variables that are numeric type.
I've got stuck in my code, perhaps someone could point me to a solution.
This is how far I have got:
numericVarNames <- function(df) {
numeric_vars<-c()
for (i in colnames(df)) {
if (is.numeric(df[i])) {
numeric_vars <- c(numeric_vars, colnames(df)[i])
message(numeric_vars[i])
}
}
return(numeric_vars)
}
To run it:
teste <-numericVarNames(semWellComb)
The is.numeric assertion is not working. There is something wrong with my syntax for catching the type of each column. What is wrong?
Rather than a looping function, how about
df <- data.frame(a = c(1,2,3),
b = c("a","b","c"),
c = c(4,5,6))
## names(df)[sapply(df, class) == "numeric"]
## updated to be 'safer'
names(df)[sapply(df, is.numeric)]
[1] "a" "c"
## As variables can have multiple classes
This question is worth a read
Without test data it is hard to be sure, but it looks like there is just a "grammar" issue in your code.
You wrote:
numeric_vars <- c(numeric_vars, colnames(df)[i])
The way to get the column name into the concatenated list is to include the whole referred to subset in the parentheses:
numeric_vars <- c(numeric_vars, colnames(df[i]))
Try running it with that one change and see what you get.

Applying multiple function via sapply

I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Is there a more efficient/clean approach to an eval(parse(paste0( set up?

Sometimes I have code which references a specific dataset based on some variable ID. I have then been creating lines of code using paste0, and then eval(parse(...)) that line to execute the code. This seems to be getting sloppy as the length of the code increases. Are there any cleaner ways to have dynamic data reference?
Example:
dataset <- "dataRef"
execute <- paste0("data.frame(", dataset, "$column1, ", dataset, "$column2)")
eval(parse(execute))
But now imagine a scenario where dataRef would be called for 1000 lines of code, and sometimes needs to be changed to dataRef2 or dataRefX.
Combining the comments of Jack Maney and G.Grothendieck:
It is better to store your data frames that you want to access by a variable in a list. The list can be created from a vector of names using get:
mynames <- c('dataRef','dataRef2','dataRefX')
# or mynames <- paste0( 'dataRef', 1:10 )
mydfs <- lapply( mynames, get )
Then your example becomes:
dataset <- 'dataRef'
mydfs[[dataset]][,c('column1','column2')]
Or you can process them all at once using lapply, sapply, or a loop:
mydfs2 <- lapply( mydfs, function(x) x[,c('column1','column2')] )
#G.Grothendieck has shown you how to use get and [ to elevate a character value and return the value of a named object and then reference named elements within that object. I don't know what your code was intended to accomplish since the result of executing htat code would be to deliver values to the console, but they would not have been assigned to a name and would have been garbage collected. If you wanted to use three character values: objname, colname1 and colname2 and those columns equal to an object named after a fourth character value.
newname <- "newdf"
assign( newname, get(dataset)[ c(colname1, colname2) ]
The lesson to learn is assign and get are capable of taking character character values and and accessing or creating named objects which can be either data objects or functions. Carl_Witthoft mentions do.call which can construct function calls from character values.
do.call("data.frame", setNames(list( dfrm$x, dfrm$y), c('x2','y2') )
do.call("mean", dfrm[1])
# second argument must be a list of arguments to `mean`

Assignment to a data.frame with `with`

Here's an example that assigns in two different ways, one which works and one which doesn't:
library(datasets)
dat <- as.data.frame(ChickWeight)
dat$test1 <- with(dat, Time + weight)
with(dat, test2 <- Time + weight)
> colnames(dat)
[1] "weight" "Time" "Chick" "Diet" "test1"
I've grown accustomed to this behavior. Perhaps more surprising is that test2 just disappears (instead of winding up in the base environment, as I'd expect):
> ls(pattern="test")
character(0)
Note that with is a fairly simple^H^H^H^H^H^H short function:
function (data, expr, ...)
eval(substitute(expr), data, enclos = parent.frame())
First let's replicate with's functionality:
eval( substitute(Time+weight), envir=dat, enclos=parent.frame() )
Now test with a different enclosure:
testEnv <- new.env()
eval( substitute(test3 <- Time+weight), envir=dat, enclos=testEnv )
ls( envir=testEnv )
Which still doesn't assign anywhere. This disproves my hunch that it was related to the enclosing environment being discarded, and rather points to something more fundamental to the ,enclos argument not doing what I think it does.
I'm curious about the mechanics of why this is going on and if there's an alternative which allows assignment.
Change with to within. with is only for making variables available, not changing them.
Edit: To elaborate, I believe that both with and within create a new environment and populate it with the given list-like object (such as a data frame), and then evaluate the given expression within that environhment. The difference is that with returns the result of the expression and discards the environment, while within returns the environment (converted back to whatever class it originally was, e.g. data.frame). Either way, any assignments made within the expression are presumably performed inside the created environment, which is discarded by with. This explains why test2 is nowhere to be found after doing with(dat, test2 <- Time + weight).
Note that since within returns the modified environment instead of editing it in place (i.e. call-by-value semantics), you need to do dat <- within(dat, test2 <- Time + weight).
If you want a function to do assignment to the current environment (or any specified environment), look at assign.
Edit 2: The modern answer is to embrace the tidyverse and use magrittr & dplyr:
library(datasets)
library(dplyr)
library(magrittr)
dat <- as.data.frame(ChickWeight)
dat %<>% mutate(test1 = Time + weight)
The last line is equivalent to
dat <- dat %>% mutate(test1 = Time + weight)
which is in turn equivalent to
dat <- mutate(dat, test1 = Time + weight)
Use whichever of the last 3 lines makes the most sense to you.
Inspired by the fact that the following works from the command line ...
eval(substitute(test <- Time + weight, dat))
... I put together the following, which seems to work.
myWith <- function(DAT, expr) {
X <- call("eval",
call("substitute", substitute(expr), DAT))
eval(X, parent.frame())
}
## Trying it out
dat <- as.data.frame(ChickWeight)
myWith(dat, test <- Time + weight)
head(test)
# [1] 42 53 63 70 84 103
(The complicated aspect of this problem is that we need substitute() to search for symbols in one environment (the current frame) while the "outer" eval() assigns into a different environment (the parent frame).)
I get the sense that this is being made way too complex. Both with and within return values calculated by operations on named columns of dataframes. If you don't assign them to anything, the value will get garbage collected. The usual way to store tehn is assignment to to a named object or possibly a component of an object with the <- operator. within returns the entire dataframe, whereas with returns only the vector that was calculated from whatever operations were performed on the column names. You could, of course, use assign instead of <-, but I think overuse of that function may obfuscate rather than clarify the code. The difference in use is just assignment to an entrire dataframe or just a column:
dat <- within(dat, newcol <- oldcol1*oldcol2)
dat$newcol <- with(dat, oldcol1*oldcol2)

Resources