Applying a Function to a Data Frame : lapply vs traditional way - r

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!

When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Related

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

R: I'm trying to apply rollmean(zoo) to a particular column in several dataframes which are in a list

reprod:
df1 <- data.frame(X = c(0:9), Y = c(10:19))
df2 <- data.frame(X = c(0:9), Y = c(10:19))
df3 <- data.frame(X = c(0:9), Y = c(10:19))
list_of_df <- list(A = df1, B = df2, C = df3)
list_of_df
I'm trying to apply the rollmean function from zoo to every 'Y' column in this list of dataframes.
I've tried lapply with no success, It seems no matter which way i spin it, there is no way to get around specifying the dataframe you want to apply to at some point.
This does one of the dataframes
roll_mean <- rollmean(list_of_df$A, 2)
roll_mean
obviously this doesn't work:
roll_mean1 <- rollmean(list_of_df, 2)
roll_mean1
I also tried this:
subset(may not be necessary)
Sub1 <- lapply(list_of_df, "[", 2)
roll_mean1 <- rollmean(Sub1, 2)
roll_mean1
there doesn't seem to be a way to do it without having to
specify the particular dataframe in the rollmean function
lapply(list_of_df), function(x) rollmean(list_of_df, 2))
for loop? also no success
For (i in list_of_df) {roll_mean1 <- rollmean(Sub1, 2)
Exp
}
Stating the obvious but I'm very new to coding in general and would appreciate some pointers.
It has occurred to me that even if it did work, the column that has been averaged would be one value longer than the rest of the dataframe; how would I get around that?
The question at one point says that it wants to perform the rollmean only on Y and at another point says that this works roll_mean <- rollmean(list_of_df$A, 2) but that does all columns.
1) Assuming that you want to apply rollmean to all columns:
Use lapply like this:
lapply(list_of_df, rollmean, 2)
This also works:
for(i in seq_along(list_of_df)) list_of_df[[i]] <- rollmean(list_of_df[[i]], 2)
2) If you only want to apply it to the Y column:
lapply(list_of_df, transform, Y = rollmean(Y, 2, fill = NA))
or
for(i in seq_along(list_of_df)) {
list_of_df[[i]]$Y <- rollmean(list_of_df[[i]]$Y, 2, fill = NA)
}

Using for() over variables that need to be changed

I'd like to be able use for() loop to automate the same operation that runs over many variables modifying them.
Here's simplest example to could design:
varToChange = list( 1:10, iris$Species[1:10], letters[1:10]) # assume that it has many more than just 3 elements
varToChange
for (i in varToChange ) {
if (is.character(y)) i <- as.integer(as.ordered(i))
if (is.factor(y)) i <- as.integer(i)
}
varToChange # <-- Here I want to see my elements as integers now
Here's actual example that led me to this question - taken from: Best way to plot automatically all data.table columns using ggplot2
In the following function
f <- function(dt, x,y,k) {
if (is.numeric(x)) x <- names(dt)[x]
if (is.numeric(y)) y <- names(dt)[y]
if (is.numeric(k)) k <- names(dt)[k]
ggplot(dt, aes_string(x,y, col=k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2)
instead of brutally repeating the same line many times, as a programmer, I would rather have a loop to repeat this line for me.
Something like this one:
for (i in c(x,y,k)) {
if (is.numeric(i)) i <- names(dt)[i]
}
In C/C++ this would have been done using pointers. In R - is it all possible?
UPDATE: Very nice idea to use Map below. However it does not work for this example
getColName <- function(dt, x) {
if (is.numeric(x)) {
x <- names(dt)[x]
}
x
}
f<- function(dt, x,y,k) {
list(x,y,k) <- Map(getColName, list(x,y,k), dt)
# if (is.numeric(x)) x <- names(dt)[x]
# if (is.numeric(y)) y <- names(dt)[y]
# if (is.numeric(k)) k <- names(dt)[k]
ggplot(dt, aes_string(x,y, col=k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2) # Brrr..
No need for for loop, just Map a function over each of your list items
varToChange = list( 1:10, iris$Species[1:10], letters[1:10])
myfun <- function(y) {
if (is.character(y)) y <- as.integer(as.ordered(y))
if (is.factor(y)) y <- as.integer(y)
y
}
varToChange <- Map(myfun, varToChange)
UPDATE: Map never modifies variables in place, This is simply not done in R. Use the new values returned by Map
f<- function(dt, x, y, k) {
args <- Map(function(x) getColName(dt, x), list(x=x,y=y,k=k))
ggplot(dt, aes_string(args$x,args$y, col=args$k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2)
You have two choices for iteration in R, iterate over variables themselves, or over their indices. I generally recommend iterating over indices. This case illustrates a strong advantage of that because your question is a non-issue if you are using indices.
varToChange = list( 1:10, iris$Species[1:10], letters[1:10])
for (i in seq_along(varToChange)) {
if (is.character(varToChange[[i]])) varToChange[[i]] <- as.integer(as.factor(varToChange[[i]]))
if (is.factor(varToChange[[i]])) varToChange[[i]] <- as.integer(varToChange[[i]])
}
I also replaced as.ordered() with as.factor() - the only difference between an ordered factor and a regular factor are the default contrasts used in modeling. As you are just coercing to integer, it doesn't matter.

function fundamentals

I am attempting to write a function to add a suffix to variable names in a data frame.
The code I want to turn into a function: colnames(dataframe) <- paste0(colnames(dataframe), "_suffix")
My function:
rename <- function(x, y) {colnames(x) <- paste0(colnames(x), y)}
When I call the function on the data frame I would like to append with a suffix, the data frame does not change. I am sure I am missing some fundamental understanding of how the function should work to append the data frame column names and preserve the data frame name.
Try this solution:
df <- data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
rename <- function(x, y) {
colnames(x) <- paste0(colnames(x), y)
return(x)
}
As suggested above, you would overwrite the original data frame like so:
df <- rename(df, "_suffix")

set matrix element using apply in R

I am trying to assign the values from the dataframe into a matrix. The columns 2 and 3 are mapped to rows and columns respectively in the matrix. This is not working since the sim.mat is not storing the values.
score <- function(x, sim.mat) {
r <- as.numeric(x[2])
c <- as.numeric(x[3])
sim.mat[r,c] <- as.numeric(x[4])
}
mat <- apply(sim.data, 1, score, sim.mat)
Is this the right approach? If yes how can I get it to work.
No need for apply, try this:
score <- function(x, sim.mat) {
r <- as.numeric(x[[2]])
c <- as.numeric(x[[3]])
sim.mat[cbind(r,c)] <- as.numeric(x[[4]])
sim.mat
}
mat <- score(sim.data, sim.mat)
Check the "Matrices and arrays" section of ?"[" for documentation.
If you really wanted to use apply like you did, you would need your function to modify sim.data in the calling environment, do:
score <- function(x, sim.mat) {
r <- as.numeric(x[2])
c <- as.numeric(x[3])
sim.mat[r,c] <<- as.numeric(x[4])
}
apply(sim.data, 1, score, sim.mat)
sim.mat
This type of programming where functions have side-effects is really not recommended.

Resources