Apply function with multiple parameters and output variable - r

I am trying to apply a function with two arguments. The first argument is a dataframe, the second is an integer that defines a row of the df.
col_1 <- c("A", "B", "C")
col_2 <- c("red", "blue", "black")
df <- data.frame(col_1, col_2)
f <- function(x, arg1) {
x[arg1, 1]
x[arg1, 2]
}
apply(df, 1, f)
Looks like the second argument is not passed to the function. Here is the error
Error in x[arg1, 1] : incorrect number of dimensions
when I put arg1=1 like this
apply(df, arg1=1, f)
it gives me a FUN error
Error in match.fun(FUN) : argument "FUN" is missing, with no default
the desired output is "A" and "red", i.e. in my real code I need to operate with the values of each row.
I also want to add an output variable to be able to save a plot that I am making in my real analysis in a file. Can I just add an "output" variable in function(x, arg1) and then do apply(df, arg1=1, f, output="output_file")?

As #Greg mentions, the purpose of this code isn't clear. However, the question seems to relate to how apply() works so here goes:
Basically, when any of the apply family of functions is used, the user-enetered function (f(), in this case) is applied to the subset of the data produced by apply. So here, you've asked apply to evaluate each row then call f() - the first argument to f() would then be a vector rather than the data frame your function requires.
Here's some functioning code:
col_1 <- c("A", "B", "C")
col_2 <- c("red", "blue", "black")
df <- data.frame(col_1, col_2)
f <- function(x) {
x[1]
x[2]
}
apply(df, 1, f)
This generates all of the values of the second column as a vector since x[2] is returned from the function and for each row, will represent the value in the second column.
If you want the arg1 row of results, you could simply use the following:
find_row <- function(df, row) {
df[row, ]
}
find_row(df, 1)
apply() isn't required. Using a single function makes the code simpler to read and should be faster too.

Related

Apply function to dataset when function calls from two sources

I have a function that I want to apply to a dataset, but the function also uses global variables as arguments as these variables are needed elsewhere.
With this reduced example I want to apply 'pterotest' to the rows of 'data'. This test case works when the function is given V as a vector, and M and g as a single value.
df<- data.frame(matrix(ncol = 1, nrow = 3))
row.names(df) <- c("Apsaravis_ukhaana", "Jeholornis_prima", "Changchengornis_hengdaoziensis")
colnames(df) <- "M"
mass_var <- c(0.1840000, 1.6910946, 0.0858997)
df$M <- mass_var
V <- seq(0.25,30, by = 0.05)
g <- 9.81
pterotest <- function(V, M, g) {
out1 <- M*g
out2 <- V*M
return(list(V, out1, out2))
}
apply(df,1,pterotest, M = "M", g = g, V = V)
However, all I get is an error of the form:
Error in match.fun(FUN) : '1' is not a function, character or symbol
EDIT: Turning this on it's head, what I could do would be to run a loop over each row, using the multiple columns as different arguments to the function, but with a 4.2M line dataset I feel vectorising might be quicker...

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

using mapply with ggplot

Continuing on my quest to work with functions and ggplot:
I sorted out basic ways on how to use lapply and ggplot to cycle through a list of y_columns to make some individual plots:
require(ggplot2)
# using lapply with ggplot
df <- data.frame(x=c("a", "b", "c"), col1=c(1, 2, 3), col2=c(3, 2, 1), col3=c(4, 2, 3))
cols <- colnames(df[2:4])
myplots <- vector('list', 3)
plot_function <- function(y_column, data) {
ggplot(data, aes_string(x="x", y=y_column, fill = "x")) +
geom_col() +
labs(title=paste("lapply:", y_column))
}
myplots <- lapply(cols, plot_function, df)
myplots[[3]])
I know what to bring in a second variable that I will use to select rows. In my minimal example I am skipping the selection and just reusing the same plots and dfs as before, I simply add 3 iterations. So I would like to generate the same three plots as above, but now labelled as iteration A, B, and C.
I took me a while to sort out the syntax, but I now get that mapply needs to vectors of identical length that get passed on to the function as matched pairs. So I am using expand.grid to generate all pairs of variable 1 and variable 2 to create a dataframe and then pass the first and second column on via mapply. The next problem to sort out was that I need to pass on the dataframe as list MoreArgs =. So it seems like everything should be good to go. I am using the same syntax for aes_string() as above in my lapply example.
However, for some reason now it is not evaluating the y_column properly, but simply taking it as a value to plot, not as an indicator to plate the values contained in df$col1.
HELP!
require(ggplot2)
# using mapply with ggplot
df <- data.frame(x=c("a", "b", "c"), col1=c(1, 2, 3), col2=c(3, 2, 1), col3=c(4, 2, 3))
cols <- colnames(df[2:4])
iteration <- c("Iteration A", "Iteration B", "Iteration C")
multi_plot_function <- function(y_column, iteration, data) {
plot <- ggplot(data, aes_string(x="x", y=y_column, fill = "x")) +
geom_col() +
labs(title=paste("mapply:", y_column, "___", iteration))
}
# mapply call
combo <- expand.grid(cols=cols, iteration=iteration)
myplots <- mapply(multi_plot_function, combo[[1]], combo[[2]], MoreArgs = list(df), SIMPLIFY = F)
myplots[[3]]
We may need to use rowwise here
out <- lapply(asplit(combo, 1), function(x)
multi_plot_function(x[1], x[2], df))
In the OP's code, the only issue is that the columns are factor for 'combo', so it is not parsed correctly. If we change it to character, it works
out2 <- mapply(multi_plot_function, as.character(combo[[1]]),
as.character(combo[[2]]), MoreArgs = list(df), SIMPLIFY = FALSE)
-testing
out2[[1]]

Calculate top & lowest ten percent values in multiple columns in R

Load library and sample data:
library(MASS)
View(Cars93)
Cars93$ID=1:93
Now I want to subset Cars93 so that new df (sub0l and sub0h) have all IDs with all columns but with only top (for df sub0h) and lowest 10% values (for df sub0l) in column 17:25, and rest values (11-100 quartile for df sub0l and 0-90 quartile for df sub0h) could be changed to NA.
Here is my attempt to create two dfs with top ten% or lowest ten% values from columns 17:25:
sub0l <- do.call(rbind,by (Cars93,Cars93$ID,FUN= function(x)
subset(Cars93, (Cars93[,17:25] <= quantile(Cars93[,17:25], probs= .10)))))
sub0h <- do.call(rbind,by (Cars93,Cars93$ID,FUN= function(x)
subset(Cars93, (Cars93[,17:25] >= quantile(Cars93[,17:25], probs= .91)))))
I get an error while subseting top and lowest ten quartiles of columns:
Error in `[.data.frame`(Cars93, ,17:25) : undefined columns selected
Called from: `[.data.frame`(Cars93, ,17:25)
Any better alternative?
I think the following returns what you are looking for
sub0l <- cbind(Cars93[,1:16], sapply(Cars93[,17:25],
function(i) ifelse(i > quantile(i, probs=0.1, na.rm=T) | is.na(i), NA, i)))
sub0h <- cbind(Cars93[,1:16], sapply(Cars93[,17:25],
function(i) ifelse(i < quantile(i, probs=0.91, na.rm=T) | is.na(i), NA, i)))
The sapply function loops through each variable in the data.frame, to which the quantile function is applied. Within each pass, the generic function accesses the variable as a vector through the "i" argument. This is then passed to the ifelse function. This function takes a look at each element of the vector, i and assesses whether it passes the test. If the element passes the test, it is assigned NA, if it fails, its original value is returned. This process will work great for variables that are numeric.
If some of the variables are not numeric, then you can add an additional check in the sapply functions as below:
sub0l <- cbind(Cars93[,1:16],
sapply(Cars93[,17:25],
function(i) {
if(is.numeric(i)) {
ifelse(i > quantile(i, probs=0.1, na.rm=T) | is.na(i), NA, i)))
}
else i
}))
sub0h <- cbind(Cars93[,1:16],
sapply(Cars93[,17:25],
function(i) {
if(is.numeric(i)) {
ifelse(i < quantile(i, probs=0.91, na.rm=T) | is.na(i), NA, i)
}
else i
}))
before beginning the operation described above, the generic function checks if the vector i is of type numeric (in R, this is either mode double or integer, see ?typeof for a discussion of the core element types in R). If this test fails, the vector is returned by else i. If the first test passes, then the process described above begins.

R df process each column by a different function provided in a list of functions

I guess my problem is very simple, but I could not find the solution in web yet.
I would like to modify a data frame with a set of functions.
The functions are defined in a list. They may have more than one argument, but one arg is always the value found on the related column in a df.
I used build in BOD data set just for convinience. The list could be this:
funs <- list(
fn1 = function(x) x+1,
fn2 = function(x) x-1
)
The function call could look like this:
searchedFunc(BOD, funs)
So after modificatin Time column values are added by 1 and demand column values are subtracted by one.
You can use sapply to be more flexible
funs <- list(
fn1 = function(x) x+1,
fn2 = function(x) x-1
)
searchedFunc <- function(df, fns) {
sapply(seq(along.with=fns), function(i) fns[[i]](df[, i]))
}
searchedFunc(BOD, funs)
Hope it helps,
alex

Resources