Apply function to dataset when function calls from two sources - r

I have a function that I want to apply to a dataset, but the function also uses global variables as arguments as these variables are needed elsewhere.
With this reduced example I want to apply 'pterotest' to the rows of 'data'. This test case works when the function is given V as a vector, and M and g as a single value.
df<- data.frame(matrix(ncol = 1, nrow = 3))
row.names(df) <- c("Apsaravis_ukhaana", "Jeholornis_prima", "Changchengornis_hengdaoziensis")
colnames(df) <- "M"
mass_var <- c(0.1840000, 1.6910946, 0.0858997)
df$M <- mass_var
V <- seq(0.25,30, by = 0.05)
g <- 9.81
pterotest <- function(V, M, g) {
out1 <- M*g
out2 <- V*M
return(list(V, out1, out2))
}
apply(df,1,pterotest, M = "M", g = g, V = V)
However, all I get is an error of the form:
Error in match.fun(FUN) : '1' is not a function, character or symbol
EDIT: Turning this on it's head, what I could do would be to run a loop over each row, using the multiple columns as different arguments to the function, but with a 4.2M line dataset I feel vectorising might be quicker...

Related

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Apply concordance dataframe to zoo objects

I have a zoo object made of several time series, like this:
indices <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-30'),by="year")
a <- zoo(rnorm(5), order.by=indices)
b <- zoo(rnorm(5), order.by=indices)
c <- zoo(rnorm(5), order.by=indices)
ts_origin <- merge(a,b,c)
I would like to multiply each zoo series from ts_origin by a ratio contained in a dataframe, an put
the results in another zoo object (ts_final) that contains the time seris d,e,f. In other words,
the dataframe is a concordance file between a,b,c and d,e,f , and the ratio would be applied this way:
ts_final$d = ts_origin$a * 10 ; ts_final$e = ts_origin$b * 100 ; ts_final$f = ts_origin$c * 1000.
df <- data.frame(original = c("a","b","c"),
final = c("d","e","f"),
ratio = c(10,100,1000))
indices <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-30'),by="year")
d <- zoo(, order.by=indices)
e <- zoo(, order.by=indices)
f <- zoo(, order.by=indices)
ts_final <- merge(d,e,f)
Not too sure what the best approach for this. I was trying with the apply function, but couldn't make
it work... any help would be greatly appreciated!
1) Map/merge
Use Map to iterate over final, original and ratio executing the products required producing a list of zoo objects L. Note that Map takes the names from the first argument after fun. Then merge the list components forming zoo object ts_final.
fun <- function(f, o, r) ts_origin[, o] * r
L <- with(df, Map(fun, final, original, ratio))
ts_final <- do.call("merge", L)
The result using the inputs shown in the Note at the end is this zoo object:
> ts_final
d e f
2000-01-01 -5.6047565 46.09162 400.7715
2001-01-01 -2.3017749 -126.50612 110.6827
2002-01-01 15.5870831 -68.68529 -555.8411
2003-01-01 0.7050839 -44.56620 1786.9131
2004-01-01 1.2928774 122.40818 497.8505
2005-01-01 17.1506499 35.98138 -1966.6172
2) sweep
Another approach is to sweep out the ratios setting the names appropriately giving the same result as in (1).
with(df, sweep(setNames(ts_origin[, original], final), 2, ratio, "*"))
3) rep
Set the names and multiply by ratio repeated appropriately giving the same result as in (1).
nr <- nrow(df)
with(df, setNames(ts_origin[, original], final) * rep(ratio, each = nr))
Note
We can define the input reproducibly like this:
set.seed(123)
tt <- as.Date(ISOdate(2000:2005, 1, 1))
m <- matrix(rnorm(6*3), 6, dimnames = list(NULL, c("a", "b", "c")))
ts_origin <- zoo(m, tt)
df <- data.frame(original = c("a","b","c"),
final = c("d","e","f"),
ratio = c(10,100,1000))
Here is a one-liner, with wrong final names.
ts_final <- t(df$ratio * t(ts_origin))
ts_final
# a b c
#2000-01-01 -5.382213 -12.64773 -513.6408
#2001-01-01 -9.218280 -98.55123 -1826.6430
#2002-01-01 2.114663 -28.58910 290.8008
#2003-01-01 -3.576460 -23.47314 -166.5473
#2004-01-01 6.490508 -36.29317 -398.0389
#2005-01-01 -5.382213 -12.64773 -513.6408
Now assign final names.
colnames(ts_final) <- df$final

R apply() function issue with custom function included

I have coded following function:
one_way_anova <- function(m, n, sample_means, sample_vars) {
keskiarvo = 1/m*sum(sample_means)
otosv = (sum((sample_means-keskiarvo)^2))/(m-1)
TS = (n*otosv)/(sum(sample_vars)/m)
parvo = 1-pf(TS, m-1, m*(n-1))
return(parvo)
}
And using following data:
set.seed(1)
dat <- matrix(rnorm(300*20), nrow=300)
sample_means <- matrix(rowMeans(dat), nrow=100)
sample_vars <- matrix(apply(dat, 1, var), nrow=100)
m <- nrow(sample_means)
n <- ncol(sample_means)
Now I try to use apply -function to calculate "parvo" with my function one_way_anova for dataset sample_means by individual rows with three samples (matrix is 100x3).
apply(sample_means, 1, one_way_anova)
Which gives following error
Error in FUN(newX[, i], ...) : argument "sample_means" is missing, with no default
Since your function one_way_anova needs multiple arguments, you need to pass all other arguments besides sample_means if you used apply.
If you want to run it over rows in sample_means and sample_vars, maybe you can try sapply like below
sapply(1:m,function(k) one_way_anova(m,n,sample_means[k,],sample_vars[k,]))

Spearman correlation between two matrices of same dimensions

I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?
No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")
Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}

Resources