Function within function not activating as expected - r

I have a function that I use to get a "quick look" at a data.frame... I deal with a lot of survey data and this acts as a quick tool to see what's what.
f.table <- function(x) {
if (is.factor(x[[1]])) {
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
x <- na.omit(melt(x,c()))
x <- cast(x, variable ~ value, frequency)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
frequency <- function(x) {
x[x > 1] <- 1
x[is.na(x)] <- 0
x <- round(sum(x)/n, digits=2)
}
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(frequency, mean, sd, min, max))
x <- transform(x, variable=reorder(variable, frequency))
}
return(x)
}
What I find happens is that if I don't define "frequency" outside of the function, it returns wonky results for data frames with continuous variables. It doesn't seem to matter which definition I use outside of the function, so long as I do.
try:
n <- 100
x <- data.frame(a=c(1:25),b=rnorm(100),c=rnorm(100))
x[x > 20] <- NA
Now, select either one of the frequency functions and paste them in and try it again:
frequency <- function(x) {
x <- round(length(x)/n, digits=2)
}
f.table(x)
Why is that?

Crucially, I think this is where your problem is. cast() is evaluating those functions without reference to the function it was called from. Inside cast() it evaluates fun.aggregate via funstofun and, although I don't really follow what it is doing, is getting stats:::frequency and not your local one.
Hence my comment to your Q. What do you wan the function to do? At the moment it would seem necessary to define a "frequency" function in the global environment so that cast() or funstofun() finds it. Give it a unique name so it is unlikely to clash with anything so it should be the only thing found, say .Frequency(). Without knowing what you want to do with the function (rather than what you thought the function [f.table] should do) it is a bit difficult to provide further guidance, but why not have .FrequencyNum() and .FrequencyFac() defined in the global workspace and rewrite your f.table() wrapper calls to cast to use the relevant one?
.FrequencyFac <- function(X, N) {
round(length(X)/N, digits=2)
}
.FrequencyNum <- function(X, N) {
X[X > 1] <- 1
X[is.na(X)] <- 0
round(sum(X)/N, digits=2)
}
f.table <- function(x, N) {
if (is.factor(x[[1]])) {
x <- na.omit(melt(x, c()))
x <- dcast(x, variable ~ value, .FrequencyFac, N = N)
x <- cbind(x,top2=x[,ncol(x)]+x[,ncol(x)-1], bottom=x[,2])
}
if (is.numeric(x[[1]])) {
x <- na.omit(melt(x))
x <- cast(x, variable ~ ., c(.FrequencyNum, mean, sd, min, max), N = N)
##x <- transform(x, variable=reorder(variable, frequency))
## left this out as I wanted to see what cast returned
}
return(x)
}
Which I thought would work, but it is not finding N, and it should be. So perhaps I am missing something here?
By the way, it is probably not a good idea to rely on function that find n (in your version) from outside the function. Always pass in the variables you need as arguments.

I don't have the package that contains melt, but there are a couple potential issues I can see:
Your frequency functions do not return anything.
It's generally bad practice to alter function inputs (x is the input and the output).
There is already a generic frequency function in stats package in base R, which may cause issues with method dispatch (I'm not sure).

Related

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

Rolling over function with 2 vector arguments

I want to apply rolling on the function that requires 2 vector arguments. Here is the exmample (that doesn't work) using data.table:
library(data.table)
df <- as.data.table(cbind.data.frame(x=1:100, y=101:200))
my_sum <- function(x, y) {
x <- log(x)
y <- x * y
return(x + y)
}
roll_df <- frollapply(df, 10, function(x, y) {
my_sum(x, y)})
It doesn't recognize y column. Ofc, the solution can be using xts or some other package.
EDIT:
This is the real function I want to apply:
library(dpseg)
dpseg_roll <- function(time, price) {
p <- estimateP(x=time, y=price, plot=FALSE)
segs <- dpseg(time, price, jumps=jumps, P=p, type=type, store.matrix=TRUE)
slope_last <- segs$segments$slope[length(segs$segments$slope)]
return(slope_last)
}
With runner you can apply any function in rolling window. Running window can be created also on a rows of data.frame inserted to x argument. Let's focus on simpler function my_sum. Argument f in runner can accept only one object (data in this case). I encourage to put browser() to the function to debug row-by-row before you apply some fancy model on the subset (some algorithms requires some minimal number of observations).
my_sum <- function(data) {
# browser()
x <- log(data$x)
y <- x * data$y
tail(x + y, 1) # return only one value
}
my_sum should return only one value, because runner computes for each row - if my_sum returns vector, you would get a list.
Because runner is an independent function you need to pass data.table object to x. Best way to do this is to use x = .SD (see here why)
df[,
new_col := runner(
x = .SD,
f = my_sum,
k = 10
)]
I have no idea what you are going to do with frollapply (mean or sum or something else?).
Assuming you are about to use rolling sum, here might be one example. I rewrote your function my_sum such that it applies to df directly.
my_sum <- function(...) {
v <- c(...)
x <- log(v[[1]])
y <- Reduce(`*`,v)
return(x + y)
}
roll_df <- frollapply(
my_sum(df),
10,
FUN = sum)
rollapply in zoo passes a zoo object to the function to be applied if coredata=FALSE is used. The zoo object is made up of a time and a value part so we can use the following if the x value represents ascending values (which I gather it does). Note that my_sum in the question returns a 10 element result if the two arguments are length 10 so out shown below is a 100 x 10 zoo object with the first 9 rows filled with NAs.
If you don't want the NAs omit fill=NA or if you want to apply the function to partial inputs at the beginning instead of fill=NA use partial=TRUE. If you only want one of the 10 elements, such as the last one, then use function(x) my_sum(time(x), coredata(x))[10] in place of the function shown or just use out[, 10].
fortify.zoo(out) can be used to turn a zoo object out to a data frame if you need the result in that form or use as.data.frame(out) if you want to drop the times. as.data.table(out) also works in a similar manner.
library(zoo)
z <- read.zoo(df) # df$x becomes the time part and df$y the value part
out <- rollapplyr(z, 10, function(u) my_sum(time(u), coredata(u)),
coredata = FALSE, fill = NA)
dim(out)
## [1] 100 10
Note that in dpseg_roll that jumps and type are not defined.

Call function many times without duplicate it (with %>%)

Let's assume that we have 3 functions with this minimal functionality:
func1 <- function (x) {
x + 1
}
func2 <- function (x, plus = T) {
if (plus == TRUE) {
x + 2
} else {
x - 5
}
}
func3 <- function (x) {
x + 3
}
I would like to nest this function to each other like this with the pipe (%>%) operator:
library(magrittr)
func1(0) %>% func2(plus = T) %>% func2(plus = F) %>% func3
# result: 1
Which is the equivalent version of it:
func3(func2(func2(func1(0), plus = T), plus = F))
# result: 1
I try to find a method which doesn't require to duplacate the func2() function (because I have to run it many times and also I would like to change the number of function calls and the parameter dinamically). Currently I am not a big expert of apply functions or map package but I guess at least one of it can do this job.
It is of course just a dummy example, my real code is much more complicated, I just try to simplify my problem to find a solution.
I have to use the pipe operator so I only interested in that solutions which also work with pipes.
Write a function that takes the initial x and the outcomes to feed to func2 and loops through those outcomes:
func2_iterate = function(x, outcomes){
for (i in 1:length(outcomes)){
x = func2(x, outcomes[i])
}
return(x)
}
Then run (with func1, func2, func3 as above):
func1(0) %>% func2_iterate(c(T, F)) %>% func3
#result: 1
I'd also like to point out that in this particular case the output of func2_iterate is just its input, plus 2 times the number of T in outcomes, minus 5 times the number of F in outcomes. But I assume you actually have functions that do something more complicated.
Using a partial / compose / invoke combo :
library(tidyverse)
f2b <- invoke(compose, map(c(F,T), ~substitute(partial(func2, plus =.), lst(.))))
func1(0) %>% f2b %>% func3
# [1] 1

Using for() over variables that need to be changed

I'd like to be able use for() loop to automate the same operation that runs over many variables modifying them.
Here's simplest example to could design:
varToChange = list( 1:10, iris$Species[1:10], letters[1:10]) # assume that it has many more than just 3 elements
varToChange
for (i in varToChange ) {
if (is.character(y)) i <- as.integer(as.ordered(i))
if (is.factor(y)) i <- as.integer(i)
}
varToChange # <-- Here I want to see my elements as integers now
Here's actual example that led me to this question - taken from: Best way to plot automatically all data.table columns using ggplot2
In the following function
f <- function(dt, x,y,k) {
if (is.numeric(x)) x <- names(dt)[x]
if (is.numeric(y)) y <- names(dt)[y]
if (is.numeric(k)) k <- names(dt)[k]
ggplot(dt, aes_string(x,y, col=k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2)
instead of brutally repeating the same line many times, as a programmer, I would rather have a loop to repeat this line for me.
Something like this one:
for (i in c(x,y,k)) {
if (is.numeric(i)) i <- names(dt)[i]
}
In C/C++ this would have been done using pointers. In R - is it all possible?
UPDATE: Very nice idea to use Map below. However it does not work for this example
getColName <- function(dt, x) {
if (is.numeric(x)) {
x <- names(dt)[x]
}
x
}
f<- function(dt, x,y,k) {
list(x,y,k) <- Map(getColName, list(x,y,k), dt)
# if (is.numeric(x)) x <- names(dt)[x]
# if (is.numeric(y)) y <- names(dt)[y]
# if (is.numeric(k)) k <- names(dt)[k]
ggplot(dt, aes_string(x,y, col=k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2) # Brrr..
No need for for loop, just Map a function over each of your list items
varToChange = list( 1:10, iris$Species[1:10], letters[1:10])
myfun <- function(y) {
if (is.character(y)) y <- as.integer(as.ordered(y))
if (is.factor(y)) y <- as.integer(y)
y
}
varToChange <- Map(myfun, varToChange)
UPDATE: Map never modifies variables in place, This is simply not done in R. Use the new values returned by Map
f<- function(dt, x, y, k) {
args <- Map(function(x) getColName(dt, x), list(x=x,y=y,k=k))
ggplot(dt, aes_string(args$x,args$y, col=args$k)) + geom_jitter(alpha=0.1)
}
f(diamonds, 1,7,2)
You have two choices for iteration in R, iterate over variables themselves, or over their indices. I generally recommend iterating over indices. This case illustrates a strong advantage of that because your question is a non-issue if you are using indices.
varToChange = list( 1:10, iris$Species[1:10], letters[1:10])
for (i in seq_along(varToChange)) {
if (is.character(varToChange[[i]])) varToChange[[i]] <- as.integer(as.factor(varToChange[[i]]))
if (is.factor(varToChange[[i]])) varToChange[[i]] <- as.integer(varToChange[[i]])
}
I also replaced as.ordered() with as.factor() - the only difference between an ordered factor and a regular factor are the default contrasts used in modeling. As you are just coercing to integer, it doesn't matter.

Write a function that returns a vector or list of three statistics

This is a question for school, but I have been working on it for hours and just need a point in the right direction. I am not asking for the full answer.
I was given a data frame with student grades for various assessments. I have to write a function that will result in a vector or list that will give the min, max, and average of one particular assessment.
I was provided with the following framework:
checkAssessment <- function(df, assessmentName)
{
}
I need to be able to write the code to get the exact results below when the following line of code is executed:
checkAssessment(df,"hw1")
# $min
# [1] 0
#
# $max
# [1] 14
#
# $avg
# [1] 12.58824
So, I have tried many ways to go about this, none of which have worked. The two that came closest were
checkAssessment <- function(df, assessmentName)
{
my_min <- df$assessmentName == min(assessmentName)
my_max <- df$assessmentName == max(assessmentName)
my_avg <- df$assessmentName == mean(assessmentName)
return(df[my_min, ])
return(df[my_max, ])
return(df[my_avg, ])
}
and
checkAssessment <- function(df, assessmentName)
{
my_min <- sapply(df$assessmentName, min)
my_max <- sapply(df$assessmentName, max)
my_avg <- sapply(mean.default(df$assessmentName, trim = 0, na.rm = FALSE,
...))
funs = c(min, max, mean)
return(df[my_min, ])
return(df[my_max, ])
return(df[my_avg, ])
}
I'm not even sure if I'm close with either of these. I'm in an introductory R course so the code should be fairly simple, but I've developed a mental block with this question.
Any help would be very much appreciated. Thank you.
Because your were given the function framework, we have to use it.
checkAssessment <- function(df, assessmentName)
{
x <- df[[assessmentName]] ## extract column vector
return(list(min = min(x), max = max(x), avg = mean(x))) ## use a list for multiple return
}
Note:
to extract a column from a data frame by matching column name (exactly), use [[]]. It is OK to use $, but it does partial matching; Maybe this answer can help you understand this concept;
be aware of R-base functions min, max and mean, so that you don't need to struggle with x[x == min(x)], etc. Even if you want this logic, you can try x[which.min(x)]. Read ?which.min for more;
If you want multiple returned values, use a "list" to collect all of them. The basic way to set up a list is like list(1, 2), but a list can have names, so compare with list(a = 1, b = 2).
Test
We use R's built-in dataset trees for a test.
checkAssessment(trees, "Height")
#$min
#[1] 63
#$max
#[1] 87
#$avg
#[1] 76
It might also be worth pointing out where your code is problematic:
checkAssessment <- function(df, assessmentName)
{
my_min <- df$assessmentName == min(assessmentName)
my_max <- df$assessmentName == max(assessmentName)
my_avg <- df$assessmentName == mean(assessmentName)
return(df[my_min, ])
return(df[my_max, ])
return(df[my_avg, ])
}
First, min(assessmentName) does not make sense. Maybe you want
df$assessmentName == min(df$assessmentName)
Then, return(df[my_min, ]) is returning a data frame, a single row but multiple columns. Maybe you want:
return(df[my_min, assessmentName])
Finally, after the above return, the following won't have any effect:
return(df[my_max, assessmentName])
return(df[my_avg, assessmentName])
because the function terminates after seeing the first return. This is why you should use a "list" to get multiple returned values.

Resources