I'm attempting to write a function that uses the prob package to compute conditional probabilities. When using the function I continue to encounter the same error, which states an object within the function cannot be found.
Below is a reproducible example in which I compute a conditional probability without the function and then attempt to use the function to produce the same result. I'm not sure if the error is due to limitations with the prob package or an error on my part.
# Load prob package
library(prob)
# Set seed for reproducibility
set.seed(30)
# Sample data frame
sampledata <- data.frame(
X <- sample(1:10),
Y <- sample(c(-1, 0, 1), 10, replace=TRUE))
# Set probability space
S <- probspace(sampledata)
# Subset Y between -1 and 0
A <- subset(S, Y>=-1 & Y<=0)
# Subset X greater than 6
B <- subset(S, X>6)
# Compute conditional probability
P <- prob(A, given=B)
The above code produces the following probability:
> P
[1] 0.25
Attempting to write a function to calculate the same probability:
# Create function with data frame, variables, and conditional inputs
prob.function <- function(df, variable1, variable2, state1, state2, cond1){
s <- probspace(df)
a <- subset(s, variable1>=state1 & variable1<=state2)
b <- subset(s, variable2>cond1)
p <- prob(a, given=b)
return(p)
}
# Demonstrate the function
test <- prob.function(sampledata, Y, X, -1, 0, 6)
This function gives the following error:
Error in eval(expr, envir, enclos) : object 'b' not found
Any help you can provide would be great.
Thanks!
This looks like a bug in prob.
When I run this in Vanilla R, I get the same error. But when I create an object b in my workspace, the error disapears:
> print(b)
Error in print(b) : object 'b' not found
> test <- prob.function(sampledata, Y, X, -1, 0, 6)
Error in eval(expr, envir, enclos) : object 'b' not found
>
> b <- "dummy variable"
> print(b)
[1] "dummy variable"
> test <- prob.function(sampledata, Y, X, -1, 0, 6)
> test
[1] 0.25
>
As a temporary workaround, just create a dummy b in your current environment.
As for the bug, if you look at the source for prob.default (which in the example above is what prob(a, given=b) is eventually calling), you'll see the following section:
if (missing(given)) {
< cropped >
}
else {
f <- substitute(given)
g <- eval(f, x) <~~~~
if (!is.logical(g)) { <~~~~
if (!is.data.frame(given)) <~~~~
stop("'given' must be data.frame or evaluate to logical")
B <- given
}
...
< cropped >
}
it is jumping from g to given, perhaps inadvertently? I would reach out to the package maintainer, as this may be an oversight.
I don't think this is a bug in package prob.
First, you should create you sampledata as
sampledata <- data.frame(
X = sample(1:10),
Y = sample(c(-1, 0, 1), 10, replace=TRUE))
Your original code creates not only this dataframe but also variables X and Y in the global environment which are actually being used later when you call your function.
Second, you shouldn't call subset() inside a function. Use bracket subsetting instead:
prob.function <- function(df, variable1, variable2, state1, state2, cond1){
s <- probspace(df)
a <- s[s[[variable1]]>=state1 & s[[variable1]]<=state2, ]
b <- s[s[[variable2]]>cond1, ]
p <- prob(a, given=b)
return(p)
}
And pass variable1 and variable2 as strings:
test <- prob.function(sampledata, "Y", "X", -1, 0, 6)
Now you have test==0.25, and no error.
References for what is going on:
http://adv-r.had.co.nz/Computing-on-the-language.html#non-standard-evaluation-in-subset
Assignment operators in R: '=' and '<-'
Why is `[` better than `subset`?
Related
I have a function that I want to apply to a dataset, but the function also uses global variables as arguments as these variables are needed elsewhere.
With this reduced example I want to apply 'pterotest' to the rows of 'data'. This test case works when the function is given V as a vector, and M and g as a single value.
df<- data.frame(matrix(ncol = 1, nrow = 3))
row.names(df) <- c("Apsaravis_ukhaana", "Jeholornis_prima", "Changchengornis_hengdaoziensis")
colnames(df) <- "M"
mass_var <- c(0.1840000, 1.6910946, 0.0858997)
df$M <- mass_var
V <- seq(0.25,30, by = 0.05)
g <- 9.81
pterotest <- function(V, M, g) {
out1 <- M*g
out2 <- V*M
return(list(V, out1, out2))
}
apply(df,1,pterotest, M = "M", g = g, V = V)
However, all I get is an error of the form:
Error in match.fun(FUN) : '1' is not a function, character or symbol
EDIT: Turning this on it's head, what I could do would be to run a loop over each row, using the multiple columns as different arguments to the function, but with a 4.2M line dataset I feel vectorising might be quicker...
This question is a follow up on two questions I had answered before:
Create the function
Calculate mean
I have a couple of variables (var1, var2 and var3), which have different distribution functions:
var1_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var1_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 6, sd = 1))
var1_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 2, sd = 2))
var2_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 3))
var2_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var2_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 2))
var3_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 1))
var3_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 1))
var3_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 7, sd = 2))
To create proportional distribution function, to match the combination of two or three different variables whith their appropriate probablity functions I have created the next function I learned in the first question:
foo <- function(...){
#set x values
x <- seq(1, 10, by = 1)
#create y values
y <- 1L
for (fun in list(...)) y <- y * fun(x)
#create new PDF
p <- data.frame(x,y)
pdqr::new_d(p, type = "continuous")
}
So, if I want to create a proportional distribution function var2_distr1__var3_distr3 of var2_distr1 and var3_distr3 I can just do this: var2_distr1__var3_distr3 <- foo(var2_distr1, var3_distr3), works like charm.
Now I have per for each variable, per case, I have selected the appropriate distrubution, using a simple if_else, which returns the appropriate distribution in a dataframe like this:
df <- data.frame(var1 = c("var1_distr1", "var1_distr3", "var1_distr1", "var1_distr2", "var1_distr2", "var1_distr1", "var1_distr3"),
var2 = c("var2_distr2", "var2_distr1", "var2_distr2", "var2_distr1", "var2_distr3", "var2_distr3", "var2_distr1"),
var3 = c("var3_distr2", "var3_distr3", "var3_distr1", "var3_distr1", "var3_distr2", "var3_distr3", "var3_distr1"))
If I want the mean for the relavant individual distributions per case for a single variable I can use this
df$var2_distr1_mean <- sapply(mget(df$var2_distr1), pdqr::summ_mean)
df$var3_distr3_mean <- sapply(mget(df$var3_distr3), pdqr::summ_mean)
which I learned in the second question.
However, if I want to get the mean of the proportional distributions given in var1 and var2 I get into trouble.
> df$var1_2_mean <- mapply(pdqr::summ_mean, foo(df$var1, df$var2))
Error in fun(x) : could not find function "fun"
While if I individually pass the distribution functions, this happens:
> df$var1_2_mean <- mapply(summ_mean, foo(var1_distr1, var2_distr2))
Error in dots[[1L]][[1L]] : object of type 'closure' is not subsettable
As suggested by #Limey, if put the PDF's in a list:
PDFS <- list(var1_distr1 = var1_distr1, var1_distr2 = var1_distr2, var1_distr3 = var1_distr3,
var2_distr1 = var2_distr1, var2_distr2 = var2_distr2, var2_distr3 = var2_distr3,
var3_distr1 = var3_distr1, var3_distr2 = var3_distr2, var3_distr3 = var3_distr3)
However, when calling that (using this approach apply-list-of-functions-to-list-of-values) I get this:
> df$var1_2_mean <- foo(sapply(PDFS, mapply, df$var1, df$var2))
Error in (function (x) : unused argument (dots[[2]][[1]])
> sapply(PDFS, mapply, df$var1, df$var2)
Error in (function (x) : unused argument (dots[[2]][[1]])
> sapply(PDFS, mapply, df$var1)
Error: `x` must be 'numeric', not 'character'.
> df$var1_2_mean <- foo(sapply(PDFS, mapply, paste(df$var1, df$var2, sep = ", ")))
Error: `x` must be 'numeric', not 'character'.
> df$var1_2_mean <- summ_mean(foo(sapply(PDFS, mapply, paste(df$var1, df$var2, sep = ", "))))
Error: `x` must be 'numeric', not 'character'.
> df$var1_2_mean <- sapply(foo(mget(mapply(PDFS, sapply, df$var1, df$var2))), pdqr::summ_mean)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'PDFS' of mode 'function' was not found
> lapply(PDFS, function(x) x())
Error in x() : argument "x" is missing, with no default
I'm still missing something, and I believe it's on vectorisation. Might invoke_map work?
I don't have the pdqr package, so I can't solve your exact problem, but here's a proof-of-concept example that may be helpful. As I mention in comments, you haven't specified your exact use case, but I do feel you are imposing constraints that make your life more difficult than it need be. For example passing function names rather than functions to your summary function, using a data frame rather than a list, etc.
Anyway, start by defining some functions and store them in a list.
foo1 <- function() {"Foo 1"}
foo2 <- function() {"Foo 2"}
foo3 <- function() {"Foo 3"}
funcList <- list(foo1, foo2, foo3)
Now use utils::combn() to generate all combinations of two of these three functions and call each member of each pair in turn.
combn(
funcList,
m=2,
FUN=function(combination) {
lapply(combination, function(x) x())
}
)
Giving
[,1] [,2] [,3]
[1,] "Foo 1" "Foo 1" "Foo 2"
[2,] "Foo 2" "Foo 3" "Foo 3"
combn() takes the list of functions as input. m=2 requests the generation of all combinations of 2 elements from the list. FUN= specifies a function to be applied to each combination. The anonymous function supplied simply takes the supplied combination and simply calls each element of the combination in turn.
Here is some example data:
set.seed(1234) # Make the results reproducible
count <- 100
cs1 <- round(rchisq(count, 1), 2)
cs2 <- round(rchisq(count, 2), 2)
c(rep("Present", 30), rep("Absent", 30), rep("NA", 40)) -> temp
temp[temp == "NA"] <- NA
as.factor(temp) -> temp
temp1 <- round(rnorm(count, 3), 2)
temp1[7] <- NA
temp2 <- round(rnorm(count, 7), 2)
temp2[54] <- NA
c(rep("Yes", 30), rep("No", 30), rep("Maybe", 30), rep("NA", 10)) -> temp3
temp3[temp3 == "NA"] <- NA
as.factor(temp3) -> temp3
c(rep("Group A", 55), rep("Group B", 45)) -> temp4
as.factor(temp4) -> temp4
mydata <- data.frame(cs1, cs2, temp, temp1, temp2, temp3, temp4)
mydata$cs2[56:100] <- NA ; mydata
I know I can compute summary statistics for each variable stratified by temp4 like so:
by(mydata, mydata$temp4, summary)
However, I would also like to compute either a t.test or a chisq.test for each variable stratified by temp4. I've tried simply modifying the above code to do that but it always gives me an error. It seems the error stems from the fact that some of the variables in the data frame are numeric (and thus, would need a t.test) while others are factors (and thus, would need a chisq.test).
Is there a simple way to tell R to check the variable to see what kind it is, and then run the appropriate test, all at once? And to still print out all of the results even if it encounters an error?
I am not worried about the appropriateness of doing this (e.g., I am aware of the risks of multiple testing, etc) but rather just need to know how to do it. Thanks!
You can use lapply to loop through the variables and decide inside the anonymous function which test to conduct.
When an error occurs, it's caught by tryCatch and instead of a test result the final list will have the error message as a member.
tests_list <- lapply(mydata[-ncol(mydata)], function(x){
tryCatch({
if(is.numeric(x)){
if(length(levels(mydata$temp4)) == 2){
t.test(x ~ temp4, data = mydata)
}else{
aov(x ~ temp4, data = mydata)
}
}else{
tbl <- table(x, mydata$temp4)
chisq.test(tbl)
}
}, error = function(e) e)
})
err <- sapply(tests_list, inherits, "error")
tests_list$cs1
tests_list$temp3
tests_list[[err]]
Yes, you can loop through designated columns, keeping temp4 as the factor, and check class of each column (named x within the anonymous function). You can use sapply or apply(X, MARGIN = 2, FUN ...). Note that I'm explicitly subsetting mydata because I find it more explicit and readable.
sapply(mydata[, c("cs1", "cs2", "temp", "temp1", "temp2", "temp3")], FUN = function(x, group) {
if (class(x) == "numeric") {
# perform t-test, e.g. t.test(x ~ group)
return(result_of_t_test)
}
if (class(x) == "factor") {
# perform chi-square test
return(result_of_chisq_test)
}
}, group = mydata$temp4)
I conducted the repeated measures anova, that's my code, it's simple operation and i always did it quickly.
Link to mydata in .csv format
library(car)
vivo4 <- read.csv("vivo1.csv",sep=";",dec=",")
ageLevels <- c(1, 2,3,4,5,6,7,8,9,10,12)
ageFactor <- as.factor(ageLevels)
ageFrame <- data.frame(ageFactor)
measures <- function(data = vivo4, n = 4) { #n=4 is 4 variables
## Editor comment:
## correct way to initialize a list, don't use "list(n)"
## you can compare what you get from "list(4)" and "vector ("list", length = 4)"
## lmo's comment: don't use "list" for your variable name (may mask R function "list")
## I have corrected it as "Mylist"
Mylist <- vector("list", length = n)
for(i in 0:3) {Mylist[[i+1]] <- as.matrix(cbind(data[, 12*i + 1:12])) # 12 visits
}
Mylist
}
measures_list <- measures()
models <- lapply(
measures_list, function(x) {
ageModel <- lm(x ~ 1)
Anova.mlm (ageModel, idata = ageFrame, idesign = ~ageFactor)
} )
models #View the result
but i got the error
Error in `rownames<-`(`*tmp*`, value = colnames(B)) :
length of 'dimnames' [1] not equal to array extent
I have read many answers and can't understand whats wrong, i need supervision.
You have 12 levels in your dataset,but in agelevels, you indicated only 11 levels.
i.e. you forgot indicate 14
ageLevels <- c(1, 2,3,4,5,6,7,8,9,10,12,14)
I am well aware there are much better solutions for the particular problem described below (e.g., cor and rcorr in Hmisc, as discussed here). This is just an illustration for a more general R issue I just can't figure out: passing multiple variable names from a character vector to a formula statement within a function.
Assume there is a dataset consisting of numeric variables.
vect.a <- rnorm(n = 20, mean = 0, sd = 1)
vect.b <- rnorm(n = 20, mean = 0, sd = 1)
vect.c <- rnorm(n = 20, mean = 0, sd = 1)
vect.d <- rnorm(n = 20, mean = 0, sd = 1)
dataset <- data.frame(vect.a, vect.b, vect.c, vect.d)
names(dataset) <- c("var1", "var2", "var3", "var4")
A correlation test has to be performed for each possible pair of variables within this data set, using a formula statement of the type ~ VarA + VarB within the function cor.test:
for (i in 1:(length(names(dataset))-1)){
for (j in (i+1):length(names(dataset))) {
cor.test(~ names(dataset)[i] + names(dataset)[j], data = "dataset")
}
}
which returns an error: invalid 'envir' argument of type 'character'
I assume a character string is incompatible with the formula statement but which class would be compatible with it? If the entire approach is wrong, please explain why and provide or point to an alternative solution. If the approach is somehow "ugly" or "non-R", please explain why.
You get that formula by using as.formula with a string argument.
>> x <- c('x1','x2','x3')
>> f <- as.formula(paste('~ ', x[1], ' + ', x[2]))
>> f
~x1 + x2
>> class(f)
[1] "formula"
There is another issue here, data="dataset" should be data=dataset, since dataset is a name.
> dataset <- data.frame(a=1:5, b=sample(1:5))
> cor.test(~ a + b, data="dataset")
Error in eval(predvars, data, env) :
invalid 'envir' argument of type 'character'
> cor.test(~ a + b, data=dataset)
Pearson's product-moment correlation
...