I'm writing a function that uses kmeans to determine bin widths to convert a continuous measurement (a predicted probability) into an integer (one of 3 bins). I've stumbled upon an edge case in which it's possible for my algorithm to (correctly) predict the same probability for a whole set, and I want to handle that situation. I'm using the rattle package's binning() function in the following way:
btsKmeansBin <- function(x, k = 3, default = c(0, 0.3, 0.5, 1)) {
result <- binning(x, bins = k, method = "kmeans", ordered = T)
bins <- attr(result, "breaks")
attr(bins, "names") <- NULL
bins <- bins[order(bins)]
bins[1] <- 0
bins[length(bins)] <- 1
return(bins)
}
Run this function on x <- c(.5,.5,.5,.5,.5,.5), and you'll get an error at the order(bins) step, because bins will be NULL and therefore not a vector.
Obviously, if x has only one distinct value, kmeans shouldn't work. In this case, I'd like to return the default bin divisions. When this happens, binning issues "Warning: the variable is not considered." So I'd like to use tryCatch to handle this warning, but surrounding the result <- ... line with the following code doesn't work the way I expect:
...
tryCatch({
result <- binning(x, bins = k, method = "kmeans", ordered = T)
}, warning = function(w) {
warn(sprintf("%s. Using default values", w))
return(default)
}, error = function(e) {
stop(e)
})
...
The warning gets printed as though I hadn't used tryCatch, and the code progresses past the return statement and throws the error from order again. I have tried a bunch of variations to no avail. What am I missing, here??
If you look in binning I think you'll find that the "warning" you see is not generated via warning() but with cat(), which is why tryCatch isn't picking it up. The author of binning probably deserves a few lashings with a wet noodle for this oversight. ;) (Or it could be on purpose due to the particular way that rattle works, I'm not sure.)
It appears to return NULL when this happens, so you could simply handle it manually. Not ideal, but possibly the only way to go.
Related
I'm trying to write a function that takes a few lines of code and allows me to input a single variable. I've got the code below that creates an object using the Surv function (Survival package). The second line takes the variable in question, in this case a column listed as Variable_X, and outputs data that can then be visualized using ggsurvplot. The output is a Kaplan-Meier survival curve. What I'd like to do is have a function such that i can type f(Variable_X) and have the output KM curve visualized for whichever column I choose from the data. I want f(y) to output the KM as if I had put y where the ~Variable_X currently is. I'm new to R and very new to how functions work, I've tried the below code but it obviously doesn't work. I'm working through datacamp and reading posts but I'm having a hard time with it, appreciate any help.
surv_object <- Surv(time = KMeier_DF$Followup_Duration, event = KMeier_DF$Death_Indicator)
fitX <- survfit(surv_object ~ Variable_X, data = KMeier_DF)
ggsurvplot(fitX, data = KMeier_DF, pval = TRUE)
f<- function(x) {
dat<-read.csv("T:/datafile.csv")
KMeier_DF < - dat
surv_object <- Surv(time = KMeier_DF$Followup_Duration, event =
KMeier_DF$Death_Indicator)
fitX<-survfit(surv_object ~ x, data = KMeier_DF)
PlotX<- ggsurvplot(fitX, data = KMeier_DF, pval = TRUE)
return(PlotX)
}
The crux of the problem you have is actually a tough stumbling block to figure out initially: how to pass variable or dataframe column names into a function. I created some example data. In the example below I supply a function four variables, one of which is your data. You can see two ways I call on the columns, using [[]], and [,], which you can think of as being equivalent to using $. Outside of functions, they are, but not inside. The print functions are there to just show you the data along the way. If those objects exist in your global environment, remove them one by one, rm(surv_object), or clear them all rm(list = ls()).
duration <- c(1, 3, 4, 3, 3, 4, 2)
di <- c(1, 1, 0, 0, 0, 0, 1)
color <- c(1, 1, 2, 2, 3, 3, 4)
KMdf <- data.frame(duration, di, color)
testfun <- function(df, varb1, varb2, varb3) {
surv_object <- Surv(time = df[[varb1]], event = df[ , varb2])
print(surv_object)
fitX <- survfit(surv_object ~ df[[varb3]], data = df)
print(fitX)
# plotx <- ggsurvplot(fitX, data = df, pval = TRUE) # this gives an error that surv_object is not found
# return(plotx)
}
testfun(KMdf, "duration", "di", "color") # notice the use of quotes here, if not you'll get an error about object not found.
And even better, you have an even tougher stumbling block: how r handles variables and where it looks for them. From what I can tell, you're running into that because there is possibly a bug in ggsurvplot and looking at the global environment for variables, and not inside the function. They closed the issue, but as far as I can tell, it's still there. When you try to run the ggsurvplot line, you'll get an error that you would get if you didn't supply a variable:
Error in eval(inp, data, env) : object 'surv_object' not found.
Hopefully that helps. I'd submit a bug report if I were you.
edit
I was hoping this solution would help, but it doesn't.
testfun <- function(df, varb1, varb2, varb3) {
surv_object <- Surv(time = df[[varb1]], event = df[,varb2])
print(surv_object)
fitX <- survfit(surv_object ~ df[[varb3]], data = df)
print(fitX)
attr(fitX[['strata']], "names") <- c("color = 1", "color = 2", "color = 3", "color = 4")
plotx <- ggsurvplot(fitX, data = df, pval = TRUE) # this gives an error that surv_object is not found
return(plotx)
}
Error in eval(inp, data, env) : object 'surv_object' not found
This is homework, right?
First, you need to try to run the code before you provide it as an example. Your example has several fatal errors. ggsurvplot() needs either a library call to survminer or to be summoned as follows: survminer::ggsurvplot().
You have defined a function f, but you never used it. In the function definition, you have a wayward space < -. It never would have worked.
I suggest you start by defining a function that calculates the sum of two numbers, or concatenates two strings. Start here or here. Then, you can return to the Kaplan-Meier stuff.
Second, in another class or two, you will need to know the three parts of a function. You will need to understand the scope of a function. You might as well dig into the basics before you start copy-and-pasting.
Third, before you post another question, please read How to make a great R reproducible example?.
Best of luck.
I am trying to write a program for bisection method and try to plot the points also at each iteration.
Here is the code I have tried.
Bisec = function(f,a =1, b =2, max=1e10, tol = 1e-100){
midVals = c()
for (i in 1:max){
c = (a+b)/2
midVals = append(midVals,c)
if(abs(f(c)) < tol){
return(list(c,plot(f),points(midVals)))
}else if(f(a)*f(c) > 0){
a = c
}else{
b =c
}
}
print("Maximum iterations reached")
}
x = var('x')
f = function(x){x*x-2}
Bisec(f,1, 3, max=1e5, tol = 1e-10)
But I am getting the graphs like this.
What do I need?
the function f has to be plotted.
the midpoints found in each iteration should be plotted in x axis.
How to achieve this?
Any hint may be helpful. I dont know where am I goint wrong.
R notation can be a little different if you learned to program in a different language. Part of R's power is it's integration between an interpreted interface and (fast) compiled functions. Generally (although this may be an exception, I'm not focusing on that), for loops are avoided (many functions are vectorized, which means they do the looping within the compiled portion of the code). We also avoid defining empty variables, because they have to be copied and pasted EVERY time you want to add something to them.
For your specific problem, plot is plotting f - it just doesn't know anything about the points command because it evaluates plot before it ever sees points. You might find ggplot2 gives a more dynamic solution, but I'll start with a base R approach to your function:
Bisec = function(f,a =1, b =2, max_iter=1e10, tol = 1e-100){
midVals = rep(NA, max_iter) # I avoid using `max` since that's a function to find the maximum
for (i in 1:max_iter){
x <- mean(c(a,b)) # I also avoid using `c` since that's a function to concatenate stuff
midVals[i] <- x
if(abs(f(x)) < tol){
plot(f, xlim = range(midVals, na.rm = TRUE))
points(midVals, rep(0,length(midVals))
return(x)
} else if(f(a)*f(x) > 0){
a = x
}
else{
b = x
}
}
print("Maximum iterations reached")
}
I'm wondering why when I run: iris[complete.cases(iris), ] it works perfectly fine. But when I do the same thing from the function below, it gives me the error: colMeans(x, na.rm = TRUE) : 'x' must be numeric?
p.s. scale() works well with data.frames ==> scale(mtcars).
Can this be fixed?
Here is the function:
standard <- function(data, scale = TRUE, center = TRUE, na.rm = TRUE){
data <- if(na.rm) data[complete.cases(data), ]
data[paste0(names(data), ".s")] <- scale(data, center = center, scale = scale)
return(data)
}
# EXAMPLE:
standard(iris)
EDIT:
Yes, the error is thrown by scale(), and not earlier. If you want to scale all the numeric columns and leave the other columns as is, you'll need to add a step that extracts the numeric columns, scales them, and then puts them back in. Incidentally, scale can handle NA values, so you can put the complete.cases() call after the scale.
Original Answer:
You can step through this by adding a call to browser() inside your function, but I suspect you'll find the error is thrown here:
scale(data, center = center, scale = scale)
Note from the documentation on scale()
Arguments
x a numeric matrix(like object).
Here's how you'd debug this:
make your function this:
standard <- function(data, scale = TRUE, center = TRUE, na.rm = TRUE){
browser()
data <- if(na.rm) data[complete.cases(data), ]
data[paste0(names(data), ".s")] <- scale(data, center = center, scale = scale)
return(data)
}
Then try to call it with standard(immer)
It will open a browser for you to step through each statement in the function. If you do this in RStudio you can see the environment changes in the Environment tab in the upper right window. Use the command help to see how to navigate the browser, but in general, you'll use n and/or s to step through each statement. Q gets you out of the browser, and removing the browser() call from your function lets you run it as you would usually.
Update 2.0: Now with data such that the errors should be reproducible:
Data for the different functions:
z <- seq(0,2,length=1000)
t <- grid <- c(0.1,0.55,0.9)
parA <- c(0.21,-0.93)
parB <- c(0.21,1.008)
p <- c(1,2,1,2)
## for plotting ##
f_func <- function(x) exp(-x^3+x)
envARS1 <- function(x){ exp(parA[1]*x+parB[1])}
envARS2 <- function(x){ exp(parA[2]*x+parB[2])}
plot(x=z,y=envARS1(z), type = "l", col = "blue", ylim = c(0,2), xlim = c(0,2))
lines(x=z,y=envARS2(z), type = "l", col = "red")
lines(x = z,(f_func(z)), type = "l", col = "black")
I'm trying to implement an Adaptive rejection sampler using a derivative-free approach. Along the way of this implementation, I have to implement a dynamic envelope function, which is able to adjust depending on the values/number of some Zt's.
I have accomplished to write a dynamic envelope function which seems to work fine but when I try to integrate the envelope, with the final aim of drawing from this envelope, I get errors.
DynamicEnv <- function(x){
exp(parA[p[max(which(x>=grid))]]*x+
parB[p[max(which(x>=grid))]])
}
The envelope function is a exponential linear line and the parameters a and b depends on where the x, it's input, is located relatively to the Zt's.
The variable 'grid' contains the Zt's and is therefore a vector, p is a dynamic position variable, which essentially tells the function which parameters to use.
So the first problem I had was that, when I gave my dynamic envelope a vector as input, I get troubles with the 'which' function which only can handle numeric values as far as I understand.
Updated with the error I receive from 'which'
I get the below error with which:
Error in which(x > grid) :
dims [product 3] do not match the length of object [1000]
Which I believe occurs because 'which' tries to compare both vectors to each other, and not the n'th element in x with the entire vector of grid.
Then I try to incorporate a loop, to loop over all the values in the x-vector, and return a vector with the output values, but then I got the error message 'non-finite function values' when I tried to integrate my dynamic envelope.
The dynamic envelope with a loop inside is;
DynamicEnv1 <- function(x){
Draws <- matrix(0,length(x),1)
for (i in 1:length(x)) Draws[i,1] <-
exp(parA[p[max(which(x[i]>=grid))]]*x[i] + parB[p[max(which(x[i]>=grid))]])
return(Draws)
}
I have written this 'static' envelope function, which works fine with respect to making draws from it (thereby integrate).
envARSup <- function(x){ (ifelse((x <= t[1] | t[2] < x & x <= t[3]),
exp(parA[1]*x+parB[1]),exp(parA[2]*x+parB[2])))*1*(x>0)}
Here the t's are the Zt's mentioned above. The idea of the dynamic envelope should be clear from this function, since they ideally should be able to return the same for the same grid (Zt's/t's).
The above function checks which interval the value of x belongs to, and based on the interval it uses a specific exponential linear line.
I would really appreciate if someone could suggest an alternative to the 'which' function, in order to locate a position in a vector or help me understand why I get the error message with the loop-based dynamic envelope.
How does the following code work? I got the example when I was reading the help line of R ?curve. But i have not understood this.
for(ll in c("", "x", "y", "xy"))
curve(log(1+x), 1, 100, log = ll,
sub = paste("log= '", ll, "'", sep = ""))
Particularly , I am accustomed to numeric values as arguments inside the for-loop as,
for(ll in 1:10)
But what is the following command saying:
for(ll in c("","x","y","xy"))
c("","x","y","xy") looks like a string vector? How does c("","x","y","xy") work inside curve
function as log(1+x)[what is x here? the string "x"? in c("","x","y","xy")] and log=ll ?
Apparently, there are no answers on stack overflow about how the curve function in R works and especially about the log argument so this might be a good chance to delve into it a bit more (I liked the question btw):
First of all the easy part:
c("","x","y","xy") is a string vector or more formally a character vector.
for(ll in c("","x","y","xy")) will start a loop of 4 iterations and each time ll will be '','x','y','xy' respectively. Unfortunately, the way this example is built you will only see the last one plotted which is for ll = 'xy'.
Let's dive into the source code of the curve function to answer the rest:
First of all the what does the x represent in log(1+x)?
log(1+x) is a function. x represents a vector of numbers that gets created inside the curve function in the following part (from source code):
x <- exp(seq.int(log(from), log(to), length.out = n)) #if the log argument is 'x' or
x <- seq.int(from, to, length.out = n) #if the log argument is not 'x'
#in our case from and to are 1 and 100 respectively
As long as the n argument is the default the x vector will contain 101 elements. Obviously the x in log(1+x) is totally different to the 'x' in the log argument.
as for y it is always created as (from source code):
y <- eval(expr, envir = ll, enclos = parent.frame()) #where expr is in this case log(1+x), the others are not important to analyse now.
#i.e. you get a y value for each x value on the x vector which was calculated just previously
Second, what is the purpose of the log argument?
The log argument decides which of the x or y axis will be logged. The x-axis if 'x' is the log argument, y-axis if 'y' is the log argument, both axis if 'xy' is the log argument and no log-scale if the log argument is ''.
It needs to be mentioned here that the log of either x or y axis is being calculated in the plot function in the curve function, that is the curve function is only a wrapper for the plot function.
Having said the above this is why if the log argument is 'x' (see above) the exponential of the log values of the vector x are calculated so that they will return to the logged ones inside the plot function.
P.S. the source code for the curve function can be seen with typing graphics::curve on the console.
I hope this makes a bit of sense now!