I am working with some R code that I'm sure must be able to written using one of the apply series of functions, but I can't work out how. I have a dataframe with multiple columns and I want to call a function, and the input of the function is using multiple columns from the dataframe. Let's say I have this data and a function f:
data<- data.frame(T=c(1,2,3,4), S=c(3,7,8,4), K=c(5,6,11,9))
data
V<-c(0.1,0.2,0.3,0.4,0.5,0.6)
f<-function(para_h,S,T,a,t,b){
r<- V
steps<-T
# Recursive form: Terminal condition for the A and B at time T
A_T=0
B_T=0
A=c()
B=c()
# A and B a time T-1
A[1]= r[steps]*a
B[1]= a*para_h[5]+ ((para_h[4])^(-2))
# Recursion back to time t
for (i in 2:steps){
A[i]= A[i-1]+ r[steps-i+1]*a + para_h[1]*B[i-1]
B[i]= para_h[2]*B[i-1]+a*para_h[5]+ (para_h[4]^(-2))
}
f = exp(log(S)*a + A[t] + B[t]*b )
return(f)
}
This function works well for some specific values :
> para_h<-c(0.1,0.2,0.3,0.4,0.5,0.7)
> f(para_h,S=3,T=2,a=0.4,t=1,b=0.1)
[1] 3.204144
I want to apply a function to each column S and T in a data frame. So, my code looks like:
mapply(function(para_h,S,T,a,t,b) f(para_h,S,T,a,t,b) ,para_h,S=data$S,T=data$T,a=0.4,t=1,b=0.1)
This gives an error:
> mapply(function(para_h,S,T,a,t,b) f(para_h,S,T,a,t,b) ,para_h,S=data$S,T=data$T,a=0.4,t=1,b=0.1)
Error in A[i] = A[i - 1] + r[steps - i + 1] * a + para_h[1] * B[i - 1] :
replacement has length zero
I'm pretty sure the problem is that : "steps" is vector. Will really appreciate an elegant solution.
I hope this has made some sort of sense, any advice would be greatly appreciated.
Couple of things:
1) each call of your function expects full para_h vector, but in your mapply code it will receive only one value at a time, so you probably wants something like this:
mapply(function(S,T) f(para_h,S,T,a=0.4,t=1,b=0.1), data$S, data$T)
or this:
apply(data,1,function(d) f(para_h,d['S'],d['T'],a=0.4,t=1,b=0.1))
2) Your function throws error when T==1 (which is the case in the first row of data), so you might need to modify your sample data set to be able to run this code.
Related
I would like to know how can I list the outputs of my function (it prints out vectors) so that I am able to know how many steps did it require until finding the optimal solution.
I have the following code and am just wondering what should I do at the end so that when printing out the vectors, it enumerates them one at a time as well. I am new to Rstudio and do see that some operations that have to do with matrices are not common in other programming languages.
I should say that I have already defined another function such as "gradient", but my concern is about the enumeration of the outputs for this particular function.
Sd=function(b0,epsilon=1e-5){
while (norm(gradient(b0))>epsilon) {
num1=(t(b0)%*%Q%*%gradient(b0)-t(y)%*%X%*%gradient(b0))/(t(gradient(b0))%*%Q%*%gradient(b0))
num2=norm(num1)
step=num2*gradient(b0)
b0=b0-step
print(t(b0))
}
}
Thank you for any help I can get.
Here's a generic answer that will show you how to approach this. Without access to your custom functions I can't give a more direct answer. It's generally helpful to give a minimal reproducible example.
That said, my basic suggestion is to use a counter variable, increment it once each loop, and include that in your printed output.
Here's a simplified example that's based on your code, but the only operation we're doing is taking repeated square roots. Note that the arrow operator <- is the best practice for assigning values. (I promise you get used to it!)
# set up a generic function for this minimal example
get_value <- function(x){
return (sqrt(x))
}
my_function <- function(b0, epsilon = 1.1){
# set up a counter variable
i <- 0
# our main loop
while (get_value(b0) > epsilon) {
# increment the counter
i <- i + 1
# do calculations
num1 <- get_value(b0)
# update our current solution
b0 <- num1
# print a message to the console with the counter and the value
message(paste0("Iteration: ",i,"\n",
"b0: ", b0))
}
# print a final message to the console when we stop
message(paste0("Final Iteration: ",i,"\n",
"Final b0: ", b0))
}
my_function(2)
I'm attempting to create sigma/summation function with the variables in my dataset that looks like this:
paste0("(choose(",zipdistrib$Leads[1],",",zipdistrib$Starts[1],")*beta(a+",zipdistrib$Starts[1],",b+",zipdistrib$Leads[1],"-",zipdistrib$Starts[1],")/beta(a,b))")
When I enter that code, I get
[1] "(choose(9,6)*beta(a+6,b+9-6)/beta(a,b))"
I want to create a sigma/summation function where a and b are unknown free-floating variables and the values of Leads[i] and Starts[i] are determined by the values for Leads and Starts for observation i in my dataset. I have tried using a sum function in conjunction with mapply and sapply to no avail. Currently, I am taking the tack of creating the function as a string using a for loop in conjunction with a paste0 command so that the only things that change are the values of the variables Leads and Starts. Then, I try coercing the result into a function. To my surprise, I can actually enter this code without creating a syntax error, but when I try optimize the function for variables a and b, I'm not having success.
Here's my attempt to create the function out of a string.
betafcn <- function (a,b) {
abfcnstring <-
for (i in 1:length(zipdistrib$Zip5))
toString(
paste0(" (choose(",zipdistrib$Leads[i],",",zipdistrib$Starts[i],")*beta(a+",zipdistrib$Starts[i],",b+",zipdistrib$Leads[i],"-",zipdistrib$Starts[i],")/beta(a,b))+")
)
as.function(
as.list(
substr(abfcnstring, 1, nchar(abfcnstring)-1)
)
)
}
Then when I try to optimize the function for a and b, I get the following:
optim(c(a=.03, b=100), betafcn(a,b))
## Error in as.function.default(x, envir) :
argument must have length at least 1
Is there a better way for me to compile a sigma from i=1 to length of dataset with mapply or lapply or some other *apply function? Or am I stuck using a dreaded for loop? And then once I create the function, how do I make sure that I can optimize for a and b?
Update
This is what my dataset would look like:
leads <-c(7,4,2)
sales <-c(3,1,0)
zipcodes <-factor(c("11111", "22222", "33333"))
zipleads <-data.frame(ZipCode=zipcodes, Leads=leads, Sales=sales)
zipleads
## ZipCode Leads Sales
# 1 11111 7 3
# 2 22222 4 1
# 3 33333 2 0
My goal is to create a function that would look something like this:
betafcn <-function (a,b) {
(choose(7,3)*beta(a+3,b+7-3)/beta(a,b))+
(choose(4,1)*beta(a+4,b+4-1)/beta(a,b))+
(choose(2,0)*beta(a+0,b+2-0)/beta(a,b))
}
The difference is that I would ideally like to replace the dataset values with any other possible vectors for Leads and Sales.
Since R vectorizes most of its operations by default, you can write an expression in terms of single values of a and b (which will automatically be recycled to the length of the data) and vectors of x and y (i.e., Leads and Sales); if you compute on the log scale, then you can use sum() (rather than prod()) to combine the results. Thus I think you're looking for something like:
betafcn <- function(a,b,x,y,log=FALSE) {
r <- lchoose(x,y)+lbeta(a+x,b+x-y)-lbeta(a,b)
if (log) r else exp(r)
}
Note that (1) optim() minimizes by default (2) if you're trying to optimize a likelihood you're better off optimizing the log-likelihood instead ...
Since all of the internal functions (+, lchoose, lbeta) are vectorized, you should be able to apply this across the whole data set via:
zipleads <- data.frame(Leads=c(7,4,2),Sales=c(3,1,0))
objfun <- function(p) { ## negative log-likelihood
-sum(betafcn(p[1],p[2],zipleads$Leads,zipleads$Sales,
log=TRUE))
}
objfun(c(1,1))
optim(fn=objfun,par=c(1,1))
I got crazy answers for this example (extremely large values of both shape parameters), but I think that's because it's awfully hard to fit a two-parameter model to three data points!
Since the shape parameters of the beta-binomial (which is what this appears to be) have to be positive, you might run into trouble with unconstrained optimization. You can use method="L-BFGS-B", lower=c(0,0) or optimize the parameters on the log scale ...
I thought your example was hopelessly complex. If you are going to attemp making a function by pasting character values, you first need to understand how to make a function body with an unevaluated expression, and after that basic task is understood, then you can elaborate ... if in fact it is necessary, noting BenBolker's suggestions.
choosefcn <- function (a,b) {}
txtxpr <- paste0("choose(",9,",",6,")" )
body(choosefcn) <- parse(text= txtxpr)
#----------
> betafcn
function (a, b)
choose(9, 6)
val1 <- "a"
val2 <- "b"
txtxpr <- paste0("choose(", val1, ",", val2, ")" )
body(choosefcn) <- parse(text= txtxpr)
#
choosefcn
#function (a, b)
#choose(a, b)
It also possible to configure the formal arguments separately with the formals<- function. See each of these help pages:
?formals
?body
?'function' # needs to be quoted
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
I'm an inexperienced R programmer, trying to make a piece of code I have written work. This is probably an elemental problem. I want this code to check one value against its predecessor in a vector, and if it is greater than a certain threshold value, to return which element on that vector satisfies this criterion. Once it has found one case, I'd like it to stop.
At present my code half-functions as I'd like it to, but it goes through the whole vector and once it reaches the end it checks a[i+1] which is NA and gives me an error message.
testdata<-c(0,0.1,0.2,0.3,0.45,0.5,0.6,0.7,0.8,0.9,1.0)
MLD<-function(a,...){
x<-NULL
y<-NULL
for(i in seq(along=a)){
if(a[i+1]>=a[i]+0.125)
{x=c(x,a[i+1]); y=which(a==x); print(y)}
}
}
try(MLD(testdata),silent=TRUE) # code finds right element
MLD(testdata) # but continues looking until it runs out of data
I know I need a break() or a stop() somewhere but I can't seem to work it out, I hope you can help me.
You can simplify your code to:
which(diff(testdata) > 0.125) + 1
Which you could put in a function:
MLD = function(a) which(diff(a) > 0.125) + 1
I have got a column with different numbers (from 1 to tt) and would like to use looping to perform a count on the occurrence of these numbers in R.
count = matrix(ncol=1,nrow=tt) #creating an empty matrix
for (j in 1:tt)
{count[j] = 0} #initiate count at 0
for (j in 1:tt)
{
for (i in 1:N) #for each observation (1 to N)
{
if (column[i] == j)
{count[j] = count[j] + 1 }
}
}
Unfortunately I keep getting this error.
Error in if (column[i] == j) { :
missing value where TRUE/FALSE needed
So I tried:
for (i in 1:N) #from obs 1 to obs N
if (column[i] = 1) print("Test")
I basically got the same error.
Tried to do abit research on this kind of error and alot have to said about "debugging" which I'm not familiar with.
Hopefully someone can tell me what's happening here. Thanks!
As you progress with your learning of R, one feature you should be aware of is vectorisation. Many operations that (in C say) would have to be done in a loop, can be don all at once in R. This is particularly true when you have a vector/matrix/array and a scalar, and want to perform an operation between them.
Say you want to add 2 to the vector myvector. The C/C++ way to do it in R would be to use a loop:
for ( i in 1:length(myvector) )
myvector[i] = myvector[i] + 2
Since R has vectorisation, you can do the addition without a loop at all, that is, add a scalar to a vector:
myvector = myvector + 2
Vectorisation means the loop is done internally. This is much more efficient than writing the loop within R itself! (If you've ever done any Matlab or python/numpy it's much the same in this sense).
I know you're new to R so this is a bit confusing but just keep in mind that often loops can be eliminated in R.
With that in mind, let's look at your code:
The initialisation of count to 0 can be done at creation, so the first loop is unnecessary.
count = matrix(0,ncol=1,nrow=tt)
Secondly, because of vectorisation, you can compare a vector to a scalar.
So for your inner loop in i, instead of looping through column and doing if column[i]==j, you can do idx = (column==j). This returns a vector that is TRUE where column[i]==j and FALSE otherwise.
To find how many elements of column are equal to j, we just count how many TRUEs there are in idx. That is, we do sum(idx).
So your double-loop can be rewritten like so:
for ( j in 1:tt ) {
idx = (column == j)
count[j] = sum(idx) # no need to add
}
Now it's even possible to remove the outer loop in j by using the function sapply:
sapply( 1:tt, function(j) sum(column==j) )
The above line of code means: "for each j in 1:tt, return function(j)", an returns a vector where the j'th element is the result of the function.
So in summary, you can reduce your entire code to:
count = sapply( 1:tt, function(j) sum(column==j) )
(Although this doesn't explain your error, which I suspect is to do with the construction or class of your column).
I suggest to not use for loops, but use the count function from the plyr package. This function does exactly what you want in one line of code.