New to R and need help coding function and understanding errors - r

So I'm taking a class for R, and I'm having a really hard time coding basic formulas.
Basically what I'm trying to do is find 3 variables but I keep getting errors. (I've attached a picture for easier presentation)
Note:
d is the number of DOF, d=1,...,20
and this is my code :
set.seed(29)
library(ISLR)
library(splines)
#### ETAPE 1
x <- runif(1000,min=0,max=10)
lambda=(2*x)+(0.2*x*sin(x))
y <- rpois(1000,lambda)
J <- data.frame(x=x, y=y)
plot(x,y,cex=0.4)
### ETAPE 2
ajust <- matrix(NA,20,1000)
for(i in (1:20)) {
smoothing=lm(y~ns(x=x,df=i),data=J)
ajust[i,]=predict(smoothing)
}
fd=function(d) {return(smoothing[d])}
for(i in (1:20)) {
lines(x,ajust[i,],col=i)
}
lines(x,lambda,col='black')
for(i in (1:20)) {
d1<- (1/1000)*sum((y-ajust[i,])**2)
}
### Calcul de D2
Mean=lambda
for (d in (1:20)){
W=(Mean-fd(x))**2
d2=sum(W)/1000
}
It works up until "calcul de D2" where I get "Non-numeric argument to binary operator " error. And I don't understand how to make it work. I know my question might seem a little bit vague so don't hesitate to let me know if something isn't clear.

The bug in the code is that your fd(x) function call returns a list. This is, as the error says, not a numeric.
We don't have information on what f(d) should be (it's not defined in the picture or question), but it seems that the solution would be to extract whatever component from fd(x) you meant to have subtracted from Mean.
For example:
for (d in (1:20)){
W=(Mean-fd(x)$fitted.values)**2
d2=sum(W)/1000
}
Update
I saw your followup comment/question regarding "D3" from the equations in the picture. I'm a little unsure because I don't have the textbook/context to be sure of the notation (X isn't formally defined and I also had to take a leap of faith that Y in the picture = Mean in the code based on how you used it). This is my best guess, based on that context:
# The equation for d3 is the expected value of (Y-fd(X))^2.
#
# I don't know the context of this, but I see the definition of d1 and d2.
#
# D1 = for(i in (1:20)) {
# d1<- (1/1000)*sum((y-ajust[i,])**2)
# }
d1 # [1] 10.04203
#
# D2 = for (d in (1:20)){
# W=(Mean-fd(x)$fitted.values)**2
# d2=sum(W)/1000
# }
#
d2 # [1] 0.2024568
#
# Based on that, Y = Mean, y = y, x=x, i=i, N=1000
# W = (Y - fd(xi))^2
# I presume X = vectorized xi
#
# So, D3 =
D3 = (Mean - fd(x)$fitted.values)^2
#Since it's an expected value, I presume we take the mean
D3 = mean(D3)
Where I may be guessing wrong there is probably X. X in the pictured equation looks like the vector of all x[i]. But each element of x is an x[i] so x is already the vector representation thereof.

Related

How does the assignment of variable works in function calls in R language?

I am trying to exercise a simulation of Sierpinski triangle in R with affine transformation and Iterated Function System (IFS). And hopefully, I can further exercise how the simulation of Barnsley's fern can also be done. For those who know Chinese, this video is my starting point of this exercise.
Here is a short introduction of the simulation process:
Create an equilateral triangle, name the vertices A, B, C
Create a random initial point lying inside the triangle ABC
Sample A, B, C with equal chances
If the outcome is A, then move the initial point to the midpoint of A and itself
Repeat step 3, and move the last point to the midpoint of the outcome point and itself.
By doing this repeatedly, we should see the path of the points looks like a Sierpinski triangle.
I wonder how the assignment of variable works inside a self-defined function. I would like to create an object (a matrix or a dataframe) to store the path of simulated points and keep updating the object to keep track of how the points move.
the following is my current codes:
# create the triangle
triangle <- matrix(c(A = c(-1,0),
B = c(1, 0),
C = c(0, sqrt(3))),
byrow = TRUE, nrow = 3, ncol = 2)
colnames(triangle) <- c("X", "Y") # axis name
rownames(triangle) <- c("A", "B", "C")
# sample an initial point inside the triangle ABC
sampleInit <- function(){
X <- runif(1, min = -1, max = 1)
Y <- runif(1, min = 0, max = sqrt(3))
if( (Y >= 0) && (Y <= (sqrt(3)*X + sqrt(3))) && (Y <= -sqrt(3)*X+sqrt(3)) ){
return(cbind(X, Y))
} else {
sampleInit()
}
}
### graph: plot the triangle and the initial point together
graphics.off()
plot(triangle, xlim = c(-1, 1), ylim = c(0, sqrt(3)))
par(new = TRUE)
plot(sampleInit(), xlim = c(-1, 1), ylim = c(0, sqrt(3)), col = "red")
### a three-sided dice: determine the direction to move along
diceRoll <- function(){
return(sample(c("A", "B", "C"), size = 1, prob = c(1/3, 1/3, 1/3)))
}
## path
stepTrace <- as.data.frame(sampleInit())
move <- function(diceOutCome, stepTrace){
lastStep <- tail(stepTrace, 1)
if(diceOutCome == "A"){
X <- (-1 + lastStep[,1])/2
Y <- (0 + lastStep[,2])/2
} else if(diceOutCome == "B"){
X <- (1 + lastStep[,1])/2
Y <- (0 + lastStep[,2])/2
} else if(diceOutCome == "C"){
X <- (0 + lastStep[,1])/2
Y <- (sqrt(3) + lastStep[,2])/2
}
lastStep <- cbind(X, Y)
stepTrace <- rbind(stepTrace, lastStep)
}
move(diceRoll(), stepTrace)
View(stepTrace)
Sorry for the long story and not jumping to the key question directly. My question is that stepTrace (the object I would like to store the path) didn't get updated as I execute the last two lines.
What I imagined was the assignment process in move() updates the dataframe stepTrace, however it turns out it doesn't. I check my code in the debugger, and found out that stepTrace did get updated inside the function call, but it didn't pass the new assigned value outside the function call. That's why I would like to ask how does the assignment process works in R. What is the difference between the this kind of process and other general purpose languages such as Java? (What I imagined to do this exercise in Java would not encounter this kind of assignment issue. Correct me if I am wrong since I am still new to Java)
Similar problems bother me when I tried to assign variables inside a loop. I know there is a base function assign that helps to resolve is issue, but I just don't know what is the mechanism behind it.
I tried to google my question, but I am not sure which keyword I should use, and I didn't find direct answers to my question. Any comment, keyword or external resource to the documentation is appreciated!
In short, your move function does what you want, but it is not advisable to write it like that. In its current form, stepTrace is updated in the function's local environment, but not in the global environment, where your stepTrace lives. They are not the same stepTrace. To fix it, you can run stepTrace <- move(diceRoll(), stepTrace), but beware of the second circle. For a cleaner approach, remove the last stepTrace assignment from move.
From ?return: If the end of a function is reached without calling return, the value of the last evaluated expression is returned.
Consider the following examples:
x <- 5
a <- b <- c <- d <- 1
f1 <- function(x) x + 1
f2 <- function(x) return(x + 1)
f3 <- function(x) x <- x + 1
f4 <- function(x) x <<- x + 1
f1(1)
f2(1)
f3(1) # your problem
f4(1) # x gets replaced with x in f4, 2 in global environment.
a <- b <- c <- d <- 1
a <- f1(1)
b <- f2(1)
c <- f3(1)
d <- f4(1)
f3 and f4 are generally considered bad practice because of side effects, i.e. they (can) modify a non-local variable, f2 might trigger a discussion. For f3, see the result of
c(f3(1))
#> [1] 2
Given our experiment of calling f3(1) by itself, we'd expect a vector of length 0 (?). Consider removing any assignment as the last operation within your functions, and avoid naming your function arguments the same as the objects you intend to change.
#DonaldSeinen explained how to fix your code in his answer. I'll try to point you to documentation for more details.
First, you don't need to go to external documentation. An Introduction to R and The R Language Definition manuals are included in R distributions. The Introduction describes what's going on in lots of detail in section 10.7, "Scope". There's a different description in the Language Definition in section 3.5, "Scope of Variables".
Some people find the language in those manuals to be too technical. An easier to read external reference that gets it right is Wickham's Advanced R, readable online at https://adv-r.hadley.nz/. Scoping is discussed in chapters 6 and 7, especially sections 6.4 and 7.2.

Writing a function in R to solve Plank's equation

I am trying to write my first function in R to calculate emittance using Plank's function for different temperatures. I can do it manually as below for temperatures from 200 to 310 K.
pi <- 3.141593
h <- 6.626068963e-34
c <- 2.99792458e+8
lambda <- 4 * 1e-6
k <- 1.38e-23
t <- c (200:310)
a <- (2*pi*(c^2)*h)/(lambda^5)
b <- exp((h*c)/(lambda*k*t))
B <- a * (1/(b-1))
Where B is the vector of values I want.
Now here is an effort to write a function in R:
P_function <- function(t, pi = 3.141593, h = 6.626068963e-34, c = 2.99792458e+8,
lambda = 4 * 1e-6, k = 1.38e-2) {
((2*pi*(c^2)*h)/(lambda^5)) *((1/(exp((h*c)/(lambda*k*t))-1)))
}
Now for different values of t (200-300K), how do I implement this function?
Couple of problems. First, pi is already a defined constant at better precision than you are using.
> rm(pi) # remove your copy
> pi
[1] 3.141593 # default for console printing is only 8 digits
> print(pi, digits=18)
[1] 3.14159265358979312 # but there is more "depth" to be had
Second, it makes no sense to put scientific constants in the parameter list. Since they're constant they can be defined in the body. Parameter lists are for items that might vary from situation to situation.
newPfun <- function(t) { h <- 6.626068963e-34
c <- 2.99792458e+8
lambda <- 4 * 1e-6
k <- 1.38e-23
a <- (2*pi*(c^2)*h)/(lambda^5) #pi is already defined
b <- exp((h*c)/(lambda*k*t))
B <- a * (1/(b-1))
return(B) }
This is just your original code "packaged" to accept a vector of temperatures. (And I'm pretty sure that's not the right spelling the scientist's name.)
Not sure where your second function is flawed. Perhaps a mismatched parenthesis. After trying to duplicate the results with a single expression and failing multiple times, I'm now wondering if it's really a problem with numerical overflow (or underflow).

Converting Mahalanobis p1 probabilities to p2 probabilities - is vectorization possible in this context?

I'm trying to write a function that takes in p1 probabilities for Mahalanobis distances and returns p2 probabilities. The formula for p2, along with a worked example is given at on the IBM website. I have written a function (below) that solves the problem, and allows me to reproduce the p2 values given in the worked example at the aforementioned webpage.
p1_to_p2 <- function(p1,N) {
p2 <- numeric(length(p1))
for (i in 1:length(p1))
{
k <- i;
p1_value <- p1[i];
start_value <- 1;
while (k >= 1)
{
start_value = start_value - choose(N,N-k+1) * (1-p1_value)^(N-k+1) * (p1_value)^(k-1)
k <- k-1;
}
p2[i] <- start_value;
}
return(p2)
}
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
N <- 73
p1_to_p2(p1,N)
Although the function works, it's been suggested to me by a colleague that it's inefficient/poorly written as it's not vectorized. This is indeed potentially relevant since in general we will be converting a lot more than just 5 p1 values to p2 values.
I have some limited experience vectorizing code, but I am wondering if a vectorized solution is possible in this context since within the loop the variable start_value constantly needs to update itself. If vectorization is not possible, is there some other way I should improve the code so that it works better?
Here is one way to do it, Breaking the steps here can help(Please read the comments):
#Input:
N <- 73
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
n <- N:(N-length(p1)+1)
# code:
mahalanobis_dist = function(x=x,n){
m = max(n)
max_min = Reduce(`*`,c(1, n[-length(n)]), accumulate = TRUE)
acc = c(1, Reduce(`*`, seq_along(n), accumulate = TRUE)[-length(n)])
comns = max_min/acc
exp <- comns*((1 - x)**n)*(x**(m - n))
return(1- sum(exp))
} ## the calculation of Mahalanobis distances
## This is just an iterator for each of the sequences we have to run the above function
ls <- lapply(n, function(x)(max(n):x))
## creating a list of iterators
## applying mapply, mapply or Map can iterate multiple inputs of the function,
## here the input p1 and ls , p1 is your input points, ls is the iterator created above
mapply(mahalanobis_dist,p1, ls)
## Applying the function on each iterators
#Output:
#> mapply(mahalanobis_dist,p1, ls)
#[1] 0.2864785 0.1299047 0.5461263 0.3973690
#[5] 0.2662369
Note:
Also, one can join the last two steps like below, with one function and correct iteration this can be achieved:
mapply(mahalanobis_dist,p1, lapply(n, function(x)(max(n):x)))

Vectors with sigma notation (R)

I'm now learning R and have some difficulties while computing sigma notation. I know how to do the basic stuff like this:
summ <- 10:100
sum(summ^3 + 4 * summ^2)
But I don't know how to do the same operations with the values that differ from i (include not only i (ex: x and y)) or operations with two sigma notations in a row.
At the beginning I thought that it just requires to do the same as in the simple sigma notation with only i's
summ <- 1:10
sum((x^summ) / (y^summ))
But it shows an error that it is not a numeric argument.
Thank you in advance for your help.
For you second formula, you can define a function like below
f <- function(x,y,n) sum((x/y)**(1:n))
For you last formula, you can rewrite the expression as a product of two terms (you need a math transformation as the first step if you want to simplify the procedure), since i and j are independent
> sum((1:20)**2)*sum(1/(5+(1:10)**3))
[1] 886.0118
Otherwise, a straightforward translation from the formula could be using nested sapply
> sum(sapply(1:20,function(i) sapply(1:10, function(j) i**2/(5+j**3))))
[1] 886.0118
That's, basically, the answer to the first question with undefined variables x and y:
x <- readline(prompt = "Enter x: ")
y <- readline(prompt = "Enter y: ")
x <- as.integer(x)
y <- as.integer(y)
i = 1:10
answer <- sum((x^i) / (y^i))
answer

Goal seek in R with 3 parameters

#========
#DATABASE
#========
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
x<-1
y<-1
z<-1
database$RES<-c(1,0,0,0,1,0,1)
database$SCORE<- database$A*x+database$B*y+database$C*z
database$PREV<- ifelse(database$SCORE>1,1,0)
#========
#TARGET
#========
t<-table(database$RES, database$PREV)
P<-(t[1]+t[4])/nrow(database)
This is an example of my database (60k rows), I want to find values for x y z (in the code I put "1" just for convenience to run the script but I want to find them!) to have maximum value of P. The target P must be 1 or closed to 1.
I didnt find what I'm looking for in thread with similiar title.
In excel is pretty simple but can't find more than 1 parameter.
Thanx in advance.
I'm not satisfied with this answer, but maybe this is something that can at least get you started.
The optim() function finds the optimum set of answers for the problem you're trying to solve, but it looks to me, at least with the toy data, that it finds itself into a local maxima. You'd have to run it several times to find the best parameters, for me it occurs when P = 0.8571429, and even then the x, y, z values can vary quite significantly, which would indicate that there are several equally optimal solutions for this particular data.
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
database$RES <- c(1,0,0,0,1,0,1)
find_best <- function(data, x) {
SCORE <- data$A*x[1]+data$B*x[2]+data$C*x[3]
PREV <- ifelse(SCORE>1,1,0)
t <- table(data$RES, PREV)
P <- (t[1]+t[4])/nrow(data)
P
}
result <- optim(c(1, 1, 1), find_best, data = database, method = "SANN", control = list(fnscale = -1))
result$value
[1] 0.8571429 # The P value
result$par
[1] 2.396844 -4.460343 -7.137460 # These are your sought after x, y, z parameters.

Resources