#========
#DATABASE
#========
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
x<-1
y<-1
z<-1
database$RES<-c(1,0,0,0,1,0,1)
database$SCORE<- database$A*x+database$B*y+database$C*z
database$PREV<- ifelse(database$SCORE>1,1,0)
#========
#TARGET
#========
t<-table(database$RES, database$PREV)
P<-(t[1]+t[4])/nrow(database)
This is an example of my database (60k rows), I want to find values for x y z (in the code I put "1" just for convenience to run the script but I want to find them!) to have maximum value of P. The target P must be 1 or closed to 1.
I didnt find what I'm looking for in thread with similiar title.
In excel is pretty simple but can't find more than 1 parameter.
Thanx in advance.
I'm not satisfied with this answer, but maybe this is something that can at least get you started.
The optim() function finds the optimum set of answers for the problem you're trying to solve, but it looks to me, at least with the toy data, that it finds itself into a local maxima. You'd have to run it several times to find the best parameters, for me it occurs when P = 0.8571429, and even then the x, y, z values can vary quite significantly, which would indicate that there are several equally optimal solutions for this particular data.
database <- matrix(c(51,43,-22,-92,28,-21,68,22,9,-20,-30,-1,-10,10,-10,-5,10,-2,30,-3,-5),ncol=3,byrow=TRUE)
colnames(database ) <- c("A","B","C")
database <- as.data.frame(database )
database$RES <- c(1,0,0,0,1,0,1)
find_best <- function(data, x) {
SCORE <- data$A*x[1]+data$B*x[2]+data$C*x[3]
PREV <- ifelse(SCORE>1,1,0)
t <- table(data$RES, PREV)
P <- (t[1]+t[4])/nrow(data)
P
}
result <- optim(c(1, 1, 1), find_best, data = database, method = "SANN", control = list(fnscale = -1))
result$value
[1] 0.8571429 # The P value
result$par
[1] 2.396844 -4.460343 -7.137460 # These are your sought after x, y, z parameters.
Related
I am trying to exercise a simulation of Sierpinski triangle in R with affine transformation and Iterated Function System (IFS). And hopefully, I can further exercise how the simulation of Barnsley's fern can also be done. For those who know Chinese, this video is my starting point of this exercise.
Here is a short introduction of the simulation process:
Create an equilateral triangle, name the vertices A, B, C
Create a random initial point lying inside the triangle ABC
Sample A, B, C with equal chances
If the outcome is A, then move the initial point to the midpoint of A and itself
Repeat step 3, and move the last point to the midpoint of the outcome point and itself.
By doing this repeatedly, we should see the path of the points looks like a Sierpinski triangle.
I wonder how the assignment of variable works inside a self-defined function. I would like to create an object (a matrix or a dataframe) to store the path of simulated points and keep updating the object to keep track of how the points move.
the following is my current codes:
# create the triangle
triangle <- matrix(c(A = c(-1,0),
B = c(1, 0),
C = c(0, sqrt(3))),
byrow = TRUE, nrow = 3, ncol = 2)
colnames(triangle) <- c("X", "Y") # axis name
rownames(triangle) <- c("A", "B", "C")
# sample an initial point inside the triangle ABC
sampleInit <- function(){
X <- runif(1, min = -1, max = 1)
Y <- runif(1, min = 0, max = sqrt(3))
if( (Y >= 0) && (Y <= (sqrt(3)*X + sqrt(3))) && (Y <= -sqrt(3)*X+sqrt(3)) ){
return(cbind(X, Y))
} else {
sampleInit()
}
}
### graph: plot the triangle and the initial point together
graphics.off()
plot(triangle, xlim = c(-1, 1), ylim = c(0, sqrt(3)))
par(new = TRUE)
plot(sampleInit(), xlim = c(-1, 1), ylim = c(0, sqrt(3)), col = "red")
### a three-sided dice: determine the direction to move along
diceRoll <- function(){
return(sample(c("A", "B", "C"), size = 1, prob = c(1/3, 1/3, 1/3)))
}
## path
stepTrace <- as.data.frame(sampleInit())
move <- function(diceOutCome, stepTrace){
lastStep <- tail(stepTrace, 1)
if(diceOutCome == "A"){
X <- (-1 + lastStep[,1])/2
Y <- (0 + lastStep[,2])/2
} else if(diceOutCome == "B"){
X <- (1 + lastStep[,1])/2
Y <- (0 + lastStep[,2])/2
} else if(diceOutCome == "C"){
X <- (0 + lastStep[,1])/2
Y <- (sqrt(3) + lastStep[,2])/2
}
lastStep <- cbind(X, Y)
stepTrace <- rbind(stepTrace, lastStep)
}
move(diceRoll(), stepTrace)
View(stepTrace)
Sorry for the long story and not jumping to the key question directly. My question is that stepTrace (the object I would like to store the path) didn't get updated as I execute the last two lines.
What I imagined was the assignment process in move() updates the dataframe stepTrace, however it turns out it doesn't. I check my code in the debugger, and found out that stepTrace did get updated inside the function call, but it didn't pass the new assigned value outside the function call. That's why I would like to ask how does the assignment process works in R. What is the difference between the this kind of process and other general purpose languages such as Java? (What I imagined to do this exercise in Java would not encounter this kind of assignment issue. Correct me if I am wrong since I am still new to Java)
Similar problems bother me when I tried to assign variables inside a loop. I know there is a base function assign that helps to resolve is issue, but I just don't know what is the mechanism behind it.
I tried to google my question, but I am not sure which keyword I should use, and I didn't find direct answers to my question. Any comment, keyword or external resource to the documentation is appreciated!
In short, your move function does what you want, but it is not advisable to write it like that. In its current form, stepTrace is updated in the function's local environment, but not in the global environment, where your stepTrace lives. They are not the same stepTrace. To fix it, you can run stepTrace <- move(diceRoll(), stepTrace), but beware of the second circle. For a cleaner approach, remove the last stepTrace assignment from move.
From ?return: If the end of a function is reached without calling return, the value of the last evaluated expression is returned.
Consider the following examples:
x <- 5
a <- b <- c <- d <- 1
f1 <- function(x) x + 1
f2 <- function(x) return(x + 1)
f3 <- function(x) x <- x + 1
f4 <- function(x) x <<- x + 1
f1(1)
f2(1)
f3(1) # your problem
f4(1) # x gets replaced with x in f4, 2 in global environment.
a <- b <- c <- d <- 1
a <- f1(1)
b <- f2(1)
c <- f3(1)
d <- f4(1)
f3 and f4 are generally considered bad practice because of side effects, i.e. they (can) modify a non-local variable, f2 might trigger a discussion. For f3, see the result of
c(f3(1))
#> [1] 2
Given our experiment of calling f3(1) by itself, we'd expect a vector of length 0 (?). Consider removing any assignment as the last operation within your functions, and avoid naming your function arguments the same as the objects you intend to change.
#DonaldSeinen explained how to fix your code in his answer. I'll try to point you to documentation for more details.
First, you don't need to go to external documentation. An Introduction to R and The R Language Definition manuals are included in R distributions. The Introduction describes what's going on in lots of detail in section 10.7, "Scope". There's a different description in the Language Definition in section 3.5, "Scope of Variables".
Some people find the language in those manuals to be too technical. An easier to read external reference that gets it right is Wickham's Advanced R, readable online at https://adv-r.hadley.nz/. Scoping is discussed in chapters 6 and 7, especially sections 6.4 and 7.2.
Something i came across today that i don't quite understand. The setup is that i want to generate some uniformly distributed points in the plane, afterwards i want to assign each point an arrival rate. I want to be able to reproduce the same points but assign different arrival rates. I figured i could use the set.seed function for this.
library(dplyr)
library(ggplot2)
seed = NULL
no_of_points = 50
interval = c("min" = -10, "max" = 10)
arv = c("min" = 1/80, "max" = 1)
plot_data <- function() {
id <- 1:no_of_points
# setting the seed here to be able to reproduce if desired
set.seed(seed)
x <- runif(no_of_points, min = interval["min"], max = interval["max"])
y <- runif(no_of_points, min = interval["min"], max = interval["max"])
# resetting the seed to give "random" arrival rates regardless of the seed
set.seed(NULL)
arrival_rate <- runif(no_of_points, min = arv["min"], max = arv["max"])
data <- tibble(
"Demand point id" = as.character(id),
"x" = x,
"y" = y,
"Arrival rate" = arrival_rate
)
}
ggplot(plot_data()) +
geom_point(aes(x, y, size = `Arrival rate`))
This works fine when i set a seed and i get a plot like this, which is what i would expect
However when i have seed = NULL as in the example code i get a plot like this, where it seems that arrival rates are correlated with the x-axis.
How can this be explained? Additionally i tried to run the same code but not inside a function, but then i get expected behavior. So i suspect it has something to do with the seed being set inside a function.
I don't think set.seed(NULL) is doing what you expect. In this case I think NULL is initializing the exact same random seed both times you call it. Therefore, the first random number generation after calling set.seed(NULL) (x) is correlated with the first random number generation after you call set.seed(NULL) again (Arrival rate) (but not the second generation of the first instance - y). In this simple example, you can see that the nth random generation after setting a particular seed is correlated with the nth random generation after setting that same seed again, and that using NULL and NULL is basically the same as using 1 and 1.
f <- function(s1 = NULL, s2 = NULL) {
set.seed(s1)
a <- runif(50)
b <- runif(50)
c <- runif(50)
set.seed(s2)
d <- runif(50)
e <- runif(50)
f <- runif(50)
x <- data.frame(a, b, c, d, e, f)
plot(x)
}
f(NULL, NULL)
f(1, 1)
f(1, 2)
Created on 2022-01-04 by the reprex package (v2.0.1)
So I'm taking a class for R, and I'm having a really hard time coding basic formulas.
Basically what I'm trying to do is find 3 variables but I keep getting errors. (I've attached a picture for easier presentation)
Note:
d is the number of DOF, d=1,...,20
and this is my code :
set.seed(29)
library(ISLR)
library(splines)
#### ETAPE 1
x <- runif(1000,min=0,max=10)
lambda=(2*x)+(0.2*x*sin(x))
y <- rpois(1000,lambda)
J <- data.frame(x=x, y=y)
plot(x,y,cex=0.4)
### ETAPE 2
ajust <- matrix(NA,20,1000)
for(i in (1:20)) {
smoothing=lm(y~ns(x=x,df=i),data=J)
ajust[i,]=predict(smoothing)
}
fd=function(d) {return(smoothing[d])}
for(i in (1:20)) {
lines(x,ajust[i,],col=i)
}
lines(x,lambda,col='black')
for(i in (1:20)) {
d1<- (1/1000)*sum((y-ajust[i,])**2)
}
### Calcul de D2
Mean=lambda
for (d in (1:20)){
W=(Mean-fd(x))**2
d2=sum(W)/1000
}
It works up until "calcul de D2" where I get "Non-numeric argument to binary operator " error. And I don't understand how to make it work. I know my question might seem a little bit vague so don't hesitate to let me know if something isn't clear.
The bug in the code is that your fd(x) function call returns a list. This is, as the error says, not a numeric.
We don't have information on what f(d) should be (it's not defined in the picture or question), but it seems that the solution would be to extract whatever component from fd(x) you meant to have subtracted from Mean.
For example:
for (d in (1:20)){
W=(Mean-fd(x)$fitted.values)**2
d2=sum(W)/1000
}
Update
I saw your followup comment/question regarding "D3" from the equations in the picture. I'm a little unsure because I don't have the textbook/context to be sure of the notation (X isn't formally defined and I also had to take a leap of faith that Y in the picture = Mean in the code based on how you used it). This is my best guess, based on that context:
# The equation for d3 is the expected value of (Y-fd(X))^2.
#
# I don't know the context of this, but I see the definition of d1 and d2.
#
# D1 = for(i in (1:20)) {
# d1<- (1/1000)*sum((y-ajust[i,])**2)
# }
d1 # [1] 10.04203
#
# D2 = for (d in (1:20)){
# W=(Mean-fd(x)$fitted.values)**2
# d2=sum(W)/1000
# }
#
d2 # [1] 0.2024568
#
# Based on that, Y = Mean, y = y, x=x, i=i, N=1000
# W = (Y - fd(xi))^2
# I presume X = vectorized xi
#
# So, D3 =
D3 = (Mean - fd(x)$fitted.values)^2
#Since it's an expected value, I presume we take the mean
D3 = mean(D3)
Where I may be guessing wrong there is probably X. X in the pictured equation looks like the vector of all x[i]. But each element of x is an x[i] so x is already the vector representation thereof.
I want to use the ICC::ICCbare function within a loop. However, the ICCbare uses the concrete variable names as input, e.g.:
ICCbare(x = group, y = variable1, data = dat)
whereby both "group" and "variable1" are columns of the data.frame "dat" (i.e., dat$variable1); ICCbarecannot be used with y = dat[, i].
In order to program a loop I therefore need to evaluate some R code within the function call of ICCbare. My idea was the following:
for(i in 1:10){
ICCbare(group, names(dat)[i], data = dat)
}
However, this does not work. The following error is printed:
Error in '[.data.frame`(data, yc) : undefined columns selected'
Is there a way to evaluate the statement names(dat)[i]) first before it is passed to the function call?
Here is a minimum working example for my problem:
# Create data set
dat <- data.frame(group=c(rep("A",5),
rep("B",5)),
variable1=1:10,
variable2=rnorm(10))
# Loop
for (i in names(dat)[2:3]){
ICCbare("group", i, data = dat)
}
I agree with #agstudy. This is a bad example of non-standard evaluation. You can use this as a workaround:
v <- "variable1"
ICCbare("group", v, data = dat)
#Error in `[.data.frame`(data, yc) : undefined columns selected
eval(bquote(ICCbare("group", .(v), data = dat)))
#$ICC
#[1] 0.8275862
It is a bug in ICCbare that try to to manage arguments as name in a bad manner.
function (x, y, data)
{
ICCcall <- Call <- match.call()
xc <- as.character(ICCcall[[2L]]) ## this is ugly!
yc <- as.character(ICCcall[[3L]])
inds <- unique(data[xc])[[1]]
tdata <- data.frame(data[yc], data[xc])
Personally I would remove the first lines and just use assume that arguments are just column names.
ICCbare_simple <-
function (xc, yc, data)
{
## remove lines before this one
inds <- unique(data[xc])[[1]]
## the rest of the code
.....
}
I'm the maintainer of ICC and I want to thank you for the excellent discussion. I know this is a very late reply, but I just updated the package and the new version (v2.3.0) should fix the "ugly" code and the problem encountered by the OP. See examples in this gist.
I just wanted to post this here in case anyone was searching with a similar problem. Thanks again, sorry for the delay.
Here is the content of the gist:
ICC non-standard evaluation examples
The ICC package for R calculates the intraclass correlation coefficient (ICC) from a one-way analysis of variance. Recently, the package was updated to better execute R's non-standard evaluation within each function (version 2.3.0 and higher). The package functions should now be able to handle a range of possible scenarios for calling the functions in, what I hope, is a less grotesque and more standard way of writing R functions. To demonstrate, below are some of those scenarios. Note, the examples use the ICCbare function, but the way in which the function arguments are supplied will apply to all of the functions in ICC.
First, load the package (and make sure the version is >2.3.0)
library(ICC)
packageVersion("ICC")
Columns of a data.frame
Here we supply the column names and the data.frame that contains the data to calculate the ICC. We will use the ChickWeight data fame.
data(ChickWeight)
ICCbare(x = Chick, y = weight, data = ChickWeight)
#$ICC
#[1] 0.1077609
Iterating through columns of a data.frame
In this case, we might have a data.frame in which we want to estimate the ICC for a number of different types of measurements that each has the same grouping or factor variable (e.g., x). The extreme of this might be in a simulation or bootstrapping scenario or even with some fancy high-throughput phenotyping/data collection. The point being, we want to automate the calculation of the ICC for each column.
First, we will simulate our own dataset with 3 traits to use in the example:
set.seed(101)
n <- 15 # number of individuals/groups/categories/factors
k <- 3 # number of measures per 'n'
va <- 1 # variance among
icc <- 0.6 # expected ICC
vw <- (va * (1 - icc)) / icc # solve for variance within
simdf <- data.frame(ind = rep(LETTERS[1:n], each = k),
t1 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t2 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)),
t3 = rep(rnorm(n, 10, sqrt(va)), each = k) + rnorm(n*k, 0, sqrt(vw)))
Two ways to run through the columns come to mind: iteratively pass the name of each column or iteratively pass the column index. I will demonstrate both below. I do these in for loops so it is easier to see, but an easy extension would be to vectorise this by using something from the apply family of functions. First, passing the name:
for(i in names(simdf)[-1]){
cat(i, ":")
tmp.icc <- ICCbare(x = ind, y = i, data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
or even like this:
for(i in 1:3){
cat(paste0("t", i), ": ")
tmp.icc <- ICCbare(x = ind, y = paste0("t", i), data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Alternatively, pass the column index:
for(i in 2:ncol(simdf)){
cat(names(simdf)[i], ": ")
tmp.icc <- ICCbare(x = ind, y = simdf[, i], data = simdf)
cat(tmp.icc, "\n")
}
#t1 : 0.60446
#t2 : 0.6381197
#t3 : 0.591065
Passing a character as an argument is deprecated
Note that the function will still work if a character is passed directly (e.g., "t1"), albeit with a warning. The warning just means that this may no longer work in future versions of the package. For example:
ICCbare(x = ind, y = "t1", data = simdf)
#[1] 0.60446
#Warning message:
#In ICCbare(x = ind, y = "t1", data = simdf) :
# passing a character string to 'y' is deprecated since ICC version
# 2.3.0 and will not be supported in future versions. The argument
# to 'y' should either be an unquoted column name of 'data' or an object
Note, however, that an expression evaluating to a character (e.g., paste0("t", 1)) doesn't throw the warning, which is nice!
I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }