R: How to (efficiently) parameterize and draw (conditional) functions? - r

I have a data.frame were each row contains the parameter values for a function (i.e. one function per row). I would like to draw these functions. The functions are conditional, and should only be drawn for certain values, and have kinks (due to min/max levels). See example of what I am trying to archive:
I originally considered using curve() or stat_function (ggplot2-library). But I could not see how it would be possible to draw the curves only for certain values (see a, b, c), without generating a data.frame.
So I created a function that generates the plot data:
N = 10000;
PrisstrukturToPlotdata = function(s){
# Create empty data.frame:
A <- data.frame(Site=rep(s$Site, N), bid=1:N, Saelger=rep(NA, N), Koeber=rep(NA, N), stringsAsFactors=FALSE)
# Fill out the data.frame:
for (i in 1:N) {
# Don't draw below:
if(i > s$Mindste_bud*s$Kurs) {
# First parenthesis is condition to insure we are above min, second parenthesis is in-between min and max, and third parenthesis is condition for above max:
A[i, ]$Saelger = s$Saelger_Fast_salaer*s$Kurs + i*s$Saelger_Andet_pct +
(i*s$Saelger_Variable_salaer <= s$Saelger_Min_variable_salaer*s$Kurs) *
s$Saelger_Min_variable_salaer*s$Kurs +
(i*s$Saelger_Variable_salaer > s$Saelger_Min_variable_salaer | (s$Saelger_Max_variable_salaer != 0 & i*s$Saelger_Variable_salaer < s$Saelger_Max_variable_salaer*s$Kurs)) *
i*s$Saelger_Variable_salaer
(s$Saelger_Max_variable_salaer != 0 & i*s$Saelger_Variable_salaer >= s$Saelger_Max_variable_salaer*s$Kurs) *
s$Saelger_Max_variable_salaer*s$Kurs;
A[i, ]$Koeber = s$Koeber_Fast_salaer*s$Kurs + i*s$Koeber_Variable_salaer;
}
}
return(A)
}
library(plyr)
Plotdata = adply(Prisstruktur, 1, PrisstrukturToPlotdata, .expand = FALSE)
Conditionality explained: There is a minimum value, below which the curve should not be drawn at all; if(i > s$Mindste_bud*s$Kurs)).
Then there is a percentage i*s$Saelger_Variable_salaer with respectively a minimum and maximum level (to complicate things not all functions have a max, those without the max value is just 0). If the percentage is below the minimum, the minimum level should be used. If the percentage is above the max, then the maximum level should be used. In between the percentage should be used.
The script above works okay for N=100 or even N=1000, but when I go to N=10000 or above it takes ages to run. I am guessing this is due to all the conditional statements, but I am not sure how to do this in a more efficient manner?
Dummy data:
Site = c('A', 'B', 'C')
Mindste_bud = c(300, 0 , 0)
Saelger_Fast_salaer = c(0, 250, 2)
Saelger_Variable_salaer = c(0.12, 0.16, 0.10)
Saelger_Min_variable_salaer = c(250, 0, 0)
Saelger_Max_variable_salaer = c(0, 0, 250)
Saelger_Andet_pct = c(0, 0, 0)
Koeber_Fast_salaer = c(95, 0, 0)
Koeber_Variable_salaer = c(0.2, 0.25, 0)
Kurs = c(1, 1, 5.430)
Prisstruktur = cbind(Site, Mindste_bud, Saelger_Fast_salaer, Saelger_Variable_salaer, Saelger_Min_variable_salaer, Saelger_Max_variable_salaer, Saelger_Andet_pct, Koeber_Fast_salaer, Koeber_Variable_salaer, Kurs)

You don't need a loop in your function. I doubt that you need all N = 10000 data points to get a nice plot. I've added structure to your code by using more whitespace and some ifelse functions for clarity.
PrisstrukturToPlotdata <- function(s, N = 10000, Length = 101)
n <- seq(s$Mindste_bud * s$Kurs + 1, N, length = Length)
data.frame(
Bid = n,
Saelger =
s$Saelger_Fast_salaer * s$Kurs +
n * s$Saelger_Andet_pct +
ifelse(
n * s$Saelger_Variable_salaer <= s$Saelger_Min_variable_salaer * s$Kurs,
s$Saelger_Min_variable_salaer * s$Kurs,
0
) +
ifelse(
n * s$Saelger_Variable_salaer > s$Saelger_Min_variable_salaer |
(s$Saelger_Max_variable_salaer != 0 &
n * s$Saelger_Variable_salaer < s$Saelger_Max_variable_salaer * s$Kurs),
n * s$Saelger_Variable_salaer,
0
) +
ifelse(
s$Saelger_Max_variable_salaer != 0 & n * s$Saelger_Variable_salaer >= s$Saelger_Max_variable_salaer * s$Kurs,
s$Saelger_Max_variable_salaer * s$Kurs,
0
),
Koeber = s$Koeber_Fast_salaer * s$Kurs + n * s$Koeber_Variable_salaer
)
)

Related

Solving "Error in if (obs <= ei) 2 * pv else 2 * (1 - pv) : missing value where TRUE/FALSE needed" for ape package Moran's I function in R

developers!
I have encountered an error message
Error in if (obs <= ei) 2 * pv else 2 * (1 - pv) : missing value where
TRUE/FALSE needed
stopping me to get the value from Moran's I function from ape package. Here is what I did:
library(ape)
nrstp <- data.frame(
X = c(300226.9, 300224.6, 300226.4, 300226.1, 300224.0, 300226.4, 300225.7, 300226.4, 300226.1, 300226.4, 300226.3, 300226.3, 300227.1),
Y = c(5057949, 5057952, 5057950, 5057950, 5057956, 5057950, 5057950, 5057950, 5057950, 5057950, 5057950, 5057950, 5057949),
V3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
nrstp = data.frame(nrstp)
dist = as.matrix(dist(cbind(nrstp$X, nrstp$Y)))
invdist = 1/dist
invdist[is.infinite(invdist)] <- 0
moranI = Moran.I(nrstp$V3, invdist)
The intention of this code is to calculate Moran's I from a series of point to check spatial autocorrelation. So far, this seems to be the only function working for Moran's I in R. After a few testing (I have thousands groups of points), this error seems only happen to the input vector having only one value (I tried other numbers than 0, it still raise this error).
Could someone help me improve this code? Or are their better suggestion to calculate Moran's I or test spatial autocorrelation from linestring (those point groups are origin point of one linestring and the closest points from other linestring within 10 meter buffer of such origin point)?
Thank you ahead for any help!
The control flow choice if(condition) do something requires that the value of condition is not NA.
In your case, obs <= ei results in NA. That's why the the error message missing value where TRUE/FALSE needed is generated.
To understand how obs <= ei results in NA, you can check the details inside Moran.I function:
Moran.I
function (x, weight, scaled = FALSE, na.rm = FALSE, alternative = "two.sided")
{
if (dim(weight)[1] != dim(weight)[2])
stop("'weight' must be a square matrix")
n <- length(x)
if (dim(weight)[1] != n)
stop("'weight' must have as many rows as observations in 'x'")
ei <- -1/(n - 1)
nas <- is.na(x)
if (any(nas)) {
if (na.rm) {
x <- x[!nas]
n <- length(x)
weight <- weight[!nas, !nas]
}
else {
warning("'x' has missing values: maybe you wanted to set na.rm = TRUE?")
return(list(observed = NA, expected = ei, sd = NA,
p.value = NA))
}
}
ROWSUM <- rowSums(weight)
ROWSUM[ROWSUM == 0] <- 1
weight <- weight/ROWSUM
s <- sum(weight)
m <- mean(x)
y <- x - m
cv <- sum(weight * y %o% y)
v <- sum(y^2)
obs <- (n/s) * (cv/v)
if (scaled) {
i.max <- (n/s) * (sd(rowSums(weight) * y)/sqrt(v/(n -
1)))
obs <- obs/i.max
}
S1 <- 0.5 * sum((weight + t(weight))^2)
S2 <- sum((apply(weight, 1, sum) + apply(weight, 2, sum))^2)
s.sq <- s^2
k <- (sum(y^4)/n)/(v/n)^2
sdi <- sqrt((n * ((n^2 - 3 * n + 3) * S1 - n * S2 + 3 * s.sq) -
k * (n * (n - 1) * S1 - 2 * n * S2 + 6 * s.sq))/((n -
1) * (n - 2) * (n - 3) * s.sq) - 1/((n - 1)^2))
alternative <- match.arg(alternative, c("two.sided",
"less", "greater"))
pv <- pnorm(obs, mean = ei, sd = sdi)
if (alternative == "two.sided")
pv <- if (obs <= ei)
2 * pv
else 2 * (1 - pv)
if (alternative == "greater")
pv <- 1 - pv
list(observed = obs, expected = ei, sd = sdi, p.value = pv)
}
<bytecode: 0x000001cd5e0715d0>
<environment: namespace:ape>
By assigning x = nrstp$V3 and weight = invdist, you will get mean(x) = 0. This results in y=0, cv = 0, v=0, and finally obs = NaN. Consequently,
obs <= ei
[1] NA
To overcome the problem, you need to ensure that each of obs and ei is not NA. In your case, if mean(x) is not zero, obs <= ei will not be NA. However, because I know nothing about this particular topic, I'm not sure whether non-zero mean(x) is always the right solution.
The problem is that your x are all the same value. If you look in the code from Abdur Rohman the calculation of the function is
m <- mean(x)
y <- x - m
cv <- sum(weight * y %o% y)
v <- sum(y^2)
obs <- (n/s) * (cv/v)
if all x are the same than the mean of m <- mean(x) is obviously the same value as all x and y, v, obs are 0.
For obs you divide cv/v which is NaN
So at least one value of x should be different

Trying to run GA in R, getting Error in if (any(x < 0)) { : missing value where TRUE/FALSE needed

I am relatively new to R and need to set up a genetic algorithm to find an equation that would produce a certain number of prime numbers.
install.packages("GA")
install.packages("matlab")
library(GA)
library(matlab)
f <- function(x)
{
#initialize fitness score
score <- 0
#set test values for k
k <- seq(from = 1, to = 100,by = 1)
#test if the result of the formula (k^2 + ak + b) is a prime number using test k values
for (i in k) {
if (isprime(i ^ 2 + x[1] * i + x[2]) == 2) {
score = score + 1
}
}
#return fitness score
return(score)
}
lbound <- 2
ubound <- 1000
GA <- ga(type="real-valued",fitness=f,popSize = 10,pcrossover = 0.8,pmutation = 0.1, maxiter=30, run=20, lower = lbound, upper = ubound)
When I try to run the GA part, I get the following error:
> GA <- ga(type="real-valued",fitness=f,popSize = 10,pcrossover = 0.8,pmutation = 0.1, maxiter=30, run=20, lower = lbound, upper = ubound)
Error in if (any(x < 0)) { : missing value where TRUE/FALSE needed
Any suggestions for what I might be doing wrong?
Thank you
The error in your code happens because it tries to find x[2] when it doesn't exist.
If you read the Rastrigin example for GA function the vignette, for 2 values you need 1. specify a function with 2 inputs and 2. use a wrapper on this function
f <- function(x1,x2)
# two variables
{
#initialize fitness score
score <- 0
#set test values for k
k <- seq(from = 1, to = 100,by = 1)
#test if the result of the formula (k^2 + ak + b) is a prime number using test k values
for (i in k) {
if (isprime(i ^ 2 + x1 * i + x2) == 2) {
score = score + 1
}
}
#return fitness score
return(score)
}
lbound <- 2
ubound <- 1000
GA <- ga(type="real-valued",
#the wrapper is here
fitness=function(x)f(round(x[1]),round(x[2])),
popSize = 10,
pcrossover = 0.8,pmutation = 0.1, maxiter=30,
run=20, lower = rep(lbound,2), upper = rep(ubound,2))

Draw random binom dependent on individual variable and predefined group risk

I'm trying to simulate exposure data in a group of people, and then to have a boolean conditional on the data. So say this is my simulated exposure data:
x <- rlnorm(2000)
I then want to generate a 1 or 0 for each person dependent on the value of x. I can simply define the `max(x)' as P==1 and lesser values as a proportion thereof:
prob <- x / max(x)
y <- rbinom( n=length(x), 1, prob=prob)
> table(y)
y
0 1
1900 100
However this is not really what I want. I would like to also be able to set an overall population risk for the group, say 30% (so `risk = 0.3'), such that individual risk depends on x but the total group risk =0.3. At the end I want 30% of the population to have y==1, but with an individual probability dependent on the value of x. I'm at a loss as how to achieve this - any help appreciated.
Update:
Taking a hint from #B Williams answer below, I've written a short optimiser function:
df1 <- data.frame(x = rlnorm(2000))
df1$prob <- df1$x / max(df1$x)
risk = 0.3
optimize_prob <- function(prob, risk, delta = 0.01, tol = 0.02, max_iter = 400, mult=1){
prob1 <- prob
for( i in 1: max_iter){
y <- rbinom( n=length(prob1), 1, prob=prob1)
meas_risk <- sum(y==1) / length(y)
if( abs(risk - meas_risk) > tol) {
sign <- as.numeric((risk - meas_risk) >= 0)
prob1 <- prob1 + (sign * delta) + (prob1 *delta * mult)
# prob1's must lie between 0 & 1
prob1 <- ifelse(prob1 > 1, 1, prob1)
prob1 <- ifelse(prob1 < 0, 0, prob1)
} else {
break
}
}
msg <- paste0("Iterations: ", i)
print(msg)
out <- cbind(prob1, y)
return(out)
}
df1 <- data.frame(df1, optimize_prob( df1$prob, risk, mult=3))
df1$y <- as.factor(df1$y)
table(df1$y)
This more or less achieves the result I want. However, if anyone knows a neater way of doing this I'd much appreciate suggestions. Also any efficiency improvements to the above appreciated as I will be running it alot if all goes to plan.
I may not understand correctly what you are trying to do, but here is my guess.
library(dplyr)
df <- data.frame(x = rlnorm(2000))
Pull out the top 600 (30% of 2000) values and get the minimum value
df %>%
mutate(prob = x/max(x)) %>%
top_n(600) %>%
summarise(min.value = min(prob)) -> out
Set the global probability based upon the minimum value
df %>%
mutate(prob = x/max(x),
global = ifelse(prob > out$min.value, 1, 0)) %>%
summarise(one = sum(global))
Alternatively you could write a function and optimize it to get the "cutoff" value.

Adjusting figure margins using split.screen

I am trying to produce multiple plots using the split.screen option and I need to have 7 plots on the page. One of them should be plotted on its own and the other 6 plotted repeatedly using a for loop.
This is my code for some simulation I am carrying out. It runs well, but I have two potential problems:
I am not sure which of the plots actually gets plotted because I couldn't get the assigned label to show up on the bigger plot.
The plot showing on screen 1 is not the actual data because I have plotted it separately and know what it should look like.
Simulating the data:
numpop = 2
N = 1250
nSNP = 5000
Fst = 0.001
omega = c(0.5, 0.5)
propnExtreme = 0.1
nsim = 10
Fst.obs = vector(length = nSNP)
pdiffs = vector(length = nSNP)
genomat = matrix(nrow = N, ncol = nSNP)
for (i in 1:nSNP){
p = runif(1, 0.1, 0.9)
alpha = p * (1 - Fst) / Fst
beta = (1 - p) * (1 - Fst) / Fst
ps = rbeta(numpop, shape1 = alpha, shape2 = beta)
vars = var(ps)
pdiffs[i] = max(ps) - min(ps)
Fst.obs[i] = vars / (p * (1 - p))
for (j in 1:numpop){
ind1 = (j-1) * N * omega[j] + 1
ind2 = j * N * omega[j]
freqs = c(ps[j]^2, 2 * ps[j] * (1 - ps[j]), (1 - ps[j])^2)
genomat[ind1:ind2, i] = sample(c(0, 1, 2), size = N*omega[j], replace = TRUE, prob = freqs)
}
}
snpmeans = colMeans(genomat)
pi = (1 + colSums(genomat)) / (2 + 2*nrow(genomat))
stdmat = scale(genomat, center=snpmeans, scale=sqrt(pi*(1-pi)))
pr = prcomp(stdmat, center=F, scale=F)
Plotting:
get( getOption("device" ) )()
png(file="myplot.png", width=2000, height = 1200)
par(oma = c(0,0,3,0))
split.screen(c(1,2)) # split display into two screens
plot(pr$x,
col = c(rep("red", N*omega[1]), rep("blue", N*omega[2])),
main = "Whole genotype data")
split.screen(c(2, 3), screen = 2) # now split the second into 2x3
for(i in 1:8) ## 8=#of screens
{
screen(i) # prepare screen i for output
fA=0.5
fa = 1-fA
combined_SNP <- sample(c(0:2), N, prob=c(fA^2, 2*fA*fa, fa^2), replace=T)
pheno_indep <-c()
##Phenotypes
for (i in 1:length(combined_SNP)){
if (combined_SNP[i] == '0') {
pheno_indep<- c(pheno_indep, rnorm(1, mean = 0.07, sd = 1))
} else if (combined_SNP[i ]== '1') {
pheno_indep <- c(pheno_indep, rnorm(1, mean = 0, sd = 1))
} else {
pheno_indep <- c(pheno_indep, rnorm(1, mean = -0.07, sd = 1))
}
}
l <- 1:N
combined_indep <- cbind(combined_SNP, pheno_indep, l)
sorted_combined <- combined_indep[order(combined_indep[, 2]), ]
##eps data
f = 0.1
Nums = nrow(sorted_combined)
keep <- c(1:(f*Nums), (Nums-(f*Nums)+1):Nums)
epsdat<- c(rep("0", f*Nums), rep("1", f*Nums))
EPS_dat <- as.factor(cbind(sorted_combined[keep, ], epsdat))
dim(EPS_dat) <- c(length(keep), 4)
#colnames(EPS_dat) <- c("Genotypes", "Phenotypes", "ID", "position")
PC_EPS <- prcomp((genomat[EPS_dat[, 3], ]))
plot(PC_EPS$x,
col=c(rep("red", f*Nums), rep("blue", f*Nums)))
}
close.screen(all=TRUE)
dev.off()
Result:
I have spent a lot of time trying to figure this out even with other packages like layout.show. Thanks!
Is the following what you expect to be plotted? (I added screen title to the small plots for illustration)
When you split the screens, you should have gotten the following on your console:
> split.screen(c(1, 2))
[1] 1 2
# (code used to plot first chart on the left)
> split.screen(c(2, 3), screen = 2)
[1] 3 4 5 6 7 8
As described in the help file ?split.screen, this is a a vector of screen numbers for the newly-created screens. So your valid screen numbers are 1 (already plotted), and 3-8 (6 small screens).
As such, the next line doesn't work as expected, since you're now looping through screens 1-8 rather than screens 3-8.
# instead of
for(i in 1:8) ## 8=#of screens
# use this
for(i in 3:8) ## 8=#of screens
As a side note, you should also use different loop counters for nested loops. Your outer loop (for the 6 small plots) used i as the loop counter. Within this loop, you have another loop for phenotypes, which used i as well. Since the screen selection was done at the start of each outer loop iteration, the code still worked in this case, but in general, best to keep the loop counters separate.

summing the positive values of k variables in R

I want to first find a max of 0 or j where j is any variable and then sum these for k (k=1,2,...k) variables of a dataframe data. In stata, I did as follows:
gen sum=max(0,x)+max(0,y)+max(0,z)+...+max(0,k)
In R I used following approach:
data$sum<-ifelse(data$x<0,0,data$x*1)+ifelse(data$y<0,0,data$y*1)+ifelse(data$z<0,0,data$z*1)+...+ifelse(data$k<0,0,data$k*1)
I was wondering whether there is an alternative and efficient approach in R to do the same thing.
Try this:
mycols <- c("x", "y", "z", "k")
data$sum <- rowSums(data[mycols] * (data[mycols] > 0))
Check with some sample data:
data <- data.frame(x = runif(10) - 0.5,
y = runif(10) - 0.5,
z = runif(10) - 0.5,
k = runif(10) - 0.5)
identical(rowSums(data[mycols] * (data[mycols] > 0)), # mine
ifelse(data$x < 0, 0, data$x * 1) + # yours
ifelse(data$y < 0, 0, data$y * 1) +
ifelse(data$z < 0, 0, data$z * 1) +
ifelse(data$k < 0, 0, data$k * 1))
# [1] TRUE
Alternatives to flodel's excellent solution, noting the first looks quite a bit like your Stata code.
with( data, # terrible name for an R object, BTW
pmax(x, 0) + pmax(y, 0) + pmax(z, 0) +pmax(k,0) )
rowSums( apply(data[-5], 2, pmax, 0) )
The second one is probably slower, but it is in the running for this R-golf competition. Also a matrix math solution:
as.matrix( (data[,1:4] > 0 )* data[, 1:4]) %*% rep(1, 4 )
Not the question, but writing out every variable in Stata is likely to be tedious and error-prone. There is likely to be scope for a loop here:
gen sum = 0
quietly foreach v of var varlist {
replace sum = sum + `v' if inrange(`v', 0, .)
}
where you must work out what the varlist should be.

Resources