I'm trying to compute a neural network with the package neuralnet, to solve a regression problem. I'm trying to approximate the function:
f(x1,x2) = sqrt(x1) + sin(x2) + x1*x2.
here is my code:
library(neuralnet)
library(scatterplot3d)
X1 <- as.data.frame(runif(1000, min = 0 , max = 100))
X2 <- as.data.frame(runif(1000, min = 0 , max = 100))
input <- cbind(X1,X2)
sortie <- sqrt(X1) + sin(X2) + X1*X2
donnee <- cbind(sortie,input)
colnames(donnee) <- c("sortie","entree1","entree2")
f <- as.formula(sortie ~ entree1 + entree2)
net.f <- neuralnet(f , donnee, hidden = c(10,10,10) ,linear.output = FALSE)
here is the code to look at the scatterplot of the outputs of the neural networks:
abscisse1 <- 0:100
abscisse2 <- 0:100
net.abscisseformule <- compute(net.f , cbind(abscisse1,abscisse2))
neuralsortie <- c(net.abscisseformule$net.result)
scatterplot3d(abscisse1,abscisse2,neuralsortie)
I'm pretty sure that the result is wrong because the scatterplot doesn't looks like the scatterplot of the function f. I thonk that the problem comes from the line
f <-as.formula(sortie ~ entree1 + entree2)
here is the code to look at the scatterplot of the function
x <- seq(0, 100, 1)
y <- seq(0, 100, 1)
z <- sqrt(x) + sin(y) +x*y
scatterplot3d(x,y,z)
this is the graph of f
https://i.stack.imgur.com/HkpbG.png
this is the graph of the outputs of the neuralnet
https://i.stack.imgur.com/N38dd.png
Can somebody give me a piece of advice ? Many Thanks !
I find the answer to my question. According to the book The Elements of Statistical Learning (by Friedman,Tibshirani and Hastie), when solving a regression problem it is required to use the identity function in the last layer of the neural network. Which mean that the output is a linear combination of the previous layer. In order to do so with R, it is required to set "linear.output" to TRUE and not FALSE.
Related
I was trying to do some manual calculations of knn regression and came across this unusual error. The predicted values done by hand do not match with the ones I got from the 'knnreg' function in the 'caret' package. So I used another package (FNN) as a second check and discovered that my manual calculations do agree with the ones from the FNN package. So I'm really confused now. Here is an example code:
# caret vs. FNN packages
# issue in predictions
library(caret)
library(FNN)
library(dbscan)
n <- 100
x <- rnorm(n)
y <- 2 + 3*x + rnorm(n, sd = 0.5)
x <- as.matrix(x)
# using caret
knn_caret <- knnreg(x, y, k = 5)
yhat_caret <- predict(knn_caret, newdata = x)
# using FNN
knn_FNN <- knn.reg(train = x, y = y, k = 5)
yhat_FNN <- knn_FNN$pred
# manual calculation using the neighbors.
# choose a point
i <- 3
nn <- kNN(x, k = 5) #using the caret package
neighbors <- nn$id[i, ]
mean(y[neighbors]) # manual calculation
yhat_FNN[i] # FNN package
yhat_caret[i] # caret package
If you can point to any mistake that I may have made in my code or any thoughts on this issue is greatly appreciated.
I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.
I am new to R and looking to estimate the likelihood of having an outcome>=100 using a probability density function (the outcome in my example is the size of an outbreak). I believe I have the correct coding, but something doesn't feel right about the answer, when I look at the plot.
This is my code (it's based on the output of a stochastic model of an outbreak). I'd very much appreciate pointers. I think the error is in the likelihood calculation....
Thank you!
total_cases.dist <- dlnorm(sample.range, mean = total_cases.mean, sd = total_cases.sd)
total_cases.df <- data.frame("total_cases" = sample.range, "Density" = total_cases.dist)
library(ggplot2)
ggplot(total_cases.df, aes(x = total_cases, y = Density)) + geom_point()
pp <- function(x) {
print(paste0(round(x * 100, 3), "%"))
}
# likelihood of n_cases >= 100
pp(sum(total_cases.df$Density[total_cases.df$total_cases >= 100]))
You are using dlnorm, which is the log-normal distribution, which means the mean and sd are the mean of the log (values) and sd of log(values), for example:
# we call the standard rlnorm
X = rlnorm(1000,0,1)
# gives something close to sd = exp(1), and mean=something
c(mean(X),sd(X))
# gives what we simulated
c(mean(log(X)),sd(log(X)))
We now simulate some data, using a known poisson distribution where mean = variance. And we can model it using the log-normal:
set.seed(100)
X <- rpois(500,lambda=1310)
# we need to log values first
total_cases.mean <- mean(log(X))
total_cases.sd <- sd(log(X))
and you can see it works well
sample.range <- 1200:1400
hist(X,br=50,freq=FALSE)
lines(sample.range,
dlnorm(sample.range,mean=total_cases.mean,sd=total_cases.sd),
col="navyblue")
For your example, you can get probability of values > 1200 (see histogram):
plnorm(1200,total_cases.mean,total_cases.sd,lower.tail=FALSE)
Now for your data, if it is true that mean = 1310.198 and total_cases.sd = 31615.26, take makes variance ~ 76000X of your mean ! I am not sure then if the log normal distribution is appropriate for modeling this kind of data..
I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.
I am interested in replicating an experiment in a paper [1] I came across. The idea is that I need to simulate a cox proportional hazard model that is dependent on the first to covariates in the dataframe. I am trying to make a plot similar to this:
But I am trying to make a "hex" version of it. The problem is that I can't seem to get the "z-axis" correct.
set.seed(42) # this makes the example exactly reproducible
#50,000 random uniforms
obs <- runif(50000,min = -1, max = .999)
#make uniforms a matrix
obs <- matrix(data = obs, nrow = 5000, ncol = 10)
#make is_censored
is_censored <- sample(0:1,5000,TRUE,prob=c(0.40,0.60))
#hazard function
const <- 1
time <- rexp(n = 5000, const*exp(-(obs[,1]+2*obs[,2])))
#dataset
df <- cbind(obs, is_censored, time)
#names for covariates
names = letters[1:10]
colnames(df)[1:10] <- names
#truth data
x <- df[,1]; y <- df[,2]
true <- tibble(x,y,time)
install.packages("hexbin")
library(hexbin)
ggplot(true,aes(x,y))+
geom_hex(bins = 30)
I thought that if I added time for the z-axis I would get the correct gradient, but instead I got:
ggplot(true,aes(x,y,fill=time))+
geom_hex(bins = 30)
How can I get the proper gradient?
1Deep Survival: A Deep Cox Proportional Hazards Network