How to predict using a locally smoothed mean? - r

(Statistics beginner here).
I have some training data (x,y), and wish to make prediction for new data x_new.
Now let's assume I have the data for the plot below, but I do not know how y is computed. So I would like to use the data I have a calculate for any given x the local mean of y data, as this seems like the best guess I can make.
install.packages("gplots")
library("gplots")
x <- abs(rnorm(500))
y <- rnorm(500, mean=2*x, sd=2+2*x)
bandplot(x,y)
Is there a R function to predict y for a given x, using the locally smoothed mean (here shown in red thanks to the function bandplot), or something similar?

wapply from gplots returns the locally smoothed mean as a table for x and y.
x <- 1:1000
y <- rnorm(1000, mean=1, sd=1 + x/1000 )
wapply(x,y,mean)
to predict, one would need, I guess, to resolve the closest x that is in the table returned by wapply, then deduce the local mean for y.
For a value a, the closest x will be given by the index:
index = which(abs(wapply(x,y,mean)$x-a)==min(abs(wapply(x,y,mean)$x-a)))
then the prediction should be:
pred = wapply(x,y,mean)[index]
So in one line:
locally_smoothed_mean_prediction = function(a) wapply(x,y,mean)$y[which(abs(wapply(x,y,mean)$x-a)==min(abs(wapply(x,y,mean)$x-a)))]
> locally_smoothed_mean_prediction(600)
[1] 1.055642

Related

R: How to plot custom range of polynomial produced by lm poly fit

I'm confused by the coefficients produced by the output of lm
Here's a copy of the data I'm working with
(postprocessed.csv)
"","time","value"
"1",1,2.61066016308988
"2",2,3.41246054742996
"3",3,3.8608767964033
"4",4,4.28686048552237
"5",5,4.4923132964825
"6",6,4.50557049744317
"7",7,4.50944447661246
"8",8,4.51097373134893
"9",9,4.48788748823809
"10",10,4.34603985656981
"11",11,4.28677073671406
"12",12,4.20065901625172
"13",13,4.02514194962519
"14",14,3.91360194972916
"15",15,3.85865748409081
"16",16,3.81318053258601
"17",17,3.70380706527433
"18",18,3.61552922363713
"19",19,3.61405310598722
"20",20,3.64591327503384
"21",21,3.70234435835577
"22",22,3.73503970503372
"23",23,3.81003078640584
"24",24,3.88201196162666
"25",25,3.89872518158949
"26",26,3.97432743542362
"27",27,4.2523675144599
"28",28,4.34654855854847
"29",29,4.49276038902684
"30",30,4.67830892029687
"31",31,4.91896819673664
"32",32,5.04350767355202
"33",33,5.09073406942046
"34",34,5.18510849382162
"35",35,5.18353176529036
"36",36,5.2210776270173
"37",37,5.22643491929207
"38",38,5.11137006553725
"39",39,5.01052467981257
"40",40,5.0361056705898
"41",41,5.18149486951409
"42",42,5.36334869132276
"43",43,5.43053620818444
"44",44,5.60001072279525
I have fitted a 4th order polynomial to this data using the following script:
library(ggplot2)
library(matrixStats)
library(forecast)
df_input <- read.csv("postprocessed.csv")
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
poly4model <- lm(y~poly(x, degree=4), data=df)
v <- seq(30, 40)
vv <- poly4model$coefficients[1] +
poly4model$coefficients[2] * v +
poly4model$coefficients[3] * (v ^ 2) +
poly4model$coefficients[4] * (v ^ 3) +
poly4model$coefficients[5] * (v ^ 4)
pdf("postprocessed.pdf")
plot(df)
lines(v, vv, col="red", pch=20, lw=3)
dev.off()
I initially tried using the predict function to do this, but couldn't get that to work, so resorted to implementing this "workaround" using some new vectors v and vv to store the data for the line in the region I am trying to plot.
Ultimatly, I am trying to do this:
Fit a 4th order polynomial to the data
Plot the 4th order polynomial over the range of data in one color
Plot the 4th order polynomial over the range from the last value to the last value + 10 (prediction) in a different color
At the moment I am fairly sure using v and vv to do this is not "the best way", however I would have thought it should work. What is happening is that I get very large values.
Here is a screenshot from Desmos. I copied and pasted the same coefficients as shown by typing poly4model$coefficients into the console. However, something must have gone wrong because this function is nothing like the data.
I think I've provided enough info to be able to run this short script. However I will add the pdf as well.
It is easiest to use the predict function to create your line. To do that, you pass the model and a data frame with the desired independent variables to the predict function.
x <- df_input$time
y <- df_input$value
df <- data.frame(x, y)
poly4model <- lm(y~poly(x, degree=4), data=df)
v <- seq(30, 40)
#Notice the column in the dataframe is the same variable name
# as the variable in the model!
predict(poly4model, data.frame(x=v))
plot(df)
lines(v, predict(poly4model, data.frame(x=seq(30, 40))), col="red", pch=20, lw=3)
NOTE
The function poly "Returns or evaluates orthogonal polynomials of degree 1 to degree over the specified set of points x: these are all orthogonal to the constant polynomial of degree 0." To return the "normal" polynomial coefficients one needs to use the "raw=TRUE" option in the function.
poly4model <- lm(y~poly(x, degree=4, raw=TRUE), data=df)
Now your equation above will work.

Simulating data between correlated count variable and a continuous variable

Does anybody know how I could possible simulate data with a correlation between a count variable and a continuous variable? Right now the best idea that I have is to just transform the count variable to make it approximately normal, and then to simulate the data using this R code:
set.seed(2018)
x = rnorm(n = 1000, mean = 0, sd = 1)
y = rnorm(n = 1000, mean = .29*x, sqrt(1-.3^2))
cor(x,y)
However, I really think it would be preferable if I could actually make Y a count variable (because they tend to typically be right-skewed). Also, I want to be able to specify specific correlations between x and y. E.g., simulate data with a 0.5 correlation between x and y etc.
Edit: I'm still looking for help!
You can use runif to simulate the continuous variable, then feed the result as the lambda (rate) parameter of rpois:
set.seed(1)
continuous <- runif(100, 0, 10)
counts <- rpois(100, continuous)
plot(continuous, counts)
cor(counts, continuous)
#> [1] 0.7852701
Created on 2020-12-11 by the reprex package (v0.3.0)

Convert uniform draws to normal distributions with known mean and std in R

I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?
You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.

How to find a percentile that can maximize the correlation coefficient between two vector?

Suppose I have two continuous vectors such like:
set.seed(123)
df <- data.frame(x = rnorm(100),
y = rnorm(100,3,5))
with(df, cor(x,y))
My question is how to find a percentile of x so that to maximize the absolute correlation of x and y such that:
perc <- quantile(df$x, 0.3)
df1 <- subset(df, x > perc)
with(df1, cor(x,y))
Namely how to find perc?
This problem is ill defined. Take your example data set and the function you want to find the maximum of (copied from #coffeinjunky):
set.seed(123)
df <- data.frame(x = rnorm(100),
y = rnorm(100,3,5))
findperc <- function(prop, dat) {
perc <- quantile(dat$x, prop)
with(subset(dat, dat$x > perc), abs(cor(x,y)))
}
Now plot the result of findperc for percentiles between 0 and 1.
x <- seq(0,1,0.01)
plot(x,sapply(x,findperc,df),type="l")
The circled point indicates that found by optimize as in #coffeinjunky's answer. This is clearly only a local maximum. The applicability of the warning from #Thierry, "You need to rethink the question. As soon a x and y contain only 2 element the correlation will be either 1 or -1", should be apparent on the right hand side of the plot.
In general, the fact that you are getting moderate to high correlations when starting with independently generated random variables should warn you that your results are spurious and method suspect.
Well, why not take your question literally, and just search for it? For instance, try:
findperc <- function(prop, dat) {
perc <- quantile(dat$x, prop)
with(subset(dat, dat$x > perc), abs(cor(x,y)))
}
optimize(findperc, lower=0, upper=1, maximum=T, dat=df)
This defines a function that computes the absolute correlation between your vectors based on the corresponding percentile (which here is a single value), just as in your example code. And then I feed this function to a linear optimizer which searches for the input that produces the maximum value for the output.
Edit: Thanks to #A. Webb's answer I learned that optimize uses a gradient search as opposed to a grid search. I thought that this was the main difference between optim and optimize, a clearly wrong assumption I should have checked myself. However, just to provide a solution using grid search that will get you closer to the global maximum, one could use the following:
x <- seq(0,0.97,0.01)
x[which.max(sapply(x, findperc, dat=df))]
Note that I have cut x here at 97%. This ensures that at least 3 people are left in the sample (given a sample size of 100).

R: Finding solutions for new x values with nlmrt

Good day,
I have tried to figure this out, but I really can't!! I'll supply an example of my data in R:
x <- c(36,71,106,142,175,210,246,288,357)
y <- c(19.6,20.9,19.8,21.2,17.6,23.6,20.4,18.9,17.2)
table <- data.frame(x,y)
library(nlmrt)
curve <- "y~ a + b*exp(-0.01*x) + (c*x)"
ones <- list(a=1, b=1, c=1)
Then I use wrapnls to fit the curve and to find a solution:
solve <- wrapnls(curve, data=table, start=ones, trace=FALSE)
This is all fine and works for me. Then, using the following, I obtain a prediction of y for each of the x values:
predict(solve)
But how do I find the prediction of y for new x values? For instance:
new_x <- c(10, 30, 50, 70)
I have tried:
predict(solve, new_x)
predict(solve, 10)
It just gives the same output as:
predict(solve)
I really hope someone can help! I know if I use the values of 'solve' for parameters a, b, and c and substitute them into the curve formula with the desired x value that I would be able to this, but I'm wondering if there is a simpler option. Also, without plotting the data first.
Predict requires the new data to be a data.frame with column names that match the variable names used in your model (whether your model has one or many variables). All you need to do is use
predict(solve, data.frame(x=new_x))
# [1] 18.30066 19.21600 19.88409 20.34973
And that will give you a prediction for just those 4 values. It's somewhat unfortunate that any mistakes in specifying the new data results in the fitted values for the original model being returned. An error message probably would have been more useful, but oh well.

Resources