Fitting a curve in the points - r

This is my data:
y<-c(1.8, 2, 2.8, 2.9, 2.46, 1.8,0.3,1.1,0.664,0.86,1,1.9)
x<- c(1:12)
data<-as.data.frame(cbind(y,x))
plot(data$y ~ data$x)
I want to fit a curve through these points so that I can generate the intermediate predicted values. I need a curve that goes through the points. I don't care what function it fits.
I consulted this link.
Fitting a curve to specific data
install.packages("rgp")
library(rgp)
result <- symbolicRegression(y ~ x,data=data,functionSet=mathFunctionSet,
stopCondition=makeStepsStopCondition(2000))
# inspect results, they'll be different every time...
(symbreg <- result$population[[which.min(sapply(result$population,
result$fitnessFunction))]])
function (x)
exp(sin(sqrt(x)))
# inspect visual fit
ggplot() + geom_point(data=data, aes(x,y), size = 3) +
geom_line(data=data.frame(symbx=data$x, symby=sapply(data$x, symbreg)),
aes(symbx, symby), colour = "red")
If I repeat this analysis again, every time the function above produces a different curve. Does anyone know why is this happening and whether this is a right way to fit a curve in these points? Also this function does not go through each points therefore I cannot obtain the intermediates points.

A standard approach is to fit a spline, this gives a nice curve that goeas through all points. See spline. Concretely you would use a call like:
spline(x = myX, y = myY, xout=whereToInterpolate)
or just calculating 100 points to your example:
ss <- spline(x,y, n=100)
plot(x,y)
lines(ss)
Note there is also a smoothing spline which may help for noisy data.
If the curve doesn't need to be smooth there is the simpler approx which does linear interpolation.
approx(x = myX, y = myY, xout=whereToInterpolate)

Related

How to create grid of kernel density plots in R

I have some some samples of a high dimensional density that I would like to plot. I like to create a grid of them where their bivariate density is plotted when they cross. For example, in Bayes and Big Data - The Consensus Monte Carlo Algorithm, Scott et al. (2016) has the following plot:
In this plot, above the diagonal are the distributions on a scale just large enough to fit the plots. In the below diagonal, the bivariate densities are plotted on a common scale.
Does anyone know how I can achieve such a plot?
For instance if I have just generated a 5-dimensional Gaussian distribution using:
library(MASS)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
This is relatively easy using the facet_matrix() from the ggforce package. You just have to specify which layer goes on what part of the plot (i.e. layer.upper = 1 says that the first layer (geom_density2d()) should go in the upper triangular part of the matrix. The geom_autodensity() makes sure that the KDE touches the bottom part of the plot.
library(MASS)
library(ggforce)
data <- MASS::mvrnorm(n=10000, mu=c(1,2,3,4,5), Sigma = diag(5))
df <- as.data.frame(data)
ggplot(df) +
geom_density2d(aes(x = .panel_x, y = .panel_y)) +
geom_autodensity() +
geom_point(aes(x = .panel_x, y = .panel_y)) +
facet_matrix(vars(V1:V5), layer.upper = 1, layer.diag = 2)
More details about facet_matrix() are posted here.

Smoothing using kernel and loess in R

I am trying to smooth my data set, using kernel or loess smoothing method. But, They are all not clear or not what I want. Several questions are the followings.
My x data is "conc" and y data is "depth", which is ex. cm.
1) Kernel smooth
k <- kernel("daniell", 150)
plot(k)
K <- kernapply(conc, k)
plot(conc~depth)
lines(K, col = "red")
Here, my data is smoothed by frequency=150. This means that every data point is averaged by neighboring (right and left) 150 data points? What "daniell" means? I could not find what it means online.
2) Loess smooth
p<-qplot(depth, conc, data=total)
p1 <- p + geom_smooth(method = "loess", size = 1, level=0.95)
Here, what is the default of loess smooth function? If I want to smooth my data with frequency=150 like above case (moving average by every 150 data point), how can I modify this code?
3) To show y-axis with log scale, I put "log10(conc)", instead of "conc", and it worked. But, I cannot change the y-axis tick label. I tried to use "scale_y_log10(limits = c(1,1e3))" in my code to show axis tick labe like 10^0, 10^1, 10^2..., but did not work.
Please answer my questions. Thanks a lot for your help.
Sum

How to plot exponential function on barplot R?

So I have a barplot in which the y axis is the log (frequencies). From just eyeing it, it appears that bars decrease exponentially, but I would like to know this for sure. What I want to do is also plot an exponential on this same graph. Thus, if my bars fall below the exponential, I would know that my bars to decrease either exponentially or faster than exponential, and if the bars lie on top of the exponential, I would know that they dont decrease exponentially. How do I plot an exponential on a bar graph?
Here is my graph if that helps:
If you're trying to fit density of an exponential function, you should probably plot density histogram (not frequency). See this question on how to plot distributions in R.
This is how I would do it.
x.gen <- rexp(1000, rate = 3)
hist(x.gen, prob = TRUE)
library(MASS)
x.est <- fitdistr(x.gen, "exponential")$estimate
curve(dexp(x, rate = x.est), add = TRUE, col = "red", lwd = 2)
One way of visually inspecting if two distributions are the same is with a Quantile-Quantile plot, or Q-Q plot for short. Typically this is done when inspecting if a distribution follows standard normal.
The basic idea is to plot your data, against some theoretical quantiles, and if it matches that distribution, you will see a straight line. For example:
x <- qnorm(seq(0,1,l=1002)) # Theoretical normal quantiles
x <- x[-c(1, length(x))] # Drop ends because they are -Inf and Inf
y <- rnorm(1000) # Actual data. 1000 points drawn from a normal distribution
l.1 <- lm(sort(y)~sort(x))
qqplot(x, y, xlab="Theoretical Quantiles", ylab="Actual Quantiles")
abline(coef(l.1)[1], coef(l.1)[2])
Under perfect conditions you should see a straight line when plotting the theoretical quantiles against your data. So you can do the same plotting your data against the exponential function you think it will follow.

best fitting curve from plot in R

I have a probability density function in a plot called ph that i derived from two samples of data, by the help of a user of stackoverflow, in this way
few <-read.table('outcome.dat',head=TRUE)
many<-read.table('alldata.dat',head=TRUE)
mh <- hist(many$G,breaks=seq(0,1.,by=0.03), plot=FALSE)
fh <- hist(few$G, breaks=mh$breaks, plot=FALSE)
ph <- fh
ph$density <- fh$counts/(mh$counts+0.001)
plot(ph,freq=FALSE,col="blue")
I would like to fit the best curve of the plot of ph, but i can't find a working method.
how can i do this? I have to extract the vaule from ph and then works on they? or there is same function that works on
plot(ph,freq=FALSE,col="blue")
directly?
Assuming you mean that you want to perform a curve fit to the data in ph, then something along the lines of
nls(FUN, cbind(ph$counts, ph$mids),...) may work. You need to know what sort of function 'FUN' you think the histogram data should fit, e.g. normal distribution. Read the help file on nls() to learn how to set up starting "guess" values for the coefficients in FUN.
If you simply want to overlay a curve onto the histogram, then smoo<-spline(ph$mids,ph$counts);
lines(smoo$x,smoo$y)
will come close to doing that. You may have to adjust the x and/or y scaling.
Do you want a density function?
x = rnorm(1000)
hist(x, breaks = 30, freq = FALSE)
lines(density(x), col = "red")

Calculating an area under a continuous density plot

I have two density curves plotted using this:
Network <- Mydf$Networks
quartiles <- quantile(Mydf$Avg.Position, probs=c(25,50,75)/100)
density <- ggplot(Mydf, aes(x = Avg.Position, fill = Network))
d <- density + geom_density(alpha = 0.2) + xlim(1,11) + opts(title = "September 2010") + geom_vline(xintercept = quartiles, colour = "red")
print(d)
I'd like to compute the area under each curve for a given Avg.Position range. Sort of like pnorm for the normal curve. Any ideas?
Calculate the density seperately and plot that one to start with. Then you can use basic arithmetics to get the estimate. An integration is approximated by adding together the area of a set of little squares. I use the mean method for that. the length is the difference between two x-values, the height is the mean of the y-value at the begin and at the end of the interval. I use the rollmeans function in the zoo package, but this can be done using the base package too.
require(zoo)
X <- rnorm(100)
# calculate the density and check the plot
Y <- density(X) # see ?density for parameters
plot(Y$x,Y$y, type="l") #can use ggplot for this too
# set an Avg.position value
Avg.pos <- 1
# construct lengths and heights
xt <- diff(Y$x[Y$x<Avg.pos])
yt <- rollmean(Y$y[Y$x<Avg.pos],2)
# This gives you the area
sum(xt*yt)
This gives you a good approximation up to 3 digits behind the decimal sign. If you know the density function, take a look at ?integrate
Three possibilities:
The logspline package provides a different method of estimating density curves, but it does include pnorm style functions for the result.
You could also approximate the area by feeding the x and y variables returned by the density function to the approxfun function and using the result with the integrate function. Unless you are interested in precise estimates of small tail areas (or very small intervals) then this will probably give a reasonable approximation.
Density estimates are just sums of the kernels centered at the data, one such kernel is just the normal distribution. You could average the areas from pnorm (or other kernels) with the sd defined by the bandwidth and centered at your data.

Resources