Control the degree of smoothing in my simple kernel smoothing code - r

I need help in customizing the smoothing to my time series data. The code below smooths the data using sm.regression and approx functions, but the degree of smoothing is not user controlled, i.e. By changing function parameters I want the ability to control whether the smoothed curve follows the underlying data more closely or less closely.
find.extrema <- function(x)
{
if(is.xts(x)) {
y = as.vector( Cl(x) )
} else {
y = x
}
n = len(y)
t = 1:n
h = h.select(t, y, method = 'cv')
temp = sm.regression(t, y, h=h, display = 'none')
mhat = approx(temp$eval.points, temp$estimate, t, method='linear')$y
#mhat = y #to exactly match underlying data
return (mhat)
}
Any help would be appreciated.
Thanks.

There are not many questions regarding the sm package. Perhaps it is not widely used nowadays, but I still remember playing with it a lot when doing my MRes degree.
You can't control smoothness because you are using cross-validation for auto-selection of smoothing parameter. Just get rid of the h.select line and pass h as an argument of your function.
find.extrema <- function(x, h = NULL)
{
if(is.xts(x)) {
y = as.vector( Cl(x) )
} else {
y = x
}
n = len(y)
t = 1:n
## if not given, do auto-selection from data
if (is.null(h)) h = h.select(t, y, method = 'cv')
temp = sm.regression(t, y, h=h, display = 'none')
mhat = approx(temp$eval.points, temp$estimate, t, method='linear')$y
#mhat = y #to exactly match underlying data
return (mhat)
}
The whole point of sm package on kernel smoothing and / or kernel density estimation, is the cross-validation part. If you don't utilize it, you can just use ksmooth from R base for Nadaraya-Watson kernel estimator. You may read more about it from Scatter plot kernel smoothing: ksmooth() does not smooth my data at all. I did made a comparison with sm.regression there.

Related

How can I return a plot from for loop in R?

I am trying to write a program for bisection method and try to plot the points also at each iteration.
Here is the code I have tried.
Bisec = function(f,a =1, b =2, max=1e10, tol = 1e-100){
midVals = c()
for (i in 1:max){
c = (a+b)/2
midVals = append(midVals,c)
if(abs(f(c)) < tol){
return(list(c,plot(f),points(midVals)))
}else if(f(a)*f(c) > 0){
a = c
}else{
b =c
}
}
print("Maximum iterations reached")
}
x = var('x')
f = function(x){x*x-2}
Bisec(f,1, 3, max=1e5, tol = 1e-10)
But I am getting the graphs like this.
What do I need?
the function f has to be plotted.
the midpoints found in each iteration should be plotted in x axis.
How to achieve this?
Any hint may be helpful. I dont know where am I goint wrong.
R notation can be a little different if you learned to program in a different language. Part of R's power is it's integration between an interpreted interface and (fast) compiled functions. Generally (although this may be an exception, I'm not focusing on that), for loops are avoided (many functions are vectorized, which means they do the looping within the compiled portion of the code). We also avoid defining empty variables, because they have to be copied and pasted EVERY time you want to add something to them.
For your specific problem, plot is plotting f - it just doesn't know anything about the points command because it evaluates plot before it ever sees points. You might find ggplot2 gives a more dynamic solution, but I'll start with a base R approach to your function:
Bisec = function(f,a =1, b =2, max_iter=1e10, tol = 1e-100){
midVals = rep(NA, max_iter) # I avoid using `max` since that's a function to find the maximum
for (i in 1:max_iter){
x <- mean(c(a,b)) # I also avoid using `c` since that's a function to concatenate stuff
midVals[i] <- x
if(abs(f(x)) < tol){
plot(f, xlim = range(midVals, na.rm = TRUE))
points(midVals, rep(0,length(midVals))
return(x)
} else if(f(a)*f(x) > 0){
a = x
}
else{
b = x
}
}
print("Maximum iterations reached")
}

How to set different kind of bar plot in terminal nodes?

I am running a MOB tree on a dataset and I want to modify plots in terminal nodes. I am going to use bar chart of the coefficients of the models which fitted by MOB in each node as my terminal node.
For example, I run the MOB tree on "PimaIndiansDiabetes" dataset in "mlbench" package. Here is the codes:
pid_formula <- diabetes ~ glucose | pregnant + pressure + triceps +
insulin + mass + pedigree + age
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
pid_tree <- mob(pid_formula, data = PimaIndiansDiabetes, fit = logit)
then I have model for each node. for example I have "mass=-9.95+0.058*glucose" for Node number 2. I want to make bar charts from these coefficients (ex: -9.95 and 0.058 for node number 2) and use these bar charts as my terminal nodes in final tree plot. Any idea how to do that? Thanks in advance.
To implement such a graphic in partykit you would have to write a new panel function for the plot() method (or rather a panel-generating function). The starting point could be partykit::node_barplot which first extracts the fitted probabilities of a classification tree and then draws them using the grid package. Instead, you could extract the estimated parameters with coef() and then draw these using grid. It's a bit technical but not extremely complicated.
However, I wouldn't recommend to implement such a function. The reason is that this would be best suited to compare the different coefficients within the same node. But as slope and intercept are on completely different scales this is not easy to interpret. Instead one should give more emphasis to differences in the same coefficient across nodes. The basis for this would also be:
coef(pid_tree)
## x(Intercept) xglucose
## 2 -9.951510 0.05870786
## 4 -6.705586 0.04683748
## 5 -2.770954 0.02353582
Additionally one could consider the corresponding standard errors for confidence intervals. (Keep in mind that these have to be taken with a grain of salt, though: They do not adjust for estimating the tree but pretend the terminal groups were given exogenously. Still they are useful as rough yardsticks.) I include a small convenience function to do this:
confintplot <- function(object, ylim = NULL,
xlab = "Parameter per node", ylab = "Estimate",
main = "", index = NULL, ...)
{
## point estimates and interval
cf <- coef(object)
node <- nodeids(object, terminal = TRUE)
ci <- nodeapply(object, ids = node, FUN = function(n)
confint(info_node(n)$object, ...))
if (!is.null(index)) {
cf <- cf[, index, drop = FALSE]
ci <- lapply(ci, "[", index, , drop = FALSE)
}
cfnm <- rownames(ci[[1L]])
nodenm <- rownames(cf)
## set up dimensions
n <- length(ci)
k <- nrow(ci[[1L]])
at <- t(outer(1:k, seq(-0.15, 0.15, length.out = n), "+"))
## empty plot
if(is.null(ylim)) ylim <- range(unlist(ci))
plot(0, 0, type = "n", xlim = range(at), ylim = ylim,
xlab = xlab, ylab = ylab, main = main, axes = FALSE)
## draw every parameter
for(i in 1L:k) {
arrows(at[,i], sapply(ci, "[", i, 1L), at[,i], sapply(ci, "[", i, 2L),
code = 3, angle = 90, length = 0.05)
points(at[, i], cf[, cfnm[i]], pch = 19, col = "white", cex=1.15)
points(at[, i], cf[, cfnm[i]], pch = nodenm, cex = 0.65)
}
axis(1, at = 1:k, labels = cfnm)
axis(2)
box()
}
Using this we can create one plot for each parameter (intercept vs. slope) separately. This shows that the intercept is increasing across nodes while the slope is decreasing.
par(mfrow = c(1, 2))
confintplot(pid_tree, index = 1)
confintplot(pid_tree, index = 2)
It is also possible to show these on a common y-axis. However, this completely obscures the changes in the slope because of the different scales:
confintplot(pid_tree)
Final comment: I would recommend using glmtree() for this particular kind of model instead of mob() "by hand". The former is faster and provides some extra features, especially easy forecasting.

Using R and Sensor Accelerometer Data to Detect a Jump

I'm fascinated by sensor data. I used my iPhone and an app called SensorLog to capture
accelerometer data while I stand and push my legs to jump.
My goal is to use R to create a model which can identify jumps and how long I'm in the air.
I'm unsure how to proceed in such a challenge. I have a timeseries with accelerometer data.
https://drive.google.com/file/d/0ByWxsCBUWbqRcGlLVTVnTnZIVVk/view?usp=sharing
Some questions:
How can a jump be detected in timeseries data?
How to identify the air time part?
How to train such a model?
Below is the R code used to create the graphs above, which is me standing and doing a simple jump.
Thanks!
# Training set
sample <- read.csv("sample-data.csv")
# Sum gravity
sample$total_gravity <- sqrt(sample$accelerometerAccelerationX^2+sample$accelerometerAccelerationY^2+sample$accelerometerAccelerationZ^2)
# Smooth our total gravity to remove noise
f <- rep(1/4,4)
sample$total_gravity_smooth <- filter(sample$total_gravity, f, sides=2)
# Removes rows with NA from smoothing
sample<-sample[!is.na(sample$total_gravity_smooth),]
#sample$test<-rollmaxr(sample$total_gravity_smooth, 10, fill = NA, align = "right")
# Plot gravity
plot(sample$total_gravity, type="l", col=grey(.2), xlab="Series", ylab="Gravity", main="Accelerometer Gravitational Force")
lines(sample$total_gravity_smooth, col="red")
stdevs <- mean(sample$total_gravity_smooth)+c(-2,-1,+1,+2)*sd(sample$total_gravity_smooth)
abline(h=stdevs)
This is probably less than perfect solution, but it might be enough to get you started. The first part relies on a small modification of the find_peaks function from the gazetools package.
find_maxima <- function(x, threshold)
{
ranges <- find_peak_ranges(x, threshold)
peaks <- NULL
if (!is.null(ranges)) {
for (i in 1:nrow(ranges)) {
rnge <- ranges[i, 1]:ranges[i, 2]
r <- x[rnge]
peaks <- c(peaks, rnge[which(r == max(r))])
}
}
peaks
}
find_minima <- function(x, threshold)
{
ranges <- find_peak_ranges(x, threshold)
peaks <- NULL
if (!is.null(ranges)) {
for (i in 1:nrow(ranges)) {
rnge <- ranges[i, 1]:ranges[i, 2]
r <- x[rnge]
peaks <- c(peaks, rnge[which(r == min(r))])
}
}
peaks
}
In order to get the find_maxima and find_minima functions to give us what we're looking for we are going to need to smooth the total_gravity data even further:
spline <- smooth.spline(sample$loggingSample, y = sample$total_gravity, df = 30)
Note: I 'zeroed out' total gravity (sample$total_gravity <- sample$total_gravity - 1)
Next, pull out the smoothed x and y values:
out <- as.data.frame(cbind(spline$x,spline$y))
Then find our local maxima and minima
max <- find_maxima(out$y, threshold = 0.4)
min <- find_minima(out$y, threshold = -0.4)
And then plot the data to make sure everything looks legit:
plot(out$y, type="l", col=grey(.2), xlab="Series", ylab="Gravity", main="Accelerometer Gravitational Force")
lines(out$y, col="red")
stdevs <- mean(out$y)+c(-2,-1,+1,+2)*sd(out$y)
abline(h=stdevs)
abline(v=max[1], col = 'green')
abline(v=max[2], col = 'green')
abline(v=min[1], col = 'blue')
And finally, we can see how long you were off the ground.
print(hangtime <- min[1] - max[1])
[1] 20
You can reduce your thresholds to get additional datapoints (changes in acceleration).
Hope this helps!
I would consider a few things:
Smooth the data by collecting median values every 100ms - accelerometer data on iPhones is not perfectly accurate, so this approach will help.
Identify turningpoints as #scribbles suggests.
There is code available in my github repository that could be modified to help with both of these issues. A PDF with some explanation is here: https://github.com/MonteShaffer/mPowerEI/blob/master/mPowerEI/example/challenge-1a.pdf
Specifically, take a look at:
library(devtools);
install_github("MonteShaffer/mPowerEI", subdir="mPowerEI");
library(mPowerEI);
# data smoothing
?scaleToTimeIncrement
# turning points
?pastecs::turnpoints

Different data in upper and lower panel of scatterplot matrix

I want to plot two different data sets in a scatterplot matrix.
I know that I can use upper.panel and lower.panel to differentiate the plot function. However, I don’t succeed in putting my data in a suitable format to harness this.
Assume I have two tissues (“brain” and “heart”) and four conditions (1–4). Now I can use e.g. pairs(data$heart) to get a scatterplot matrix for one of the data sets. Assume I have the following data:
conditions <- 1 : 4
noise <- rnorm(100)
data <- list(brain = sapply(conditions, function (x) noise + 0.1 * rnorm(100)),
heart = sapply(conditions, function (x) noise + 0.3 * rnorm(100)))
How do I get this into a format so that pairs(data, …) plots one data set above and one below the diagonal, as shown here (green = brain, violet = heart):
Just using
pairs(data, upper.panel = something, lower.panel = somethingElse)
Doesn’t work because that will plot all conditions versus all conditions without regard for different tissue – it essentially ignores the list, and the same when reordering the hierarchy (i.e. having data = (A=list(brain=…, heart=…), B=list(brain=…, heart=…), …)).
This is the best I seem to be able to do via passing arguments:
foo.upper <- function(x,y,ind.upper,col.upper,ind.lower,col.lower,...){
points(x[ind.upper],y[ind.upper],col = col.upper,...)
}
foo.lower <- function(x,y,ind.lower,col.lower,ind.upper,col.upper,...){
points(x[ind.lower],y[ind.lower],col = col.lower,...)
}
pairs(dat[,-5],
lower.panel = foo.lower,
upper.panel = foo.upper,
ind.upper = dat$type == 'brain',
ind.lower = dat$type == 'heart',
col.upper = 'blue',
col.lower = 'red')
Note that each panel needs all arguments. ... is a cruel mistress. If you include only the panel specific arguments in each function, it appears to work, but you get lots and lots of warnings from R trying to pass these arguments on to regular plotting functions and obviously they won't exist.
This was my quick first attempt, but it seems ugly:
dat <- as.data.frame(do.call(rbind,data))
dat$type <- rep(c('brain','heart'),each = 100)
foo.upper <- function(x,y,...){
points(x[dat$type == 'brain'],y[dat$type == 'brain'],col = 'red',...)
}
foo.lower <- function(x,y,...){
points(x[dat$type == 'heart'],y[dat$type == 'heart'],col = 'blue',...)
}
pairs(dat[,-5],lower.panel = foo.lower,upper.panel = foo.upper)
I'm abusing R's scoping here in this second version a somewhat ugly way. (Of course, you could probably do this more cleanly in lattice, but you probably knew that.)
The only other option I can think of is to design your own scatter plot matrix using layout, but that's probably quite a bit of work.
Lattice Edit
Here's at least a start on a lattice solution. It should handle varying x,y axis ranges better, but I haven't tested that.
dat <- do.call(rbind,data)
dat <- as.data.frame(dat)
dat$grp <- rep(letters[1:2],each = 100)
plower <- function(x,y,grp,...){
panel.xyplot(x[grp == 'a'],y[grp == 'a'],col = 'red',...)
}
pupper <- function(x,y,grp,...){
panel.xyplot(x[grp == 'b'],y[grp == 'b'],...)
}
splom(~dat[,1:4],
data = dat,
lower.panel = plower,
upper.panel = pupper,
grp = dat$grp)

Nested for loops in R - Issue with final result

I am in the midst of solving a problem in Reconstructing (or recovering) a probability distribution function when only the moments of the distribution are known. I have written codes in R for it and although the logic seems right to me, I am not getting the output that I want.
The equation I am trying to using as the approximated (or reconstructed or recovered) CDF is what you see in the image below. I am writing codes for the right hand side of the equation and equating that to a vector that I call F in my codes.
The link to paper that contains the original equation can be found here.
http://www.sciencedirect.com/science/article/pii/S0167715208000187
It is marked as equation (2) in the paper.
Here is the code I wrote.:
#R Codes:
alpha <- 50
T <- 1
x <- seq(0, T, by = 0.1)
# Original CDF equation
Ft <- (1-log(x^3))*(x^3)
plot(x, Ft, type = "l", ylab = "", xlab = "")
# Approximated CDF equation using Moment type reconstruction
k<- floor(alpha*y/T)
for(i in 1:length(k))
{
for(j in k[i]:alpha)
{
F[x+1] <- (factorial(alpha)/(factorial(alpha-j)*factorial(j-k)*factorial(k)))*(((-1)^(j-k))/(T^j))*((9/(j+3))^2)
}
}
plot(x[1:7], F, type = "l", ylab = "", xlab = "")
Any help will be appreciated here because the approximation and the graph obtained using my codes is grossly different from the original curve.
It seems clear that your problem is in here.
F[x+1] <- (factorial(alpha)/(factorial(alpha-j)*factorial(j-k)*factorial(k)))*(((-1)^(j-k))/(T^j))*((9/(j+3))^2)
You are trying to get something varying in x, yes? So how can you get that, if the right hand side of this equation has nothing varying in x, while the left hand side has an assignment using non-integer indices?
alpha <- 30 #In the exemple you try to reproduce, they use an alpha of 30 if i understood correctly (i'm a paleontologist not a mathematician so this paper's way beyond my area of expertise :) )
tau <- 1 #tau is your T (i changed it to avoid confusion with TRUE)
x <- seq(0, tau, by = 0.001)
f<-rep(0,length(x)) #This is your F (same reason as above for the change).
#It has to be created as a vector of 0 before your loop since the whole idea of the loop is that you want to proceed by incrementation.
#You want a value of f for each of your element of x so here is your first loop:
for(i in 1:length(x)){
#Then you want the sum for all k going from 1 to alpha*x[i]/tau:
for(k in 1:floor(alpha*x[i]/tau)){
#And inside that sum, the sum for all j going from k to alpha:
for(j in k:alpha){
#This sum needs to be incremented (hence f[i] on both side)
f[i]<-f[i]+(factorial(alpha)/(factorial(alpha-j)*factorial(j-k)*factorial(k)))*(((-1)^(j-k))/(tau^j))*(9/(j+3)^2)
}
}
}
plot(x, f, type = "l", ylab = "", xlab = "")

Resources