R: Function "lowess" with NAs - r

I am trying to use function "lowess" in base package to smooth a vector. The question is that the vector only has one NA value, but after smoothing by "lowess", 4 more NAs appear. I searched and some one suggest using "lowess" in package "gplots". I tried but got same results.
x1 <- c(NA, 19.93621, 17.64789, 17.78488, 17.11141, 18.4648, 19.62629, 17.5737, 17.48582, 18.76946, 19.57138, 19.62812, 2.982385, -0.1320513)
x2 <- lowess(x1)
x2
$x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
$y
[1] NA NA NA NA NA 18.988279 18.563642 18.407768 18.496699 17.922510 14.419999 10.861535 7.179754 3.145612
One way is to delete the NA in x1 so there is no NA after smoothing in x2$y.
x2 <- lowess(x1[-1])
x2
$x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
$y
[1] 18.7079309 18.4481273 18.2491632 18.0847709 18.0946245 18.1282192 18.1497617 17.9298278 14.6441882 10.9465210 7.2635244 3.5247529 -0.3021372
But I just wonder why more NAs appear and is there an easier way? Thank you!

lowess doesn't have any built-in NA handling
Probably the easiest thing to do long term is learn to use loess (whose options and settings are slightly different), which does deal with NA values (it has an na.action argument, like lm does, which by default should work much the way you want - however, in terms of syntax, loess works much more like lm than it does like lowess).
you can use na.omit directly on the arguments to lowess; that way you don't need to specifically identify which observations are omitted.

Related

How to prevent extrapolation using na.spline()

I'm having trouble with the na.spline() function in the zoo package. Although the documentation explicitly states that this is an interpolation function, the behaviour I'm getting includes extrapolation.
The following code reproduces the problem:
require(zoo)
vector <- c(NA,NA,NA,NA,NA,NA,5,NA,7,8,NA,NA)
na.spline(vector)
The output of this should be:
NA NA NA NA NA NA 5 6 7 8 NA NA
This would be interpolation of the internal NA, leaving the trailing NAs in place. But, instead I get:
-1 0 1 2 3 4 5 6 7 8 9 10
According to the documentation, this shouldn't happen. Is there some way to avoid extrapolation?
I recognise that in my example, I could use linear interpolation, but this is a MWE. Although I'm not necessarily wed to the na.spline() function, I need some way to interpolate using cubic splines.
This behavior appears to be coming from the stats::spline function, e.g.,
spline(seq_along(vector), vector, xout=seq_along(vector))$y
# [1] -1 0 1 2 3 4 5 6 7 8 9 10
Here is a work around, using the fact that na.approx strictly interpolates.
replace(na.spline(vector), is.na(na.approx(vector, na.rm=FALSE)), NA)
# [1] NA NA NA NA NA NA 5 6 7 8 NA NA
Edit
As #G.Grothendieck suggests in the comments below, another, no doubt more performant, way is:
na.spline(vector) + 0*na.approx(vector, na.rm = FALSE)

Why is my use of predict in r not working - should be very simple case

So I'm having some trouble getting r to actually predict given a very, very simple linear model. Using the following,
> x=1:10
> y=1:10*2
> lm(y~x)
we get the corect answer, namely, y is twice x. But when I do,
predict(lm(y~x),newdata=2.5)
I just get
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
What is going on?
The newdata parameter should be a data.frame with column names matching the names used as covariates. So the correct case is
predict(lm(x~y),newdata=data.frame(y=2.5))
or
predict(lm(y~x),newdata=data.frame(x=2.5))
depending on which way you wanted to do the regression.

R Pooled DataFrame analysis

I'm trying to perform several analysis on subsets of data in a dataframe in R, and i was wondering if there is generic way for doing this.
Say, I have a dataframe like:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
how could I apply some computation (e.g. cumulative counting) based upon values in col "one" condition upon (grouped by) the value in col "three".
That is, I wanna do stuff to one column, based upon grouping in another column. I can do this with loops, but I feel there might be standard ways to do this all at once.
thank you in advance!
ddply(data, .(coln), Stat) does the trick exactly

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

Resources