How to prevent extrapolation using na.spline() - r

I'm having trouble with the na.spline() function in the zoo package. Although the documentation explicitly states that this is an interpolation function, the behaviour I'm getting includes extrapolation.
The following code reproduces the problem:
require(zoo)
vector <- c(NA,NA,NA,NA,NA,NA,5,NA,7,8,NA,NA)
na.spline(vector)
The output of this should be:
NA NA NA NA NA NA 5 6 7 8 NA NA
This would be interpolation of the internal NA, leaving the trailing NAs in place. But, instead I get:
-1 0 1 2 3 4 5 6 7 8 9 10
According to the documentation, this shouldn't happen. Is there some way to avoid extrapolation?
I recognise that in my example, I could use linear interpolation, but this is a MWE. Although I'm not necessarily wed to the na.spline() function, I need some way to interpolate using cubic splines.

This behavior appears to be coming from the stats::spline function, e.g.,
spline(seq_along(vector), vector, xout=seq_along(vector))$y
# [1] -1 0 1 2 3 4 5 6 7 8 9 10
Here is a work around, using the fact that na.approx strictly interpolates.
replace(na.spline(vector), is.na(na.approx(vector, na.rm=FALSE)), NA)
# [1] NA NA NA NA NA NA 5 6 7 8 NA NA
Edit
As #G.Grothendieck suggests in the comments below, another, no doubt more performant, way is:
na.spline(vector) + 0*na.approx(vector, na.rm = FALSE)

Related

Extract string between prefix and suffix

I have these columns:
text.NANA text.22 text.32
1 Female RNDM_MXN95.tif No NA
12 Male RNDM_QOS38.tif No NA
13 Female RNDM_WQW90.tif No NA
14 Male RNDM_BKD94.tif No NA
15 Male RNDM_LGD67.tif No NA
16 Female RNDM_AFP45.tif No NA
I want to create a column that only has the barcode that starts with RNDM_ and ends with .tif, but not including .tif. The tricky part is to get rid of the gender information that is also in the same column. There are a random amount of spaces between the gender information and the RNDM_:
text.NANA text.22 text.32 BARCODE
1 Female RNDM_MXN95.tif No NA RNDM_MXN95
12 Male RNDM_QOS38.tif No NA RNDM_QOS38
13 Female RNDM_WQW90.tif No NA RNDM_WQW90
14 Male RNDM_BKD94.tif No NA RNDM_BKD94
15 Male RNDM_LGD67.tif No NA RNDM_LGD67
16 Female RNDM_AFP45.tif No NA RNDM_AFP45
I made a very poor attempt with this, but it didn't work:
dfrm$BARCODE <- regexpr("RNDM_", dfrm$text.NANA)
# [1] 8 6 9 7 7 8 9 9 8 8 9 9 6 6 7 8 9 8
# attr(,"match.length")
# [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# attr(,"useBytes")
# [1] TRUE
Please help. Thanks!
So you just want to remove the file extension? Use file_path_sans_ext:
dfrm$BARCODE = file_path_sans_ext(dfrm$text.NANA)
If there’s more stuff in front, you can use the following regular expression to extract just the suffix:
dfrm$BARCODE = stringr::str_match(dfrm$text.NANA, '(RNDM_.*)\\.tif')[, 2]
Note that I’m using the {stringr} package here because the base R functions for extracting regex matches are terrible. Nobody uses them.
I strongly recommend against using strsplit here because it’s underspecified: from reading the code it’s absolutely not clear what the purpose of that code is. Write code that is self-explanatory, not code that requires explanation in a comment.
You can use sapply() and strsplit to do it easy, let me show you:
sapply(strsplit(dfrm$text.NANA, "_"),"[", 1)
That should work.
Edit:
sapply(strsplit(x, "[ .]+"),"[", 2)

R - enter basic formula

I am new to R and struggling to understand its quirks. I'm trying to do something which should be really simple, but is turning out to be apparently very complicated.
I am used to Excel, SQL and Minitab, where you can enter a value in one column which includes references to other columns and parameters. However, R doesn't seem to be allowing me to do this.
I have a table with (currently) four columns:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.787890411
3 30/12/2011 662 NA NA
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
and have a parameter "beta", with a value which I have assigned as 0.0002.
All I want to do is assign a formula to rows 3:10 which is:
beta*(Pallets t - Pallets t-1)+(1-beta)*Tt t-1.
I thought that the appropriate code might be:
Table[3:10,4]<-beta*(Table[3:10,"Pallets"]-Table[2:9,"Pallets"])+(1-beta)*Table[2:9,"Tt"]
However, this doesn't work. The first time I enter this formula, it generates:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.7878904
3 30/12/2011 662 NA 0.8431328
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
So it's generated the correct answer for the second item in the series, but not for any of the subsequent values.
It seems as though R doesn't automatically update each row, and the relationship to each other row, when you enter a formula, as Excel does. Having said that, Excel actually would require me to enter the formula in cell [4,Tt], and then drag this down to all of the other cells. Perhaps R is the same, and there is an equivalent to "dragging down" which I need to do?
Finally, I also noticed that when I change the value of the beta parameter, through, e.g. beta<-0.5, and then print the Table values again, they are unchanged - so the table hasn't updated even though I have changed the value of the parameter.
Appreciate that these are basic questions, but I am very new to R.
In R, the computations are not made "cell by cell", but are vectorised - in your example, R takes the vectors Table[3:10,"Pallets"], Table[2:9,"Pallets"] and Table[2:9,"Tt"] as they are at the moment, computes the resulting vector, and finally assigns it to Table[3:10,4].
If you want to make some computations "cell by cell", you have to use the for loop:
beta <- 0.5
df <- data.frame(v1 = 1:12, v2 = 0)
for (i in 3:10) {
df[i, "v2"] <- beta * (df[i, "v1"] - df[i-1, "v1"]) + (1 - beta) * df[i-1, "v2"]
}
df
v1 v2
1 1 0.0000000
2 2 0.0000000
3 3 0.5000000
4 4 0.7500000
5 5 0.8750000
6 6 0.9375000
7 7 0.9687500
8 8 0.9843750
9 9 0.9921875
10 10 0.9960938
11 11 0.0000000
12 12 0.0000000
As it comes to your second question, R will never update any values on its own (imagine having set manual calculation in Excel). So you need to repeat the computations after changing beta.
Although it's generally a bad design, but you can iterate over rows in a loop:
Table$temp <- c(0,diff(Table$Palletes,1))
prevTt = 0
for (i in 1:10)
{
Table$Tt[i] = Table$temp * beta + (1-beta)*prevTt
prevTt = Table$Tt[i]
}
Table$temp <- NULL

R: Function "lowess" with NAs

I am trying to use function "lowess" in base package to smooth a vector. The question is that the vector only has one NA value, but after smoothing by "lowess", 4 more NAs appear. I searched and some one suggest using "lowess" in package "gplots". I tried but got same results.
x1 <- c(NA, 19.93621, 17.64789, 17.78488, 17.11141, 18.4648, 19.62629, 17.5737, 17.48582, 18.76946, 19.57138, 19.62812, 2.982385, -0.1320513)
x2 <- lowess(x1)
x2
$x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
$y
[1] NA NA NA NA NA 18.988279 18.563642 18.407768 18.496699 17.922510 14.419999 10.861535 7.179754 3.145612
One way is to delete the NA in x1 so there is no NA after smoothing in x2$y.
x2 <- lowess(x1[-1])
x2
$x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
$y
[1] 18.7079309 18.4481273 18.2491632 18.0847709 18.0946245 18.1282192 18.1497617 17.9298278 14.6441882 10.9465210 7.2635244 3.5247529 -0.3021372
But I just wonder why more NAs appear and is there an easier way? Thank you!
lowess doesn't have any built-in NA handling
Probably the easiest thing to do long term is learn to use loess (whose options and settings are slightly different), which does deal with NA values (it has an na.action argument, like lm does, which by default should work much the way you want - however, in terms of syntax, loess works much more like lm than it does like lowess).
you can use na.omit directly on the arguments to lowess; that way you don't need to specifically identify which observations are omitted.

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

Resources