interpolation for limited number of NA - r

i have a dataframe df with a column containing values (meter reading). Some values are sporadically missing (NA).
df excerpt:
row time meter_reading
1 03:10:00 26400
2 03:15:00 NA
3 03:20:00 27200
4 03:25:00 28000
5 03:30:00 NA
6 03:35:00 NA
7 03:40:00 30000
What I'm trying to do:
If there is only one consecutive NA, I want to interpolate (e.g. na.interpolation for row 2).
But if there's two or more consecutive NA, I don't want R to interpolate and leave the values as NA. (e.g. row 5 and 6).
What I tried so far is loop (for...) with an if-condition. My approach:
library("imputeTS")
for(i in 1:(nrow(df))) {
if(!is.na(df$meter_reading[i]) & is.na(df$meter_reading[i-1]) & !is.na(df$meter_reading[i-2])) {
na_interpolation(df$meter_reading)
}
}
Giving me :
Error in if (!is.na(df$meter_reading[i]) & is.na(df$meter_reading[i - :
argument is of length zero
Any ideas how to do it? Am I completely wrong here?
Thanks!

I don't knaow what is your na.interpolation, but taking the mean of previous and next rows for example, you could do that with dplyr :
df %>% mutate(x=ifelse(is.na(meter_reading),
(lag(meter_reading)+lead(meter_reading))/2,
meter_reading))
# row time meter_reading x
#1 1 03:10:00 26400 26400
#2 2 03:15:00 NA 26800
#3 3 03:20:00 27200 27200
#4 4 03:25:00 28000 28000
#5 5 03:30:00 NA NA
#6 6 03:35:00 NA NA
#7 7 03:40:00 30000 30000

A quick look shows that your counter i starts at 1 and then you try to get index at i-1 andi-2.

Just an addition here, in the current imputeTS package version, there is also a maxgap option for each imputation algorithm, which easily solves this problem. Probably wasn't there yet, as you asked this question.
Your code would look like this:
library("imputeTS")
na_interpolation(df, maxgap = 1)
This means gaps of 1 NA get imputed, while longer gaps of consecutive NAs remain NA.

Related

Compute the variance of a moving window in a dataframe

Hey I want to compute the variance of column. My dataframe is sorted by the as.Date() format. Here you can see a snippet of it:
Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA
The dataframe ranges from january 2004 up to dezember 2018. But I do not want to compute the compute the variance of the whole columnes.
I want to compute the variance of one year (or 12 values) which is moving month by month.
I do not really know how to start. I can imagine using the zoo package and the rollapply. But here the problem is (I think) that R computes uses the values around it and not past it?
I also found this question: R: create a data frame out of a rolling window, so my idea was to get rid of the date column. It is easy to build the matrix, but now I do not understand how to apply the variance function to my data...
Is there a smart way to compute it all in one and also using the information of the date? If not, I also appreciate any other solution from you!
We can use rollappyr to perform the rolling computations. Since there are only 11 rows in the data in the question we can't take 12 month averages but using 3 month averages instead we can illustrate it. Remove fill = NA if you want to omit the NA rows or replace it with partial = TRUE if you want variances using fewer than 12 near the beginning. If you want a data frame result use fortify.zoo(zv) .
library(zoo)
z <- read.zoo(DF)
zv <- rollapplyr(z, 3, var, fill = NA)
zv
giving this zoo object:
USA ARG BRA CHL COL MEX PER
2012-04-01 NA NA NA NA NA NA NA
2012-05-01 NA NA NA NA NA NA NA
2012-06-01 0 1.287083e-04 4.998008e-04 1.126781e-09 1.237524e-11 5.208793e-06 NA
2012-07-01 0 1.033001e-04 5.217420e-05 9.109406e-10 3.883996e-12 3.565057e-06 NA
2012-08-01 0 9.358558e-06 1.396497e-05 2.060928e-09 4.221043e-12 4.600220e-06 NA
2012-09-01 0 1.113297e-05 3.108380e-08 9.159058e-10 4.826929e-12 7.453672e-07 NA
2012-10-01 0 1.988357e-06 4.498977e-08 2.485889e-10 2.953403e-12 8.001948e-07 NA
2012-11-01 0 3.560373e-06 1.944961e-05 2.615387e-10 1.168389e-11 2.971477e-07 NA
2012-12-01 0 3.717777e-05 2.655440e-05 1.271886e-10 1.814869e-11 4.312436e-07 NA
2013-01-01 0 2.042867e-05 3.268476e-05 2.806455e-10 7.540331e-11 1.231438e-06 NA
2013-02-01 0 4.134729e-07 1.129013e-04 1.186146e-10 1.983651e-11 3.263780e-07 NA
We can plot the log of the variances like this:
library(ggplot2)
autoplot(log(zv), facet = NULL) + geom_point() + ylab("log(var(.))")
Note
We assume that the starting point is the data frame generated reproducibly below:
Lines <- "Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA"
DF <- read.table(text = Lines, header = TRUE)

R: for a new variable in row [i] compare row-comprehensive in column x and y

I am looking for a code like a for-loop, which looks at all rows of certain columns to create a new variable for row [i].
I have a data frame which includes basically three columns. Interval start, interval end and measuring date. In all three columns the values have the format YearMonthDayHourMinute. The measuring date is continuous all 10 minutes. The interval are just short periods, that leave a lot of NAs where the measuring date is not met.
The data frame looks like this:
interval_start interval_end measuring_date
1 NA NA 201805021210
2 201805021220 201805021250 201805021220
3 NA NA 201805021230
4 NA NA 201805021240
5 NA NA 201805021250
6 NA NA 201805021300
Now, I want to R to create a new column, that gives a "Yes" where the measuring period lies within the interval, and a "No" where it doesn't.
Like this:
interval_start interval_end measuring_date within_interval
1 NA NA 201805021210 No
2 201805021220 201805021250 201805021220 Yes
3 NA NA 201805021230 Yes
4 NA NA 201805021240 Yes
5 NA NA 201805021250 Yes
6 NA NA 201805021300 No
So I want R to take the measuring_date of row 1 and compare it to the interval_start and interval_end of row 1,2,3,4,5 and 6. Same for measuring_date of row 2 and so on.
The Problem I have now is that I've tried for-loops with if else and nested for-loops (see below) but R seems not to be able to take the measuring_date of row 1 and compare it with all rows of interval_start and interval_end. It compares only within the same row. So all I can get is:
interval_start interval_end measuring_date within_interval
1 NA NA 201805021210 No
2 201805021220 201805021250 201805021220 Yes
3 NA NA 201805021230 No
4 NA NA 201805021240 No
5 NA NA 201805021250 No
6 NA NA 201805021300 No
Does anyone know a solution to this problem? Maybe there are solutions outside of a for-loop, that I didn't come across.
I've been searching the whole internet but didn't find any solution which left me quite frustrated. Even my supervisor is helpless..
I hope my question is clear enough, sorry, I am using stackoverflow for the first time..
for (i in 1:nrow(masterX)){
masterX$Within_Searching_Period[i] <- NA
for (j in 1:nrow(masterX)){
if (masterX$MESS_DATUM[i] >= masterX$time_date_start_min[j] &
masterX$MESS_DATUM[i] <= masterX$time_date_end_min[j]) {
masterX$Within_Searching_Period[i] <- "YES"
} else {masterX$Within_Searching_Period[i] <- "NO"
}
}
}
Using data.table package, you can use non-equi join to find if measuring_date is within any interval:
DT[, within_interval :=
DT[DT, .N > 0 ,on=.(interval_start <= measuring_date, interval_end >= measuring_date), by=.EACHI]$V1
]
output:
interval_start interval_end measuring_date within_interval
1: <NA> <NA> 2018-05-02 12:10:00 FALSE
2: 2018-05-02 12:20:00 2018-05-02 12:50:00 2018-05-02 12:20:00 TRUE
3: <NA> <NA> 2018-05-02 12:30:00 TRUE
4: <NA> <NA> 2018-05-02 12:40:00 TRUE
5: <NA> <NA> 2018-05-02 12:50:00 TRUE
6: <NA> <NA> 2018-05-02 13:00:00 FALSE
data:
library(data.table)
DT <- fread("interval_start,interval_end,measuring_date
NA,NA,201805021210
201805021220,201805021250,201805021220
NA,NA,201805021230
NA,NA,201805021240
NA,NA,201805021250
NA,NA,201805021300", colClasses="character")
DT[, (names(DT)) := lapply(.SD, as.POSIXct, format="%Y%m%d%H%M")]
Instead of focusing on the loop, it's better practice to try to write vectorized code. I've used some lubridate and tidyverse syntax. If you deal with dates a lot, it's worth looking at lubridate's documentation.
library(tidyverse)
library(lubridate)
data %>%
mutate_at(vars(time_date_start_min, time_date_end_min, MESS_DATUM), ymd_hm) %>%
mutate(within_interval =
ifelse(MESS_DATUM %within% interval(time_date_start_min, time_date_end_min) %in% TRUE, "Yes", "No"))

Create multiple lagged variables using a zoo object

I need to create 'n' number of variables with lags of the original variable from 1 to 'n' on the fly. Something like so :-
OrigVar
DatePeriod, value
2/01/2018,6
3/01/2018,4
4/01/2018,0
5/01/2018,2
6/01/2018,4
7/01/2018,1
8/01/2018,6
9/01/2018,2
10/01/2018,7
Lagged 1 variable
2/01/2018,NA
3/01/2018,6
4/01/2018,4
5/01/2018,0
6/01/2018,2
7/01/2018,4
8/01/2018,1
9/01/2018,6
10/01/2018,2
11/01/2018,7
Lagged 2 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,6
5/01/2018,4
6/01/2018,0
7/01/2018,2
8/01/2018,4
9/01/2018,1
10/01/2018,6
11/01/2018,2
12/01/2018,7
Lagged 3 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,NA
5/01/2018,6
6/01/2018,4
7/01/2018,0
8/01/2018,2
9/01/2018,4
10/01/2018,1
11/01/2018,6
12/01/2018,2
13/01/2018,7
and so on
I tried using the shift function and various other functions. Wtih most of them that worked for me, the lagged variables finished at the last date of the original variable. In other words, the length of the lagged variable is the same as that of the original variable.
What I am looking for the new lagged variable to be shifted down by the 'kth' lag and the data series to be extended by 'k' elements including the index.
The reason I need this is to be able to compute the value of the dependent variable using the regression coeffficients and the corresponding lagged variable value beyond the in-sample period
y1 <- Lag(ciresL1_usage_1601_1612, shift = 1)
head(y1)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA -5171.051 -6079.887 -3687.227 -3229.453 -2110.368
y2 <- Lag(ciresL1_usage_1601_1612, shift = 2)
head(y2)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA NA -5171.051 -6079.887 -3687.227 -3229.453
tail(y2)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-2316.039 -2671.185 -4100.793 -2043.020 -1147.798 1111.674
tail(ciresL1_usage_1601_1612)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-4100.793 -2043.020 -1147.798 1111.674 3498.729 2438.739
Is there a way to do it relatively easily. I know I can do it by looping and adding 'k' rows in a new vector and reloading the data in to this new vector appropriately shifting the data values in the new vector but I don't want to use that method unless I have to. I am quietly confident that there has to be a better way to do it than this!
By the way, the object is a zoo object with daily dates as the index.
Best regards
Deepak
Convert the input zoo object to zooreg and then use lag.zooreg like this:
library(zoo)
# test input
z <- zoo(1:10, as.Date("2008-01-01") + 0:9)
zr <- as.zooreg(z)
lag(zr, -(0:3))
giving:
lag0 lag-1 lag-2 lag-3
2008-01-01 1 NA NA NA
2008-01-02 2 1 NA NA
2008-01-03 3 2 1 NA
2008-01-04 4 3 2 1
2008-01-05 5 4 3 2
2008-01-06 6 5 4 3
2008-01-07 7 6 5 4
2008-01-08 8 7 6 5
2008-01-09 9 8 7 6
2008-01-10 10 9 8 7
2008-01-11 NA 10 9 8
2008-01-12 NA NA 10 9
2008-01-13 NA NA NA 10

How to analyze data from the Internet with R to find discrepancies?

I am new to "R"; I have this html table here
I need to find out if there is a gap in the "time (DT)" column of more than one minute. I need to analyze the data and create a new table with just two columns, the first one with the time and the second one with the number of the gap.
Like this: output
So far I am able to download the data!!!
require(XML)
u='http://cronos.est.pr/test.html'
tables = readHTMLTable(u)
datatest=tables[[1]]
View(datatest)
What's next???
Convert the first column to "POSIXct" class, take differences and replace differences of one minute or less with NA. No packages are used.
with(datatest, {
Time <- as.POSIXct(`Time (DT)`)
Diff <- c(0 , c(diff(Time, units = "minutes")))
data.frame(Time, Diff = ifelse(Diff <= 1, NA, Diff))
})
giving:
Time Diff
1 2010-01-01 09:10:00 NA
2 2010-01-01 09:11:00 NA
3 2010-01-01 09:12:00 NA
4 2010-01-01 09:13:00 NA
5 2010-01-01 09:17:00 4
6 2010-01-01 09:18:00 NA
7 2010-01-01 09:19:00 NA
8 2010-01-01 09:20:00 NA
9 2010-01-01 09:22:00 2
10 2010-01-01 09:24:00 2
11 2010-01-01 09:25:00 NA
12 2010-01-01 09:26:00 NA
13 2010-01-01 09:38:00 12
14 2010-01-01 09:39:00 NA
15 2010-01-01 09:40:00 NA
Use the lubridate package.
library(lubridate)
minutes = minute(datatest[,"Time (DT)"])
gaps = c(0, diff(minutes))
output = data.frame("date_time" = datatest[,"Time (DT)"], gaps = gaps)
The output is like you requested except that every gap is recorded, not just the ones greater than 1 minute. To get just the big gaps, do
output[output$gaps > 1,]

How do I copy a date from one variable to another in R data.table without losing the date format?

I have a data.table containing two date variables. The data set was read into R from a .csv file (was originally an .xlsx file) as a data.frame and the two variables then converted to date format using as.Date() so that they display as below:
df
id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 <NA> 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 <NA> <NA>
I then converted the data.frame to a data.table:
df <- data.table(df)
I then wanted to create a third variable, that would include "specdate" if present, but replace it with "recdate" if "specdate" was missing (NA). This is where I'm having some difficulty, as it seems that no matter how I approach this, data.table displays dates in date format only if a complete variable that is already in date format is copied. Otherwise, individual values are displayed as a number (even when using as.IDate) and I gather that an origin date is needed to correct this. Is there any way to avoid supplying an origin date but display the dates as dates in data.table?
Below is my attempt to fill the NAs of specdate with the recdate dates:
# Function to fill NAs:
fillnas <- function(dataref, lookupref, nacol, replacecol, replacelist=NULL) {
nacol <- as.character(nacol)
if(!is.null(replacelist)) nacol <- factor(ifelse(dataref==lookupref & (is.na(nacol) | nacol %in% replacelist), replacecol, nacol))
else nacol <- factor(ifelse(dataref==lookupref & is.na(nacol), replacecol, nacol))
nacol
}
# Fill the NAs in specdate with the function:
df[, finaldate := fillnas(dataref=id, lookupref=id, nacol=specdate, replacecol=as.IDate(recdate, format="%Y-%m-%d"))]
Here is what happens:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> NA
The display problem is compounded if I create the new variable from scratch by using ifelse:
df[, finaldate := ifelse(!is.na(specdate), specdate, recdate)]
This gives:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 16294
2: 2 2014-08-15 2014-08-20 16297
3: 3 2014-08-21 2014-08-26 16303
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 16307
6: 6 <NA> <NA> NA
Alternately if I try a find-and-replace type approach, I get an error about the number of items to replace not matching the replacement length (I'm guessing this is because that approach is not vectorised?), the values from recdate are recycled and end up in the wrong place:
> df$finaldate <- df$specdate
> df$finaldate[is.na(df$specdate)] <- df$recdate
Warning message:
In NextMethod(.Generic) :
number of items to replace is not a multiple of replacement length
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 2014-08-17
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> 2014-08-20
So in conclusion - the function I applied gets me closest to what I want, except that where NAs have been replaced, the replacement value is displayed as a number and not in date format. Once displayed as a number, the origin is required to again display it as a date (and I would like to avoid supplying the origin since I usually don't know it and it seems unnecessarily repetitive to have to supply it when the date was originally in the correct format).
Any insights as to where I'm going wrong would be much appreciated.
I'd approach it like this, maybe :
DT <- data.table(df)
DT[, finaldate := specdata]
DT[is.na(specdata), finaldate := recdate]
It seems you want to add a new column so you can can retain the original columns as well. I do that as well a lot. Sometimes, I just update in place :
DT <- data.table(df)
DT[!is.na(specdate), specdate:=recdate]
setnames(DT, "specdate", "finaldate")
Using i like that avoids creating a whole new column's worth of data which might be very large. Depends on how important retaining the original columns is to you and how many of them there are and your data size. (Note that a whole column's worth of data is still created by the is.na() call and then again by ! but at least there isn't a third column's worth for the new finaldate. Would be great to optimize i=!is.na() in future (#1386) and if you use data.table this way now you won't need to change your code in future to benefit.)
It seems that you might have various "NA" strings that you're replacing. Note that fread in v1.9.6 on CRAN has a fix for that. From README :
correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to #dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.
Btw, you've made one of the top 3 mistakes mentioned here : https://github.com/Rdatatable/data.table/wiki/Support
Works for me. You may want to test to be sure that your NA values are not strings or factors "<NA>"; they will look like real NA values:
dt[, finaldate := ifelse(is.na(specdate), recdate, specdate)][
,finaldate := as.POSIXct(finaldate*86400, origin="1970-01-01", tz="UTC")]
# id specdate recdate finaldate
# 1: 1 2014-08-12 2014-08-17 2014-08-12
# 2: 2 2014-08-15 2014-08-20 2014-08-15
# 3: 3 2014-08-21 2014-08-26 2014-08-21
# 4: 4 NA 2014-08-28 2014-08-28
# 5: 5 2014-08-25 2014-08-30 2014-08-25
# 6: 6 NA NA NA
Data
df <- read.table(text=" id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 NA 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 NA NA", header=T, stringsAsFactors=F)
dt <- as.data.table(df)

Resources