Compute the variance of a moving window in a dataframe - r

Hey I want to compute the variance of column. My dataframe is sorted by the as.Date() format. Here you can see a snippet of it:
Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA
The dataframe ranges from january 2004 up to dezember 2018. But I do not want to compute the compute the variance of the whole columnes.
I want to compute the variance of one year (or 12 values) which is moving month by month.
I do not really know how to start. I can imagine using the zoo package and the rollapply. But here the problem is (I think) that R computes uses the values around it and not past it?
I also found this question: R: create a data frame out of a rolling window, so my idea was to get rid of the date column. It is easy to build the matrix, but now I do not understand how to apply the variance function to my data...
Is there a smart way to compute it all in one and also using the information of the date? If not, I also appreciate any other solution from you!

We can use rollappyr to perform the rolling computations. Since there are only 11 rows in the data in the question we can't take 12 month averages but using 3 month averages instead we can illustrate it. Remove fill = NA if you want to omit the NA rows or replace it with partial = TRUE if you want variances using fewer than 12 near the beginning. If you want a data frame result use fortify.zoo(zv) .
library(zoo)
z <- read.zoo(DF)
zv <- rollapplyr(z, 3, var, fill = NA)
zv
giving this zoo object:
USA ARG BRA CHL COL MEX PER
2012-04-01 NA NA NA NA NA NA NA
2012-05-01 NA NA NA NA NA NA NA
2012-06-01 0 1.287083e-04 4.998008e-04 1.126781e-09 1.237524e-11 5.208793e-06 NA
2012-07-01 0 1.033001e-04 5.217420e-05 9.109406e-10 3.883996e-12 3.565057e-06 NA
2012-08-01 0 9.358558e-06 1.396497e-05 2.060928e-09 4.221043e-12 4.600220e-06 NA
2012-09-01 0 1.113297e-05 3.108380e-08 9.159058e-10 4.826929e-12 7.453672e-07 NA
2012-10-01 0 1.988357e-06 4.498977e-08 2.485889e-10 2.953403e-12 8.001948e-07 NA
2012-11-01 0 3.560373e-06 1.944961e-05 2.615387e-10 1.168389e-11 2.971477e-07 NA
2012-12-01 0 3.717777e-05 2.655440e-05 1.271886e-10 1.814869e-11 4.312436e-07 NA
2013-01-01 0 2.042867e-05 3.268476e-05 2.806455e-10 7.540331e-11 1.231438e-06 NA
2013-02-01 0 4.134729e-07 1.129013e-04 1.186146e-10 1.983651e-11 3.263780e-07 NA
We can plot the log of the variances like this:
library(ggplot2)
autoplot(log(zv), facet = NULL) + geom_point() + ylab("log(var(.))")
Note
We assume that the starting point is the data frame generated reproducibly below:
Lines <- "Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA"
DF <- read.table(text = Lines, header = TRUE)

Related

Efficient way for insertion of multiple rows at given indices & with repetitions

I have a data frame (DATA) with > 2 million rows (observations at different time points) and another data frame (INSERTION) which gives info about missing observations. The latter object contains 2 columns: 1st column with row indices after which empty (NA) rows should be inserted into DATA, and 2nd column with the number of empty rows that should be inserted at that position.
Below is a minimum working example:
DATA <- data.frame(datetime=strptime(as.character(c(201301011700, 201301011701, 201301011703, 201301011704, 201301011705, 201301011708, 201301011710, 201301011711, 201301011715, 201301011716, 201301011718, 201301011719, 201301011721, 201301011722, 201301011723, 201301011724, 201301011725, 201301011726, 201301011727, 201301011729, 201301011730, 201301011731, 201301011732, 201301011733, 201301011734, 201301011735, 201301011736, 201301011737, 201301011738, 201301011739)), format="%Y%m%d%H%M"), var1=rnorm(30), var2=rnorm(30), var3=rnorm(30))
INSERTION <- data.frame(index=c(2, 5, 6, 8, 10, 12, 19), repetition=c(1, 2, 1, 3, 1, 1, 1))
Now I'm looking for an efficient (and thus fast) way to insert the n empty rows at given row indices of the original file. How can I additionally complement the correct datetimes for these empty rows (add 1 minute for every new row; however, every weekend and bank holidays there are some regular gaps which are not contained in INSERTION!)?
Any help is appreciated!
Looking at the pattern in INSERTION and matching it with DATA most probably you are trying to fill the missing minutes in datetime of DATA. You can create a dataframe with every minute sequence from min to max value of datetime from DATA and then merge
merge(data.frame(datetime = seq(min(DATA$datetime), max(DATA$datetime),
by = "1 min")),DATA, all.x = TRUE)
# datetime var1 var2 var3
#1 2013-01-01 17:00:00 -1.063326 0.11925 -0.788622
#2 2013-01-01 17:01:00 1.263185 0.24369 -0.502199
#3 2013-01-01 17:02:00 NA NA NA
#4 2013-01-01 17:03:00 -0.349650 1.23248 1.496061
#5 2013-01-01 17:04:00 -0.865513 -0.51606 -1.137304
#6 2013-01-01 17:05:00 -0.236280 -0.99251 -0.179052
#7 2013-01-01 17:06:00 NA NA NA
#8 2013-01-01 17:07:00 NA NA NA
#9 2013-01-01 17:08:00 -0.197176 1.67570 1.902362
#10 2013-01-01 17:09:00 NA NA NA
#...
#...
Or using similar logic with tidyr::complete
tidyr::complete(DATA, datetime = seq(min(datetime), max(datetime), by = "1 min"))
If performance is a factor on a large data frame, this approach avoids joins:
# Generate new data.frame containing missing datetimes
tmp <- data.frame(datetime = DATA$datetime[with(INSERTION, rep(index, repetition))] + sequence(INSERTION$repetition)*60)
# Create variables filled with NA to match main data.frame
tmp[setdiff(names(DATA), names(tmp))] <- NA
# Bind and sort
new_df <- rbind(DATA, tmp)
new_df <- new_df[order(new_df$datetime),]
head(new_df, 15)
datetime var1 var2 var3
1 2013-01-01 17:00:00 0.98789253 0.68364933 0.70526985
2 2013-01-01 17:01:00 -0.68307496 0.02947599 0.90731512
31 2013-01-01 17:02:00 NA NA NA
3 2013-01-01 17:03:00 -0.60189915 -1.00153188 0.06165694
4 2013-01-01 17:04:00 -0.87329313 -1.81532302 -2.04930719
5 2013-01-01 17:05:00 -0.58713154 -0.42313098 0.37402224
32 2013-01-01 17:06:00 NA NA NA
33 2013-01-01 17:07:00 NA NA NA
6 2013-01-01 17:08:00 2.41350911 -0.13691754 1.57618578
34 2013-01-01 17:09:00 NA NA NA
7 2013-01-01 17:10:00 -0.38961552 0.83838954 1.18283382
8 2013-01-01 17:11:00 0.02290672 -2.10825367 0.87441448
35 2013-01-01 17:12:00 NA NA NA
36 2013-01-01 17:13:00 NA NA NA
37 2013-01-01 17:14:00 NA NA NA

Create multiple lagged variables using a zoo object

I need to create 'n' number of variables with lags of the original variable from 1 to 'n' on the fly. Something like so :-
OrigVar
DatePeriod, value
2/01/2018,6
3/01/2018,4
4/01/2018,0
5/01/2018,2
6/01/2018,4
7/01/2018,1
8/01/2018,6
9/01/2018,2
10/01/2018,7
Lagged 1 variable
2/01/2018,NA
3/01/2018,6
4/01/2018,4
5/01/2018,0
6/01/2018,2
7/01/2018,4
8/01/2018,1
9/01/2018,6
10/01/2018,2
11/01/2018,7
Lagged 2 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,6
5/01/2018,4
6/01/2018,0
7/01/2018,2
8/01/2018,4
9/01/2018,1
10/01/2018,6
11/01/2018,2
12/01/2018,7
Lagged 3 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,NA
5/01/2018,6
6/01/2018,4
7/01/2018,0
8/01/2018,2
9/01/2018,4
10/01/2018,1
11/01/2018,6
12/01/2018,2
13/01/2018,7
and so on
I tried using the shift function and various other functions. Wtih most of them that worked for me, the lagged variables finished at the last date of the original variable. In other words, the length of the lagged variable is the same as that of the original variable.
What I am looking for the new lagged variable to be shifted down by the 'kth' lag and the data series to be extended by 'k' elements including the index.
The reason I need this is to be able to compute the value of the dependent variable using the regression coeffficients and the corresponding lagged variable value beyond the in-sample period
y1 <- Lag(ciresL1_usage_1601_1612, shift = 1)
head(y1)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA -5171.051 -6079.887 -3687.227 -3229.453 -2110.368
y2 <- Lag(ciresL1_usage_1601_1612, shift = 2)
head(y2)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA NA -5171.051 -6079.887 -3687.227 -3229.453
tail(y2)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-2316.039 -2671.185 -4100.793 -2043.020 -1147.798 1111.674
tail(ciresL1_usage_1601_1612)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-4100.793 -2043.020 -1147.798 1111.674 3498.729 2438.739
Is there a way to do it relatively easily. I know I can do it by looping and adding 'k' rows in a new vector and reloading the data in to this new vector appropriately shifting the data values in the new vector but I don't want to use that method unless I have to. I am quietly confident that there has to be a better way to do it than this!
By the way, the object is a zoo object with daily dates as the index.
Best regards
Deepak
Convert the input zoo object to zooreg and then use lag.zooreg like this:
library(zoo)
# test input
z <- zoo(1:10, as.Date("2008-01-01") + 0:9)
zr <- as.zooreg(z)
lag(zr, -(0:3))
giving:
lag0 lag-1 lag-2 lag-3
2008-01-01 1 NA NA NA
2008-01-02 2 1 NA NA
2008-01-03 3 2 1 NA
2008-01-04 4 3 2 1
2008-01-05 5 4 3 2
2008-01-06 6 5 4 3
2008-01-07 7 6 5 4
2008-01-08 8 7 6 5
2008-01-09 9 8 7 6
2008-01-10 10 9 8 7
2008-01-11 NA 10 9 8
2008-01-12 NA NA 10 9
2008-01-13 NA NA NA 10

How to analyze data from the Internet with R to find discrepancies?

I am new to "R"; I have this html table here
I need to find out if there is a gap in the "time (DT)" column of more than one minute. I need to analyze the data and create a new table with just two columns, the first one with the time and the second one with the number of the gap.
Like this: output
So far I am able to download the data!!!
require(XML)
u='http://cronos.est.pr/test.html'
tables = readHTMLTable(u)
datatest=tables[[1]]
View(datatest)
What's next???
Convert the first column to "POSIXct" class, take differences and replace differences of one minute or less with NA. No packages are used.
with(datatest, {
Time <- as.POSIXct(`Time (DT)`)
Diff <- c(0 , c(diff(Time, units = "minutes")))
data.frame(Time, Diff = ifelse(Diff <= 1, NA, Diff))
})
giving:
Time Diff
1 2010-01-01 09:10:00 NA
2 2010-01-01 09:11:00 NA
3 2010-01-01 09:12:00 NA
4 2010-01-01 09:13:00 NA
5 2010-01-01 09:17:00 4
6 2010-01-01 09:18:00 NA
7 2010-01-01 09:19:00 NA
8 2010-01-01 09:20:00 NA
9 2010-01-01 09:22:00 2
10 2010-01-01 09:24:00 2
11 2010-01-01 09:25:00 NA
12 2010-01-01 09:26:00 NA
13 2010-01-01 09:38:00 12
14 2010-01-01 09:39:00 NA
15 2010-01-01 09:40:00 NA
Use the lubridate package.
library(lubridate)
minutes = minute(datatest[,"Time (DT)"])
gaps = c(0, diff(minutes))
output = data.frame("date_time" = datatest[,"Time (DT)"], gaps = gaps)
The output is like you requested except that every gap is recorded, not just the ones greater than 1 minute. To get just the big gaps, do
output[output$gaps > 1,]

Condition for function and loop

I have a data frame simplified as follow:
head(dendro)
X DateTime ID diameter dendro ring DOY month mday year Rain_mm_Tot Through_Tot temp
1 1 2012-06-21 13:45:00 r1_1 5482 1 1 173 6 22 113 NA NA NA
2 2 2012-06-21 13:45:00 r2_3 NA 3 2 173 6 22 113 NA NA NA
3 3 2012-06-21 13:45:00 r1_2 5534 2 1 173 6 22 113 NA NA NA
4 4 2012-06-21 13:45:00 r2_4 NA 4 2 173 6 22 113 NA NA NA
5 5 2012-06-21 13:45:00 r1_3 5606 3 1 173 6 22 113 NA NA NA
6 6 2012-06-21 13:45:00 r2_5 NA 5 2 173 6 22 113 NA NA NA
The dataframe is first splitted by "ID", so it's a list of IDs
After that I apply a function, that includes a loop, and the result is a new column "Diameter2", with the result I want from the function, that works OK:
dendro_sp <- split(dendro, dendro$ID)
library(changepoint)
dendro_sp <- lapply(dendro_sp, function(x){
x <- subset(x, !is.na(diameter))
cpfit <- cpt.mean(x$diameter, method="BinSeg")
x$diameter2 <- x$diameter
cpts <- cpfit#cpts
means <- param.est(cpfit)$mean
meanZero <- means[1]
for(i in 1:(length(cpts)-1)){
x$diameter2[(cpts[i]+1):cpts[i+1]] <- x$diameter2[(cpts[i]+1):cpts[i+1]] + (meanZero - means[i+1])
}
return(x)
})
dendro2 <- do.call(rbind, dendro_sp)
rownames(dendro2) <- NULL
My problem is that I want it to apply it conditionally, for example to r1_1 and r1_3, and grab the "diameter" value for r3 in the new column "diameter2", instead of applying the function for the rest of IDs:
ifelse(diameter$ID==c("r1_1","r1_3"), apply_the_function_to_r11_and_r13_to_calculate_diameter2, otherwise_write_diameter_value_in_diameter2_column)
Remember that the dataframe "dendro" is splitted by ID, I don't know if that is important to define the condition for several IDs.
Thanks
I am not sure if I understand the problem correctly. I try to answer.
I assume you want to apply a function to the "diameter" field of the "diameter" data.frame, conditioning on the "ID" field and retunr the result in the corresponding diameter2 field. I don't know how the function works, so forgive me if this will not work.
Selected fields
diameter$diameter2[diameter$ID=="r1_1"|diameter$ID=="r1_3"]<- yourfun(diameter$diameter[diameter$ID=="r1_1"|diameter$ID=="r1_3"]
Unselected fields
diameter$diameter2[diameter$ID!="r1_1" & diameter$ID=="r1_3"]<- diameter$diameter[diameter$ID=="r1_1"|diameter$ID=="r1_3"]

Using a date field in a ts?

I wonder how I can make use of an already existing date field when creating a ts in R.
Sometimes you simply have a date before you have a ts object, e.g.
x <- as.Date("2008-01-01") + c(30,60,90,120,150)
# add some data to it
df = data.frame(datefield=x,test=1:length(x))
Now, is there a way to use the datefield of the df to as an index when creating a ts object? Because:
ts(df$test,start=c(2008,1,2),frequency=12)
(obviuously) completely ignores the date information I already have. Making use of ts methods like acf is the reason why I´d like to make it a ts object. I typcically use monthly an quarterly time series...
You don't necessarily need to create new types of objects from scratch; you can always coerce to other classes, including ts as you need to. zoo or xts are arguably to most useful and intuitive but there are others. Here is your example, cast as a zoo object, which we then coerce to class ts for use in acf().
## create the data
x <- as.Date("2008-01-01") + c(30,60,90,120,150)
df = data.frame(datefield=x,test=1:length(x))
## load zoo
require(zoo)
## convert to a zoo object, with order given by the `datefield`
df.zoo <- with(df, zoo(test, order.by = x))
## or to a regular zoo object
df.zoo2 <- with(df, zooreg(test, order.by = x))
Now we can easily go to a ts object using the as.ts() method:
> as.ts(df.zoo)
Time Series:
Start = 13920
End = 14040
Frequency = 0.0333333333333333
[1] 1 2 3 4 5
> ## zooreg object:
> as.ts(df.zoo2)
Time Series:
Start = 13909
End = 14029
Frequency = 1
[1] 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[21] NA NA NA NA NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA
[41] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[61] 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[81] NA NA NA NA NA NA NA NA NA NA 4 NA NA NA NA NA NA NA NA NA
[101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[121] 5
Notice the two ways in which the objects are represented (although we could have made the zooreg version the same as the standard zoo object by setting the frequency argument to 0.03333333):
> as.ts(with(df, zooreg(test, order.by = datefield,
+ frequency = 0.033333333333333)))
Time Series:
Start = 13920.0000000001
End = 14040.0000000001
Frequency = 0.033333333333333
[1] 1 2 3 4 5
We can use the zoo/zooreg object in acf() and it will get the correct lags (daily observations but every 30 days):
acf(df.zoo)
Whether this is intuitive to you or not depends on how you view the time series. We can do the same thing in terms of a 30-day interval via:
acf(coredata(df.zoo))
where we use coredata() to extract the time series itself, ignoring the date information.
I don't know exactly what you're trying to do, but acf also works on simple vectors, given off course it represents a regular time series (i.e. even spaced). Otherwise the result is just bollocks.
>acf(df$test)
Regarding the ts object :
The "dates" you see are just from the print.ts function, so they're not inherent to the ts object. The ts object has no date information in it. You can set the option calender=FALSE to get the standard print out of the ts object.
> ts(df$test,start=2008,frequency=12)
Jan Feb Mar Apr May
2008 1 2 3 4 5
> print(ts(df$test,start=2008,frequency=12),calendar=F)
Time Series:
Start = c(2008, 1)
End = c(2008, 5)
Frequency = 12
[1] 1 2 3 4 5
Now, the vector you construct looks like :
> x
[1] "2008-01-31" "2008-03-01" "2008-03-31" "2008-04-30" "2008-05-30"
which is or isn't regular, depending on how you see it. If you extract the months, then you have 1 observation for january, 2 for march, 1 for april...: not regular. You have an observation every 30 days : regular. If you have an observation every 30 days, you shouldn't bother about the dates as 365 is not dividable through 30. Hence, one year you'll have 12 observations, another one you'll have 13 observations. So you can't set the frequency in ts in a consequent correct way.
So I'd refrain from using a ts all together, as James already indicated in the comments.
If you want:
Use the date information you already have
Easily set the frequency to a desired value
End up with a ts object
You can start with an xts object, add a frequency attribute, and then convert to ts:
library("xts")
my_xts <- xts(df$test, df$datefield)
attr(my_xts, 'frequency') <- 12 # Set the frequency
my_ts <- as.ts(my_xts)
The resulting ts object will have the specified period and will have the correct dates assigned to each data point.

Resources