endpoints (bug?) behavior in xts - r

If I try to get endpoints for every year, for example, and do the following:
xts.data <- xts(1:10000, order.by=seq(from=as.Date("1970-01-01"), by=1, len=10000))
z <- endpoints(xts.data, on="months", k=12)
The return value for this is:
> z
[1] 0 10000
Same with numbers higher than 12. Why would xts simply not return indices of every year or 13th month starting from the beginning. Is it limited by the number of periods in a year? That is if i did:
z <- endpoints(xts.data, on="weeks", k=54)
This seems to work fine.

I agree that this is a bug, or at least a limitation that needs to be documented: the k for "months" only works for k=1,2,3,4 and 6.
Finding workarounds, my first idea was:
library(xts)
x <- xts(1:10000, order.by=seq(from=as.Date("1970-01-01"), by=1, len=10000))
index(x)[endpoints(x, on="months", k=6)[c(T,F)]]
giving:
"1970-12-31" "1971-12-31" "1972-12-31"... "1995-12-31" "1996-12-31"
But it breaks when I use a different set of data:
x <- xts(1:10000, order.by=seq(from=as.Date("1970-07-01"), by=1, len=10000))
which gives:
"1971-06-30" "1972-06-30" "1973-06-30" ...
A stable answer is:
dates <- index(x)[endpoints(x, on="months", k=6)]
dates[ as.POSIXlt(dates)$mon==11 ]
(In English: get the last day of each half-year, and only keep the ones in December.)
An alternative is to just use endpoints(x, on="years"), and then remove the last date if you don't like it.
I'm guessing you want "last trading day" or "last sample day" of the year, so it won't actually be Dec 31st in each year. But if you did want a specific date each year:
index(x)[.indexmon(x)==11 & .indexmday(x)==31]

Yes. This looks like something of an implementation gotcha. I'll add a fix to R-forge once I can test the version I am working on more completely.
UPDATED twice
There is now a patch on this in rev 742 on R-forge. It isn't likely final, but ideally this is the direction.
> head(xts.data[z])
[,1]
1970-12-31 365
1971-12-31 730
1972-12-31 1096
1973-12-31 1461
1974-12-31 1826
1975-12-31 2191
> head(xts.data[endpoints(xts.data, on="months", k=1)])
[,1]
1970-01-31 31
1970-02-28 59
1970-03-31 90
1970-04-30 120
1970-05-31 151
1970-06-30 181
> head(xts.data[endpoints(xts.data, on="months", k=2)])
[,1]
1970-02-28 59
1970-04-30 120
1970-06-30 181
1970-08-31 243
1970-10-31 304
1970-12-31 365
> head(xts.data[endpoints(xts.data, on="months", k=3)])
[,1]
1970-03-31 90
1970-06-30 181
1970-09-30 273
1970-12-31 365
1971-03-31 455
1971-06-30 546
> head(xts.data[endpoints(xts.data, on="months", k=4)])
[,1]
1970-04-30 120
1970-08-31 243
1970-12-31 365
1971-04-30 485
1971-08-31 608
1971-12-31 730
> head(xts.data[endpoints(xts.data, on="months", k=6)])
[,1]
1970-06-30 181
1970-12-31 365
1971-06-30 546
1971-12-31 730
1972-06-30 912
1972-12-31 1096
> head(xts.data[endpoints(xts.data, on="months", k=7)])
[,1]
1970-07-31 212
1971-02-28 424
1971-09-30 638
1972-04-30 851
1972-11-30 1065
1973-06-30 1277

Related

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

Convert time values to numeric while keeping time characteristics

I have a data set which contains interval times of different events occurring. What I want to do, is convert the data into a numeric vector, so its easier to manipulate and run summaries/make graphs etc, while keeping its time characteristics. Here is a snippet of my data:
data <- c( "03:31", "12:17", "16:29", "09:52", "04:01", "09:00", "06:29",
"04:17", "04:42")
class(data)
[1] character
The obvious answer is :
as.numeric(data)
But I get this error:
Warning message:
NAs introduced by coercion
I thought of maybe taking the ':' out, but then it loses its time characteristics. By that, I mean that if I sum values together say 347 and 543, it would give me 890 as opposed to 930. Here is the code that I would use to take the colon out, which works fine for its purpose:
Nocolon <- gsub("[:]", "", Data, perl=TRUE)
"0331" "1217" "1629" "0952" "0401" "0900" "0629" "0417" "0442"
So essentially, what I want is for my time values to be in a form which is easy to manipulate and analyse. My idea is for it to be a numeric vector, but that is from my minimal understanding of R. My actual code has thousands of time values, and I want to create a plot that will allow me to view and determine whether the values follow a statistical distribution.
Thanks in advance!
Here are some approaches. All convert to minutes. For example, the first component is "03:31" which is 3 * 60 + 31 = 211 minutes. (1) to (5) do not use any packages.
1) %*% It works by reading data into a 2 column data frame with hours and minutes. That is converted to a matrix so that it can be matrix multiplied by c(60, 1). Finally, unravel it with c.
c(as.matrix(read.table(text = data, sep = ":")) %*% c(60, 1))
[1] 211 737 989 592 241 540 389 257 282
2) with This variation is even shorter. It creates the same data frame but and then simply mulitiplies the first column (V1) by 60 and adds it to the second column (V2).
with(read.table(text = data, sep = ":"), 60*V1+V2)
[1] 211 737 989 592 241 540 389 257 282
3) complex This converts each component to a complex number and then performs the required arithmetic on the real and imaginary parts:
data_c <- as.complex(sub(":(\\d+)", "+\\1i", data))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
3a) This variation of (3) also works and avoids regular expressions:
data_c <- as.complex(paste0(chartr(":", "+", data), "i"))
60 * Re(data_c) + Im(data_c)
## [1] 211 737 989 592 241 540 389 257 282
4) eval This converts each component into an arithmetic expression which evaluates to the number of minutes and then performs the evalution. Using eval is not really recommended when you can avoid it so this one is less desirable:
sapply(parse(text = sub("(\\d+):", "60*\\1+", data)), eval)
## [1] 211 737 989 592 241 540 389 257 282
5) POSIXlt We can convert to "POSIXlt" class and then use the hour and min components:
with(unclass(as.POSIXlt(data, format = "%H:%M")), 60 * hour + min)
## [1] 211 737 989 592 241 540 389 257 282
6) chron Using the chron package we can paste on the seconds, convert to "times" class and then convert to minutes:
library(chron)
24 * 60 * as.numeric(times(paste0(data, ":00")))
## [1] 211 737 989 592 241 540 389 257 282
7) lubridate Using the lubridate package we can convert it using hm and then to numeric giving seconds and finally dividing by 60 to give minutes:
as.numeric(hm(data)) / 60
## [1] 211 737 989 592 241 540 389 257 282
Use the as.difftime function designed for this:
as.difftime(data, format="%H:%M", units="mins")
#Time differences in mins
#[1] 211 737 989 592 241 540 389 257 282

R Conditional summing

I've just started my adventure with programming in R. I need to create a program summing numbers divisible by 3 and 5 in the range of 1 to 1000, using the '%%' operator. I came up with an idea to create two matrices with the numbers from 1 to 1000 in one column and their remainders in the second one. However, I don't know how to sum the proper elements (kind of "sum if" function in Excel). I attach all I've done below. Thanks in advance for your help!
s1<-1:1000
in<-s1%%3
m1<-matrix(c(s1,in), 1000, 2, byrow=FALSE)
s2<-1:1000
in2<-s2%%5
m2<-matrix(c(s2,in2),1000,2,byrow=FALSE)
Mathematically, the best way is probably to find the least common multiple of the two numbers and check the remainder vs that:
# borrowed from Roland Rau
# http://r.789695.n4.nabble.com/Greatest-common-divisor-of-two-numbers-td823047.html
gcd <- function(a,b) if (b==0) a else gcd(b, a %% b)
lcm <- function(a,b) abs(a*b)/gcd(a,b)
s <- seq(1000)
s[ (s %% lcm(3,5)) == 0 ]
# [1] 15 30 45 60 75 90 105 120 135 150 165 180 195 210
# [15] 225 240 255 270 285 300 315 330 345 360 375 390 405 420
# [29] 435 450 465 480 495 510 525 540 555 570 585 600 615 630
# [43] 645 660 675 690 705 720 735 750 765 780 795 810 825 840
# [57] 855 870 885 900 915 930 945 960 975 990
Since your s is every number from 1 to 1000, you could instead do
seq(lcm(3,5), 1000, by=lcm(3,5))
Just use sum on either result if that's what you want to do.
Props to #HoneyDippedBadger for figuring out what the OP was after.
See if this helps
x =1:1000 ## Store no. 1 to 1000 in variable x
x ## print x
Div = x[x%%3==0 & x%%5==0] ## Extract Nos. divisible by 3 & 5 both b/w 1 to 1000
Div ## Nos. Stored in DIv which are divisible by 3 & 5 both
length(Div)
table(x%%3==0 & x%%5==0) ## To see how many are TRUE for given condition
sum(Div) ## Sums up no.s divisible by both 3 and 5 b/w 1 to 1000

Creating data continuously using rnorm until an outlier occurs in R

Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources