Subset first time of two hour time intervals - r

I have problem in subsetting times.
1) I would like to filter my data by time intervals where one is in midnight and another in midday.
2) And i need only first time that occurs in each interval.
Data frame looks like this
DATE v
1 2007-07-28 00:41:00 1
2 2007-07-28 02:00:12 5
3 2007-07-28 02:01:19 3
4 2007-07-28 02:44:08 2
5 2007-07-28 04:02:18 3
6 2007-07-28 09:59:16 4
7 2007-07-28 11:21:32 8
8 2007-07-28 11:58:40 5
9 2007-07-28 13:20:52 4
10 2007-07-28 13:21:52 9
11 2007-07-28 14:41:32 3
12 2007-07-28 15:19:00 9
13 2007-07-29 01:01:48 2
14 2007-07-29 01:41:08 5
Result should look like this
DATE v
2 2007-07-28 02:00:12 5
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2
Reproducible code
DATE<-c("2007-07-28 00:41:00", "2007-07-28 02:00:12","2007-07-28 02:01:19", "2007-07-28 02:44:08", "2007-07-28 04:02:18","2007-07-28 09:59:16", "2007-07-28 11:21:32", "2007-07-28 11:58:40","2007-07-28 13:20:52", "2007-07-28 13:21:52", "2007-07-28 14:41:32","2007-07-28 15:19:00", "2007-07-29 01:01:48", "2007-07-29 01:41:08")
v<-c(1,5,3,2,3,4,8,5,4,9,3,9,2,5)
hyljes<-data.frame(cbind(DATE,v))
df <-subset(hyljes, format(as.POSIXct(DATE),"%H") %in% c ("01":"02","13":"14"))
There´s problem with making intervals. It allows me to subset hours "13":"14" but not for "01":"02". Is there any reasonable answers for that?
And i haven´t found the way how to get only first elements from each interval.
Any help is appreciated!

Try
hyljes[c(1, head(cumsum(rle(as.POSIXlt(hyljes$DATE)$hour < 13)$lengths) + 1, -1)), ]
## DATE v
## 1 2007-07-28 00:41:00 1
## 9 2007-07-28 13:20:52 4
## 13 2007-07-29 01:01:48 2
as.POSIXlt(hyljes$DATE)$hour < 13 gives you whether time is before or after noon
rle(...)$lengths gives you lengths of the runs of TRUEs and FALSEs
cumsum of above + 1 gives you indices of first record in each run
head(...,-1) trims of last element
c(1, ...) adds back first index - which should be always be included by definition

There are lots of little manipulations in here, but the end result gets you where you need to be:
hyljes <- [YOUR DATA]
hyljes$DATE <- as.POSIXct(hyljes$DATE, format = "%Y-%m-%d %H:%M:%S")
hyljes$hour <- strftime(hyljes$DATE, '%H')
hyljes$date <- strftime(hyljes$DATE, '%Y-%m-%d')
hyljes$am_pm <- ifelse(hyljes$hour < 12, 'am', 'pm')
mins <- ddply(hyljes, .(date, am_pm), summarise, min = min(DATE))$min
hyljes[hyljes[, 1] %in% mins, 1:2]
DATE v
1 2007-07-28 00:41:00 1
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2

Related

R chron %in% comparison only recognizes every second date

I am using zoo and chron packages in R to read and transform data. At one point I need to select a part of a chron-indexed zoo object which corresponds to another chron object. Unfortunately, using %in% operator I only get part of the corresponding dates. Here is a MWE that reproduces the error:
library(chron)
library(zoo)
chron1 <- seq(chron("2013-01-01","00:00:00", format=c(dates="y-m-d",times="h:m:s")),
chron("2013-01-01","03:10:00", format=c(dates="y-m-d",times="h:m:s")),by=1./1440.)
x1 <- runif(200)
z1 <- zoo(x1,chron1)
chron10 <- trunc(chron1, "00:10:00")
x10 <- aggregate(z1,chron10,FUN=sum)
which(index(x10) %in% chron1)
The (unexpected) output is:
[1] 1 3 5 7 9 10 12 14 16 18 19
chron objects are floating point so there can be slight differences in what appears to be the same datetime depending on how they were calculated. format them and compare those:
which(format(index(x10)) %in% format(chron1))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
This also works as trunc uses an eps value to ensure that inputs slightly less than one minute are not truncated down a further minute. See ?trunc.times
which(trunc(index(x10), "minutes") %in% trunc(chron1, "minutes"))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Also see R FAQ 7.31

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

Determine the length of a season with conditions (ex. winter snow season)

I want to determine the length of the snow season in the following data frame:
DATE SNOW
1998-11-01 0
1998-11-02 0
1998-11-03 0.9
1998-11-04 1
1998-11-05 0
1998-11-06 1
1998-11-07 0.6
1998-11-08 1
1998-11-09 2
1998-11-10 2
1998-11-11 2.5
1998-11-12 3
1998-11-13 6.5
1999-01-01 15
1999-01-02 15
1999-01-03 19
1999-01-04 18
1999-01-05 17
1999-01-06 17
1999-01-07 17
1999-01-08 17
1999-01-09 16
1999-03-01 6
1999-03-02 5
1999-03-03 5
1999-03-04 5
1999-03-05 5
1999-03-06 2
1999-03-07 2
1999-03-08 1.6
1999-03-09 1.2
1999-03-10 1
1999-03-11 0.6
1999-03-12 0
1999-03-13 1
Snow season is defined by a snow depth (SNOW) of more than 1 cm for at least 10 consecutive days (so if there is snow one day in November but after it melts and depth is < 1 cm we consider the season not started).
My idea would be to determine:
1) the date of snowpack establishement (in my example 1998-11-08)
2) the date of "disappearing" (here 1999-03-11)
3) calculate the length of the period (nb of days between 1998-11-05 and 1999-03-11)
For the 3rd step I can easily get the number between 2 dates using this method.
But how to define the dates with conditions?
This is one way:
# copy data from clipboard
d <- read.table(text=readClipboard(), header=TRUE)
# coerce DATE to Date type, add event grouping variable that numbers the groups
# sequentially and has NA for values not in events.
d <- transform(d, DATE=as.Date(DATE),
event=with(rle(d$SNOW >= 1), rep(replace(ave(values, values, FUN=seq), !values, NA), lengths)))
# aggregate event lengths in days
event.days <- aggregate(DATE ~ event, data=d, function(x) as.numeric(max(x) - min(x), units='days'))
# get those events greater than 10 days
subset(event.days, DATE > 10)
# event DATE
# 3 3 122
You can also use the event grouping variable to find the start dates:
starts <- aggregate(DATE ~ event, data=d, FUN=head, 1)
# 1 1 1998-11-04
# 2 2 1998-11-06
# 3 3 1998-11-08
# 4 4 1999-03-13
And then merge this with event.days:
merge(event.days, starts, by='event')
# event DATE.x DATE.y
# 1 1 0 1998-11-04
# 2 2 0 1998-11-06
# 3 3 122 1998-11-08
# 4 4 0 1999-03-13

Date handling and splitting

I have a set of data (in csv format) that looks something like:
Date Auto_Index Realty_Index
29-Dec-02 1742.2 1000
2-Jan-03 1748.85 1009.67
3-Jan-03 1758.66 1041.45
4-Jan-03 1802.9 1062.11
5-Jan-03 1797.45 1047.56
...
...
...
26-Nov-12 1665.5 248.75
27-Nov-12 1676.3 257.6
29-Nov-12 1696.7 266.9
30-Nov-12 1682.8 266.55
3-Dec-12 1702.6 270.4
I want to analyse this data over different periods in R. Is there a way I can break this data into different periods say 2002-2005, 2006-2009 and 2009-2012?
If you want to operate on the periods as numbers (rather than text), then this might help:
br <- c("2002","2005","2010","2013")
df$Int <-findInterval(format(as.Date(df$Date,format='%d-%b-%y'),"%Y"),br)
As #user1317221_G proposed, you should use function cut.POSIXt. Here's how:
d
Date Auto_Index Realty_Index
1 29-Dec-02 1742.20 1000.00
2 2-Jan-03 1748.85 1009.67
3 3-Jan-03 1758.66 1041.45
4 4-Jan-03 1802.90 1062.11
5 5-Jan-03 1797.45 1047.56
6 26-Nov-12 1665.50 248.75
7 27-Nov-12 1676.30 257.60
8 29-Nov-12 1696.70 266.90
9 30-Nov-12 1682.80 266.55
10 3-Dec-12 1702.60 270.40
# First step, convert your date column in POSIXct using strptime
d$Date <- strptime(d$Date, format("%d-%b-%y"))
# Then define your break points for your periods:
breaks <- as.POSIXct(c("2002-01-01","2006-01-01","2010-01-01","2013-01-01"))
# Then cut
d$Period <- cut(d$Date, breaks=breaks,
labels=c("2002-2005","2006-2009","2010-2012"))
d
Date Auto_Index Realty_Index Period
1 2002-12-29 1742.20 1000.00 2002-2005
2 2003-01-02 1748.85 1009.67 2002-2005
3 2003-01-03 1758.66 1041.45 2002-2005
4 2003-01-04 1802.90 1062.11 2002-2005
5 2003-01-05 1797.45 1047.56 2002-2005
6 2012-11-26 1665.50 248.75 2010-2012
7 2012-11-27 1676.30 257.60 2010-2012
8 2012-11-29 1696.70 266.90 2010-2012
9 2012-11-30 1682.80 266.55 2010-2012
10 2012-12-03 1702.60 270.40 2010-2012

How to put information obtained by cast function of reshape package back in my original data frame in R

I have a data.frame in panel format (country-year) and I need to calculate the mean of a variable by country and at each five years. So I just used the 'cast' function from 'reshape' package and it worked. Now I need to put this information(the mean by quinquennium) in the old data.frame, so I can run some regressions. How can I do that? Below I provide an example to ilustrate what I want:
set.seed(2)
fake= data.frame(y=rnorm(20), x=rnorm(20), country=rep(letters[1:2], each=10), year=rep(1:10,2), quinquenio= rep(rep(1:2, each=5),2))
fake.m = melt.data.frame(fake, id.vars=c("country", "year", "quinquenio"))
cast(fake.m, country ~ quinquenio, mean, subset=variable=="x", na.rm=T)
Now, everything is fine and I get what I wantted: the mean of x and y, by country and by quinquennial years. Now, I would like to put them back in the data.frame fake, like this:
y x country year quinquenio mean.x
1 -0.89691455 2.090819205 a 1 1 0.8880242
2 0.18484918 -1.199925820 a 2 1 0.8880242
3 1.58784533 1.589638200 a 3 1 0.8880242
4 -1.13037567 1.954651642 a 4 1 0.8880242
5 -0.08025176 0.004937777 a 5 1 0.8880242
6 0.13242028 -2.451706388 a 6 2 -0.2978375
7 0.70795473 0.477237303 a 7 2 -0.2978375
8 -0.23969802 -0.596558169 a 8 2 -0.2978375
9 1.98447394 0.792203270 a 9 2 -0.2978375
10 -0.13878701 0.289636710 a 10 2 -0.2978375
11 0.41765075 0.738938604 b 1 1 0.2146461
12 0.98175278 0.318960401 b 2 1 0.2146461
13 -0.39269536 1.076164354 b 3 1 0.2146461
14 -1.03966898 -0.284157720 b 4 1 0.2146461
15 1.78222896 -0.776675274 b 5 1 0.2146461
16 -2.31106908 -0.595660499 b 6 2 -0.8059598
17 0.87860458 -1.725979779 b 7 2 -0.8059598
18 0.03580672 -0.902584480 b 8 2 -0.8059598
19 1.01282869 -0.559061915 b 9 2 -0.8059598
20 0.43226515 -0.246512567 b 10 2 -0.8059598
I appreciate any tip in the right direction. Thanks in advance.
ps.: the reason I need this is that I'll run a regression with quinquennial data, and for some variables (like per capita income) I have information for all years, so I decided to average them by 5 years.
I'm sure there's an easy way to do this with reshape, but my brain defaults to plyr first:
require(plyr)
ddply(fake, c("country", "quinquenio"), transform, mean.x = mean(x))
This is quite hackish, but one way to use reshape building off your earlier work:
zz <- cast(fake.m, country ~ quinquenio, mean, subset=variable=="x", na.rm=T)
merge(fake, melt(zz), by = c("country", "quinquenio"))
though I'm positive there has to be a better solution.
Here's a more old school approach using tapply, ave, and with
fake$mean.x <- with(fake, unlist(tapply(x, list(country, quinquenio), ave)))

Resources