Date handling and splitting - r

I have a set of data (in csv format) that looks something like:
Date Auto_Index Realty_Index
29-Dec-02 1742.2 1000
2-Jan-03 1748.85 1009.67
3-Jan-03 1758.66 1041.45
4-Jan-03 1802.9 1062.11
5-Jan-03 1797.45 1047.56
...
...
...
26-Nov-12 1665.5 248.75
27-Nov-12 1676.3 257.6
29-Nov-12 1696.7 266.9
30-Nov-12 1682.8 266.55
3-Dec-12 1702.6 270.4
I want to analyse this data over different periods in R. Is there a way I can break this data into different periods say 2002-2005, 2006-2009 and 2009-2012?

If you want to operate on the periods as numbers (rather than text), then this might help:
br <- c("2002","2005","2010","2013")
df$Int <-findInterval(format(as.Date(df$Date,format='%d-%b-%y'),"%Y"),br)

As #user1317221_G proposed, you should use function cut.POSIXt. Here's how:
d
Date Auto_Index Realty_Index
1 29-Dec-02 1742.20 1000.00
2 2-Jan-03 1748.85 1009.67
3 3-Jan-03 1758.66 1041.45
4 4-Jan-03 1802.90 1062.11
5 5-Jan-03 1797.45 1047.56
6 26-Nov-12 1665.50 248.75
7 27-Nov-12 1676.30 257.60
8 29-Nov-12 1696.70 266.90
9 30-Nov-12 1682.80 266.55
10 3-Dec-12 1702.60 270.40
# First step, convert your date column in POSIXct using strptime
d$Date <- strptime(d$Date, format("%d-%b-%y"))
# Then define your break points for your periods:
breaks <- as.POSIXct(c("2002-01-01","2006-01-01","2010-01-01","2013-01-01"))
# Then cut
d$Period <- cut(d$Date, breaks=breaks,
labels=c("2002-2005","2006-2009","2010-2012"))
d
Date Auto_Index Realty_Index Period
1 2002-12-29 1742.20 1000.00 2002-2005
2 2003-01-02 1748.85 1009.67 2002-2005
3 2003-01-03 1758.66 1041.45 2002-2005
4 2003-01-04 1802.90 1062.11 2002-2005
5 2003-01-05 1797.45 1047.56 2002-2005
6 2012-11-26 1665.50 248.75 2010-2012
7 2012-11-27 1676.30 257.60 2010-2012
8 2012-11-29 1696.70 266.90 2010-2012
9 2012-11-30 1682.80 266.55 2010-2012
10 2012-12-03 1702.60 270.40 2010-2012

Related

Subsetting tidy data from a vector

I'm using R to analyse data about antibiotic use from a number of hospitals.
I've imported this data into a frame, according to the tidy data principles.
>head(data)
date antibiotic usage hospital
1 2006-01-01 amikacin 0.000000 hospital1
2 2006-02-01 amikacin 0.000000 hospital1
3 2006-03-01 amikacin 0.000000 hospital1
4 2006-04-01 amikacin 0.000000 hospital1
5 2006-05-01 amikacin 0.937119 hospital1
6 2006-06-01 amikacin 1.002961 hospital1
(the data set is monthly data x 5 hospitals x 40 antibiotics)
The first thing I would like to do is aggregate the antibiotics into classes.
> head(distinct(select(data, antibiotic)))
antibiotic
1 amikacin
2 amoxicillin-clavulanate
3 amoxycillin
4 ampicillin
5 azithromycin
6 benzylpenicillin
7 cefalotin
8 cefazolin
> penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
> ceph1 <- c("cefalotin", "cefazolin")
What I would like to do is then subset the data based on these antibiotic class vectors:
filter(data, antibiotic =(any one of the values in the vector "penicillins")
Thanks to thelatemail for pointing out the way to do this is:
d <- filter(data, antibiotic %in% penicillins)
What I would like the data to do is to be analysed in a number of ways:
The key analysis (and ggplot output) is:
x = date
y = usage of antibiotic(s) stratified by (drug | class), filtered by hospital
What I'm not clear on now is how to aggregate the data for this sort of thing.
Example:
I want to analyse the use of class "ceph1" across all the hospitals in the district, resulting in (apologies - i know this is not proper code)
x y
Jan-2006 for all in hospitals(usage of cephazolin + usage of cephalotin)
Feb-2006 for all in hospitals(usage of cephazolin + usage of cephalotin)
etc
And, in the long-run, to be able to pass arguments to a function which will let me select which hospitals and which antibiotic or class of antibiotics.
Thanks again - I know this is an order of magnitude more complicated than the original question!
So after lots of trial and error and heaps of reading, I've managed to sort it out.
>str(data)
'data.frame': 23360 obs. of 4 variables:
$ date : Date, format: "2007-09-01" "2012-06-01" ...
$ antibiotic: Factor w/ 41 levels "amikacin","amoxicillin-clavulanate",..: 17 3 19 30 38 20 20 20 7 25 ...
$ usage : num 21.368 36.458 7.226 3.671 0.917 ...
$ hospital : Factor w/ 5 levels "hospital1","hospital2",..: 1 3 2 1 4 1 4 3 5 1 ...
So I can subset the data first:
>library(dplyr)
>penicillins <- c("amoxicillin-clavulanate", "amoxycillin", "ampicillin", "benzylpenicillin")
>d <- filter(data, antibiotic %in% penicillins)
And then make the summary using more of dplyr (thanks, Hadley!)
>d1 <- summarise(group_by(d, date), total = sum(usage))
>d1
Source: local data frame [122 x 2]
date total
(date) (dbl)
1 2006-01-01 1669.177
2 2006-02-01 1901.749
3 2006-03-01 2311.008
4 2006-04-01 1921.436
5 2006-05-01 1594.781
6 2006-06-01 2150.997
7 2006-07-01 2052.517
8 2006-08-01 2132.501
9 2006-09-01 1959.916
10 2006-10-01 1751.667
.. ... ...
>
> qplot(date, total, data = d1) + geom_smooth()
> [scatterplot as desired!]
Next step will be to try and build it all into a function and/or to try and do the subsetting in-line, building on what I've worked out here.

R + ggplot: Order irregular Time Strings for Plot

I have a data frame with two columns. The first is a numerical value, the other is a string describing a time. The time format looks like yyyy-mm-dd--hh-mm-ss-?????? (e.g. 2015-03-04--12-11-35-669696), I don't know what the last 6 digits mean. E.g.
y time
1 4.548 2014-08-11--09-07-44-202586
2 4.548 2014-08-11--09-07-54-442586
3 4.548 2014-08-11--09-08-04-522586
4 4.478 2014-08-11--09-08-14-762586
5 4.431 2014-08-11--09-08-24-522586
6 4.446 2014-08-11--09-08-34-922586
7 4.492 2014-08-11--09-08-44-522586
8 4.508 2014-08-11--09-08-54-442586
9 4.486 2014-08-11--09-09-04-202586
10 4.497 2014-08-11--09-09-14-442586
11 4.461 2014-08-11--09-09-24-202586
I want to plot them with
ggplot(df, aes(x=time, y=y)) + geom_line()
But I have the problem, that ggplot doesn't know how to deal with data of class character and in particular with my given time format.
I tried to use AsciiToInt from the pakage {sfsmisc} to convert the strings to numerical values, but it repeats a list of integers for each string (one number for each character, of course).
I can also sort my time strings with mixedsort from the pakage {gtools}, but I don't how to apply it for the plot (also keeping in mind the distance).
Another problem is that I don't want every time string appear as tick at the x-axis, due to I have around 20k rows. Maybe I can solve that problem like in this question, but I cannot check that as long as the first problem occurs.
Can you help me, ploting such data with the time as a numeric-like value on the x-axis?
I loaded your data as a .txt file called time dat. First I convert your data into POSIXct type. To make a cleaner graph for test purposes I omit the seconds field, if you want to add them in just use the commented out line.
library(ggplot2)
timedat<-read.csv("~/Work/Timedat.csv")
timedat
str(timedat)
> str(timedat)
'data.frame': 11 obs. of 3 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ y : num 4.55 4.55 4.55 4.48 4.43 ...
$ time: Factor w/ 11 levels "2014-08-11--09-07-44-202586",..: 1 2 3 4 5 6 7 8 9 10 ...
#timedat$time<-as.POSIXct(as.character(timedat$time),format = "%Y-%m-%d--%H-%M-%S")
timedat$time<-as.POSIXct(as.character(timedat$time),format = "%Y-%m-%d--%H-%M")
qplot(data=timedat,y=y,x=time)+theme_bw()
> timedat
X y time
1 1 4.548 2014-08-11--09-07-44-202586
2 2 4.548 2014-08-11--09-07-54-442586
3 3 4.548 2014-08-11--09-08-04-522586
4 4 4.478 2014-08-11--09-08-14-762586
5 5 4.431 2014-08-11--09-08-24-522586
6 6 4.446 2014-08-11--09-08-34-922586
7 7 4.492 2014-08-11--09-08-44-522586
8 8 4.508 2014-08-11--09-08-54-442586
9 9 4.486 2014-08-11--09-09-04-202586
10 10 4.497 2014-08-11--09-09-14-442586
11 11 4.461 2014-08-11--09-09-24-202586
This produces the following plot with the dates nicely ordered.

Subset first time of two hour time intervals

I have problem in subsetting times.
1) I would like to filter my data by time intervals where one is in midnight and another in midday.
2) And i need only first time that occurs in each interval.
Data frame looks like this
DATE v
1 2007-07-28 00:41:00 1
2 2007-07-28 02:00:12 5
3 2007-07-28 02:01:19 3
4 2007-07-28 02:44:08 2
5 2007-07-28 04:02:18 3
6 2007-07-28 09:59:16 4
7 2007-07-28 11:21:32 8
8 2007-07-28 11:58:40 5
9 2007-07-28 13:20:52 4
10 2007-07-28 13:21:52 9
11 2007-07-28 14:41:32 3
12 2007-07-28 15:19:00 9
13 2007-07-29 01:01:48 2
14 2007-07-29 01:41:08 5
Result should look like this
DATE v
2 2007-07-28 02:00:12 5
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2
Reproducible code
DATE<-c("2007-07-28 00:41:00", "2007-07-28 02:00:12","2007-07-28 02:01:19", "2007-07-28 02:44:08", "2007-07-28 04:02:18","2007-07-28 09:59:16", "2007-07-28 11:21:32", "2007-07-28 11:58:40","2007-07-28 13:20:52", "2007-07-28 13:21:52", "2007-07-28 14:41:32","2007-07-28 15:19:00", "2007-07-29 01:01:48", "2007-07-29 01:41:08")
v<-c(1,5,3,2,3,4,8,5,4,9,3,9,2,5)
hyljes<-data.frame(cbind(DATE,v))
df <-subset(hyljes, format(as.POSIXct(DATE),"%H") %in% c ("01":"02","13":"14"))
There´s problem with making intervals. It allows me to subset hours "13":"14" but not for "01":"02". Is there any reasonable answers for that?
And i haven´t found the way how to get only first elements from each interval.
Any help is appreciated!
Try
hyljes[c(1, head(cumsum(rle(as.POSIXlt(hyljes$DATE)$hour < 13)$lengths) + 1, -1)), ]
## DATE v
## 1 2007-07-28 00:41:00 1
## 9 2007-07-28 13:20:52 4
## 13 2007-07-29 01:01:48 2
as.POSIXlt(hyljes$DATE)$hour < 13 gives you whether time is before or after noon
rle(...)$lengths gives you lengths of the runs of TRUEs and FALSEs
cumsum of above + 1 gives you indices of first record in each run
head(...,-1) trims of last element
c(1, ...) adds back first index - which should be always be included by definition
There are lots of little manipulations in here, but the end result gets you where you need to be:
hyljes <- [YOUR DATA]
hyljes$DATE <- as.POSIXct(hyljes$DATE, format = "%Y-%m-%d %H:%M:%S")
hyljes$hour <- strftime(hyljes$DATE, '%H')
hyljes$date <- strftime(hyljes$DATE, '%Y-%m-%d')
hyljes$am_pm <- ifelse(hyljes$hour < 12, 'am', 'pm')
mins <- ddply(hyljes, .(date, am_pm), summarise, min = min(DATE))$min
hyljes[hyljes[, 1] %in% mins, 1:2]
DATE v
1 2007-07-28 00:41:00 1
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2

How to prepare my data fo a factorial repeated measures analysis?

Currently, my dataframe is in wide-format and I want to do a factorial repeated measures analysis with two between subject factors (sex & org) and a within subject factor (tasktype). Below I've illustrated how my data looks with a sample (the actual dataset has a lot more variables). The variable starting with '1_' and '2_' belong to measurements during task 1 and task 2 respectively. this means that 1_FD_H_org and 2_FD_H_org are the same measurements but for tasks 1 and 2 respectively.
id sex org task1 task2 1_FD_H_org 1_FD_H_text 2_FD_H_org 2_FD_H_text 1_apv 2_apv
2 F T Correct 2 69.97 68.9 116.12 296.02 10 27
6 M T Correct 2 53.08 107.91 73.73 333.15 16 21
7 M T Correct 2 13.82 30.9 31.8 78.07 4 9
8 M T Correct 2 42.96 50.01 88.81 302.07 4 24
9 F H Correct 3 60.35 102.9 39.81 96.6 15 10
10 F T Incorrect 3 78.61 80.42 55.16 117.57 20 17
I want to analyze whether there is a difference between the two tasks on e.g. FD_H_org for the different groups/conditions (sex & org).
How do I reshape my data so I can analyze it with a model like this?
ezANOVA(data=df, dv=.(FD_H_org), wid=.(id), between=.(sex, org), within=.(task))
I think that the correct format of my data should like this:
id sex org task outcome FD_H_org FD_H_text apv
2 F T 1 Correct 69.97 68.9 10
2 F T 2 2 116.12 296.02 27
6 M T 1 Correct 53.08 107.91 16
6 M T 2 2 73.73 333.15 21
But I'm not sure. I tryed to achieve this wih the reshape2 package but couldn't figure out how to do it. Anybody who can help?
I think probably you need to rebuild it by binding the 2 subsets of columns together with rbind(). The only issue here was that your outcomes implied difference data types, so forced them both to text:
require(plyr)
dt<-read.table(file="dt.txt",header=TRUE,sep=" ") # this was to bring in your data
newtab=rbind(
ddply(dt,.(id,sex,org),summarize, task=1, outcome=as.character(task1), FD_H_org=X1_FD_H_org, FD_H_text=X1_FD_H_text, apv=X1_apv),
ddply(dt,.(id,sex,org),summarize, task=2, outcome=as.character(task2), FD_H_org=X2_FD_H_org, FD_H_text=X2_FD_H_text, apv=X2_apv)
)
newtab[order(newtab$id),]
id sex org task outcome FD_H_org FD_H_text apv
1 2 F T 1 Correct 69.97 68.90 10
7 2 F T 2 2 116.12 296.02 27
2 6 M T 1 Correct 53.08 107.91 16
8 6 M T 2 2 73.73 333.15 21
3 7 M T 1 Correct 13.82 30.90 4
9 7 M T 2 2 31.80 78.07 9
4 8 M T 1 Correct 42.96 50.01 4
10 8 M T 2 2 88.81 302.07 24
5 9 F H 1 Correct 60.35 102.90 15
11 9 F H 2 3 39.81 96.60 10
6 10 F T 1 Incorrect 78.61 80.42 20
12 10 F T 2 3 55.16 117.57 17
EDIT - obviously you don't need plyr for this (and it may slow it down) unless you're doing further transformations. This is the code with no non-standard dependencies:
newcolnames<-c("id","sex","org","task","outcome","FD_H_org","FD_H_text","apv")
t1<-dt[,c(1,2,3,3,4,6,8,10)]
t1$org.1<-1
colnames(t1)<-newcolnames
t2<-dt[,c(1,2,3,3,5,7,9,11)]
t2$org.1<-2
t2$task2<-as.character(t2$task2)
colnames(t2)<-newcolnames
newt<-rbind(t1,t2)
newt[order(newt$id),]

Insert incremental time between two rows determined by column entry

Sorry me again. I will keep on trying but I want help in case I can't figure out within the next hour.
My data looks like this:
B<-data.frame(ID=c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2),EVID=c(1,1,1,0,1,2,2,1,1,1,2,2,1,1,1),VALUE=seq(15))
B$TIME<-c(Sys.time()+6*3600*(seq_len(nrow(B))-1))
Actually the time is more variable, and each ID may have multiple EVID of 2.
I wanted to add one hour increments between the times for EVID=2 for as many hours as they are apart, i.e., for each pair of EVID=2, I add one hour until the time is within one hour to the second EVID=2 in the pair, so I can get something like this:
(value and ID are just duplicate previous rows)
ID EVID VALUE TIME
1 1 1 1 2013-05-31 07:51:09
2 1 1 2 2013-05-31 13:51:09
3 1 1 3 2013-05-31 19:51:09
4 1 0 4 2013-06-01 01:51:09
5 1 1 5 2013-06-01 07:51:09
6 1 2 6 2013-06-01 13:51:09
6 1 2 6 2013-06-01 14:51:09
6 1 2 6 2013-06-01 15:51:09
6 1 2 6 2013-06-01 16:51:09
6 1 2 6 2013-06-01 17:51:09
6 1 2 6 2013-06-01 18:51:09
7 1 2 7 2013-06-01 19:51:09
8 1 1 8 2013-06-02 01:51:09
9 2 1 9 2013-06-02 07:51:09
10 2 1 10 2013-06-02 13:51:09
11 2 2 11 2013-06-02 19:51:09
11 2 2 11 2013-06-02 20:51:09
11 2 2 11 2013-06-02 21:51:09
11 2 2 11 2013-06-02 22:51:09
11 2 2 11 2013-06-02 23:51:09
11 2 2 11 2013-06-02 0:51:09
12 2 2 12 2013-06-03 01:51:09
13 2 1 13 2013-06-03 07:51:09
14 2 1 14 2013-06-03 13:51:09
15 2 1 15 2013-06-03 19:51:09
Below is my brainstorm/attempt:
library(data.table)
BDT <- data.table(row=1:nrow(B), B, key="ID")
BDT[,list(row,EVID,c(EVID)==2)]
attach(B)
newB<-BDT[c(EVID)==2,list(row=row+1,ID=ID,EVID=EVID,VALUE=VALUE,TIME=head(TIME+3600,-1))]
finalB<-rbind(BDT,newB)[order(EVID,decreasing=TRUE)][order(row)][,-1,with=FALSE]
However, this adds one row of Time+1 hour to each EVID=2 which is not what I desired.
The next thing I tried duplicates every row after the first which is not what I wanted, but has the advantage of sparing my from typing out all the names of the columns (I have about 32)
newB<-B[c(1,rep(2:nrow(B),each=2)),]
## My wild guess -- as.numeric(head(TIME))-as.numeric(tail(TIME)))/3600 doesn't work. I know it says that from row 2 to last row, repeat each row twice
newB[c(FALSE,TRUE),"EVID"]<-2
newB[c(FALSE,TRUE),"TIME"]<-newB[c(FALSE,TRUE),"TIME"]+3600
Thank you for any feedback.
=================================================================
eddie's code works well with my example, which I thought was a good representation but my actual data keep getting
error in seq.int(...) wrong sign in 'by' argument
(...) varies depending on what I was trying
I have a relatively large data, the column that I use as the ID as in the example is in the middle of the data table; I see even from my small sample data if I place the ID along with the other names in the list, R will recognize item 2 as having n+1 columns than item 1 in the rbind. But if I don't include it in the list so that I may use the by=ID, R complains that names are in different order. If a do not list one of the unimportant columns in the beginning of the data, R says item 2 has n-1 columns compared to item 1!
I thought that perhaps my error comes from my time being not exactly hours apart, but by test runs I see that small differences are tolerated, and rounding, either to hour or doing integers, doesn't help.
I tried using length.out, ignoring the warning
Warning message: In .rbind.data.table(...) : Argument 2 has names in
a different order. Columns will be bound by name for consistency with
base. Alternatively, you can drop names (by using an unnamed list) and
the columns will then be joined by position. Or, set use.names=FALSE.
But then the code does not add to between the 2's except at the end, where it adds too many!
What am I doing wrong? I've been pulling all-nighter for this :(
OK so when I rearrange the original data I can get rid of the warnings. However, the insertions are still happening at the end of the data only and they were too many.
This should work:
library(data.table)
dt = data.table(B)
dt[, TIME := as.POSIXct(TIME)]
rbind(dt, dt[EVID == 2,
list(EVID=EVID[1],
VALUE=VALUE[1],
TIME=seq.POSIXt(TIME[1], TIME[2], "hour")),
by = ID])[!duplicated(paste(ID,EVID,TIME))][order(ID, TIME)]

Resources