R - aggregate daily to weekly with start date as Saturday - r

I have daily data and I want to convert to weekly with Week starting on Saturday.
date value
1 11/5/2016 30
2 11/6/2016 20
3 11/7/2016 12
4 11/8/2016 22
5 11/9/2016 48
6 11/10/2016 50
7 11/11/2016 47
8 11/12/2016 12
9 11/13/2016 19
10 11/14/2016 31
11 11/15/2016 43
12 11/16/2016 26
13 11/17/2016 33
14 11/18/2016 36
15 11/19/2016 14
16 11/20/2016 15
17 11/21/2016 36
18 11/22/2016 38
19 11/23/2016 28
20 11/24/2016 21
21 11/25/2016 13
I tried the following but it assumes Start of Week on Monday
data = as.xts(df$value,order.by=as.Date(df$date))
weekly = apply.weekly(data,sum)
I want the output to be aggregated by Saturday as Start Of Week.

The order.by statement in xts call is not with the correct format of Date class
data <- xts(df$value, order.by = as.Date(df$date, '%m/%d/%Y'))
tapply(data[,1], cumsum(format(index(data), '%w')==6), sum)

Related

How do I convert this into a zoo object in R?

I have a very big dataset which is a data frame containing Date/Time in 1 column and closing price in the next.
enter image description here
I am using the following code:
read.zoo(df,tz="GMT",format = "%d.%m.%Y %H:%M")
but this is shown:
Error in read.zoo(df, tz = "GMT", format = "%d.%m.%Y %H:%M") :
index has 28290 bad entries at data rows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42...
What should I do?
The image is part of my dataframe

How to update and replace part of old data

I want to merge the df OldData and NewData.
In this case, Nov-2015 and Dec 2015 are present in both df.
Since NewData is the most accurate update available, I want to update the value of Nov-2015 and Dec 2015 using the value in df NewData and of course adding the records of Jan-2016 and Feb-2016 as well.
Can anyone help?
OldData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 42
12 Dec-2015 32
NewData
Month Value
1 Nov-2015 88
2 Dec-2015 45
3 Jan-2016 32
4 Feb-2016 11
Here is the output I want:
JoinData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 88
12 Dec-2015 45
13 Jan-2016 32
14 Feb-2016 11
Thanks for #akrun, the problem is solved, and the following code works smoothly!!
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Update: Now, let's upgrade our problem little bit.
suppose our OldData and NewData have another column called "Type".
How do we merge/update it this time?
> OldData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-12 C 32
13 2015-12 D 77
> NewData
Month Type Value
1 2015-11 A 88
2 2015-12 C 45
3 2015-12 D 22
4 2016-01 A 32
5 2016-02 A 11
The JoinData will suppose to update all value from NewData ass following:
> JoinData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-11 A 88 (originally not included, added from the NewData)
12 2015-12 C 45 (Updated the value by NewData)
13 2015-12 D 22 (Updated the value by NewData)
14 2016-01 A 32 (newly added from NewData)
15 2016-02 A 11 (newly added from NewData)
Thanks for #akrun: I have got the solution here for the second question as well.
Thanks for the help for everyone here!
Here is the answer:
d1 <- merge(OldData, NewData, by = c("Month","Type"), all = TRUE);d2 <- transform(d1, Value.x= ifelse(!is.na(Value.y), Value.y, Value.x))[-4];d2[!duplicated(d2[1:2], fromLast=TRUE),]
Here is an option using data.table (similar approach as #thelatemail mentioned in the comments)
library(data.table)
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Or
rbindlist(list(OldData, NewData))[,if(.N >1) .SD[.N] else .SD, Month]

insert new rows to the time series data, with date added automatically

I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33

Transforming long format data to short format by segmenting dates that include redundant observations

I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA

How to identify the records that belong to a certain time interval when I know the start and end records of that interval? (R)

So, here is my problem. I have a dataset of locations of radiotagged hummingbirds I’ve been following as part of my thesis. As you might imagine, they fly fast so there were intervals when I lost track of where they were until I eventually found them again.
Now I am trying to identify the segments where the bird was followed continuously (i.e., the intervals between “Lost” periods).
ID Type TimeStart TimeEnd Limiter Starter Ender
1 Observed 6:45:00 6:45:00 NO Start End
2 Lost 6:45:00 5:31:00 YES NO NO
3 Observed 5:31:00 5:31:00 NO Start NO
4 Observed 9:48:00 9:48:00 NO NO NO
5 Observed 10:02:00 10:02:00 NO NO NO
6 Observed 10:18:00 10:18:00 NO NO NO
7 Observed 11:00:00 11:00:00 NO NO NO
8 Observed 13:15:00 13:15:00 NO NO NO
9 Observed 13:34:00 13:34:00 NO NO NO
10 Observed 13:43:00 13:43:00 NO NO NO
11 Observed 13:52:00 13:52:00 NO NO NO
12 Observed 14:25:00 14:25:00 NO NO NO
13 Observed 14:46:00 14:46:00 NO NO End
14 Lost 14:46:00 10:47:00 YES NO NO
15 Observed 10:47:00 10:47:00 NO Start NO
16 Observed 10:57:00 11:00:00 NO NO NO
17 Observed 11:10:00 11:10:00 NO NO NO
18 Observed 11:19:00 11:27:55 NO NO NO
19 Observed 11:28:05 11:32:00 NO NO NO
20 Observed 11:45:00 12:09:00 NO NO NO
21 Observed 11:51:00 11:51:00 NO NO NO
22 Observed 12:11:00 12:11:00 NO NO NO
23 Observed 13:15:00 13:15:00 NO NO End
24 Lost 13:15:00 7:53:00 YES NO NO
25 Observed 7:53:00 7:53:00 NO Start NO
26 Observed 8:48:00 8:48:00 NO NO NO
27 Observed 9:25:00 9:25:00 NO NO NO
28 Observed 9:26:00 9:26:00 NO NO NO
29 Observed 9:32:00 9:33:25 NO NO NO
30 Observed 9:33:35 9:33:35 NO NO NO
31 Observed 9:42:00 9:42:00 NO NO NO
32 Observed 9:44:00 9:44:00 NO NO NO
33 Observed 9:48:00 9:48:00 NO NO NO
34 Observed 9:48:30 9:48:30 NO NO NO
35 Observed 9:51:00 9:51:00 NO NO NO
36 Observed 9:54:00 9:54:00 NO NO NO
37 Observed 9:55:00 9:55:00 NO NO NO
38 Observed 9:57:00 10:01:00 NO NO NO
39 Observed 10:02:00 10:02:00 NO NO NO
40 Observed 10:04:00 10:04:00 NO NO NO
41 Observed 10:06:00 10:06:00 NO NO NO
42 Observed 10:20:00 10:33:00 NO NO NO
43 Observed 10:34:00 10:34:00 NO NO NO
44 Observed 10:39:00 10:39:00 NO NO End
Note: When there is a “Start” and an “End” in the same row it’s because the non-lost period consists only of that record.
I was able to identify the records that start or end these “non-lost” periods (under the columns “Starter” and “Ender”), but now I want to be able to identify those periods by giving them unique identifiers (period A,B,C or 1,2,3, etc).
Ideally, the name of the identifier would be the name of the start point for that period (i.e., ID[ Starter==”Start”])
I'm looking for something like this:
ID Type TimeStart TimeEnd Limiter Starter Ender Period
1 Observed 6:45:00 6:45:00 NO Start End 1
2 Lost 6:45:00 5:31:00 YES NO NO Lost
3 Observed 5:31:00 5:31:00 NO Start NO 3
4 Observed 9:48:00 9:48:00 NO NO NO 3
5 Observed 10:02:00 10:02:00 NO NO NO 3
6 Observed 10:18:00 10:18:00 NO NO NO 3
7 Observed 11:00:00 11:00:00 NO NO NO 3
8 Observed 13:15:00 13:15:00 NO NO NO 3
9 Observed 13:34:00 13:34:00 NO NO NO 3
10 Observed 13:43:00 13:43:00 NO NO NO 3
11 Observed 13:52:00 13:52:00 NO NO NO 3
12 Observed 14:25:00 14:25:00 NO NO NO 3
13 Observed 14:46:00 14:46:00 NO NO End 3
14 Lost 14:46:00 10:47:00 YES NO NO Lost
15 Observed 10:47:00 10:47:00 NO Start NO 15
16 Observed 10:57:00 11:00:00 NO NO NO 15
17 Observed 11:10:00 11:10:00 NO NO NO 15
18 Observed 11:19:00 11:27:55 NO NO NO 15
19 Observed 11:28:05 11:32:00 NO NO NO 15
20 Observed 11:45:00 12:09:00 NO NO NO 15
21 Observed 11:51:00 11:51:00 NO NO NO 15
22 Observed 12:11:00 12:11:00 NO NO NO 15
23 Observed 13:15:00 13:15:00 NO NO End 15
24 Lost 13:15:00 7:53:00 YES NO NO Lost
Would this be too hard to do in R?
Thanks!
> d <- data.frame(Limiter = rep("NO", 44), Starter = rep("NO", 44), Ender = rep("NO", 44), stringsAsFactors = FALSE)
> d$Starter[c(1, 3, 15, 25)] <- "Start"
> d$Ender[c(1, 13, 23, 44)] <- "End"
> d$Limiter[c(2, 14, 24)] <- "Yes"
> d$Period <- ifelse(d$Limiter == "Yes", "Lost", which(d$Starter == "Start")[cumsum(d$Starter == "Start")])
> d
Limiter Starter Ender Period
1 NO Start End 1
2 Yes NO NO Lost
3 NO Start NO 3
4 NO NO NO 3
5 NO NO NO 3
6 NO NO NO 3
7 NO NO NO 3
8 NO NO NO 3
9 NO NO NO 3
10 NO NO NO 3
11 NO NO NO 3
12 NO NO NO 3
13 NO NO End 3
14 Yes NO NO Lost
15 NO Start NO 15
16 NO NO NO 15
17 NO NO NO 15
18 NO NO NO 15
19 NO NO NO 15
20 NO NO NO 15
21 NO NO NO 15
22 NO NO NO 15
23 NO NO End 15
24 Yes NO NO Lost
25 NO Start NO 25
26 NO NO NO 25
27 NO NO NO 25
28 NO NO NO 25
29 NO NO NO 25
30 NO NO NO 25
31 NO NO NO 25
32 NO NO NO 25
33 NO NO NO 25
34 NO NO NO 25
35 NO NO NO 25
36 NO NO NO 25
37 NO NO NO 25
38 NO NO NO 25
39 NO NO NO 25
40 NO NO NO 25
41 NO NO NO 25
42 NO NO NO 25
43 NO NO NO 25
44 NO NO End 25

Resources