Related
I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")
I'm trying to take something like this
df <- data.frame(times = c("0915", "0930", "0945", "1000", "1015", "1030", "1045", "1100", "1130", "1145", "1200"),
values = c(1,2,3,4,1,2,3,4,1,3,4))
> df
times values
1 0915 1
2 0930 2
3 0945 3
4 1000 4
5 1015 1
6 1030 2
7 1045 3
8 1100 4
9 1130 1
10 1145 3
11 1200 4
12 1215 1
13 1245 3
14 1300 4
15 1330 2
16 1345 4
And turn it into something like this
> df2
times values
1 0930 3
2 1000 7
3 1030 3
4 1100 7
5 1130 NA
6 1200 7
7 1230 NA
8 1300 7
9 1330 NA
10 1400 NA
Essentially, take values measured in 15 minute intervals, and convert them into values measured across 30 minute intervals (summing is sufficient for this).
I can think of an okay solution if I can be certain I have two 15 minute readings for each half hourly reading. I could just add elements pairwise and get what I want. But I can't be certain of that in my data set. As my demo also shows, there could be multiple consecutive values missing.
So I thought some kind of number recognition was necessary, e.g. recognises the time is between 9:15 and 9:30, and just sums those two. So I have a function already called hr2dec which I created to convert these times to decimal so it looks like this
> hr2dec(df$times)
[1] 9.25 9.50 9.75 10.00 10.25 10.50 10.75 11.00 11.50 11.75 12.00
I mention this in case it's easier to solve this problem with decimal instead of 4 digit time.
I also have this data for 24 hours, and multiple days. So if I have a solution that loops, it would need to reset to 0015 after 2400, as these are the first and last measurements for each day. A full set of data with dates included could be generated like so (with decimals for times, like I said, either is fine for me):
set.seed(42)
full_df <- data.frame(date = rep(as.Date(c("2010-02-02", "2010-02-03")), each = 96),
dec_times = seq(0.25,24,0.25),
values = rnorm(96)
)
full_df <- full_df[-c(2,13,15,19,95,131,192),]
The best solution I can come up with so far is a pairwise comparative loop. But even this is not perfect.
Is there some elegant way to do what I'm after? I.e. check the first and last values (in terms of date and time), and sum each half hourly interval? I'm not satisfied with my loop that...
Checks first and last date-time value to work out the range of half hours
Checks items in order, pair at a time to decide whether or not I have two values that belong to that half hourly period.
Sums if I do, places NA if I do not.
You should check out the tibbletime package -- specifically, you'll want to look at collapse_by() which collapses a tbl_time object by a time period.
library(tibbletime)
library(dplyr)
# create a series of 7 days
# 2018-01-01 to 2018-01-07 by 15 minute intervals
df <- create_series('2018-01-01' ~ '2018-01-07', period = "15 minute")
df$values <- rnorm(nrow(df))
df
#> # A time tibble: 672 x 2
#> # Index: date
#> date values
#> <dttm> <dbl>
#> 1 2018-01-01 00:00:00 -0.365
#> 2 2018-01-01 00:15:00 -0.275
#> 3 2018-01-01 00:30:00 -1.50
#> 4 2018-01-01 00:45:00 -1.64
#> 5 2018-01-01 01:00:00 -0.341
#> 6 2018-01-01 01:15:00 -1.05
#> 7 2018-01-01 01:30:00 -0.544
#> 8 2018-01-01 01:45:00 -1.10
#> 9 2018-01-01 02:00:00 0.0824
#> 10 2018-01-01 02:15:00 0.477
#> # ... with 662 more rows
# Collapse into 30 minute intervals, group, and sum
df %>%
collapse_by("30 minute") %>%
group_by(date) %>%
summarise(sum_values = sum(values))
#> # A time tibble: 336 x 2
#> # Index: date
#> date sum_values
#> <dttm> <dbl>
#> 1 2018-01-01 00:15:00 -0.640
#> 2 2018-01-01 00:45:00 -3.14
#> 3 2018-01-01 01:15:00 -1.39
#> 4 2018-01-01 01:45:00 -1.64
#> 5 2018-01-01 02:15:00 0.559
#> 6 2018-01-01 02:45:00 0.581
#> 7 2018-01-01 03:15:00 -1.50
#> 8 2018-01-01 03:45:00 1.36
#> 9 2018-01-01 04:15:00 0.872
#> 10 2018-01-01 04:45:00 -0.835
#> # ... with 326 more rows
# Alternatively, you can use clean = TRUE
df %>%
collapse_by("30 minute", clean = TRUE) %>%
group_by(date) %>%
summarise(sum_values = sum(values))
#> # A time tibble: 336 x 2
#> # Index: date
#> date sum_values
#> <dttm> <dbl>
#> 1 2018-01-01 00:30:00 -0.640
#> 2 2018-01-01 01:00:00 -3.14
#> 3 2018-01-01 01:30:00 -1.39
#> 4 2018-01-01 02:00:00 -1.64
#> 5 2018-01-01 02:30:00 0.559
#> 6 2018-01-01 03:00:00 0.581
#> 7 2018-01-01 03:30:00 -1.50
#> 8 2018-01-01 04:00:00 1.36
#> 9 2018-01-01 04:30:00 0.872
#> 10 2018-01-01 05:00:00 -0.835
#> # ... with 326 more rows
If you're more into videos (< 20 minutes), check out the The Future of Time Series and Financial Analysis in the Tidyverse by David Vaughan.
I'm the OP. After a bit of playing I got something which I think is a more elegant solution than the loop I originally had. Decided to post as an answer for discussion. Still wouldn't mind something more elegant still.
Using full_df I create an index, which is just all the 15-minute periods I would expect given the days I've been supplied.
index <- data.frame(date = rep(seq(full_df$date[1], full_df$date[nrow(full_df)],by="+1 day"),each=96),
dec_times = rep(seq(0.25,24,0.25), length(unique(full_df$date)))
)
Then I merge this with full_df by the two matching columns, and so it keeps values which aren't common (i.e. my missing values)
index <- merge(full_df, index, by.y=c("date", "dec_times"), all.y=T)
Then I go ahead an create a column which lists what half hour each 15 minute interval belongs to using plyr's round_any function
index$half_hour <- plyr::round_any(index$dec_times, 0.5, ceiling)
Then I use plyr's ddply function to sum based on the new half_hour column (taking advantage of the fact that anything + an NA is an NA).
df2 <- plyr::ddply(index[,c("half_hour","values")], "half_hour", sum)
I believe the resulting data frame is exactly what I was after.
> df2
date half_hour values
1 2010-02-02 0.5 NA
2 2010-02-02 1.0 0.99599102
3 2010-02-02 1.5 0.29814381
4 2010-02-02 2.0 1.41686296
5 2010-02-02 2.5 1.95570961
6 2010-02-02 3.0 3.59151505
7 2010-02-02 3.5 NA
8 2010-02-02 4.0 NA
9 2010-02-02 4.5 -2.94070834
10 2010-02-02 5.0 NA
11 2010-02-02 5.5 -2.08794703
12 2010-02-02 6.0 1.04275734
13 2010-02-02 6.5 1.46472433
14 2010-02-02 7.0 -2.02043247
15 2010-02-02 7.5 -0.17989752
16 2010-02-02 8.0 1.16028746
17 2010-02-02 8.5 0.42617715
18 2010-02-02 9.0 -1.21205356
19 2010-02-02 9.5 -1.63536660
20 2010-02-02 10.0 -2.37808504
21 2010-02-02 10.5 -0.15505870
22 2010-02-02 11.0 0.03145841
23 2010-02-02 11.5 -0.93546302
24 2010-02-02 12.0 0.63270809
25 2010-02-02 12.5 0.22420168
26 2010-02-02 13.0 -0.46191368
27 2010-02-02 13.5 2.21862683
28 2010-02-02 14.0 0.36631139
29 2010-02-02 14.5 0.76912170
30 2010-02-02 15.0 -2.70820713
31 2010-02-02 15.5 -0.18200408
32 2010-02-02 16.0 1.98156055
33 2010-02-02 16.5 0.57525057
34 2010-02-02 17.0 1.37435422
35 2010-02-02 17.5 1.64160673
36 2010-02-02 18.0 -1.13330533
37 2010-02-02 18.5 -0.33000520
38 2010-02-02 19.0 0.03816768
39 2010-02-02 19.5 1.23194633
40 2010-02-02 20.0 -1.98555720
41 2010-02-02 20.5 1.77062845
42 2010-02-02 21.0 -0.03245631
43 2010-02-02 21.5 -0.58233200
44 2010-02-02 22.0 -0.39989655
45 2010-02-02 22.5 1.75511944
46 2010-02-02 23.0 0.91594245
47 2010-02-02 23.5 2.04145902
48 2010-02-02 24.0 NA
49 2010-02-03 0.5 0.80626028
50 2010-02-03 1.0 0.99599102
51 2010-02-03 1.5 0.29814381
52 2010-02-03 2.0 1.41686296
53 2010-02-03 2.5 1.95570961
54 2010-02-03 3.0 3.59151505
55 2010-02-03 3.5 -1.66764947
56 2010-02-03 4.0 0.50262906
57 2010-02-03 4.5 -2.94070834
58 2010-02-03 5.0 -1.12035358
59 2010-02-03 5.5 -2.08794703
60 2010-02-03 6.0 1.04275734
61 2010-02-03 6.5 1.46472433
62 2010-02-03 7.0 -2.02043247
63 2010-02-03 7.5 -0.17989752
64 2010-02-03 8.0 1.16028746
65 2010-02-03 8.5 0.42617715
66 2010-02-03 9.0 NA
67 2010-02-03 9.5 -1.63536660
68 2010-02-03 10.0 -2.37808504
69 2010-02-03 10.5 -0.15505870
70 2010-02-03 11.0 0.03145841
71 2010-02-03 11.5 -0.93546302
72 2010-02-03 12.0 0.63270809
73 2010-02-03 12.5 0.22420168
74 2010-02-03 13.0 -0.46191368
75 2010-02-03 13.5 2.21862683
76 2010-02-03 14.0 0.36631139
77 2010-02-03 14.5 0.76912170
78 2010-02-03 15.0 -2.70820713
79 2010-02-03 15.5 -0.18200408
80 2010-02-03 16.0 1.98156055
81 2010-02-03 16.5 0.57525057
82 2010-02-03 17.0 1.37435422
83 2010-02-03 17.5 1.64160673
84 2010-02-03 18.0 -1.13330533
85 2010-02-03 18.5 -0.33000520
86 2010-02-03 19.0 0.03816768
87 2010-02-03 19.5 1.23194633
88 2010-02-03 20.0 -1.98555720
89 2010-02-03 20.5 1.77062845
90 2010-02-03 21.0 -0.03245631
91 2010-02-03 21.5 -0.58233200
92 2010-02-03 22.0 -0.39989655
93 2010-02-03 22.5 1.75511944
94 2010-02-03 23.0 0.91594245
95 2010-02-03 23.5 2.04145902
96 2010-02-03 24.0 NA
What I like about this solution
No loops
Works within the data frame
What I don't like about this solution
Chunkiness in creating the index
This question already has answers here:
How to fill with different colors between two lines? (originally: fill geom_polygon with different colors above and below y = 0 (or any other value)?)
(4 answers)
Closed 5 years ago.
I have this df
x acc
1 1902-01-01 0.782887804
2 1903-01-01 -0.003144199
3 1904-01-01 0.100006276
4 1905-01-01 0.326173392
5 1906-01-01 1.285114692
6 1907-01-01 2.844399973
7 1920-01-01 -0.300232190
8 1921-01-01 1.464389342
9 1922-01-01 0.142638653
10 1923-01-01 -0.020162385
11 1924-01-01 0.361928571
12 1925-01-01 0.616325588
13 1926-01-01 -0.108206003
14 1927-01-01 -0.318441954
15 1928-01-01 -0.267884586
16 1929-01-01 -0.022473777
17 1930-01-01 -0.294452983
18 1931-01-01 -0.654927109
19 1932-01-01 -0.263508341
20 1933-01-01 0.622530992
21 1934-01-01 1.009666043
22 1935-01-01 0.675484421
23 1936-01-01 1.209162008
24 1937-01-01 1.655280986
25 1948-01-01 2.080021785
26 1949-01-01 0.854572563
27 1950-01-01 0.997540963
28 1951-01-01 1.000244163
29 1952-01-01 0.958322941
30 1953-01-01 0.816259474
31 1954-01-01 0.814488644
32 1955-01-01 1.233694537
33 1958-01-01 0.460120970
34 1959-01-01 0.344201474
35 1960-01-01 1.601430139
36 1961-01-01 0.387850967
37 1962-01-01 -0.385954401
38 1963-01-01 0.699355708
39 1964-01-01 0.084519926
40 1965-01-01 0.708964572
41 1966-01-01 1.456280443
42 1967-01-01 1.479412638
43 1968-01-01 1.199000726
44 1969-01-01 0.282942042
45 1970-01-01 -0.181724504
46 1971-01-01 0.012170186
47 1972-01-01 -0.095891043
48 1973-01-01 -0.075384446
49 1974-01-01 -0.156668145
50 1975-01-01 -0.303023258
51 1976-01-01 -0.516027310
52 1977-01-01 -0.826791524
53 1980-01-01 -0.947112221
54 1981-01-01 -1.634878300
55 1982-01-01 -1.955298323
56 1987-01-01 -1.854447550
57 1988-01-01 -1.458955443
58 1989-01-01 -1.256102245
59 1990-01-01 -0.864108585
60 1991-01-01 -1.293373024
61 1992-01-01 -1.049530431
62 1993-01-01 -1.002526230
63 1994-01-01 -0.868783614
64 1995-01-01 -1.081858981
65 1996-01-01 -1.302103374
66 1997-01-01 -1.288048194
67 1998-01-01 -1.455750340
68 1999-01-01 -1.015467069
69 2000-01-01 -0.682789640
70 2001-01-01 -0.811058004
71 2002-01-01 -0.972374057
72 2003-01-01 -0.536505225
73 2004-01-01 -0.518686263
74 2005-01-01 -0.976298621
75 2006-01-01 -0.946429713
I would like plot the data in this kind:
where on x axes there is column x of df, and on y axes column acc.
Is possible plot it with ggplot?
I tried with this code:
ggplot(df,aes(x=x,y=acc))+
geom_linerange(data =df , aes(colour = ifelse(acc <0, "blue", "red")),ymin=min(df),ymax=max(cdf))
but the result is this:
Please, how I can do it?
Is this what you want? I'm not sure.
ggplot(data = df,mapping = aes(x,acc))+geom_segment(data = df , mapping = aes(x=x,y=ystart,xend=x,yend=acc,color=col))
df$x=year(as.Date(df$x))
df$ystart=0
df$col=ifelse(df$acc>=0,"blue","red")
I want to merge the df OldData and NewData.
In this case, Nov-2015 and Dec 2015 are present in both df.
Since NewData is the most accurate update available, I want to update the value of Nov-2015 and Dec 2015 using the value in df NewData and of course adding the records of Jan-2016 and Feb-2016 as well.
Can anyone help?
OldData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 42
12 Dec-2015 32
NewData
Month Value
1 Nov-2015 88
2 Dec-2015 45
3 Jan-2016 32
4 Feb-2016 11
Here is the output I want:
JoinData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 88
12 Dec-2015 45
13 Jan-2016 32
14 Feb-2016 11
Thanks for #akrun, the problem is solved, and the following code works smoothly!!
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Update: Now, let's upgrade our problem little bit.
suppose our OldData and NewData have another column called "Type".
How do we merge/update it this time?
> OldData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-12 C 32
13 2015-12 D 77
> NewData
Month Type Value
1 2015-11 A 88
2 2015-12 C 45
3 2015-12 D 22
4 2016-01 A 32
5 2016-02 A 11
The JoinData will suppose to update all value from NewData ass following:
> JoinData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-11 A 88 (originally not included, added from the NewData)
12 2015-12 C 45 (Updated the value by NewData)
13 2015-12 D 22 (Updated the value by NewData)
14 2016-01 A 32 (newly added from NewData)
15 2016-02 A 11 (newly added from NewData)
Thanks for #akrun: I have got the solution here for the second question as well.
Thanks for the help for everyone here!
Here is the answer:
d1 <- merge(OldData, NewData, by = c("Month","Type"), all = TRUE);d2 <- transform(d1, Value.x= ifelse(!is.na(Value.y), Value.y, Value.x))[-4];d2[!duplicated(d2[1:2], fromLast=TRUE),]
Here is an option using data.table (similar approach as #thelatemail mentioned in the comments)
library(data.table)
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Or
rbindlist(list(OldData, NewData))[,if(.N >1) .SD[.N] else .SD, Month]
I have the following column in my data frame:
DateTime
1 2011-10-03 08:00:04
2 2011-10-03 08:00:05
3 2011-10-03 08:00:06
4 2011-10-03 08:00:09
5 2011-10-03 08:00:15
6 2011-10-03 08:00:24
7 2011-10-03 08:00:30
8 2011-10-03 08:00:42
9 2011-10-03 08:01:01
10 2011-10-03 08:01:24
11 2011-10-03 08:01:58
12 2011-10-03 08:02:34
13 2011-10-03 08:03:25
14 2011-10-03 08:04:26
15 2011-10-03 08:06:00
With dput:
> dput(smallDF)
structure(list(DateTime = structure(c(1317621604, 1317621605,
1317621606, 1317621609, 1317621615, 1317621624, 1317621630, 1317621642,
1317621661, 1317621684, 1317621718, 1317621754, 1317621805, 1317621866,
1317621960, 1317622103, 1317622197, 1317622356, 1317622387, 1317622463,
1317622681, 1317622851, 1317623061, 1317623285, 1317623404, 1317623498,
1317623612, 1317623849, 1317623916, 1317623994, 1317624174, 1317624414,
1317624484, 1317624607, 1317624848, 1317625023, 1317625103, 1317625179,
1317625200, 1317625209, 1317625229, 1317625238, 1317625249, 1317625264,
1317625282, 1317625300, 1317625315, 1317625339, 1317625353, 1317625365,
1317625371, 1317625381, 1317625395, 1317625415, 1317625423, 1317625438,
1317625458, 1317625469, 1317625487, 1317625500, 1317625513, 1317625533,
1317625548, 1317625565, 1317625581, 1317625598, 1317625613, 1317625640,
1317625661, 1317625674, 1317625702, 1317625715, 1317625737, 1317625758,
1317625784, 1317625811, 1317625826, 1317625841, 1317625862, 1317625895,
1317625909, 1317625935, 1317625956, 1317625973, 1317626001, 1317626043,
1317626062, 1317626100, 1317626113, 1317626132, 1317626153, 1317626179,
1317626212, 1317626239, 1317626271, 1317626296, 1317626323, 1317626361,
1317626384, 1317626407), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = "DateTime", row.names = c(NA,
-100L), class = "data.frame")
My goal: I want to calculate the time difference, in seconds, between each measurement.
Edit:
I'm looking to get the following result, where the time difference (in seconds) between each data point is calculated, except for the first value of the day (line 3), when the time is calculate relative to 8 am:
DateTime Seconds
1 2011-09-30 21:59:02 6
2 2011-09-30 21:59:04 2
3 2011-10-03 08:00:04 4
4 2011-10-03 08:00:05 1
5 2011-10-03 08:00:06 1
6 2011-10-03 08:00:09 3
7 2011-10-03 08:00:15 5
8 2011-10-03 08:00:24 9
9 2011-10-03 08:00:30 6
10 2011-10-03 08:00:42 12
11 2011-10-03 08:01:01 19
12 2011-10-03 08:01:24 23
13 2011-10-03 08:01:58 34
14 2011-10-03 08:02:34 36
15 2011-10-03 08:03:25 51
16 2011-10-03 08:04:26 61
17 2011-10-03 08:06:00 94
However, the measurements start at 8:00 am, so if the value is the first of the day, the number of seconds relative to 8:00 am need to be calculated. In the example above, the first measurement ends at 8:00:04 so using the $sec attribute of POSIX could work here, but on other days the first value may happen a few minutes after 8:00 o'clock.
I've tried to achieve that goal with the following function:
SecondsInBar <- function(x, startTime){
# First data point or first of day
if (x == 1 || x > 1 && x$wkday != x[-1]$wkday){
seconds <- as.numeric(difftime(x,
as.POSIXlt(startTime, format = "%H:%M:%S"),
units = "secs"))
# else calculate time difference
} else {
seconds <- as.numeric(difftime(x, x[-1], units = "secs"))
}
return (seconds)
}
Which then could be called with SecondsInBar(smallDF$DateTime, "08:00:00").
There are at least two problems with this function, but I don't know how to solve these:
The code segment x$wkday != x[-1]$wkday returns a $ operator is
invalid for atomic vectors error,
And the as.POSIXlt(startTime, format = "%H:%M:%S") uses the
current date, which makes the difftime calculation erroneous.
My question:
Where am I going wrong with this function?
And: is this approach a viable way or should I approach it from a different angle?
How about something along these lines:
smallDF$DateTime - as.POSIXct(paste(strftime(smallDF$DateTime,"%Y-%m-%d"),"07:00:00"))
Time differences in secs
[1] 4 5 6 9 15 24 30 42 61 84 118 154 205 266 360
[16] 503 597 756 787 863 1081 1251 1461 1685 1804 1898 2012 2249 2316 2394
[31] 2574 2814 2884 3007 3248 3423 3503 3579 3600 3609 3629 3638 3649 3664 3682
[46] 3700 3715 3739 3753 3765 3771 3781 3795 3815 3823 3838 3858 3869 3887 3900
[61] 3913 3933 3948 3965 3981 3998 4013 4040 4061 4074 4102 4115 4137 4158 4184
[76] 4211 4226 4241 4262 4295 4309 4335 4356 4373 4401 4443 4462 4500 4513 4532
[91] 4553 4579 4612 4639 4671 4696 4723 4761 4784 4807
attr(,"tzone")
[1] ""
Note that I used 7am as when I copied your data my it decided to interpret it as BST.
As for your errors, you can't use $ to get elements of a date with POSIXct (which is how smallDF$DateTime is defined), only with POSIXlt. And for the second error, if you don't supply a date, it has to assume the current date, as there is no other information to draw upon.
Edit
Now its been clarified, I would propose a different approach: split your data frame by day, and then combine the times with the reference time and do diff on that, using lapply to loop over days:
#modify dataframe to add extra day to second half
smallDF[51:100,1] <- smallDF[51:100,1]+86400
smallDF2 <- split(smallDF,strftime(smallDF$DateTime,"%Y-%m-%d"))
lapply(smallDF2,function(x) diff(c(as.POSIXct(paste(strftime(x$DateTime[1],"%Y-%m-%d"),"07:00:00")),x$DateTime)))
$`2011-10-03`
Time differences in secs
[1] 4 1 1 3 6 9 6 12 19 23 34 36 51 61 94 143 94 159 31
[20] 76 218 170 210 224 119 94 114 237 67 78 180 240 70 123 241 175 80 76
[39] 21 9 20 9 11 15 18 18 15 24 14 12
$`2011-10-04`
Time differences in secs
[1] 3771 10 14 20 8 15 20 11 18 13 13 20 15 17 16
[16] 17 15 27 21 13 28 13 22 21 26 27 15 15 21 33
[31] 14 26 21 17 28 42 19 38 13 19 21 26 33 27 32
[46] 25 27 38 23 23