I have a dataframe temp which looks like the following:
Time Count Colour
01:02:30 11.000000 Red
17:05:49 52.000000 White
04:06:07 4.000000 Blue
01:07:03 30.000000 Red
20:08:30 4.000000 Yellow
The Time was initially an ISODate do I stripped it off to get the time, which is what I wanted, using the code below.
temp$Time = parse_iso_8601(temp$Time)
temp$Time <- as.POSIXlt(temp$Time)
library(chron)
temp$Time=times(format(temp$Time, format="%H:%M:%S"))
Now, I wish to find rows with time between 02:00:00 and 05:00:00. Can you please suggest how this can be done? Thanks!
Here is my try,
temp <- read.table(text = "Time Count Colour
01:02:30 11.000000 Red
17:05:49 52.000000 White
04:06:07 4.000000 Blue
01:07:03 30.000000 Red
20:08:30 4.000000 Yellow", header = TRUE)
library(chron)
temp$Time <- times(format(temp$Time, format="%H:%M:%S"))
temp[temp$Time >= 2/24 & temp$Time <= 5/24, ]
Output:
> temp[temp$Time >= 2/24 & temp$Time <= 5/24, ]
Time Count Colour
3 04:06:07 4 Blue
The logic:
The code below shows that the function times should be mapping [00:00:00, 23:59:59.999...) to [0, 1)
> as.numeric(times(paste(0:23, ":00:00", sep = "")))
[1] 0.00000000 0.04166667 0.08333333 0.12500000 0.16666667 0.20833333
[7] 0.25000000 0.29166667 0.33333333 0.37500000 0.41666667 0.45833333
[13] 0.50000000 0.54166667 0.58333333 0.62500000 0.66666667 0.70833333
[19] 0.75000000 0.79166667 0.83333333 0.87500000 0.91666667 0.95833333
Thus, to find if the Time is between 02:00:00 and 05:00:00, you can check whether it is greater or equal to 2/24 and smaller or equal to 5/25.
The gap problem:
Not sure if it is what you want.
Assume temp is ordered by date and time, like the one below
library(chron)
temp <- data.frame(
Record = 1:8,
Day = c(1, 1, 1, 1, 1, 2, 2, 2),
Time = c("01:02:30", "01:07:03", "04:06:07", "17:05:49", "20:08:30", "02:00:00", "02:15:00", "04:07:00")
)
temp$Time <- times(format(temp$Time, format="%H:%M:%S"))
> temp
Record Day Time
1 1 1 01:02:30
2 2 1 01:07:03
3 3 1 04:06:07
4 4 1 17:05:49
5 5 1 20:08:30
6 6 2 02:00:00
7 7 2 02:15:00
8 8 2 04:07:00
R code to do your task:
temp$Gap <- 24 * (c(NA, diff(temp$Time)) + c(NA, diff(temp$Day) > 0))
temp$Gap3hr <- temp$Gap >= 3 # 3 hour gap
temp$HourFromFirst <- 24 * (temp$Time - temp$Time[1]) + 24 * (temp$Day - temp$Day[1])
tempSelected <- lapply(which(temp$Gap3hr == TRUE), function(i) {
BeforeGap1hr <- (temp$HourFromFirst[i - 1] - temp$HourFromFirst) <= 1 & (temp$HourFromFirst[i - 1] - temp$HourFromFirst) >= 0
AfterGap1hr <- (temp$HourFromFirst - temp$HourFromFirst[i]) <= 1 & (temp$HourFromFirst - temp$HourFromFirst[i]) >= 0
temp[BeforeGap1hr | AfterGap1hr, ]
}
)
Output:
> tempSelected
[[1]]
Record Day Time Gap Gap3hr HourFromFirst
3 3 1 04:06:07 2.984444 FALSE 3.060278
4 4 1 17:05:49 12.995000 TRUE 16.055278
[[2]]
Record Day Time Gap Gap3hr HourFromFirst
4 4 1 17:05:49 12.995000 TRUE 16.05528
5 5 1 20:08:30 3.044722 TRUE 19.10000
[[3]]
Record Day Time Gap Gap3hr HourFromFirst
5 5 1 20:08:30 3.044722 TRUE 19.10000
6 6 2 02:00:00 5.858333 TRUE 24.95833
7 7 2 02:15:00 0.250000 FALSE 25.20833
Related
I'm looking for instances in my dataset where 4 of the last 6 samples show progressively increasing concentrations. I've worked out the logic of a series of if statements but I'm having trouble applying it to my data. I was planning on using cbind to attach the list to my dataset but the list ends up with 28 values and my data only has 24 rows.
I can't figure out what's happening with the for loop and have read that it's not a great way to do things anyways so I'm looking for alternatives.
EDIT: I've added some photos of two specific examples where suggested answers fail. The trouble seems to be that the "simple" solutions look for increases between consecutive points only. I'm looking for four points over each set of six that increase.
Row 16 and Row 20 plots
Here is some of the data:
SAMPLE_DATE <- c("2013-08-02", "2014-06-13", "2015-09-03", "2016-06-12", "2016-09-27", "2017-05-30", "2017-05-30", "2017-09-14", "2017-09-14", "2017-12-02", "2018-03-29", "2018-06-05", "2018-10-19", "2019-02-27", "2019-06-04", "2019-08-28", "2019-10-22", "2020-02-04", "2020-06-06", "2020-08-26", "2020-10-23", "2021-02-01", "2021-06-15", "2021-08-03")
REPORT_RESULT_VALUE <- c(0.1470, 0.0623, 1.4600, 0.1810, 0.0509, 0.0801, 0.0801, 0.0999, 0.0980, 0.0820, 0.0698, 0.0884, 0.1060, 0.1010, 0.0984, 0.1050, 0.1100, 0.0980, 0.1000, 0.1090, 0.1050, 0.0662, 0.0944, 0.1220)
GWSubsetData <- data.frame(SAMPLE_DATE, REPORT_RESULT_VALUE)
And here is what I've attempted:
Groundwater_ST1 <- vector("list")
for (i in seq_along(GWSubsetData$REPORT_RESULT_VALUE)) {
if (i >= 6) {
a <- i-5
b <- i-4
c <- i-3
d <- i-2
e <- i-1
#If i > 3 of first 5 samples
if (sum(GWSubsetData$REPORT_RESULT_VALUE[[i]] > GWSubsetData$REPORT_RESULT_VALUE[a:e]) >= 3) {
#If i>E and E > 2 of first 4 samples
if ((GWSubsetData$REPORT_RESULT_VALUE[[i]] > GWSubsetData$REPORT_RESULT_VALUE[[e]]) &
(sum(GWSubsetData$REPORT_RESULT_VALUE[[e]] > GWSubsetData$REPORT_RESULT_VALUE[a:d]) > 2)) {
#if E>D and D > 1 of first 3 samples
if ((GWSubsetData$REPORT_RESULT_VALUE[[e]] > GWSubsetData$REPORT_RESULT_VALUE[[d]]) &
(sum(GWSubsetData$REPORT_RESULT_VALUE[[d]] > GWSubsetData$REPORT_RESULT_VALUE[a:c]) >= 1)) {
Groundwater_ST1[i] = TRUE
#If E>C and C > 1 of first 2 sampels
}else if ((GWSubsetData$REPORT_RESULT_VALUE[[e]] > GWSubsetData$REPORT_RESULT_VALUE[[c]]) &
(sum(GWSubsetData$REPORT_RESULT_VALUE[[c]] > GWSubsetData$REPORT_RESULT_VALUE[a:b]) >= 1)) {
Groundwater_ST1[i] = TRUE
#If E>B and B>A
}else if ((GWSubsetData$REPORT_RESULT_VALUE[[e]] > GWSubsetData$REPORT_RESULT_VALUE[[b]]) &
(GWSubsetData$REPORT_RESULT_VALUE[[b]] > GWSubsetData$REPORT_RESULT_VALUE[[a]])) {
Groundwater_ST1[i] = TRUE
}else{
Groundwater_ST1[i] = FALSE
}
#If i>D and D > 2 of first 3 samples
}else if ((GWSubsetData$REPORT_RESULT_VALUE[[i]] > GWSubsetData$REPORT_RESULT_VALUE[[d]]) &
(sum(GWSubsetData$REPORT_RESULT_VALUE[[d]] > GWSubsetData$REPORT_RESULT_VALUE[a:c]) >= 2)) {
#If D>C and C > 1 of first 2 samples
if ((GWSubsetData$REPORT_RESULT_VALUE[[d]] > GWSubsetData$REPORT_RESULT_VALUE[[c]]) &
(sum(GWSubsetData$REPORT_RESULT_VALUE[[c]] > GWSubsetData$REPORT_RESULT_VALUE[a:b]) >= 1)) {
Groundwater_ST1[i] = TRUE
#If D>B and B>A
}else if ((GWSubsetData$REPORT_RESULT_VALUE[[d]] > GWSubsetData$REPORT_RESULT_VALUE[[b]]) &
(GWSubsetData$REPORT_RESULT_VALUE[[b]] > GWSubsetData$REPORT_RESULT_VALUE[[a]])) {
Groundwater_ST1[i] = TRUE
}else{
Groundwater_ST1[i] = FALSE
}
#If i > c > b > a
}else if ((GWSubsetData$REPORT_RESULT_VALUE[[i]] > GWSubsetData$REPORT_RESULT_VALUE[[c]]) &
(GWSubsetData$REPORT_RESULT_VALUE[[c]] > GWSubsetData$REPORT_RESULT_VALUE[b]) &
(GWSubsetData$REPORT_RESULT_VALUE[[b]] > GWSubsetData$REPORT_RESULT_VALUE[[a]])) {
Groundwater_ST1[i] = TRUE
}else{
Groundwater_ST1[i] = FALSE
}
}else{
Groundwater_ST1[i] = FALSE }
}else{
Groundwater_ST1[i] = FALSE
}
}
EDIT: Tricky! Here's a brute force tidyverse answer. Unless the data is many millions of rows long, this should be very performant, since its vectorized. Additional gains could be had by porting into data.table or collapse to get better performance for large numbers of groups.
My approach was to identify every possible ascending subset of four points within a window of 6. There are 15 of these patterns:
library(tidyverse)
subsets_to_try <- combinat::combn(1:6, 4) %>%
t() %>%
as.data.frame() %>%
mutate(combination = row_number(),
combo_pattern = paste0(V1,V2,V3,V4, sep=""), .before = 1)
# combination combo_pattern V1 V2 V3 V4
#1 1 1234 1 2 3 4
#2 2 1235 1 2 3 5
#3 3 1236 1 2 3 6
#4 4 1245 1 2 4 5
#5 5 1246 1 2 4 6
#6 6 1256 1 2 5 6
#7 7 1345 1 3 4 5
#8 8 1346 1 3 4 6
#9 9 1356 1 3 5 6
#10 10 1456 1 4 5 6
#11 11 2345 2 3 4 5
#12 12 2346 2 3 4 6
#13 13 2356 2 3 5 6
#14 14 2456 2 4 5 6
#15 15 3456 3 4 5 6
For use below, we can save a version that is reshaped into long form. This produces a table that is 60 rows long (15 combinations x 4 positions).
subsets_long <- subsets_to_try %>%
pivot_longer(-c(combination, combo_pattern),
names_to = "trend_num", names_prefix = "V",
names_transform = as.integer,
values_to = "pos_in_window")
Now for the brute forcing. We can make 60 copies of each row, corresponding to each of the four positions an observation might have within the 15 sequences. We can then join the data to subsets_long so that each observation will now be situated at a particular position in a particular sequence.
We can rearrange the data so that each observation is in order within its sequence. Grouping within each possible sequence end date and combination, we can filter for just the situations where there is a 4-element increasing trend.
incr_sequences <- GW %>%
uncount(nrow(subsets_to_try), .id = "combination") %>%
left_join(subsets_long) %>%
mutate(SEQ_END = DATE_NUM - pos_in_window + 6) %>%
arrange(SEQ_END, combination, pos_in_window) %>%
group_by(SEQ_END, combination) %>%
filter(sum(val > lag(val, default = -Inf)) == 4) %>%
ungroup()
That's basically it. We might look at the identified sequences like this as a visual confirmation (I've filtered out the extreme point to clarify).
In case it's of any use downstream, I've preserved the observation that there are actually two ways to get a four-element ascending sequence ending row 16 or row 17, depending on whether they include 14 or 15. If you just need to know which windows have any working trend at all, you could look at incr_sequences %>% distinct(SEQ_END) to see it's just row 16 and 17 that work.
ggplot(incr_sequences, aes(DATE_NUM, val)) +
geom_point(data = GW %>% filter(val < 0.5), color = "gray70") +
geom_point() +
facet_wrap(~interaction(SEQ_END, combo_pattern))
Original answer:
I think we could do this by counting the number of cumulative increases, then looking to see how the increases have changed over a window of 6 values.
library(dplyr)
GWSubsetData %>%
mutate(increases = cumsum(REPORT_RESULT_VALUE > lag(REPORT_RESULT_VALUE, default = 0)),
n_incr_last_6 = increases - lag(increases, 6, default = 0),
flag = n_incr_last_6 >= 4)
SAMPLE_DATE REPORT_RESULT_VALUE increases n_incr_last_6 flag
1 2013-08-02 0.1470 1 1 FALSE
2 2014-06-13 0.0623 1 1 FALSE
3 2015-09-03 1.4600 2 2 FALSE
4 2016-06-12 0.1810 2 2 FALSE
5 2016-09-27 0.0509 2 2 FALSE
6 2017-05-30 0.0801 3 3 FALSE
7 2017-05-30 0.0801 3 2 FALSE
8 2017-09-14 0.0999 4 3 FALSE
9 2017-09-14 0.0980 4 2 FALSE
10 2017-12-02 0.0820 4 2 FALSE
11 2018-03-29 0.0698 4 2 FALSE
12 2018-06-05 0.0884 5 2 FALSE
13 2018-10-19 0.1060 6 3 FALSE
14 2019-02-27 0.1010 6 2 FALSE
15 2019-06-04 0.0984 6 2 FALSE
16 2019-08-28 0.1050 7 3 FALSE
17 2019-10-22 0.1100 8 4 TRUE
18 2020-02-04 0.0980 8 3 FALSE
19 2020-06-06 0.1000 9 3 FALSE
20 2020-08-26 0.1090 10 4 TRUE
21 2020-10-23 0.1050 10 4 TRUE
22 2021-02-01 0.0662 10 3 FALSE
23 2021-06-15 0.0944 11 3 FALSE
24 2021-08-03 0.1220 12 4 TRUE
Base R equivalent:
GWSubsetData$incr = cumsum(c(1, diff(GWSubsetData$REPORT_RESULT_VALUE) > 0))
GWSubsetData$flag = (GWSubsetData$incr - lag(GWSubsetData$incr, 6, default = 0)) >= 4
Edit:
I acknowledge that this approach doesn't lead to your expected result, but I wanted to post it in case it spurs you/others to come up with alternative approaches to brute forcing (if there are any...). Here, I'm defining "progressive increase" based on the direction of the slope of a linear relationship between date and val:
dat$date2 <- as.numeric(as.Date(dat$date, "%Y-%m-%d"))
w <- 6 # window
## With a loop
dat$slope <- NA
for(i in w:nrow(dat)){
dat[i,"slope"] <- coef(lm(val ~ date2, data = dat[(i-w):i,]))[2]
}
## Without a loop
dat$slope2 <- NA
dat[w:nrow(dat), "slope2"] <- sapply(w:nrow(dat), function(i) coef(lm(val ~ date2, data = dat[(i-w):i,]))[2])
Previous answer:
I'm assuming that if the change between a value and its previous value is greater than 0, there was an increase and if this happens 4 or more times within six consecutive values, your condition Groundwater_ST1 == TRUE:
library(dplyr)
library(RcppRoll)
dat %>%
mutate(change = dat$val - lag(dat$val)) %>%
mutate(incr = change > 0) %>%
mutate(roll_sum = roll_sum(incr, 6, align = "right", fill = NA)) %>%
mutate(Groundwater_ST1 = roll_sum >= 4)
result:
date val change incr roll_sum Groundwater_ST1
1 2013-08-02 0.1470 NA NA NA NA
2 2014-06-13 0.0623 -0.0847 FALSE NA NA
3 2015-09-03 1.4600 1.3977 TRUE NA NA
4 2016-06-12 0.1810 -1.2790 FALSE NA NA
5 2016-09-27 0.0509 -0.1301 FALSE NA NA
6 2017-05-30 0.0801 0.0292 TRUE NA NA
7 2017-05-30 0.0801 0.0000 FALSE 2 FALSE
8 2017-09-14 0.0999 0.0198 TRUE 3 FALSE
9 2017-09-14 0.0980 -0.0019 FALSE 2 FALSE
10 2017-12-02 0.0820 -0.0160 FALSE 2 FALSE
11 2018-03-29 0.0698 -0.0122 FALSE 2 FALSE
12 2018-06-05 0.0884 0.0186 TRUE 2 FALSE
13 2018-10-19 0.1060 0.0176 TRUE 3 FALSE
14 2019-02-27 0.1010 -0.0050 FALSE 2 FALSE
15 2019-06-04 0.0984 -0.0026 FALSE 2 FALSE
16 2019-08-28 0.1050 0.0066 TRUE 3 FALSE
17 2019-10-22 0.1100 0.0050 TRUE 4 TRUE
18 2020-02-04 0.0980 -0.0120 FALSE 3 FALSE
19 2020-06-06 0.1000 0.0020 TRUE 3 FALSE
20 2020-08-26 0.1090 0.0090 TRUE 4 TRUE
21 2020-10-23 0.1050 -0.0040 FALSE 4 TRUE
22 2021-02-01 0.0662 -0.0388 FALSE 3 FALSE
23 2021-06-15 0.0944 0.0282 TRUE 3 FALSE
24 2021-08-03 0.1220 0.0276 TRUE 4 TRUE
data:
dat <- data.frame(
date =c("2013-08-02", "2014-06-13", "2015-09-03", "2016-06-12", "2016-09-27", "2017-05-30", "2017-05-30", "2017-09-14", "2017-09-14", "2017-12-02", "2018-03-29", "2018-06-05", "2018-10-19", "2019-02-27", "2019-06-04", "2019-08-28", "2019-10-22", "2020-02-04", "2020-06-06", "2020-08-26", "2020-10-23", "2021-02-01", "2021-06-15", "2021-08-03"),
val = c(0.1470, 0.0623, 1.4600, 0.1810, 0.0509, 0.0801, 0.0801, 0.0999, 0.0980, 0.0820, 0.0698, 0.0884, 0.1060, 0.1010, 0.0984, 0.1050, 0.1100, 0.0980, 0.1000, 0.1090, 0.1050, 0.0662, 0.0944, 0.1220))
The answers below are very helpful. But I oversimplified my original question. I figured I learn more if I oversimplify and then adapt to my actual need, but now I am stuck. There are other factors that drive the amortization. See more complete code here. I like the response using "amort$end_bal <- begin_bal * (1 - mpr)^amort$period" and "amort$pmt <- c(0, diff(amort$end_bal))* -1", but in addition npr increases the ending balances and ch_off decreases ending balances. Here´s the more complete code:
n_periods <- 8
begin_bal <- 10000
yld <- .20
npr <- .09
mpr <- .10
co <- .10
period = seq(0,n_periods,1)
fin = 0
pur = 0
pmt = 0
ch_off = 0
end_bal = begin_bal
for(i in 1:n_periods){
{fin[i+1] = end_bal[i]*yld/12}
{pur[i+1] = end_bal[i]*npr}
{pmt[i+1] = end_bal[i]*mpr}
{ch_off[i+1] = end_bal[i]*co/12}
end_bal[i+1] = end_bal[i]+pur[i+1]-pmt[i+1]-ch_off[i+1]}
amort <- data.frame(period,fin,pur,pmt,ch_off,end_bal)
Which gives the below correct output:
print(amort,row.names=FALSE)
period fin pur pmt ch_off end_bal
0 0.0000 0.0000 0.0000 0.00000 10000.000
1 166.6667 900.0000 1000.0000 83.33333 9816.667
2 163.6111 883.5000 981.6667 81.80556 9636.694
3 160.6116 867.3025 963.6694 80.30579 9460.022
4 157.6670 851.4020 946.0022 78.83351 9286.588
5 154.7765 835.7929 928.6588 77.38823 9116.334
6 151.9389 820.4700 911.6334 75.96945 8949.201
7 149.1534 805.4281 894.9201 74.57668 8785.132
8 146.4189 790.6619 878.5132 73.20944 8624.072
I´m new to R, and I understand one of its features is matrix/vector manipulation. In the below example I amortize an asset over 8 months, where each payment ("pmt") is 10% ("mpr") of the prior period balance ("end_bal"). The below works fine. I used a FOR loop. I understand FOR loops can be slow in large models and a better solution is use of R´s abundant vector/matrix functions. But I didn´t know how to do this in my example since each monthly payment is calculated by referencing the prior period ending balance.
So my questions are:
Is there a more efficient way to do the below?
How do I replace the 0 for pmt in period 0, with an empty space?
R code:
n_periods <- 8
begin_bal <- 100
mpr <- .10
# Example loan amortization
pmt = 0
end_bal = begin_bal
for(i in 1:n_periods){
{pmt[i+1] = end_bal[i]*mpr}
end_bal[i+1] = end_bal[i]-pmt[i+1]}
amort <- data.frame(period = 0:n_periods,pmt,end_bal)
amort
Results, which are correct:
> amort
period pmt end_bal
1 0 0.000000 100.00000
2 1 10.000000 90.00000
3 2 9.000000 81.00000
4 3 8.100000 72.90000
5 4 7.290000 65.61000
6 5 6.561000 59.04900
7 6 5.904900 53.14410
8 7 5.314410 47.82969
9 8 4.782969 43.04672
Use R's vectorised calculations
n_periods <- 8
begin_bal <- 100
mpr <- .10
amort <- data.frame(period = seq(0, n_periods, 1))
amort$end_bal <- begin_bal * (1 - mpr)^amort$period
amort$pmt <- c(0, diff(amort$end_bal))* -1
amort
#> period end_bal pmt
#> 1 0 100.00000 0.000000
#> 2 1 90.00000 10.000000
#> 3 2 81.00000 9.000000
#> 4 3 72.90000 8.100000
#> 5 4 65.61000 7.290000
#> 6 5 59.04900 6.561000
#> 7 6 53.14410 5.904900
#> 8 7 47.82969 5.314410
#> 9 8 43.04672 4.782969
Created on 2021-05-12 by the reprex package (v2.0.0)
dplyr way for a different case (say)
n_periods <- 15
begin_bal <- 1000
mpr <- .07
library(dplyr)
seq(0, n_periods, 1) %>% as.data.frame() %>%
setNames('period') %>%
mutate(end_bal = begin_bal * (1 - mpr)^period,
pmt = -1 * c(0, diff(end_bal)))
#> period end_bal pmt
#> 1 0 1000.0000 0.00000
#> 2 1 930.0000 70.00000
#> 3 2 864.9000 65.10000
#> 4 3 804.3570 60.54300
#> 5 4 748.0520 56.30499
#> 6 5 695.6884 52.36364
#> 7 6 646.9902 48.69819
#> 8 7 601.7009 45.28931
#> 9 8 559.5818 42.11906
#> 10 9 520.4111 39.17073
#> 11 10 483.9823 36.42878
#> 12 11 450.1035 33.87876
#> 13 12 418.5963 31.50725
#> 14 13 389.2946 29.30174
#> 15 14 362.0439 27.25062
#> 16 15 336.7009 25.34308
Created on 2021-05-12 by the reprex package (v2.0.0)
Though OP has put another question in edited scenario, here's the approach suggested (for future reference)
n_periods <- 8
begin_bal <- 10000
yld <- .20
npr <- .09
mpr <- .10
co <- .10
library(dplyr)
seq(0, n_periods, 1) %>% as.data.frame() %>%
setNames('period') %>%
mutate(end_bal = begin_bal * (1 - (mpr + co/12 - npr))^period,
fin = c(0, (end_bal * yld/12)[-nrow(.)]),
pur = c(0, (end_bal * npr)[-nrow(.)]),
pmt = c(0, (end_bal * mpr)[-nrow(.)]),
ch_off = c(0, (end_bal * co/12)[-nrow(.)]))
#> period end_bal fin pur pmt ch_off
#> 1 0 10000.000 0.0000 0.0000 0.0000 0.00000
#> 2 1 9816.667 166.6667 900.0000 1000.0000 83.33333
#> 3 2 9636.694 163.6111 883.5000 981.6667 81.80556
#> 4 3 9460.022 160.6116 867.3025 963.6694 80.30579
#> 5 4 9286.588 157.6670 851.4020 946.0022 78.83351
#> 6 5 9116.334 154.7765 835.7929 928.6588 77.38823
#> 7 6 8949.201 151.9389 820.4700 911.6334 75.96945
#> 8 7 8785.132 149.1534 805.4281 894.9201 74.57668
#> 9 8 8624.072 146.4189 790.6619 878.5132 73.20944
Created on 2021-05-13 by the reprex package (v2.0.0)
If you are "lazy" (don't want to formulate the general expression of pmt and end_bal), you can define a recursive function f like blow
f <- function(k) {
if (k == 1) {
return(data.frame(pmt = 100 * mpr, end_bal = 100))
}
u <- f(k - 1)
end_bal <- with(tail(u, 1), end_bal - pmt)
pmt <- mpr * end_bal
rbind(u, data.frame(pmt, end_bal))
}
n_periods <- 8
res <- transform(
cbind(period = 0:n_periods, f(n_periods + 1)),
pmt = c(0, head(pmt, -1))
)
and you will see
> res
period pmt end_bal
1 0 0.000000 100.00000
2 1 10.000000 90.00000
3 2 9.000000 81.00000
4 3 8.100000 72.90000
5 4 7.290000 65.61000
6 5 6.561000 59.04900
7 6 5.904900 53.14410
8 7 5.314410 47.82969
9 8 4.782969 43.04672
I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe
I have problem in subsetting times.
1) I would like to filter my data by time intervals where one is in midnight and another in midday.
2) And i need only first time that occurs in each interval.
Data frame looks like this
DATE v
1 2007-07-28 00:41:00 1
2 2007-07-28 02:00:12 5
3 2007-07-28 02:01:19 3
4 2007-07-28 02:44:08 2
5 2007-07-28 04:02:18 3
6 2007-07-28 09:59:16 4
7 2007-07-28 11:21:32 8
8 2007-07-28 11:58:40 5
9 2007-07-28 13:20:52 4
10 2007-07-28 13:21:52 9
11 2007-07-28 14:41:32 3
12 2007-07-28 15:19:00 9
13 2007-07-29 01:01:48 2
14 2007-07-29 01:41:08 5
Result should look like this
DATE v
2 2007-07-28 02:00:12 5
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2
Reproducible code
DATE<-c("2007-07-28 00:41:00", "2007-07-28 02:00:12","2007-07-28 02:01:19", "2007-07-28 02:44:08", "2007-07-28 04:02:18","2007-07-28 09:59:16", "2007-07-28 11:21:32", "2007-07-28 11:58:40","2007-07-28 13:20:52", "2007-07-28 13:21:52", "2007-07-28 14:41:32","2007-07-28 15:19:00", "2007-07-29 01:01:48", "2007-07-29 01:41:08")
v<-c(1,5,3,2,3,4,8,5,4,9,3,9,2,5)
hyljes<-data.frame(cbind(DATE,v))
df <-subset(hyljes, format(as.POSIXct(DATE),"%H") %in% c ("01":"02","13":"14"))
There´s problem with making intervals. It allows me to subset hours "13":"14" but not for "01":"02". Is there any reasonable answers for that?
And i haven´t found the way how to get only first elements from each interval.
Any help is appreciated!
Try
hyljes[c(1, head(cumsum(rle(as.POSIXlt(hyljes$DATE)$hour < 13)$lengths) + 1, -1)), ]
## DATE v
## 1 2007-07-28 00:41:00 1
## 9 2007-07-28 13:20:52 4
## 13 2007-07-29 01:01:48 2
as.POSIXlt(hyljes$DATE)$hour < 13 gives you whether time is before or after noon
rle(...)$lengths gives you lengths of the runs of TRUEs and FALSEs
cumsum of above + 1 gives you indices of first record in each run
head(...,-1) trims of last element
c(1, ...) adds back first index - which should be always be included by definition
There are lots of little manipulations in here, but the end result gets you where you need to be:
hyljes <- [YOUR DATA]
hyljes$DATE <- as.POSIXct(hyljes$DATE, format = "%Y-%m-%d %H:%M:%S")
hyljes$hour <- strftime(hyljes$DATE, '%H')
hyljes$date <- strftime(hyljes$DATE, '%Y-%m-%d')
hyljes$am_pm <- ifelse(hyljes$hour < 12, 'am', 'pm')
mins <- ddply(hyljes, .(date, am_pm), summarise, min = min(DATE))$min
hyljes[hyljes[, 1] %in% mins, 1:2]
DATE v
1 2007-07-28 00:41:00 1
9 2007-07-28 13:20:52 4
13 2007-07-29 01:01:48 2
I want to determine the length of the snow season in the following data frame:
DATE SNOW
1998-11-01 0
1998-11-02 0
1998-11-03 0.9
1998-11-04 1
1998-11-05 0
1998-11-06 1
1998-11-07 0.6
1998-11-08 1
1998-11-09 2
1998-11-10 2
1998-11-11 2.5
1998-11-12 3
1998-11-13 6.5
1999-01-01 15
1999-01-02 15
1999-01-03 19
1999-01-04 18
1999-01-05 17
1999-01-06 17
1999-01-07 17
1999-01-08 17
1999-01-09 16
1999-03-01 6
1999-03-02 5
1999-03-03 5
1999-03-04 5
1999-03-05 5
1999-03-06 2
1999-03-07 2
1999-03-08 1.6
1999-03-09 1.2
1999-03-10 1
1999-03-11 0.6
1999-03-12 0
1999-03-13 1
Snow season is defined by a snow depth (SNOW) of more than 1 cm for at least 10 consecutive days (so if there is snow one day in November but after it melts and depth is < 1 cm we consider the season not started).
My idea would be to determine:
1) the date of snowpack establishement (in my example 1998-11-08)
2) the date of "disappearing" (here 1999-03-11)
3) calculate the length of the period (nb of days between 1998-11-05 and 1999-03-11)
For the 3rd step I can easily get the number between 2 dates using this method.
But how to define the dates with conditions?
This is one way:
# copy data from clipboard
d <- read.table(text=readClipboard(), header=TRUE)
# coerce DATE to Date type, add event grouping variable that numbers the groups
# sequentially and has NA for values not in events.
d <- transform(d, DATE=as.Date(DATE),
event=with(rle(d$SNOW >= 1), rep(replace(ave(values, values, FUN=seq), !values, NA), lengths)))
# aggregate event lengths in days
event.days <- aggregate(DATE ~ event, data=d, function(x) as.numeric(max(x) - min(x), units='days'))
# get those events greater than 10 days
subset(event.days, DATE > 10)
# event DATE
# 3 3 122
You can also use the event grouping variable to find the start dates:
starts <- aggregate(DATE ~ event, data=d, FUN=head, 1)
# 1 1 1998-11-04
# 2 2 1998-11-06
# 3 3 1998-11-08
# 4 4 1999-03-13
And then merge this with event.days:
merge(event.days, starts, by='event')
# event DATE.x DATE.y
# 1 1 0 1998-11-04
# 2 2 0 1998-11-06
# 3 3 122 1998-11-08
# 4 4 0 1999-03-13