I have a data for selling some product and I would like to calculate the growth rate of this data such that N_win and N_lose are the win and lose over a period of time 1-19 March. Also, I would like to predict the growth rate and win and lose?
Date N_win N_lose tot1 tot2
1 2018-03-01 0 0 0 0
2 2018-03-02 1 0 1 1
3 2018-03-03 0 0 1 1
4 2018-03-04 1 0 2 2
5 2018-03-05 3 0 5 5
6 2018-03-06 0 0 5 5
7 2018-03-07 2 0 7 7
8 2018-03-08 4 0 11 11
9 2018-03-09 4 0 15 15
10 2018-03-10 5 0 20 20
11 2018-03-11 1 1 21 20
12 2018-03-12 24 1 45 44
13 2018-03-13 41 1 86 85
14 2018-03-14 17 2 103 101
15 2018-03-15 15 3 118 115
16 2018-03-16 15 6 133 127
17 2018-03-17 38 6 171 165
18 2018-03-18 67 6 238 232
I tried to apply this function but it seems not working
Growthrate = function(x1,x2, n){
gr = (x2/x1)^(1/n)-1
return(gr)
}
GR = NULL
for(i in 1:length(DF[,1])){
GR[i] = Growthrate(DF[i,2],DF[i+1,2], sum(i))
}
Related
I'd like to get a summary of time series data where group is "Flare" and the max value of the FlareLength is the data of interest for that group.
If I have a dataframe, like this:
Date Flare FlareLength
1 2015-12-01 0 1
2 2015-12-02 0 2
3 2015-12-03 0 3
4 2015-12-04 0 4
5 2015-12-05 0 5
6 2015-12-06 0 6
7 2015-12-07 1 1
8 2015-12-08 1 2
9 2015-12-09 1 3
10 2015-12-10 1 4
11 2015-12-11 0 1
12 2015-12-12 0 2
13 2015-12-13 0 3
14 2015-12-14 0 4
15 2015-12-15 0 5
16 2015-12-16 0 6
17 2015-12-17 0 7
18 2015-12-18 0 8
19 2015-12-19 0 9
20 2015-12-20 0 10
21 2015-12-21 0 11
22 2016-01-11 1 1
23 2016-01-12 1 2
24 2016-01-13 1 3
25 2016-01-14 1 4
26 2016-01-15 1 5
27 2016-01-16 1 6
28 2016-01-17 1 7
29 2016-01-18 1 8
I'd like output like:
Date Flare FlareLength
1 2015-12-06 0 6
2 2015-12-10 1 4
3 2015-12-21 0 11
4 2016-01-18 1 8
I have tried various aggregate forms but I'm not very familiar with the time series wrinkle.
Using dplyr, we can create a grouping variable by comparing the FlareLength with the previous FlareLength value and select the row with maximum FlareLength in the group.
library(dplyr)
df %>%
group_by(gr = cumsum(FlareLength < lag(FlareLength,
default = first(FlareLength)))) %>%
slice(which.max(FlareLength)) %>%
ungroup() %>%
select(-gr)
# A tibble: 4 x 3
# Date Flare FlareLength
# <fct> <int> <int>
#1 2015-12-06 0 6
#2 2015-12-10 1 4
#3 2015-12-21 0 11
#4 2016-01-18 1 8
In base R with ave we can do the same as
subset(df, FlareLength == ave(FlareLength, cumsum(c(TRUE, diff(FlareLength) < 0)),
FUN = max))
Suppose I have this dataframe:
> df1
date count
1 2012-07-01 2.867133
2 2012-08-01 2.018745
3 2012-09-01 5.237515
4 2012-10-01 8.320493
5 2012-11-01 4.119850
6 2012-12-01 3.648649
7 2013-01-01 3.172867
8 2013-02-01 4.065041
9 2013-03-01 2.914798
10 2013-04-01 4.735683
11 2013-05-01 3.775411
12 2013-06-01 3.825717
13 2013-07-01 3.273427
14 2013-08-01 2.716469
15 2013-09-01 2.687296
16 2013-10-01 3.674121
17 2013-11-01 3.325942
18 2013-12-01 2.524038
I now want to split df1$count in such a way, that I get groups/ranges of where the Information is the highest. My thoughts go towards Information Gain, but I know IG is for attributes, not a column.
If you plot the data, you can distinguish a high rise and decrease...so my goal is to always find these significant increases/decreases which contain a high Information Gain.
Any ideas on how I could do this?
Something like this?
df1%>%
mutate(dif=ifelse((lag(count)-count)>0,0,1))%>%
mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count dif group
1 2012-07-01 2.867133 NA 1
2 2012-08-01 2.018745 0 2
3 2012-09-01 5.237515 1 3
4 2012-10-01 8.320493 1 3
5 2012-11-01 4.119850 0 4
6 2012-12-01 3.648649 0 4
7 2013-01-01 3.172867 0 4
8 2013-02-01 4.065041 1 5
9 2013-03-01 2.914798 0 6
10 2013-04-01 4.735683 1 7
11 2013-05-01 3.775411 0 8
12 2013-06-01 3.825717 1 9
13 2013-07-01 3.273427 0 10
14 2013-08-01 2.716469 0 10
15 2013-09-01 2.687296 0 10
16 2013-10-01 3.674121 1 11
17 2013-11-01 3.325942 0 12
18 2013-12-01 2.524038 0 12
UPDATE
df1%>%
mutate(nxt=lag(count),
dif=ifelse( abs(count-lag(count))>2 | count/lag(count)>3 | lag(count)/count>3,1,0))%>%
+ mutate(group=rle(dif) %>% magrittr::extract2("lengths") %>% rep(seq_along(.), .))
date count nxt dif group
1 2012-07-01 2.867133 NA NA 1
2 2012-08-01 2.018745 2.867133 0 2
3 2012-09-01 5.237515 2.018745 1 3
4 2012-10-01 8.320493 5.237515 1 3
5 2012-11-01 4.119850 8.320493 1 3
6 2012-12-01 3.648649 4.119850 0 4
7 2013-01-01 3.172867 3.648649 0 4
8 2013-02-01 4.065041 3.172867 0 4
9 2013-03-01 2.914798 4.065041 0 4
10 2013-04-01 4.735683 2.914798 0 4
11 2013-05-01 3.775411 4.735683 0 4
12 2013-06-01 3.825717 3.775411 0 4
13 2013-07-01 3.273427 3.825717 0 4
14 2013-08-01 2.716469 3.273427 0 4
15 2013-09-01 2.687296 2.716469 0 4
16 2013-10-01 3.674121 2.687296 0 4
17 2013-11-01 3.325942 3.674121 0 4
18 2013-12-01 2.524038 3.325942 0 4
This question already has answers here:
Cumsum with reset when 0 is encountered and by groups
(2 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 5 years ago.
I have the following dataframe
x y count
1 1 2018-02-24 4.031540
2 2 2018-02-25 5.244303
3 3 2018-02-26 5.441465
4 NA 2018-02-27 4.164104
5 5 2018-02-28 5.172919
6 6 2018-03-01 5.591410
7 7 2018-03-02 4.691716
8 8 2018-03-03 5.465360
9 9 2018-03-04 3.269378
10 NA 2018-03-05 5.300679
11 11 2018-03-06 5.489664
12 12 2018-03-07 4.423334
13 13 2018-03-08 3.808764
14 14 2018-03-09 6.450136
15 15 2018-03-10 5.541785
16 16 2018-03-11 4.762889
17 17 2018-03-12 5.511649
18 18 2018-03-13 6.795386
19 19 2018-03-14 6.615762
20 20 2018-03-15 4.749151
I want to take the cumsum of the count column, but I want the the cumsum to restart when the x value is NA. I've tried the following:
df$cum_sum <- ifelse(is.na(df$x) == FALSE, cumsum(df$count), 0)
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 0.000000
5 5 2018-02-28 5.172919 24.054331
6 6 2018-03-01 5.591410 29.645741
7 7 2018-03-02 4.691716 34.337458
8 8 2018-03-03 5.465360 39.802817
9 9 2018-03-04 3.269378 43.072195
10 NA 2018-03-05 5.300679 0.000000
11 11 2018-03-06 5.489664 53.862538
12 12 2018-03-07 4.423334 58.285871
13 13 2018-03-08 3.808764 62.094635
14 14 2018-03-09 6.450136 68.544771
15 15 2018-03-10 5.541785 74.086556
16 16 2018-03-11 4.762889 78.849445
17 17 2018-03-12 5.511649 84.361094
18 18 2018-03-13 6.795386 91.156480
19 19 2018-03-14 6.615762 97.772242
20 20 2018-03-15 4.749151 102.521394
The result is the cum_sum column is 0 at the NA values, but the cumsum doesn't reset. How can I fix this?
A possible solution:
dat$cum_sum <- ave(dat$count, cumsum(is.na(dat$x)), FUN = cumsum)
which gives:
> dat
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 4.164104
5 5 2018-02-28 5.172919 9.337023
6 6 2018-03-01 5.591410 14.928433
7 7 2018-03-02 4.691716 19.620149
8 8 2018-03-03 5.465360 25.085509
9 9 2018-03-04 3.269378 28.354887
10 NA 2018-03-05 5.300679 5.300679
11 11 2018-03-06 5.489664 10.790343
12 12 2018-03-07 4.423334 15.213677
13 13 2018-03-08 3.808764 19.022441
14 14 2018-03-09 6.450136 25.472577
15 15 2018-03-10 5.541785 31.014362
16 16 2018-03-11 4.762889 35.777251
17 17 2018-03-12 5.511649 41.288900
18 18 2018-03-13 6.795386 48.084286
19 19 2018-03-14 6.615762 54.700048
20 20 2018-03-15 4.749151 59.449199
Or with dplyr:
library(dplyr)
dat %>%
group_by(grp = cumsum(is.na(x))) %>%
mutate(cum_sum = cumsum(count)) %>%
ungroup() %>%
select(-grp)
I have the data.table version
plouf <- setDT(df)
plouf[,group := cumsum(is.na(x))]
plouf[!is.na(x),cum_sum := cumsum(count),by = group]
x y count group cum_sum
1: 1 2018-02-24 4.031540 0 4.031540
2: 2 2018-02-25 5.244303 0 9.275843
3: 3 2018-02-26 5.441465 0 14.717308
4: NA 2018-02-27 4.164104 1 NA
5: 5 2018-02-28 5.172919 1 5.172919
6: 6 2018-03-01 5.591410 1 10.764329
7: 7 2018-03-02 4.691716 1 15.456045
8: 8 2018-03-03 5.465360 1 20.921405
9: 9 2018-03-04 3.269378 1 24.190783
10: NA 2018-03-05 5.300679 2 NA
11: 11 2018-03-06 5.489664 2 5.489664
12: 12 2018-03-07 4.423334 2 9.912998
13: 13 2018-03-08 3.808764 2 13.721762
14: 14 2018-03-09 6.450136 2 20.171898
15: 15 2018-03-10 5.541785 2 25.713683
16: 16 2018-03-11 4.762889 2 30.476572
17: 17 2018-03-12 5.511649 2 35.988221
18: 18 2018-03-13 6.795386 2 42.783607
19: 19 2018-03-14 6.615762 2 49.399369
20: 20 2018-03-15 4.749151 2 54.148520
I have a data frame in R as follows called Ident.
Date coredata.Ident.
1 2017-09-01 <NA>
2 2017-09-03 <NA>
3 2017-09-04 <NA>
4 2017-09-05 0
5 2017-09-06 0
6 2017-09-07 0
7 2017-09-08 0
8 2017-09-10 0
9 2017-09-11 Doji
10 2017-09-12 <NA>
11 2017-09-13 0
12 2017-09-14 Bull.Engulfing
13 2017-09-15 0
14 2017-09-17 0
15 2017-09-18 Bear.Engulfing
16 2017-09-19 Doji
17 2017-09-20 Bear.Engulfing
18 2017-09-21 Bull.Engulfing
19 2017-09-22 0
20 2017-09-24 0
21 2017-09-25 Bear.Engulfing
22 2017-09-26 0
23 2017-09-27 0
24 2017-09-28 0
25 2017-09-29 0
I would like to assign the next date after that there is a Bull.Engulfing to a variable called DateSelect1 and then the second Bull.Engulfing would be assigned to DateSelect2 etc. So that all of the Bull.Engulfing have a date assigned to them.
So in this example, as there is a Bull.Engulfing on 2017-09-14 line 12, DateSelect1 should be 2017-09-15 as it is the next row. Hope this makes sense.
TIA
Assuming the input in the Note subset the data frame and assign each date in that subset to a variable:
dates <- as.Date(subset(DF, coredata.Ident. == "Bull.Engulfing")$Date)
for(i in seq_along(dates)) assign(paste0("DateSelect", i), dates[i])
DateSelect1
## [1] "2017-09-14"
DateSelect2
## [1] "2017-09-21"
Note: The input in reproducible form is:
Lines <- "
Date coredata.Ident.
1 2017-09-01 <NA>
2 2017-09-03 <NA>
3 2017-09-04 <NA>
4 2017-09-05 0
5 2017-09-06 0
6 2017-09-07 0
7 2017-09-08 0
8 2017-09-10 0
9 2017-09-11 Doji
10 2017-09-12 <NA>
11 2017-09-13 0
12 2017-09-14 Bull.Engulfing
13 2017-09-15 0
14 2017-09-17 0
15 2017-09-18 Bear.Engulfing
16 2017-09-19 Doji
17 2017-09-20 Bear.Engulfing
18 2017-09-21 Bull.Engulfing
19 2017-09-22 0
20 2017-09-24 0
21 2017-09-25 Bear.Engulfing
22 2017-09-26 0
23 2017-09-27 0
24 2017-09-28 0
25 2017-09-29 0"
DF <- read.table(text = Lines)
This is Fips data set
State Fips State.Abbreviation ANSI.Code GU.Name
1 1 67 AL 2403054 Abbeville
2 1 73 AL 2403063 Adamsville
3 1 117 AL 2403069 Alabaster
4 1 95 AL 2403074 Albertville
5 1 123 AL 2403077 Alexander City
6 1 107 AL 2403080 Aliceville
7 1 39 AL 2403097 Andalusia
8 1 15 AL 2403101 Anniston
:
:
:
41774 51 720 VA 1498434 Norton
41775 51 730 VA 1498435 Petersburg
41776 51 735 VA 1498436 Poquoson
41777 51 740 VA 1498556 Portsmouth
41778 51 750 VA 1498438 Radford
41779 51 760 VA 1789073 Richmond
41780 51 770 VA 1498439 Roanoke
41781 51 775 VA 1789074 Salem
41782 51 790 VA 1789075 Staunton
41783 51 800 VA 1498560 Suffolk
41784 51 810 VA 1498559 Virginia Beach
41785 51 820 VA 1498443 Waynesboro
41786 51 830 VA 1789076 Williamsburg
41787 51 840 VA 1789077 Winchester
dim(fips)
[1] 2937 5
This is data head cancer
PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS Fips State State.Abbreviation
1 93261752 1544 2 15 0 1 3 3 34 NY
2 93264865 1544 2 1 0 1 15 15 34 NY
3 93268186 1544 2 1 0 1 5 5 34 NY
4 93272027 1544 2 1 0 2 17 17 34 NY
5 93274555 1544 1 1 0 1 13 13 34 NY
6 93275343 1544 5 1 0 2 25 25 34 NY
7 93279759 1544 5 1 0 2 9 9 34 NY
8 93280754 1544 2 1 0 2 35 35 34 NY
9 93281166 1544 2 1 0 2 31 31 34 NY
10 93282602 1544 5 1 0 1 33 33 34 NY
11 93287646 1544 1 1 0 1 11 11 34 NY
12 93288255 1544 4 1 4 1 39 39 34 NY
13 93290660 1544 9 1 0 2 25 25 34 NY
14 93291461 1544 1 1 6 1 39 39 34 NY
15 93291778 1544 2 1 0 1 3 3 34 NY
dim(headcancer)
[1] 75313 10
when I merged together I expect to get the same row with head.cancer 75313 rows, but I got 951423 rows.
Here is my code and output
n = merge(head.cancer,fips, by=c('State','Fips','State.Abbreviation'), all.x= TRUE)
State Fips State.Abbreviation PUBCSNUM REG MAR_STAT RACE1V NHIADE SEX FIPS ANSI.Code GU.Name
1 6 5 CA 70128269 1541 4 1 0 2 5 2409693 Amador City
2 6 5 CA 70128269 1541 4 1 0 2 5 2411446 Plymouth
3 6 5 CA 70128269 1541 4 1 0 2 5 226085 Jackson
4 6 5 CA 70128269 1541 4 1 0 2 5 1675841 Amador
5 6 5 CA 70128269 1541 4 1 0 2 5 2418631 Ione Band of Miwok
6 6 5 CA 70128269 1541 4 1 0 2 5 2412019 Sutter Creek
7 6 5 CA 70128269 1541 4 1 0 2 5 2410110 Ione
8 6 5 CA 70128269 1541 4 1 0 2 5 2410128 Jackson
9 6 5 CA 67476209 1541 2 1 1 2 5 2409693 Amador City
10 6 5 CA 67476209 1541 2 1 1 2 5 2411446 Plymouth
11 6 5 CA 67476209 1541 2 1 1 2 5 226085 Jackson
12 6 5 CA 67476209 1541 2 1 1 2 5 1675841 Amador
13 6 5 CA 67476209 1541 2 1 1 2 5 2418631 Ione Band of Miwok
14 6 5 CA 67476209 1541 2 1 1 2 5 2412019 Sutter Creek
15 6 5 CA 67476209 1541 2 1 1 2 5 2410110 Ione
16 6 5 CA 67476209 1541 2 1 1 2 5 2410128 Jackson
17 6 5 CA 56544761 1541 4 1 0 2 5 2409693 Amador City
18 6 5 CA 56544761 1541 4 1 0 2 5 2411446 Plymouth
19 6 5 CA 56544761 1541 4 1 0 2 5 226085 Jackson
20 6 5 CA 56544761 1541 4 1 0 2 5 1675841 Amador
dim(n)
[1] 951423 12
The first row to 8th "PUBCSNUM "duplicate 8 times, "PUBCSNUM" is ID, so it's unique, "ANSI.Code" is supposed only 1 value, now they are so many value.I don't know why it's duplicate like that
Please help me, I stuck for couples hours but I couldn't figure out. Thanks