taking lag and capping the values with mean in dplyr

taking lag and capping the values with mean in dplyr - r

I have a following dataframe in r
name date month year hours
SSI 01-01-2016 01 2016 2000
SSI 02-01-2016 01 2016 1900
SSI 03-01-2016 01 2016 2038
SSI 04-01-2016 01 2016 2041
SSII 01-01-2016 01 2016 2000
SSII 02-01-2016 01 2016 2100
SSII 03-01-2016 01 2016 2105
SSII 04-01-2016 01 2016 2203
I want to calculate lag of hours for every name group by month and year.Which I can do it with following code
df1 <- df %>%
group_by(name,year,month) %>%
mutate(running_hrs = hours- lag(hours)) %>%
as.data.frame()
What I want is where running_hrs is greater than 24 or less than 0,I want to cap those values with mean of that month. I am doing following.
new_df <- df%>%
group_by(name,year,month) %>%
mutate(running_hrs = hours- lag(hours)) %>%
mutate(running_hrs_new = ifelse(running_hrs > 24 | running_hrs < 0,mean(running_hrs),running_hrs)) %>%
as.data.frame()
name date month year hours running_hrs running_hrs_new
SSI 01-01-2016 01 2016 2000 NA
SSI 02-01-2016 01 2016 1900 -100 (3/4)
SSI 03-01-2016 01 2016 2038 138 (3/4)
SSI 04-01-2016 01 2016 2041 3 3
SSII 01-01-2016 01 2016 2000 NA
SSII 02-01-2016 01 2016 2100 100 (10/4)
SSII 03-01-2016 01 2016 2105 5 5
SSII 04-01-2016 01 2016 2110 5 5
Values should be replaced by mean of running hours less than 24 and greater than or equal to zero. I think we can use conditional mean

library(dplyr)
library(tidyr)
new_df <- df%>%
group_by(name,year,month) %>%
mutate(running_hrs = hours- lag(hours)) %>%
mutate(valid_running_hrs= ifelse(running_hrs < 24 & running_hrs > 0,running_hrs,0)) %>%
replace_na(list(valid_running_hrs=0)) %>%
group_by(name,year,month) %>%
mutate(running_hrs_new = ifelse(running_hrs > 24 | running_hrs < 0, mean(valid_running_hrs), running_hrs)) %>%
as.data.frame()

Related

How to dynamically loop through a split dataframe in R

I have dataframe df3
df3
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
After applying split function `df5 = split(df3,f=df3$d)
> df5 = split(df3,f=df3$d)
> df5
$`Sep 2020`
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
$`Oct 2020`
x d y
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
$`Nov 2020`
x d y
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
I would like to dynamically loop through the split dataframe.
I need to find out if any values present in Nov 2020 are also present in Oct 2020.
if it is present in both, then have to check the previous one Sep 2020, and also find the number of times the names have occurred. Here df3$d is in as.yearmon format. If any names in df5[["Nov 2020"]]$x are present in df5[["Sep 2020"]]$x, extract and store it in an object along with its count. here the count is 2 since it would be present in Nov 2020 and Oct 2020. Only if the names are present in the most recent month, it should check the previous months. For this example, the output should be
> df4
names_present present_for
1 bpa 2
2 db 2
Thank you in advance

Find the number of times a name occurs in each 'month year' using R

I have a dataframe like
x d y
bbc Sep 2020 123
rsb Sep 2020 234
atc Sep 2020 345
svc Sep 2020 543
mwe Sep 2020 567
bpa Oct 2020 322
mwe Oct 2020 456
uhs Oct 2020 786
se Oct 2020 543
db Oct 2020 778
rsb Nov 2020 358
svc Nov 2020 678
db Nov 2020 321
rb Nov 2020 689
bpa Nov 2020 765
The column 'd' is in as.yearmon format
I wanna find out the values in x column that are repeated for each month year column d.
In this dataframe bpa and db are present in Nov 2020 and Oct 2020. So the output needs to be like
names_present present_for
bpa 2
db 2
The values of x present in each month year data should be given as output only if they are present continuosly. Here svc is present in Nov 2020 and Sep 2020 but it cannot be considered as a part of the output since it is absent in Oct 2020.
I tried splitting the dataframe based on the df$d column but I couldn't loop through the split dataframes and get the output required. the split dataframe looks like
$`Nov 2020`
x d y
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
$`Oct 2020`
x d y
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
$`Sep 2020`
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
The code needs to first check in the most recent month year followed by the previous month year and so on. Check if any of the x names are present in Nov 2020 first and then followed by Oct 2020 and then Sep 2020 and count their occurrences. if the names are present in Nov 2020 and Sep 2020 but not in Oct 2020, then it cannot be considered as a part of the output dataframe. There can be any number of month year values. The dataframe given here is just a small sample. I wanna dynamically find out this information.
I've been struggling with this for a long time. Would be great if anyone could help me solve this. Thank you in advance.

Rolling data for 12 month period

I wanna show the last 12 months, and each of those months should show the sum of 12 months back. So January 2022 shows sum of January 2021 -> January 2022, February 2022 shows sum of February 2021 -> February 2022 and so on.
My current data
Expected Result
I new in kusto, seems i need use pivot mode with prev function but these month period a bit confusing.

If you know for sure that you have data for each month, this will do the trick.
If not, the solution will get a bit more complicated.
The Idea is to create an accumulated sum column and then match each month accumulated sum with this of the same month from the previous year.
The difference between them is the sum of the last 12 months.
// Data sample generation. Not part of the solution.
let t = materialize(range i from 1 to 10000 step 1 | extend dt = ago(365d*5*rand()) | summarize val = count() by year = getyear(dt), month = getmonth(dt));
// Solution starts here.
t
| order by year asc, month asc
| extend cumsum_val = row_cumsum(val) - val, prev_year = year - 1
| as t2
| join kind=inner t2 on $left.prev_year == $right.year and $left.month == $right.month
| project year, month = format_datetime(make_datetime(year,month,1),'MM') , last_12_cumsum_val = cumsum_val - cumsum_val1
| evaluate pivot(month, any(last_12_cumsum_val), year)
| order by year asc
year
01
02
03
04
05
06
07
08
09
10
11
12
2018
1901
2020
2018
2023
2032
2039
2015
2025
2039
2019
2045
2048
2029
2043
2053
2040
2041
2027
2025
2037
2050
2042
2020
2035
2016
2024
2022
1999
2009
1989
1996
1975
1968
1939
1926
2021
1926
1931
1936
1933
1945
1942
1972
1969
1981
2007
2020
2049
2022
2051
2032
2019
2002
Fiddle

Another option is to follow the sliding window aggregations sample described here:
let t = materialize(range i from 1 to 10000 step 1 | extend dt = ago(365d*5*rand()) | summarize val = count() by year = getyear(dt), month = getmonth(dt) | extend Date = make_datetime(year, month, 1));
let window_months = 12;
t
| extend _bin = startofmonth(Date)
| extend _range = range(1, window_months, 1)
| mv-expand _range to typeof(long)
| extend end_bin = datetime_add("month", _range, Date)
| extend end_month = format_datetime(end_bin, "MM"), end_year = datetime_part("year", end_bin)
| summarize sum(val), count() by end_year, end_month
| where count_ == 12
| evaluate pivot(end_month, take_any(sum_val), end_year)
| order by end_year asc
end_year
01
02
03
04
05
06
07
08
09
10
11
12
2018
1921
2061
2036
2037
2075
2067
2038
2025
2029
2019
2012
2006
2015
2022
1997
2015
2012
2010
1994
2002
2029
2035
2020
2012
2002
1967
1949
1950
1963
1966
1976
1982
2016
1988
1972
2021
1990
1987
1991
1996
2026
2004
2005
1996
1991
1966
1989
1993
2022
1979
1983
1981
1977
1931

Summarising data frame using maximum counts

I have this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
First 20 rows of data:
head(counts, 20)
year month count
1 2000 Jan 14
2 2000 Feb 182
3 2000 Mar 462
4 2000 Apr 395
5 2000 May 107
6 2000 Jun 127
7 2000 Jul 371
8 2000 Aug 158
9 2000 Sep 147
10 2000 Oct 41
11 2000 Nov 141
12 2000 Dec 27
13 2001 Jan 72
14 2001 Feb 7
15 2001 Mar 40
16 2001 Apr 351
17 2001 May 342
18 2001 Jun 81
19 2001 Jul 442
20 2001 Aug 389
Lets say I try to calculate the standard deviation of these data using the usual R code:
library(plyr)
ddply(counts, .(month), summarise, s.d. = sd(count))
month s.d.
1 Apr 145.3018
2 Aug 140.9949
3 Dec 173.9406
4 Feb 127.5296
5 Jan 148.2661
6 Jul 162.4893
7 Jun 133.4383
8 Mar 125.8425
9 May 168.9517
10 Nov 93.1370
11 Oct 167.9436
12 Sep 166.8740
This gives the standard deviation around the mean of each month. How can I get R to output standard deviation around maximum value of each month?

you want: "max of values per month and the average from this maximum value" [which is not the same as the standard deviation].
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(data.table)
counts=data.table(counts)
counts[,mean(count-max(count)),by=month]

This question is highly vague. If you want to calculate the standard deviation of the differences to the maximum, you can use this code:
> library(plyr)
> ddply(counts, .(month), summarise, sd = sd(count - max(count)))
month sd
1 Apr 182.5071
2 Aug 114.3068
3 Dec 117.1049
4 Feb 184.4638
5 Jan 138.1755
6 Jul 167.0677
7 Jun 100.8841
8 Mar 144.8724
9 May 173.3452
10 Nov 132.0204
11 Oct 127.4645
12 Sep 152.2162

Order dataframe by month

I have calculated the maximum counts per month in this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(plyr)
count_max <- ddply(counts, .(month), summarise, max.count = max(count))
month max.count
1 Apr 470
2 Aug 389
3 Dec 446
4 Feb 487
5 Jan 473
6 Jul 460
7 Jun 488
8 Mar 449
9 May 488
10 Nov 464
11 Oct 483
12 Sep 394
I now want to sort count_max by the month.abb vector, so that month is in the usual order Jan-Dec. This is what I tried:
count_max[match(count_max$month, month.abb),]
...but it didn't work. How can I arrange count_max$month in the order Jan-Dec?

An alternate without conversion:
count_max[order(match(count_max$month, month.abb)), ]
# month max.count
# 5 Jan 466
# 4 Feb 356
# 8 Mar 496
# 1 Apr 489
# 9 May 498
# 7 Jun 497
# 6 Jul 491
# 2 Aug 446
# 12 Sep 414
# 11 Oct 490
# 10 Nov 416
# 3 Dec 475
Note that in your example, match(count...) returns the position of a given month in month.abb, which is what you want to sort by. You came real close, but instead of sorting by that vector, you subsetted by it. So, for example, August is the 2nd value in your original DF, but the 8th value in month.abb, so the match value for the 2nd value in your subset vector is 8, which means you are going to put the 8th row of your original data frame (in your case March), into the second position of your new DF, instead of ranking the 2nd row in your original DF into 8th position of the new one.
The distinction is a bit of a brain twister, but if you think it through it should make sense.

Convert your "month" column into an ordered factor:
factor(count_max$month, month.abb, ordered=TRUE)
# [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
# Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
Example:
count_max$month <- factor(count_max$month, month.abb, ordered=TRUE)
count_max[order(count_max$month), ]
# month max.count
# 5 Jan 482
# 4 Feb 408
# 8 Mar 483
# 1 Apr 489
# 9 May 369
# 7 Jun 432
# 6 Jul 344
# 2 Aug 470
# 12 Sep 474
# 11 Oct 450
# 10 Nov 492
# 3 Dec 366

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex