Summarising data frame using maximum counts

Summarising data frame using maximum counts - r

I have this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
First 20 rows of data:
head(counts, 20)
year month count
1 2000 Jan 14
2 2000 Feb 182
3 2000 Mar 462
4 2000 Apr 395
5 2000 May 107
6 2000 Jun 127
7 2000 Jul 371
8 2000 Aug 158
9 2000 Sep 147
10 2000 Oct 41
11 2000 Nov 141
12 2000 Dec 27
13 2001 Jan 72
14 2001 Feb 7
15 2001 Mar 40
16 2001 Apr 351
17 2001 May 342
18 2001 Jun 81
19 2001 Jul 442
20 2001 Aug 389
Lets say I try to calculate the standard deviation of these data using the usual R code:
library(plyr)
ddply(counts, .(month), summarise, s.d. = sd(count))
month s.d.
1 Apr 145.3018
2 Aug 140.9949
3 Dec 173.9406
4 Feb 127.5296
5 Jan 148.2661
6 Jul 162.4893
7 Jun 133.4383
8 Mar 125.8425
9 May 168.9517
10 Nov 93.1370
11 Oct 167.9436
12 Sep 166.8740
This gives the standard deviation around the mean of each month. How can I get R to output standard deviation around maximum value of each month?

you want: "max of values per month and the average from this maximum value" [which is not the same as the standard deviation].
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(data.table)
counts=data.table(counts)
counts[,mean(count-max(count)),by=month]

This question is highly vague. If you want to calculate the standard deviation of the differences to the maximum, you can use this code:
> library(plyr)
> ddply(counts, .(month), summarise, sd = sd(count - max(count)))
month sd
1 Apr 182.5071
2 Aug 114.3068
3 Dec 117.1049
4 Feb 184.4638
5 Jan 138.1755
6 Jul 167.0677
7 Jun 100.8841
8 Mar 144.8724
9 May 173.3452
10 Nov 132.0204
11 Oct 127.4645
12 Sep 152.2162

Related

How to dynamically loop through a split dataframe in R

I have dataframe df3
df3
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
After applying split function `df5 = split(df3,f=df3$d)
> df5 = split(df3,f=df3$d)
> df5
$`Sep 2020`
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
$`Oct 2020`
x d y
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
$`Nov 2020`
x d y
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
I would like to dynamically loop through the split dataframe.
I need to find out if any values present in Nov 2020 are also present in Oct 2020.
if it is present in both, then have to check the previous one Sep 2020, and also find the number of times the names have occurred. Here df3$d is in as.yearmon format. If any names in df5[["Nov 2020"]]$x are present in df5[["Sep 2020"]]$x, extract and store it in an object along with its count. here the count is 2 since it would be present in Nov 2020 and Oct 2020. Only if the names are present in the most recent month, it should check the previous months. For this example, the output should be
> df4
names_present present_for
1 bpa 2
2 db 2
Thank you in advance

Daily Average of Time series derived from monthly data R monthdays()

I have a time series object ts. I have mentioned the entire object here. It has data from Jan 2013 to Dec 2017 for all years. I am trying to find the daily average value so that the value is divided by the number of days in a month.
Expected output
The first value for Jan 2013 in ts is 23770, I want the value to be 23770/31 where 31 is the number of days in Jan, second value for Feb 2013 is 23482. I want the value to be 23482/28 as 28 was the number of days in Feb 2013 and so on
Tried so far:
I know monthdays() can do this. Something like ts/monthdays() .Monthdays() returns number of days in a month. I am not able to implement it here. Read about this tapply somewhere but it is not giving me desired result, since i need values corresponding to each month year combination.
ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 23770 23482 23601 22889 23401 24240 23873 23647 23378 23871 22624 23496
2014 26765 27619 26341 27320 27389 27418 26874 27005 27538 26324 27267 27583
2015 28354 27452 28336 28998 28595 28338 27806 28660 27226 28317 28666 28574
2016 30209 30659 31554 30248 30358 31091 30389 30247 31227 31839 30602 30609
2017 32180 32203 31639 31784 32375 30856 31863 32827 32506 31702 31681 32176
> cycle(ts_actual_group2)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 1 2 3 4 5 6 7 8 9 10 11 12
2014 1 2 3 4 5 6 7 8 9 10 11 12
2015 1 2 3 4 5 6 7 8 9 10 11 12
2016 1 2 3 4 5 6 7 8 9 10 11 12
2017 1 2 3 4 5 6 7 8 9 10 11 12
Using tapply since i read it , but this is not giving desired output
tapply(ts_actual_group2, cycle(ts_actual_group2), mean)
1 2 3 4 5 6 7 8 9 10 11 12
28255.6 28283.0 28294.2 28247.8 28423.6 28388.6 28161.0 28477.2 28375.0 28410.6 28168.0 28487.6

I am not able to implement it here.
I'm not sure why you couldn't. The monthdays function from the forecast package, when applied to a ts object, returns the number of days in each month of the series. The object returned is a time-series of the same dimension as the input. So you can simply divide them.
library(forecast)
ts/monthdays(ts)
Jan Feb Mar Apr May Jun Jul
2013 766.7742 838.6429 761.3226 762.9667 754.8710 808.0000
2014 863.3871 986.3929 849.7097 910.6667 883.5161 913.9333
2015 914.6452 980.4286 914.0645 966.6000 922.4194 944.6000
2016 974.4839 1057.2069 1017.8710 1008.2667 979.2903 1036.3667
2017 1038.0645 1150.1071 1020.6129 1059.4667 1044.3548 1028.5333
monthsdays(ts) # Accepts a time-series object
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 31 28 31 30 31 30 31 31 30 31 30 31
2014 31 28 31 30 31 30 31 31 30 31 30 31
2015 31 28 31 30 31 30 31 31 30 31 30 31
2016 31 29 31 30 31 30 31 31 30 31 30 31
2017 31 28 31 30 31 30 31 31 30 31 30 31

converting to time series using ts() in r

Good afternoon
I have a time series
v2<-c(12,13,15,17,18,12,11,12)
which run from July 1996 to October 1997, just the months between July and October
when I try to convert to time series with
v2.ts<-ts(v2, frequency=12, start=c(1996,7), end=c(1997,10))
It yields me this result
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996 12 13 15 17 18 12
1997 11 12 12 13 15 17 18 12 11 12
what parameters can I use to make it like:
Jul Aug Sep Oct
1996 12 13 15 17
1997 18 12 11 12
Thanks in advance for the help

A ts series must be regularly spaced but the output shown has points that are one month apart except between Oct of the first year and July of the second year so it is not of that form.
There are several packages that can represent irregularly spaced series. With the zoo package it would be done like this:
library(zoo)
z <- as.zoo(v2.ts)
z[cycle(z) %in% 7:10]
## Jul 1996 Aug 1996 Sep 1996 Oct 1996 Jul 1997 Aug 1997 Sep 1997 Oct 1997
## 12 13 15 17 18 12 11 12
If you are not looking for a time series but just a matrix with the indicated elements then:
tapply(c(v2.ts), list(floor(time(v2.ts)), cycle(v2.ts)), c)[, 7:10]
## 7 8 9 10
## 1996 12 13 15 17
## 1997 18 12 11 12

Fill last spot of matrix with NA

I think I might be missing something very simply, but:
How I can I fill the last spots of a matrix with NA instead of it just repeating previous values?
Data example:
x <- 1:27
m <- matrix(x, nrow = 12, ncol = ceiling(nrow(base.de)/12), byrow = FALSE)
col_names <- c("2013", "2014", "2015")
row_names <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
dimnames(m) <- list(row_names, col_names)
m
2013 2014 2015
Jan 1 13 25
Feb 2 14 26
Mar 3 15 27
Apr 4 16 1 # NOT NA?
May 5 17 2
Jun 6 18 3
Jul 7 19 4
Aug 8 20 5
Sep 9 21 6
Oct 10 22 7
Nov 11 23 8
Dec 12 24 9
I would like all values after 2015 March to be filled with NA.

If you assign a shorter vector to a longer vector in R, it recycles the values in the shorter vector. That's what you are observing here. (Note that a matrix is just a vector with dimension attribute.) This behaviour cannot be avoided. So, you should assign NA after creating the matrix:
m[28:length(m)] <- NA
Or, alternatively, you could append the necessary number of NA values to 1:27 when creating the matrix.

Create a list of dimnames and from that create a matrix of NAs. Finally, fill it:
x <- 1:27 # input as per question
dnm <- list(month.abb, 2013:2015) # list of dimnames
m <- matrix(NA, nrow = length(dnm[[1]]), ncol = length(dnm[[2]]), dimnames = dnm)
m[seq_along(x)] <- x
Note: You might not want to do this at all and instead create a monthly time series:
library(zoo)
z <- zooreg(x, as.yearmon("2013-01"), freq = 12)
giving:
> z
Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 Jun 2013 Jul 2013 Aug 2013
1 2 3 4 5 6 7 8
Sep 2013 Oct 2013 Nov 2013 Dec 2013 Jan 2014 Feb 2014 Mar 2014 Apr 2014
9 10 11 12 13 14 15 16
May 2014 Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 Dec 2014
17 18 19 20 21 22 23 24
Jan 2015 Feb 2015 Mar 2015
25 26 27
or a ts series:
> as.ts(z)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 1 2 3 4 5 6 7 8 9 10 11 12
2014 13 14 15 16 17 18 19 20 21 22 23 24
2015 25 26 27
or directly:
ts(x, start = c(2013, 1), freq = 12)

Order dataframe by month

I have calculated the maximum counts per month in this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(plyr)
count_max <- ddply(counts, .(month), summarise, max.count = max(count))
month max.count
1 Apr 470
2 Aug 389
3 Dec 446
4 Feb 487
5 Jan 473
6 Jul 460
7 Jun 488
8 Mar 449
9 May 488
10 Nov 464
11 Oct 483
12 Sep 394
I now want to sort count_max by the month.abb vector, so that month is in the usual order Jan-Dec. This is what I tried:
count_max[match(count_max$month, month.abb),]
...but it didn't work. How can I arrange count_max$month in the order Jan-Dec?

An alternate without conversion:
count_max[order(match(count_max$month, month.abb)), ]
# month max.count
# 5 Jan 466
# 4 Feb 356
# 8 Mar 496
# 1 Apr 489
# 9 May 498
# 7 Jun 497
# 6 Jul 491
# 2 Aug 446
# 12 Sep 414
# 11 Oct 490
# 10 Nov 416
# 3 Dec 475
Note that in your example, match(count...) returns the position of a given month in month.abb, which is what you want to sort by. You came real close, but instead of sorting by that vector, you subsetted by it. So, for example, August is the 2nd value in your original DF, but the 8th value in month.abb, so the match value for the 2nd value in your subset vector is 8, which means you are going to put the 8th row of your original data frame (in your case March), into the second position of your new DF, instead of ranking the 2nd row in your original DF into 8th position of the new one.
The distinction is a bit of a brain twister, but if you think it through it should make sense.

Convert your "month" column into an ordered factor:
factor(count_max$month, month.abb, ordered=TRUE)
# [1] Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
# Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
Example:
count_max$month <- factor(count_max$month, month.abb, ordered=TRUE)
count_max[order(count_max$month), ]
# month max.count
# 5 Jan 482
# 4 Feb 408
# 8 Mar 483
# 1 Apr 489
# 9 May 369
# 7 Jun 432
# 6 Jul 344
# 2 Aug 470
# 12 Sep 474
# 11 Oct 450
# 10 Nov 492
# 3 Dec 366

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summarising data frame using maximum counts - r

Related

How to dynamically loop through a split dataframe in R

Daily Average of Time series derived from monthly data R monthdays()

converting to time series using ts() in r

Fill last spot of matrix with NA

Order dataframe by month

Categories

Resources