sorting of month in matrix in R - r

I have a matrix in this format:
year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274
I need to sort months on the basis of their occurrence i.e jan, feb, mar... when I sort it gets sorted on the basis of first alphabet. I used this:
mat <- mat[order(mat[,1], decreasing = TRUE), ]
and it looks like this :
row.names April August December February January July June March May November October September
1 2015 59535 0 0 24258 22785 0 31356 40274 84211 0 0 0
2 2014 466 10982 35881 17 0 2981 1279 289 879 8911 8565 4000
Can we sort months on the basis of occurrence in R ?

Suppose DF is the data frame from which you derived your matrix. We provide such a data frame in reproducible form at the end. Ensure that month and year are factors with appropriate levels. Note that month.name is a builtin variable in R that is used here to ensure that the month levels are appropriately sorted and we have assumed year is a numeric column. Then use levelplot like this:
DF2 <- transform(DF,
month = factor(as.character(month), levels = month.name),
year = factor(year)
)
library(lattice)
levelplot(Freq ~ year * month, DF2)
Note: Here is DF in reproducible form:
Lines <- " year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274 "
DF <- read.table(text = Lines, header = TRUE)

Assuming you want to sort based on time (have to add a dummy day 1 to convert to time format):
time = strptime(paste(1, mat$month, mat$year), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]
Or if you don't care about the year:
time = strptime(paste(1, mat$month, 2000), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]

Related

How can I create a new column that only extracts either the year or month from a mm/dd/yy hh:mm string?

I have a date/time string variable that looks like this:
> dput(df$starttime)
c("12/16/20 7:24", "6/21/21 13:20", "1/22/20 9:03", "1/07/20 17:19",
"11/8/21 10:14", NA, NA, "10/26/21 7:19", "3/14/22 9:48", "5/12/22 13:29"
I basically want to create a column that only has the year (2020, 2021, 2022) and the year + month (e.g., "Jan 2022)
1) Base R Assuming that you want separate month and year numeric columns, define a function which converts a string in the format shown in the question to a year or month number and then invoke it twice. No packages are used.
toNum <- function(x, fmt) format(as.Date(x, "%m/%d/%y"), fmt) |>
type.convert(as.is = TRUE)
transform(df, year = toNum(starttime, "%Y"), month = toNum(starttime, "%m"))
giving
starttime year month
1 12/16/20 7:24 2020 12
2 6/21/21 13:20 2021 6
3 1/22/20 9:03 2020 1
4 1/07/20 17:19 2020 1
5 11/8/21 10:14 2021 11
6 <NA> NA NA
7 <NA> NA NA
8 10/26/21 7:19 2021 10
9 3/14/22 9:48 2022 3
10 5/12/22 13:29 2022 5
2) yearmon Assuming that you want a yearmon class column which represents year and month internally as year + fraction where fraction is 0 for Ja, 1/12 for Feb, ..., 11/12 for Dec so that it sorts appropriately and adding 1/12, say, will give the next month we can use the following. Note that if ym is yearmon then as.integer(ym) is the year and cycle(ym) is the month number (1, 2, ..., 12).
library(zoo)
transform(df, yearmon = as.yearmon(starttime, "%m/%d/%y"))
giving:
starttime yearmon
1 12/16/20 7:24 Dec 2020
2 6/21/21 13:20 Jun 2021
3 1/22/20 9:03 Jan 2020
4 1/07/20 17:19 Jan 2020
5 11/8/21 10:14 Nov 2021
6 <NA> <NA>
7 <NA> <NA>
8 10/26/21 7:19 Oct 2021
9 3/14/22 9:48 Mar 2022
10 5/12/22 13:29 May 2022
Note
If you want to sort by starttime then use
ct <- as.POSIXct(df$starttime, format = "%m/%d/%Y %H:%M")
df[order(ct),, drop = FALSE ]
If you want a chronologically sortable output, you could use the tsibble::yearmonth type:
tsibble::yearmonth(lubridate::mdy_hm(c("12/16/20 7:24", "6/21/21 13:20", "1/22/20 9:03", "1/07/20 17:19",
"11/8/21 10:14", NA, NA, "10/26/21 7:19", "3/14/22 9:48", "5/12/22 13:29")))
result
<yearmonth[10]>
[1] "2020 Dec" "2021 Jun" "2020 Jan" "2020 Jan" "2021 Nov" NA NA
[8] "2021 Oct" "2022 Mar" "2022 May"
An option is to convert to datetime class POSIXct with mdy_hm (from lubridate), then format to extract the month (%b) and 4 digit year (%Y), filter out the NA elements and arrange based on the converted datetime column
library(dplyr)
library(lubridate)
df %>%
mutate(starttime = mdy_hm(starttime),
yearmonth = format(starttime, "%b %Y")) %>%
filter(complete.cases(yearmonth)) %>%
arrange(starttime)
-output
# A tibble: 8 × 2
starttime yearmonth
<dttm> <chr>
1 2020-01-07 17:19:00 Jan 2020
2 2020-01-22 09:03:00 Jan 2020
3 2020-12-16 07:24:00 Dec 2020
4 2021-06-21 13:20:00 Jun 2021
5 2021-10-26 07:19:00 Oct 2021
6 2021-11-08 10:14:00 Nov 2021
7 2022-03-14 09:48:00 Mar 2022
8 2022-05-12 13:29:00 May 2022
Try this with lubridate
library(lubridate)
data.frame(df,
Year = format(mdy_hm(df$starttime), "%Y"),
MonthYear = format(mdy_hm(df$starttime), "%b %Y"))
starttime Year MonthYear
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
5 11/8/21 10:14 2021 Nov 2021
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
8 10/26/21 7:19 2021 Oct 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
It uses mdy_hm in conjunction with format to get the desired Year %Y and %b %Y abbreviated month and year part of the date.
Ordered rows:
df_new <- data.frame(df,
Year = format(mdy_hm(df$starttime), "%Y"),
MonthYear = format(mdy_hm(df$starttime), "%b %Y"))
df_new[order(my(df_new$MonthYear)),]
starttime Year MonthYear
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
8 10/26/21 7:19 2021 Oct 2021
5 11/8/21 10:14 2021 Nov 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
Without NAs
na.omit(df_new[order(my(df_new$MonthYear)),])
starttime Year MonthYear
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
8 10/26/21 7:19 2021 Oct 2021
5 11/8/21 10:14 2021 Nov 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
Data
df <- structure(list(starttime = c("12/16/20 7:24", "6/21/21 13:20",
"1/22/20 9:03", "1/07/20 17:19", "11/8/21 10:14", NA, NA, "10/26/21 7:19",
"3/14/22 9:48", "5/12/22 13:29")), class = "data.frame", row.names = c(NA,
-10L))

dplyr sample_n by one variable through another one

I have a data frame with a "grouping" variable season and another variable year which is repeated for each month.
df <- data.frame(month = as.character(sapply(month.name,function(x)rep(x,4))),
season = c(rep("winter",8),rep("spring",12),rep("summer",12),rep("autumn",12),rep("winter",4)),
year = rep(2021:2024,12))
I would like to use dplyr::sample_n or something similar to choose 2 months in the data frame for each season and keep the same months for all the years, for example:
month season year
1 January winter 2021
2 January winter 2022
3 January winter 2023
4 January winter 2024
5 February winter 2021
6 February winter 2022
7 February winter 2023
8 February winter 2024
9 March spring 2021
10 March spring 2022
11 March spring 2023
12 March spring 2024
13 May spring 2021
14 May spring 2022
15 May spring 2023
16 May spring 2024
17 June summer 2021
18 June summer 2022
19 June summer 2023
20 June summer 2024
21 July summer 2021
22 July summer 2022
23 July summer 2023
24 July summer 2024
25 October autumn 2021
26 October autumn 2022
27 October autumn 2023
28 October autumn 2024
29 November autumn 2021
30 November autumn 2022
31 November autumn 2023
32 November autumn 2024
I cannot make df %>% group_by(season,year) %>% sample_n(2) since it chooses different months for each year.
Thanks!
We can randomly sample 2 values from month and filter them by group.
library(dplyr)
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),2))
# month season year
# <chr> <chr> <int>
# 1 January winter 2021
# 2 January winter 2022
# 3 January winter 2023
# 4 January winter 2024
# 5 February winter 2021
# 6 February winter 2022
# 7 February winter 2023
# 8 February winter 2024
# 9 March spring 2021
#10 March spring 2022
# … with 22 more rows
If for certain groups there are less than 2 unique values we can select minimum between 2 and unique values in the group to sample.
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),min(2, n_distinct(month))))
Using the same logic with base R, we can use ave
df[as.logical(with(df, ave(month, season,
FUN = function(x) x %in% sample(unique(x),2)))), ]
An option using slice
library(dplyr)
df %>%
group_by(season) %>%
slice(which(!is.na(match(month, sample(unique(month), 2)))))
# A tibble: 32 x 3
# Groups: season [4]
# month season year
# <fct> <fct> <int>
# 1 October autumn 2021
# 2 October autumn 2022
# 3 October autumn 2023
# 4 October autumn 2024
# 5 November autumn 2021
# 6 November autumn 2022
# 7 November autumn 2023
# 8 November autumn 2024
# 9 April spring 2021
#10 April spring 2022
# … with 22 more rows
Or using base R
by(df, df$season, FUN = function(x) subset(x, month %in% sample(unique(month), 2 )))

How to add missing months to a data frame?

I have a dataset with three observations: January, February, and March. I would like to add the remaining months as observations of zero to the same datatable, but I'm having trouble appending these.
Here's my current code:
library(dplyr)
Period <- c("January 2015", "February 2015", "March 2015",
"January 2016", "February 2016", "March 2016",
"January 2017", "February 2017", "March 2017",
"January 2018", "February 2018", "March 2018")
Month <- c("January", "February", "March",
"January", "February", "March",
"January", "February", "March",
"January", "February", "March")
Dollars <- c(936, 753, 731,
667, 643, 588,
948, 894, 997,
774,745, 684)
dat <- data.frame(Period = Period, Month = Month, Dollars = Dollars)
dat2 <- dat %>%
dplyr::select(Month, Dollars) %>%
dplyr::group_by(Month) %>%
dplyr::summarise(AvgDollars = mean(Dollars))
Any ideas for populating April through December in the dataset are greatly appreciated. Thanks in advance!
Here's the way to do it using complete in one step:
library(tidyverse)
Then use complete:
dat2 <- data.frame(Period = Period, Month = Month, Dollars = Dollars) %>%
# make a "year" variable
mutate(Year = word(Period, 2,2)) %>%
# remove period variable (we'll add it in later)
select(-Period) %>%
# month.name is a base variable listing all months (thanks #Gregor).
# nesting by "Year" lets complete know you only want the years listed in your dataset.
complete(Month = month.name, nesting(Year), fill = list(Dollars = 0)) %>%
# Arrange by Year and month
arrange(Year, Month) %>%
#remake the "period" variable
mutate(Period = paste(Month, Year)) %>%
group_by(Month) %>%
summarise(AvgDollars = mean(Dollars))
Here is a two-step solution:
library(dplyr)
Sys.setlocale("LC_TIME", "English")
# first, define a dataframe with each month from January 2015 to December 2018
dat2 <- data.frame(Period = format(seq(as.Date("2015/1/1"),
as.Date("2018/12/1"), by = "month"),
format = "%B %Y"),
Month = substr(Period, 1, nchar(Period)-5))
# then, merge dat and dat2
dat %>%
select(Period, Dollars) %>%
right_join(dat2, by = "Period") %>%
select(Period, Month, Dollars)
Period Month Dollars
1 January 2015 January 936
2 February 2015 February 753
3 March 2015 March 731
4 April 2015 January NA
5 May 2015 February NA
6 June 2015 March NA
7 July 2015 January NA
8 August 2015 February NA
9 September 2015 March NA
10 October 2015 January NA
11 November 2015 February NA
12 December 2015 March NA
13 January 2016 January 667
14 February 2016 February 643
15 March 2016 March 588
16 April 2016 January NA
17 May 2016 February NA
18 June 2016 March NA
19 July 2016 January NA
20 August 2016 February NA
21 September 2016 March NA
22 October 2016 January NA
23 November 2016 February NA
24 December 2016 March NA
25 January 2017 January 948
26 February 2017 February 894
27 March 2017 March 997
28 April 2017 January NA
29 May 2017 February NA
30 June 2017 March NA
31 July 2017 January NA
32 August 2017 February NA
33 September 2017 March NA
34 October 2017 January NA
35 November 2017 February NA
36 December 2017 March NA
37 January 2018 January 774
38 February 2018 February 745
39 March 2018 March 684
40 April 2018 January NA
41 May 2018 February NA
42 June 2018 March NA
43 July 2018 January NA
44 August 2018 February NA
45 September 2018 March NA
46 October 2018 January NA
47 November 2018 February NA
48 December 2018 March NA
Maybe there's a more graceful solution with dplyr, but here is a quick solution without much typing:
dat <- rbind(data.frame(Period = Period, Month = Month, Dollars = Dollars),
data.frame(Period = c(sapply(2015:2018, function(x) format(ISOdate(x,4:12,1),"%B %Y"))),
Month = c(sapply(2015:2018, function(x) format(ISOdate(x,4:12,1),"%B"))),
Dollars = 0))

Dealing with nonexistent data when converting to time-series in CRAN R

I have got following data set and I am trying to convert the consumption to time series. Some of the data are nonexistent (e.g. there is no data for 10/2014).
year month consumption
2014 7 10617
2014 8 8318
2014 9 3199
2014 12 2066
2015 1 10825
2015 2 3096
2015 3 1665
2015 4 3651
2015 5 5807
2015 7 2951
2015 8 5885
2015 9 3653
2015 10 4266
2015 11 9706
when I use ts() in R, the wrong values are replaced for nonexistent months.
ts(mkt$consumptions, start = c(2014,7),end=c(2015,11), frequency=12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 10617 8318 3199 2066 10825 3096
2015 1665 3651 5807 2951 5885 3653 4266 9706 10617 8318 3199
,y question is how to simply replace the nonexistent values with zero or blank?
"ts" class requires that the data be regularly spaced, i.e. every month should be present or NA but that is not the case here. The zoo package can handle irregularly spaced series. Read the input into zoo using the "yearmon" class for the year/month and then simply use it as a "zoo" series or else convert it to "ts". If the input is in a file but otherwise is exactly the same as in Lines then replace text = Lines with something like "myfile.dat" .
Lines <- "year month consumption
2014 7 10617
2014 8 8318
2014 9 3199
2014 12 2066
2015 1 10825
2015 2 3096
2015 3 1665
2015 4 3651
2015 5 5807
2015 7 2951
2015 8 5885
2015 9 3653
2015 10 4266
2015 11 9706"
library(zoo)
toYearmon <- function(y, m) as.yearmon(paste(y, m), "%Y %m")
z <- read.zoo(text = Lines, header = TRUE, index = 1:2, FUN = toYearmon)
as.ts(z)

Summarising data frame using maximum counts

I have this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
First 20 rows of data:
head(counts, 20)
year month count
1 2000 Jan 14
2 2000 Feb 182
3 2000 Mar 462
4 2000 Apr 395
5 2000 May 107
6 2000 Jun 127
7 2000 Jul 371
8 2000 Aug 158
9 2000 Sep 147
10 2000 Oct 41
11 2000 Nov 141
12 2000 Dec 27
13 2001 Jan 72
14 2001 Feb 7
15 2001 Mar 40
16 2001 Apr 351
17 2001 May 342
18 2001 Jun 81
19 2001 Jul 442
20 2001 Aug 389
Lets say I try to calculate the standard deviation of these data using the usual R code:
library(plyr)
ddply(counts, .(month), summarise, s.d. = sd(count))
month s.d.
1 Apr 145.3018
2 Aug 140.9949
3 Dec 173.9406
4 Feb 127.5296
5 Jan 148.2661
6 Jul 162.4893
7 Jun 133.4383
8 Mar 125.8425
9 May 168.9517
10 Nov 93.1370
11 Oct 167.9436
12 Sep 166.8740
This gives the standard deviation around the mean of each month. How can I get R to output standard deviation around maximum value of each month?
you want: "max of values per month and the average from this maximum value" [which is not the same as the standard deviation].
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(data.table)
counts=data.table(counts)
counts[,mean(count-max(count)),by=month]
This question is highly vague. If you want to calculate the standard deviation of the differences to the maximum, you can use this code:
> library(plyr)
> ddply(counts, .(month), summarise, sd = sd(count - max(count)))
month sd
1 Apr 182.5071
2 Aug 114.3068
3 Dec 117.1049
4 Feb 184.4638
5 Jan 138.1755
6 Jul 167.0677
7 Jun 100.8841
8 Mar 144.8724
9 May 173.3452
10 Nov 132.0204
11 Oct 127.4645
12 Sep 152.2162

Resources