How to add missing months to a data frame? - r

I have a dataset with three observations: January, February, and March. I would like to add the remaining months as observations of zero to the same datatable, but I'm having trouble appending these.
Here's my current code:
library(dplyr)
Period <- c("January 2015", "February 2015", "March 2015",
"January 2016", "February 2016", "March 2016",
"January 2017", "February 2017", "March 2017",
"January 2018", "February 2018", "March 2018")
Month <- c("January", "February", "March",
"January", "February", "March",
"January", "February", "March",
"January", "February", "March")
Dollars <- c(936, 753, 731,
667, 643, 588,
948, 894, 997,
774,745, 684)
dat <- data.frame(Period = Period, Month = Month, Dollars = Dollars)
dat2 <- dat %>%
dplyr::select(Month, Dollars) %>%
dplyr::group_by(Month) %>%
dplyr::summarise(AvgDollars = mean(Dollars))
Any ideas for populating April through December in the dataset are greatly appreciated. Thanks in advance!

Here's the way to do it using complete in one step:
library(tidyverse)
Then use complete:
dat2 <- data.frame(Period = Period, Month = Month, Dollars = Dollars) %>%
# make a "year" variable
mutate(Year = word(Period, 2,2)) %>%
# remove period variable (we'll add it in later)
select(-Period) %>%
# month.name is a base variable listing all months (thanks #Gregor).
# nesting by "Year" lets complete know you only want the years listed in your dataset.
complete(Month = month.name, nesting(Year), fill = list(Dollars = 0)) %>%
# Arrange by Year and month
arrange(Year, Month) %>%
#remake the "period" variable
mutate(Period = paste(Month, Year)) %>%
group_by(Month) %>%
summarise(AvgDollars = mean(Dollars))

Here is a two-step solution:
library(dplyr)
Sys.setlocale("LC_TIME", "English")
# first, define a dataframe with each month from January 2015 to December 2018
dat2 <- data.frame(Period = format(seq(as.Date("2015/1/1"),
as.Date("2018/12/1"), by = "month"),
format = "%B %Y"),
Month = substr(Period, 1, nchar(Period)-5))
# then, merge dat and dat2
dat %>%
select(Period, Dollars) %>%
right_join(dat2, by = "Period") %>%
select(Period, Month, Dollars)
Period Month Dollars
1 January 2015 January 936
2 February 2015 February 753
3 March 2015 March 731
4 April 2015 January NA
5 May 2015 February NA
6 June 2015 March NA
7 July 2015 January NA
8 August 2015 February NA
9 September 2015 March NA
10 October 2015 January NA
11 November 2015 February NA
12 December 2015 March NA
13 January 2016 January 667
14 February 2016 February 643
15 March 2016 March 588
16 April 2016 January NA
17 May 2016 February NA
18 June 2016 March NA
19 July 2016 January NA
20 August 2016 February NA
21 September 2016 March NA
22 October 2016 January NA
23 November 2016 February NA
24 December 2016 March NA
25 January 2017 January 948
26 February 2017 February 894
27 March 2017 March 997
28 April 2017 January NA
29 May 2017 February NA
30 June 2017 March NA
31 July 2017 January NA
32 August 2017 February NA
33 September 2017 March NA
34 October 2017 January NA
35 November 2017 February NA
36 December 2017 March NA
37 January 2018 January 774
38 February 2018 February 745
39 March 2018 March 684
40 April 2018 January NA
41 May 2018 February NA
42 June 2018 March NA
43 July 2018 January NA
44 August 2018 February NA
45 September 2018 March NA
46 October 2018 January NA
47 November 2018 February NA
48 December 2018 March NA

Maybe there's a more graceful solution with dplyr, but here is a quick solution without much typing:
dat <- rbind(data.frame(Period = Period, Month = Month, Dollars = Dollars),
data.frame(Period = c(sapply(2015:2018, function(x) format(ISOdate(x,4:12,1),"%B %Y"))),
Month = c(sapply(2015:2018, function(x) format(ISOdate(x,4:12,1),"%B"))),
Dollars = 0))

Related

How can I create a new column that only extracts either the year or month from a mm/dd/yy hh:mm string?

I have a date/time string variable that looks like this:
> dput(df$starttime)
c("12/16/20 7:24", "6/21/21 13:20", "1/22/20 9:03", "1/07/20 17:19",
"11/8/21 10:14", NA, NA, "10/26/21 7:19", "3/14/22 9:48", "5/12/22 13:29"
I basically want to create a column that only has the year (2020, 2021, 2022) and the year + month (e.g., "Jan 2022)
1) Base R Assuming that you want separate month and year numeric columns, define a function which converts a string in the format shown in the question to a year or month number and then invoke it twice. No packages are used.
toNum <- function(x, fmt) format(as.Date(x, "%m/%d/%y"), fmt) |>
type.convert(as.is = TRUE)
transform(df, year = toNum(starttime, "%Y"), month = toNum(starttime, "%m"))
giving
starttime year month
1 12/16/20 7:24 2020 12
2 6/21/21 13:20 2021 6
3 1/22/20 9:03 2020 1
4 1/07/20 17:19 2020 1
5 11/8/21 10:14 2021 11
6 <NA> NA NA
7 <NA> NA NA
8 10/26/21 7:19 2021 10
9 3/14/22 9:48 2022 3
10 5/12/22 13:29 2022 5
2) yearmon Assuming that you want a yearmon class column which represents year and month internally as year + fraction where fraction is 0 for Ja, 1/12 for Feb, ..., 11/12 for Dec so that it sorts appropriately and adding 1/12, say, will give the next month we can use the following. Note that if ym is yearmon then as.integer(ym) is the year and cycle(ym) is the month number (1, 2, ..., 12).
library(zoo)
transform(df, yearmon = as.yearmon(starttime, "%m/%d/%y"))
giving:
starttime yearmon
1 12/16/20 7:24 Dec 2020
2 6/21/21 13:20 Jun 2021
3 1/22/20 9:03 Jan 2020
4 1/07/20 17:19 Jan 2020
5 11/8/21 10:14 Nov 2021
6 <NA> <NA>
7 <NA> <NA>
8 10/26/21 7:19 Oct 2021
9 3/14/22 9:48 Mar 2022
10 5/12/22 13:29 May 2022
Note
If you want to sort by starttime then use
ct <- as.POSIXct(df$starttime, format = "%m/%d/%Y %H:%M")
df[order(ct),, drop = FALSE ]
If you want a chronologically sortable output, you could use the tsibble::yearmonth type:
tsibble::yearmonth(lubridate::mdy_hm(c("12/16/20 7:24", "6/21/21 13:20", "1/22/20 9:03", "1/07/20 17:19",
"11/8/21 10:14", NA, NA, "10/26/21 7:19", "3/14/22 9:48", "5/12/22 13:29")))
result
<yearmonth[10]>
[1] "2020 Dec" "2021 Jun" "2020 Jan" "2020 Jan" "2021 Nov" NA NA
[8] "2021 Oct" "2022 Mar" "2022 May"
An option is to convert to datetime class POSIXct with mdy_hm (from lubridate), then format to extract the month (%b) and 4 digit year (%Y), filter out the NA elements and arrange based on the converted datetime column
library(dplyr)
library(lubridate)
df %>%
mutate(starttime = mdy_hm(starttime),
yearmonth = format(starttime, "%b %Y")) %>%
filter(complete.cases(yearmonth)) %>%
arrange(starttime)
-output
# A tibble: 8 × 2
starttime yearmonth
<dttm> <chr>
1 2020-01-07 17:19:00 Jan 2020
2 2020-01-22 09:03:00 Jan 2020
3 2020-12-16 07:24:00 Dec 2020
4 2021-06-21 13:20:00 Jun 2021
5 2021-10-26 07:19:00 Oct 2021
6 2021-11-08 10:14:00 Nov 2021
7 2022-03-14 09:48:00 Mar 2022
8 2022-05-12 13:29:00 May 2022
Try this with lubridate
library(lubridate)
data.frame(df,
Year = format(mdy_hm(df$starttime), "%Y"),
MonthYear = format(mdy_hm(df$starttime), "%b %Y"))
starttime Year MonthYear
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
5 11/8/21 10:14 2021 Nov 2021
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
8 10/26/21 7:19 2021 Oct 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
It uses mdy_hm in conjunction with format to get the desired Year %Y and %b %Y abbreviated month and year part of the date.
Ordered rows:
df_new <- data.frame(df,
Year = format(mdy_hm(df$starttime), "%Y"),
MonthYear = format(mdy_hm(df$starttime), "%b %Y"))
df_new[order(my(df_new$MonthYear)),]
starttime Year MonthYear
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
8 10/26/21 7:19 2021 Oct 2021
5 11/8/21 10:14 2021 Nov 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
Without NAs
na.omit(df_new[order(my(df_new$MonthYear)),])
starttime Year MonthYear
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
8 10/26/21 7:19 2021 Oct 2021
5 11/8/21 10:14 2021 Nov 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
Data
df <- structure(list(starttime = c("12/16/20 7:24", "6/21/21 13:20",
"1/22/20 9:03", "1/07/20 17:19", "11/8/21 10:14", NA, NA, "10/26/21 7:19",
"3/14/22 9:48", "5/12/22 13:29")), class = "data.frame", row.names = c(NA,
-10L))

NAs in a data frame split by country in R [duplicate]

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 1 year ago.
I would like to impute NA's in a dataframe with means of observed data in each country. In other words, while dealing with NAs, the values in the specific country should be taken into consideration. For instance;
Date Country Battles Riots
March 2018 Afghanistan 380 NA
March 2018 Yemen 88 5
March 2018 Mali 45 NA
April 2018 Afghanistan 350 NA
April 2018 Yemen NA 66
April 2018 Mali 67 NA
May 2018 Afghanistan NA 7
May 2018 Yemen NA NA
May 2018 Mali NA 6
I have used the following code, but obviously it calculates the means without taking the country specific information.
for(i in 6:ncol(my_data)) {
my_data[ , i][is.na(my_data[ , i])] <- mean(my_data[ , i], na.rm = TRUE)
}
Many thanks in advance.
You could use:
library(dplyr)
library(tidyr)
df %>%
group_by(Country) %>%
mutate(across(c(Battles, Riots), ~ replace_na(.x, mean(.x, na.rm = TRUE)))) %>%
ungroup()
which returns
Date Country Battles Riots
<chr> <chr> <dbl> <dbl>
1 March 2018 Afghanistan 380 7
2 March 2018 Yemen 88 5
3 March 2018 Mali 45 6
4 April 2018 Afghanistan 350 7
5 April 2018 Yemen 88 66
6 April 2018 Mali 67 6
7 May 2018 Afghanistan 365 7
8 May 2018 Yemen 88 35.5
9 May 2018 Mali 56 6
Data
structure(list(Date = c("March 2018", "March 2018", "March 2018",
"April 2018", "April 2018", "April 2018", "May 2018", "May 2018",
"May 2018"), Country = c("Afghanistan", "Yemen", "Mali", "Afghanistan",
"Yemen", "Mali", "Afghanistan", "Yemen", "Mali"), Battles = c(380,
88, 45, 350, NA, 67, NA, NA, NA), Riots = c(NA, 5, NA, NA, 66,
NA, 7, NA, 6)), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
A data.table option (borrow df from #Martin Gal)
setDT(df)[
,
c("Battles", "Riots") := lapply(
.(Battles, Riots),
function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
),
Country
][]
gives
Date Country Battles Riots
1: March 2018 Afghanistan 380 7.0
2: March 2018 Yemen 88 5.0
3: March 2018 Mali 45 6.0
4: April 2018 Afghanistan 350 7.0
5: April 2018 Yemen 88 66.0
6: April 2018 Mali 67 6.0
7: May 2018 Afghanistan 365 7.0
8: May 2018 Yemen 88 35.5
9: May 2018 Mali 56 6.0

dplyr sample_n by one variable through another one

I have a data frame with a "grouping" variable season and another variable year which is repeated for each month.
df <- data.frame(month = as.character(sapply(month.name,function(x)rep(x,4))),
season = c(rep("winter",8),rep("spring",12),rep("summer",12),rep("autumn",12),rep("winter",4)),
year = rep(2021:2024,12))
I would like to use dplyr::sample_n or something similar to choose 2 months in the data frame for each season and keep the same months for all the years, for example:
month season year
1 January winter 2021
2 January winter 2022
3 January winter 2023
4 January winter 2024
5 February winter 2021
6 February winter 2022
7 February winter 2023
8 February winter 2024
9 March spring 2021
10 March spring 2022
11 March spring 2023
12 March spring 2024
13 May spring 2021
14 May spring 2022
15 May spring 2023
16 May spring 2024
17 June summer 2021
18 June summer 2022
19 June summer 2023
20 June summer 2024
21 July summer 2021
22 July summer 2022
23 July summer 2023
24 July summer 2024
25 October autumn 2021
26 October autumn 2022
27 October autumn 2023
28 October autumn 2024
29 November autumn 2021
30 November autumn 2022
31 November autumn 2023
32 November autumn 2024
I cannot make df %>% group_by(season,year) %>% sample_n(2) since it chooses different months for each year.
Thanks!
We can randomly sample 2 values from month and filter them by group.
library(dplyr)
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),2))
# month season year
# <chr> <chr> <int>
# 1 January winter 2021
# 2 January winter 2022
# 3 January winter 2023
# 4 January winter 2024
# 5 February winter 2021
# 6 February winter 2022
# 7 February winter 2023
# 8 February winter 2024
# 9 March spring 2021
#10 March spring 2022
# … with 22 more rows
If for certain groups there are less than 2 unique values we can select minimum between 2 and unique values in the group to sample.
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),min(2, n_distinct(month))))
Using the same logic with base R, we can use ave
df[as.logical(with(df, ave(month, season,
FUN = function(x) x %in% sample(unique(x),2)))), ]
An option using slice
library(dplyr)
df %>%
group_by(season) %>%
slice(which(!is.na(match(month, sample(unique(month), 2)))))
# A tibble: 32 x 3
# Groups: season [4]
# month season year
# <fct> <fct> <int>
# 1 October autumn 2021
# 2 October autumn 2022
# 3 October autumn 2023
# 4 October autumn 2024
# 5 November autumn 2021
# 6 November autumn 2022
# 7 November autumn 2023
# 8 November autumn 2024
# 9 April spring 2021
#10 April spring 2022
# … with 22 more rows
Or using base R
by(df, df$season, FUN = function(x) subset(x, month %in% sample(unique(month), 2 )))

How to convert months as factors while still maintaining the months in sequence?

I have a original data frame (df) containing data of around 10 years(1994-2003). The head(df) is as shown below:
Sl.no Date Year Month Season val1 val2 val3
1 1 1993-12-01 1993 Dec Winter 21.0 16.0 3.0
2 2 1994-01-01 1994 Jan Winter 21.0 15.5 0.0
3 3 1994-02-01 1994 Feb Winter 21.0 18.5 0.0
4 4 1994-03-01 1994 Mar Spring 30.0 24.0 1.9
5 5 1994-04-01 1994 Apr Spring 35.5 27.0 0.5
6 6 1994-05-01 1994 May Spring 36.0 30.0 1.5
since i wanted to convert Months as factors, so as to plot boxplot, i used:
df$Month <- as.factor(format(df$Date, "%b"))
levels(df$Month) <- c("Jan","Feb","Mar", "Apr", "May", "Jun", "Jul",
"Aug", "Sep", "Oct", "Nov", "Dec")
however the output appeared as below: (Months were not in sequence like original df)
Sl.no Date Year Month Season val1 val2 val3
1 1 1993-12-01 1993 Mar Winter 21.0 16.0 3.0
2 2 1994-01-01 1994 May Winter 21.0 15.5 0.0
3 3 1994-02-01 1994 Apr Winter 21.0 18.5 0.0
4 4 1994-03-01 1994 Aug Spring 30.0 24.0 1.9
5 5 1994-04-01 1994 Jan Spring 35.5 27.0 0.5
6 6 1994-05-01 1994 Sep Spring 36.0 30.0 1.5
so in the above df, it is noted that the months are distorted, which otherwise should be in sequence following the Date.
so how can i rectify this problem? your help will be highly appreciated.
kind regards
Use
df$Month <- factor(format(df$Date, "%b"), month.abb, ordered = TRUE)
Demo of the problem you're facing:
set.seed(1)
M <- sample(month.abb, 20, TRUE)
M
# [1] "Apr" "May" "Jul" "Nov" "Mar" "Nov" "Dec" "Aug" "Aug" "Jan" "Mar" "Mar" "Sep" "May"
# [15] "Oct" "Jun" "Sep" "Dec" "May" "Oct"
your_attempt <- as.factor(M)
# [1] Apr May Jul Nov Mar Nov Dec Aug Aug Jan Mar Mar Sep May Oct Jun Sep Dec May Oct
# Levels: Apr Aug Dec Jan Jul Jun Mar May Nov Oct Sep
## At this step, you're basically asking R to replace "Apr" with "Jan",
## "Aug" with "Feb", and so on. Not what you're looking for....
levels(your_attempt) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
your_attempt
# [1] Jan Aug May Sep Jul Sep Mar Feb Feb Apr Jul Jul Nov Aug Oct Jun Nov Mar Aug Oct
# Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## ordered = TRUE not necessarily required. Depends on what you want to do
new_attempt <- factor(M, levels = month.abb, ordered = TRUE)
new_attempt
# [1] Apr May Jul Nov Mar Nov Dec Aug Aug Jan Mar Mar Sep May Oct Jun Sep Dec May Oct
# Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
the month() function from the lubridate package will handle this for you.
library(lubridate)
df$Month <- month(df$Date, label=TRUE, abbr=TRUE)

sorting of month in matrix in R

I have a matrix in this format:
year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274
I need to sort months on the basis of their occurrence i.e jan, feb, mar... when I sort it gets sorted on the basis of first alphabet. I used this:
mat <- mat[order(mat[,1], decreasing = TRUE), ]
and it looks like this :
row.names April August December February January July June March May November October September
1 2015 59535 0 0 24258 22785 0 31356 40274 84211 0 0 0
2 2014 466 10982 35881 17 0 2981 1279 289 879 8911 8565 4000
Can we sort months on the basis of occurrence in R ?
Suppose DF is the data frame from which you derived your matrix. We provide such a data frame in reproducible form at the end. Ensure that month and year are factors with appropriate levels. Note that month.name is a builtin variable in R that is used here to ensure that the month levels are appropriately sorted and we have assumed year is a numeric column. Then use levelplot like this:
DF2 <- transform(DF,
month = factor(as.character(month), levels = month.name),
year = factor(year)
)
library(lattice)
levelplot(Freq ~ year * month, DF2)
Note: Here is DF in reproducible form:
Lines <- " year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274 "
DF <- read.table(text = Lines, header = TRUE)
Assuming you want to sort based on time (have to add a dummy day 1 to convert to time format):
time = strptime(paste(1, mat$month, mat$year), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]
Or if you don't care about the year:
time = strptime(paste(1, mat$month, 2000), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]

Resources