I have a dataset that has a variable called month, which each month as a character. Is there a way with dplyr to combine some months to create a season variable? I have tried the following but got an error:
data %>%
mutate(season = ifelse(month[1:3], "Winter", ifelse(month[4:6], "Spring",
ifelse(month[7:9], "Summer",
ifelse(month[10:12], "Fall", NA)))))
With error:
Error in mutate_impl(.data, dots) : Column `season` must be length 100798 (the number of rows) or one, not 3
I am new to R so any help is much appreciated!
The correct syntax should be
data %>% mutate(season = ifelse(month %in% 10:12, "Fall",
ifelse(month %in% 1:3, "Winter",
ifelse(month %in% 4:6, "Spring",
"Summer"))))
Edit: probably a better way to get the job done
Astronomical Seasons
temp_data %>%
mutate(
season = case_when(
month %in% 10:12 ~ "Fall",
month %in% 1:3 ~ "Winter",
month %in% 4:6 ~ "Spring",
TRUE ~ "Summer"))
Meteorological Seasons
temp_data %>%
mutate(
season = case_when(
month %in% 9:11 ~ "Fall",
month %in% c(12, 1, 2) ~ "Winter",
month %in% 3:5 ~ "Spring",
TRUE ~ "Summer"))
When there are multiple key/value, we can do a join with a key/val dataset
keyval <- data.frame(month = month.abb,
season = rep(c("Winter", "Spring", "Summer", "Fall"), each = 3),
stringsAsFactors = FALSE)
left_join(data, keyval)
You can also try using dplyr::recode or functions from forcats. I think this is the simplest method here:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
data <- tibble(month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
data %>%
mutate(
season = fct_collapse(
.f = month,
Spring = c("Mar", "Apr", "May"),
Summer = c("Jun", "Jul", "Aug"),
Autumn = c("Sep", "Oct", "Nov"),
Winter = c("Dec", "Jan", "Feb")
)
)
#> # A tibble: 12 x 2
#> month season
#> <chr> <fct>
#> 1 Jan Winter
#> 2 Feb Winter
#> 3 Mar Spring
#> 4 Apr Spring
#> 5 May Spring
#> 6 Jun Summer
#> 7 Jul Summer
#> 8 Aug Summer
#> 9 Sep Autumn
#> 10 Oct Autumn
#> 11 Nov Autumn
#> 12 Dec Winter
Created on 2018-04-06 by the reprex package (v0.2.0).
Related
I have a dataset like thatI want to add a column with season time like this:
Month
Year
Region
Season
January
2019
NY
Winter
February
2019
NY
Winter
March
2019
NY
Spring
September
2019
NY
Fall
How can I do a code in R that automatically add a column where all January, February and December are Winter, all March, April and May are Spring and so on.
Thanks a lot for helping
season <- c(data, Spring = "March", Spring = "April")
We can create a keyvalue dataset and do a join
library(dplyr)
keydat <- tibble(Month = month.name,
Season = rep(c("Winter", "Spring", "Summer", "Fall", "Winter"),
c(2, 3, 3, 3, 1)))
df1 <- left_join(df1, keydat)
-output
df1
Month Year Region Season
1 January 2019 NY Winter
2 February 2019 NY Winter
3 March 2019 NY Spring
4 September 2019 NY Fall
data
df1 <- structure(list(Month = c("January", "February", "March", "September"
), Year = c(2019L, 2019L, 2019L, 2019L), Region = c("NY", "NY",
"NY", "NY")), class = "data.frame", row.names = c(NA, -4L))
In base R you could do:
df1$Season <- c('Winter', 'Spring', 'Summer', 'Fall')[
1 + (match(df1$Month, month.name) %/% 3) %% 4]
Which results in:
df1
#> Month Year Region Season
#> 1 January 2019 NY Winter
#> 2 February 2019 NY Winter
#> 3 March 2019 NY Spring
#> 4 September 2019 NY Fall
(Using akrun's reproducible data)
I have a time series data with a column for a month and a column for a year. The months are JAN, FEB, etc.
I'm trying to combine them into one month year variable in order to run time series analysis on it. I'm very new to R and could use any guidance.
Perhaps something like this?
library(dplyr)
c("JAN", "FEB", "MAR", "APR",
"MAY", "JUN", "JUL", "AUG",
"SEP", "OCT", "NOV", "DEC") %>%
rep(., times = 3) %>%
as.factor() -> months
c("2018", "2019", "2020") %>%
rep(., each = 12) %>%
as.factor() -> years
df1 <- cbind.data.frame(months, years)
paste(df1$months, df1$years, sep = ".") %>%
as.factor() -> merged.years.months
Start with your month/year df.
library(tidyverse)
library(lubridate)
events <- tibble(month = c("JAN", "MAR", "FEB", "NOV", "AUG"),
year = c(2018, 2019, 2018, 2020, 2019))
Let's say that each of your time periods start on the first of the month.
series <- events %>%
mutate(mo1 = dmy(paste(1, month, year)))
This is what you want
R > series
# A tibble: 5 x 3
month year mo1
<chr> <dbl> <date>
1 JAN 2018 2018-01-01
2 MAR 2019 2019-03-01
3 FEB 2018 2018-02-01
4 NOV 2020 2020-11-01
5 AUG 2019 2019-08-01
These are now dates;you can use them in other analyses.
Base R solution:
events <- within(events,{
month_no <- as.integer(as.factor(sort(month)))
date <- as.Date(paste(year, ifelse(nchar(month_no) < 2, paste0("0", month_no),
month_no), "01", sep = "-"), "%Y-%m-%d")
rm(month_no, month, year)
}
)
I've a question about a pair-case multiplication of variables/df in R.
Consider the following problem:
having data in vector (or in dataframe) that have labels and values as follow:
alpha_lab <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
alpha_num <- c(15.28767, 44.38356, 73.47945, 103.56164, 133.64384, 163.72603, 193.80822, 224.38356, 254.46575, 284.54795, 314.63014, 344.71233)
the alpha_num is a product of other calculations (irrelevant), the following values correspond to their labels in alpha_lab (so January = 15.28767, April = 103.56164...).
I also have a dataframe with "case", "month" (as int), "year" and "value":
> df_values
# A tibble: 1,173 x 4
# Groups: case, month
case month year value
<chr> <int> <int> <dbl>
1 A1 1 2009 121.
2 A1 1 2010 177.
3 A1 1 2011 220.
4 A1 1 2012 196.
5 A1 1 2013 161.
6 A1 1 2014 142.
7 A1 2 2009 82.3
8 A1 2 2010 169.
9 A1 2 2011 194.
10 A1 2 2012 169.
# ... with 1,163 more rows
what I am looking for, is a way to compute for each case (20 different) in each month-year a product of
value * alpha_num
where alpha_num is taken only for a calculated month, so for example:
row 1 (A1, January 2009 case): 121 * 15.28767
row 5 (A1, January 2013 case): 161 * 15.28767
row 7 (A1, February 2011 case): 82.3 * 44.38356
and so on for each case in each month in each year...
Is there a way to compute this without adding corresponding alpha_num value to df_values table one-by-one month case?
Thanks!
This should be helpful:
library(dplyr)
# original vectors
alpha_lab <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
alpha_num <- c(15.28767, 44.38356, 73.47945, 103.56164, 133.64384, 163.72603, 193.80822, 224.38356, 254.46575, 284.54795, 314.63014, 344.71233)
# example of your dataframe
df_values = data.frame(case = c("A1", "A1"),
month = c(1, 2),
year = c(2009, 2009),
value = c(121, 82.3), stringsAsFactors = F)
df_values %>% mutate(new_col = value * alpha_num[month])
# case month year value new_col
# 1 A1 1 2009 121.0 1849.808
# 2 A1 2 2009 82.3 3652.767
Note that this works because your alpha_lab vector has the months in the right order. i.e. Jan, Feb, ..., Dec represent the positions 1, 2, ..., 12.
You can also try to work with an lookup table and dplyr::left_join.
library("magrittr")
sampleData <- tibble::tibble(
case = "A1",
month = rep(1:12, each = 6),
year = rep(2009:2014, 12),
value = runif(72, 10, 130)
)
lookup_table <- tibble::tibble(
alpha_lab = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
alpha_id = 1:12,
alpha_num = c(15.28767, 44.38356, 73.47945, 103.56164, 133.64384, 163.72603, 193.80822, 224.38356, 254.46575, 284.54795, 314.63014, 344.71233)
)
result <- dplyr::left_join(sampleData, lookup_table, by = c("month" = "alpha_id")) %>%
dplyr::mutate(new_col = alpha_num * value) %>%
dplyr::select(-alpha_num, -alpha_lab)
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I am trying to find the means, not including NAs, for multiple columns withing a dataframe by multiple groups
airquality <- data.frame(City = c("CityA", "CityA","CityA",
"CityB","CityB","CityB",
"CityC", "CityC"),
year = c("1990", "2000", "2010", "1990",
"2000", "2010", "2000", "2010"),
month = c("June", "July", "August",
"June", "July", "August",
"June", "August"),
PM10 = c(runif(3), rnorm(5)),
PM25 = c(runif(3), rnorm(5)),
Ozone = c(runif(3), rnorm(5)),
CO2 = c(runif(3), rnorm(5)))
airquality
So I get a list of the names with the number so I know which columns to select:
nam<-names(airquality)
namelist <- data.frame(matrix(t(nam)));namelist
I want to calculate the mean by City and Year for PM25, Ozone, and CO2. That means I need columns 1,2,4,6:7)
acast(datadf, year ~ city, mean, na.rm=TRUE)
But this is not really what I want because it includes the mean of something I do not need and it is not in a data frame format. I could convert it and then drop, but that seems like a very inefficient way to do it.
Is there a better way?
We can use dplyr with summarise_at to get mean of the concerned columns after grouping by the column of interest
library(dplyr)
airquality %>%
group_by(City, year) %>%
summarise_at(vars("PM25", "Ozone", "CO2"), mean)
Or using the devel version of dplyr (version - ‘0.8.99.9000’)
airquality %>%
group_by(City, year) %>%
summarise(across(PM25:CO2, mean))
The summarise_at solution by Colin is simplest, but of course there are several.
Here is another solution, using tidyr to rearrange and calculate the mean:
airquality %>%
select(City, year, PM25, Ozone, CO2) %>%
gather(var, value, -City, -year) %>%
group_by(City, year, var) %>%
summarise(avg = mean(value, na.rm=T)) %>% # can stop here if you want
spread(var, avg) # optional to make this into a wider table
# A tibble: 8 x 5
# Groups: City, year [8]
City year CO2 Ozone PM25
* <fctr> <fctr> <dbl> <dbl> <dbl>
1 CityA 1990 0.275981522 0.19941717 0.826008441
2 CityA 2000 0.090342153 0.50949094 0.005052771
3 CityA 2010 0.007345704 0.21893117 0.625373926
4 CityB 1990 1.148717447 -1.05983482 -0.961916973
5 CityB 2000 -2.334429324 0.28301220 -0.828515418
6 CityB 2010 1.110398814 -0.56434523 -0.804353609
7 CityC 2000 -0.676236740 0.20661529 -0.696816058
8 CityC 2010 0.229428142 0.06202997 -1.396357288
You should try dplyr::mutate_at :
library(dplyr)
airquality %>%
group_by(City, year) %>%
summarise_at(.vars = c("PM10", "PM25", "Ozone", "CO2"), .funs = mean)
# A tibble: 8 x 6
# Groups: City [?]
City year PM10 PM25 Ozone CO2
<fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 CityA 1990 0.004087379 0.5146409 0.44393422 0.61196671
2 CityA 2000 0.039414194 0.8865582 0.06754322 0.69870187
3 CityA 2010 0.116901563 0.6608619 0.51499227 0.32952099
4 CityB 1990 -1.535888778 -0.9601897 1.17183649 0.08380664
5 CityB 2000 0.226046487 0.4037230 0.86554997 -0.05698204
6 CityB 2010 -0.824719956 0.1508471 0.32089806 -0.12871853
7 CityC 2000 -0.824509111 -0.6928741 0.85553837 0.12137923
8 CityC 2010 -1.626150294 1.5176198 0.21183149 -0.63859910
So I tested the comments above and added more replication to the original dataset because I wanted to calculate the average by city and by year. Here is the updated dataset
airquality <- data.frame(City = c("CityA", "CityA","CityA","CityA",
"CityB","CityB","CityB","CityB",
"CityC", "CityC", "CityC"),
year = c("1990", "2000", "2010", "2010",
"1990", "2000", "2010", "2010",
"1990", "2000", "2000"),
month = c("June", "July", "August", "August",
"June", "July", "August","August",
"June", "August", "August"),
PM10 = c(runif(6), rnorm(5)),
PM25 = c(runif(6), rnorm(5)),
Ozone = c(runif(6), rnorm(5)),
CO2 = c(runif(6), rnorm(5)))
airquality
Of the answers above, AK run and Colin worked.
Here is the situation where I got kinda stuck with R. I have data table with one row for each day, something like this:
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month1 = c("DEC", "JAN");
Year1 = c("15", "16");
Price1 = c(100, 110);
Month2 = c(NA_character_, NA_character_);
Year2 = c(NA_character_, NA_character_);
Price2 = c(NA_integer_, NA_integer_);
Month3 = c(NA_character_, NA_character_);
Year3 = c(NA_character_, NA_character_);
Price3 = c(NA_integer_, NA_integer_);
Month4 = c(NA_character_, NA_character_);
Year4 = c(NA_character_, NA_character_);
Price4 = c(NA_integer_, NA_integer_);
dataSample = data.frame(Date, Month1, Year1, Price1, Month2, Year2, Price2, Month3, Year3, Price3, Month4, Year4, Price4);
Which gives such a table:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now I need to calculate all months and prices for each. For that I have 2 other data frames:
Date = c(as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2016-01-01"), as.Date("2016-01-01"), as.Date("2016-01-01"));
Month.Start = c("DEC", "JAN", "FEB", "JAN", "FEB", "MAR");
Year.Start = c("15", "16", "16", "16", "16", "16")
Month.End = c("JAN", "FEB", "MAR", "FEB", "MAR", "APR");
Year.End = c("16", "16", "16", "16", "16", "16")
Diff = c(10, 15, -15, 19, -20, -5);
diffsOneMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month.Start = c("DEC", "MAR");
Year.Start = c("15", "16")
Month.End = c("MAR", "JUN");
Year.End = c("16", "16")
Diff = c(11, 25);
diffsThreeMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Which gives me these tables:
One month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 JAN 16 10
2 2015-12-31 JAN 16 FEB 16 15
3 2015-12-31 FEB 16 MAR 16 -15
4 2016-01-01 JAN 16 FEB 16 19
5 2016-01-01 FEB 16 MAR 16 -20
6 2016-01-01 MAR 16 APR 16 -5
Three month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 MAR 16 20
2 2016-01-01 MAR 16 JUN 16 25
Now I must fill dataSample data frame by using data from differences tables. I check what start/end months/years are available there and have to fill those months/years in dataSample. Then take difference of price and set calculated price in dataSample. So for example in dataSample we start with DEC 15, then in diffsOneMonth we have entry DEC 15 - JAN 16 with difference 10 so we add it to DEC 15 price and get JAN 16 price 110:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now its possible to do next month and then next etc. If we use diffsOneMonth only we would get desirable result like this:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 110
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
However there is additional requirement that I must use wider month spread to calculate prices if its possible. So for 2015-12-31 there exists three month spread from DEC 15 to MAR 16 which should override price from one month difference. So DEC 15 price is 110 and DEC 15 - MAR 16 difference is 11 which makes MAR 16 price not 110 but 111:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 111
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
So for this sample it would be my final desirable output.
Real data is much more complex, with 6 and 12 month differences and 64 months forward for each date. Also some months can be missing. I tried to do it with a loop but it was very slow, however I am not sure how to approach such a problem without a loop. I have created few helper methods to be able to calculate next year/month:
nextContract = function(currentMonth, currentYear, length = 1,
years = c("10", "11", "12", "13", "14", "15", "16", "17", "18"),
months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
yDiff = ifelse(length(months) < mIdx, mIdx / length(months) - ifelse(mIdx %% length(months) == 0, 1, 0), 0);
return(data.frame(nextMonth(currentMonth, length, months), nextYear(currentYear, length = yDiff)))
}
nextYear = function(currentYear, length = 1, years = c("10", "11", "12", "13", "14", "15", "16", "17", "18")) {
return(years[match(currentYear, years)+length]);
}
nextMonth = function(currentMonth, length = 1, months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
return(months[ifelse(length(months) < mIdx, ifelse(mIdx %% length(months) != 0, mIdx %% length(months), length(months)), mIdx)]);
}
Example of usage could be:
> nextContract("DEC", "15")
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 JAN 16
or:
> nextContract("DEC", "15", length = 3)
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 MAR 16
This got to be pretty long question but I hope someone will take time to review it :)
Thanks in advance!
EDIT
A little bit of improvement on proposed solution and I got what I needed:
outrightAndForwardRows <- list("1" = diffsOneMonth, "3" = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
mutate(Day.End = 1) %>%
mutate(Outright.Day = 1) %>%
unite("Contract.Start", Day.Start, Month.Start, Year.Start) %>%
unite("Contract.End", Day.End, Month.End, Year.End) %>%
unite("Contract.Outright", Outright.Day, Month1, Year1) %>%
mutate(time_step = as.numeric(time_step),
Contract.Start =
Contract.Start %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.End =
Contract.End %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.Outright =
Contract.Outright %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Contract.End) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Contract.End) %>%
slice(time_step %>% which.max) %>%
ungroup() %>%
select(-time_step, -Diff, -Contract.Start)
#### add outright and forward months to the same columns
outright <- outrightAndForwardRows %>% select(Date, Price=Price1, Contract=Contract.Outright) %>% unique
forwardMonths <- outrightAndForwardRows %>% select(Date, Contract=Contract.End, Price)
# join and sort rows
joined <- rbind(outright, forwardMonths) %>% arrange(Date, Contract)
# add contract sequence
joined = data.table(joined)
joined = joined[, Contract.seq:=seq(.N), by=Date];
dcast(joined, Date ~ Contract.seq, value.var=c("Price", "Contract"))
Something like this:
library(dplyr)
library(tidyr)
library(lubridate)
list(`1` = diffsOneMonth,
`3` = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
unite("Date.Start", Day.Start, Month.Start, Year.Start) %>%
mutate(time_step = as.numeric(time_step),
Date.Start =
Date.Start %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Date.Start) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Date.Start) %>%
slice(time_step %>% which.max)