specifying a new column based on the criteria in another column - r

I'm working with school year registration data for a school since 1890 and currently have columns for the month (as a number) and the year. I would like to find a way to group these values into school years so that Aug-April are all from the same school year. For example, the 8/2010-4/2011 would be from the 2010 school year. In SAS I would have used the code below but I can't get my R code to work and I'm not sure what I'm missing. I apologize for my R code, I'm still learning. a
SAS Code:
If Month="8" or Month="9" or Month= "10" or Month= "11" or Month="12" then SchoolYear=Year;
If Month= "1" or Month="2" or Month="3" or Month="4" then SchoolYear= Year-1;
If Month="5" or Month="6" or Month="7" then SchoolYear= "";
R Code and corresponding error:
for (i in nrow(df)) if(df$Month == 8 | df$Month == 9 |df$Month ==10| df$Month ==11 | df$Month == 12) {df$SchoolYear == df$Year} else if (df$Month == 1 | df$Month == 2 | df$Month == 3 | df$Month == 4) {df$SchoolYear == df$Year- 1} else {df$SchoolYear == "NA"}
the condition has length > 1 and only the first element will be used the condition has length > 1 and only the first element will be used

We can use %in% for multiple element comparisons
library(dplyr)
df %>%
mutate(SchoolYear = case_when(Month %in% 8:12 ~ Year,
Month %in% 1:4 ~ Year - 1L,
Month %in% 5:7 ~ NA_integer_))
Based on the logic, it can be futher simplified to
df$SchoolYear <- with(df, (NA^(Month %in% 5:7)* Year) - (Month %in% 1:4))
data
set.seed(24)
df <- data.frame(Month = sample(1:12, 30, replace = TRUE),
Year = sample(1978:2001, 30, replace = TRUE))

Related

R mutate variable to variable values from another observation, using a loop, an ifelse condition and subset (dplyr)

see my reproducible and desired output below.
I want to create a new variable, where I combine variable values from other observations (rows), which I want to identify in a loop using subset. The condition of the subset is to be defined by the loop.
In example 1 subset(df, country == i) does not work, but doing it manually (in Ex.2) subset(df, country == 'US') works. I thought country == i and country == 'US' should be pretty much the same.
# create a df
country <- c('US', 'US', 'China', 'China')
Trump_virus <- c('Y', 'N' ,'Y', 'N')
cases <- c (1000, 2000, 4, 6)
df <- data.frame(country, Trump_virus, cases)
#################################################### Ex.1
for (i in df$country) {
print(i)
df <- df %>%
mutate(cases_corected = ifelse(
Trump_virus == 'Y'
,subset(df, Trump_virus == 'N' & country == i)$cases*1000
,'killer_virus'
))}
##
df$cases_corected
#################################################### Ex.2
for (i in df$country) {
print(i)
df <- df %>%
mutate(cases_corected = ifelse(
Trump_virus == 'Y'
,subset(df, Trump_virus == 'N' & country == 'US')$cases*1000
,'killer_virus'
))}
##
df$cases_corected
################################################### Desired output
> df$cases_corected
[1] "2e+06"
[2] "killer_virus"
[3] "6000"
[4] "killer_virus"
Here is a solution with dplyr.
Updated based on the change in desired output
df <- df %>%
mutate(country=toupper(country)) # to get same names for other variants of a country #e.g. China and china
#genearting a dataset which have cases only for Trump_virus==N
df1<-df %>%
dplyr::filter(Trump_virus=="N") %>%
dplyr::mutate(ID= "Y",
cases_corected=cases*1e3) %>%
dplyr::select(-c(cases,Trump_virus))
# final merging
df<-df %>%
left_join(df1,by=c("country"="country","Trump_virus"="ID")) %>%
mutate(cases_corected=ifelse(is.na(cases_corected),'killer_virus',cases_corected))
df
country Trump_virus cases cases_corected
1 US Y 1000 2e+06
2 US N 2000 killer_virus
3 CHINA Y 4 6000
4 CHINA N 6 killer_virus

How to calculate average of values from specific days each year for multiple years in R?

I wanted to calculate the average temperature (t) of specific time period for each year.
I have weather data that gives me values for each day. My real data is from 2011-2019 and has all days in all years and I would like for example average temperature for 20th of April - 15th of May for each year.
Example data:
df <- data.frame(matrix(ncol = 4, nrow = 8))
x <- c("year", "month","day","t")
colnames(df) <- x
df$year <- c(2011,2011,2011,2011,2012,2012,2012,2012)
df$month <- c(3,3,4,4,3,3,4,4)
df$day <- c(1,2,3,4,1,2,3,4)
df$t <- c(1,3,6,1,2,7,1,-9)
I did managed to do this with a very ugly and time consuming code but lack of knowledge has stopped me in my tracks.
Thank you in advance.
With tidyverse you could do something similar:
library(tidyverse)
Data %>%
filter((month == 4 & day >= 20) |
(month == 5 & day <= 15)) %>%
group_by(year) %>%
summarise(mean_temp = mean(t))
Similar to #Ben's answer but in base R :
aggregate(t~year, subset(df, (month == 4 & day >= 20) |
(month == 5 & day <= 15)), mean)
you can actually add quite complex calculations to the group_by function in the dplyr package. Maybe you want to look into something like this.
library(dplyr)
library(lubridate)
df <- data.frame(matrix(ncol = 4, nrow = 8))
x <- c("year", "month","day","t")
colnames(df) <- x
df$year <- c(2011,2011,2011,2011,2012,2012,2012,2012)
df$month <- c(3,3,4,4,3,3,4,4)
df$day <- c(1,2,3,4,1,2,3,4)
df$t <- c(1,3,6,1,2,7,1,-9)
df %>%
group_by(lubridate::dmy(paste(day, month, year)) %>%
lubridate::yday() %>%
between(lubridate::yday(dmy("3.4.2000")), lubridate::yday(dmy("15.5.2000")))) %>%
summarise(mean(t))
I am using the yday function from lubridate to be able to select days over multiple years.
Hope this helps!!
Try the code bellow, I like to use for loop to deal with this kind of troble.
# Create a vector of all years
year_u <- unique(zz$year)
# Create the initial and final period
inicial_day <- 20
inicial_month <- 4
final_day <- 15
final_month <- 5
# Create an empty data.frame to store the data after each loop
averages <- data.frame()
# Open a loop
for(i in 1:length(year)){
# take each year
subsets <- subset(zz, year == year_u[i])
# Mean of each time between the period
average <- mean(subsets[subsets$day >= inicial_day & subsets$month >= inicial_month &
subsets$day <= final_day & subsets$month <= final_month, ]$t)
# Create a temporary data.frame to store the year and the t_mean
temp <- data.frame(year = year_u[i], t_mean = average)
# Combine the actual data with the last
averages <- rbind(averages, temp)
}

Remove observation before certain row

I have a data frame and I want to compute the mean across the variable value for all the period excluding +- two observations before/after that the crisis is 1 (i don't care about missing val). The calculation should be done by country (even though here in the example below I have only one country). Example:
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
mean(df$value[df$crisis == 0], na.rm=TRUE)
# expected result
exp_mean <- (1+2+3+9+10)/5
exp_mean
edit:
I would like to get a general case where we take into account other possible 1 in the dataset, for instance if we have
crisis[10] = 1
the result should be (3+9)/2
in order not to consider the periods after the first crisis but that actually experience a crisis at the second perdiod. Any idea?
Another base R solution, using outer + c + unique to filter out rows, i.e.,
r <- mean(na.omit(df[-unique(c(outer(which(df$crisis==1),-2:2,"+"))),"value"]))
such that
> r
[1] 5
We can write a function which excludes the variables which are +- 2 observations after crisis = 1.
custom_mean <- function(c, v) {
inds <- which(c == 1)
mean(v[-unique(c(sapply(inds, `+`, -2:2)))], na.rm = TRUE)
}
sapply is used assuming there could be multiple crisis = 1 situations for a country.
We can then apply this function for each country.
library(dplyr)
df %>% group_by(country) %>% summarise(exp_mean = custom_mean(crisis, value))
# A tibble: 1 x 2
# country exp_mean
# <fct> <dbl>
#1 AT 5
This solution using base R works as long as there is only one value with 'crisis == 1' and as long as there are always two rows befor and after the row with 'crisis == 1'
country <- rep("AT",10)
value <- seq(1,10,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0)
df <- data.frame(country, value, crisis)
df
df[(which(df$crisis == 1) - 2):(which(df$crisis == 1) + 2), ]
This solution does not work for this data:
country <- rep("AT",11)
value <- seq(1,11,1)
crisis <- c(0,0,0,NA,0,1,0,NA,0,0,1)
df2 <- data.frame(country, value, crisis)
df2[(which(df2$crisis == 1) - 2):(which(df2$crisis == 1) + 2), ]

I want to return a season and year value from a continuous list of dates

I have a continuous list of dates (yyyy-mm-dd) from 1985 to 2018 in one column (Colname = date). What I wish to do is generate another column which outputs a water season and year given the date.
To make it clearer I have two water season:
Summer = yyyy-04-01 to yyyy-09-31;
Winter = yyyy-10-01 to yyyy(+1)-03-31.
So for 2018 - Summer = 2018-04-01 to 2018-09-31; Winter 2018-10-01 to 2019-03-31.
What I would like to output is something like the following:
Many thanks.
A tidy verse approach
library(tidyverse)
df <-tibble(date = seq(from = as.Date('2000-01-01'), to = as.Date('2001-12-31'), by = '1 month'))
df
df %>%
mutate(water_season_year = case_when(
lubridate::month(date) %in% c(4:9) ~str_c('Su_', lubridate::year(date)),
lubridate::month(date) %in% c(10:12) ~str_c('Wi_', lubridate::year(date)),
lubridate::month(date) %in% c(1:3)~str_c('Wi_', lubridate::year(date) -1),
TRUE ~ 'Error'))
You can compare just the month part of the data to get the season, in base R consider doing
month <- as.integer(format(df$date, "%m"))
year <- format(df$date, "%Y")
inds <- month >= 4 & month <= 9
df$water_season_year <- NA
df$water_season_year[inds] <- paste("Su", year[inds], sep = "_")
df$water_season_year[!inds] <- paste("Wi", year[!inds], sep = "_")
#To add previous year for month <= 3 do
df$water_season_year[month <= 3] <- paste("Wi",
as.integer(year[month <= 3]) - 1, sep = "_")
df
# date water_season_year
#1 2019-01-03 Wi_2019
#2 2000-06-01 Su_2000
Make sure that date variable is of "Date" class.
data
df <-data.frame(date = as.Date(c("2019-01-03", "2000-06-01")))

Most effective way to group data in quarters and fiscal years in R

I have a large database (POY) with data from 2011 to 2017 which contains a date column. I would need to do two things: make it possible to split by quarters and by fiscal year.
Our fiscal year unfortunately does not run in parallel with calendar years but goes from July to June. Which also means that my Quarter 1 runs from July to September.
I've written some code that seems to work fine but it seems rather lengthy (especially the second part). Does anyone have any advice for this beginner to make it more efficient?
#Copy of date column and splitting it in 3 columns for year, month and day
library(tidyr)
POY$Date2 <- POY$Date
POY<-separate(POY, Date2, c("year","month","day"), sep = "-", convert=TRUE)
#Making a quarter variable
POY$quarter[POY$month<=3] <- "Q3"
POY$quarter[POY$month>3 & POY$month <=6] <- "Q4"
POY$quarter[POY$month>6 & POY$month <=9] <- "Q1"
POY$quarter[POY$month>9 & POY$month <=12] <- "Q2"
POY$quarter <- as.factor(POY$quarter)
For the Fiscal Year variable: it runs July - June, so:
June'15 should become FY1415
July'15 should become FY1516
Or: Q1 and Q2 in 2015 should become FY1516, while Q3 and Q4 of 2015 are actually FY1415.
#Making a FY variable
for (i in 1:nrow(POY)) {
if (POY$quarter[i] == "Q1" | POY$quarter[i] == "Q2") {
year1 <- as.character(POY$year[i])
year2 <- as.character(POY$year[i] + 1)
} else {
year1 <- as.character(POY$year[i]- 1)
year2 <- as.character(POY$year[i])
}
POY$FY[i] <- paste0("FY", substr(year1, start=3, stop=4), substr(year2, start=3, stop=4))
}
POY$FY <- as.factor(POY$FY)
summary(POY$FY)
Any suggestions?
Thank you!
Not sure if this was available at the time but the lubridate package contains a quarter function which allows you to create your fiscal quarter and year columns.
The documentation is here.
Examples for your case would be:
x <- ymd("2011-07-01")
quarter(x)
quarter(x, with_year = TRUE)
quarter(x, with_year = TRUE, fiscal_start = 7)
You can then use dplyr and paste function to mutate your own columns in creating fiscal quarters and years.
I've used a combination of base R, lubridate and dplyr;
# make a blank dataframe with sequential dates ...
df <- data.frame(date = seq (as.Date('2011-07-01'), as.Date('2015-07-01'), by = 'month'))
# similar to original poster, separate year/month/day
df <-
df %>%
separate(col = date, into = c('yr', 'mnth', 'dy'), sep = '-', convert = TRUE, remove = FALSE)
# extract last 2 digits of year
df$yr_small <- strftime(x = df$date, format = '%y', tz = 'GMT')
df$yr_small <- as.numeric(df$yr_small)
# Use dplyr's "case_when" to categorise quarters
df <-
df %>%
# make quarters
mutate(
quarter = case_when(
mnth >= 7 & mnth <= 9 ~ 'Q1'
, mnth >= 10 & mnth <= 12 ~ 'Q2'
, mnth >= 1 & mnth <= 3 ~ 'Q3'
, mnth >= 4 & mnth <= 6 ~ 'Q4' ) ) %>%
# ... the financial year is
mutate (
financial_year = case_when(
quarter == 'Q1' | quarter == 'Q2' ~ (yr_small + 1)
, quarter == 'Q3' | quarter == 'Q4' ~ (yr_small) ) )
# final column to make the full financial year start/end
df <- df %>% mutate (FY = paste('FY',df$financial_year, df$financial_year + 1, sep = '') )
Should give you this:
You could use this to replace the for-loop, I think. If you'd supply some data I could test it.
#Making a FY variable
POY$year1 <- as.character(POY$year - 1)
POY$year2 <- as.character(POY$year)
POY$year1[(POY$quarter == "Q1") | (POY$quarter == "Q2")] <-
as.character(POY$year[(POY$quarter == "Q1") |(POY$quarter == "Q2")])
POY$year2[(POY$quarter == "Q1") | (POY$quarter == "Q2")] <-
as.character(POY$year[(POY$quarter == "Q1") | (POY$quarter == "Q2")] + 1)
POY$FY <-
paste0("FY", substr(POY$year1, 3, 4), substr(POY$year2, 3, 4))
POY$FY <- as.factor(POY$FY)
summary(POY$FY)

Resources