R - subsetting a data frame in a for loop - r

im trying to subset a data frame in a for loop to create a smaller data.frame. This is my data.frame
day rain in mm temperature in °C season
1 201 20 summer
2 156 18 summer
3 56 -4 winter
4 98 15 spring
I want to extract a data.frame for each season (with all columns). Here is my code:
for (season in seasons){
a<- weather[which(weather$season %in% season[1]) , ,drop=TRUE]
...
}
Unfortunately, the sub-setting doesn' t work. When i use
a<- weather[which(weather$season %in% "summer") , ,drop=TRUE] it works perfectly. Also this does not work properly:
season <- "summer"
a<- weather[which(weather$season %in% season[1]) , ,drop=TRUE]
Does anyone see the problem with my code? Thank you.

It works with dplyr.
library(dplyr)
mydf <- data.frame(day = c(1,2,3,4),
rain = c(201,156,56,98),
temperature = c(20,18,-4,15),
season = c("summer", "summer", "winter", "spring"))
seasons <- c("spring", "summer", "autumn", "winter")
for (sea in seasons) {
a <- dplyr::filter(mydf, season == sea)
print(a)
}

Related

R - if else taking too long for 16 million rows

I have a dataframe of 16million rows, and I am looking to add a column based on existing column, Month. If the month is 3 or 4 or 5, the column Season will be spring, etc.
for (i in 1:nrow(df)) {
if (df$Month[[i]] %in% c(3,4,5)) {
df$Season[[i]] <- "Spring"
} else if (df$Month[[i]] %in% c(6,7,8)) {
df$Season[[i]] <- "Summer"
} else if (df$Month[[i]] %in% c(9,10,11)) {
df$Season[[i]] <- "Autumn"
} else if (df$Month[[i]] %in% c(12,1,2)) {
df$Season[[i]] <- "Winter"
}
}
However, it is taking way too long for it to complete. What can I do?
One of the easier and faster ways is to create a data frame of the months and seasons and then join it to your parent data frame.
Like this:
seasons<-data.frame(Month=1:12, Season=c("Winter", "Winter", rep("Spring", 3), rep("Summer", 3), rep("Autumn", 3), "Winter"))
answer <- dplyr::left_join(df, seasons)
this is assuming both data frames have matching column names "Month".
I expect about 1000x increase in performance over the for loop.
This is more along with the lines of #Dave2e however with base R:
Season=c("Winter", "Winter", rep("Spring", 3),
rep("Summer", 3), rep("Autumn", 3), "Winter")
df<-data.frame(month=sample(1:12,10,replace=T)) #Sample data
df$season<-Season[df$month]
df
# month season
#1 8 Summer
#2 8 Summer
#3 5 Spring
#4 7 Summer
#5 2 Winter
#6 4 Spring
#7 12 Winter
#8 7 Summer
#9 11 Autumn
#10 1 Winter
This one, is significantly faster than the for loop method.
Using for loop (1000 rows):
#user system elapsed
#0.02 0.00 0.02
Using vectorised method (1000 rows):
#user system elapsed
# 0 0 0
Calculated using system.time.
This difference might look insignificant considering there are only 1000 rows. However, it becomes a significantly large difference as the number of rows are increased (in OP's case, 16 million)

How to determine the seasons of the year from a multitemporal data list using R?

I would like to determine the seasons here in my region from a time list using dplyr or tidyr.
In my province:
Summer: Starts on December 21st through March 20th.
Autumn: Starts on March 21st through June 20th.
Winter: Starts on June 21st through September 22nd.
Spring: Starts September 23rd through December 20th.
My data.frame
sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
Temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
df<-data.frame(sample_station, Date_dmy, Temperature)
1) Use findInterval to look up the date in the season_start vector and extract the associated season_name.
library(dplyr)
# given Date class vector returns vector of season names
date2season <- function(date) {
season_start <- c("0101", "0321", "0621", "0923", "1221") # mmdd
season_name <- c("Summer", "Autumn", "Winter", "Spring", "Summer")
mmdd <- format(date, "%m%d")
season_name[findInterval(mmdd, season_start)] ##
}
df %>% mutate(season = date2season(as.Date(Date_dmy, "%d/%m/%Y")))
giving:
sample_station Date_dmy Temperature season
1 A 01/01/2000 17 Summer
2 A 08/08/2000 20 Winter
3 A 16/03/2001 24 Summer
4 A 22/09/2001 19 Winter
5 A 01/06/2002 17 Autumn
...snip...
1a) The last line in date2season, marked ##, could optionally be replaced with
season_name[(mmdd >= "0101") + (mmdd >= "0321") + (mmdd >= "0621") +
(mmdd >= "0923") + (mmdd >= "1221")]
and in that case you don't need the line defining season_start either.
2) An alternative is to use case_when:
df %>%
mutate(mmdd = format(as.Date(Date_dmy, "%d/%m/%Y"), "%m%d"),
season = case_when(
mmdd <= "0320" ~ "Summer",
mmdd <= "0620" ~ "Autumn",
mmdd <= "0922" ~ "Winter",
mmdd <= "1220" ~ "Spring",
TRUE ~ "Summer")) %>%
select(-mmdd)

Select the same period every year in R

This seems really simple, yet I can't find an easy solution. I'm working with future streamflow projections for every day of a 25 year period (2024-2050). I'm only interested in streamflow during the 61 day period between 11th of April and 10th of June each year. I want to extract the data from the seq and Data column that are within this period for each year and have it in a data frame together.
Example data:
library(xts)
seq <- timeBasedSeq('2024-01-01/2050-12-31')
Data <- xts(1:length(seq),seq)
I want to achieve something like this BUT with all the dates between April 11 and June 10th and for all years (2024-2050). This is a shortened sample output:
seq_x <- c("2024-04-11","2024-06-10","2025-04-11","2025-06-10","2026-04-11","2027-06-10",
"2027-04-11", "2027-06-10")
Data_x <- c(102, 162, 467, 527, 832, 892, 1197, 1257)
output <- data.frame(seq_x, Data_x)
This question is similar to:
Calculating average for certain time period in every year
and
select date ranges for multiple years in r
but doesn't provide an efficient answer to my question on how to extract the same period over multiple years.
Here is a base R approach :
dates <- index(Data)
month <- as.integer(format(dates, '%m'))
day <- as.integer(format(dates, '%d'))
result <- Data[month == 4 & day >= 11 | month == 5 | month == 6 & day <= 10]
result
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
#...
#...
#2024-06-07 159
#2024-06-08 160
#2024-06-09 161
#2024-06-10 162
#2025-04-11 467
#2025-04-12 468
#...
#...
Create an mmdd character string and subset using it:
mmdd <- format(time(Data), "%m%d")
Data1 <- Data[mmdd >= "0411" & mmdd <= "0610"]
These would also work. They shift the dates back by 10 days in which case it coincides with April and May
Data2 <- Data[format(time(Data)-10, "%m") %in% c("04", "05")]
or
Data3 <- Data[ cycle(as.yearmon(time(Data)-10)) %in% 4:5 ]
The command fortify.zoo(x) can be used to convert an xts object x to a data frame.
Here is an option. Do a group by year of the 'seq_x', then summarise to create a list column by subsetting 'Data' based on the first and last elements of 'seq_x' and select the column
library(dplyr)
library(lubridate)
library(tidyr)
library(purrr)
output %>%
group_by(year = year(seq_x)) %>%
summarise(new = list(Data[str_c(first(seq_x), last(seq_x), sep="::")]),
.groups = 'drop') %>%
pull(new) %>%
invoke(rbind, .)
# [,1]
#2024-04-11 102
#2024-04-12 103
#2024-04-13 104
#2024-04-14 105
#2024-04-15 106
#2024-04-16 107
# ...

Adding a column of corresponding seasons to dataframe

Here is an example of my dataframe. I am working in R.
date name count
2016-11-12 Joe 5
2016-11-15 Bob 5
2016-06-15 Nick 12
2016-10-16 Cate 6
I would like to add a column to my data frame that will tell me the season that corresponds to the date. I would like it to look like this:
date name count Season
2016-11-12 Joe 5 Winter
2016-11-15 Bob 5 Winter
2017-06-15 Nick 12 Summer
2017-10-16 Cate 6 Fall
I have started some code:
startWinter <- c(month.name[1], month.name[12], month.name[11])
startSummer <- c(month.name[5], month.name[6], month.name[7])
startSpring <- c(month.name[2], month.name[3], month.name[4])
# create a function to find the correct season based on the month
MonthSeason <- function(Month) {
# !is.na()
# ignores values with NA
# match()
# returns a vector of the positions of matches
# If the starting month matches a spring season, print "Spring". If the starting month matches a summer season, print "Summer" etc.
ifelse(!is.na(match(Month, startSpring)),
return("spring"),
return(ifelse(!is.na(match(Month, startWinter)),
"winter",
ifelse(!is.na(match(Month, startSummer)),
"summer","fall"))))
}
This code gives me the season for a month. Im not sure if I am going about this problem in the right way. Can anyone help me out?
Thanks!
There are a couple of hacks, and their usability depends on whether you want to use meteorological or astronomical seasons. I'll offer both, I think they offer sufficient flexibility.
I'm going to use your second data provided, since it provides more than just "Winter".
txt <- "date name count
2016-11-12 Joe 5
2016-11-15 Bob 5
2017-06-15 Nick 12
2017-10-16 Cate 6"
dat <- read.table(text = txt, header = TRUE, stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)
The quickest method works well when seasons are defined strictly by month.
metseasons <- c(
"01" = "Winter", "02" = "Winter",
"03" = "Spring", "04" = "Spring", "05" = "Spring",
"06" = "Summer", "07" = "Summer", "08" = "Summer",
"09" = "Fall", "10" = "Fall", "11" = "Fall",
"12" = "Winter"
)
metseasons[format(dat$date, "%m")]
# 11 11 06 10
# "Fall" "Fall" "Summer" "Fall"
If you choose to use date ranges for your seasons that are not defined by month start/stop such as the astronomical seasons, here's another 'hack':
astroseasons <- as.integer(c("0000", "0320", "0620", "0922", "1221", "1232"))
astroseasons_labels <- c("Winter", "Spring", "Summer", "Fall", "Winter")
If you use proper Date or POSIX types, then you are including years, which makes things a little less-generic. One might think of using julian dates, but during leap years this produces anomalies. So, with the assumption that Feb 28 is never a seasonal boundary, I'm "numericizing" the month-day. Even though R does do character-comparisons just fine, cut expects numbers, so we convert them to integers.
Two safe-guards: because cut is either right-open (and left-closed) or right-closed (and left-open), then our two book-ends need to extend beyond the legal dates, ergo "0000" and "1232". There are other techniques that could work equally well here (e.g., using -Inf and Inf, post-integerization).
astroseasons_labels[ cut(as.integer(format(dat$date, "%m%d")), astroseasons, labels = FALSE) ]
# [1] "Fall" "Fall" "Spring" "Fall"
Notice that the third date is in Spring when using astronomical seasons and Summer otherwise.
This solution can easily be adjusted to account for the Southern hemisphere or other seasonal preferences/beliefs.
Edit: motivated by #Kristofersen's answer (thanks), I looked into benchmarks. lubridate::month uses a POSIXct-to-POSIXlt conversion to extract the month, which can be over 10x faster than my format(x, "%m") method. As such:
metseasons2 <- c(
"Winter", "Winter",
"Spring", "Spring", "Spring",
"Summer", "Summer", "Summer",
"Fall", "Fall", "Fall",
"Winter"
)
Noting that as.POSIXlt returns 0-based months, we add 1:
metseasons2[ 1 + as.POSIXlt(dat$date)$mon ]
# [1] "Fall" "Fall" "Summer" "Fall"
Comparison:
library(lubridate)
library(microbenchmark)
set.seed(42)
x <- Sys.Date() + sample(1e3)
xlt <- as.POSIXlt(x)
microbenchmark(
metfmt = metseasons[ format(x, "%m") ],
metlt = metseasons2[ 1 + xlt$mon ],
astrofmt = astroseasons_labels[ cut(as.integer(format(x, "%m%d")), astroseasons, labels = FALSE) ],
astrolt = astroseasons_labels[ cut(100*(1+xlt$mon) + xlt$mday, astroseasons, labels = FALSE) ],
lubridate = sapply(month(x), seasons)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# metfmt 1952.091 2135.157 2289.63943 2212.1025 2308.1945 3748.832 100
# metlt 14.223 16.411 22.51550 20.0575 24.7980 68.924 100
# astrofmt 2240.547 2454.245 2622.73109 2507.8520 2674.5080 3923.874 100
# astrolt 42.303 54.702 72.98619 66.1885 89.7095 163.373 100
# lubridate 5906.963 6473.298 7018.11535 6783.2700 7508.0565 11474.050 100
So the methods using as.POSIXlt(...)$mon are significantly faster. (#Kristofersen's answer could be improved by vectorizing it, perhaps with ifelse, but that still won't compare to the speed of the vector lookups with or without cut.)
You can do this pretty quickly with lubridate and a function to change the month number into a season.
library(lubridate)
seasons = function(x){
if(x %in% 2:4) return("Spring")
if(x %in% 5:7) return("Summer")
if(x %in% 8:10) return("Fall")
if(x %in% c(11,12,1)) return("Winter")
}
dat$Season = sapply(month(dat$date), seasons)
> dat
date name count Season
1 2016-11-12 Joe 5 Winter
2 2016-11-15 Bob 5 Winter
3 2016-06-15 Nick 12 Summer
4 2016-10-16 Cate 6 Fall
if your data is df:
# create dataframe for month and corresponding season
dfSeason <- data.frame(season = c(rep("Winter", 3), rep("Summer", 3),
rep("Spring", 3), rep("Fall", 3)),
month = month.name[c(11,12,1, 5:7, 2:4, 8:10)],
stringsAsFactors = F)
# make date as date
df$data <- as.Date(df$date)
# match the month of the date in df (format %B) with month in season
# then use it to index the season of dfSeason
df$season <- dfSeason$season[match(format(df$data, "%B"), dfSeason$month)]

R Need to extract month and assign season [duplicate]

This question already has answers here:
Find which season a particular date belongs to
(11 answers)
Closed 8 years ago.
I am using R, and I need to set up a loop (I think) where I extract the month from the date and assign a season. I would like to assign winter to months 12, 1, 2; spring to 3, 4, 5; summer to 6, 7, 8; and fall to 9, 10, 11. I have a subset of the data below. I am awful with loops and couldn't figure it out. Also for the date, I wasn't sure how packages like lubridate would work
"","UT_TDS_ID_2011.Monitoring.Location.ID","UT_TDS_ID_2011.Activity.Start.Date","UT_TDS_ID_2011.Value","UT_TDS_ID_2011.Season"
"1",4930585,"7/28/2010 0:00",196,""
"2",4933115,"4/21/2011 0:00",402,""
"3",4933115,"7/23/2010 0:00",506,""
"4",4933115,"6/14/2011 0:00",204,""
"8",4933115,"12/3/2010 0:00",556,""
"9",4933157,"11/18/2010 0:00",318,""
"10",4933157,"11/6/2010 0:00",328,""
"11",4933157,"7/23/2010 0:00",290,""
"12",4933157,"6/14/2011 0:00",250,""
Regarding the subject/title of the question, its actually possible to do this without extracting the month. The first two solutions below do not extract the month. There is also a third solution which does extract the month but only to increment it.
1) as.yearqtr/as.yearmon Convert the dates to year/month and add one month (1/12). Then the calendar quarters correspond to the seasons so convert to year/quarter, yq, and label the quarters as shown:
library(zoo)
yq <- as.yearqtr(as.yearmon(DF$dates, "%m/%d/%Y") + 1/12)
DF$Season <- factor(format(yq, "%q"), levels = 1:4,
labels = c("winter", "spring", "summer", "fall"))
giving:
dates Season
1 7/28/2010 summer
2 4/21/2011 spring
3 7/23/2010 summer
4 6/14/2011 summer
5 12/3/2010 winter
6 11/18/2010 fall
7 11/6/2010 fall
8 7/23/2010 summer
9 6/14/2011 summer
1a) A variation of this is to use chron's quarters which produces a factor so that levels=1:4 does not have to be specified. To use chron replace the last line in (1) with:
library(chron)
DF$Season <- factor(quarters(as.chron(yq)),
labels = c("winter", "spring", "summer", "fall"))
chron could also be used in conjunction with the remaining solutions.
2) cut. This solution only uses the base of R. First convert the dates to the first of the month using cut and add 32 to get a date in the next month, d. The quarters corresponding to d are the seasons so compute the quarters using quarters and construct the labels in the same fashion as the first answser:
d <- as.Date(cut(as.Date(DF$dates, "%m/%d/%Y"), "month")) + 32
DF$Season <- factor(quarters(d), levels = c("Q1", "Q2", "Q3", "Q4"),
labels = c("winter", "spring", "summer", "fall"))
giving the same answer.
3) POSIXlt This solution also uses only the base of R:
p <- as.POSIXlt(as.Date(DF$dates, "%m/%d/%Y"))
p$day <- 1
p$mo <- p$mo+1
DF$Season <- factor(quarters(p), levels = c("Q1", "Q2", "Q3", "Q4"),
labels = c("winter", "spring", "summer", "fall"))
Note 1: We could optionally omit levels= in all these solutions if we knew that every season appears.
Note 2: We used this data frame:
DF <- data.frame(dates = c('7/28/2010', '4/21/2011', '7/23/2010',
'6/14/2011', '12/3/2010', '11/18/2010', '11/6/2010', '7/23/2010',
'6/14/2011'))
Using only base R, you can convert the "datetime" column to "Date" class (as.Date(..)), extract the "month" (format(..., '%m')) and change the character value to numeric (as.numeric(). Create an "indx" vector that have values from "1" to "12", set the names of the values according to the specific season (setNames(..)), and use this to get the corresponding "Season" for the "months" vector.
months <- as.numeric(format(as.Date(df$datetime, '%m/%d/%Y'), '%m'))
indx <- setNames( rep(c('winter', 'spring', 'summer',
'fall'),each=3), c(12,1:11))
df$Season <- unname(indx[as.character(months)])
df
# datetime Season
#1 7/28/2010 0:00 summer
#2 4/21/2011 0:00 spring
#3 7/23/2010 0:00 summer
#4 6/14/2011 0:00 summer
#5 12/3/2010 0:00 winter
#6 11/18/2010 0:00 fall
#7 11/6/2010 0:00 fall
#8 7/23/2010 0:00 summer
#9 6/14/2011 0:00 summer
Or as #Roland mentioned in the comments, you can use strptime to convert the "datetime" to "POSIXlt" and extract the month ($mon)
months <- strptime(df$datetime, format='%m/%d/%Y %H:%M')$mon +1
and use the same method as above
data
df <- data.frame(datetime = c('7/28/2010 0:00', '4/21/2011 0:00',
'7/23/2010 0:00', '6/14/2011 0:00', '12/3/2010 0:00', '11/18/2010 0:00',
'11/6/2010 0:00', '7/23/2010 0:00', '6/14/2011 0:00'),stringsAsFactors=FALSE)

Resources