How can I subset a dataset to a specific year? - r

I have a dataset (Crime) with 6,847,944 observations. I am trying to downsize this data to only those occurring in the relevant year of 2016. The dates can be found in the "Date" column. I have tried all of the following for code:
#change dates to proper format#
Crime$Date = as.Date(Crime$Date, format = "%m/%d/%y")
#filter crimes to 2016#
ATTEMPT 1: Crime16 = subset(Crime$Date = as.Date("2016"))
RESULT 1: Error: unexpected '=' in "Crime16 = subset(Crime$Date ="
ATTEMPT 2: Crimes_2016 <- Crime[year(Date)==2016,]
RESULT 2: Error in as.POSIXlt.default(x, tz = tz(x)) : do not know how to convert 'x' to class “POSIXlt”
ATTEMPT 3: Crimes_2016 = subset(Crime, Date >=2016/1/1 & Date <= 2016/31/12)
RESULT 3: Creates data frame, but contains no observations.
ATTEMPT 4: morecrimes = subset(Crime, Date == 2016)
RESULT 4: Creates data frame, but contains no observations.
ATTEMPT 5: Crimes.2016 = selectByDate(Crime$Date = 2016)
RESULT 5: Error: unexpected '=' in "Crimes.2016 = selectByDate(Crime$Date ="

Without a proper reproducible example dataset I cannot be sure of what you are after but... taking the following dataframe as a test:
x <- data.frame(
"Date" = as.Date(c("2016-01-01", "2015-05-12", "2016-06-16"), format = "%Y-%m-%d"),
"Crime" = LETTERS[1:3])
Which gives:
> x
Date Crime
1 2016-01-01 A
2 2015-05-12 B
3 2016-06-16 C
This can be subset making a logical vector, generated by format(x$Date, "%Y") == "2016" where I change the date format to just year, and using that in a linear search of the data.frame to return the rows where the elements of the logical vector are "TRUE" as such:
> x[format(x$Date, "%Y") == "2016", ]
Date Crime
1 2016-01-01 A
3 2016-06-16 C
x[format(x$Date, "%Y") == "2016", ]
Giving:
> x[format(x$Date, "%Y") == "2016", ]
Date Crime
1 2016-01-01 A
3 2016-06-16 C
Alternatively you could use the dplyr function filter():
library(tidyverse)
# Route 1. Implement filter() the base R way
filter(x, format(x$Date, "%Y") == "2016")
# Route 2. Use filter() the tidyverse way
x %>% filter(format(x$Date, "%Y") == "2016")

Related

Categorizing data using date variable in R

I am having trouble in using the date variable in my dataset to create categories of 6 months time period. I want to create these time period categories for years between 2017-1-1 and 2020-6-30. The time period categories for each year would be from 2017-1-1 to 2017-6-30, and 2017-7-1 to 2017-12-31 until 2020-6-30.
I have used the following two types of codes to create date categories but I am getting a similar error:
#CODE1
#checking for date class
myData <- str(myData)
myData #date in factor class
#convert to date class
date_class <- as.Date(myData$date, format = "%m/%d/%Y")
myData$date_class <- as.Date(myData$date, format = "%m/%d/%Y")
myData
#creating timeperiod category 1
date_cat <- NA
myData$date_cat[which(myData$date_class >= "2017-1-1" & myData$date_class < "2017-7-1")] <- 1
#CODE2
#converting to date format
myData$date <- strptime(myData$date,format="%m/%d/%Y")
myData$date <- as.POSIXct(myData$date)
myData
#creating timeperiod category 1
date_cat <- NA
myData$date_cat[which(myData$date >= "2017-1-1" & myData$date < "2017-7-1")] <- 1
For both the codes I am getting a similar error
Error in $<-.data.frame(*tmp*, date_cat, value = numeric(0)) :
replacement has 0 rows, data has 1123
Please help me with understanding where I am going wrong.
Thanks,
Priya
Here's a function (to.interval) that returns a time interval {0, 1, 2, 3, ...}, given parameters of the event date, index date, and interval width. Probably a good idea to include error checking in the function, so if for example the event date is prior to the anchor date, it returns NA.
df <- data.frame(event.date=as.Date(c("2017-01-01", "2017-08-01", "2018-04-30")))
to.interval <- function(anchor.date, future.date, interval.days){
round(as.integer(future.date - anchor.date) / interval.days, 0)}
df$interval <- to.interval(as.Date('2017-01-01'),
df$event.date, 180 )
df
Output
event.date interval
1 2017-01-01 0
2 2017-08-01 1
3 2018-04-30 3

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

weekend dates within an interval R

I'm trying to identify whether or not a weekend fell within an interval of dates. I've been able to identify if a specific date is a weekend, but not when trying to look at a range of dates. Is this possible? If so, please advise. TIA.
library(lubridate, chron)
start.date <- c("1/1/2017", "2/1/2017")
end.date <- c("1/21/2017", "2/11/2017")
df <- data.frame(start.date, end.date)
df$start.date <- mdy(df$start.date)
df$end.date <- mdy(df$end.date)
df$interval.date <- interval(df$start.date, df$end.date)
df$weekend.exist <- ifelse(is.weekend(df$interval.date), 1, 0)
# Error in dts - floor(dts) :
# Arithmetic operators undefined for 'Interval' and 'Interval' classes:
# convert one to numeric or a matching time-span class.
why don't you prefer a seq of dates rather than creating the interval ? like
df$weekend.exist <- sapply(1:nrow(df), function(i)
as.numeric(any(is.weekend(seq(df$start.date[i], df$end.date[i],by = "day")))))
# [1] 1 1
library(dplyr)
df %>%
group_by(start.date,end.date) %>%
mutate(weekend.exist = as.numeric(any(is.weekend(seq(start.date, end.date,by = "day")))))
# start.date end.date weekend.exist
# <date> <date> <dbl>
# 1 2017-01-01 2017-01-21 1
# 2 2017-02-01 2017-02-03 1

R: converting start/end dates into data series

I have the following data frame representing user subscriptions:
User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25
...
How can I transform it into a time series that gives me the count of active monthly subscriptions over time (assuming it is active in the month if at least for one day in that month). Something like this (based on the example above, assuming only 2 records):
Month Count
2015-08 0
2015-09 1
2015-10 2
2015-11 1
2015-12 1
2016-01 0
Rem: I took some arbitrary start and end dates for the time series, to make the example clear.
Prepare the data and make sure that the date columns are actually stored as dates:
data <- read.table(text = "User StartDate EndDate
1 2015-09-03 2015-10-17
2 2015-10-27 2015-12-25", header = TRUE)
data$StartDate <- as.Date(StartDate)
data$EndDate <- as.Date(EndDate))
This function returns a vector with all month that are within a subscription:
library(lubridate)
subscr_month <- function(start, end) {
start <- floor_date(start, "month")
seq <- seq(start, end, by = "1 month")
months <- format(seq, format = "%Y-%m")
return(months)
}
It uses the function floor_date() from the lubridate package. It is necessary to round of the start date, because otherwise the last month might be missing. For example, for user 2, if you add two month to the start date, you end up on 2015-12-27, which is after the end date, such that no date from December will be included in seq. The last line converts the Dates to character that only include year and month.
Now, you can apply this function to each start and end date from your data using mapply(). Afterwards, table() creates a table of counts of all dates in the resulting list:
all_month <- mapply(subscr_month, data$StartDate, data$EndDate, SIMPLIFY = FALSE)
table(unlist(all_month))
## 2015-09 2015-10 2015-11 2015-12
## 1 2 1 1
You can also convert the table to a data frame:
as.data.frame(table(unlist(all_month)))
## Var1 Freq
## 1 2015-09 1
## 2 2015-10 2
## 3 2015-11 1
## 4 2015-12 1
Your example output also includes the counts for months that do not appear in the data set. If you want to have this, you can convert the vector of months to a factor and set the levels to all the months you want to include:
month_list <- format(seq(as.Date("2015-08-01"), as.Date("2016-01-01"), by = "1 month"), format = "%Y-%m")
all_month_factor <- factor(unlist(all_month), levels = month_list)
table(all_month_factor)
## all_month_factor
## 2015-08 2015-09 2015-10 2015-11 2015-12 2016-01
## 0 1 2 1 1 0
read the data frame mentioned.
df = structure(list(StartDate = structure(c(16681, 16735), class = "Date"),
EndDate = structure(c(16735, 16794), class = "Date")), class = "data.frame", .Names = c("StartDate",
"EndDate"), row.names = c(NA, -2L))
Could make good use of do in dplyr package and seq
df %>%
rowwise() %>% do({
w <- seq(.$StartDate,.$EndDate,by = "15 days") #for month difference less than 1 complete month
m <- format(w,"%Y-%m") %>% unique
data.frame(Month = m)
}) %>%
group_by(Month) %>%
summarise(Count = length(Month))

R subset data frame where date is less than a variable date

I am trying to select a subset of a data frame where the date needs to be less than a (calculated/variable) date.
The following code throws an error:
loanFrame_excluding_young <- loanFrame[loanFrame$LoanEffective < AddMonths(as.Date("2015-11-11"),-loanFrame$TermMonths),]
Error in seq.Date(X[[i]], ...) : 'by' must be of length 1
Any ideas?
The problem lies with the DescTools::AddMonths function. in AddMonths(x, n, ceiling = TRUE) the n can only be a single number, not a vector.
Using the following code does work using the %m-% function of lubridate.
library(lubridate)
loanFrame <- data.frame(TermMonths = c(1,3,5,7),
LoanEffective = as.Date(c("2015-09-15", "2015-08-05", "2015-10-01", "2015-06-25")))
loanFrame_excluding_young <- loanFrame[loanFrame$LoanEffective < as.Date("2015-11-11") %m-% months(loanFrame$TermMonths),]
loanFrame_excluding_young
TermMonths LoanEffective
1 1 2015-09-15
2 3 2015-08-05

Resources