Conditional rolling counting function - r

I would like to implement a rolling count function for the working days in a month. Weekends (Saturday and Sunday) should be assigned a NA.
A replicable example:
#Change language if your are in a non-English location like me
Sys.setlocale("LC_TIME", "C")
workdays <- c("Mon","Tue","Wed","Thu","Fri")
dataset <- data.frame(Date = seq(as.Date("2020-03-01"),as.Date("2020-04-01")-1,"days"))
dataset$Day <- format(dataset$Date,format="%d")
dataset$WeekDay <- format(dataset$Date,format="%a")
dataset$Month <- format(dataset$Date,format="%m")
dataset$Year <- format(dataset$Date,format="%y")
dataset$Workday <- dataset$WeekDay %in% workdays
I wanted to use dplry grouped by the respective month and year to sum conditionally for the working days.
dataset %>%
group_by(Month,Year) %>%
mutate(WorkdayNo = ???)
In my example, the first ten rows should then look like this:
[1] NA 1 2 3 4 5 NA NA 6 7 (...)

cumsum with ifelse should help -
library(dplyr)
dataset %>%
group_by(Month,Year) %>%
mutate(WorkdayNo = if_else(Workday, cumsum(Workday), NA_integer_)) %>%
ungroup
# Date Day WeekDay Month Year Workday WorkdayNo
# <date> <chr> <chr> <chr> <chr> <lgl> <int>
# 1 2020-03-01 01 Sun 03 20 FALSE NA
# 2 2020-03-02 02 Mon 03 20 TRUE 1
# 3 2020-03-03 03 Tue 03 20 TRUE 2
# 4 2020-03-04 04 Wed 03 20 TRUE 3
# 5 2020-03-05 05 Thu 03 20 TRUE 4
# 6 2020-03-06 06 Fri 03 20 TRUE 5
# 7 2020-03-07 07 Sat 03 20 FALSE NA
# 8 2020-03-08 08 Sun 03 20 FALSE NA
# 9 2020-03-09 09 Mon 03 20 TRUE 6
#10 2020-03-10 10 Tue 03 20 TRUE 7
# … with 21 more rows

Related

Assigning Values in R by Date Range

I am trying to create a "week" variable in my dataset of daily observations that begins with a new value (1, 2, 3, et cetera) whenever a new Monday happens. My dataset has observations beginning on April 6th, 2020, and the data are stored in a "YYYY-MM-DD" as.date() format. In this example, an observation between April 6th and April 12th would be a "1", an observation between April 13th and April 19 would be a "2", et cetera.
I am aware of the week() package in lubridate, but unfortunately that doesn't work for my purposes because there are not exactly 54 weeks in the year, and therefore "week 54" would only be a few days long. In other words, I would like the days of December 28th, 2020 to January 3rd, 2021 to be categorized as the same week.
Does anyone have a good solution to this problem? I appreciate any insight folks might have.
This will also do
df <- data.frame(date = as.Date("2020-04-06")+ 0:365)
library(dplyr)
library(lubridate)
df %>% group_by(d= year(date), week = (isoweek(date))) %>%
mutate(week = cur_group_id()) %>% ungroup() %>% select(-d)
# A tibble: 366 x 2
date week
<date> <int>
1 2020-04-06 1
2 2020-04-07 1
3 2020-04-08 1
4 2020-04-09 1
5 2020-04-10 1
6 2020-04-11 1
7 2020-04-12 1
8 2020-04-13 2
9 2020-04-14 2
10 2020-04-15 2
# ... with 356 more rows
Subtract the dates with the minimum date, divide the difference by 7 and use floor to get 1 number for each 7 days.
x <- as.Date(c('2020-04-06','2020-04-07','2020-04-13','2020-12-28','2021-01-03'))
as.integer(floor((x - min(x))/7) + 1)
#[1] 1 1 2 39 39
Maybe lubridate::isoweek() and lubridate::isoyear() is what you want?
Some data:
df1 <- data.frame(date = seq.Date(as.Date("2020-04-06"),
as.Date("2021-01-04"),
by = "1 day"))
Example code:
library(dplyr)
library(lubridate)
df1 <- df1 %>%
mutate(week = isoweek(date),
year = isoyear(date)) %>%
group_by(year) %>%
mutate(week2 = 1 + (week - min(week))) %>%
ungroup()
head(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-04-06 15 2020 1
2 2020-04-07 15 2020 1
3 2020-04-08 15 2020 1
4 2020-04-09 15 2020 1
5 2020-04-10 15 2020 1
6 2020-04-11 15 2020 1
7 2020-04-12 15 2020 1
8 2020-04-13 16 2020 2
tail(df1, 8)
# A tibble: 8 x 4
date week year week2
<date> <dbl> <dbl> <dbl>
1 2020-12-28 53 2020 39
2 2020-12-29 53 2020 39
3 2020-12-30 53 2020 39
4 2020-12-31 53 2020 39
5 2021-01-01 53 2020 39
6 2021-01-02 53 2020 39
7 2021-01-03 53 2020 39
8 2021-01-04 1 2021 1

Creating Calendar df in R

I am currently creating a Calendar df to join to my other dfs and originally code it in the following way:
Date <- seq(as.Date("2020-01-01"), as.Date("2021-12-31"), by="days")
Calendar <- data.frame(Date)
Calendar$DateNo <- format(Calendar$Date, format = "%d")
Calendar$NameDay <- format(Calendar$Date, format = "%A")
Calendar$MonthNo <- format(Calendar$Date, format = "%m")
Calendar$NameMonth <- format(Calendar$Date, format = "%B")
Calendar$NameMonthShort <- format(Calendar$Date, format = "%b")
Calendar$Week <- format(Calendar$Date, format = "%V")
Calendar$Year <- format(Calendar$Date, format = "%Y")
Calendar$Quarter <- quarter(Calendar$Date, with_year = F, fiscal_start = 7)
Calendar$Month_Year <-paste(Calendar$NameMonthShort,Calendar$Year,sep="-")
Calendar$Quarter_Year <-paste(Calendar$Quarter,Calendar$Year,sep="-")
After some issues with plotting my data into ggplot I came across an alternate way of creating it using lubridate package with mutate. My new code is as follows:
Date <- seq(as.Date("2020-01-01"), as.Date("2021-12-31"), by="days")
Calendar <- data.frame(Date)
Calendar <- Calendar %>%
mutate(
DateNo = day(Date),
NameDay = wday(Date,label = TRUE),
MonthNo = month(Date),
NameMonth = month(Date, label = TRUE),
NameMonthShort = month(Date, label = TRUE),
Week = week(Date),
Year = year(Date),
Quarter = quarter(Date, with_year = F, fiscal_start = 7))
The issues I am encountering are that I can't add the unabbreviated date/month and not sure if I can add Month_Year/Quarter_Year inside the mutate so that the values are factored in. Is it possible to add those values in or do I have to add them how I did previously? Thanks!
You might find it easier if you use built-in as.POSIXlt, no lubridate needed. Just apply it on your sequence and you'd get a list-type format,
Date <- as.POSIXlt(seq(as.Date("2020-01-01"), as.Date("2020-06-30"), by="7 days"))
## Note: shortened for sake of brevity
that has the desired information already stored in objects that can be accessed by $.
attr(Date, "names")
# [1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday" "isdst"
There are some minor conversions needed due to the storage format, and some helper functions like weekdays, quarters, and strftime. In addition we may use the built-in constants month.name and month.abb.
Calendar <- data.frame(Date,
DateNo=Date$mday,
NameDay=weekdays(Date),
MonthNo=Date$mon + 1,
NameMonth=month.name[Date$mon + 1],
NameMonthShort=month.abb[Date$mon + 1],
Week=strftime(Date, "%V"),
Year=1900 + Date$year,
Quarter=quarters(Date)
)
Result
Calendar
# Date DateNo NameDay MonthNo NameMonth NameMonthShort Week Year Quarter
# 1 2020-01-01 1 Wednesday 1 January Jan 01 2020 Q1
# 2 2020-01-08 8 Wednesday 1 January Jan 02 2020 Q1
# 3 2020-01-15 15 Wednesday 1 January Jan 03 2020 Q1
# 4 2020-01-22 22 Wednesday 1 January Jan 04 2020 Q1
# 5 2020-01-29 29 Wednesday 1 January Jan 05 2020 Q1
# 6 2020-02-05 5 Wednesday 2 February Feb 06 2020 Q1
# 7 2020-02-12 12 Wednesday 2 February Feb 07 2020 Q1
# 8 2020-02-19 19 Wednesday 2 February Feb 08 2020 Q1
# 9 2020-02-26 26 Wednesday 2 February Feb 09 2020 Q1
# 10 2020-03-04 4 Wednesday 3 March Mar 10 2020 Q1
# 11 2020-03-11 11 Wednesday 3 March Mar 11 2020 Q1
# 12 2020-03-18 18 Wednesday 3 March Mar 12 2020 Q1
# 13 2020-03-25 25 Wednesday 3 March Mar 13 2020 Q1
# 14 2020-04-01 1 Wednesday 4 April Apr 14 2020 Q2
# 15 2020-04-08 8 Wednesday 4 April Apr 15 2020 Q2
# 16 2020-04-15 15 Wednesday 4 April Apr 16 2020 Q2
# 17 2020-04-22 22 Wednesday 4 April Apr 17 2020 Q2
# 18 2020-04-29 29 Wednesday 4 April Apr 18 2020 Q2
# 19 2020-05-06 6 Wednesday 5 May May 19 2020 Q2
# 20 2020-05-13 13 Wednesday 5 May May 20 2020 Q2
# 21 2020-05-20 20 Wednesday 5 May May 21 2020 Q2
# 22 2020-05-27 27 Wednesday 5 May May 22 2020 Q2
# 23 2020-06-03 3 Wednesday 6 June Jun 23 2020 Q2
# 24 2020-06-10 10 Wednesday 6 June Jun 24 2020 Q2
# 25 2020-06-17 17 Wednesday 6 June Jun 25 2020 Q2
# 26 2020-06-24 24 Wednesday 6 June Jun 26 2020 Q2
Long month names are easy to add by including abbr=FALSE switch to month().
Pasting quarters or months to years needs a second mutate as below.
Edit Since paste creates character vectors and not factors, you will need to specify factor levels manually:
monthlevels = c(
'Jan-2020','Feb-2020','Mar-2020','Apr-2020','May-2020','Jun-2020',
'Jul-2020','Aug-2020','Sep-2020','Oct-2020','Nov-2020','Dec-2020',
'Jan-2021','Feb-2021','Mar-2021','Apr-2021','May-2021','Jun-2021',
'Jul-2021','Aug-2021','Sep-2021','Oct-2021','Nov-2021','Dec-2021')
quarterlevels = c('1-2020','2-2020','3-2020','4-2020','1-2021','2-2021','3-2021','4-2021')
Calendar %>%
mutate(
DateNo = day(Date),
NameDay = wday(Date,label = TRUE),
MonthNo = month(Date),
NameMonth = month(Date, label = TRUE, abbr=FALSE), ## added abbr=FALSE
NameMonthShort = month(Date, label = TRUE),
Week = week(Date),
Year = year(Date),
Quarter = quarter(Date, with_year = F, fiscal_start = 7)) %>%
## added second mutate() to paste fields created by the first mutate
mutate(
QuarterYear = factor(paste(Quarter, Year, sep='-'), levels=quarterlevels),
MonthYear = factor(paste(NameMonthShort,Year,sep="-"),levels=monthlevels
) %>% head()
Returns:
Date DateNo NameDay MonthNo NameMonth NameMonthShort Week Year Quarter
1 2020-01-01 1 Wed 1 January Jan 1 2020 3
2 2020-01-02 2 Thu 1 January Jan 1 2020 3
3 2020-01-03 3 Fri 1 January Jan 1 2020 3
4 2020-01-04 4 Sat 1 January Jan 1 2020 3
5 2020-01-05 5 Sun 1 January Jan 1 2020 3
6 2020-01-06 6 Mon 1 January Jan 1 2020 3
QuarterYear MonthYear
1 3-2020 Jan-2020
2 3-2020 Jan-2020
3 3-2020 Jan-2020
4 3-2020 Jan-2020
5 3-2020 Jan-2020
6 3-2020 Jan-2020

How to calculate Quarter Over Quarter %change when the dataset is monthly

I have this df which observations are monthly represented:
library(dplyr)
library(lubridate)
Date <- seq(from = as_date("2019-11-01"), to = as_date("2020-10-01"), by = "month")
A <- (10:21)
df <- data.frame(Date, A)
view(df)
Date A
<date> <int>
1 2019-11-01 10
2 2019-12-01 11
3 2020-01-01 12
4 2020-02-01 13
5 2020-03-01 14
6 2020-04-01 15
7 2020-05-01 16
8 2020-06-01 17
9 2020-07-01 18
10 2020-08-01 19
11 2020-09-01 20
12 2020-10-01 21
Using lag() I know how to calculate %change from Month over Month (MoM), but haven't been able to compare a quarter with the previous quarter: i.e, the sum of 3 months compared with the previous 3 months summed. I tried a loop approach but it didn't work and there should be a more efficient approach.
I appreciate it if someone can help.
We can use as.yearqtr from zoo to convert the 'Date' column to quarter, do a group by sum and then get the Difference between the current and next (lead) or current and previous (lag)
library(dplyr)
library(zoo)
df %>%
group_by(Quarter = as.yearqtr(Date)) %>%
summarise(A = sum(A), .groups = 'drop') %>%
mutate(Diff = lead(A) - A)
-output
# A tibble: 5 x 3
# Quarter A Diff
# <yearqtr> <int> <int>
#1 2019 Q4 21 18
#2 2020 Q1 39 9
#3 2020 Q2 48 9
#4 2020 Q3 57 -36
#5 2020 Q4 21 NA

Remove IDs with fewer than 9 unique observations

I am trying to filter my data and remove IDs that have fewer than 9 unique month observations. I would also like to create a list of IDs that includes the count.
I've tried using a few different options:
library(dplyr)
count <- bind %>% group_by(IDS) %>% filter(n(data.month)>= 9) %>% ungroup()
count2 <- subset(bind, with(bind, IDS %in% names(which(table(data.month)>=9))))
Neither of these worked.
This is what my data looks like:
data.month ID
01 2
02 2
03 2
04 2
05 2
05 2
06 2
06 2
07 2
07 2
07 2
07 2
07 2
08 2
09 2
10 2
11 2
12 2
01 5
01 5
02 5
01 7
01 7
01 7
01 4
02 4
03 4
04 4
05 4
05 4
06 4
06 4
07 4
07 4
07 4
07 4
07 4
08 4
09 4
10 4
11 4
12 4
In the end, I would like a this:
IDs
2
3
I would also like this
IDs Count
2 12
5 2
7 1
4 12
So far this code is the closest, but still just gives error codes:
count <- bind %>%
group_by(IDs) %>%
filter(length(unique(bind$data.month >=9)))
Error in filter_impl(.data, quo) :
Argument 2 filter condition does not evaluate to a logical vector
You can do with unique and length
library(dplyr)
df %>% group_by(ID) %>% summarise(Count=length(unique(data.month)))
# A tibble: 4 x 2
ID Count
<int> <int>
1 2 12
2 4 12
3 5 2
4 7 1
If want to get the ID
df%>%group_by(ID)%>%summarise(Count=length(unique(data.month)))%>%filter(Count>9)%>%select(ID)
# A tibble: 2 x 1
ID
<int>
1 2
2 4
We can use n_distinct
To remove IDs with less than 9 unique observations
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique
#[1] 2 4
Or
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)
# ID
# <int>
#1 2
#2 4
For unique counts of each ID
df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))
# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1
here is a data.table approach
library( data.table )
ID's with 9 obervations or more
unique( DT[, if (.N >= 9) .SD, by = .(data.month)]$ID )
#[1] 2 4
#Unique ID's per month
unique(DT, by = c("data.month", "ID"))[, .(counts = .N), by = .(IDs = ID)]
# IDs counts
# 1: 2 12
# 2: 5 2
# 3: 7 1
# 4: 4 12
sample data
DT <- fread("data.month ID
01 2
02 2
03 2
04 2
05 2
05 2
06 2
06 2
07 2
07 2
07 2
07 2
07 2
08 2
09 2
10 2
11 2
12 2
01 5
01 5
02 5
01 7
01 7
01 7
01 4
02 4
03 4
04 4
05 4
05 4
06 4
06 4
07 4
07 4
07 4
07 4
07 4
08 4
09 4
10 4
11 4
12 4")

In R how can I find the number of connections I have in a given dataframe and produce a variable representing it?

So I currently have a dataframe which represents a social network like follows:
id age id1 id2 id3
01 14 02 05 03
02 23 01 05 03
03 52 04 01 02
04 41 03
05 32 01 02
Ideally I would like a new data frame like the following:
id age id1 id2 id3 Connections
01 14 02 05 03 3
02 23 01 05 03 3
03 52 04 01 02 3
04 41 03 1
05 32 01 02 2
With a new variable the represents the number of connections the "id" has. As of now I currently have a code like follows:
links <- df
links <- as.matrix(links)
links <- as.data.frame(rbind(links[,c(1,3)], links[,c(1,4)]), links[,c(1,5)])
head(links)
library(igraph)
g = graph.data.frame(links)
m = as.matrix(get.adjacency(g))
m
pmax(rowSums(m), colSums(m))
Which gives me:
1 2 3 4 5 NA
3 3 3 1 2 3
How can I then incorporate this into the dataframe to create the "Connections" variable? Ideally my other data contains up to 50 connections so I would like an easier way in which I don't have to recreate a dataframe.
A quick tidyverse way is to reshape the data into a long shape, add up how many non-NA values each ID has, and reshape back to wide.
library(tidyverse)
df %>%
gather(key = key, value = val, -id, -age) %>%
group_by(id, age) %>%
mutate(connections = sum(!is.na(val))) %>%
head()
#> # A tibble: 6 x 5
#> # Groups: id, age [5]
#> id age key val connections
#> <chr> <dbl> <chr> <chr> <int>
#> 1 01 14 id1 02 3
#> 2 02 23 id1 01 3
#> 3 03 52 id1 04 3
#> 4 04 41 id1 03 1
#> 5 05 32 id1 01 2
#> 6 01 14 id2 05 3
df %>%
gather(key = key, value = val, -id, -age) %>%
group_by(id, age) %>%
mutate(connections = sum(!is.na(val))) %>%
spread(key = key, value = val)
#> # A tibble: 5 x 6
#> # Groups: id, age [5]
#> id age connections id1 id2 id3
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 01 14 3 02 05 03
#> 2 02 23 3 01 05 03
#> 3 03 52 3 04 01 02
#> 4 04 41 1 03 <NA> <NA>
#> 5 05 32 2 01 02 <NA>
But I wouldn't consider your first approach wrong. Since you're working with a network, it makes sense to use network analysis tools and calculate the degree of each node, same as the number of connections.
library(dplyr)
# Toy data
df = data.frame(id = c(1,2,3,4),
age = c(1, 1, 1, 1),
id1 = c(1, 2, 3, 4),
id2 = c(1, 2, 3, NA),
id3 = c(1,2, NA, NA))
df$Connections = df %>%
select(-id, -age) %>% # Remove unnecessary columns
apply(1, function(row) {
binary_row = as.numeric(!is.na(row)) # Convert each column to binary
sum(binary_row) # Return connection count
})
What about something like this:
First, using regex we determine the columns corresponding to connections
# here connections columns must contain the pattern "id"+digit(s)
connectionsNames <- grepl("id\\d+", names(df), perl = TRUE)
Then we use rowSums to create the new column
df$connections <- sum(connectionsNames) - rowSums(is.na(df))
Here the result
df
id age id1 id2 id3 connections
1 1 1 1 1 1 3
2 2 1 2 2 2 3
3 3 1 3 3 NA 2
4 4 1 4 NA NA 1

Resources