Related
I have a dataset of random, sometimes infrequent, events that I want to count as a sum per week. Due to the randomness they are not linear so other examples I have tried so far are not applicable.
The data is similar to this:
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1) )
What I'm trying to do is create a new table that contains the sum of events per week in the calendar year.
In this case producing something like this:
Name Week Events
Jim 1 3
Sue 1 3
Jim 2 0
Sue x ... x
and so on...
Update OP request for multiple years:
We could use isoweek also from lubridate instead of week
OR:
We could add the year as follows:
df_date %>%
as_tibble() %>%
mutate(Week = week(ymd(Dates))) %>%
mutate(Year = year(ymd(Dates))) %>%
count(Name, Year, Week)
We could use lubridates Week function after transforming character Dates to date format with lubridates ymd function.
Then we can use count which is the short for group_by(Name, Week) %>% summarise(Count = n())
:
library(dplyr)
library(lubridate)
df_date %>%
as_tibble() %>%
mutate(Week = week(ymd(Dates))) %>%
count(Name, Week)
Name Week n
<chr> <dbl> <int>
1 Jim 1 3
2 Jim 3 2
3 Jim 5 1
4 Jim 6 2
5 Jim 7 1
6 Jim 9 1
7 Sue 1 3
8 Sue 3 2
9 Sue 5 1
10 Sue 6 2
11 Sue 7 1
12 Sue 9 1
Here is an approach that gets you each ISO week for each individual, with zeros when there are no events for that week for that individual:
get_dates_df <- function(d) {
data.frame(date = seq(min(d, na.rm=T),max(d,na.rm=T),1)) %>%
mutate(Year=year(date), Week=week(date)) %>%
distinct(Year, Week)
}
df_date = df_date %>% mutate(Dates=lubridate::ymd(Dates))
left_join(
full_join(distinct(df_date %>% select(Name)), get_dates_df(df_date$Dates), by=character()),
df_date %>%
group_by(Name,Year=year(Dates), Week=week(Dates)) %>%
summarize(Events = sum(Event), .groups="drop")
) %>%
mutate(Events=if_else(is.na(Events),0,Events))
Output:
Name Year Week Events
1 Jim 2010 1 3
2 Jim 2010 2 0
3 Jim 2010 3 2
4 Jim 2010 4 0
5 Jim 2010 5 1
6 Jim 2010 6 2
7 Jim 2010 7 1
8 Jim 2010 8 0
9 Jim 2010 9 1
10 Sue 2010 1 3
11 Sue 2010 2 0
12 Sue 2010 3 2
13 Sue 2010 4 0
14 Sue 2010 5 1
15 Sue 2010 6 2
16 Sue 2010 7 1
17 Sue 2010 8 0
18 Sue 2010 9 1
I create a simple dataframe:
library(dplyr)
df <- tibble(
UserId = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
Answer_Date = as.Date(c("2010-12-31", "2011-12-29", "2012-12-25", "2013-12-10", "2014-12-31", "2010-10-31", "2011-10-28", "2013-10-31", "2015-10-31")),
Q1 = c(3, 1, 1, 0, 1, 4, 2, 5, 4),
Q2 = c(2, 0, 1, 2, 1, 8, 2, 6, 5),
) %>%
group_by(UserId) %>%
mutate(First_Date = min(Answer_Date)) %>%
mutate(Last_Date = max(Answer_Date)) %>%
ungroup()
which gives me
> df
# A tibble: 9 x 6
UserId Answer_Date Q1 Q2 First_Date Last_Date
<chr> <date> <dbl> <dbl> <date> <date>
1 A 2010-12-31 3 2 2010-12-31 2014-12-31
2 A 2011-12-29 1 0 2010-12-31 2014-12-31
3 A 2012-12-25 1 1 2010-12-31 2014-12-31
4 A 2013-12-10 0 2 2010-12-31 2014-12-31
5 A 2014-12-31 1 1 2010-12-31 2014-12-31
6 B 2010-10-31 4 8 2010-10-31 2015-10-31
7 B 2011-10-28 2 2 2010-10-31 2015-10-31
8 B 2013-10-31 5 6 2010-10-31 2015-10-31
9 B 2015-10-31 4 5 2010-10-31 2015-10-31
I now wish to compute the change in each subject's answers between the first and last date on which they answer the questionnaire. I start by writing
df_tmp <- df %>%
filter(Answer_Date == First_Date) %>%
select(c("UserId", "Q1", "Q2"))
colnames(df_tmp) <- c("UserId", paste0("First_Response_", c("Q1", "Q2")))
df <- merge(df, df_tmp, by = "UserId")
df_tmp <- df %>%
filter(Answer_Date == Last_Date) %>%
select(c("UserId", "Q1", "Q2"))
colnames(df_tmp) <- c("UserId", paste0("Last_Response_", c("Q1", "Q2")))
df <- merge(df, df_tmp, by = "UserId")
giving me
> df
UserId Answer_Date Q1 Q2 First_Date Last_Date First_Q1 First_Q2 Last_Q1 Last_Q2
1 A 2010-12-31 3 2 2010-12-31 2014-12-31 3 2 1 1
2 A 2011-12-29 1 0 2010-12-31 2014-12-31 3 2 1 1
3 A 2012-12-25 1 1 2010-12-31 2014-12-31 3 2 1 1
4 A 2013-12-10 0 2 2010-12-31 2014-12-31 3 2 1 1
5 A 2014-12-31 1 1 2010-12-31 2014-12-31 3 2 1 1
6 B 2010-10-31 4 8 2010-10-31 2015-10-31 4 8 4 5
7 B 2011-10-28 2 2 2010-10-31 2015-10-31 4 8 4 5
8 B 2013-10-31 5 6 2010-10-31 2015-10-31 4 8 4 5
9 B 2015-10-31 4 5 2010-10-31 2015-10-31 4 8 4 5
I now wish to create two now columns, Delta_Q1 = Last_Q1 - First_Q1 and Delta_Q2 = Last_Q2 - First_Q2, but (possibly) using mutate, paste0("First_", c("Q1", "Q2")), paste0("Last_", c("Q1", "Q2")) and paste0("Delta_", c("Q1", "Q2")).
What is the correct syntax for computing the differences (or in general, some function of two variables) between pairs of columns sequentially? The reason I don't want to write the differences down manually is simple - the real dataframe has lots of pairs of columns.
Many thanks in advance for your help.
Sincerely
Thomas Philips
You can create two vector of columns and directly subtract them to create new columns.
first_r_col <- grep('First_Response', colnames(df))
last_r_col <- grep('Last_Response', colnames(df))
df[paste0('delta', seq_along(first_r_col))] <- df[last_r_col] - df[first_r_col]
Using dplyr select statement might be easy way to select the columns.
library(dplyr)
df[paste0('delta', seq_along(first_r_col))] <-
df %>% select(starts_with('Last_Response')) -
df %>% select(starts_with('First_Response'))
Here's one approach that does not require you creating the First_Date and Last_Date columns:
library(dplyr)
df %>%
group_by(UserId) %>%
arrange(UserId, Answer_Date) %>%
filter(row_number() == 1 | row_number() == n()) %>%
summarize(Delta_Q1 = diff(Q1),
Delta_Q2 = diff(Q2))
I don't think much of that coding is needed, below is a dplyr solution:
df %>%
group_by(UserId) %>%
arrange(Answer_Date) %>%
summarize(First_Q1 = first(Q1),
First_Q2 = first(Q2),
Last_Q1 = last(Q1),
Last_Q2 = last(Q2)) %>%
mutate(Delta_Q1 = Last_Q1 - First_Q1,
Delta_Q2 = Last_Q2 - First_Q2)
Gives the output of:
# A tibble: 2 x 7
UserId First_Q1 First_Q2 Last_Q1 Last_Q2 Delta_Q1 Delta_Q2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 3 2 1 1 -2 -1
2 B 4 8 4 5 0 -3
With the benefit of hindsight, I missed the blindingly obvious answer, and made it harder than it should have been:
QUESTIONS <- c("Q1", "Q2")
FIRST_RESPONSE_PREFIX <- "First_"
LAST_RESPONSE_PREFIX <- "Last_"
DELTA_RESPONSE_PREFIX <- "Delta_"
first_response_cols <- paste0(FIRST_RESPONSE_PREFIX, QUESTIONS)
last_response_cols <- paste0(LAST_RESPONSE_PREFIX, QUESTIONS)
delta_response_cols <- paste0(DELTA_RESPONSE_PREFIX, QUESTIONS)
df_tmp1 <- df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(c("UserId", QUESTIONS))
colnames(df_tmp1) <- c("UserId", first_response_cols)
df <- merge(df, df_tmp1, by = "UserId")
df_tmp2 <- df %>%
filter(Answer_Date == Last_Answer_Date) %>%
select(c("UserId", QUESTIONS))
colnames(df_tmp2) <- c("UserId", last_response_cols)
df <- merge(df, df_tmp2, by = "UserId")
df[delta_response_cols] <- df[last_response_cols] - df[first_response_cols]
When I run the code, I get exactly what i want:
> df
UserId Answer_Date Q1 Q2 First_Answer_Date Last_Answer_Date First_Q1 First_Q2 Last_Q1 Last_Q2 Delta_Q1 Delta_Q2
1 A 2010-12-31 3 2 2010-12-31 2014-12-31 3 2 1 1 -2 -1
2 A 2011-12-29 1 0 2010-12-31 2014-12-31 3 2 1 1 -2 -1
3 A 2012-12-25 1 1 2010-12-31 2014-12-31 3 2 1 1 -2 -1
4 A 2013-12-10 0 2 2010-12-31 2014-12-31 3 2 1 1 -2 -1
5 A 2014-12-31 1 1 2010-12-31 2014-12-31 3 2 1 1 -2 -1
6 B 2010-10-31 4 8 2010-10-31 2015-10-31 4 8 4 5 0 -3
7 B 2011-10-28 2 2 2010-10-31 2015-10-31 4 8 4 5 0 -3
8 B 2013-10-31 5 6 2010-10-31 2015-10-31 4 8 4 5 0 -3
9 B 2015-10-31 4 5 2010-10-31 2015-10-31 4 8 4 5 0 -3
That said, thanks for the help - I learned something by looking at the suggested answers.
I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)
I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2
I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA