Dplyr doesn't respect groups when ranking data - r

Using the below code in dplyr 0.7.6, I try to calculate the rank of a variable for each day on a dataset. But dplyr doesn't account for the group_by(CREATIONDATE_DAY)
dates <- sample(seq(from=as.POSIXct("2019-03-12",tz="UTC"),to=as.POSIXct("2019-03-20",tz="UTC"),by = "day"),size = 100,replace=TRUE)
group <- sample(c("A","B","C"),100,TRUE)
df <- data.frame(CREATIONDATE_DAY = dates,GROUP = group)
# calculate the occurances for each day and group
dfMod <- df %>% group_by(CREATIONDATE_DAY,GROUP) %>%
dplyr::summarise(COUNT = n()) %>% ungroup()
# Compute the rank by count for each day
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>%
mutate(rank = rank(-COUNT, ties.method ="min"))
But the rank values are calculate on the entire group instead on the creation day value. As seen in the image the row with id 24 should be rank 1 due to 4 being the highest value for 16.03.2019 and row 23 should be rank 2 of this particular day. Where is my mistake?
Edit: added desired output:
Edit #2: as MrFlick has pointed out I checked my dplyr version (0.7.6) and upgrade to the most current version fixed the issue for me.

It seems that may be are some conflict with another package. If you have active lubridate, try to inverse the order in which you call the packages lubridate and dplyr (I've tried your example and gave me the right answer). Yet, you can stil try with:
dfMod <- dfMod %>% group_by(CREATIONDATE_DAY) %>% mutate(rank = row_number(desc(COUNT)))
> head(dfMod)
# A tibble: 6 x 4
# Groups: CREATIONDATE_DAY [2]
CREATIONDATE_DAY GROUP COUNT rank
<dttm> <fct> <int> <int>
1 2019-03-12 00:00:00 A 2 3
2 2019-03-12 00:00:00 B 5 1
3 2019-03-12 00:00:00 C 4 2
4 2019-03-13 00:00:00 A 4 1
5 2019-03-13 00:00:00 B 3 2
6 2019-03-13 00:00:00 C 2 3

Related

Creating new column in R based on different column with a for loop

I try to create a new column in R, which contains the mean of values of a different column but for their respective date.
My data frame looks something like this:
Temp Date
4 2018-01-01
3 2018-01-01
2 2018-01-02
2 2018-01-02
I now want to create a third column, with the mean temperature for each day. So that it looks like this:
Temp Date mean_Temp
4 2018-01-01 3.5
3 2018-01-01 3.5
2 2018-01-02 2
2 2018-01-02 2
I already tried:
for (i in as.list(df$Date)) {
df$mean_Temp[i] <- paste(mean(df$Temp))
}
But that doesn't work, it only returns the overall mean of the temperature and doesn't calculate the mean for every day individually.
Thank you guys, I hope I made my problem clear.
I would not use a for loop in this case, since it is utterly unnecessary.
Here is a tidyverse approach. Based on your desired output, each Date would still have two records after the mean is calculated. If you only want a single row for each Date, use summarise() instead of mutate().
mutate()
library(tidyverse)
df %>% group_by(Date) %>% mutate(mean_Temp = mean(Temp))
# A tibble: 4 x 3
# Groups: Date [2]
Temp Date mean_Temp
<dbl> <chr> <dbl>
1 4 2018-01-01 3.5
2 3 2018-01-01 3.5
3 2 2018-01-02 2
4 2 2018-01-02 2
summarise()
df %>% group_by(Date) %>% summarize(mean_Temp = mean(Temp))
# A tibble: 2 x 2
Date mean_Temp
<chr> <dbl>
1 2018-01-01 3.5
2 2018-01-02 2
Try:
library(dplyr)
df %>% group_by(Date) %>% mutate(mean_Temp = mean(Temp))
When grouping with dplyr, you can either use summarise or mutate. summarise will return one row per group while mutate will add/modify one column and repeat the value for all the entries in each group.

New data frame with unique values and counts [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'd like to create a new data table from my old one that includes a count of all the "article_id" that occur for each date (i.e. there are three article_id's listed for the date 2001-10-01, so I'd like one column with the date and one column that has the article count, "3").
Here is the output of the data table:
date article_id N
1: 2001-09-01 FAS_200109_11104 3
2: 2001-10-01 FAS_200110_11126 6
3: 2001-10-01 FAS_200110_11157 21
4: 2001-10-01 FAS_200110_11160 5
5: 2001-11-01 FAS_200111_11220 26
---
7359: 2019-08-01 FAZ_201908_2958 7
7360: 2019-09-01 FAZ_201909_3316 8
7361: 2019-09-01 FAZ_201909_3515 13
7362: 2000-12-01 FAZ_200012_92981 3
7363: 2001-08-01 FAZ_200108_86041 14
So I'll have to move over the unique date values to a new data frame (so that each date is only shown once), as well as a count of article_id's shown for each date.
I've been trying to figure this out but haven't found exactly the right answer regarding how to count the occurrence of a character vector (the article_id) by group (date). I think this is something pretty simple in R, but I'm new to the program and don't have much support so I would very much appreciate your suggestions - thank you so much!
The expected output is not clear. Some assumptions of expected output
Sum of 'N' by 'date'
library(data.table)
dt[, .(N = sum(N, na.rm = TRUE)), by = date]
Count of unique 'article_id' for each date
dt1[, .(N = uniqueN(article_id)), by = date]
Get the first count by 'date'
dt1[, .(N = first(N)), by = date]
We could group and then summarise:
library(dplyr)
df %>%
group_by(date) %>%
summarise(n = n())
date n
<chr> <int>
1 2000-12-01 1
2 2001-08-01 1
3 2001-09-01 1
4 2001-10-01 3
5 2001-11-01 1
6 2019-08-01 1
7 2019-09-01 2
Here 2 tidyverse solutions:
Libraries
library(tidyverse)
Example Data
df <-
tibble(
date = ymd(c("2001-09-01","2001-10-01","2001-10-01")),
article_id = c("FAS_200109_11104","FAS_200110_11126","FAS_200110_11157"),
N = c(3,6,21)
)
Solution
Solution 1
df %>%
group_by(date) %>%
summarise(N = sum(N,na.rm = TRUE))
Solution 2
df %>%
count(date,wt = N)
Result
# A tibble: 2 x 2
date n
<date> <dbl>
1 2001-09-01 3
2 2001-10-01 27

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Count active observations by week

I have a data frame of observations with a start and end date for each observation indicating the period it was active.
The duration active varies by observation, and can spread across multiple weeks.
Some observations are still active and do not have an end date.
For a given date range, how can I count the number of observations that were active during a week within that date range, including those still active?
I have a crude method that works, but is pretty slow. It seems like there has to be a more efficient and simpler way to do this.
EDIT: My first approach was similar to Ronak's solution, which is definitely better than mine for smaller data sets, but my real data set has more observations and longer date ranges, so I run into memory constraints.
#I'm primarily using tidyverse/lubridate, but definitely open to other solutions.
library(tidyverse)
library(lubridate)
# sample data frame of observations with start and end dates:
df_obs <- tibble(
observation = c(1:10),
date_start = as_date(c("2020-03-17", "2020-01-20", "2020-02-06", "2020-01-04", "2020-01-06", "2020-01-24", "2020-01-09", "2020-02-11", "2020-03-13", "2020-02-07")),
date_end = as_date(c("2020-03-27", "2020-03-20", NA, "2020-03-04", "2020-01-16", "2020-02-24", NA, "2020-02-19", NA, "2020-02-27"))
)
# to account for observations that are still active, NAs are converted to today's date:
df_obs <- mutate(df_obs, date_end = if_else(is.na(date_end), Sys.Date(), date_end))
# create a data frame of weeks by start and end date to count the active observations in a given week
# for this example I'm just using date ranges from the sample data:
df_weeks <-
seq(min(df_obs$date_start), max(df_obs$date_start), by = 'day') %>%
enframe(NULL, 'week_start') %>%
mutate(week_start = as_date(cut(week_start, "week"))) %>%
mutate(week_end = week_start + 6) %>%
distinct()
# create a function that filters the observations data frame based on start and end dates:
check_active <- function(d, s, e){
d %>%
filter(date_start <= e) %>%
filter(date_end >= s) %>%
nrow()
}
# applying that function to each week in the date range data frame gives the expected results:
df_weeks %>%
rowwise() %>%
mutate(total_active = check_active(df_obs, week_start, week_end)) %>%
select(-week_end) %>%
ungroup()
# A tibble: 12 x 2
week_start total_active
<date> <int>
1 2019-12-30 1
2 2020-01-06 3
3 2020-01-13 3
4 2020-01-20 4
5 2020-01-27 4
6 2020-02-03 6
7 2020-02-10 7
8 2020-02-17 7
9 2020-02-24 6
10 2020-03-02 4
11 2020-03-09 4
12 2020-03-16 5
Here is one way :
library(tidyverse)
df_obs %>%
#Replace NA with today's date
#Create sequence between start and end date
mutate(date_end = replace(date_end, is.na(date_end), Sys.Date()),
date = map2(date_start, date_end, seq, "day")) %>%
#Get data in long format
unnest(date) %>%
#Unselect start an end date
select(-date_start, -date_end) %>%
#Cut data by week
mutate(date = cut(date, "week")) %>%
#Get unique rows for observation and date
distinct(observation, date) %>%
#Count number of observation in each week
count(date)
which returns :
# A tibble: 14 x 2
# value n
# <fct> <int>
# 1 2019-12-30 1
# 2 2020-01-06 3
# 3 2020-01-13 3
# 4 2020-01-20 4
# 5 2020-01-27 4
# 6 2020-02-03 6
# 7 2020-02-10 7
# 8 2020-02-17 7
# 9 2020-02-24 6
#10 2020-03-02 4
#11 2020-03-09 4
#12 2020-03-16 5
#13 2020-03-23 4
#14 2020-03-30 3

Add sequence of week count aligned to a date column with infrequent dates

I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))

Resources