dplyr summarize date by weekdays - r

I have multiple observations from different persons on different dates, e.g.
df <- data.frame(id= c(rep(1,5), rep(2,8), rep(3,7)),
dates = seq.Date(as.Date("2015-01-01"), by="month", length=20))
Here we have 3 people (id), with different amount of observations each.
I now want to count the mondays, tuesdays etc for each person.
This should be done using dplyr and summarize because my real data set has many more columns which I summarize with different statistics.
It should be some something like this:
summa <- df %>% group_by(id) %>%
summarize(mondays = #numberof mondays,
tuesdays = #number of tuesdays,
.........)
How can this be achieved?

I would do the following:
summa <- count(df, id, day = weekdays(dates))
# or:
# summa <- df %>%
# mutate(day = weekdays(dates)) %>%
# count(id, day)
head(summa)
#Source: local data frame [6 x 3]
#Groups: id [2]
#
# id day n
# (dbl) (chr) (int)
#1 1 Donnerstag 1
#2 1 Freitag 1
#3 1 Mittwoch 1
#4 1 Sonntag 2
#5 2 Dienstag 2
#6 2 Donnerstag 1
But you can also reshape to wide format:
library(tidyr)
spread(summa, day, n, fill=0)
#Source: local data frame [3 x 8]
#Groups: id [3]
#
# id Dienstag Donnerstag Freitag Mittwoch Montag Samstag Sonntag
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 1 0 1 1 1 0 0 2
#2 2 2 1 1 1 1 1 1
#3 3 1 0 2 1 2 0 1
My results are in German, but yours would be in your own language of course. The column names are German weekdays.
If you want to use summarize explicitly you can achieve the same as above using:
summa <- df %>%
group_by(id, day = weekdays(dates)) %>%
summarize(n = n()) # or do something with summarise_each() for many columns

You could use the lubridate package:
library(lubridate)
summa <- df %>% group_by(id) %>%
summarize(mondays = sum(wday(dates) == 2),
....

Base Date functions:
summa <- df %>% group_by(id) %>%
summarise(monday = sum(weekdays(dates) == "Monday"),
tuesday = sum(weekdays(dates) == "Tuesday"))

Related

Organizing a data frame with multiple entries per sample

I have the following database with several entries per individual:
record_id<-c(21,21,21,15,15,15,2,2,2,2,3,3,3)
var<-c(0,0,0,1,0,0,1,1,0,0,1,1,0)
data<-data.frame(cbind(record_id,var))
I want to create a new data frame with just 1 row per record_id. But it has to fulfill that if the individual (record_id) has a data$var == 1. The outcome data frame must indicate 1.
So, the outcome would be like this:
record_id<-c(21,15,2,3)
var<-c(0,1,1,1)
data_sol<-data.frame(cbind(record_id,var))
I have tried this:
DF1 <- data %>%
group_by(record_id) %>%
mutate(class = ifelse(var==1,1,0)) %>%
ungroup
I know it's not the best way, I was planning to obtain afterwards the unique values... But it did not make the trick.
If your 'var' is all zeroes or ones, you can also use max():
data%>%group_by(record_id)%>%
summarise(new_var=max(var))
# A tibble: 4 x 2
record_id new_var
<dbl> <dbl>
1 2 1
2 3 1
3 15 1
4 21 0
You can use mean() with the mutate to detect if there exsist any non zero value inside a group like,
data %>%
group_by(record_id) %>%
mutate(var = ifelse(mean(var)!=0,1,0)) %>%
distinct(record_id,var)
gives,
# A tibble: 4 x 2
# Groups: record_id [4]
# record_id var
# <dbl> <dbl>
# 1 21 0
# 2 15 1
# 3 2 1
# 4 3 1
We can do
library(dplyr)
data %>%
group_by(record_id) %>%
summarise(var = +(mean(var) != 0))
Or using slice
data %>%
group_by(record_id) %>%
slice_max(n = 1, order_by = var)

How to find observations within a certain time range of each other in R

I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1

Calculate difference in dates (in days) between group A and the row above it for each id

Here is a my df (data.frame):
id group date
[1] 1 B 2000-01-01
[2] 1 B 2001-02-11
[3] 1 A 2001-04-06
[4] 2 C 2000-02-01
[5] 2 A 2001-01-01
[6] 2 B 2004-11-12
...
The data.frame has been arranged by id and date.
I would like to calculate difference in dates (in days) between group A and the row above it for each id. In my data, every group A has a row above it for the same id.
The results that I am interest in will look something like this
id days
[1] 1 54
[2] 2 335
...
Please advise
Thanks.
Since it's already sorted, you can just do:
dft %>%
group_by(id) %>%
mutate(diff_days = difftime(date, lag(date))) %>%
filter(group == "A") %>%
select(diff_days)
which gives:
id diff_days
<int> <time>
1 1 54 days
2 2 335 days
Here is an idea using dplyr
library(dplyr)
#make sure "date" has the appropriate class
df$date <- as.POSIXct(df$date, format = '%Y-%m-%d')
df %>%
group_by(id) %>%
mutate(diff1 = c(NA, round(diff.difftime(date, units = 'days')))) %>%
filter(group == 'A') %>%
select(id, diff1)
#Source: local data frame [2 x 2]
#Groups: id [2]
# id diff1
# <int> <dbl>
#1 1 54
#2 2 335
We can use data.table
library(data.table)
setDT(df)[, diff1 := c(NA, round(diff.difftime(date,
units = 'days'), 0)), id][group=="A"][, c("id", "diff1"), with = FALSE]
# id diff1
#1: 1 54
#2: 2 335

dplyr- conditional and multiple filters grouped-by

I want to filter based in more than conditions in a generalizable way with a dplyr feel. My objective is is to filter to get the just the first month when a group got the goal of 40000. Given this data.
group month output cumulouput indi
(fctr) (int) (dbl) (dbl) (dbl)
A 1 9735.370 9735.37 0
A 2 10468.063 20203.43 0
A 3 11494.736 31698.17 0
B 1 10186.465 10186.46 0
B 2 9771.083 19957.55 0
B 3 9871.636 29829.18 0
B 4 9877.264 39706.45 0
B 5 9009.198 48715.65 1
B 6 9874.526 58590.17 1
C 1 10613.868 10613.87 0
C 2 10503.673 21117.54 0
C 3 10397.098 31514.64 0
C 4 9709.228 41223.87 1
C 5 9861.669 51085.54 1
C 6 9137.551 60223.09 1
For each group is to get the minimum month when group got the goal and the maximum month when group didn't reach the goal. (???)
This is the the result of the filter:
group month output cumulouput indi
(fctr) (int) (dbl) (dbl) (dbl)
A 3 11494.736 31698.17 0
B 5 9994.509 51800.365 1
C 4 9709.228 41223.87 1
For the data:
library(dplyr)
df1 <- data.frame(group = rep(LETTERS[1:3], each=6), month = rep(1:6,3)) %>%
arrange(group,month) %>%
mutate(output = rnorm(n=18,mean = 10000, sd = 722))%>%
group_by(group) %>%
mutate(cumulouput=cumsum(output))%>%
filter(!(group=="A"&month>=4)) %>%
mutate( indi= ifelse(cumulouput>40000,1,0))
This will get you the desired output, although I feel it can be shortened up a bit.
library(dplyr)
df1 <- data.frame(group = rep(LETTERS[1:3], each=6), month = rep(1:6,3)) %>%
arrange(group,month) %>%
mutate(output = rnorm(n=18,mean = 10000, sd = 722))%>%
group_by(group) %>%
mutate(cumulouput=cumsum(output))%>%
filter(!(group=="A"&month>=4)) %>%
mutate( indi= ifelse(cumulouput>40000,1,0))
one <- df1 %>%
group_by(group) %>%
.[.$cumulouput > 40000,] %>%
filter(row_number(cumulouput) == 1)
two <- df1 %>%
group_by(group) %>%
.[.$indi == 0,]
three <- rbind(one,two) %>%
group_by(group) %>%
filter(cumulouput == max(cumulouput))%>%
arrange(group)
head(three)
The logic here goes as follows, for each group for each row it checks if indi==1 if TRUE it returns the min month with goal satisfied if FALSE it returns the max month with goal not satisfied.
Then filter months who matches the ones we just added and filter for max(indi) to remove previous months of a group.
Finally remove the temp column m
df1 %>% group_by(group) %>%
mutate(m=if_else(indi==1, min(.[.$indi==1,'month']), max(.[.$indi==0,'month']))) %>%
filter(month==m, indi==max(indi)) %>%
select(-m)

rearrange specific rows into columns using dplyr

I am trying to rearrange rows into columns in a specific way (preferably using dplyr) but I dont really know where to start with this. I am trying to create one row for each person (Bill or Bob) and have all of that persons values on one row. So far I have
df<-data.frame(
Participant=c("bob1","bill1","bob2","bill2"),
No_Photos=c(1,4,5,6)
)
res<-df %>% group_by(Participant) %>% dplyr::summarise(phot_mean=mean(No_Photos))
which gives me:
Participant mean(No_Photos)
(fctr) (dbl)
1 bill1 4
2 bill2 6
3 bob1 1
4 bob2 5
GOAL:
mean_NO_Photos_1 mean_No_Photos_2
bob 1 5
bill 4 6
Using tidyr and dplyr:
library(tidyr)
library(dplyr)
df %>% mutate(rep = extract_numeric(Participant),
Participant = gsub("[0-9]", "", Participant)) %>%
group_by(Participant, rep) %>%
summarise(mean = mean(No_Photos)) %>%
spread(rep, mean)
Source: local data frame [2 x 3]
Participant 1 2
(chr) (dbl) (dbl)
1 bill 4 6
2 bob 1 5

Resources