I am trying to split my data in two based on a gap in the dates. The problem is that in the real data, the duration of the observations is not constant. I am assigning all values of lactation to be 1, and trying to make everything after the long gap to become two.
What I am trying to do:
Identify the gap in days, if the gap is longer than 20 days, we will start counting from 1 again using group_by and row_number.
The problem here is that the lag() function is not carrying the new value after the condition.
###Code
library(dplyr)
library(lubridate)
#simulating the data
name<-"cow1"
milk<-rnorm(500,15,6)
date1<-seq(ymd('2012-01-01'),ymd('2012-09-06'),by='days') %>% as_tibble()
date2<-seq(ymd('2013-01-01'),ymd('2013-09-07'),by='days') %>% as_tibble()
date<-bind_rows(date1,date2) %>% rename("day"=value)
cow1<- milk %>% as_tibble() %>% rename("Yield"=value) %>% mutate(cowid=name,day=date$value)
cow1.1 <- cow1 %>% mutate(lactation=1) %>%
mutate(gap = day - lag(day, default = day[1])) %>%
mutate(lactation=ifelse(gap>20,lag(lactation)+1,lag(lactation))) %>%
group_by(lactation) %>% mutate(dim=row_numer())
Sample result:
Row Yield cowid day lactation gap
250 3.1429436 cow1 2012-09-06 1 1 days
251 10.1427923 cow1 2013-01-01 2 117 days
252 19.8654469 cow1 2013-01-02 1 1 days
Desired result:
Row Yield cowid day lactation gap
250 3.1429436 cow1 2012-09-06 1 1 days
251 10.1427923 cow1 2013-01-01 2 117 days
252 19.8654469 cow1 2013-01-02 2 1 days
Related
I try to calculate the date difference between second row and last row per group id. The data looks like
data<- data.frame(pid= c(1, 1, 1,1, 2, 2, 2, 3, 3, 3,3 ,3), day = c("25/07/2018", "19/10/2018", "17/01/2019", "19/03/2019", "10/09/2018","29/11/2018", "26/03/2019", "17/06/2016", "25/04/2018", "17/07/2018","05/04/2019", "09/02/2021"), catt=c(1,1,2,1,1,1,2,2,2,1,1,2))
data
pid
day
1
1
25/07/2018
2
1
19/10/2018
3
1
17/01/2019
4
1
19/03/2019
5
2
10/09/2018
6
2
29/11/2018
7
2
26/03/2019
8
3
17/06/2016
9
3
25/04/2018
10
3
17/07/2018
11
3
05/04/2019
12
3
09/02/2021
I use the following code to obtain a difference in months.
difftime("19/10/2018","19/03/2019 ", units = "days")/ (30)
difftime("29/11/2018","26/03/2019 ", units = "days")/ (30)
difftime("25/04/2018","09/02/2021 ", units = "days")/ (30)
The desired output
id day difference
1 25/07/2018
1 19/10/2018
1 17/01/2019
1 19/03/2019 7.13
2 10/09/2018
2 29/11/2018
2 26/03/2019 44.7
3 17/06/2016
3 25/04/2018
3 17/07/2018
3 05/04/2019
3 09/02/2021 196.7667
But it is difficult to large data, so anyone can help using lubricate () + slice() functions
Convert to date object and calculate the difference between last and second date for each pid.
library(dplyr)
library(lubridate)
data %>%
mutate(day = dmy(day)) %>%
arrange(pid, day) %>%
group_by(pid) %>%
summarise(difference = (last(day) - day[2])/30)
# pid difference
# <dbl> <dbl>
#1 1 5.03
#2 2 3.9
#3 3 34.0
If you want to maintain the number of rows in the dataframe, use mutate and replace the difference only on the last row of the dataframe.
data %>%
mutate(day = dmy(day)) %>%
arrange(pid, day) %>%
group_by(pid) %>%
mutate(difference = ifelse(row_number() == n(), (last(day) - day[2])/30, NA))
Note that output from difftime in the question is incorrect.
#Wrong output
difftime("19/10/2018","19/03/2019 ", units = "days")
#Time difference of 214 days
#Correct output
difftime(dmy("19/03/2019"), dmy("19/10/2018"), units = "days")
#Time difference of 151 days
Using R.
This is a small subset of my dataset, simplified to only show relevant columns. The data is taken from Capital Bikeshare. The Start.Date column below has exact rental times for a bike.
Start.date Member.type
2018-11-01 00:00:45 Member
2018-11-01 00:00:52 Casual
2018-11-01 00:01:46 Member
2018-11-01 01:00:02 Casual
2018-11-01 01:03:36 Member
What I'm trying to do is group all of the data by date, hour of day, number of each member type, and total number of member types (casual+member) for any given hour of any given day. So, in the end, I'll just have "Day - Hour - Number of Rentals per member type" so I can predict trends for hour of the day,
Here is my relevant code
library(dplyr)
bikeData <- read.csv("2011data.csv")
bikeData <- bikeData %>%
mutate(Hour = format(strptime(
bikeData$Start.date, "%Y-%m-%d %H:%M:%S"), "%m-%d %H")) %>%
mutate(day = wday(Start.date, label=TRUE))
groupData <- bikeData %>%
mutate(Start.date = ymd_hms(Start.date)) %>%
count(date1 = as.Date(Start.date), Hour1 = hour(Start.date),
member=(Member.type)) %>%
group_by(date1, Hour1) %>%
arrange(date1, Hour1) %>%
summarise(total=sum(n))
What this gives me is the following new dataset, groupData
date1 Hour1 total
2018-11-01 0 82
2018-11-01 1 43
2018-11-01 2 17
2018-11-01 3 4
2018-11-02 0 5
2018-11-02 1 24
So I was able to do the total number of Member+Casual for all 24 hours of each day of my dataset, but how do I get another two columns that show the total number of casual and another that shows the total number of member? Thanks!
Desired below:
date1 Hour1 total Casual Member
2018-11-01 0 82 40 42
2018-11-01 1 43 20 23
2018-11-01 2 17 10 7
2018-11-01 3 4 1 3
2018-11-02 0 5 1 4
2018-11-02 1 24 20 4
groupData <- bikeData %>%
mutate(Start.date = ymd_hms(Start.date)) %>%
count(date1 = as.Date(Start.date), Hour1 = hour(Start.date),
member=(Member.type)) %>%
group_by(date1, Hour1) %>%
arrange(date1, Hour1) %>%
summarise(total=sum(n),members=sum(Member.type=="Member"),casuals=sum(Member.type=="Casual"))
You can simply add to your summarize call two variables that count the logical occurrences of Member.type equaling each of the options.
My goal is simply to count the number of records in each hour of each day. I thought a simple solution could be found with the dplyr or data.table packages:
My data set is extremely simple:
> head(test)
id date hour
1 14869663 2018-01-24 17
2 14869664 2018-01-24 17
3 14869665 2018-01-24 17
4 14869666 2018-01-24 17
5 14869667 2018-01-24 17
6 14869668 2018-01-24 17
I only need to group by two variables (date and hour) and count. The id doesn't matter. However, these two methods in dplyr do not seem to produce the desired result (a data frame of the same length of the input data, which includes millions of records, is the output). What am I doing wrong here?
test %>% group_by(date, hour) %>% mutate(count = n())
test %>% add_count(date, hour)
The output would look something like this
> head(output)
n_records date hour
1 700 2018-01-24 0
2 750 2018-01-24 1
3 730 2018-01-24 2
4 700 2018-01-24 3
5 721 2018-01-24 4
6 753 2018-01-24 5
and so on
any suggestions?
This seems to do the trick:
library(dplyr)
starwars %>%
group_by(gender, species) %>%
count
It appears (h/t to Frank) that the count function can take the grouping fields directly:
starwars %>% count(gender, species)
using data.table,
test[, .N, by=.(date, hour)]
Base
aggregate(name ~ gender + species, data = starwars, length)
If we want to treat NAs as a group:
species1 <- factor(starwars$species, exclude = "")
gender1 <- factor(starwars$gender, exclude = "")
aggregate(name ~ gender1 + species1, data = starwars, length)
I have a dataset in long format with multiple start and end dates for patients with unique id. Dates represent hospital admission and discharge. Some patients have multiple overlapping stays in hospital, some have stays that don't overlap, and other cases have start (admission) and end dates (discharge) on same day.
Building on a post that used a lagged start date and the cummax function, I wish to do 3 things:
For cases with overlapping start and end dates,
combine/merge cases, keeping the earliest start
date and last end date.
For cases with start and end dates that are the same date, maintain that observation (don't merge).
Create new variable surgdays that is calculated from max days in surgical unit (surg), for both merged and non-merged cases.
I have data like this:
id start end surg
1 A 2013-01-01 2013-01-05 0
2 A 2013-01-08 2013-01-12 1
3 A 2013-01-10 2013-01-14 6
4 A 2013-01-20 2013-01-20 3
5 B 2013-01-15 2013-01-25 4
6 B 2013-01-20 2013-01-22 5
7 B 2013-01-28 2013-01-28 0
What I've tried:
library(dplyr)
data %>%
arrange(data, id, start) %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(id, indx) %>%
summarise(start = first(start), end = last(end), surgdays = max(surg))
What I get:
id indx start end surgdays
1 A 0 2013-01-01 2013-01-05 0
2 A 1 2013-01-08 2013-01-14 7
3 A 2 2013-01-20 2013-01-20 3
The problem: the number of rows examined with this code is limited to the number of columns in my dataset. For example, with 4 variables/columns, it worked with data from only first 4 rows (including merging two rows with overlapping dates) then stopped...even though there are 7 rows in example (and thousands of rows in actual dataset).
Similarly, when I try same code with 70 columns (and hundreds of rows), it combines overlapping dates but only based on the first 70 rows. My initial thought was to create as many placeholder columns as there are observations in my dataset but this is clunky workaround.
What I'd like to get:
id indx start end surgdays
1 A 0 2013-01-01 2013-01-05 0
2 A 1 2013-01-08 2013-01-14 7
3 A 2 2013-01-20 2013-01-20 3
4 B 0 2013-01-15 2013-01-22 9
5 B 1 2013-01-28 2013-01-28 0
This helpful approach, originally posted here by #David Arenburg, worked fine (used all cases) after I removed the arrange() statement from the sequence of operations:
data %>%
group_by(id) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(id, indx) %>%
summarise(start = first(start), end = last(end), surgdays = max(surg))
I also found this approach helpful for capturing other variables in the collapsed cases, such as admission diagnosis at earliest hospital visit. Just add to the summarise() statement:
summarise(start = first(start), end = last(end), admit_diagnosis = first(diagnosis), surgdays = max(surg))
The end goal is to visualize the amount of a medication taken per day across a large sample of individuals. I'm trying to reshape my data to make a stacked area chart (or something similar).
In a more general term; I have my data structured as below:
id med start_date end_date
1 drug_a 2010-08-24 2011-03-03
2 drug_a 2011-06-07 2011-08-12
3 drug_b 2010-03-26 2010-10-31
4 drug_b 2012-08-14 2013-01-31
5 drug_c 2012-03-01 2012-06-20
5 drug_a 2012-04-01 2012-06-14
I think I'm trying to create a data frame with one row per date, and a column summing the total of patients (id) that are taking that drug on that day. For example, if someone is taking drug_a from 2010-01-01 to 2010-01-20, each of those drug-days should count.
Something like:
date drug_a drug_b drug_c
2010-01-01 5 0 10
2010-01-02 10 2 8
I'm functional with dplyr and tidyr, but unsure how to use spread with dates and durations.
I'd expand out the data to use all dates using a do loop:
library(dplyr)
library(tidyr)
library(zoo)
df %>%
group_by(id, med) %>%
do(with(.,
data_frame(
date = (start_date:end_date) %>% as.Date) ) ) %>%
group_by(date, med) %>%
summarize(frequency = n() ) %>%
spread(med, frequency)