I am working in R on a dataframe which has the date of the first visit and/or of the last visit of a patient, this way:
patient_ID
date
date_number
max_date_number
3
2017-09-25
1
7
3
2019-03-05
7
7
5
2015-10-01
1
1
6
2010-04-15
1
7
6
2011-04-15
5
5
This table is contained in the visits_dataframe variable, computed this way:
visits_dataframe <- data.frame(patient_ID=integer(), date=character(), date_number=character(), max_date_number=character())
patients <- c(3,3,5,6,6)
dates <- c("2017-09-25", "2019-03-05", "2015-10-01", "2010-04-15", "2011-04-15")
date_numbers <- c("1","7","1","1","5")
max_date_numbers <- c("7","7","1","7","5")
visits_dataframe <- data.frame(patients, dates, date_numbers, max_date_numbers, stringsAsFactors=FALSE)
I need to compute the average date distance between the first visit and the last visit, when available, for all the patients. That would be the total duration of the therapy for each patient.
In this example, I would like to compute the distance between 2019-03-05 and 2017-09-25 for the 3 patient, and between 2011-04-15 and 2010-04-15 for the 6 patient.
In this example, I would not be able to compute it for the 5 patient, because the max_date_number is unavailable for her/him.
I tried this piece of code but did not work:
visits_dataframe_durations <- ave(visits_dataframe$date_number, visits_dataframe$patient_ID, FUN = (visits_dataframe[(visits_dataframe$date_number==1),] - visits_dataframe[(visits_dataframe$date_number==max_date_number),]))
Basically, I have to use a command that says:
for each patient ID:
find the last visit date (date_number == max_date_number)
find the first visit date (date_number == 1)
compute the distance between last visit and first visit (thisDuration)
save this duration into a general duration variable (generalDuration += thisDuration)
end for
compute average duration = general duration / number of patients
Can someone help me with this problem? Thanks
We could do this in dplyr
library(dplyr)
visits_dataframe %>%
mutate(dates = as.Date(dates)) %>%
group_by(patients) %>%
mutate(durations = dates[date_numbers == 1] -
dates[date_numbers == max_date_numbers])
Related
I have a table of about 50 000 rows, with four columns.
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.
So its result should look like:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.
Here is a data.table solution using foverlaps.
First, notice that there's an error in your data:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
run foverlaps and then
filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
Add a variable where we count the male simultaneous guests and
a variable where we count the female simultaneous guests, all grouped by ID and arrival
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
Join the findings of the previous command with our original table on ID and arrival
result <- simultaneous[dt, on = .(ID, Arrival)]
<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
set the column order to something nicer
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.
It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.
It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.
Any questions about the code, lemme know.
There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...
Final output looks like this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have a data set that has information on where people went over a certain number of days period-- the data has three level of nesting in long format. First is person, second is day, third is location. Each row indicates a location. I have information type of location (home, work, etc), travel mode used to get to the location (walk, bike, public bus, etc.), and arrival and departure time. It is a daily travel journal that starts from home and ends at home.
I need to aggregate the data to create get information on the following types of journeys for each person for each day:
1. Journey from home to work without detour. (H-W)
2. Journey from home to work with detour. (H-dt-W) the number of detour does not matter.
3. Journey work to home without detour. (W-H)
4. Journey work to home with detour. (W-dt-H) the number of detour does not matter
5. Journey starting from home and ending at home and does not include work in between. (H-O..-H)
6. Journey starting from work and ending at work and does not include home in between. (W-O..-W)
For all these categories I need information of travel mode, and total travel time.
For example: Imagine a Monday; a person wakes up at his home(H) has his breakfast and heads for office(W) on his car; on the way he stops at starbuck for a coffee (C) and then pick-ups a co-worker from their home(D). During the day at work, the person goes to visit a client at different location (E) and comes back to work; this time he takes train. The person then leave for home early that day because he needs to go for grocery. so the person return home, and goes for grocery at location(F), and comes back home, this time walked to the grocery store. This person made different kinds of journeys: 1)H-dt(C-D)-W, 2)W-O(E)-W, 3)W-H, 5)H-O(F)-H. He used different modes for the journey, 1) driving, 2)train, 3)walk. We can add travel time to each location as well using arrival and departure time. Below is the tabular form of the data. (The data below is only for a day for a person, but my data has more days and people).
###Data I have
Person Day ID Place Location_Code Mode Arrive Depart
5 1 0 H NA NA 8:00:00 AM
5 1 1 C D 8:30:00 AM 9:30:00 AM
5 1 2 D D 10:00:00 AM 11:00:00 AM
5 1 3 W D 11:30:00 AM 12:00:00 PM
5 1 4 E T 1:00:00 PM 1:30:00 PM
5 1 5 W T 2:30:00 PM 3:45:00 PM
5 1 6 H D 4:00:00 PM 4:30:00 PM
5 1 7 F P 5:00:00 PM 6:00:00 PM
5 1 8 H P 7:00:00 PM NA
###Data I want
Person Day Journey Type Mode/s Travel Time(hr)
5 1 H-dt-W DDD 1.5
5 1 W-O-W TT 2
5 1 W-H D 0.25
5 1 H-O-H PP 1.5
I have also enter image description hereattached a picture of the data as I have and the data I want to have.
Here is a solution using functions from tidyverse, data.table, lubridate, and stringr. dt6 is the final output. Notice that dt6 is exactly the same as your desired output except the Journey Type column because I don't know the logic and meening of your coding (like Why H-C-D-W is H-dt(C-D)-W?). I just combined all information. You can change the coding based on your preference.
# Load package
library(tidyverse)
library(data.table)
library(lubridate)
library(stringr)
Data Preparation
# Create example data frame
dt <- read.table(text = "Person 'Day ID' Place Location_Code Mode Arrive Depart
5 1 0 H NA NA '8:00:00 AM'
5 1 1 C D '8:30:00 AM' '9:30:00 AM'
5 1 2 D D '10:00:00 AM' '11:00:00 AM'
5 1 3 W D '11:30:00 AM' '12:00:00 PM'
5 1 4 E T '1:00:00 PM' '1:30:00 PM'
5 1 5 W T '2:30:00 PM' '3:45:00 PM'
5 1 6 H D '4:00:00 PM' '4:30:00 PM'
5 1 7 F P '5:00:00 PM' '6:00:00 PM'
5 1 8 H P '7:00:00 PM' NA",
header = TRUE, stringsAsFactors = FALSE)
Step 1: Convert Arrive and Depart to date time class
The year and month, 2000-01, does not matter if all your movement events happend on the same date. I just added them to be easier convert to date time class.
dt2 <- dt %>%
mutate(Arrive = ymd_hms(paste0("2000-01-", Day.ID, " ", Arrive)),
Depart = ymd_hms(paste0("2000-01-", Day.ID, " ", Depart)))
Step 2: Convert data frame from wide format to long format based on Arrive and Depart. Create a MoveID, which is one lag difference to the Place column.
dt3 <- dt2 %>%
# Convert to long format
gather(Action, Time, Arrive, Depart) %>%
arrange(Person, Day.ID, Place, Location_Code, Action) %>%
group_by(Person, Day.ID, Place, Location_Code) %>%
# Create a Moving ID
mutate(MoveID = lag(Place)) %>%
ungroup() %>%
fill(MoveID, .direction = "down")
Step 3: Calculate time difference between departure and arrive per MoveID
dt4 <- dt3 %>%
# Calculate time difference
group_by(Person, Day.ID, MoveID) %>%
summarise(Travel_Time = difftime(dplyr::last(Time), dplyr::first(Time),
units = "hours")) %>%
ungroup() %>%
select(MoveID, Travel_Time) %>%
right_join(dt3, by = "MoveID")
Step 4: Shift the Travel_Time by one. Create a Run Length ID based on Mode.
dt5 <- dt4 %>%
mutate(Travel_Time = lag(Travel_Time)) %>%
mutate(RunID = rleid(Mode)) %>%
group_by(Person, Day.ID, Place) %>%
slice(1) %>%
select(-Action, -Time) %>%
ungroup()
Step 5: Create all the desired column
dt6 <- dt5 %>%
group_by(Person, Day.ID, RunID) %>%
summarise(Travel_Time_Sum = sum(Travel_Time),
Mode_Sum = paste(Mode, collapse = ""),
Journey = paste(Location_Code, collapse = "-")) %>%
mutate(Journey = paste(str_sub(lag(Journey), start = -1, end = -1),
Journey, sep = "-")) %>%
# Remove any columns with NA in Travel_Time_Sum
drop_na(Travel_Time_Sum) %>%
select(Person, Day = Day.ID, `Journey Type` = Journey, `Mode/s` = Mode_Sum,
`Travel Time(hr)` = Travel_Time_Sum)
I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50
Hope your doing well, I am working on an assignment related to data pre processing and I need some help in R
I have a column for days in which they are 711 unique values. In total I have 2 million observations. The data has been collected over 2 years and each day represents one day in a week.
For example day 1 is Monday and day 8 is Monday aswell and day 15 Is Monday and so on.
Could someone help me to replace this with 1 to 7 so if day 1 is Monday I want the cell which contains the value 8 to be replaced by 1 and 15 with 1 and so on.
I hope this makes sense.
thank you for your help.
Regards
A
Following the comments (since I can't comment), try this:
# An example data.frame
mydata <- data.frame(DAY= 1:21, ABC= letters[1:21])
mydata
# Do "mod 7" with variable DAY, so DAY have now values from 0 to 6,
# Then assign back to variable DAY
mydata$DAY <- mydata$DAY %% 7
mydata
# Replace 0 for 7 in DAY variable
mydata$DAY <- ifelse(mydata$DAY == 0, 7, mydata$DAY)
mydata
# Save final data.frame
write.csv(mydata, file='mydata.csv')
Rather than issue 7 separate commands (one for each day) you can use dplyr:
require(dplyr)
d <- data.frame(day = seq(1:711))
mutate(d, day = day %% 7 +1)
What we're doing here is taking the day number and finding its remainder when divided by 7. We have to add 1 back to this so we dont get 0 when there is no remainder.