R - creating a complete data frame by combining from other data frames

R - creating a complete data frame by combining from other data frames - r

I am trying to create two new column of people's ages and their age groups(5 year intervals) given their date of birth from a data frame.
The current data frame For example:
Person Date of Birth
A 1/2/2000
B 3/2/1998
C 4/5/2008
The expected outcome is :
Person Date of Birth Age Age-Group
A 1/2/2000 18 15-20
B 3/2/1990 28 25-30
C 4/5/2008 10 5-10
What is the best way to do this on the most efficient way for a large data set? Thanks

Something like this? BTW, I slightly adjusted the age groups you used in your example since using 5-10 and 15-20 would mean you will use an age group of 11-14 as well, which would seem weird to me.
df <- read.table(text = "
Person DateofBirth
A 1/2/2000
B 3/2/1998
C 4/5/2008", header = T)
library(lubridate)
df$age <- interval(as.Date(df$DateofBirth, "%d/%m/%Y"), Sys.Date()) %/% years(1)
df$agegroup <- cut(df$age, seq(5,30,5), c("5-10", "11-15", "16-20", "21-25", "25-30"))
df
Person DateofBirth age agegroup
1 A 1/2/2000 18 16-20
2 B 3/2/1998 20 16-20
3 C 4/5/2008 10 5-10
If you many more agegroups, you could as well consider to generalize the last cut argument like this:
df1 <- data.frame(age = 1:100)
df1$agegroup <- cut(df1$age, seq(0,100,5), paste0(seq(1,96, 5), "-", seq(5,100,5)))

Related

Split hms column into 30 minute intervals

I have participant data during an exercise test, which includes participant ID, the condition (either Environmental or Control) and the total time taken to complete the test. A small example of my data:
RawData <- data.frame(
ParticipantID = c (1:6),
Condition = c("Control","Experimental","Experimental","Control","Experimental","Control"),
Time = c("04:34:22","02:48:47","04:22:06","02:57:11","02:07:11","05:34:22"))
I then used the lubridate package so I have time in hms via:
RawData <- RawData %>%
mutate(TotalTime = hms::as_hms(Time))
Now I wish to create a new column, that bins each RawData$TotalTime result into a category including: Sub2, Sub230, Sub3, Sub330, Sub4, Sub430, Sub5, Sub530 and Sub6. I could probably do this via a long case_when statement but is there an easy way to do this in lubridate given I am after 30 minute intervals?
My desired output would be:
RawData <- data.frame(
ParticipantID = c (1:6),
Condition = c("Control","Experimental","Experimental","Control","Experimental","Control"),
Time = c("04:34:22","02:48:47","04:22:06","02:57:11","02:07:11","05:34:22"),
Category = c("Sub5","Sub3","Sub430","Sub3","Sub230","Sub6"))
Thank you!

You can use ceiling_date function with units as "30 mins".
library(dplyr)
library(lubridate)
RawData %>%
mutate(TotalTime = as.POSIXct(Time, format = '%T'),
Category = format(ceiling_date(TotalTime, '30 mins'), "%H%M")) %>%
select(-TotalTime)
# ParticipantID Condition Time Category
#1 1 Control 04:34:22 0500
#2 2 Experimental 02:48:47 0300
#3 3 Experimental 04:22:06 0430
#4 4 Control 02:57:11 0300
#5 5 Experimental 02:07:11 0230
#6 6 Control 05:34:22 0600

Merging three time series in R

library(qrmdata)
library(xts)
library(dplyr)
#load the data
data("EUR_USD")
data("SP500_const")
data("EURSTX_const")
#select stock and period
walmart <- data.frame(nr = SP500_const['2005-05-20/2015-05-19',"WMT"])
danone <- data.frame(nr = EURSTX_const['2005-05-20/2015-05-19',"BN.PA"])
exrate <- data.frame(nr = EUR_USD['2005-05-20/2015-05-19',])
#omit 'NA' entries
walmart <- na.omit(walmart)
danone <- na.omit(danone)
exrate <- na.omit(exrate)
I want to merge the three time series walmart, danone and exrate into one time series, but I only want those days in it for which I have data in all three time series.
I tried to merge danone and walmart first using
z <- merge(danone,walmart, join='inner')
which should merge danone and walmart (only using the days for which I have data from both danone and walmart) but it doesn't give me the output I described above

You can use inner_join from dplyr library
walmart$date<-rownames(walmart)
danone$date<-rownames(danone)
exrate$date<-rownames(exrate)
a<-inner_join(inner_join(walmart, danone,by = "date" ), exrate, by = "date")
> head(a)
WMT date BN.PA EUR.USD
1 37.52 2005-05-20 24.8986 1.2561
2 38.05 2005-05-23 25.0496 1.2585
3 37.89 2005-05-24 25.1000 1.2586
4 37.61 2005-05-25 25.0832 1.2605
5 37.62 2005-05-26 25.3013 1.2513
6 37.59 2005-05-27 25.2006 1.2580

Number of overlaping datetime inside same table (R)

I have a table of about 50 000 rows, with four columns.
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.
So its result should look like:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.

Here is a data.table solution using foverlaps.
First, notice that there's an error in your data:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
run foverlaps and then
filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
Add a variable where we count the male simultaneous guests and
a variable where we count the female simultaneous guests, all grouped by ID and arrival
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
Join the findings of the previous command with our original table on ID and arrival
result <- simultaneous[dt, on = .(ID, Arrival)]
<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
set the column order to something nicer
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]

Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.
It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.
It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.
Any questions about the code, lemme know.
There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...
Final output looks like this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t

How to split a panel data record in R based on a threshold value for a variable?

I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.

My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50

average time-distance between grouped events

df is battle events within years & conflicts. I am trying to calculate the average distance (in time) between battles within conflict years.
Header looks something like this:
conflictId | year | event_date | event_type
107 1997 1997-01-01 1
107 1997 1997-01-01 1
20 1997 1997-01-01 1
20 1997 1997-01-01 2
20 1997 1997-01-03 1
what I first tried was
time_prev_total <- aggregate (event_date ~ conflictId + year, data, diff)
but I end up with event_date being a list in the new df. Attempts to extract the first index position of the list within the df have been unsuccessful.
Alternatively it was suggested to me that I could create a time index within each conflict year, then lag that index, create a new data frame with conflictId, year, event_date, and the lagged index, and then merge that with the original df, but match the lagged index in the new df with the old index in the original df. I have tried to implement this but am a little unsure how to index the obs. within conflict years since it is unbalanced.

You can use ddply to split a data.frame into pieces
(one per year and conflict) and apply a function to each.
# Sample data
n <- 100
d <- data.frame(
conflictId = sample(1:3, n, replace=TRUE),
year = sample(1990:2000, n, replace=TRUE),
event_date = sample(0:364, n, replace=TRUE),
event_type = sample(1:10, n, replace=TRUE)
)
d$event_date <- as.Date(ISOdate(d$year,1,1)) + d$event_date
library(plyr)
# Average distance between battles, within each year and conflict
ddply(
d,
c("year","conflictId"),
summarize,
average = mean(dist(event_date))
)
# Average distance between consecutive battles, within each year and conflict
d <- d[order(d$event_date),]
ddply(
d,
c("year","conflictId"),
summarize,
average = mean(diff(event_date))
)