Identify subsequent event for repeat IDs based on dates and initial event - r

I am trying to determine repeat IDs based on date and an initial event. Below is a sample data set
+----+------------+-------------------------+
| ID | Date | Investigation or Intake |
+----+------------+-------------------------+
| 1 | 1/1/2019 | Investigation |
| 2 | 1/2/2019 | Investigation |
| 3 | 1/3/2019 | Investigation |
| 4 | 1/4/2019 | Investigation |
| 1 | 1/2/2019 | Intake |
| 2 | 12/31/2018 | Intake |
| 3 | 1/5/2019 | Intake |
+----+------------+-------------------------+
I want to write R codes to go through IDs from 1 to 4 (IDs that have investigations) and see if they have a subsequent intake (an intake that happens at a later date than the date of investigation). So the expected output looks like this:
+----+------------+-------------------------+------------+
| ID | Date | Investigation or Intake | New Column |
+----+------------+-------------------------+------------+
| 1 | 1/1/2019 | Investigation | Sub Intake |
| 2 | 1/2/2019 | Investigation | None |
| 3 | 1/3/2019 | Investigation | Sub Intake |
| 4 | 1/4/2019 | Investigation | None |
| 1 | 1/2/2019 | Intake | |
| 2 | 12/31/2018 | Intake | |
| 3 | 1/5/2019 | Intake | |
+----+------------+-------------------------+------------+
What will the code look like to solve this? I am guessing it will be some loop function?
Thanks!

you can do this using the dplyr package and using some ifelse statements create a new column as required.
Instead of using looping instead just check the next entry in the group using lead function.
This solution assumes that in each group you will have one "Investigation" and then 0 or more "Intake" entries that are listed afterwards.
library(dplyr)
df <- data.frame(ID = c(1, 2, 3, 4, 1, 2, 3),
Date = as.Date(c("2019-01-01", "2019-01-02", "2019-1-03", "2019-01-04", "2019-01-02", "2018-12-31", "2019-1-5")),
Investigation_or_Intake = c("Investigation", "Investigation", "Investigation", "Investigation", "Intake", "Intake", "Intake"),
stringsAsFactors = FALSE)
df %>%
group_by(ID) %>% # Make groups according to ID column
mutate(newcol = ifelse(lead(Date) > Date, "Sub Intake", "None"), # Check next entry in the group to see if Date is after current
newcol = ifelse(Investigation_or_Intake == "Investigation" & is.na(newcol), "None", newcol)) # Change "Investigation" entries with no Intake to "None"
This gives
ID Date Investigation_or_Intake newcol
<dbl> <date> <chr> <chr>
1 1 2019-01-01 Investigation Sub Intake
2 2 2019-01-02 Investigation None
3 3 2019-01-03 Investigation Sub Intake
4 4 2019-01-04 Investigation None
5 1 2019-01-02 Intake NA
6 2 2018-12-31 Intake NA
7 3 2019-01-05 Intake NA

Related

How to summarize data in R (dplyr) and avoid duplicate identifiers? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I'm trying to identify the lowest rate over a range of years for a number of items (ID).
In addition, I would like to know the Year the lowest rate was pulled from.
I'm grouping by ID, but I run into an issue when rates are duplicated across years.
sample data
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Year = rep(2010:2012,4),
Rate = c(0.3,0.6,0.9,
0.8,0.5,0.2,
0.8,0.4,0.9,
0.7,0.7,0.7))
sample data as table
| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |
Using dplyr I grouped by ID, then found the lowest rate.
df.Summarise <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate))
This gives me the following
| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |
However, I also need to know the year that data was pulled from.
This is what I would like my final result to look like:
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |
Here's where I ran into some issues.
Attempt #1: Include "Year" in the original dplyr code
df.Summarise2 <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate),
Year = Year)
Error: Column `Year` must be length 1 (a summary value), not 3
Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!
Attempt #2: Use mutate instead of summarise
df.Mutate <- df %>%
group_by(ID) %>%
mutate(LowestRate = min(Rate))
So that essentially returns my original dataframe, but with an extra column for LowestRate attached.
How would I go from this to what I want?
I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?
df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))
df.joined as table
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 |
| 4 | 0.7 | 2011 |
| 4 | 0.7 | 2012 |
I've tried looking online, but I can't really find anything that strikes this exactly.
Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?
The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)
Thanks for any help!
You can group by ID and then filter without summarizing, and that way you'll preserve all columns but still only keep the min value:
df %>%
group_by(ID) %>%
filter(Rate == min(Rate))

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

R data.table add new column with query for each row

I have 2 R data.tables in R like so:
first_table
id | first | trunc | val1
=========================
1 | Bob | Smith | 10
2 | Sue | Goldm | 20
3 | Sue | Wollw | 30
4 | Bob | Bellb | 40
second_table
id | first | last | val2
==============================
1 | Bob | Smith | A
2 | Bob | Smith | B
3 | Sue | Goldman | A
4 | Sue | Goldman | B
5 | Sue | Wollworth | A
6 | Sue | Wollworth | B
7 | Bob | Bellbottom | A
8 | Bob | Bellbottom | B
As you can see, the last names in the first table are truncated. Also, the combination of first and last name is unique in the first table, but not in the second. I want to "join" on the combination of first name and last name under the incredibly naive assumptions that
first,last uniquely defines a person
that truncation of the last name does not introduce ambiguity.
The result should look like this:
id | first | trunc | last | val1
=======================================
1 | Bob | Smith | Smith | 10
2 | Sue | Goldm | Goldman | 20
3 | Sue | Wollw | Wollworth | 30
4 | Bob | Bellb | Bellbottom | 40
Basically, for each row in table_1, I need to find a row that back fills the last name.
For Each Row in first_table:
Find the first row in second_table with:
matching first_name & trunc is a substring of last
And then join on that row
Is there an easy vectorized way to accomplish this with data.table?
One approach is to join on first, then filter based on the substring-match
first_table[
unique(second_table[, .(first, last)])
, on = "first"
, nomatch = 0
][
substr(last, 1, nchar(trunc)) == trunc
]
# id first trunc val1 last
# 1: 1 Bob Smith 10 Smith
# 2: 2 Sue Goldm 20 Goldman
# 3: 3 Sue Wollw 30 Wollworth
# 4: 4 Bob Bellb 40 Bellbottom
Or, do the truncation on the second_table to match the first, then join on both columns
first_table[
unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
, on = c("first", "trunc")
, nomatch = 0
]
## yields the same answer

Copy column data when function unaggregates a single row into multiple in R

I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.

Resources