I am fairly new to R. I have a hypothetical dataset containing prescriptions from various different patients and drug types. What I would like to do is to create episodes of drug use, i.e., I would like to see for how long a patient used the drug. The loop mentioned in post sequentially update rows in data.table works for me, but I am not sure how I can make sure that the loop starts over when encountering a new patient identifier or drug type.
These are some rows from the dataset "AllDrugs":
DrugType ID Duration StartPrescr EndPrescr n
1 1 90 5-3-2020 3-6-2020 1
1 2 30 7-1-2020 6-2-2020 1
1 2 30 14-1-2020 12-6-2020 2
1 2 30 21-01-2020 19-6-2020 3
Note: n is a number indicating the prescription by ID and DrugType
This is the current loop:
for (i in 2:nrow(AllDrugs)) {
if (AllDrugs[i,StartPrescr] >= AllDrugs[i-1,EndPrescr]) {
AllDrugs[i, EndPrescr:= StartPrescr+ Duration]
} else {
AllDrugs[i, EndPrescr:= AllDrugs[i-1,EndPrescr] + Duration]
}
}
This is what I get:
DrugType ID Duration StartPrescr EndPrescr n
1 1 90 5-3-2020 3-6-2020 1
1 2 30 7-1-2020 3-7-2020 1
1 2 30 14-1-2020 2-8-2020 2
1 2 30 21-01-2020 1-9-2020 3
This is what I want:
DrugType ID Duration StartPrescr EndPrescr n
1 1 90 5-3-2020 3-6-2020 1
1 2 30 7-1-2020 6-2-2020 1
1 2 30 14-1-2020 7-3-2020 2
1 2 30 21-01-2020 6-4-2020 3
How can I shift the prescriptions based on the duration of the prescription by ID and DrugType? Note: this is an example of one drug type, but DrugType could also be 2, or 3 etc.
Does this work for you?
shift_end <- function(en,dur) {
if(length(en)>1) for(i in 2:length(en)) en[i] = en[i-1] + dur[i]
return(en)
}
df[order(ID, DrugType,StartPrescr), EndPrescr:=shift_end(EndPrescr,Duration), by=.(ID,DrugType)]
Result:
DrugType ID Duration StartPrescr EndPrescr n
1: 1 1 90 2020-03-05 2020-06-03 1
2: 1 2 30 2020-01-07 2020-02-06 1
3: 1 2 30 2020-01-14 2020-03-07 2
4: 1 2 30 2020-01-21 2020-04-06 3
Data Source:
df <- structure(list(
DrugType = c(1, 1, 1, 1),
ID = c(1, 2, 2, 2),
Duration = c(90, 30, 30, 30),
StartPrescr = structure(c(18326,18268, 18275, 18282), class = "Date"),
EndPrescr = structure(c(18416, 18298, 18425, 18432), class = "Date"),
n = c(1, 1, 2, 3)), row.names = c(NA,-4L),
class = c("data.table", "data.frame")
)
Related
Let's say we have a team variable, but we also have a time period 1 and a time period 2 variable, and a numeric grade 1-10. I want to mutate and add a variable that calculates the difference from time period 1 to time period 2.
How do I do this?
Visually the table looks like this:
img
There is a neat function in the data.table package called dcast( ) that allows you to transform your data from long to wide. In this case, you can use the Period variable to create 2 new columns, Period 1 and Period 2, where the values are the Grades.
library(data.table)
> data <- data.table(
+ Team = c("Team 1","Team 1","Team 2","Team 2","Team 3","Team 3"),
+ Period = c("Period 1","Period 2","Period 1","Period 2","Period 1","Period 2"),
+ Grade = c(75,87,42,35,10,95))
> data
Team Period Grade
1: Team 1 Period 1 75
2: Team 1 Period 2 87
3: Team 2 Period 1 42
4: Team 2 Period 2 35
5: Team 3 Period 1 10
6: Team 3 Period 2 95
> data2 <- dcast(
+ data = data,
+ Team ~ Period,
+ value.var = "Grade")
> data2
Team Period 1 Period 2
1: Team 1 75 87
2: Team 2 42 35
3: Team 3 10 95
> data2 <- data2[,Difference := `Period 2` - `Period 1`]
> data2
Team Period 1 Period 2 Difference
1: Team 1 75 87 12
2: Team 2 42 35 -7
3: Team 3 10 95 85
In tidyverse syntax, we would use pivot_wider and mutate:
library(tidyverse)
df %>%
pivot_wider(names_from = `Time Period`, values_from = Grade) %>%
mutate(difference = P2 - P1)
#> # A tibble: 3 x 4
#> Team P1 P2 difference
#> <chr> <dbl> <dbl> <dbl>
#> 1 Team 1 75 87 12
#> 2 Team 2 42 35 -7
#> 3 Team 3 10 95 85
Created on 2022-08-29 with reprex v2.0.2
Data used
df <- data.frame(Team = paste("Team", rep(1:3, each = 2)),
`Time Period` = rep(c("P1", "P2"), 3),
Grade = c(75, 87, 42, 35, 10, 95),
check.names = FALSE)
df
#> Team Time Period Grade
#> 1 Team 1 P1 75
#> 2 Team 1 P2 87
#> 3 Team 2 P1 42
#> 4 Team 2 P2 35
#> 5 Team 3 P1 10
#> 6 Team 3 P2 95
I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
I have 2 dataframes in R. The first one is a list of patients.
Patient 1
Patient 2
Patient 3
The second one is a list of procedures, and their costs, per patient.
Procedure 1 - Patient 1 - Cost
Procedure 2 - Patient 1 - Cost
Procedure 3 - Patient 1 - Cost
Procedure 1 - Patient 2 - Cost
Procedure 1 - Patient 3 - Cost
Etc.
I want to add the costs, per patients, into a new column in the first data frame (i.e, total expenditure per patient)
How can I do this?
Seems like you just need to aggregate and merge your data.
Here’s some example data
patient_df <- structure(list(patient_id = 1:3, gender = structure(c(2L, 1L,
2L), .Label = c("F", "M"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
print(patient_df)
## patient_id gender
## 1 1 M
## 2 2 F
## 3 3 M
procedure_df <- structure(list(procedure_id = c(1, 2, 3, 1, 2, 1), patient_id = c(1,
1, 1, 2, 2, 3), cost = c(10, 5, 12, 10, 5, 10)), class = "data.frame", row.names = c(NA,
-6L))
print(procedure_df)
## procedure_id patient_id cost
## 1 1 1 10
## 2 2 1 5
## 3 3 1 12
## 4 1 2 10
## 5 2 2 5
## 6 1 3 10
Let’s aggregate the procedure data
library(dplyr)
total_costs <- procedure_df %>%
group_by(patient_id) %>%
summarize(total_cost = sum(cost)) %>%
ungroup()
print(total_costs)
## # A tibble: 3 x 2
## patient_id total_cost
## <dbl> <dbl>
## 1 1 27
## 2 2 15
## 3 3 10
And then merge it to patient data
patient_costs <- left_join(patient_df, total_costs, by = "patient_id")
print(patient_costs)
## patient_id gender total_cost
## 1 1 M 27
## 2 2 F 15
## 3 3 M 10
I have a dataframe that looks something like this:
class <- c(3,0,3,0,0)
value <- c(50,50,70,30,100)
days <- c(3,3,2,2,1)
mydata <- data.frame(class, value, days)
What I need is for each day to have both classes represented - so if there is no class 3 on a given day (in this example, day 1) I'd like to add a row where class = 3 and value = 0 and day = 1. My real data is more complicated, because there are varying numbers of rows for each day (and many more days than 3), and many other columns (but for which it would be fine to enter NA). This doesn't seem like too complicated a problem, but I'm having trouble wrapping my head around the code. Thanks so much!
Using tidyverse you can use complete:
library(tidyverse)
mydata %>%
complete(days, class, fill = list(value = 0))
Output
# A tibble: 6 x 3
days class value
<dbl> <dbl> <dbl>
1 1 0 100
2 1 3 0
3 2 0 30
4 2 3 70
5 3 0 50
6 3 3 50
Data
mydata <- structure(list(class = c(3, 0, 3, 0, 0), value = c(50, 50, 70,
30, 100), days = c(3, 3, 2, 2, 1)), class = "data.frame", row.names = c(NA,
-5L))
With base R, we can do
out <- merge(expand.grid(lapply(mydata[c('class', 'days')],
unique)), mydata, all.x = TRUE)
out$value[is.na(out$value)] <- 0
out
# class days value
#1 0 1 100
#2 0 2 30
#3 0 3 50
#4 3 1 0
#5 3 2 70
#6 3 3 50
NOTE: No packages used
Or with data.table
library(data.table)
setDT(mydata)[CJ(class, days, unique = TRUE),
on = .(class, days)][is.na(value), value := 0][]
# class value days
#1: 0 100 1
#2: 0 30 2
#3: 0 50 3
#4: 3 0 1
#5: 3 70 2
#6: 3 50 3
Or using crossing/left_join from tidyverse
library(dplyr)
library(tidyr)
tidyr::crossing(class = unique(mydata$class),
days = unique(mydata$days)) %>%
left_join(mydata) %>%
mutate(value = replace_na(value, 0))
# A tibble: 6 x 3
# class days value
# <dbl> <dbl> <dbl>
#1 0 1 100
#2 0 2 30
#3 0 3 50
#4 3 1 0
#5 3 2 70
#6 3 3 50
data
mydata <- structure(list(class = c(3, 0, 3, 0, 0), value = c(50, 50, 70,
30, 100), days = c(3, 3, 2, 2, 1)), class = "data.frame", row.names = c(NA,
-5L))
I would like to count the number of observations within each group using conditions in R.
For example, I would like to count how many observations for ID "A" in every 10 days.
ID (A,A,A,A,A,A,A,A)
Day (7,14,17,25,35,37,42,57)
X (9,20,14,24,23,30,20,40)
Output Image
(In the first 10 days, we have one observation for ID "A". Days:7
In the next 10 days, we have two observations for ID "A". Days:14,17)
ID (A,A,A,A,A,A,A,A)
Day_10 (1,2,3,4,5,6)
Count_10 (1,2,1,2,1,1)
Also it would be great if I can calculate the number of observations before and after the certain values. For the given X value, I would like to know how many observation between [X-10, X+10] within ID "A".
The output image would be as follows:
ID (A,A,A,A,A,A,A,A)
X (9,20,14,24,23,30,40,50)
Count_X10 (3,3,3,3,3,3,2,1)
Count_X10: for a given X(=9) there are three observations within ID "A" [-1,19]
Here are the data loaded as a data.frame to keep the observations connected. Note that I added a second group to to show how to handle that
df <-
data.frame(
ID = rep(c("A","B"), each = 8)
, Day = c(7,14,17,25,35,37,42,57)
, X = c(9,20,14,24,23,30,20,40)
)
Then, I used dplyr to pass the data through a series of steps. First, I split by the ID column, then used lapply to run a function on each of those ID groups, including calculating two columns of interest (then returning the whole data.frame). Finally, I stitch the rows back together with bind_rows
df %>%
split(.$ID) %>%
lapply(function(x){
x$nextTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay, thisDay + 10))
})
x$plusMinusTen <- sapply(x$Day, function(thisDay){
sum(between(x$Day, thisDay - 10, thisDay + 10))
})
return(x)
}) %>%
bind_rows()
The result is
ID Day X nextTen plusMinusTen
1 A 7 9 3 3
2 A 14 20 2 3
3 A 17 14 2 4
4 A 25 24 2 3
5 A 35 23 3 4
6 A 37 30 2 3
7 A 42 20 1 3
8 A 57 40 1 1
9 B 7 9 3 3
10 B 14 20 2 3
11 B 17 14 2 4
12 B 25 24 2 3
13 B 35 23 3 4
14 B 37 30 2 3
15 B 42 20 1 3
16 B 57 40 1 1
But any condition you are interested good be added to that lapply step.
Your sample data :
df = data.frame(
ID = rep('A', 8),
Day = c(7, 14, 17, 25, 35, 37, 42, 57),
X = c(9, 20, 14, 24, 23, 30, 40, 50),
stringsAsFactors = FALSE)
Note: You give two different values for vector X. I suppose it is c(9, 20, 14, 24, 23, 30, 40, 50), and not c(9, 20, 14, 24, 23, 30, 20, 40).
First calculation:
library(dplyr)
output1 = df %>%
mutate(Day_10 = ceiling(Day/10)) %>%
group_by(ID, Day_10) %>%
summarise(Count_10 = n())
The mutate step creates the ranges of 10 days by rounding Day/10. Then we group by ID and Day_10 and we count the number of observations within each group.
> output1
ID Day_10 Count_10
<chr> <dbl> <int>
1 A 1 1
2 A 2 2
3 A 3 1
4 A 4 2
5 A 5 1
6 A 6 1
Second calculation:
output2 = df %>%
group_by(ID) %>%
mutate(Count_X10 = sapply(X, function(x){sum(Day >= x-10 & Day <= x+10)})) %>%
select(-Day)
We group by ID, and for each X we count the number of days with this ID that are between X-10 and X+10.
> output2
ID X Count_X10
<chr> <dbl> <int>
1 A 9 3
2 A 20 3
3 A 14 3
4 A 24 3
5 A 23 3
6 A 30 3
7 A 40 3
8 A 50 2
Note: I suppose there's a mistake in the desired output you give, because for instance, when X = 50, there are 2 observations within [40, 60] with ID "A": days 42 and 57.