I have 2 dataframes in R. The first one is a list of patients.
Patient 1
Patient 2
Patient 3
The second one is a list of procedures, and their costs, per patient.
Procedure 1 - Patient 1 - Cost
Procedure 2 - Patient 1 - Cost
Procedure 3 - Patient 1 - Cost
Procedure 1 - Patient 2 - Cost
Procedure 1 - Patient 3 - Cost
Etc.
I want to add the costs, per patients, into a new column in the first data frame (i.e, total expenditure per patient)
How can I do this?
Seems like you just need to aggregate and merge your data.
Here’s some example data
patient_df <- structure(list(patient_id = 1:3, gender = structure(c(2L, 1L,
2L), .Label = c("F", "M"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
print(patient_df)
## patient_id gender
## 1 1 M
## 2 2 F
## 3 3 M
procedure_df <- structure(list(procedure_id = c(1, 2, 3, 1, 2, 1), patient_id = c(1,
1, 1, 2, 2, 3), cost = c(10, 5, 12, 10, 5, 10)), class = "data.frame", row.names = c(NA,
-6L))
print(procedure_df)
## procedure_id patient_id cost
## 1 1 1 10
## 2 2 1 5
## 3 3 1 12
## 4 1 2 10
## 5 2 2 5
## 6 1 3 10
Let’s aggregate the procedure data
library(dplyr)
total_costs <- procedure_df %>%
group_by(patient_id) %>%
summarize(total_cost = sum(cost)) %>%
ungroup()
print(total_costs)
## # A tibble: 3 x 2
## patient_id total_cost
## <dbl> <dbl>
## 1 1 27
## 2 2 15
## 3 3 10
And then merge it to patient data
patient_costs <- left_join(patient_df, total_costs, by = "patient_id")
print(patient_costs)
## patient_id gender total_cost
## 1 1 M 27
## 2 2 F 15
## 3 3 M 10
Related
My dataframe contains data about political careers, such as a unique identifier (called: ui) column for each politician and the electoral term(called: electoral_term) in which they were elected. Since a politician can be elected in multiple electoral terms, there are multiple rows that contain the same ui.
Now I would like to add another column to my dataframe, that counts how many times the politician got re-elected.
So e.g. the politician with ui=1 was re-elected 2 times, since he occured in 3 electoral_terms.
I already tried
df %>% count(ui)
But that only gives out a table which can't be added into my dataframe.
Thanks in advance!
We may use base R
df$reelected <- with(df, ave(ui, ui, FUN = length)-1)
-output
> df
ui electoral reelected
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
data
df <- structure(list(ui = c(1, 1, 1, 2, 3, 3), electoral = c(1, 2,
3, 2, 7, 9)), class = "data.frame", row.names = c(NA, -6L))
mydf <- tibble::tribble(~ui, ~electoral, 1, 1, 1, 2, 1, 3, 2, 2, 3, 7, 3, 9)
library(dplyr)
df |>
add_count(ui, name = "re_elected") |>
mutate(re_elected = re_elected - 1)
# A tibble: 6 × 3
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
library(tidyverse)
df %>%
group_by(ui) %>%
mutate(re_elected = n() - 1)
# A tibble: 6 × 3
# Groups: ui [3]
ui electoral re_elected
<dbl> <dbl> <dbl>
1 1 1 2
2 1 2 2
3 1 3 2
4 2 2 0
5 3 7 1
6 3 9 1
I have a dataset looks like this:
(Visualising the datasets below may help you to understand the question)
original <- data.frame(
ID = c(rep("John", 3), "Steve"),
A = c(rep(3, 3), 1),
B = c(rep(4, 3), 2),
b = c(2, 3, 2, 2),
detail = c(rep("GOOOOD", 4))
)
Values in variable A, B, and b are all integers. Variable b is incomplete in this dataset and it actually has values from 1 to the value of B.
I need to complete this dataset with a new variable a added, the completed dataset will look like this:
completed1 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)
Values in variable a are integers too and a has values from 1 to the value of A. Values in b are nested in each value of a, and values in a are nested in each factor of ID.
I think the most relevant functions to complete a dataset in this way are tidyr::complete() and tidyr::expand(), but they can only complete combinations of values in existing variables, they cannot add a new column(variable).
I know the challenge is that there are multiple locations to allocate values in detail correspondingly to values in the newly added a through the nested relationship, for example, the completed dataset can also be this:
completed2 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)
Where the values in detail got located in the completed dataset does not matter to me. My actual dataset has more than 40,000 rows, so I really need something to automate it.
Is it possible to do this?
Thanks very much!!!
It's pretty messy using for loop, and it will give very random position of GOOOOD
comp_dummy <- original %>%
group_by(ID) %>%
expand(A = A, a = 1:A, B = B, b = 1:B)
original <- original %>%
group_by(ID, A, B, b) %>%
summarise(n = n())
vec <- rep(NA_character_, nrow(comp_dummy))
for (i in 1:nrow(original)){
x <- original[i,]
y <- comp_dummy %>%
rownames_to_column(., "row") %>%
filter(ID == x$ID, A == x$A, B == x$B, b == x$b) %>%
pull(row)
z <- sample(y, x$n, replace = FALSE) %>% as.numeric()
print(z)
vec[{z}] <- "GOOOOD"
}
comp_dummy$detail <- vec
comp_dummy
ID A a B b detail
<chr> <dbl> <int> <dbl> <int> <chr>
1 John 3 1 4 1 NA
2 John 3 1 4 2 GOOOOD
3 John 3 1 4 3 NA
4 John 3 1 4 4 NA
5 John 3 2 4 1 NA
6 John 3 2 4 2 NA
7 John 3 2 4 3 NA
8 John 3 2 4 4 NA
9 John 3 3 4 1 NA
10 John 3 3 4 2 GOOOOD
11 John 3 3 4 3 GOOOOD
12 John 3 3 4 4 NA
13 Steve 1 1 2 1 NA
14 Steve 1 1 2 2 GOOOOD
I wonder whether doing the complete twice, first for the a and then for b can be a solution. You can adjust different nesting, or group_by if needed.
Depending if the maximum a shall be from A within the ID group or not you shall adjust/remove the group_by (similar for b within the a group)
library(dplyr)
library(tidyr)
original %>%
dplyr::mutate(a = 1) %>%
dplyr::group_by( ID ) %>%
tidyr::complete( a = 1:max(A), nesting(ID, A, B, b), fill = list( detail = NA_character_)) %>%
group_by( a ) %>%
tidyr::complete( b = 1:max(B), nesting(ID, A, B, a), fill = list( detail = NA_character_)) %>%
dplyr::ungroup()
A base R solution
do.call(
rbind,
by(original,list(original$ID),function(x){
tmp=merge(
unique(x),
setNames(
expand.grid(
unique(x$ID),
x$A[1],
1:max(x$A),
x$B[1],
1:max(x$B)
),
c("ID","A","a","B","b")
),
by=c("ID","A","B","b"),
all=T
)
tmp[order(tmp$a,tmp$b),c("ID","A","a","B","b","detail")]
})
)
resulting in
ID A a B b detail
John.1 John 3 1 4 1 <NA>
John.5 John 3 1 4 2 GOOOOD
John.8 John 3 1 4 3 GOOOOD
John.11 John 3 1 4 4 <NA>
John.2 John 3 2 4 1 <NA>
John.4 John 3 2 4 2 GOOOOD
John.9 John 3 2 4 3 GOOOOD
John.12 John 3 2 4 4 <NA>
John.3 John 3 3 4 1 <NA>
John.6 John 3 3 4 2 GOOOOD
John.7 John 3 3 4 3 GOOOOD
John.10 John 3 3 4 4 <NA>
Steve.1 Steve 1 1 2 1 <NA>
Steve.2 Steve 1 1 2 2 GOOOOD
I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
I have a dataframe, df, like the following:
id t Happiness Wealth
1 1 5 100
2 1 5 100
3 1 5 100
1 2 3 70
2 2 9 170
3 2 2 60
Is there a way to subset the data so that I can create a new variable that represents the Wealth of an individual (id) in the previous time? So, the value for person 2 at time 2 will be 100.
You can also achieve the same by simply merging:
df <- data.frame(id = c(1, 1, 2, 2),
t = c(1, 2, 1, 2),
wealth = c(10, 15, 12, 17))
previous <- df
previous[,"t"] <- previous[,"t"] +1
df <- merge(df, previous, by = c("id", "t"), all = TRUE, suffixes = c("_current", "_previous"))
na.omit(df)
id t wealth_current wealth_previous
2 1 2 15 10
5 2 2 17 12
With the following data frame:
d <- structure(list(n = c(2, 3, 5), s = c(2, 8, 3),t = c(2, 18, 30)), .Names = c("n", "s","t"), row.names = c(NA, -3L), class = "data.frame")
which looks like:
> d
n s t
1 2 2 2
2 3 8 18
3 5 3 30
How can I remove row with duplicated values in all column.
Yielding:
n s t
2 3 8 18
3 5 3 30
Here's one possible approach, which compares all columns to the first
d[rowSums(d == d[,1]) != ncol(d),]
# n s t
# 2 3 8 18
# 3 5 3 30