"N-1" Cumulative Averages - r

I have the following data:
library(dplyr)
my_data = data.frame(patient_id = c(1,1,1,1, 2,2,2),
age = c(43, 43, 44, 44, 21, 21, 21),
gender = c("M", "M", "M", "M", "F", "F", "F"),
appointment_number = c(1,2,3,4,1,2,3),
missed = c(0, 0, 1, 1, 1, 1, 1))
My Question: Grouped by each ID, I want to create two variables:
The first variable takes the value of the previous appointment value
The second variable takes the "n-1" cumulative average of the previous appointment values (e.g. If patient_id = 1 has 8 rows, the cumulative average at this row would be the cumulative average of the first 7 rows)
Here is my attempt to do this:
my_data_final <- my_data %>%
group_by(patient_id) %>%
mutate(cummean = cumsum(missed)/(row_number() - 1)) %>%
mutate(previous_apt = lag(missed))
This results in the cummean variable being greater than 1, even though the variable in question can only be 1 or 0:
# A tibble: 7 x 7
# Groups: patient_id [2]
patient_id age gender appointment_number missed cummean previous_apt
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 43 M 1 0 NaN NA
2 1 43 M 2 0 0 0
3 1 44 M 3 1 0.5 0
4 1 44 M 4 1 0.667 1
5 2 21 F 1 1 Inf NA
6 2 21 F 2 1 2 1
7 2 21 F 3 1 1.5 1
Can someone please show me how to fix this?
Thanks!
Note: I tried to resolve this - is this correct?
my_data %>%
group_by(patient_id) %>%
mutate(previous_apt = lag(missed)) %>%
mutate(cummean = (cumsum(missed) - missed) / (row_number() - 1)) %>% mutate(previous_apt_2 = lag(missed, 2))

Related

Rowwise name of column where first non-zero value appears

I've got a bunch of columns all starting with the prefix wtp_ that occur in the midst of a wide dataframe (with several columns before and after the wtp_ columns). Mini example:
df <- tribble(~id, ~complete, ~wtp_20,~wtp_40,~wtp_60,~wtp_80,~wtp_100, ~sex,
1, 1, 0,0,1,1,1, "F",
2, 0, 0,0,0,1,1, "F",
3, 0, 0,0,0,0,1, "M",
4, 1, 1,1,1,1,1, "M",
5, 1, 0,0,0,0,0, "M",
6, 0, 0,1,1,1,1, "F"); df
What I'm looking for: I need to create a new variable (min_wtp) that returns the name of the column the first time that one of the wtp_ columns switches from 0 to 1. In other words, I need a solution to create the following:
df_needed <- tribble(~id, ~complete, ~wtp_20,~wtp_40,~wtp_60,~wtp_80,~wtp_100, ~sex, ~min_wtp,
1, 1, 0,0,1,1,1, "F", "wtp_60",
2, 0, 0,0,0,1,1, "F", "wtp_80",
3, 0, 0,0,0,0,1, "M", "wtp_100",
4, 1, 1,1,1,1,1, "M", "wtp_20",
5, 1, 0,0,0,0,0, "M", "NA",
6, 0, 0,1,1,1,1, "F", "wtp_40"); df_needed
Please note the following complications:
-Some people (like id==5) never change to 1 while others (like id==4) are 1 all along.
-There are some irrelevant columns occurring before the wtp_ columns that have 0s and 1s in them which should be ignored in the construction of min_wtp.
-There are way more columns (including wtp_ columns) than the minimal example I included above.
I've tried playing with which and colnames functions in combination with select(starts_with("wtp_")) but have been unsuccessful.
If anyone has a dplyr solution, that would be preferred.
We can use apply to get, for each row, the number of first column that satisfies your conditions. Then we use that number as the index to get the column name.
df$min_wtp = apply(df[ , grepl("wtp", names(df))], 1, function(x) {
names(x)[min(which(x > 0))]
})
df
id complete wtp_20 wtp_40 wtp_60 wtp_80 wtp_100 sex min_wtp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 1 0 0 1 1 1 F wtp_60
2 2 0 0 0 0 1 1 F wtp_80
3 3 0 0 0 0 0 1 M wtp_100
4 4 1 1 1 1 1 1 M wtp_20
5 5 1 0 0 0 0 0 M NA
6 6 0 0 1 1 1 1 F wtp_40
It would be much easier if you get the data in long format :
library(dplyr)
df %>%
tidyr::pivot_longer(cols = starts_with('wtp')) %>%
group_by(id) %>%
summarise(min_wtp = name[which(value == 1 &
lag(value, default = 0) == 0)[1]]) %>%
left_join(df, by = 'id')
# A tibble: 6 x 9
# id min_wtp complete wtp_20 wtp_40 wtp_60 wtp_80 wtp_100 sex
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 1 wtp_60 1 0 0 1 1 1 F
#2 2 wtp_80 0 0 0 0 1 1 F
#3 3 wtp_100 0 0 0 0 0 1 M
#4 4 wtp_20 1 1 1 1 1 1 M
#5 5 NA 1 0 0 0 0 0 M
#6 6 wtp_40 0 0 1 1 1 1 F
Without reshaping the data you can use rowwise with c_across :
apply_fun <- function(x) {
which(x == 1 & lag(x, default = 0) == 0)[1]
}
cols <- grep('^wtp', names(df), value = TRUE)
df %>%
rowwise() %>%
mutate(min_wtp = cols[apply_fun(c_across(cols))])
If it never goes backwards from 1 to 0, then you can find the change point very quickly with some basic sums:
sw <- startsWith(names(df), "wtp_")
names(df[sw])[sum(sw) - rowSums(df[sw]) + 1]
#[1] "wtp_60" "wtp_80" "wtp_100" "wtp_20" NA "wtp_40"

Create a random binary variable for a subset of observations assigning 1 to a specific proportion of rows

I have a dataframe...
df <- tibble(
id = 1:10,
family = c("a","a","b","b","c", "d", "e", "f", "g", "h")
)
Families will only contain 2 members at most (so they're either individuals or pairs).
For individuals (families with only one row, i.e. id = 5:10), I want to create a column called 'random' that randomly assigns 50% of the entries as 1 and the rest as 0. All other rows (those belonging to families with 2 members) should also equal 0.
By the end, the data should look like the following (depending on which 50% of rows are assigned 1)...
df <- tibble(
id = 1:10,
family = c("a","a","b","b","c", "d", "e", "f", "g", "h"),
random = c(0, 0, 0, 0, 1, 0, 1, 1, 0, 0)
)
I am mostly using Tidyverse and would like to include it within a pipe.
I am currently trying something along the lines of...
df %>%
group_by(family) %>%
mutate(random = if(n() == 1) *not sure what goes here* else 0)
We can assign 0 if number of rows in a family is greater than 1 else select a random value between 0 and 1.
library(dplyr)
df %>%
group_by(family) %>%
mutate(random = if(n() > 1) 0 else sample(0:1, 1))
# id family random
# <int> <chr> <dbl>
# 1 1 a 0
# 2 2 a 0
# 3 3 b 0
# 4 4 b 0
# 5 5 c 1
# 6 6 d 1
# 7 7 e 0
# 8 8 f 0
# 9 9 g 0
#10 10 h 0
If we want a fixed number of 1's and 0's for groups with 1 value we can use
df %>%
add_count(family) %>%
mutate(n = replace(n, n > 1, 0),
n = replace(n, {inds = which(n == 1);sample(inds, length(inds)/2)}, 0))
# A tibble: 10 x 3
# id family n
# <int> <chr> <dbl>
# 1 1 a 0
# 2 2 a 0
# 3 3 b 0
# 4 4 b 0
# 5 5 c 1
# 6 6 d 0
# 7 7 e 0
# 8 8 f 1
# 9 9 g 1
#10 10 h 0
Using data.table
library(data.table)
setDT(df)[, if(.N > 1) 0 else sample(0:1, 1), family]

Dplyr tranformation based on string filtering and conditions

I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120

How do I remove offsetting rows in a tibble?

I am trying to remove rows that have offsetting values.
library(dplyr)
a <- c(1, 1, 1, 1, 2, 2, 2, 2,2,2)
b <- c("a", "b", "b", "b", "c", "c","c", "d", "d", "d")
d <- c(10, 10, -10, 10, 20, -20, 20, 30, -30, 30)
o <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df <- tibble(ID = a, SEQ = b, VALUE = d, OTHER = o)
Generates this ordered table that is grouped by ID and SEQ.
> df
# A tibble: 10 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 B
3 1 b -10 C
4 1 b 10 D
5 2 c 20 E
6 2 c -20 F
7 2 c 20 G
8 2 d 30 H
9 2 d -30 I
10 2 d 30 J
I want to drop the row pairs (2,3), (5,6), (8,9) because VALUE negates the VALUE in the matching previous row.
I want the resulting table to be
> df2
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 D
3 2 c 20 G
4 2 d 30 J
I know that I can't use group_by %>% summarize, because I need to keep the value that is in OTHER. I've looked at the dplyr::lag() function but I don't see how that can help. I believe that I could loop through the table with some type of for each loop and generate a logical vector that can be used to drop the rows, but I was hoping for a more elegant solution.
What about:
vec <- cbind(
c(head(df$VALUE,-1) + df$VALUE[-1], 9999) ,
df$VALUE + c(9999, head(df$VALUE,-1))
)
vec <- apply(vec,1,prod)
vec <- vec!=0
df[vec,]
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
The idea is to take your VALUE field and subtract it with a slightly subset version of it. When the result is 0, than you remove the line.
Here's another solution with dplyr. Not sure about the edge case you mentioned in the comments, but feel free to test it with my solution:
library(dplyr)
df %>%
group_by(ID, SEQ) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff != 0 & diff2 != 0)) %>%
select(-diff, -diff2)
Result:
# A tibble: 4 x 4
# Groups: ID, SEQ [4]
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
Note:
This solution first creates two diff columns, one adding the lag, another adding the lead of VALUE to each VALUE. Only the offset columns will either have a zero in diff or in diff2, so I filtered out those rows, resulting in the desired output.

substitute value in dataframe based on conditional

I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2

Resources