How to do group matching in R? - r

Suppose I have the data.frame below where treat == 1 means that the id received treatment and prob is the calculated probability that treat == 1.
set.seed(1)
df <- data.frame(id = 1:10, treat = sample(0:1, 10, replace = T))
df$prob <- ifelse(df$treat, rnorm(10, .8, .1), rnorm(10, .4, .4))
df
id treat prob
1 1 0 0.3820266
2 2 0 0.3935239
3 3 1 0.8738325
4 4 1 0.8575781
5 5 0 0.6375605
6 6 1 0.9511781
7 7 1 0.8389843
8 8 1 0.7378759
9 9 1 0.5785300
10 10 0 0.6479303
To minimize selection bias, I now wish to create pseudo treatment and control groups on the basis of the values of treat and prob:
When any id withtreat == 1 is within 0.1 prob of any id with treat == 0, I want the value of group to be "treated".
When any id withtreat == 0 is within 0.1 prob of any id with treat == 1, I want the value of group to be "control".
Below is an example of what I'd like the result to be.
df$group <- c(NA, NA, NA, NA, 'control', NA, NA, 'treated', 'treated', 'control')
df
id treat prob group
1 1 0 0.3820266 <NA>
2 2 0 0.3935239 <NA>
3 3 1 0.8738325 <NA>
4 4 1 0.8575781 <NA>
5 5 0 0.6375605 control
6 6 1 0.9511781 <NA>
7 7 1 0.8389843 <NA>
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control
How would I go about doing this? In the example above, matching is done with replacements, but a solution without replacements would be welcome, too.

You can try
foo <- function(x){
TR <- range(x$prob[x$treat == 0])
CT <- range(x$prob[x$treat == 1])
tmp <- sapply(1:nrow(x), function(y, z){
if(z$treat[y] == 1){
ifelse(any(abs(z$prob[y] - TR) <= 0.1), "treated", "NA")
}else{
ifelse(any(abs(z$prob[y] - CT) <= 0.1), "control", "NA")
}}, x)
cbind(x, group = tmp)
}
foo(df)
id treat prob group
1 1 0 0.3820266 NA
2 2 0 0.3935239 NA
3 3 1 0.8738325 NA
4 4 1 0.8575781 NA
5 5 0 0.6375605 control
6 6 1 0.9511781 NA
7 7 1 0.8389843 NA
8 8 1 0.7378759 treated
9 9 1 0.5785300 treated
10 10 0 0.6479303 control

I think this problem is well suited for cut in base R. Here is how you can do it in a vectorized way:
f <- function(r) {
x <- cut(df[r,]$prob, breaks = c(df[!r,]$prob-0.1, df[!r,]$prob+0.1))
df[r,][!is.na(x),]$id
}
ones <- df$treat==1
df$group <- NA
df[df$id %in% f(ones),]$group <- "treated"
df[df$id %in% f(!ones),]$group <- "control"
> df
# id treat prob group
# 1 1 0 0.3820266 <NA>
# 2 2 0 0.3935239 <NA>
# 3 3 1 0.8738325 <NA>
# 4 4 1 0.8575781 <NA>
# 5 5 0 0.6375605 control
# 6 6 1 0.9511781 <NA>
# 7 7 1 0.8389843 <NA>
# 8 8 1 0.7378759 treated
# 9 9 1 0.5785300 treated
# 10 10 0 0.6479303 control

Perhaps not the most elegant but it seems to work for me:
df %>% group_by(id,treat) %>% mutate(group2 = ifelse(treat==1,
ifelse(any(abs(prob-df[df$treat==0,3])<0.1),"treated","NA"),
ifelse(any(abs(prob-df[df$treat==1,3])<0.1),"control","NA"))) # treat==0

Is this what you want?
#Base R:
apply(df[df$treat == 1, ],1, function(x){
ifelse(any(df[df$treat == 0, 'prob'] -.1 < x[3] & x[3] < df[df$treat == 0, 'prob'] +.1), 'treated', NA)
})
You can invert $treatclause to reflect control-group and attach the variables to your df.

Related

Get the first non-null value from selected cells in a row

Good afternoon, friends!
I'm currently performing some calculations in R (df is displayed below). My goal is to display in a new column the first non-null value from selected cells for each row.
My df is:
MD <- c(100, 200, 300, 400, 500)
liv <- c(0, 0, 1, 3, 4)
liv2 <- c(6, 2, 0, 4, 5)
liv3 <- c(1, 1, 1, 1, 1)
liv4 <- c(1, 0, 0, 3, 5)
liv5 <- c(0, 2, 7, 9, 10)
df <- data.frame(MD, liv, liv2, liv3, liv4, liv5)
I want to display (in a column called "liv6") the first non-null value from 5 cells (given the data, liv1 = 0, liv2 = 6 , liv3 = 1, liv 4 = 1 and liv5 = 1). The result should be 6. And this calculation should be repeated fro each row in my dataframe..
I do know how to do this in Python, but not in R..
Any help is highly appreciated!
One option with dplyr could be:
df %>%
rowwise() %>%
mutate(liv6 = with(rle(c_across(liv:liv5)), values[which.max(values != 0)]))
MD liv liv2 liv3 liv4 liv5 liv6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 0 6 1 1 0 6
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 3
5 500 4 5 1 5 10 4
A Base R solution:
df$liv6 <- apply(df[-1], 1, function(x) x[min(which(x != 0))])
output
df
MD liv liv2 liv3 liv4 liv5 liv6
1 100 0 6 1 1 0 2
2 200 0 2 1 0 2 2
3 300 1 0 1 0 7 1
4 400 3 4 1 3 9 1
5 500 4 5 1 5 10 1
A simple base R option is to apply across relevant columns (I exclude MD here, you can use any data frame subsetting style you want), then just take the first value of the non-zero values of that row.
df$liv6 <- apply(df[-1], 1, \(x) head(x[x > 0], 1))
df
#> MD liv liv2 liv3 liv4 liv5 liv6
#> 1 100 0 6 1 1 0 6
#> 2 200 0 2 1 0 2 2
#> 3 300 1 0 1 0 7 1
#> 4 400 3 4 1 3 9 3
#> 5 500 4 5 1 5 10 4
One approach is to use purrr::detect to detect the first non-zero element of each row.
We define a function which takes a numeric vector (row) and returns a boolean indicating whether each element is non-zero:
is_nonzero <- function(x) x != 0
We use this function to detect the first non-zero element in each row via purrr:detect
first_nonzero <- apply(df %>% dplyr::select(liv:liv5), 1, function(x) {
purrr::detect(x, is_nonzero, .dir = "forward")
})
We finally create the new column:
df$liv6 <- first_nonzero
As a result, we have
> df
MD liv liv2 liv3 liv4 liv5 liv6
100 0 6 1 1 0 6
200 0 2 1 0 2 2
300 1 0 1 0 7 1
400 3 4 1 3 9 3
500 4 5 1 5 10 4
Another straightforward solution is:
Reduce(function(x, y) ifelse(!x, y, x), df[, -1])
#[1] 6 2 1 3 4
This way should be very efficient, since we "scan" by column, as, presumably, the data have much fewer columns than rows.
The Reduce approach is a more functional form of a simple, old-school, loop:
ans = df[, 2]
for(j in 3:ncol(df)) {
i = !ans
ans[i] = df[i, j]
}
ans
#[1] 6 2 1 3 4

Create dummy-column based on another columns

Let's say I have this dataset
> example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
I want to create a new variable d. I want in d the value 1 when at least in of the variables a b c the value 1 2 or 3 is present.
d should look like this:
d <- c(1, 1, 1, 0, 0, 1, 1, 1, 1, 1)
Thanks in advance.
You can use rowSums to get a logical vector of 1, 2 or 3 appearing in each row and wrap it in as.integer to convert to 0 and 1, i.e.
as.integer(rowSums(df == 1|df == 2| df == 3) > 0)
#[1] 1 1 1 0 0 1 1 1 1 1
Will work for any number of vars:
example <- data.frame(a = 1:10, b = 10:1, c = 1:5 )
x <- c(1, 2, 3)
as.integer(Reduce(function(a, b) (a %in% x) | (b %in% x), example))
With the dplyr package:
library(dplyr)
x <- 1:3
example %>% mutate(d = as.integer(a %in% x | b %in% x | c %in% x))
Two other possibilities which work with any number of columns:
#option 1
example$d <- +(rowSums(sapply(example, `%in%`, 1:3)) > 0)
#option 2
library(matrixStats)
example$d <- rowMaxs(+(sapply(example, `%in%`, 1:3)))
which both give:
> example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1
You can do this using apply(although little slow)
Logic: any will compare if there is any 1,2 or 3 is present or not, apply is used to iterate this logic on each of the rows. Then finally converting the boolean outcome to numeric by adding +0 (You may choose as.numeric here in case you want to be more expressive)
d <- apply(example,1 ,function(x)any(x==1|x==2|x==3))+0
In case someone wants to restrict the columns or want to run the logic on some columns, then one can do this also:
d <- apply(example[,c("a","b","c")], 1, function(x)any(x==1|x==2|x==3))+0
Here you have control on columns on which one to take or ignore basis your needs.
Output:
> d
[1] 1 1 1 0 0 1 1 1 1 1
general solution:
example %>%
sapply(function(i)i %in% x) %>% apply(1,any) %>% as.integer
#[1] 1 1 1 0 0 1 1 1 1 1
Try this method, verify if in any column there is at list one element present in x.
x<-c(1,2,3)
example$d<-as.numeric(example$a %in% x | example$b %in% x | example$c %in% x)
example
a b c d
1 1 10 1 1
2 2 9 2 1
3 3 8 3 1
4 4 7 4 0
5 5 6 5 0
6 6 5 1 1
7 7 4 2 1
8 8 3 3 1
9 9 2 4 1
10 10 1 5 1

R carry forward and backward conditionally

I have a set of variables that contain data about whether or not a person has ever had certain health conditions. For example, "have you ever had a heart attack?"
If they say "yes" at observation 2, then the answer is still yes at observations 3 and 4. But, it is not necessarily yes at observation 1. The heart attack could have occurred between observation 1 and 2.
If they say "no" at observation 2, then the answer is no at observations 1. But, it is not necessarily no at observations 3 or 4.
Here is a reproducible example:
df <- tibble(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA)
)
df
id obs mi_ever
1 1 1 NA
2 1 2 0
3 1 3 1
4 1 4 NA
5 2 1 NA
6 2 2 0
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 1
11 3 3 NA
12 3 4 NA
It's trivial to carry my 0's (No's) backward or carry my 1's (Yes's) forward using zoo::na.locf. However, I'm not sure how to carry 0's backward and 1's forward. Ideally, I'd like the following result:
id obs mi_ever mi_ever_2
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
I've checked out the following posts, but none seem to cover exactly what I'm asking here.
Carry last Factor observation forward and backward in group of rows in R
Forward and backward fill data frame in R
making a "dropdown" function in R
Any help is appreciated.
Basically I'm marking the items in sequence after the first 1 to become 1 and the ones before the last 0 to become 0.
ever <- function (x) min( which( x == 1))
NA_1 <- function(x) seq_along(x) > ever(x) #could have done in one function
# check to see if working
ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
[1] NA 0 1 1 NA 0 NA NA NA 1 1 1
NA_0 <- function(x) seq_along(x) < not_yet(x)
not_yet <- function(x){ max( which( x==0)) }
# make temporary version of 1-modified column
temp1 <- ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
df$ever2 <- ave(temp1, df$id, FUN= function(x){ x[NA_0(x) ] <- 0; x})
# then make final version; could have done it "in place" I suppose.
df
# A tibble: 12 x 4
id obs mi_ever ever2
<int> <int> <dbl> <dbl>
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
If you need to suppress the warnings that should be possible.
I took the answer from #42- above (Thank you!), and tweaked it a little bit to further suit my needs. Specifically, I:
Took care of the warning "no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Inf".
Combined the separate functions into a single function (although the separate functions were extremely useful for learning).
Added an optional check_logic argument. When TRUE, the function will return 9's if a 0 comes after a 1. This represents a data error or logic flaw that warrants further investigation.
Added an example of using the function with data.table, and on multiple variables at once. This more accurately represents how I'm using the function in real life, and I thought it may be useful to others.
The function:
distribute_ever <- function(x, check_logic = TRUE, ...) {
if (check_logic) {
if (length(which(x == 1)) > 0 & length(which(x == 0)) > 0) {
if (min(which(x == 1)) < max(which(x == 0))) {
x <- 9 # Set x to 9 if zero comes after 1
}
}
}
ones <- which(x == 1) # Get indices for 1's
if (length(ones) > 0) { # Prevents warning
first_1_by_group <- min(which(x == 1)) # Index first 1 by group
x[seq_along(x) > first_1_by_group] <- 1 # Set x at subsequent indices to 1
}
zeros <- which(x == 0) # Get indices for 0's
if (length(zeros) > 0) { # Prevents warning
last_0_by_group <- max(which(x == 0)) # Index last 0 by group
x[seq_along(x) < last_0_by_group] <- 0 # Set x at previous indices to 0
}
x
}
A new reproducible example with multiple "ever" variables and some cases with 0 after 1:
dt <- data.table(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA),
diab_ever = c(0, NA, NA, 1, 1, NA, NA, 0, 0, NA, NA, NA)
)
Iterate over multiple variables quickly using data.table (with by group processing):
ever_vars <- c("mi_ever", "diab_ever")
dt[, paste0(ever_vars, "_2") := lapply(.SD, distribute_ever),
.SDcols = ever_vars,
by = id][]
Results:
id obs mi_ever diab_ever mi_ever_2 diab_ever_2
1: 1 1 NA 0 0 0
2: 1 2 0 NA 0 NA
3: 1 3 1 NA 1 NA
4: 1 4 NA 1 1 1
5: 2 1 NA 1 0 9
6: 2 2 0 NA 0 9
7: 2 3 NA NA NA 9
8: 2 4 NA 0 NA 9
9: 3 1 NA 0 NA 0
10: 3 2 1 NA 1 NA
11: 3 3 NA NA 1 NA
12: 3 4 NA NA 1 NA
For each input "ever" variable, we have:
Created a new variable with "_2" appended to the end of the input variable name. You could also edit "in place" as 42- pointed out, but I like being able to double check my data.
Zeroes are carried backward and ones are carried forward in time.
NA's after zeros and before ones (within id) are returned unchanged.
If there is a 0 (No, I've never had ...) after a 1 (Yes, I've had ...), as is the case with person 2's responses regarding diabetes, then the function returns 9's.
If we were to set check_logic to FALSE, then 1's would win out and replace 0's

How to replace consecutive NAs with zero given a max gap parameter (in R)

I would like to replace all consecutive NA values per row with zero but only if the number of consecutive NAs is less than a parmeter maxgap.
This is very similar to the function zoo::na.locf
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
zoo::na.locf(x, maxgap = 2, na.rm = FALSE)
gives
[1] NA 1 2 3 3 3 5 6 7 NA NA NA
There are two things different from my aim:
I would like to replace the leading NA too and I would like to replace the 2 consecutive NAs with 0 and not the last non-NA value.
I would like to get
0 1 2 3 0 0 5 6 7 NA NA NA
How can I do this in R. Can I use functions from the tidyverse?
If y is the result of the na.locf line then if y[i] is not NA but x[i] is NA then it was replaced so assign 0 to it. Also if it is a leading NA which occurs when the cumsum(...) term below is 0 then replace it as well.
replace(y, (!is.na(y) & is.na(x)) | cumsum(!is.na(y)) == 0, 0)
## [1] 0 1 2 3 0 0 5 6 7 NA NA NA
We can use rle to do this
f1 <- function(vec){
rl <- rle(is.na(vec))
lst <- within.list(rl, {
i1 <- seq_along(values)==1
i2 <- seq_along(values) != length(values)
values[!((lengths==2 & values & i2)|
(values & i1))] <- FALSE
})
vec[inverse.rle(lst)] <- 0
vec
}
f1(x)
#[1] 0 1 2 3 0 0 5 6 7 NA NA NA
You could e.g. do this:
require(data.table)
require(dplyr)
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
my_replace <- function(x, n, maxgap){
if(is.na(x[1]) && n <= maxgap){
x <- 0
}
x
}
data.frame(x, y=x) %>%
group_by(data.table::rleid(x)) %>%
mutate(x = my_replace(x, n(), 2), y = my_replace(y, n(), 1)) %>%
ungroup() %>%
select(x,y)
This allows you to set the maxgap columnwise: for x 2 for y 1.
This results in:
# A tibble: 12 × 2
x y
<dbl> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 0 NA
6 0 NA
7 5 5
8 6 6
9 7 7
10 NA NA
11 NA NA
12 NA NA

Identify and label the largest number in each group

Hi I want to identify and label the largest number for each group, can someone tell me how to get this done in r (or maybe excel would be easier)?
The following is an example data, the original data contains only the left 2 columns and I want to generate the third one. In the 3rd column, I want to label the largest value in the group as 1, e.g., in group 1, the largest is .02874 so it's marked as 1, otherwise 0. Thank you!
x <- read.table(header=T, text="group value largest
1 0.02827 0
1 0.02703 0
1 0.02874 1
2 0.03255 0
2 0.10394 1
2 0.03417 0
3 0.13858 0
3 0.16084 0
3 0.99830 1
3 0.24563 0")
UPDATE: Thank you all for your help! They all are great solutions!
Finally, the base (no package required) approach:
is.largest <- function(x) as.integer(seq_along(x) == which.max(x))
x <- transform(x, largest = ave(value, group, FUN = is.largest))
Note that if I were you, I would remove the as.integer and just store a logical (TRUE/FALSE) vector.
library(data.table)
x <- data.table(x)
y <- x[,list(value = max(value), maxindicator = TRUE), by = c('group')]
z <- merge(x,y, by = c('group','value'), all = TRUE)
Output
> z
group value largest maxindicator
1: 1 0.02703 0 NA
2: 1 0.02827 0 NA
3: 1 0.02874 1 TRUE
4: 2 0.03255 0 NA
5: 2 0.03417 0 NA
6: 2 0.10394 1 TRUE
7: 3 0.13858 0 NA
8: 3 0.16084 0 NA
9: 3 0.24563 0 NA
10: 3 0.99830 1 TRUE
Here is a solution with plyr :
x$largest <- 0
x <- ddply(x, .(group), function(df) {
df$largest[which.max(df$value)] <- 1
df
})
And one with base R :
x$largest <- 0
l <- split(x, x$group)
l <- lapply(l, function(df) {
df$largest[which.max(df$value)] <- 1
df
})
x <- do.call(rbind, l)
Here's a less cool base approach:
FUN <- function(x) {y <- rep(0, length(x)); y[which.max(x)] <- 1; y}
x$largest <- unlist(tapply(x$value, x$group, FUN))
## group value largest
## 1 1 0.02827 0
## 2 1 0.02703 0
## 3 1 0.02874 1
## 4 2 0.03255 0
## 5 2 0.10394 1
## 6 2 0.03417 0
## 7 3 0.13858 0
## 8 3 0.16084 0
## 9 3 0.99830 1
## 10 3 0.24563 0
It was more difficult to do in base than I had anticipated.

Resources