Suppose I have a table in the following format:
CowId DIM Type
1 13 Case
2 7 Case
3 3 Control
4 4 Control
5 9 Control
6 3 Control
7 5 Control
8 10 Control
9 1 Control
10 6 Control
11 7 Control
12 4 Control
I would like to randomly match Cases to Controls (1 to 1) based on +/- 3 DIM. Is there a convenient way to accomplish this task using dplyr? Any feedback would be appreciated.
Output from dput is appended:
structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
A way in base R :
#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data.
df[c(which(inds), result), ]
# CowId DIM Type
#1 1 13 Case
#2 2 7 Case
#5 5 9 Control
#10 10 6 Control
The part randomly could be tricky. Here is my approach:
For each case Id calculate the min/max DIM
Then randomly picked either 1 or half of available Control available to them
Update the Control picked with reference to CAse ID and excluded those rows from future pick.
Repeat this step till done for all Case
In case of no picked was available a message will popup.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)
# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
# calculate the min/max DIM
current_case <- df %>% filter(CowId == i_case_id)
expecting_DIM_min <- current_case$DIM - 3
expecting_DIM_max <- current_case$DIM + 3
# Pick with sample
possible_sample <- df %>%
filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
Picked == FALSE)
if (nrow(possible_sample) == 0) {
message("There is no possible sample for Case ID: ", i_case_id)
message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
} else {
max_sample <- nrow(possible_sample)
# Maximum pick - in this case OP ask for 1 - 1 matched
# pick_number <- max(1, max_sample / 2)
pick_number <- 1
sample <- possible_sample %>%
sample_n(size = 1)
df$Picked[df$CowId %in% sample$CowId] <- TRUE
df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
}
}
Here is an output
df %>% filter(Picked | Type == "Case")
#> CowId DIM Type Picked Case_ID
#> 1 1 13 Case FALSE -1
#> 2 2 7 Case FALSE -1
#> 3 8 10 Control TRUE 1
#> 4 10 6 Control TRUE 2
Updated: matching 1-1 only
Created on 2021-04-10 by the reprex package (v1.0.0)
Related
:
https://stackoverflow.com/questions/57330238/combine-2-datasets-and-allocate-values-more-evenly?noredirect=1#comment101211294_57330238
I have 2 data frames of different lengths and with one common column. What I need to do is combine them, but in a way that equally distributes values from a non-common column. So if we have Users:
User Category
John A
John D
Will A
Will E
Bea P
Bea E
Sarah A
Sarah B
And claims:
Category Claim
A 1
A 2
B 3
B 4
D 5
D 6
D 7
D 8
D 9
D 10
D 11
D 12
A 13
A 14
A 15
A 16
A 17
A 18
E 19
E 20
E 21
E 22
E 23
E 24
E 25
E 26
E 27
E 28
P 29
P 30
P 31
P 32
P 33
P 34
I want to provide an equal number of claims for each User based on Category - i.e. A claims would be split up between 3 users evenly.
Here it is, explanations are in the comments:
library("dplyr")
# Creating a "user number" which is their ID among
# other users having this category... When allocating claims, we'll know
# "this is user 2 out of 3 for category A, I need to assign the second third of the A claims."
users <-
users %>%
group_by(Category) %>%
arrange(Category) %>%
mutate(user_number = 1:n(),
total_users = n())
# Same thing for claims: this will allow us to identify the "second third of A claims"
claims <-
claims %>%
group_by(Category) %>%
mutate(claim_number = 1:n(),
total_claims = n())
user_claims <-
users %>%
# full join gives all the XXX claims to everyone in category XXX
full_join(claims) %>%
# We only keep the fraction of the claims that "belongs" to the user
filter(claim_number > total_claims * (user_number - 1) / total_users,
claim_number <= total_claims * (user_number) / total_users)
library(dplyr)
claim %>%
count(Category, name="Claims") %>%
left_join(user, ., by=c("Category")) %>%
add_count(Category) %>%
mutate(Claims = Claims / n) %>%
select(-n)
#> # A tibble: 8 x 3
#> User Category Claims
#> <fct> <fct> <dbl>
#> 1 John A 2.67
#> 2 John D 8
#> 3 Will A 2.67
#> 4 Will E 5
#> 5 Bea P 6
#> 6 Bea E 5
#> 7 Sarah A 2.67
#> 8 Sarah B 2
Data:
claim <- structure(list(Category = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L),
.Label = c("A", "B", "D", "E", "P"),
class = "factor"),
Claim = 1:34),
class = "data.frame", row.names = c(NA, -34L))
user <- structure(list(User = structure(c(2L, 2L, 4L, 4L, 1L, 1L, 3L, 3L),
.Label = c("Bea", "John", "Sarah", "Will"),
class = "factor"),
Category = structure(c(1L, 3L, 1L, 4L, 5L, 4L, 1L, 2L),
.Label = c("A", "B", "D", "E", "P"),
class = "factor")),
class = "data.frame", row.names = c(NA, -8L))
Here is a data.table way:
Users <- data.table(User = rep(c("John","Will","Bea","Sarah"),each = 2), Category = c("A","D","A","E","P","E","A","B"))
set.seed(1)
Claims <- data.table(Category = sample(c("A","D","E","P"), replace = TRUE, 34), Claim = 1:34)
claims_joined <- merge(Users, Claims, by = "Category", allow.cartesian = TRUE)
claims_joined[, mod_base := uniqueN(User), by = .(Category)]
claims_joined <- claims_joined[, .(User = User[1L + (.GRP %% mod_base)][1]), by = .(Category, Claim)]
dcast(claims_joined, Category ~ User, fun.aggregate = length)
Category Bea John Sarah Will
1: A 0 2 3 3
2: D 0 11 0 0
3: E 3 0 0 4
4: P 8 0 0 0
You essentially do a full outer join, then set up a index that increments for each claim. You then modulo that index by the number of users in that category, which is then used to pick a rotating user for each claim within the category
I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)
I have 100 simulated data sets, for example a single set is shown below
pid time status
1 2 1
1 6 0
1 4 1
2 3 0
2 1 1
2 7 1
3 8 1
3 11 1
3 2 0
pid denotes patient id. This indicates that each patient has three records on the time and status column.
I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like
pid time status
1 2 1
1 4 1
2 3 0
3 8 1
3 11 1
As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task?
Thank you in advance.
dplyr package can help. I added a record to your data example to include multiple 0 values for a pid.
Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.
df %>%
group_by(pid) %>%
filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))
# A tibble: 6 x 3
# Groups: pid [4]
pid time status
<int> <int> <int>
1 1 2 1
2 1 4 1
3 2 3 0
4 3 8 1
5 3 11 1
6 4 3 0
data:
df <-
structure(
list(
pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
),
.Names = c("pid", "time", "status"),
class = "data.frame",
row.names = c(NA,-12L)
)
This question is more appropriate on https://stackoverflow.com.
Here is an attempt using tapply() (it's a little verbose):
dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)),
.Names = c("pid", "time", "status"), class = "data.frame",
row.names = c(NA, -9L))
ind <- unlist(tapply(dat$status, dat$pid, function(x) {
# browser()
y <- (rep(FALSE, length(x)))
if (x[1] == 1) {
y[x != 0] <- TRUE
} else {
y[1] <- TRUE
}
y
}))
dat[ind, ]
#> pid time status
#> 1 1 2 1
#> 3 1 4 1
#> 4 2 3 0
#> 7 3 8 1
#> 8 3 11 1
ind is a vector of TRUEs and FALSEs, which will indicate whether a row of dat should be kept or not according to your rules.
I use tapply(X, INDEX, FUN) to apply a function to subsets of a vector (here X = dat$status), which are defined by a grouping factor (here INDEX = dat$pid).
Here, I used an anonymous function (i.e., FUN = function(x){}) to do something with each subset of X.
In particular, I first define y, which I will return later, to be a vector of FALSEs.
If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]) into TRUE.
Otherwise, I turn only the first element (i.e., y[1]) into TRUE.
You may uncomment the browser() statement and see at the console what the function does by typing n (for next) or x or y (to see what they are).
I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37
I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.