This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe in which patients have multiple observations of medication use over time. Some patients have consistently used medication, others have gaps, while I am trying to count the patients which have never used medication.
I can't show the actual data but here is an example data frame of what I am working with.
patid meds
1 0
1 1
1 1
2 0
2 0
3 1
3 1
3 1
4 0
5 1
5 0
So from this two patients (4 and 2) never used medication. That's what I'm looking for.
I'm fairly new to R and have no idea how to do this, any would be appreciated.
Here is another alternative from dplyr package.
library(dplyr)
df <- data.frame(patid = c(1,1,1,2,2,3,3,3,4,5,5),
meds = c(0,1,1,0,0,1,1,1,0,1,0))
df %>%
distinct(patid, meds) %>%
arrange(desc(meds))%>%
filter(meds == 0 & !duplicated(patid))
# patid meds
#1 2 0
#2 4 0
Try this:
library(dplyr)
#Data
df <- structure(list(patid = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L,
5L, 5L), meds = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-11L))
#Code
df %>% group_by(patid) %>% summarise(sum=sum(meds,na.rm=T)) %>% filter(sum==0)
# A tibble: 2 x 2
patid sum
<int> <int>
1 2 0
2 4 0
A Base R solution could be
subset(aggregate(meds ~ patid, df, sum), meds == 0)
which returns
patid meds
2 2 0
4 4 0
Related
This question already has answers here:
Recode dates to study day within subject
(2 answers)
Closed 3 years ago.
I have data structured as below:
ID Day Desired Output
1 1 1
1 1 1
1 1 1
1 2 2
1 2 2
1 3 3
2 4 1
2 4 1
2 5 2
3 6 1
3 6 1
Is it possible to create a sequence for the desired output without using a loop? The dataset is quite large so a loop won't work, is it possible to do this with the dplyr package or maybe a combination of cumsum/diff?
An option is to group by 'ID', and then do a match on the 'Day' with the unique values of 'Day' column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(desired = match(Day, unique(Day)))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), Day = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 5L, 6L, 6L)), row.names = c(NA,
-11L), class = "data.frame")
I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)
I have the following data frame:
df =
id Year Value
1 1 3
1 2 4
2 1 6
2 2 2
2 2 3
3 1 7
3 2 3
I want to count the number of times an individual id has a duplicating year.
Desired Outcome:
1
Id 2 has year 2 twice, that's why 1 is the outcome
So far I have tried:
library("dplyr")
df %>% group_by(id, Year) %>% summarize(count=n())
but I cannot get a single number with the count
Cheers
We can use table and create counts of observation for each id and year and then calculate the ones which occur more than 1 time.
sum(table(df$id, df$Year) > 1)
#[1] 1
Just for completion, if we want to do this in dplyr
library(dplyr)
df %>%
group_by(id, Year) %>%
summarise(count= n()) %>%
ungroup() %>%
summarise(new_count = sum(count > 1))
# new_count
# <int>
#1 1
Just for fun:
data.table solution:
data:
dt<-
fread("id Year Value
1 1 3
1 2 4
2 1 6
2 2 2
2 2 3
3 1 7
3 2 3")
code:
dt[,.N>1,by=c("id","Year")]$V1 %>% sum
A (fast) alternative:
sum(sapply(split(df$Year, df$id), function(x) any(duplicated(x))))
Where:
df <- data.frame(
id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
Year = c(1L, 2L, 1L, 2L, 2L, 1L, 2L),
Value = c(3L, 4L, 6L, 2L, 3L, 7L, 3L)
)
I have 100 simulated data sets, for example a single set is shown below
pid time status
1 2 1
1 6 0
1 4 1
2 3 0
2 1 1
2 7 1
3 8 1
3 11 1
3 2 0
pid denotes patient id. This indicates that each patient has three records on the time and status column.
I want to write R code to delete any row with 0 status if that row is not a record for the first observation of a given patient and keep rows with 0 status if it denotes the first observation while the remaining rows with status 1 following the this 0 are deleted for that patient. The output should look like
pid time status
1 2 1
1 4 1
2 3 0
3 8 1
3 11 1
As there are 100 simulated data sets the positions of 0's and 1's in the status column are not the same for all the data. Could anyone be of help to provide R code that can perform this task?
Thank you in advance.
dplyr package can help. I added a record to your data example to include multiple 0 values for a pid.
Group by pid and with the function first you can hold the first value of status. Due to the group by this will be held for all the records per pid. Then just filter if the first record is 0 and row_number() = 1 just in case there are more records with 0 (see pid 4) or if the first record has status = 1 and keep all the records with status 1.
df %>%
group_by(pid) %>%
filter((first(status) == 0 & row_number() == 1) | (first(status) == 1 & status == 1))
# A tibble: 6 x 3
# Groups: pid [4]
pid time status
<int> <int> <int>
1 1 2 1
2 1 4 1
3 2 3 0
4 3 8 1
5 3 11 1
6 4 3 0
data:
df <-
structure(
list(
pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L, 3L, 6L, 8L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L)
),
.Names = c("pid", "time", "status"),
class = "data.frame",
row.names = c(NA,-12L)
)
This question is more appropriate on https://stackoverflow.com.
Here is an attempt using tapply() (it's a little verbose):
dat <- structure(list(pid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(2L, 6L, 4L, 3L, 1L, 7L, 8L, 11L, 2L),
status = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L)),
.Names = c("pid", "time", "status"), class = "data.frame",
row.names = c(NA, -9L))
ind <- unlist(tapply(dat$status, dat$pid, function(x) {
# browser()
y <- (rep(FALSE, length(x)))
if (x[1] == 1) {
y[x != 0] <- TRUE
} else {
y[1] <- TRUE
}
y
}))
dat[ind, ]
#> pid time status
#> 1 1 2 1
#> 3 1 4 1
#> 4 2 3 0
#> 7 3 8 1
#> 8 3 11 1
ind is a vector of TRUEs and FALSEs, which will indicate whether a row of dat should be kept or not according to your rules.
I use tapply(X, INDEX, FUN) to apply a function to subsets of a vector (here X = dat$status), which are defined by a grouping factor (here INDEX = dat$pid).
Here, I used an anonymous function (i.e., FUN = function(x){}) to do something with each subset of X.
In particular, I first define y, which I will return later, to be a vector of FALSEs.
If the first status is 1 for a subgroup, I turn all elements that are non-zero (i.e., y[x != 0]) into TRUE.
Otherwise, I turn only the first element (i.e., y[1]) into TRUE.
You may uncomment the browser() statement and see at the console what the function does by typing n (for next) or x or y (to see what they are).
I need to find a running maximum of a variable by group using R. The variable is sorted by time within group using df[order(df$group, df$time),].
My variable has some NA's but I can deal with it by replacing them with zeros for this computation.
this is how the data frame df looks like:
(df <- structure(list(var = c(5L, 2L, 3L, 4L, 0L, 3L, 6L, 4L, 8L, 4L),
group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
.Label = c("a", "b"), class = "factor"),
time = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)),
.Names = c("var", "group","time"),
class = "data.frame", row.names = c(NA, -10L)))
# var group time
# 1 5 a 1
# 2 2 a 2
# 3 3 a 3
# 4 4 a 4
# 5 0 a 5
# 6 3 b 1
# 7 6 b 2
# 8 4 b 3
# 9 8 b 4
# 10 4 b 5
And I want a variable curMax as:
var | group | time | curMax
5 a 1 5
2 a 2 5
3 a 3 5
4 a 4 5
0 a 5 5
3 b 1 3
6 b 2 6
4 b 3 6
8 b 4 8
4 b 5 8
Please let me know if you have any idea how to implement it in R.
We can try data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'group' , we get the cummax of 'var' and assign (:=) it to a new variable ('curMax')
library(data.table)
setDT(df1)[, curMax := cummax(var), by = group]
As commented by #Michael Chirico, if the data is not ordered by 'time', we can do that in the 'i'
setDT(df1)[order(time), curMax:=cummax(var), by = group]
Or with dplyr
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(curMax = cummax(var))
If df1 is tbl_sql explicit ordering might be required, using arrange
df1 %>%
group_by(group) %>%
arrange(time, .by_group=TRUE) %>%
mutate(curMax = cummax(var))
or dbplyr::window_order
library(dbplyr)
df1 %>%
group_by(group) %>%
window_order(time) %>%
mutate(curMax = cummax(var))
you can do it so:
df$curMax <- ave(df$var, df$group, FUN=cummax)