subset R dataframe using multiple functions/constraints [duplicate] - r

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I would like to subset a dataframe (dat) in 2 steps. First constraint: keep unique "id". Second constraint: keep the smallest "visit".
For example, for id=S2, I want to keep the 3rd row which has visit #1, rather than row2 with visit #2.
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(id=c("s1","s2","s2","s3","s4","s4"),
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
visit=1:2,
age=sample(18:30, n, replace=TRUE))
#dat
# id date visit age
# 1 s1 2020-12-26 1 18
# 2 s2 2020-12-27 2 22
# 3 s2 2020-12-28 1 18
# 4 s3 2020-12-29 2 26
# 5 s4 2020-12-30 1 27
# 6 s4 2020-12-31 2 21
#desired output:
# id date visit age
# 1 s1 2020-12-26 1 18
# 3 s2 2020-12-28 1 18
# 4 s3 2020-12-29 2 26
# 5 s4 2020-12-30 1 27

base R
dat[ave(dat$visit, dat$id, FUN = function(z) seq_along(z) == which.min(z)) > 0,]
# id date visit age
# 1 s1 2020-12-26 1 18
# 3 s2 2020-12-28 1 18
# 4 s3 2020-12-29 2 26
# 5 s4 2020-12-30 1 27
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
slice(which.min(visit)) %>%
ungroup()
# # A tibble: 4 x 4
# id date visit age
# <chr> <date> <int> <int>
# 1 s1 2020-12-26 1 18
# 2 s2 2020-12-28 1 18
# 3 s3 2020-12-29 2 26
# 4 s4 2020-12-30 1 27
data.table
library(data.table)
as.data.table(dat)[, .SD[which.min(visit),], by = id]
# id date visit age
# <char> <Date> <int> <int>
# 1: s1 2020-12-26 1 18
# 2: s2 2020-12-28 1 18
# 3: s3 2020-12-29 2 26
# 4: s4 2020-12-30 1 27

This seems to be the solution with the simplest syntax:
library(dplyr)
dat %>%
group_by(id) %>%
filter(age == min(age))

Related

How can I create a column in R that calculates a value of a row based on the previous row's value?

Here's what I would like to a achieve as a function in Excel, but I can't seem to find a solution to do it in R.
This is what I tried to do but it does not seem to allow me to operate with the previous values of the new column I'm trying to make.
Here is a reproducible example:
library(dplyr)
set.seed(42) ## for sake of reproducibility
dat <- data.frame(date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"))
This would be the output of the dataframe:
dat
date
1 2020-12-26
2 2020-12-27
3 2020-12-28
4 2020-12-29
5 2020-12-30
6 2020-12-31
Desired output:
date periodNumber
1 2020-12-26 1
2 2020-12-27 2
3 2020-12-28 3
4 2020-12-29 4
5 2020-12-30 5
6 2020-12-31 6
My try at this:
dat %>%
mutate(periodLag = dplyr::lag(date)) %>%
mutate(periodNumber = ifelse(is.na(periodLag)==TRUE, 1,
ifelse(date == periodLag, dplyr::lag(periodNumber), (dplyr::lag(periodNumber) + 1))))
Excel formula screenshot:
You could use dplyr's cur_group_id():
library(dplyr)
set.seed(42)
# I used a larger example
dat <- data.frame(date=sample(seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"), size = 30, replace = TRUE))
dat %>%
arrange(date) %>% # needs sorting because of the random example
group_by(date) %>%
mutate(periodNumber = cur_group_id())
This returns
# A tibble: 30 x 2
# Groups: date [6]
date periodNumber
<date> <int>
1 2020-12-26 1
2 2020-12-26 1
3 2020-12-26 1
4 2020-12-26 1
5 2020-12-26 1
6 2020-12-26 1
7 2020-12-26 1
8 2020-12-26 1
9 2020-12-27 2
10 2020-12-27 2
11 2020-12-27 2
12 2020-12-27 2
13 2020-12-27 2
14 2020-12-27 2
15 2020-12-27 2
16 2020-12-28 3
17 2020-12-28 3
18 2020-12-28 3
19 2020-12-29 4
20 2020-12-29 4
21 2020-12-29 4
22 2020-12-29 4
23 2020-12-29 4
24 2020-12-29 4
25 2020-12-30 5
26 2020-12-30 5
27 2020-12-30 5
28 2020-12-30 5
29 2020-12-30 5
30 2020-12-31 6

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
Data
library(data.table)
library(lubridate)
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
setDT(dt)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 := do.call(c, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 := do.call(c, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with do.call(c, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
Cleanup:
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!is.na(id)]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
]
dt
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
library(dplyr)
library(lubridate)
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
length
))
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

Repeated measures in messy format, need help to tidy

I have a very large data set containing weekly weights that have been coded with week of study and the weight at that visit. There are some missing visits and the data is not currently aligned.
df <- data.frame(ID=1:3, Week_A=c(6,6,7), Weight_A=c(23,24,23), Week_B=c(7,7,8),
Weight_B=c(25,26,27), Week_C=c(8,9,9), Weight_C=c(27,26,28))
df
ID Week_A Weight_A Week_B Weight_B Week_C Weight_C
1 1 6 23 7 25 8 27
2 2 6 24 7 26 9 26
3 3 7 23 8 27 9 28
I would like to align the data by week number (ideal output below).
df_ideal <- data.frame (ID=1:3, Week_6=c(23,24,NA), Week_7=c(25,26,23),
Week_8=c(27,NA,27), Week_9=c(NA,26,28))
df_ideal
ID Week_6 Week_7 Week_8 Week_9
1 1 23 25 27 NA
2 2 24 26 NA 26
3 3 NA 23 27 28
I would appreciate some help with this, even to find a starting point to manipulate this data to an easier to manage format.
A tidyverse solution:
df <- data.frame(ID=1:3,
Week_A=c(6,6,7),
Weight_A=c(23,24,23),
Week_B=c(7,7,8),
Weight_B=c(25,26,27),
Week_C=c(8,9,9),
Weight_C=c(27,26,28))
library(tidyverse)
df_long <- df %>% gather(key="v", value="value", -ID) %>%
separate(v, into=c("v1", "v2")) %>%
spread(v1, value) %>%
complete(ID, Week) %>%
arrange(Week, ID)
df_long
# A tibble: 12 x 4
# ID Week v2 Weight
# <int> <dbl> <chr> <dbl>
# 1 1 6 A 23
# 2 2 6 A 24
# 3 3 6 <NA> NA
# 4 1 7 B 25
# 5 2 7 B 26
# 6 3 7 A 23
# 7 1 8 C 27
# 8 2 8 <NA> NA
# 9 3 8 B 27
#10 1 9 <NA> NA
#11 2 9 C 26
#12 3 9 C 28
df_wide <- df_long %>% select(-v2) %>%
spread(Week, Weight, sep="_")
df_wide
# A tibble: 3 x 5
# ID Week_6 Week_7 Week_8 Week_9
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
Personally, I'd keep using df_long instead of df_wide, as it is a tidy data frame, while df_wide is not.
Here is a possible approach using the data.table package
library(data.table)
#convert into a data.table
setDT(df)
#convert into a long format
mdat <- melt(df, id.vars="ID", measure.vars=patterns("^Week", "^Weight", cols=names(df)))
#pivot into desired output
ans <- dcast(mdat, ID ~ value1, value.var="value2")
ans output:
ID 6 7 8 9
1: 1 23 25 27 NA
2: 2 24 26 NA 26
3: 3 NA 23 27 28
And if you really need the "Week_" in your column names, you can use
setnames(ans, names(ans)[-1L], paste("Week_", names(ans)[-1L]))
Another tidyverse solution using a double-gather with a final spread
df %>%
gather(k, v, -ID, -starts_with("Weight")) %>%
separate(k, into = c("k1", "k2")) %>%
unite(k1, k1, v) %>%
gather(k, v, starts_with("Weight")) %>%
separate(k, into = c("k3", "k4")) %>%
filter(k2 == k4) %>%
select(-k2, -k3, -k4) %>%
spread(k1, v)
# ID Week_6 Week_7 Week_8 Week_9
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
In base R, it's a double reshape, firstly to long and then back to wide on a different variable:
tmp <- reshape(df, idvar="ID", varying=lapply(c("Week_","Weight_"), grep, names(df)),
v.names=c("time","Week"), direction="long")
reshape(tmp, idvar="ID", direction="wide", sep="_")
# ID Week_6 Week_7 Week_8 Week_9
#1.1 1 23 25 27 NA
#2.1 2 24 26 NA 26
#3.1 3 NA 23 27 28

Looping over unique values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a data frame in long format, with one observation row per measurement. I want to loop through each unique ID and find the "minimum" date for each unique individual. For example, patient 1 may be measured at three different times, but I want the earliest time. I thought about sorting the dataset by the date (in increasing order) and removing all duplicates, but I'm not sure if this is the best way to go. Any help or suggestions would be greatly appreciated. Thank you!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', order the 'Date' (assuming that it is in Date class or else change to Date class with as.Date with correct format), and get the first observation with head
library(data.table)
setDT(df1)[order(Date), head(.SD, 1), by = ID]
Here is another way using basic R:
earliestDates = aggregate(list(date = df$date), list(ID = df$ID), min)
result = merge(earliestDates,df)
earliestDates is a two column data frame that has the minimum date by ID. The merge will join the values in the other columns.
Example:
set.seed(1)
ID = floor(runif(20,1,5))
day = as.Date(floor(runif(20,1,25)),origin = "2017-1-1")
weight = floor(runif(20,80,95))
df = data.frame(ID = ID, date = day, weight = weight)
> df
ID date weight
1 2 2017-01-24 92
2 2 2017-01-07 89
3 3 2017-01-17 91
4 4 2017-01-05 88
5 1 2017-01-08 87
6 4 2017-01-11 91
7 4 2017-01-02 80
8 3 2017-01-11 87
9 3 2017-01-22 90
10 1 2017-01-10 90
11 1 2017-01-13 87
12 1 2017-01-16 92
13 3 2017-01-13 86
14 2 2017-01-06 83
15 4 2017-01-21 81
16 2 2017-01-18 81
17 3 2017-01-21 84
18 4 2017-01-04 87
19 2 2017-01-19 89
20 4 2017-01-11 86
After the aggregate and merge, the result is:
> result
ID date weight
1 1 2017-01-08 87
2 2 2017-01-06 83
3 3 2017-01-11 87
4 4 2017-01-02 80
Try the following dplyr code:
library(dplyr)
set.seed(12345)
###Create test dataset
tb <- tibble(id = rep(1:10, each = 3),
date = rep(seq(as.Date("2017-07-01"), by=10, len=10), 3),
obs = rnorm(30))
# # A tibble: 30 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 1 2017-07-11 0.7094660
# 1 2017-07-21 -0.1093033
# 2 2017-07-31 -0.4534972
# 2 2017-08-10 0.6058875
# 2 2017-08-20 -1.8179560
# 3 2017-08-30 0.6300986
# 3 2017-09-09 -0.2761841
# 3 2017-09-19 -0.2841597
# 4 2017-09-29 -0.9193220
# # ... with 20 more rows
###Pipe the dataset through dplyr's 'group_by' and 'filter' commands
tb %>% group_by(id) %>%
filter(date == min(date)) %>%
ungroup() %>%
distinct()
# # A tibble: 10 × 3
# id date obs
# <int> <date> <dbl>
# 1 2017-07-01 0.5855288
# 2 2017-07-31 -0.4534972
# 3 2017-08-30 0.6300986
# 4 2017-07-01 -0.1162478
# 5 2017-07-21 0.3706279
# 6 2017-08-20 0.8168998
# 7 2017-07-01 0.7796219
# 8 2017-07-11 1.4557851
# 9 2017-08-10 -1.5977095
# 10 2017-09-09 0.6203798

Resources