Preamble:
The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome.
The complete question/setup:
I have a dataset dt in data.table format that looks like
date id val1 val2
%d.%m.%Y
1 01.01.2000 1 5 10
2 09.01.2000 1 4 9
3 01.08.2000 1 3 8
4 01.01.2000 2 2 7
5 01.01.2000 3 1 6
6 14.01.2000 3 7 5
7 28.01.2000 3 8 4
8 01.06.2000 3 9 3
I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I
keep the value of the last observation of val1
replace val2 of the last observation with the sum of all values of val2 of the group
add counter for how many observations came together in this group.
I.e., I want to end up with a dataset like this
date id val1 val2 counter
%d.%m.%Y
2 09.01.2000 1 4 19 2
3 01.08.2000 1 3 8 1
4 01.01.2000 2 2 7 1
7 28.01.2000 3 8 15 3
8 01.06.2000 3 9 3 1
Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools.
So far I know
that I can indicate what I mean by first and last using setkey(dt,date)
that I can replace the last val2 of a subset with the sum
dt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id]
that I get the length of a subset with [.N]
how to delete rows
that I can calculate the difference between two dates with difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks")
However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks.
Any help is appreciated. Many thanks in advance.
The trick is to use cumsum() on a condition. In this case, the condition is being more than 14 days. When the condition is true, the cumulative sum increments.
df %>%
mutate(rownumber = row_number()) %>%
group_by(id) %>%
mutate(interval = as.numeric(as.Date(date, format = "%d.%m.%Y") - as.Date(lag(date), format = "%d.%m.%Y"))) %>%
mutate(interval = ifelse(is.na(interval), 0, interval)) %>%
mutate(group = cumsum(interval > 14) + 1) %>%
ungroup() %>%
group_by(id, group) %>%
summarise(
rownumber = last(rownumber),
date = last(date),
val1 = last(val1),
val2 = sum(val2),
counter = n()
) %>%
select(rownumber, date, id, val1, val2, counter)
Output
rownumber date id val1 val2 counter
<int> <chr> <int> <int> <int> <int>
1 2 09.01.2000 1 4 19 2
2 3 01.08.2000 1 3 8 1
3 4 01.01.2000 2 2 7 1
4 7 28.01.2000 3 8 15 3
5 8 01.06.2000 3 9 3 1
Related
I have a large data frame (~30,000 rows) where I have two date fields "start_date" and "end_date".
I want to summarise the data such that I have 1 column with all the dates and a second column with a count of all the rows in which that date is between the "start_date" and "end_date".
I can make this work using 2 for loops but it is very inefficient as it is going by one though one comparing about 180 dates to 30,000 rows of date ranges.
Below is an example. Say I have the following dataframe.
df <- tibble(
start_date = c(1,1,2,2,3,3,4,4,5,5),
end_date = c(2,3,4,5,6,7,8,9,10,11)
)
I want this to output a table/dataframe that looks like this
Date Count
1 2
2 4
3 5
4 6
5 7
6 6
7 5
8 4
9 3
10 2
11 1
Is there some TidyVerse functions or anything else that could do this transformation efficiently?
Here's a base R method:
date = seq(min(df$start_date), max(df$end_date))
count = sapply(date, \(x) sum(x >= df$start_date & x <= df$end_date))
data.frame(date, count)
# date count
# 1 1 2
# 2 2 4
# 3 3 5
# 4 4 6
# 5 5 7
# 6 6 6
# 7 7 5
# 8 8 4
# 9 9 3
# 10 10 2
# 11 11 1
Here is a data.table approach using foverlaps. First, create a sequence of desired dates from the minimum start_date to the maximum end_date. Then, create a simple data.table for each of these dates.
Use foverlaps to get the overlapping join between your starting data.frame and the new table. Finally, count up the number of rows after the join for each date.
library(data.table)
setDT(df)
dates <- seq(min(df$start_date), max(df$end_date), by = 1)
dt <- data.table(start_date = dates, end_date = dates, key = c("start_date", "end_date"))
foverlaps(df, dt, which = T)[, .N, by = yid]
Output
yid N
1: 1 2
2: 2 4
3: 3 5
4: 4 6
5: 5 7
6: 6 6
7: 7 5
8: 8 4
9: 9 3
10: 10 2
11: 11 1
In tidyverse you could adapt to the following:
library(tidyverse)
data.frame(date = seq(min(df$start_date), max(df$end_date), by = 1)) %>%
rowwise() %>%
mutate(count = sum(date >= df$start_date & date <= df$end_date))
I keep trying to find an answer, but haven't had much luck. I'll add a sample of some similar data.
What I'd be trying to do here is exclude patient 1 and patient 4 from my subset, as they only have one reading for "Mobility Score". So far, I've been unable to work out a way of counting the number of readings under each variable for each patient. If the patient only has one or zero readings, I'd like to exclude them from a subset.
This is an imgur link to the sample data. I can't upload the real data, but it's similar to this
This can be done with dplyr and group_by. For more information see ?group_by and ?summarize
# Create random data
dta <- data.frame(patient = rep(c(1,2),4), MobiScor = runif(8, 0,20))
dta$MobiScor[sample(1:8,3)] <- NA
# Count all avaiable Mobility scores per patient and leave original format
library(dplyr)
dta %>% group_by(patient) %>% mutate(count = sum(!is.na(MobiScor)))
# Merge and create pivot table
dta %>% group_by(patient) %>% summarize(count = sum(!is.na(MobiScor)))
Example data
patient MobiScor
1 1 19.203898
2 2 13.684209
3 1 17.581468
4 2 NA
5 1 NA
6 2 NA
7 1 7.794959
8 2 NA
Result (mutate) 1)
patient MobiScor count
<dbl> <dbl> <int>
1 1 19.2 3
2 2 13.7 1
3 1 17.6 3
4 2 NA 1
5 1 NA 3
6 2 NA 1
7 1 7.79 3
8 2 NA 1
Result (summarize) 2)
patient count
<dbl> <int>
1 1 3
2 2 1
You can count the number of non-NA in each group and then filter based on that.
This can be done in base R :
subset(df, ave(!is.na(Mobility_score), patient, FUN = sum) > 1)
Using dplyr
library(dplyr)
df %>% group_by(patient) %>% filter(sum(!is.na(Mobility_score)) > 1)
and data.table
library(data.table)
setDT(df)[, .SD[sum(!is.na(Mobility_score)) > 1], patient]
I'm trying to restructure my data to recode a variable ('Event') so that I can determine the number of days between events. Essentially, I want to be able to count the number of days that occur between events occuring Importantly, I only want to start the 'count' between events after the first event has occurred for each person. Here is a sample dataframe:
Day = c(1:8,1:8)
Event = c(0,0,1,NA,0,0,1,0,0,1,NA,NA,0,1,0,1)
Person = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
sample <- data.frame(Person,Day,Event);sample
I would like it to end up like this:
NewEvent = c(NA,NA,0,1,2,3,0,1,NA,0,1,2,3,0,1,0)
sample2 <- sample <- data.frame(Person,Day,NewEvent);sample2
I'm new to R, unfamiliar with loops or if statements, and I could not find a thread which already answered this type of issue, so any help would be greatly appreciated. Thank you!
One approach is to group on Person and calculate distinct occurrence of events by cumsum(Event == 1). Now, group on both Person and grp to count days passed from occurrence of distinct event. The solution will be as :
library(dplyr)
sample %>% group_by(Person) %>%
mutate(EventNum = cumsum(!is.na(Event) & Event == 1)) %>%
group_by(Person, EventNum) %>%
mutate(NewEvent = ifelse(EventNum ==0, NA, row_number() - 1)) %>%
ungroup() %>%
select(Person, Day, NewEvent) %>%
as.data.frame()
# Person Day NewEvent
# 1 1 1 NA
# 2 1 2 NA
# 3 1 3 0
# 4 1 4 1
# 5 1 5 2
# 6 1 6 3
# 7 1 7 0
# 8 1 8 1
# 9 2 1 NA
# 10 2 2 0
# 11 2 3 1
# 12 2 4 2
# 13 2 5 3
# 14 2 6 0
# 15 2 7 1
# 16 2 8 0
Note: If data is not sorted on Day then one should add arrange(Day) in above code.
I have a several million rows of data and I need to create a subset. No success despite of trying hard and searching all over the web. The question is:
How to create a subset including only the smallest values of value for all ID & item combinations?
The data structure looks like this:
> df = data.frame(ID = c(1,1,1,1,2,2,2,2),
item = c('A','A','B','B','A','A','B','B'),
value = c(10,5,3,2,7,8,9,10))
> df
ID item value
1 1 A 10
2 1 A 5
3 1 B 3
4 1 B 2
5 2 A 7
6 2 A 8
7 2 B 9
8 2 B 10
The the result should look like this:
ID item value
1 A 5
1 B 2
2 A 7
2 B 9
Any hints greatly appreciated. Thank you!
We can use aggregate from baseR with grouping variables 'ID' and 'item' to get the min of 'value'
aggregate(value~., df, min)
# ID item value
#1 1 A 5
#2 2 A 7
#3 1 B 2
#4 2 B 9
Or using dplyr
library(dplyr)
df %>%
group_by(ID, item) %>%
summarise(value = min(value))
Or with data.table
library(data.table)
setDT(df)[, .(value = min(value)) , .(ID, item)]
Or another option would be to order and get the first row after grouping
setDT(df)[order(value), head(.SD, 1), .(ID, item)]
I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.