can't figure this out, despite being rather close (supposedly). I want to check if a drug was given in a 4 hour window.
drug start stop
1 A 1 3
2 A 7 10
3 A 11 17
drug A was started on time 1 and was administered up to time 3; then started again at time 7 and given up to time 10 etc.
t1 t2
1 0 4
2 4 8
3 8 12
4 12 16
5 16 20
6 20 24
these are the windows in question
DATA:
t1 <- c(0,4,8,12,16,20)
t2 <- t1 + 4
chunks <- data.frame(t1=t1,t2=t2)
drug <- "A"
start <- c(1,7,11)
stop <- c(3,10,17)
times <- data.frame(drug,start,stop)
Expected solution
t1 t2 lsg
1 0 4 1
2 4 8 1
3 8 12 1
4 12 16 1
5 16 20 1
6 20 24 0
Attempt at solution
test <- function(){
n <- 1
for (row in times){
result <- (times$start[n] > chunks$t1 & times$stop[n] < chunks$t2) | ((times$start[n] > chunks$t1 & times$start[n] < chunks$t2) & (times$stop[n] > chunks$t2 | times$stop[n] < chunks$t2)) | (times$start[n] < chunks$t1 & times$stop[n] > chunks$t1)
n <- n + 1
print(result)
}
}
gives
[1] TRUE FALSE FALSE FALSE FALSE FALSE
[1] FALSE TRUE TRUE FALSE FALSE FALSE
[1] FALSE FALSE TRUE TRUE TRUE FALSE
which is correct! First administration fell into the first time window. 2nd and 3rd administration fell into the 2nd and 3rd windows etc. But how to get to the
expected solution?
As I said, I feel close but I don't know how to join the results to the chunks-df...
The first-half of this is #akrun's comment, but expanded to include the prerequisites. (If you come back and answer, I'll happily defer to you ... just giving more details here.) The second-half is new (and often over-looked).
data.table
data.table::foverlaps does joins based on overlaps/inequalities (as opposed to base merge and dplyr::*_join, which only operate on strict equalities). One prerequisite for using overlaps (in addition to being data.table class) is that the time fields be keyed correctly.
library(data.table)
setDT(times)
setDT(chunks)
# set the keys
setkey(times, start, stop)
setkey(chunks, t1, t2)
# the join
+(!is.na(foverlaps(chunks, times, which = TRUE, mult = 'first')))
# [1] 1 1 1 1 1 0
The function actually returns which row(s) each row in times corresponds to in chunks:
foverlaps(chunks, times, which = TRUE, mult = 'first')
# [1] 1 2 2 3 3 NA
sqldf
data.table is not the only R tool that lets this happen. This solution works on any variant of data.frame (base, data.table, or tbl_df).
Here's this:
library(sqldf)
sqldf("
select c.t1, c.t2,
(case when drug is null then 0 else 1 end) > 0 as n
from chunks c
left join times t on
(t.start between c.t1 and c.t2) or (t.stop between c.t1 and c.t2)
or (c.t1 between t.start and t.stop) or (c.t2 between t.start and t.stop)
group by c.t1, c.t2")
# t1 t2 n
# 1 0 4 1
# 2 4 8 1
# 3 8 12 1
# 4 12 16 1
# 5 16 20 1
# 6 20 24 0
(I don't know if it's possible to reduce the logic of that join, nor if it will mis-behave with other data.)
If you need the count of drugs that occur in each time frame, I think you can use sum(case when ... end) as n.
Related
I need to use data from a dataframe A to fill a column in my dataframe B.
Here is a subset of dataframe A:
> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
Family Count Start End
1 A 1 0 10
2 A 2 10 35
3 A 3 35 50
4 B 1 0 5
5 B 2 5 25
and a subset of dataframe B
> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
Family Start End BelongToCount
1 A 1 3 NA
2 A 4 6 NA
3 A 36 40 NA
4 B 2 4 NA
5 B 10 24 NA
What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:
Family Start End BelongToCount
A 1 3 1
A 4 6 1
A 36 40 3
B 2 4 1
B 10 24 2
I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.
I can't seem to find a clean (and fast) way to do this in R.
Right now, I am doing as follows:
split_A <- split(dfA, dfA$Family)
split_A_FamilyA <- split_A[["A"]]
split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]
for(i in 1:nrow(split_B_FamilyA)) {
row <- split_B_FamilyA[i,]
start <- row$dStart
end <- row$dEnd
for(j in 1:nrow(split_A_FamilyA)) {
row_base <- split_A_FamilyA[j,]
start_base <- row_base$Start
end_base <- row_base$End
if ((start >= start_base) && (end <= end_base)) {
split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
break
}
}
}
I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.
Can someone point me any relevant R function / an efficient way of solving this problem in R?
You can do this with non-equi join in data.table:
library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
# Family Start End BelongToCount
# 1: A 1 3 1
# 2: A 4 6 1
# 3: A 36 40 3
# 4: B 2 4 1
# 5: B 10 24 2
In case a row in dfB is contained in multiple roles of dfA:
dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
on = .(Family, Start <= Start, End >= End), by = .EACHI]
# Family Start End BelongToCount
# 1: A 1 3 2
# 2: A 4 6 2
# 3: A 36 40 6
# 4: B 2 4 2
# 5: B 10 24 4
I want to fill action column based on its records and time column. NA in action column should be filled based on previous action record and time interval. lets say we set time interval to 10, which means that if action is A and time is 1, all NA in action should be A till time==11 (1+10).
Please note that if action or ID change, this process should be reset. For example (in row 3) we have B with time==11, I want to fill the next NAs with B until time==21, but we have C in time==16, so we continue filling NA with C until time==26.
df<-read.table(text="
id action time
1 A 1
1 NA 4
1 NA 9
1 B 11
1 NA 12
1 C 16
1 NA 19
1 NA 30
1 A 31
1 NA 32
2 NA 1
2 A 2
2 NA 6",header=T,stringsAsFactors = F)
Desired Result:
id action time
1 A 1
1 A 4
1 A 9
1 B 11
1 B 12
1 C 16
1 C 19
1 NA 30
1 A 31
1 A 32
2 NA 1
2 A 2
2 A 6
We can extract the non-NA rows to use as a reference for filling in values, then iterate through the data set and conditionally replace values based on if they meet the requirements of id and the time interval.
# Use row numbers as an index (unique Id)
df$idx <- 1:nrow(df)
# Find the non-NA rows to use a reference for imputation
idx <- df %>%
group_by(id) %>%
na.omit(action)
The temporary data set idx is used as the reference and the column idx is our unique identifier. Let's first look at the logic for finding and filling in the missing values without worrying about the time interval, so that it's easier to read and understand:
# Ignoring the 'interval' limitation, we'd fill them in like this:
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
If you're running this example code make sure you re-create df and idx before proceeding, since it would be modified by that last example code block.
The time interval requires us to do a logical test on the value of time and also another test to avoid trying to conduct the time comparison on NA values:
# Accounting for the max interval:
interval <- 10
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
if(!is.na(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]))){
if(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]) + interval >= df$time[r])
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
}
df
This gives us:
id action time idx
1 1 A 1 1
2 1 A 4 2
3 1 A 9 3
4 1 B 11 4
5 1 B 12 5
6 1 C 16 6
7 1 C 19 7
8 1 <NA> 30 8
9 1 A 31 9
10 1 A 32 10
11 2 <NA> 1 11
12 2 A 2 12
13 2 A 6 13
which matches your desired output.
I'm relatively new to R and struggle with "vectorizing" all my code in R. Even though I appreciate that's the proper way to do it.
I need to set a value in a dataframe to be the minimum time for the IDs.
ID isTrue RealTime MinTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3
To be turned into:
ID isTrue RealTime MinTime
1 TRUE 16 10
1 FALSE 8
1 TRUE 10 10
2 TRUE 7 7
2 TRUE 30 7
3 FALSE 3
The following works perfectly. However, it takes 10 minutes to run which isn't ideal:
for (i in 1:nrow(df)){
if (df[i,'isTrue']) {
prevTime <- sqldf(paste('Select min(MinTime) from dfStageIV where ID =',df[i,'ID'],sep=" "))[1,1]
if (is.na(prevTime) | is.na(df[i,'MinTime']) | df[i,'MinTime'] < prevTime){
df[i,'MinTime']<-dfStageIV[i,'RealTime']
} else {
dfStageIV[i,'MinTime']<-prevTime
}
}
}
How should I do this properly? I take it using for or do loops are not the best way in R. I've been looking at the apply() and aggregate.data.frame() functions but can't make sense of how to do this. Can someone point me in the right direction? Much appreciated!!
Here is a two line base R solution using ave, pmax, and is.na.
# calculate minimum for each ID, excluding FALSE instances
df$MinTime <- ave(pmax(df$RealTime, (!df$isTrue) * max(df$RealTime)), df$ID, FUN=min)
# turn FALSE instances into NA
is.na(df$MinTime) <- (!df$isTrue)
which returns
df
ID isTrue RealTime MinTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA
In the first line, pmax is used to construct a vector of the observations if df$isTrue is TRUE or the maximum RealTime value in the data.frame. This new vector is used in the minimum calculation. The FALSE values are set to NA in the second line.
data
df <- read.table(header=T, text="ID isTrue RealTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3")
It should be far faster with a dplyr chain. Here we group the data frame by both ID and group and get the minima at the group level. Then we can ungroup it again and simply remove the F minima.
library(dplyr)
df %>%
group_by(ID, isTrue) %>%
mutate(Min.all = min(RealTime)) %>%
ungroup() %>%
transmute(ID, isTrue, RealTime, MinTime = ifelse(isTrue == T, Min.all, ""))
Output:
# A tibble: 6 × 4
ID isTrue RealTime MinTime
<int> <lgl> <int> <chr>
1 1 TRUE 16 10
2 1 FALSE 8
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3
I'd really recommend you get familiar with dplyr if you're going to be doing lots of data frame manipulation.
Someone suggested using the ave() function and the following works and is fast although it returns a ton of warnings:
df$MinTime<-ave(df$RealTime,df$ID, df$isTrue, FUN = min)
df$MinTime<-ifelse(df$isTrue, df$MinTime,NA).
The code in the question could be simplified by doing it all in SQL or all in R (appropriately vectorized) rather than half and half. There are already some R solutions so here is an SQL solution that shows that the problem amounts to aggregating a custom self-join.
library(sqldf)
sqldf("select a.*, min(b.RealTime) minRealTime
from df a
left join df b on a.ID = b.ID and a.isTRUE and b.isTRUE
group by a.rowid")
giving:
ID isTrue RealTime minRealTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA
I have ratings of images by several raters:
data <- as.data.frame(matrix(c(rep(1,6),rep(2,6),rep(1:6,2),
0,2,1,0,1,0,0,0,3,0,0,0),12,3))
colnames(data) <- c("image", "rater", "rating")
print(data)
# image rater rating
# 1 1 1 0
# 2 1 2 2
# 3 1 3 1
# 4 1 4 0
# 5 1 5 1
# 6 1 6 0
# 7 2 1 0
# 8 2 2 0
# 9 2 3 3
# 10 2 4 0
# 11 2 5 0
# 12 2 6 0
I want to aggregate (mean) ratings by images, but only if there less than 3 zero ratings for a given image. Otherwise (=if there are 3 zeros or more), the aggregated rating should be zero. And the counting of zeros should only be for raters 1-5.
So for the above data:
# image rating
# 1 1 0.8
# 2 2 0.0
For image 1 ratings are aggregated because the third zero belongs to rater 6. For image 2, the aggregated rating is zero because there are more than 2 zeros.
On top of that, I want the aggregation to take into account a) only the first 5 ratings for each image, and b) only positive ratings.
I can manage the last 2 conditions using aggregate:
aggregate(rating ~ image, data = data[data$rater <= 5 & data$rating != 0,], mean)
# Result:
# image rating
# 1 1 1.333333
# 2 2 3.000000
But I can't figure out the first condition.
Correct results should be:
# image rating
# 1 1 1.333333
# 2 2 0.000000
Can anyone please help? Thanks.
Here is a nice method using base R:
data$this <- ave(data$rating, data$image,
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)
I use i[1:5] to subset each image, so if you have fewer than 5 raters for an image, you will get an error. This returns the mean for each group, if that is of interest. Of course, you can use the same function to produce the aggregation table you mentioned:
aggregate(data$rating, data["image"],
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)
Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)
Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.
Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.
p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.