Finding the minimums of groups of observations in R - r

I'm relatively new to R and struggle with "vectorizing" all my code in R. Even though I appreciate that's the proper way to do it.
I need to set a value in a dataframe to be the minimum time for the IDs.
ID isTrue RealTime MinTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3
To be turned into:
ID isTrue RealTime MinTime
1 TRUE 16 10
1 FALSE 8
1 TRUE 10 10
2 TRUE 7 7
2 TRUE 30 7
3 FALSE 3
The following works perfectly. However, it takes 10 minutes to run which isn't ideal:
for (i in 1:nrow(df)){
if (df[i,'isTrue']) {
prevTime <- sqldf(paste('Select min(MinTime) from dfStageIV where ID =',df[i,'ID'],sep=" "))[1,1]
if (is.na(prevTime) | is.na(df[i,'MinTime']) | df[i,'MinTime'] < prevTime){
df[i,'MinTime']<-dfStageIV[i,'RealTime']
} else {
dfStageIV[i,'MinTime']<-prevTime
}
}
}
How should I do this properly? I take it using for or do loops are not the best way in R. I've been looking at the apply() and aggregate.data.frame() functions but can't make sense of how to do this. Can someone point me in the right direction? Much appreciated!!

Here is a two line base R solution using ave, pmax, and is.na.
# calculate minimum for each ID, excluding FALSE instances
df$MinTime <- ave(pmax(df$RealTime, (!df$isTrue) * max(df$RealTime)), df$ID, FUN=min)
# turn FALSE instances into NA
is.na(df$MinTime) <- (!df$isTrue)
which returns
df
ID isTrue RealTime MinTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA
In the first line, pmax is used to construct a vector of the observations if df$isTrue is TRUE or the maximum RealTime value in the data.frame. This new vector is used in the minimum calculation. The FALSE values are set to NA in the second line.
data
df <- read.table(header=T, text="ID isTrue RealTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3")

It should be far faster with a dplyr chain. Here we group the data frame by both ID and group and get the minima at the group level. Then we can ungroup it again and simply remove the F minima.
library(dplyr)
df %>%
group_by(ID, isTrue) %>%
mutate(Min.all = min(RealTime)) %>%
ungroup() %>%
transmute(ID, isTrue, RealTime, MinTime = ifelse(isTrue == T, Min.all, ""))
Output:
# A tibble: 6 × 4
ID isTrue RealTime MinTime
<int> <lgl> <int> <chr>
1 1 TRUE 16 10
2 1 FALSE 8
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3
I'd really recommend you get familiar with dplyr if you're going to be doing lots of data frame manipulation.

Someone suggested using the ave() function and the following works and is fast although it returns a ton of warnings:
df$MinTime<-ave(df$RealTime,df$ID, df$isTrue, FUN = min)
df$MinTime<-ifelse(df$isTrue, df$MinTime,NA).

The code in the question could be simplified by doing it all in SQL or all in R (appropriately vectorized) rather than half and half. There are already some R solutions so here is an SQL solution that shows that the problem amounts to aggregating a custom self-join.
library(sqldf)
sqldf("select a.*, min(b.RealTime) minRealTime
from df a
left join df b on a.ID = b.ID and a.isTRUE and b.isTRUE
group by a.rowid")
giving:
ID isTrue RealTime minRealTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA

Related

Roll condition ifelse in R data frame

I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?
Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE

detect if drug was given in interval

can't figure this out, despite being rather close (supposedly). I want to check if a drug was given in a 4 hour window.
drug start stop
1 A 1 3
2 A 7 10
3 A 11 17
drug A was started on time 1 and was administered up to time 3; then started again at time 7 and given up to time 10 etc.
t1 t2
1 0 4
2 4 8
3 8 12
4 12 16
5 16 20
6 20 24
these are the windows in question
DATA:
t1 <- c(0,4,8,12,16,20)
t2 <- t1 + 4
chunks <- data.frame(t1=t1,t2=t2)
drug <- "A"
start <- c(1,7,11)
stop <- c(3,10,17)
times <- data.frame(drug,start,stop)
Expected solution
t1 t2 lsg
1 0 4 1
2 4 8 1
3 8 12 1
4 12 16 1
5 16 20 1
6 20 24 0
Attempt at solution
test <- function(){
n <- 1
for (row in times){
result <- (times$start[n] > chunks$t1 & times$stop[n] < chunks$t2) | ((times$start[n] > chunks$t1 & times$start[n] < chunks$t2) & (times$stop[n] > chunks$t2 | times$stop[n] < chunks$t2)) | (times$start[n] < chunks$t1 & times$stop[n] > chunks$t1)
n <- n + 1
print(result)
}
}
gives
[1] TRUE FALSE FALSE FALSE FALSE FALSE
[1] FALSE TRUE TRUE FALSE FALSE FALSE
[1] FALSE FALSE TRUE TRUE TRUE FALSE
which is correct! First administration fell into the first time window. 2nd and 3rd administration fell into the 2nd and 3rd windows etc. But how to get to the
expected solution?
As I said, I feel close but I don't know how to join the results to the chunks-df...
The first-half of this is #akrun's comment, but expanded to include the prerequisites. (If you come back and answer, I'll happily defer to you ... just giving more details here.) The second-half is new (and often over-looked).
data.table
data.table::foverlaps does joins based on overlaps/inequalities (as opposed to base merge and dplyr::*_join, which only operate on strict equalities). One prerequisite for using overlaps (in addition to being data.table class) is that the time fields be keyed correctly.
library(data.table)
setDT(times)
setDT(chunks)
# set the keys
setkey(times, start, stop)
setkey(chunks, t1, t2)
# the join
+(!is.na(foverlaps(chunks, times, which = TRUE, mult = 'first')))
# [1] 1 1 1 1 1 0
The function actually returns which row(s) each row in times corresponds to in chunks:
foverlaps(chunks, times, which = TRUE, mult = 'first')
# [1] 1 2 2 3 3 NA
sqldf
data.table is not the only R tool that lets this happen. This solution works on any variant of data.frame (base, data.table, or tbl_df).
Here's this:
library(sqldf)
sqldf("
select c.t1, c.t2,
(case when drug is null then 0 else 1 end) > 0 as n
from chunks c
left join times t on
(t.start between c.t1 and c.t2) or (t.stop between c.t1 and c.t2)
or (c.t1 between t.start and t.stop) or (c.t2 between t.start and t.stop)
group by c.t1, c.t2")
# t1 t2 n
# 1 0 4 1
# 2 4 8 1
# 3 8 12 1
# 4 12 16 1
# 5 16 20 1
# 6 20 24 0
(I don't know if it's possible to reduce the logic of that join, nor if it will mis-behave with other data.)
If you need the count of drugs that occur in each time frame, I think you can use sum(case when ... end) as n.

Create a true/false variable in R

I have one variable column that contains large string values which are multiple words. I want to create a True/False column which reports true if a certain value is detected within the column of interest.
I have tried a mutate function with an embedded str_detect.
Dataset <- Dataset %>%
mutate(new_column = str_detect('column.of.interest', "abcd"))
My expected output was for all rows in which my column of interest contained "abcd" would report as TRUE in my new column. However, every row reports as FALSE in my new column.
Base R version. First create a sample data set (questioner: you should have done this; answerers: you should always do this):
> Dataset = data.frame(ID=1:10, column.of.interest=c(NA,"This","abcd","Foo","the abcde",NA,"Me","my","mo","END"))
which looks like this:
> Dataset
ID column.of.interest
1 1 <NA>
2 2 This
3 3 abcd
4 4 Foo
5 5 the abcde
6 6 <NA>
7 7 Me
8 8 my
9 9 mo
10 10 END
Then do:
> Dataset$new_column <- grepl("abcd", Dataset$column.of.interest, ignore.case = T)
to get:
> Dataset
ID column.of.interest new_column
1 1 <NA> FALSE
2 2 This FALSE
3 3 abcd TRUE
4 4 Foo FALSE
5 5 the abcde TRUE
6 6 <NA> FALSE
7 7 Me FALSE
8 8 my FALSE
9 9 mo FALSE
10 10 END FALSE
You may or may not want ignore.case.
Here is one answer which from based on a dataset from ggplot2
library(ggplot2)
library(dplyr)
diamonds %>% mutate(newCol = str_detect(clarity, "1"))
Original bad version of answer (see comments for why the above is better)
diamonds %>% mutate(newCol = ifelse(str_detect(clarity, "1"), "TRUE", "FALSE"))

Remove rows based on first instance to meet a condition

In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID, that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time.
ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)
data
ID Time Var
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 A 6 TRUE
5 B 6 TRUE
6 C 6 FALSE
7 A 9 TRUE
8 B 9 TRUE
9 C 9 FALSE
10 A 12 TRUE
11 B 12 FALSE
12 C 12 TRUE
The desired result for this data frame should be:
ID Time Var
A 3 FALSE
B 3 FALSE
C 3 FALSE
C 6 FALSE
C 9 FALSE
Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time) another instance where Var == TRUE for that ID.
I've tried many different things but can't seem to figure this out. Any help is much appreciated!
Here's how to do that with dplyr using group_by and cumsum.
The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.
library(dplyr)
data%>%
group_by(ID)%>%
filter(cumsum(Var)<1)
ID Time Var
<fctr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
Here's the equivalent code with data.table:
library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
This data.table solution should work.
library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
Given that you want all the values up to the first TRUE value, which.max is the way to go.
You can do this with the cumall verb as well:
library(dplyr)
data %>%
dplyr::group_by(ID) %>%
dplyr::filter(dplyr::cumall(!Var))
ID Time Var
<chr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
cumall(!x): all cases until the first TRUE

How to add column to data frame after subsetting it with plyr?

Say I wanted to work with hospital Medicare data showing procedure prices by hospital and by county and my data frame was called df with columns price, procedure and county. If I wanted to find the minimum and maximum prices for each procedure by county, I could so something like
library(plyr)
mostexpensive <- ddply(df,c('county','procedure'),function(x)x[which(x$price==max(x$price)),])
to get a table showing the hospitals with the most expensive procedures in each county. I can then see how many times each hospital is listed with
summary(mostexpensive$hospital)
For the final step I want to add a column to the original df dataframe that says TRUE if the row is most expensive and FALSE otherwise but I can't figure out how to get a logical vector from a plyr function. Thanks.
Posting reproducible code would be useful. Try this anyway,
For the summary
pricey <- ddply(df, c('county','procedure'), summarise, most = max(price), less=min(price))
and for the logical indexing
testing <- ddply(df, c('county','procedure'), mutate, expensive = price == max(price))
It will be more easier to get an answer with a reproductible example. You should think about it, next time you as for help in SO.
That being said, you can use the transform function to add a new column to your existing data.
The first step is to create a toy data set.
set.seed(123)
df <- data.frame(
county = sample(LETTERS[1:3], size = 20, replace = TRUE),
procedure = sample(c(1, 2), size = 20, replace = TRUE),
price = rpois(20, 10)
)
str(df)
## 'data.frame': 20 obs. of 3 variables:
## $ county : Factor w/ 3 levels "A","B","C": 1 3 2 3 3 1 2 3 2 2 ...
## $ procedure: num 2 2 2 2 2 2 2 2 1 1 ...
## $ price : int 6 8 6 8 4 6 6 8 5 12 ...
Now we can use plyr and the transform function
require(plyr)
expensive <- ddply(df, .(county, procedure),
transform, ismax = price == max(price))
expensive
## county procedure price ismax
## 1 A 1 9 FALSE
## 2 A 1 7 FALSE
## 3 A 1 12 TRUE
## 4 A 2 6 FALSE
## 5 A 2 6 FALSE
## 6 A 2 8 TRUE
## 7 B 1 5 FALSE
## 8 B 1 12 TRUE
## 9 B 2 6 FALSE
## 10 B 2 6 FALSE
## 11 B 2 12 TRUE
## 12 B 2 11 FALSE
## 13 C 1 9 TRUE
## 14 C 1 9 TRUE
## 15 C 2 8 FALSE
## 16 C 2 8 FALSE
## 17 C 2 4 FALSE
## 18 C 2 8 FALSE
## 19 C 2 12 TRUE
## 20 C 2 12 TRUE

Resources