transpose long to wide within groups with tidyverse

transpose long to wide within groups with tidyverse - r

I cant quite work this one out.
How do I go from:
Visit Test
1 A
1 B
2 A
2 C
3 B
To:
Visit A B C
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE TRUE FALSE

With dplyr and tidyr you can do
dd %>% mutate(Value=TRUE) %>%
spread(Test, Value, fill=FALSE)
# Visit A B C
# 1 1 TRUE TRUE FALSE
# 2 2 TRUE FALSE TRUE
# 3 3 FALSE TRUE FALSE
tested with
dd<-read.table(text="Visit Test
1 A
1 B
2 A
2 C
3 B", header=T)

Another option is to use reshape2::dcast with fun.aggregate to check if length is greater than 0.
library(reshape2)
dcast(df,Visit~Test, fun.aggregate = function(x)length(x)>0, value.var = "Test")
# Visit A B C
# 1 1 TRUE TRUE FALSE
# 2 2 TRUE FALSE TRUE
# 3 3 FALSE TRUE FALSE
Data:
df<-read.table(text="Visit Test
1 A
1 B
2 A
2 C
3 B",
header=TRUE, stringsAsFactor = FALSE)

Related

Remove rows based on first instance to meet a condition

In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID, that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time.
ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)
data
ID Time Var
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 A 6 TRUE
5 B 6 TRUE
6 C 6 FALSE
7 A 9 TRUE
8 B 9 TRUE
9 C 9 FALSE
10 A 12 TRUE
11 B 12 FALSE
12 C 12 TRUE
The desired result for this data frame should be:
ID Time Var
A 3 FALSE
B 3 FALSE
C 3 FALSE
C 6 FALSE
C 9 FALSE
Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time) another instance where Var == TRUE for that ID.
I've tried many different things but can't seem to figure this out. Any help is much appreciated!

Here's how to do that with dplyr using group_by and cumsum.
The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.
library(dplyr)
data%>%
group_by(ID)%>%
filter(cumsum(Var)<1)
ID Time Var
<fctr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
Here's the equivalent code with data.table:
library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE

This data.table solution should work.
library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
Given that you want all the values up to the first TRUE value, which.max is the way to go.

You can do this with the cumall verb as well:
library(dplyr)
data %>%
dplyr::group_by(ID) %>%
dplyr::filter(dplyr::cumall(!Var))
ID Time Var
<chr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
cumall(!x): all cases until the first TRUE

Why do I get different results indexing with data.table

Here is a simple example of trying to extract some rows
from a data.table, but what appear to be the same type
of logical vectors, I get different answers:
a <- data.table(a=1:10, b = 10:1)
a # so here is the data we are working with
a b
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
6: 6 5
7: 7 4
8: 8 3
9: 9 2
10: 10 1
let's extract just the first column since I need to dynamically
specify the column number as part of my processing
col <- 1L # get column 1 ('a')
x <- a[[col]] > 5 # logical vector specifying condition
x
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(x)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
look at the structure of a[[col]] > 5
a[[col]] > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(a[[col]] > 5)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
this looks very much like 'x', so why do these two different ways
of indexing 'a' give different results
a[x] # using 'x' as the logical vector
a b
1: 6 5
2: 7 4
3: 8 3
4: 9 2
5: 10 1
a[a[[col]] > 5] # using the expression as the logical vector
Empty data.table (0 rows) of 2 cols: a,b

nested looping through rows/columns in r

I have a data set that looks like this:
1. ID RESULT_DATE Hyperkalemia Is.Hemolyzed
2. 1 5/27/2008 2 FALSE
3. 1 5/28/2008 2 FALSE
4. 1 5/29/2008 2 FALSE
5. 1 5/29/2008 2 FALSE
6. 1 5/29/2008 3 FALSE
7. 1 5/30/2008 2 FALSE
8. 1 6/15/2008 4 FALSE
9. 1 10/14/2014 1 FALSE
10. 1 10/16/2014 NA FALSE
11. 2 8/12/2013 2 FALSE
12. 3 2/26/2012 2 FALSE
13. 3 2/27/2012 2 FALSE
14. 3 4/18/2012 3 FALSE
15. 3 4/18/2012 4 FALSE
16. 3 4/21/2012 4 FALSE
17. 3 4/23/2012 4 FALSE
18. 3 4/27/2012 4 FALSE
19. 3 5/8/2012 4 FALSE
20. 3 5/12/2012 4 FALSE
21. 3 5/15/2012 4 FALSE
22. 3 5/15/2012 NA FALSE
I want find the number of times a potassium test with a Hyperkalemia score of 3 or 4 and an is.HEmolyzed = FALSE was repeated the same day (must count the repeats by patient ID) Objective is the total number of times the test qualified for a repeat and then the total number of times the repeat occurred.
Could someone help me translate my pseudocode into working R code?
# data.frame = pots
# for every row (sorted by patient and result date)
for (i in 1:nrow(pots){
# for each patient (sorted by result date)
# how do I do I count the rows for the individual patient?
for (i in 1:length(pots$ID)) {
# assign result date to use for calculation
result_date = pots$result_date
# if Hyperkalemia = 3 or 4
if (Hyperkalemia == 3 | Hyperkalemia == 4)
# go find the next result for patient where is.Hemolyzed = FALSE
# how do I get the next result?
for (i+1)
# assign date to compare to first date
next_result_date = pots$result_date
if next_result_date > result_date
then repeated_same_day <- FALSE
else if result_date == result_date
then repeated_same_day <- TRUE
}
}
goal: I want to calculate how often (by unique ID) a grade 3 or 4 non-hemolyzed potassium result has another potassium test within 24 hours (I'm using a different field now -- I guess I can add some date function to calculate 24 hours).
Edit: I did get it working with for loops eventually!! Sharing in case it is helpful to anyone. Later I did see a mistake, but for my data set it was okay.
library(dplyr)
pots <- read.csv("phis_potassium-2015-07-30.csv",
head=TRUE, stringsAsFactors = FALSE)
pots <- arrange(pots, MRN, COLLECTED_DATE)
pots$Hyperkalemia[is.na(pots$Hyperkalemia)] <- 0
pots$repeated_wi24hours <- NA
pots$met_criteria <- NA
pots$next_test_time_interval <- NA
# data.frame = pots
# for every patient (sorted by patient and collected date)
for (mrn in unique(pots$MRN)){
# for each row for each patient (sorted by collected date)
for (i in 1:length(pots$MRN[pots$MRN == pots$MRN[mrn]])) {
# if Hyperkalemia = 3 or 4 AND Is.Hemolyzed == FALSE
if((pots$Hyperkalemia[i] == 3 | pots$Hyperkalemia[i] == 4) & pots$Is.Hemolyzed[i] == FALSE){
pots$met_criteria[i] <- TRUE
# get time interval between tests
pots$next_test_time_interval[i] <- difftime(pots$COLLECTED_DATE[i+1], pots$COLLECTED_DATE[i], units = "hours")
# if next date is within 24 hours, then test repeated
if (pots$next_test_time_interval[i] <= 24 ){
pots$repeated_wi24hours[i] <- TRUE
}
else {
pots$repeated_wi24hours[i] <- FALSE
}
}
}
}
Desired output
ID RESULT_DATE Hyperkalemia Is.Hemolyzed Met_criteria Repeated
1 5/27/2008 2 FALSE
1 5/28/2008 2 FALSE
1 5/29/2008 2 FALSE
1 5/29/2008 2 FALSE
1 5/29/2008 3 FALSE TRUE FALSE
1 5/30/2008 2 FALSE
1 6/15/2008 4 FALSE
1 10/14/2014 1 FALSE
2 8/12/2013 2 FALSE
3 2/26/2012 2 FALSE
3 2/27/2012 2 FALSE
3 4/18/2012 3 FALSE TRUE TRUE
3 4/18/2012 4 FALSE TRUE FALSE
3 4/21/2012 4 FALSE TRUE FALSE

How about this:
metCriteria <- function( dfPots )
{
(dfPots$Hyperkalemia==3 | dfPots$Hyperkalemia==4) & !dfPots$Is.Hemolyzed
}
#----------------------------------------------------------------------
pots <- read.table(filename, header=TRUE)
d <- paste( as.character(pots$RESULT_DATE),
"_ID",
as.character(pots$ID))
lastOccurence <- unlist(lapply(d,function(x){which.min(diff(c(d,FALSE)==x))}))
pots <- cbind(pots, data.frame( Met_criteria = rep(FALSE,nrow(pots))),
Repeated = rep(TRUE ,nrow(pots)) )
pots$Repeated[lastOccurence] <- FALSE
pots$Met_criteria[which(metCriteria(pots))] <- TRUE
The dates and ID's are pasted together in the vector "d".
The i-th component of the vector "lastOccurence" is the row number where the date/ID-pair d[i] occures or the last time.
The data frame "pots" is extended by two columns, "Met_criteria" and "Repeated".
"Met_criteria" is initialized "FALSE". Then "which(metCriteria(pots))" picks the row numbers where the criteria are met. In these rows "Met_critaria" is set to "TRUE".
"Repeated" is initialized "TRUE". It is set to "FALSE" in those rows where the corresponding date and ID occures for the last time.
Example:
> pots
ID RESULT_DATE Hyperkalemia Is.Hemolyzed Met_criteria Repeated
1 1 5/27/2008 2 FALSE FALSE FALSE
2 1 5/28/2008 2 FALSE FALSE FALSE
3 3 5/28/2008 2 FALSE FALSE FALSE
4 1 5/29/2008 2 FALSE FALSE TRUE
5 1 5/29/2008 2 FALSE FALSE TRUE
6 1 5/29/2008 3 FALSE TRUE FALSE
7 2 5/29/2008 4 FALSE TRUE FALSE
8 1 5/30/2008 2 FALSE FALSE FALSE
9 1 6/15/2008 4 FALSE TRUE FALSE
10 1 10/14/2014 1 FALSE FALSE FALSE
11 1 10/16/2014 NA FALSE FALSE FALSE
12 2 8/12/2013 2 FALSE FALSE FALSE
13 3 2/26/2012 2 FALSE FALSE FALSE
14 3 2/27/2012 2 FALSE FALSE FALSE
15 3 4/18/2012 3 FALSE TRUE TRUE
16 3 4/18/2012 4 FALSE TRUE FALSE
17 3 4/21/2012 4 FALSE TRUE FALSE
18 3 4/23/2012 4 FALSE TRUE FALSE
19 3 4/27/2012 4 FALSE TRUE FALSE
20 3 5/8/2012 4 FALSE TRUE FALSE
21 3 5/12/2012 4 FALSE TRUE FALSE
22 3 5/15/2012 4 FALSE TRUE TRUE
23 3 5/15/2012 NA FALSE FALSE FALSE
>

If row meets criteria, then TRUE else FALSE in R

I have nested data that looks like this:
ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE
I'd like to create a column called counter in which for each unique ID the counter adds one to the next row until the Behavior = TRUE
I am expecting this result:
ID Date Behavior counter
1 1 FALSE 1
1 2 FALSE 2
1 3 TRUE 3
2 3 FALSE 1
2 5 FALSE 2
2 6 TRUE 3
2 7 FALSE
3 1 FALSE 1
3 2 TRUE 2
Ultimately, I would like to pull the minimum counter in which the observation occurs for each unique ID. However, I'm having trouble developing a solution for this current counter issue.
Any and all help is greatly appreciated!

I'd like to create a counter within each array of unique IDs and from there, ultimately pull the row level info - the question is how long on average does it take to reach a TRUE
I sense there might an XY problem going on here. You can answer your latter question directly, like so:
> library(plyr)
> mean(daply(d, .(ID), function(grp)min(which(grp$Behavior))))
[1] 2.666667
(where d is your data frame.)

Here's a dplyr solution that finds the row number for each TRUE in each ID:
library(dplyr)
newdf <- yourdataframe %>%
group_by(ID) %>%
summarise(
ftrue = which(Behavior))

do.call(rbind, by(df, list(df$ID), function(x) {n = nrow(x); data.frame(x, Counter = c(1:(m<-which(x$Behavior)), rep(NA, n-m)))}))
ID Date Behavior Counter
1.1 1 1 FALSE 1
1.2 1 2 FALSE 2
1.3 1 3 TRUE 3
2.4 2 3 FALSE 1
2.5 2 5 FALSE 2
2.6 2 6 TRUE 3
2.7 2 7 FALSE NA
3.8 3 1 FALSE 1
3.9 3 2 TRUE 2
df = read.table(text = "ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE", header = T)

R: Build Apply function to find minimum of columns based on conditions in other (related) columns

With data as such below, I'm trying to reassign any of the test cols (test_A, etc.) to their corresponding time cols (time_A, etc.) if the test is true, and then find the minimum of all true test times.
[ID] [time_A] [time_B] [time_C] [test_A] [test_B] [test_C] [min_true_time]
[1,] 1 2 3 4 FALSE TRUE FALSE ?
[2,] 2 -4 5 6 TRUE TRUE FALSE ?
[3,] 3 6 1 -2 TRUE TRUE TRUE ?
[4,] 4 -2 3 4 TRUE FALSE FALSE ?
My actual data set is quite large so my attempts at if and for loops have failed miserably. But I can't make any progress on an apply function.
And more negative time, say -2 would be considered the minimum for row 3.
Any suggestions are welcomed gladly

You don't give much information, but I think this does what you need. No idea if it is efficient enough, since you don't say how big your dataset actually is.
#I assume your data is in a data.frame:
df <- read.table(text="ID time_A time_B time_C test_A test_B test_C
1 1 2 3 4 FALSE TRUE FALSE
2 2 -4 5 6 TRUE TRUE FALSE
3 3 6 1 -2 TRUE TRUE TRUE
4 4 -2 3 4 TRUE FALSE FALSE")
#loop over all rows and subset column 2:4 with column 5:7, then take the mins
df$min_true_time <- sapply(1:nrow(df), function(i) min(df[i,2:4][unlist(df[i,5:7])]))
df
# ID time_A time_B time_C test_A test_B test_C min_true_time
#1 1 2 3 4 FALSE TRUE FALSE 3
#2 2 -4 5 6 TRUE TRUE FALSE -4
#3 3 6 1 -2 TRUE TRUE TRUE -2
#4 4 -2 3 4 TRUE FALSE FALSE -2
Another way, which might be faster (I'm not in the mood for benchmarking):
m <- as.matrix(df[,2:4])
m[!df[,5:7]] <- NA
df$min_true_time <- apply(m,1,min,na.rm=TRUE)