Why do I get different results indexing with data.table

Why do I get different results indexing with data.table - r

Here is a simple example of trying to extract some rows
from a data.table, but what appear to be the same type
of logical vectors, I get different answers:
a <- data.table(a=1:10, b = 10:1)
a # so here is the data we are working with
a b
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
6: 6 5
7: 7 4
8: 8 3
9: 9 2
10: 10 1
let's extract just the first column since I need to dynamically
specify the column number as part of my processing
col <- 1L # get column 1 ('a')
x <- a[[col]] > 5 # logical vector specifying condition
x
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(x)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
look at the structure of a[[col]] > 5
a[[col]] > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(a[[col]] > 5)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
this looks very much like 'x', so why do these two different ways
of indexing 'a' give different results
a[x] # using 'x' as the logical vector
a b
1: 6 5
2: 7 4
3: 8 3
4: 9 2
5: 10 1
a[a[[col]] > 5] # using the expression as the logical vector
Empty data.table (0 rows) of 2 cols: a,b

Related

Roll condition ifelse in R data frame

I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?

Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE

How does this odds/even extraction work in R language?

> a <- sample(c(1:10), 20, replace = TRUE)
> a
[1] 6 3 6 2 6 9 3 9 9 8 2 10 7 9 1 5 3 10 5 5
> a[c(TRUE,FALSE)]
[1] 6 6 6 3 9 2 7 1 3 5
Why a[c(TRUE,FALSE)] gives me an ODD elements of my array? c(TRUE, FALSE) has length of 2. And on my mind, this supposed to give me a single index 1, which is TRUE.
Why is this comes by this way?

Logical subsets are recycled to match the length of the vector (numerical subsets are not recycled).
From help("["):
Arguments
i, j, …
...
For [-indexing only: i, j, … can be logical vectors,
indicating elements/slices to select. Such vectors are recycled if
necessary to match the corresponding extent. i, j, … can also be
negative integers, indicating elements/slices to leave out of the
selection.
When indexing arrays by [ a single argument i can be a matrix with
as many columns as there are dimensions of x; the result is then a
vector with elements corresponding to the sets of indices in each row
of i.
To illustrate, try:
cbind.data.frame(x = 1:10, odd = c(TRUE, FALSE), even = c(FALSE, TRUE))
# x odd even
# 1 1 TRUE FALSE
# 2 2 FALSE TRUE
# 3 3 TRUE FALSE
# 4 4 FALSE TRUE
# 5 5 TRUE FALSE
# 6 6 FALSE TRUE
# 7 7 TRUE FALSE
# 8 8 FALSE TRUE
# 9 9 TRUE FALSE
# 10 10 FALSE TRUE

a[TRUE] gives your all the elements and a[FALSE] gives none. for a[c(TRUE,FALSE] it will wrap length(c(TRUE,FALSE)) which is 2 to length(a) which is 20, so for example it would be like TRUE, FALSE, TRUE, .... , then it will give you just odds indexes.

Find all the duplicate records using duplicated() on R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 4 years ago.
I have one question in R.
I have the following example code for a question.
> exdata <- data.frame(a = rep(1:4, each = 3),
+ b = c(1, 1, 2, 4, 5, 3, 3, 2, 3, 9, 9, 9))
> exdata
a b
1 1 1
2 1 1
3 1 2
4 2 4
5 2 5
6 2 3
7 3 3
8 3 2
9 3 3
10 4 9
11 4 9
12 4 9
> exdata[duplicated(exdata), ]
a b
2 1 1
9 3 3
11 4 9
12 4 9
I tried to use the duplicated() function to find all the duplicate records in the exdata dataframe, but it only finds a part of the duplicated records, so it is difficult to confirm intuitively whether duplicates exist.
I'm looking for a solution that returns the following results
a b
1 1 1
2 1 1
7 3 3
9 3 3
10 4 9
11 4 9
12 4 9
Can use the duplicated() function to find the right solution?
Or is there a way to use another function?
I would appreciate your help.

duplicated returns a logical vector with the length equal to the length of its argument, corresponding to the second time a value exists. It has a method for data frames, duplicated.data.frame, that looks for duplicated rows (and so has a logical vector of length nrow(exdata). Your extraction using that as a logical vector is going to return exactly those rows that have occurred once before. It WON'T however, return the first occurence of those rows.
Look at the index vector your using:
duplicated(exdata)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
But you can combine it with fromLast = TRUE to get all of the occurrences of these rows:
exdata[duplicated(exdata) | duplicated(exdata, fromLast = TRUE),]
# a b
# 1 1 1
# 2 1 1
# 7 3 3
# 9 3 3
# 10 4 9
# 11 4 9
# 12 4 9
look at the logical vector for duplicated(exdata, fromLast = TRUE) , and the combination with duplicated(exdata) to convince yourself:
duplicated(exdata, fromLast = TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
duplicated(exdata) | duplicated(exdata, fromLast = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE

Remove rows based on first instance to meet a condition

In the following dataset, I want to remove all rows starting at the first instance, sorted by Time and grouped by ID, that Var is TRUE. Put differently, I want to subset all rows for each ID by those which are FALSE up until the first TRUE, sorted by Time.
ID <- c('A','B','C','A','B','C','A','B','C','A','B','C')
Time <- c(3,3,3,6,6,6,9,9,9,12,12,12)
Var <- c(F,F,F,T,T,F,T,T,F,T,F,T)
data = data.frame(ID, Time, Var)
data
ID Time Var
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 A 6 TRUE
5 B 6 TRUE
6 C 6 FALSE
7 A 9 TRUE
8 B 9 TRUE
9 C 9 FALSE
10 A 12 TRUE
11 B 12 FALSE
12 C 12 TRUE
The desired result for this data frame should be:
ID Time Var
A 3 FALSE
B 3 FALSE
C 3 FALSE
C 6 FALSE
C 9 FALSE
Note that the solution should not only remove rows where Var == TRUE, but should also remove rows where Var == FALSE but this follows (in Time) another instance where Var == TRUE for that ID.
I've tried many different things but can't seem to figure this out. Any help is much appreciated!

Here's how to do that with dplyr using group_by and cumsum.
The rationale is that Var is a logical vector where FALSE is equal to 0 and TRUE is equal to 1. cumsum will remain at 0 until it hits the first TRUE.
library(dplyr)
data%>%
group_by(ID)%>%
filter(cumsum(Var)<1)
ID Time Var
<fctr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
Here's the equivalent code with data.table:
library(data.table)
data[data[, .I[cumsum(Var) <1], by = ID]$V1]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE

This data.table solution should work.
library(data.table)
> setDT(data)[, .SD[1:(which.max(Var)-1)], by=ID]
ID Time Var
1: A 3 FALSE
2: B 3 FALSE
3: C 3 FALSE
4: C 6 FALSE
5: C 9 FALSE
Given that you want all the values up to the first TRUE value, which.max is the way to go.

You can do this with the cumall verb as well:
library(dplyr)
data %>%
dplyr::group_by(ID) %>%
dplyr::filter(dplyr::cumall(!Var))
ID Time Var
<chr> <dbl> <lgl>
1 A 3 FALSE
2 B 3 FALSE
3 C 3 FALSE
4 C 6 FALSE
5 C 9 FALSE
cumall(!x): all cases until the first TRUE

nested looping through rows/columns in r

I have a data set that looks like this:
1. ID RESULT_DATE Hyperkalemia Is.Hemolyzed
2. 1 5/27/2008 2 FALSE
3. 1 5/28/2008 2 FALSE
4. 1 5/29/2008 2 FALSE
5. 1 5/29/2008 2 FALSE
6. 1 5/29/2008 3 FALSE
7. 1 5/30/2008 2 FALSE
8. 1 6/15/2008 4 FALSE
9. 1 10/14/2014 1 FALSE
10. 1 10/16/2014 NA FALSE
11. 2 8/12/2013 2 FALSE
12. 3 2/26/2012 2 FALSE
13. 3 2/27/2012 2 FALSE
14. 3 4/18/2012 3 FALSE
15. 3 4/18/2012 4 FALSE
16. 3 4/21/2012 4 FALSE
17. 3 4/23/2012 4 FALSE
18. 3 4/27/2012 4 FALSE
19. 3 5/8/2012 4 FALSE
20. 3 5/12/2012 4 FALSE
21. 3 5/15/2012 4 FALSE
22. 3 5/15/2012 NA FALSE
I want find the number of times a potassium test with a Hyperkalemia score of 3 or 4 and an is.HEmolyzed = FALSE was repeated the same day (must count the repeats by patient ID) Objective is the total number of times the test qualified for a repeat and then the total number of times the repeat occurred.
Could someone help me translate my pseudocode into working R code?
# data.frame = pots
# for every row (sorted by patient and result date)
for (i in 1:nrow(pots){
# for each patient (sorted by result date)
# how do I do I count the rows for the individual patient?
for (i in 1:length(pots$ID)) {
# assign result date to use for calculation
result_date = pots$result_date
# if Hyperkalemia = 3 or 4
if (Hyperkalemia == 3 | Hyperkalemia == 4)
# go find the next result for patient where is.Hemolyzed = FALSE
# how do I get the next result?
for (i+1)
# assign date to compare to first date
next_result_date = pots$result_date
if next_result_date > result_date
then repeated_same_day <- FALSE
else if result_date == result_date
then repeated_same_day <- TRUE
}
}
goal: I want to calculate how often (by unique ID) a grade 3 or 4 non-hemolyzed potassium result has another potassium test within 24 hours (I'm using a different field now -- I guess I can add some date function to calculate 24 hours).
Edit: I did get it working with for loops eventually!! Sharing in case it is helpful to anyone. Later I did see a mistake, but for my data set it was okay.
library(dplyr)
pots <- read.csv("phis_potassium-2015-07-30.csv",
head=TRUE, stringsAsFactors = FALSE)
pots <- arrange(pots, MRN, COLLECTED_DATE)
pots$Hyperkalemia[is.na(pots$Hyperkalemia)] <- 0
pots$repeated_wi24hours <- NA
pots$met_criteria <- NA
pots$next_test_time_interval <- NA
# data.frame = pots
# for every patient (sorted by patient and collected date)
for (mrn in unique(pots$MRN)){
# for each row for each patient (sorted by collected date)
for (i in 1:length(pots$MRN[pots$MRN == pots$MRN[mrn]])) {
# if Hyperkalemia = 3 or 4 AND Is.Hemolyzed == FALSE
if((pots$Hyperkalemia[i] == 3 | pots$Hyperkalemia[i] == 4) & pots$Is.Hemolyzed[i] == FALSE){
pots$met_criteria[i] <- TRUE
# get time interval between tests
pots$next_test_time_interval[i] <- difftime(pots$COLLECTED_DATE[i+1], pots$COLLECTED_DATE[i], units = "hours")
# if next date is within 24 hours, then test repeated
if (pots$next_test_time_interval[i] <= 24 ){
pots$repeated_wi24hours[i] <- TRUE
}
else {
pots$repeated_wi24hours[i] <- FALSE
}
}
}
}
Desired output
ID RESULT_DATE Hyperkalemia Is.Hemolyzed Met_criteria Repeated
1 5/27/2008 2 FALSE
1 5/28/2008 2 FALSE
1 5/29/2008 2 FALSE
1 5/29/2008 2 FALSE
1 5/29/2008 3 FALSE TRUE FALSE
1 5/30/2008 2 FALSE
1 6/15/2008 4 FALSE
1 10/14/2014 1 FALSE
2 8/12/2013 2 FALSE
3 2/26/2012 2 FALSE
3 2/27/2012 2 FALSE
3 4/18/2012 3 FALSE TRUE TRUE
3 4/18/2012 4 FALSE TRUE FALSE
3 4/21/2012 4 FALSE TRUE FALSE

How about this:
metCriteria <- function( dfPots )
{
(dfPots$Hyperkalemia==3 | dfPots$Hyperkalemia==4) & !dfPots$Is.Hemolyzed
}
#----------------------------------------------------------------------
pots <- read.table(filename, header=TRUE)
d <- paste( as.character(pots$RESULT_DATE),
"_ID",
as.character(pots$ID))
lastOccurence <- unlist(lapply(d,function(x){which.min(diff(c(d,FALSE)==x))}))
pots <- cbind(pots, data.frame( Met_criteria = rep(FALSE,nrow(pots))),
Repeated = rep(TRUE ,nrow(pots)) )
pots$Repeated[lastOccurence] <- FALSE
pots$Met_criteria[which(metCriteria(pots))] <- TRUE
The dates and ID's are pasted together in the vector "d".
The i-th component of the vector "lastOccurence" is the row number where the date/ID-pair d[i] occures or the last time.
The data frame "pots" is extended by two columns, "Met_criteria" and "Repeated".
"Met_criteria" is initialized "FALSE". Then "which(metCriteria(pots))" picks the row numbers where the criteria are met. In these rows "Met_critaria" is set to "TRUE".
"Repeated" is initialized "TRUE". It is set to "FALSE" in those rows where the corresponding date and ID occures for the last time.
Example:
> pots
ID RESULT_DATE Hyperkalemia Is.Hemolyzed Met_criteria Repeated
1 1 5/27/2008 2 FALSE FALSE FALSE
2 1 5/28/2008 2 FALSE FALSE FALSE
3 3 5/28/2008 2 FALSE FALSE FALSE
4 1 5/29/2008 2 FALSE FALSE TRUE
5 1 5/29/2008 2 FALSE FALSE TRUE
6 1 5/29/2008 3 FALSE TRUE FALSE
7 2 5/29/2008 4 FALSE TRUE FALSE
8 1 5/30/2008 2 FALSE FALSE FALSE
9 1 6/15/2008 4 FALSE TRUE FALSE
10 1 10/14/2014 1 FALSE FALSE FALSE
11 1 10/16/2014 NA FALSE FALSE FALSE
12 2 8/12/2013 2 FALSE FALSE FALSE
13 3 2/26/2012 2 FALSE FALSE FALSE
14 3 2/27/2012 2 FALSE FALSE FALSE
15 3 4/18/2012 3 FALSE TRUE TRUE
16 3 4/18/2012 4 FALSE TRUE FALSE
17 3 4/21/2012 4 FALSE TRUE FALSE
18 3 4/23/2012 4 FALSE TRUE FALSE
19 3 4/27/2012 4 FALSE TRUE FALSE
20 3 5/8/2012 4 FALSE TRUE FALSE
21 3 5/12/2012 4 FALSE TRUE FALSE
22 3 5/15/2012 4 FALSE TRUE TRUE
23 3 5/15/2012 NA FALSE FALSE FALSE
>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why do I get different results indexing with data.table - r

Related

Roll condition ifelse in R data frame

How does this odds/even extraction work in R language?

Find all the duplicate records using duplicated() on R [duplicate]

Remove rows based on first instance to meet a condition

nested looping through rows/columns in r

Categories

Resources