R - Data.table fast binary search based subset with multiple values in second key - r

I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point.
My data looks like this:
ID TYPE MEASURE_1 MEASURE_2
1 A 3 3
1 B 4 4
1 C 5 5
1 Mean 4 4
2 A 10 1
2 B 20 2
2 C 30 3
2 Mean 20 2
When I do this ... all works as expected.
setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C
Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key, I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c() for the second key. So, it only takes the first value defined in the vector and ignores all following values.
# extract SD of all IDs with one of the 3 types A/B/C
dt[.(unique(ID), c("A", "B", "C")]
# previous output is equivalent to
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
# I want/expect
dt[TYPE %in% c("A", "B", "C")]
What am I missing here or is this sth I cannot do with keyed subsets?
To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)
And defining multiple keys in key 1 works also as expected.
dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE

In the data.table documention (see help("data.table") or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments), it is mentioned :
character, list and data.frame input to i is converted into a data.table internally using as.data.table.
So, the classical recycling rule used in R (or in data.frame) applies. That is, .(unique(ID), c("A", "B", "C")), which is equivalent to list(unique(ID), c("A", "B", "C")), becomes:
as.data.table(list(unique(ID), c("A", "B", "C")))
and since the length of the longest list element (length of c("A", "B", "C")) is not a multiple of the shorter one (length of unique(ID)), you will get an error.
If you want each value in unique(ID) combined with each element in c("A", "B", "C"), you should use CJ(unique(ID), c("A", "B", "C")) instead.
So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))].
Note that dt[.(unique(ID), "A")] works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID).

Related

Return index numbers in R where object lengths are not multiples [duplicate]

This question already has answers here:
Is there an R function for finding the index of an element in a vector?
(4 answers)
Closed 2 years ago.
I have a vector like so:
foo = c("A", "B", "C", "D")
And I want a vector of selected index numbers, which I imagined I could do like so:
which(foo == c("A", "B", "D"))
But apparently this only works if the lengths of the two vectors are multiples, as otherwise you get an incomplete result followed by a warning message:
"longer object length is not a multiple of shorter object length".
So how do I get what I'm after, which is "1 2 4"?
Use match:
match(c('A', 'B', 'C'), foo)
Using %in% is one option here:
foo <- c("A", "B", "C", "D")
x <- c("A", "B", "D")
c(1:4)[foo %in% x] # [1] 1 2 4
The quantity foo %in% x returns a logical vector which can then be used to subset the indices you want to see.

How to return the index of certain duplicate strings in a character vector ignoring the index of the first occurence of the duplicate string?

I have a vector of strings and I want to return the index of the duplicate values, except for the index of the first occurrence of a duplicate value, given another vector with matches. For example:
x <- c("a", "b", "c", "b", "a", "a", "c", "c")
matching_values <- c("a", "b")
So I would like to have an integer vector returned with the indexes 4, 5, 6. So the first duplicate of a occurs at position 5 and the second duplicate at position 6. The first duplicate for b occurs at index 4 and because I did not specify to match c, there will be no index returned. Thank you!
You could use :
which(duplicated(x) & x %in% matching_values)
#[1] 4 5 6
We can use duplicated with %in%
which(x %in% matching_values & duplicated(x))
#[1] 4 5 6

Efficient vectors accumulation by group in data frame [duplicate]

This question already has answers here:
Cumulatively paste (concatenate) values grouped by another variable
(6 answers)
Closed 3 years ago.
I currently have a large data table and I would like to accumulate a vector column (the classes column) for each group (id) along the years to get all past classes up to the current year in vector format.
EDIT: Previous topics (ie Cumulatively paste (concatenate) values grouped by another variable) have answerd this question in the case of characters concatenation which I don't want (because analyzing strings forces me to parse it before, which is cumputer intensive on large datasets). I would like to accumulate the vectors and get a column of vectors as well. I think the solution is pretty close but I just can't manage to find the right syntax for it.
 
Sample data:
id year classes
----------------------------
1 2000 c("A", "B")
1 2001 c("C", "A")
1 2002 "D"
1 2003 "E"
2 2001 "A"
2 2002 c("A", "D")
2 2003 "E"
...
Expected output :
id year classes cumclasses
-----------------------------------------------------------
1 2000 c("A", "B") c("A", "B")
1 2001 c("C", "A") c("A", "B", "C", "A")
1 2002 "D" c("A", "B", "C", "A", "D")
1 2003 "E" c("A", "B", "C", "A", "D", "E")
2 2001 "A" "A"
2 2002 c("A", "D") c("A", "A", "D")
2 2003 "E" c("A", "A", "D", "E")
...
My goal is to find an efficient solution because my dataset is fairly large.
For now I have a working (but ultra slow) solution using dplyr and purrr :
dt2 <- dt %>%
setkeyv(c("id", "year")) %>%
group_by(id) %>%
mutate(cumclasses = accumulate(classes, append))
I'm looking for a data.table solution of the type:
#not working example
dt2 <- dt[, cumclasses := accumulate(classes, append), by = id]
or even a base R solution, the faster the better !
Thank you!
If you want to reproduce sample data please copy the following code:
dt <- data.table(id =
c(1,1,1,1,2,2,2),
year =
c(2000,2001,2002,2003,2001,2002,2003),
classes =
list(c('A', 'B'), c('C', 'A'), 'D', 'E', 'A', c('A', 'D'), 'E'), key = 'id')
EDIT [SOLVED]:
A working solution is (using data.table and purrr):
dt[, cumClasses := list(accumulate(classes, append)), by = id]
One option would be to group by 'id', loop over the sequence of rows and extract the 'Classes' and paste it to together after unlisting the list column
dt[, cumClasses := sapply(seq_len(.N), function(i) toString(unlist(classes[seq_len(i)]))), id][,
cumClasses := as.list(cumClasses)][]

Is there a way in R to rank categorical variable (of characters) into ranked ordinal data?

I have a list of character strings, say
alphabets = c(a, b, c, d,..., z) and I would like to get the index of this list as a new column in a data.frame.
e.g. (b, a, c, d, e, g) would yield (2, 1, 3, 4, 5, 7).
The solution you need is to convert the character vector to a factor:
alphabets = c("b", "a", "c", "d", "e", "g")
#convert to class factor with the order define by the levels option
alphabets<-factor(alphabets, levels=letters)
#display the values
as.numeric(alphabets)
#[1] 2 1 3 4 5 7
This is a case for match
x <- c("b", "a", "c", "d", "e", "g")
match(x, letters)
#[1] 2 1 3 4 5 7
Or sapply with grep returning a named int vector
sapply(x, grep, letters)
#b a c d e g
#2 1 3 4 5 7
Two comments:
"I have a list of character strings" Be precise with class names of objects! alphabets = c("a", "b", "c", "d") is a character vector, not a list.
letters is a built-in constant which returns the 26 lower-case letters (of the Roman alphabet) as a character vector. See ?letters for details.

For loop with factor data

I have two vectors of factor data with equal length. Just for examples sake:
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
Ultimately, I am trying to generate a classification matrix showing the number of times each factor is correctly predicted. This would look like the following for the example:
name T F
a 1 2
b 1 1
c 1 1
Note that the tables() command doesn't work here because I have 11 different factors, and the output would be 11x11 instead of 11x2. My plan is to create three vectors, and combine them into a data frame.
First, a vector of the unique factor values in the existing vectors. This is simple enough,
names=unique(df$observed)
Next, a vector of values showing the number of correct predictions. This is where I am running into trouble. I can get the number of correct predictions for an individual factor like so:
correct.a=sum(predicted[which(observed == "a")] == "a")
But this is cumbersome to repeat time and time again, and then combine into a vector like
correct=c("correct.a", "correct.b", correct.c")
Is there a way to use a loop (or other strategy that you can think of) to improve this process?
Also note that the final vector I would create would be something like this:
incorrect.a=sum(observed == "a")-correct.a
t(sapply(split(predicted == observed, observed), table))
# FALSE TRUE
#a 2 1
#b 1 1
#c 1 1
I would suggest you use data.table for explicit clean way to define your results:
library(data.table)
observed=c("a", "b", "c", "a", "b", "c", "a")
predicted=c("a", "a", "b", "b", "b", "c", "c")
dt <- data.table(observed, predicted)
res <- dt[, .(
T = sum(observed == predicted),
F = sum(observed != predicted)),
observed
]
res
# observed T F
# 1: a 1 2
# 2: b 1 1
# 3: c 1 1

Resources