using %in% to subset a data.table

using %in% to subset a data.table - r

I have a data.table
library(data.table)
DT <- data.table(a=c(1,2,3,4), b=c(4,4,4,4), x=c(1,3,5,5))
> DT
a b x
1: 1 4 1
2: 2 4 3
3: 3 4 5
4: 4 4 5
and I would like to select rows where x equals either a or b. Obviously, I could use
> DT[x==a | x==b]
a b x
1: 1 4 1
which gives the correct result. However, with many columns I thought, the follwoing should work just as well
> DT[x%in%c(a,b)]
a b x
1: 1 4 1
2: 2 4 3
but it gives a different result that is not intuitive to me. Can anyone help?

The expression
DT[x==a | x==b]
returns all rows in DT where the values in x and a are equal or x and b are equal. This is the desired result.
On the other hand
DT[x%in%c(a,b)]
returns all rows where x matches any value in c(a, b), not just the corresponding value. Thus your second row appears because x == 3 and 3 appears (somewhere) in a.

We can use Reduce with .SDcols for multiple columns. Specify the columns of interest in .SDcols, then loop over the .SD (Subset of Data.table), do the comparison (==) with 'x', and Reduce it to a single logical vector with |
DT[DT[, Reduce(`|`, lapply(.SD, `==`, x)), .SDcols = a:b]]
# a b x
#1: 1 4 1

Another way is use rowSums
DT[rowSums(DT[,.SD,.SDcols=-'x']==x)>0,]
# a b x
#1: 1 4 1
You can change to rowMeans...==1 if you want to select rows where all columns equal x

Related

R First Row By Group When Condition Is Met

dataHAVE=data.frame(STUDENT=c(1,1,1,2,2,2,3,3,3),
SCORE=c(0,1,1,5,1,2,1,1,1),
CAT=c(3,10,7,4,5,0,4,5,1),
FOX=c(5,0,10,8,9,1,8,9,0))
dataWANT=data.frame(STUDENT=c(1,2,3),
SCORE=c(1,1,1),
CAT=c(10,5,4),
FOX=c(0,9,8))
I have 'dataHAVE' and want 'dataWANT' which takes the first row for every 'STUDENT' when 'SCORE' equals to 1. I am seeking a data.table solution because of it being a large data. I try this but do not know how to set the criteria for 'SCORE'
dataWANT[,.SD[1],by = key(STUDENT)]

Convert the 'data.frame' to 'data.table' (setDT), grouped by 'STUDENT', specify the logical condition in i, get the index of the first row (.I[1]), extract that column ($V1) and subset the rows
library(data.table)
setDT(dataHAVE)[dataHAVE[SCORE == 1, .I[1], STUDENT]$V1]
.I returns row index. If we don't have a grouping column, it would return a vector i.e.
setDT(dataHAVE)[SCORE == 1, .I]
#[1] 1 2 3 4 5 6
when we provide the grouping column, by default, the .I returns with a named column V1 (we could override it by changing the name)
setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]
# STUDENT colindex
#1: 1 2
#2: 2 5
#3: 3 7
Nowe, we have two columns, 'STUDENT', 'colindex'. We are specifically interested in the 'colindex', so extract with standard procedures ($ or [[) and then use that as row index in i
i1 <- setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]$colindex
i1
#[1] 2 5 7
This we use for subsetting
dataHAVE[i1]

Here is a base R option using subset + ave
subset(
dataHAVE,
ave(SCORE==1, STUDENT, FUN = function(x) seq_along(x) == min(which(x)))
)
which gives
STUDENT SCORE CAT FOX
2 1 1 10 0
5 2 1 5 9
7 3 1 4 8

Solution 1. There is a straightforward and comprehensive solution in two lines:
dataWANT <- dataHAVE[dataHAVE$SCORE == 1,] #Filter score equals to 1
dataWANT <- dataWANT[!duplicated(dataWANT$STUDENT), ] #Remove duplicated students
Solution 2. However, if you prefer to solve in one line:
dataWANT <- dataHAVE[!duplicated(paste0(dataHAVE$STUDENT, dataHAVE$SCORE)) & dataHAVE$SCORE ==1, ]
That creates a logical vector showing which of the combinations that are not duplicated of preceding elements, and combine it with a test if 'SCORE' is 1.

You could use match to get 1st row where SCORE = 1 for each STUDENT.
library(data.table)
setDT(dataHAVE)
dataHAVE[, .SD[match(1, SCORE)], STUDENT]
# STUDENT SCORE CAT FOX
#1: 1 1 10 0
#2: 2 1 5 9
#3: 3 1 4 8

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.

No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.

Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))

library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))

You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]

Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Finding factors that correspond to more than one values

Suppose, that one has the following dataframe:
x=data.frame(c(1,1,2,2,2,3),c("A","A","B","B","B","B"))
names(x)=c("v1","v2")
x
v1 v2
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 3 B
In this dataframe a value in v1 I want to correspond into a label in v2. However, as one can see in this example B has more than one corresponding values.
Is there any elegant and fast way to find which labels in v2 correspond to more than one values in v1 ?
The result I want ideally to show, the values - which in our example should be c(2,3) - as well as the row number - which in our example should be r=c(5,6).

Assuming that we want the index of the unique elements in 'v1' grouped by 'v2' and that should have more than one unique elements, we create a logical index with ave and use that to subset the rows of 'x'.
i1 <- with(x, ave(v1, v2, FUN = function(x)
length(unique(x))>1 & !duplicated(x, fromLast=TRUE)))!=0
x[i1,]
# v1 v2
#5 2 B
#6 3 B
Or a faster option is data.table
library(data.table)
i1 <- setDT(x)[, .I[uniqueN(v1)>1 & !duplicated(v1, fromLast=TRUE)], v2]$V1
x[i1, 'v1', with = FALSE][, rn := i1][]
# v1 rn
#1: 2 5
#2: 3 6

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.

I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2

If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2

Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

R data.table subsetting within a group and splitting a data table into two

I have the following data.table.
ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b
I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows
ts,id
1,a
2,a
3,a
4,a
1,b
2,b
and
ts,id
5,a
6,a
7,a
3,b
4,b
I tried the following,
z1 = x[,.SD[.I < .N/2,],by=dev]
z1
and got just the following
id ts
a 1
a 2
a 3
Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated.
Thanks in advance.

.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.
Something like
DT[, subset := seq_len(.N) > .N/2,by='id']
subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]
subset1
# ts id
# 1: 4 a
# 2: 5 a
# 3: 6 a
# 4: 7 a
# 5: 3 b
# 6: 4 b
subset2
# ts id
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 1 b
# 5: 2 b
Should work
For more than 2 groups, you could use cut to create a factor with the appropriate number of levels
Something like
DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
# you could copy to the global environment a subset for each, but this
# will not be memory efficient!
list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)

Here's the corrected version of your expression:
dt[, .SD[, .SD[.I <= .N/2]], by = id]
# id ts
#1: a 1
#2: a 2
#3: a 3
#4: b 1
#5: b 2
The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using %in% to subset a data.table - r

Another way is use rowSums DT[rowSums(DT[,.SD,.SDcols=-'x']==x)>0,] # a b x #1: 1 4 1 You can change to rowMeans...==1 if you want to select rows where all columns equal x

Related

R First Row By Group When Condition Is Met

What is the most effective way to sort dataframe and add special id? [duplicate]

Finding factors that correspond to more than one values

Group a data.table using a column which is list

R data.table subsetting within a group and splitting a data table into two

Categories

Resources