duplicate rows in my counting by group with data.table in R - r

I'm really sorry to ask this dumb question but I don't get what is going wrong.
I have a dataset which I convert into a data.table object :
#generate 100,000 ids associated to a group in a data-set called base
id=c(1:100000)
group=sample(c(1:5),100000,TRUE)
base=cbind(id,group)
base=as.data.table(base)
I make a basic group by computation to get the number of rows by group, and the result table still contains the same number of rows
counting=base[,COUNT:= .N, by = group]
nrow(counting)
#100000
What did I miss? Is there an option in data.table in order to address my problem?

Taking akrun's comment, I decided to provide an answer. It seems that you were not sure how to summarise your data and got confused. First, one point about constructing a data set:
set.seed(123)
id = c(1:100000)
group = sample(c(1:5),100000,TRUE)
base = data.frame(id,group)
setDT(base)
base
id group
1: 1 2
2: 2 4
3: 3 3
4: 4 5
5: 5 5
....
When you use cbind() on multiple vectors, they are coerced to the same class to make a matrix. The safer way to go is to use data.frame(), which allows mixed column classes. And, if you have a data.frame, you can turn it into a data.table by reference with setDT, without needing to assign the result.
Adding a new column. Your code was basically adding a new column in the data.table object. When you use :=, you are doing the equivalent of mutate() in dplyr or transform() in base R, with one important difference. With :=, the column is added to the data.table by reference, so there is no need to assign the result.
base[, COUNT := .N, by = group]
base
id group COUNT
1: 1 2 20099
2: 2 4 19934
3: 3 3 20001
4: 4 5 19933
5: 5 5 19933
...
Here, you are counting how many data points exist for each group, and you are assigning the values to all rows. For instance, the total count of group 2 is 20099. You give this number to all rows with group == 2. You are creating a new column, not summarizing the data. Hence, you still have 100000 rows. The number of rows in base is the same as ever. There is currently no function to modify the number of rows by reference.
Summarising the data. If you want to count how many data points exist for each group and summarize the data, you want the following.
dt2 <- base[, .(COUNT = .N), by = group]
dt2
group COUNT
1: 2 20099
2: 4 19934
3: 3 20001
4: 5 19933
5: 1 20033
dim(dt2)
[1] 5 2
Here, you want to make sure that you use =, not := since you are summarising the data. It is necessary to assign the result because we are creating a new data.table. I hope this clears up your mind.

Have you noticed?
base$regroup = group
base[, .(Count = .N, regroup), by = group]
gives 100,000 rows even though group and regroup are identical?

Related

R populate rows in a dataframe with a logical condition and string match

I had a data frame in which rows were observations of certain events in different categories. A sample of it looks like this:
channel start.time stop.time vp duration
1 1_hands 13.840 14.985 CH1 1.145
2 3_speech 23.469 24.290 N 0.821
3 2_body 28.735 32.292 CH2 3.557
4 1_hands 36.778 41.674 CH1 4.896
5 4_ev 42.337 45.398 self 3.061
6 1_hands 46.112 50.715 N 4.603
There are different values that can show up in the 'channel' and 'vp' columns. Both of these columns represent different kinds of categories that can apply to each observation.
In particular, different 'channel' values can occur simultaneous, or overlap with the observations in other channels. I wanted to see which observations co-occur with a specific observation that I specify. I previously put this question out on stack overflow and got a useful answer that can be read here.
A very helpful user provided a solution using the data.table package, and provided some code which I have adapted and used, it looks like this;
library(data.table)
library(stringr)
setDT(df)
df[, id := .I]
setkey(df, id)
#self join on subset by row
df[df, overlaps := {
temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time, ]
paste(temp$channel, temp$vp, temp$duration, temp$start.time, temp$stop.time, sep = "_", collapse = "_")
}, by = .EACHI]
Doing this provides exactly what I wanted, and the resulting dataframe looks like this;
channel start.time stop.time vp duration X id overlaps
1: 1_hands 13.840 14.985 CH1 1.145 NA 1 2_body_CH1_1.883_14.272_16.155_2_body_N_14.272_0_14.272_4_speech_CH1_2.371_14.183_16.554_4_speech_EE_2.448_11.735_14.183_3_gaze_3_1.068_14.42_15.488
2: 1_hands 23.469 24.290 CH1 0.821 NA 2 2_body_N_5.485_22.055_27.54_4_speech_N_4.25_22.259_26.509_3_gaze_2_0.81_23.16_23.97_3_gaze_5_1.804_23.97_25.774
3: 1_hands 28.735 32.292 CH1 3.557 NA 3 2_body_CH1_3.445_27.54_30.985_2_body_N_3.519_30.985_34.504_4_speech_CH1_3.517_28.001_31.518_4_speech_N_0.725_31.518_32.243_4_speech_N_1.183_32.243_33.426_3_gaze_1_1.203_28.504_29.707_3_gaze_6_0.997_29.707_30.704_3_gaze_2_1.493_30.704_32.197_3_gaze_1_0.497_32.197_32.694
4: 1_hands 36.778 41.674 CH1 4.896 NA 4 2_body_CH1_3.308_34.504_37.812_2_body_CH1_5.288_39.482_44.77_2_body_N_1.67_37.812_39.482_4_speech_CH1_1.246_41.019_42.265_4_speech_N_0.957_36.549_37.506_4_speech_N_1.952_37.506_39.458_4_speech_N_1.557_39.462_41.019_3_gaze_3D_3.205_34.565_37.77_3_gaze_2_1.769_37.77_39.539_3_gaze_3D_1.41_39.539_40.949_3_gaze_2_0.433_41.313_41.746
5: 1_hands 42.337 45.398 CH1 3.061 NA 5 2_body_CH1_5.288_39.482_44.77_2_body_N_1.058_44.77_45.828_4_speech_N_8.506_42.268_50.774_3_gaze_2_0.341_42.108_42.449_3_gaze_3_0.652_44.099_44.751_3_gaze_2_0.532_44.751_45.283_3_gaze_1_0.287_45.283_45.57
6: 1_hands 46.112 50.715 CH1 4.603 NA 6 2_body_CH1_1_45.828_46.828_2_body_CH1_1.437_47.485_48.922_2_body_N_0.657_46.828_47.485_2_body_N_6.571_48.922_55.493_4_speech_N_8.506_42.268_50.774_3_gaze_2_0.771_46.767_47.538_3_gaze_3_0.83_48.714_49.544_3_gaze_2_1.121_49.544_50.665_3_gaze_1_0.328_50.665_50.993
What it does is paste the observations that occur with a given observation in the 'overlaps' column, and appends the start.time, stop.time, and duration data. This is extremely useful, but it is difficult to work with the data in that column, as the strings can be very long.
What I need, is to find a way to specify that I only want observations of a specified category to show up in the overlaps column.
for example, say I want to show only "2_body_N strings in the overlaps column, and not see the other ones, even if they co-occur with it.
I can subset the data frame in such a way that only rows will appear in which a "2_body_N" occurs in the overlaps column, but I want to remove the rest of the string that shows other observations. This is difficult, because the length of each string in the overlaps column can be different, as there can be 1 or more overlapping observations. They also don't necessarily occur in the same order.
So one row may have a "2_body_N" observation first, followed by other types, while the next row may have something else first and the "2_body_N' observation buried in the back.
I also tried splitting up the strings of the overlaps columns into additional columns to make sorting easier. But I need 144 additional columns, and again, they wouldn't occur in predictable orders by channel type since the observations are dredged up based on the start.time.
I wonder if I can run the data frame through a similar data.table approach as shown above, and specify a string instead of start.time and stop time.
I cannot figure out the syntax however, and I'm generally unfamiliar with the functions being used here, I wonder if something like this would work, if I could just get the syntax right:
library(data.table)
library(stringr)
setDT(df)
df[, id := .I]
setkey(df, id)
#self join on subset by row
df[df, overlaps := {
**temp <- *if string == "2_body_N"*
paste(temp$channel, temp$vp, temp$duration, temp$start.time, temp$stop.time, sep = "_", collapse = "_")
}, by = .EACHI]
so that is my question, can I populate the overlaps column only observations that match a specific string? such as "2_body_N" and NOT with others, even if they co-occur temporally with the reference observation.

R data.table - only keep rows with duplicate ID (most efficient solution)

I want to use data.table to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.
library(data.table)
dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))
get_duplicate_id_rows1 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str]
}
get_duplicate_id_rows1(dt, c("x", "y"))
#> x y z
#> 1: 5 5 2
#> 2: 5 5 3
get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y
As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table, .SD includes all variables of the original data table, except the grouping rows. Consequently, .SD has zero columns, which seems to be causing my issue.
get_duplicate_id_rows2 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]
}
get_duplicate_id_rows2(dt, c("x", "y"))
#> x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3
get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#> x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5
My second attempt tries to circumvent my issues with my first attempt by using .SDcols. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str are duplicated.
I think this is because one set of column names comes from the by argument and the other set of column names comes from .SDcols, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.
Consequently, I'd love to understand what's going on here, and what the most efficient solution to my problem is - particularly for large data sets, which is why I'm moving from tidyverse to data.table.
Created on 2020-04-09 by the reprex package (v0.3.0)
We can use .I to get the index of groups with frequency count greater than 1, extract the column and subset the data.table
dt[dt[, .I[.N >1], .(x, y)]$V1]
NOTE: It should be faster than .SD
Here is another option:
dt[dt[rowid(x, y) > 1], on=.(x, y), .SD]
In the example, your explanation for returning 0 row is correct. As grouping columns are used for grouping, it will be identical for each group and can be accessed via .BY and hence .SD need not have these columns to prevent duplication.
By default, when by is used, these are also returned as the leftmost columns in the output, hence in get_duplicate_id_rows2, you see x, y and then columns from .SD as specified in .SDcols.
Lastly, regarding efficiency, you can time the various options posted here using microbenchmark with your actual dataset and share your results.

match/merge dataframes with a number columns with different column names in r

I have two dataframe with different columns that has large number of rows (about 2 million)
The first one is df1
The second one is df2
I need to get match the values in y column from table one to R column in table two
Example:
see the two rows in df1 in red box have matched the two rows in df2 in red box
Then I need to get the score of the matched values
so the result should look like this and it should be stores in a dataframe:
My attempt : first Im beginner in R, so when I searched I found that I can use Match function, merge function but I did not get the result that I want it might because I did not know how to use them correctly, therefore, I need step by step very simple solution
We can use match from base R
df2[match(df2$R, df1$y, nomatch = 0), c("R", "score")]
# R score
#3 2 3
#4 111 4
Or another option is semi_join from dplyr
library(dplyr)
semi_join(df2[-1], df1, by = c(R = "y"))
# R score
#1 2 3
#2 111 4
merge(df1,df2,by.x="y",by.y="R")[c("y","score")]
y score
1 2 3
2 111 4

Apply which.min to data.table under a condition

I have a data.table and need to know the index of the row containing a minimal value under a given condition. Simple example:
dt <- data.table(i=11:13, val=21:23)
# i val
# 1: 11 21
# 2: 12 22
# 3: 13 23
Now, suppose I'd like to know in which row val is minimal under the condition i>=12, which is 2 in this case.
What didn't work:
dt[i>=12, which.min(val)]
# [1] 1
returns 1, because within dt[i>=12] it is the first row.
Also
dt[i>=12, .I[which.min(val)]]
# [1] 1
returned 1, because .I is only supposed to be used with grouping.
What did work:
To apply .I correctly, I added a grouping column:
dt[i>=12, g:=TRUE]
dt[i>=12, .I[which.min(val)], by=g][, V1]
# [1] 2
Note, that g is NA for i<12, thus which.min excludes that group from the result.
But, this requires extra computational power to add the column and perform the grouping. My productive data.table has several millions of rows and I have to find the minimum very often, so I'd like to avoid any extra computations.
Do you have any idea, how to efficiently solve this?
But, this requires extra computational power to add the column and perform the grouping.
So, keep the data sorted by it if it's so important:
setorder(dt, val)
dt[.(i_min = 12), on=.(i >= i_min), mult="first", which = TRUE]
# 2
This can also be extended to check more threshold i values. Just give a vector in i_min =:
dt[.(i_min = 9:14), on=.(i >= i_min), mult="first", which = TRUE]
# [1] 1 1 1 2 3 NA
How it works
x[i, on=, ...] is the syntax for a join.
i can be another table or equivalently a list of equal-length vectors.
.() is a shorthand for list().
on= can have inequalities for a "non-equi join".
mult= can determine what happens when a row of i has more than one match in x.
which=TRUE will return row numbers of x instead of the full joined table.
You can use the fact that which.min will ignore NA values to "mask" the values you don't want to consider:
dt[,which.min(ifelse(i>=12, val, NA))]
As a simple example of this behavior, which.min(c(NA, 2, 1)) returns 3, because the 3rd element is the min among all the non-NA values.

determining row indices of data.table group members

The data.table package in R provides the option:
which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’
matches to.
However, I see no way of obtaining, within j, the integer row numbers of 'x' within the groups established using by.
For example, given...
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
...I would like to know the indices into DT for each value of y.
The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.
For example, assuming ADS is a vector with a value for each row in DT:
ADS<-sample(100,nrow(DT))
I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.
DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]
Which gives the result I want at the cost of adding a new column.
I realize that in this example I can get the same answer using tapply:
tapply(ADS,DT$y,mean)
However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).
Perhaps there is some syntax I am overlooking???
Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???
Proposed syntax: optionally set '.which' to the group indices, allowing to write:
DT[,mean(ADS[.which]),by=y,which=TRUE]
Available since data.table 1.8.3 you can use .I in the j of a data.table to get the row indices by groups...
DT[ , list( yidx = list(.I) ) , by = y ]
# y yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
A keyed data.table will be sorted so that groups are stored in contiguous blocks. In that case, you could use .N to extract the group-wise indexing information:
DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
setkey(DT, y)
ii <- DT[,.N, by=y]
ii[, start := cumsum(N) - N[1] + 1][,end := cumsum(N)][, N := NULL]
# y start end
# 1: 1 1 3
# 2: 3 4 6
# 3: 6 7 9
(Personally, I'd probably just add an indexing column like your suggested seqNum. Seems simpler, I don't think it will affect performance too much unless you are really pushing the limits.)

Resources