I am trying to apply a logical operator to columns of data.table, which works fine without by= groups, but with by= groups something weird is happening. I'm using R 3.3.1 and data.table 1.9.6 (off CRAN).
Here is an example data set:
library(data.table)
# x <- matrix(rnorm(10),ncol=2)
x <- structure(c(0.323618333400921, -0.103619160085676, -0.823322722581548,
-0.0175726272468397, 2.17286179204364, 1.63465793636681, 0.258469984285221,
-0.411622451584, 0.22594247704226, 0.114427086028551), .Dim = c(5L, 2L))
dt <- data.table(x)
dt$id <- c(1,1,1,2,2)
dt
## V1 V2 id
## 1: 0.32361833 1.6346579 1
## 2: -0.10361916 0.2584700 1
## 3: -0.82332272 -0.4116225 1
## 4: -0.01757263 0.2259425 2
## 5: 2.17286179 0.1144271 2
Let's say I would like to know if cumulative minimum (that is, the minimal value in each column either from the start of sample to current row, or from the start of the by= group to the current row) is below zero. When there are no by= groups, we are good:
dt[, cummin(.SD), .SDcols=1:2]
## V1 V2
## 1: 0.3236183 1.6346579
## 2: -0.1036192 0.2584700
## 3: -0.8233227 -0.4116225
## 4: -0.8233227 -0.4116225
## 5: -0.8233227 -0.4116225
> dt[, cummin(.SD)<0, .SDcols=1:2]
## V1 V2
## [1,] FALSE FALSE
## [2,] TRUE FALSE
## [3,] TRUE TRUE
## [4,] TRUE TRUE
## [5,] TRUE TRUE
So far so good. I can also compute actual cumulative minimums with by= groups:
dt[, cummin(.SD), by=id]
## id V1 V2
## 1: 1 0.32361833 1.6346579
## 2: 1 -0.10361916 0.2584700
## 3: 1 -0.82332272 -0.4116225
## 4: 2 -0.01757263 0.2259425
## 5: 2 -0.01757263 0.1144271
However, if I try to compute the indicator variables as before, the output is as follows:
dt[, (cummin(.SD)<0), by=id]
## id V1
## 1: 1 FALSE
## 2: 1 TRUE
## 3: 1 TRUE
## 4: 1 FALSE
## 5: 1 FALSE
## 6: 1 TRUE
## 7: 2 TRUE
## 8: 2 TRUE
## 9: 2 FALSE
## 10: 2 FALSE
Instead of 2 variables of 5 obs. each, we have one with 10. It appears that within each by-group, the variables are stacked on top of each other. How should I do this correctly?
It will work with lapply:
dt[,lapply(.SD,function(x) cummin(x)<0),by="id"]
id V1 V2
1: 1 FALSE FALSE
2: 1 TRUE FALSE
3: 1 TRUE TRUE
4: 2 TRUE FALSE
5: 2 TRUE FALSE
As commented by David Arenburg, it seems that it works with cummin(.SD) because it is a data.frame and it doesn't work with cummin(.SD)<0 which is a matrix. So another solution is to cast the matrix as a data.frame :
dt[, (as.data.frame(cummin(.SD)<0)), by=id]
id V1 V2
1: 1 FALSE FALSE
2: 1 TRUE FALSE
3: 1 TRUE TRUE
4: 2 TRUE FALSE
5: 2 TRUE FALSE
You can try something like this :
dt2 <- dt[, cummin(.SD), by = id]
results <- data.table(dt2[, .SD<0])
results$id <- dt$id
results
id V1 V2
1: 1 FALSE FALSE
2: 1 TRUE FALSE
3: 1 TRUE TRUE
4: 2 TRUE FALSE
5: 2 TRUE FALSE
Related
I have a transactional data like this
library(data.table)
library(stringr)
sample <- data.table (customerid=c(1,1,2,2,2,3,4,4,5,5,6,6,6,7,7),
product=c("A","A+B","A","A+B+C","A+C","B","B+C+D","C+D","A+D","A+B+D","A+B","A","A+C","B+D","D"))
I am trying to count how many product each customer buy totally and add it into a column name total_product
I tried this code in data.table
sample[, A:= str_detect(product,"A")]
sample[, B:= str_detect(product,"B")]
sample[, C:= str_detect(product,"C")]
sample[, C:= str_detect(product,"D")]
sample
the code returns
customerid product A B C D
1: 1 A TRUE FALSE FALSE FALSE
2: 1 A+B TRUE TRUE FALSE FALSE
3: 2 A TRUE FALSE FALSE FALSE
4: 2 A+B+C TRUE TRUE TRUE FALSE
5: 2 A+C TRUE FALSE TRUE FALSE
6: 3 B FALSE TRUE FALSE FALSE
7: 4 B+C+D FALSE TRUE TRUE TRUE
8: 4 C+D FALSE FALSE TRUE TRUE
9: 5 A+D TRUE FALSE FALSE TRUE
10: 5 A+B+D TRUE TRUE FALSE TRUE
11: 6 A+B TRUE TRUE FALSE FALSE
12: 6 A TRUE FALSE FALSE FALSE
13: 6 A+C TRUE FALSE TRUE FALSE
14: 7 B+D FALSE TRUE FALSE TRUE
15: 7 D FALSE FALSE FALSE TRUE
I saw a question on Stack that I should merge four c(A,B,C,D) column and count the TRUE
But in my case, I will have the same product count more than one time.
Thanks for your advice!
We can use lapply on the pattern vector (LETTERS[1:4]) and either specify the arguments of the function str_detect
sample[, LETTERS[1:4] := lapply(LETTERS[1:4], str_detect, string = product)]
Or use anonymous/lambda function
sample[, LETTERS[1:4] := lapply(LETTERS[1:4], function(x)
str_detect(product, x))]
Then create the 'total_product' count as the row wise sum of logical vector i.e. TRUE -> 1 and FALSE -> 0
sample[, total_product := rowSums(.SD), .SDcols = A:D]
If we want to count the unique elements from 'product' for each 'customerid', an option is to split the column with strsplit, get the unique count with uniqueN
sample[, .(total_product = uniqueN(unlist(strsplit(product,
'+', fixed = TRUE)))), by = customerid]
-output
# customerid total_product
#1: 1 2
#2: 2 3
#3: 3 1
#4: 4 3
#5: 5 3
#6: 6 3
#7: 7 2
I have a data table with a column containing lists. I want to check if another column is present in the list column as.
library(data.table)
dt <- data.table("a" = 1:3, "b" = list(1:2, 3:4, 5:6))
I tried:
dt[, is_a_in_b := a %in% b]
dt
# a b is_a_in_b
# 1: 1 1,2 FALSE
# 2: 2 3,4 FALSE
# 3: 3 5,6 FALSE
which does not give the correct result. The desired table would be
dt
# a b is_a_in_b
# 1: 1 1,2 TRUE
# 2: 2 3,4 FALSE
# 3: 3 5,6 FALSE
You can use the mapply function with applying the function %in% to two vectors: a and b. In effect it takes a pair of vectors (lists) and produces for every index ix the result of a[ix] %in% b[[ix]].
dt[, is_a_in_b := mapply('%in%', a, b)]
> dt
a b is_a_in_b
1: 1 1,2 TRUE
2: 2 3,4 FALSE
3: 3 5,6 FALSE
I have the following data.table:
DT1 <- data.table(col1 = c(1,2,3,4,5,6,7), col2 = letters[1:7], col3 = rep(TRUE,7))
col1 col2 col3
1: 1 a TRUE
2: 2 b TRUE
3: 3 c TRUE
4: 4 d TRUE
5: 5 e TRUE
6: 6 f TRUE
7: 7 g TRUE
Then I define:
vec <- c(2,5,6)
And with:
DT1[col1 == vec, col3 := FALSE]
I obtain:
col1 col2 col3
1: 1 a TRUE
2: 2 b TRUE
3: 3 c TRUE
4: 4 d TRUE
5: 5 e FALSE
6: 6 f FALSE
7: 7 g TRUE
I expect col3 of second line to be set to FALSE here, which seems to be not the case.
But for example, this works as I expect:
DT1[vec, col3 := FALSE]
What am I missing?
data.table has the format DT[i,j,by] with i meaning location / where, j meaning select / update / compute / assign and by meaning group by.
So the mistake that you are making here is the following:
In your assignment: DT1[col1==vec, ...] part is equivalent to the following index:
DT1$col1 == vec
This is like comparing the elements col1 column of DT1 with vec. Since vec has only 3 elements, the elements are rolled over, and due to specific values in your vec and col1, the 5th and 6th elements turns out to be TRUE after rolling.
The correct way to do what you want to do is:
Method 1: (preferred)
DT1[vec, col3 := FALSE]
Method 2: (equivalent to data.frame, but not preferred for data.table)
DT1$col3[vec] <- FALSE
or, the following will also work:
DT1[vec]$col3 <- FALSE
Method 3: Here is another possibility (although slower than the first method):
DT1[col1 %in% vec, col3 := FALSE]
Hope this helps!!
Use %in% as it returns a logical vector:
> DT1<-data.table(col1=c(1,2,3,4,5,6,7),col2=letters[1:7],col3=rep(TRUE,7))
> vec <- c(2,5,6)
> DT1[col1 %in% vec, col3 := FALSE]
> DT1
col1 col2 col3
1: 1 a TRUE
2: 2 b FALSE
3: 3 c TRUE
4: 4 d TRUE
5: 5 e FALSE
6: 6 f FALSE
7: 7 g TRUE
I am trying to determine the rows in a data.table that are duplicated. Here's the sample code I used.
a<-c(1,2,3,3,5)
b<-c(6,7,8,9,10)
require(data.table)
DT<-data.table(a,b)
setkey(DT,a)
DT
# a b
# 1: 1 6
# 2: 2 7
# 3: 3 8
# 4: 3 9
# 5: 5 10
setkey(DT,a)
duplicated(DT)
# [1] FALSE FALSE FALSE TRUE FALSE
As you can see, I am identifying row 4 as a duplicated row, which is incorrect as DT[3:4,"b",with=FALSE] is different. Am I using the duplicated function wrong?
duplicated.datatable has a by argument which defaults to the key columns
Therefore if you set this to consider all columns, it will perform as you wish
duplicated(DT, by = names(DT))
# [1] FALSE FALSE FALSE FALSE FALSE
I'd like to summarize a set of observations in a datatable and could use some help with the syntax.
I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.
observations are summarized by date
observations date have varied counts of measurements (rows per date)
'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.
I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.
library(data.table)
raw <- data.table(
Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)
summarized <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
M1 = c(FALSE,TRUE,TRUE,TRUE),
M2 = c(TRUE,FALSE,TRUE,FALSE),
M3 = c(TRUE,TRUE,TRUE,TRUE),
M4 = c(TRUE,FALSE,FALSE,TRUE),
M5 = c(TRUE,FALSE,FALSE,FALSE),
M6 = c(TRUE,FALSE,TRUE,FALSE),
M7 = c(FALSE,TRUE,TRUE,FALSE),
M8 = c(FALSE,FALSE,TRUE,FALSE),
M9 = c(FALSE,FALSE,TRUE,FALSE),
M10 = c(FALSE,FALSE,TRUE,FALSE)
)
Excel
Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).
Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.
I think I need to use a join but how to calculate the M columns escapes me.
Any help is much appreciated.
Question: is there a name for this type of operation in data science or mathematics?
Update:
I'm trying the following
setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]
Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.
Date M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE
My guess is I need to use something different to subset the raw rows in the join.
This seems to work:
raw[,lapply(1:10,`%in%`,unique(unlist(.SD))),by=Date]
The result is
Date V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
If you want the columns with "M" instead of "V", use c(M=1:10) in place of 1:10.
This is a reshaping problem.
First, since it doesn't matter which sensor the data came from, let's aggregate your three columns into one column.
temp <- raw[,Reduce(union,list(S1,S2,S3)),by=Date]
Now we want to reshape from "long" to "wide" format.
A data table solution borrowed from this answer:
setkey(temp,Date,V1)
temp[CJ(unique(Date),unique(V1)), list(.N)][,
setNames(as.list(as.logical(N)), paste0("M",unique(V1))), by = Date]
# Date M1 M2 M3 M4 M5 M6 M7 M8 M9
# 1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# 2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Base reshape works this way:
as.data.table(reshape(temp, timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
# Date V1.4 V1.2 V1.3 V1.5 V1.6 V1.1 V1.7 V1.8 V1.9
# 1: 2013-05-04 4 2 3 5 6 NA NA NA NA
# 2: 2013-05-09 4 NA 3 NA NA 1 7 NA NA
# 3: 2013-05-16 NA 2 3 NA 6 1 7 8 9
# 4: 2013-05-30 4 NA 3 NA NA 1 NA NA NA
## to order by column
temp2 <- as.data.table(reshape(temp[order(V1)], timevar = "V1", v.names = "V1", idvar = "Date", direction = "wide"))
# Date V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09 1 NA 3 4 NA NA 7 NA NA
# 2: 2013-05-16 1 2 3 NA NA 6 7 8 9
# 3: 2013-05-30 1 NA 3 4 NA NA NA NA NA
# 4: 2013-05-04 NA 2 3 4 5 6 NA NA NA
##converts to logical true/false
temp2[,lapply(.SD,function(x) {x[is.na(x)] <- 0; as.logical(x)}), by = Date]
# Date vv V1.1 V1.2 V1.3 V1.4 V1.5 V1.6 V1.7 V1.8 V1.9
# 1: 2013-05-09 TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 2: 2013-05-16 TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 3: 2013-05-30 TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# 4: 2013-05-04 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
The package reshape2 is a bit more intuitive:
require(reshape2)
## dummy variable for TRUE/FALSE
temp[,vv := TRUE]
temp_reshape2 <- as.data.table(dcast(temp, Date ~ V1, value.var = "vv"))
## replace NA with FALSE
temp_reshape2[, lapply(.SD, function(x) {x[is.na(x)] <- FALSE; x}), by = Date]
# Date 1 2 3 4 5 6 7 8 9
# 1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
# 2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
# 3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
# 4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
And for completion, a lame parse-eval solution:
limits <- temp[,c(min(V1),max(V1))]
sapply(temp[,min(V1) : max(V1)], function(x) {
temp[,eval(parse(text=paste0("M",x," := any(abs(V1 - ",x,") < .Machine$double.eps)"))),by = Date]
})