Including all permutations when using data.table[,,by=...] - r

I have a large data.table that I am collapsing to the month level using ,by.
There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?

I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:
Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].

Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.

Related

Expressively select rows in a data.table which match rows from another data.table

Given two data tables (tbl_A and tbl_B), I would like to select all the rows in the tbl_A which have matching rows in tbl_B, and I would like the code to be expressive. If the %in% operator were defined for data.tables, something like this would be be ideal:
subset <- tbl_A[tbl_A %in% tbl_B]
I can think of many ways to accomplish what I want such as:
# double negation (set differences)
subset <- tbl_A[!tbl_A[!tbl_B,1,keyby=a]]
# nomatch with keyby and this annoying `[,V1:=NULL]` bit
subset <- tbl_B[,1,keyby=.(a=x)][,V1:=NULL][tbl_A,nomatch=0L]
# nomatch with !duplicated() and setnames()
subset <- tbl_B[!duplicated(tbl_B),.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# nomatch with !unique() and setnames()
subset <- unique(tbl_B)[,.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# use of a temporary variable (Thanks #Frank)
subset <- tbl_A[, found := FALSE][tbl_B, found := TRUE][(found)][,found:=NULL][]
but each expression is difficult to read and it's not obvious at first glance what the code is doing. Is there a more idiomatic / expressive way of accomplishing this task?
For purposes of example, here are some toy data.tables:
# toy tables
tbl_A <- data.table(a=letters[1:5],
b=1:5,
c=rnorm(5))
tbl_B <- data.table(x=letters[3:7],
y=13:17,
z=rnorm(5))
# both tables might have multiple rows with the same key fields.
tbl_A <- rbind(tbl_A,tbl_A)
tbl_B <- rbind(tbl_B,tbl_B)
setkey(tbl_A,a)
setkey(tbl_B,x)
and an expected result containing the rows in tbl_A which match at least one row in tbl_B:
a b c
1: c 3 -0.5403072
2: c 3 -0.5403072
3: d 4 -1.3353621
4: d 4 -1.3353621
5: e 5 1.1811730
6: e 5 1.1811730
Adding 2 more options
tbl_A[fintersect(tbl_A[,.(a)], tbl_B[,.(a=x)])]
and
tbl_A[unique(tbl_A[tbl_B, nomatch=0L, which=TRUE])]
I'm not sure how expressive it is (apologies if not) but this seems to work:
tbl_A[,.(a,b,c,any(a == tbl_B[,x])), by = a][V4==TRUE,.(a,b,c)]
I'm sure it can be improved - I only found out about any() yesterday and still testing it :)

Calculation in columns using previous row value without loops

I have data in columns which I need to do calculations on. Is it possible to do this using previous row values without using a loop? E.g. if in the first column the value is 139, calculate the median of the last 5 values and the percent change of the value 5 rows above and the value in the current row?
ID Data PF
135 5 123
136 4 141
137 5 124
138 6 200
139 1 310
140 2 141
141 4 141
So here in this dataset you would do:
Find 139 in ID column
Return average of last 5 rows in Data (Gives 4.2)
Return performance of values in PF 5 rows above to current value (Gives 152%)
If I would do a loop it looks like this:
for (i in 1:nrow(data)){
if(data$ID == "139" & i>=3)
{data$New_column <- data[i,"PF"] / data[i-4,"PF"] - 1
}
The problem is that the loop takes too long due to to many data points. The ID 139 will appear several times in the dataset.
Many thanks.
Carlos
As pointed out by Tutuchacn and Sotos, use the package zoo to get the mean of the Data in the last N rows (inclusive of the row) you are querying (assuming your data is in the data frame df):
library(zoo)
ind <- which(df$ID==139) ## this is the row you are querying
N <- 5 ## here, N is 5
res <- rollapply(df$Data, width=N, mean)[ind-(N-1)]
print(res)
## [1] 4.2
rollapply(..., mean) returns the rolling mean of the windowed data of width=N. Note that the index used to query the output from rollapply is lagged by N-1 because the rolling mean is applied forward in the series.
To get the percent performance from PF as you specified:
percent.performance <- function(x) {
z <- zoo(x) ## create a zoo series
lz <- lag(z,4) ## create the lag version
return(z/lz - 1)
}
res <- as.numeric(percent.performance(df$PF)[ind])
print(res)
## [1] 1.520325
Here, we define a function percent.performance that returns what you want for all rows of df for which the computation makes sense. We then extract the row we want using ind and convert it to a number.
Hope this helps.
Is that what you want?
ntest=139
sol<-sapply(5:nrow(df),function(ii){#ii=6
tdf<-df[(ii-4):ii,]
if(tdf[5,1]==ntest)
c(row=ii,aberage=mean(tdf[,"Data"]),performance=round(100*tdf[5,"PF"]/tdf[1,"PF"]-1,0))
})
sol<- sol[ ! sapply(sol, is.null) ] #remove NULLs
sol
[[1]]
row aberage performance
5.0 4.2 251.0
This could be a decent start:
mytext = "ID,Data,PF
135,5,123
136,4,141
137,5,124
138,6,200
139,1,310
140,2,141
141,4,141"
mydf <- read.table(text=mytext, header = T, sep = ",")
do.call(rbind,lapply(mydf$ID[which(mydf$ID==139):nrow(mydf)], function(x) {
tempdf <- mydf[1:which(mydf$ID==x),]
data.frame(ID=x,Data=mean(tempdf$Data),PF=100*(tempdf[nrow(tempdf),"PF"]-tempdf[(nrow(tempdf)-4),"PF"])/tempdf[(nrow(tempdf)-4),"PF"])
}))
ID Data PF
139 4.200000 152.03252
140 3.833333 0.00000
141 3.857143 13.70968
The idea here is: You take ID's starting from 139 to the end and use the lapply function on each of them by generating a temporary data.frame which includes all the rows above that particular ID (including the ID itself). Then you grab the mean of the Data column and the rate of change (i.e. what you call performance) of the PF column.

Map vectors into integers in a bijective way

I have 100,000 5-length vectors (the list VECTORS below) whose elements are chosen among one million values.
# dictionary
dictionary=seq(1:1e6)
# generate 100,000 5-length vectors whose elements are chosen from dictionary
VECTORS <- lapply(c(1:1e5), sample, x = dictionary, size =5)
My problem is to map each exact same vector into one integer, i.e. I need a mappy function that inputs a vector and yields an integer.
mappy(c(58431, 976854, 661294, 460685, 341123))=15, for example. Do you know how to do this in an efficient way?
Subsidiary question : what if my vectors aren't the same length anymore?
I assume here you want a bijection between the vectors you have in your list and integers. One approach would be to create a factor variable out of character representations of your vectors. Let's start with a reproducible version of your code (I'll make it a smaller vector):
set.seed(144)
VECTORS <- replicate(1e2, sample(seq_len(1e6), 5), FALSE)
Now you can create a factor variable from the character representation of each vector:
fvar <- factor(sapply(VECTORS, paste, collapse=" "))
Now we have a bijection between string representations of elements of VECTORS and integers:
vec <- c(894025, 153892, 98596, 218401, 36616) # 15th element of VECTORS
which(levels(fvar) == paste(vec, collapse=" "))
# [1] 90
levels(fvar)[90]
# [1] "894025 153892 98596 218401 36616"
as.numeric(strsplit(levels(fvar)[90], " ")[[1]])
# [1] 894025 153892 98596 218401 36616
If you wanted to wrap them up into nice functions:
id.from.vec <- function(vec) which(levels(fvar) == paste(vec, collapse=" "))
id.from.vec(c(894025, 153892, 98596, 218401, 36616))
# [1] 90
vec.from.id <- function(id) as.numeric(strsplit(levels(fvar)[id], " ")[[1]])
vec.from.id(90)
# [1] 894025 153892 98596 218401 36616
Note that this works out of the box even if the vectors are different lengths.
A keyed data.table has nice lookup properties:
library(data.table)
set.seed(1)
VECTORS <- lapply(seq(1e5), sample, x = 1e6, size = 5)
VECmap <- setkey(rbindlist(lapply(unique(VECTORS), as.list)))[, ID := .I]
# V1 V2 V3 V4 V5 ID
# 1: 13 897309 366563 678873 6571 1
# 2: 15 557977 640484 732531 848939 2
# 3: 48 18120 911805 188728 805726 3
# 4: 48 830301 862433 506297 877432 4
# 5: 52 873436 824165 86251 576173 5
# ---
# 99996: 999911 583599 803402 240910 931996 99996
# 99997: 999931 146505 287431 180259 230904 99997
# 99998: 999937 175888 266336 874987 982951 99998
# 99999: 999950 960139 455084 586956 875504 99999
# 100000: 999993 191750 258982 518519 78087 100000
mapVEC <- function(...) VECmap[.(...)]$ID
mapID <- function(id) unlist(VECmap[ID==id,!"ID",with=FALSE], use.names=FALSE)
# example usage
mapVEC(52, 873436, 824165, 86251, 576173)
# 5
mapID(5)
# 52 873436 824165 86251 576173
Comments As mentioned by #Roland, a bijection between (a) 1..1e6 and (b) all 5-length sequences of distinct numbers from 1..1e5 is not possible, so I'm just guessing that this is what the OP is after.
When you write a function with ... as an argument, that means an arbitrary number of unnamed arguments are accepted. Within the function, these arguments can be referred to with ..., but are often also seen with c(...) and list(...). Within a data.table, .(...) is an alias for list(...). To see documentation for writing functions, type help.start() and click through to the "R Language Definition."

R data.table intersection of all groups

I want to have the intersection of all groups of a data table. So for the given data:
data.table(a=c(1,2,3, 2, 3,2), myGroup=c("x","x","x", "y", "z","z"))
I want to have the result:
2
I know that
Reduce(intersect, list(c(1,2,3), c(2), c(3,2)))
will give me the desired result but I didn't figure out how to produce a list of groups of a data.table query.
I would try using Reduce in the following way (assuming dt is your data)
Reduce(intersect, dt[, .(list(unique(a))), myGroup]$V1)
## [1] 2
Here's one approach.
nGroups <- length(unique(dt[,myGroup]))
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
And here it is with some explanatory comments.
## Mark down the number of groups in your data set
nGroups <- length(unique(dt[,myGroup]))
## Then, use `by="a"` to examine in turn subsets formed by each value of "a".
## For subsets having the full complement of groups
## (i.e. those for which `length(unique(myGroup))==nGroups)`,
## return the value of "a" (stored in .BY).
## For the other subsets, return NULL.
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
If that code and the comments aren't clear on their own, a quick glance at the following might help. Basically, the approach above is just looking for and reporting the value of a for those groups that return x,y,z in column V1 below.
dt[,list(list(unique(myGroup))), by="a"]
# a V1
# 1: 1 x
# 2: 2 x,y,z
# 3: 3 x,z

Using 'fastmatch' package in R

I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch, but when I use the function fmatch(), I am only returning the index of the first match.
Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch has cut down the computing time considerably.
Thanks so much!
Here is some sample data - for the purposes of this exercise, let's call this data frame A:
DateTime Address Type ID
1 2014-03-04 20:21:03 982076970 1 2752394
2 2014-03-04 20:21:07 98174238211 1 2752394
3 2014-03-04 20:21:08 76126162197 1 2752394
4 2014-03-04 20:21:16 6718053253 1 2752394
5 2014-03-04 20:21:17 98210219176 1 2752510
6 2014-03-04 20:21:20 7622877100 1 2752510
7 2014-03-04 20:21:23 2425126157 1 2752510
8 2014-03-04 20:21:23 2425126157 1 2752510
9 2014-03-04 20:21:25 701838650 1 2752394
10 2014-03-04 20:21:27 98210219176 1 2752394
What I wish to do is to find the number of unique Type values for each Address. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type values listed above are all 1, they can take any value from 0:5. I also realize the Address values are quite long, which adds to the time required for the matching.
I have tried the following:
uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals
for (i in 1:length(unique(Address))) {
b <- which(uvals[i] %in% A$Address)
c <- length(unique(A$Type[b]))
utypes[i,2] <- c
}
However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.
I have tried this, as well, within the loop... but it is not considerably faster.
b <- which(A$Address == uvals[i])
I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.
This can be done using unique function in data.table, followed by an aggregation. I'll illustrate it using more or less the sample data generated by #Chinmay:
Create sample data:
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
data.table solution:
require(data.table) ## >= 1.9.2
dat.u = unique(setDT(dat), by=c("address", "value"))
ans = dat.u[, .N, by=address]
Explanation:
The setDT function converts a data.frame to data.table by reference (which is very fast).
unique function operated on a data.table evokes the unique.data.table method, which is incredibly fast compared to base:::unique. Now, we've only unique values of type for every address.
All that's left to do is to aggregate or group-by address and get the number of observations that are there in each group. The by=address part groups by address and .N is an in-built data.table variable that provides the number of observations for that group.
Benchmarks:
I'll create functions to generate data as data.table and data.frame to benchmark data.table answer againstdplyr solution (a) proposed by #beginneR, although I don't see the need for arrange(.) there and therefore will skip that part.
## function to create data
foo <- function(type = "df") {
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
if (type == "dt") setDT(dat)
dat
}
## DT function
dt_sol <- function(x) {
unique(x, by=c("address", "value"))[, .N, by=address]
}
## dplyr function
dplyr_sol <- function(x) {
distinct(x) %>% group_by(address) %>% summarise(N = n_distinct(value))
}
The timings reported here are three consecutive runs of system.time(.) on each function.
## benchmark timings in seconds
## pkg run-01 run-02 run-03 command
## data.table 2.4 2.3 2.4 system.time(ans1 <- dt_sol(foo("dt")))
## dplyr 15.3 16.3 15.7 system.time(ans2 <- dplyr_sol(foo()))
For some reason, dplyr automatically orders the result by the grouping variable. So in order to compare the results, I'll also order them in the result from data.table:
system.time(setkey(ans1, address)) ## 0.102 seconds
identical(as.data.frame(ans1), as.data.frame(ans2)) ## TRUE
So, data.table is ~6x faster here.
Note that bit64:::integer64 is also supported in data.table - since you mention the address values are too long, you can also store them as integer64.
You can try creating an index of your 10MM values and sort that. Then looking for your 1MM values in that indexed vector should be faster.
For example, using data.table package you can do that by using setkey function which indexes given column of data.table.
require(data.table)
set.seed(100)
dat <- sample(1:1e+07, size = 1e+07, replace = T)
searchval <- sample(dat, size = 1e+06)
DT <- data.table(dat, index = seq_along(dat))
setkey(DT, dat)
DT
## dat index
## 1: 1 169458
## 2: 1 4604823
## 3: 1 7793446
## 4: 2 5372388
## 5: 3 2036622
## ---
## 9999996: 9999996 1271426
## 9999997: 9999998 530029
## 9999998: 10000000 556672
## 9999999: 10000000 6776063
## 10000000: 10000000 6949665
lookup <- data.table(val = searchval)
setkey(lookup, val)
lookup
## val
## 1: 2
## 2: 16
## 3: 24
## 4: 33
## 5: 36
## ---
## 999996: 9999970
## 999997: 9999973
## 999998: 9999988
## 999999: 9999996
## 1000000: 9999998
Now you can lookup all the values from lookup in DT by simply using
DT[lookup]
## dat index
## 1: 2 5372388
## 2: 16 537927
## 3: 16 1721233
## 4: 24 7286522
## 5: 33 7448516
## ---
## 2000298: 9999973 8008610
## 2000299: 9999988 3099060
## 2000300: 9999988 7996302
## 2000301: 9999996 1271426
## 2000302: 9999998 530029
fmatch seems to clearly state that it only finds the first match. And given that it uses an underlying hashing strategy, I imagine it's unlikely that it stores multiple items per key which is one of the ways it stays so fast (and it's the same way match works).
Do you have many duplicate values? Perhaps you could store those in a separate place/table and create a fast index to a list of possible matches. It would be more helpful if you provided sample data representative of what you're trying to do and the code you tried to see if it would be easy to extend.
If I understand your question correctly, you can also do this with dplyr:
I will include two different ways, since I am not entirely sure which is your desired output.
First create some sample data:
Address <- rep(letters, 5)
Type <- sample(1:5, size=5*26, replace=T)
A <- data.frame(Address, Type)
Then install and load dplyr
require(dplyr)
a) To find the number of different Type values for each Address value:
A %.% arrange(Address, Type) %.% group_by(Address) %.% summarize(NoOfTypes = length(unique(Type)))
b) To find all unique combinations of Address and Type:
A %.% arrange(Address, Type) %.% group_by(Address, Type) %.% filter( 1:n() == 1)

Resources