How do you use data.table to select a subset of rows by rank? I have a large data set and I hope to do so efficiently.
> dt <- data.table(id=1:200, category=sample(LETTERS, 200, replace=T))
> dt[,count:=length(id), by=category]
> dt
id category count
1: 1 O 13
2: 2 O 13
---
199: 170 N 3
200: 171 H 3
What I want to do is to efficiently change the category to 'OTHER' for any category not in the k most common ones. Something along the lines of:
dt[rank > 5,category:="OTHER", by=category]
I'm new to data.table and I'm not quite sure how to get the rank in an efficient way. Here's a way that works, but seems clunky.
counts <- unique(dt$count)
decision <- max(counts[rank(-counts)>5])
dt[count<=decision, category:='OTHER']
I would appreciate any advice. To be honest, I don't even need to the 'count' column if it's not necessary.
Related
I am given a large data.table, e.g.
n <- 7
dt <- data.table(id_1=sample(1:10^(n-1),10^n,replace=TRUE), other=sample(letters[1:20],10^n,replace=TRUE), val=rnorm(10^n,mean=10^4,sd=1000))
> structure(dt)
id_1 other val
1: 914718 o 9623.078
2: 695164 f 10323.943
3: 53186 h 10930.825
4: 496575 p 9964.064
5: 474733 l 10759.779
---
9999996: 650001 p 9653.125
9999997: 225775 i 8945.636
9999998: 372827 d 8947.095
9999999: 268678 e 8371.433
10000000: 730810 i 10150.311
and I would like to create a data.table that for each value of the indicator id_1 only has one row, namely the one with the largest value in the column val.
The following code seems to work:
dt[, .SD[which.max(val)], by = .(id_1)]
However, it is very slow for large tables.
Is there a quicker way?
Technically this is a duplicate of this question,
but the answer wasn't really explained,
so here it goes:
dt[dt[, .(which_max = .I[val == max(val)]), by = "id_1"]$which_max]
The inner expression basically finds,
for each group according to id_1,
the row index of the max value,
and simply returns those indices so that they can be used to subset dt.
However, I'm kind of surprised I didn't find an answer suggesting this:
setkey(dt, id_1, val)[, .SD[.N], by = "id_1"]
which seems to be similarly fast in my machine,
but it requires the rows to be sorted.
I am not sure how to do it in R, but what I have done is read line by line and then put those lines into data frame. This is very fast and happens in a flash for a 100 mb text file.
import pandas as pd
filename ="C:/Users/xyz/Downloads/123456789.012-01-433.txt"
filename =filename
with open(filename, 'r') as f:
sample =[] #creating an empty array
for line in f:
tag=line[:45].split('|')[5] # its a condition, you dont need this.
if tag == 'KV-C901':
sample.append(line.split('|')) # writing those lines to an array table
print('arrays are appended and ready to create a dataframe out of an array')
I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]
DC<-data.table(l=c(0,0,1,4,5),d=c(1,2,0,0,1),y=c(0,1,0,1,7))
Hello,
how can I get a count of a particular value in a column using data.table?
I tried the following:
DC[, lapply(.SD, function(x) length(which(DC==0)))]
But this returns the count of zeros in the entire dataset, not indexing by column. So, how do I index by column?
Thanks
The question is not very well formulated, but I think #Sathish answered it perfectly in the comments.
Let's write it here one more time: to me colSums(DC == 0) is one answer to the question.
All credit goes to #Sathish. Very helpful.
If I understand your question, you'd like to form a frequency count of the values that comprise a given data table column. If that is true, let's say that you want to do so on column d of the data table you supplied:
> DC <- data.table(l=c(0,0,1,4,5), d=c(1,2,0,0,1), y=c(0,1,0,1,7))
> DC[, .N, by = d]
d N
1: 1 2
2: 2 1
3: 0 2
Then, if you want the count of a particular value in d, you would do so by accessing the corresponding row of the above aggregation as follows:
> DC[, .N, by = d][d == 0, N]
[1] 2
R's data.table package offers fast subsetting of values based on keys.
So, for example:
set.seed(1342)
df1 <- data.table(group = gl(10, 10, labels = letters[1:10]),
value = sample(1:100))
setkey(df1, group)
df1["a"]
will return all rows in df1 where group == "a".
What if I want all rows in df1 where group != "a". Is there a concise syntax for that using data.table?
I think you answered your own question:
> nrow(df1[group != "a"])
[1] 90
> table(df1[group != "a", group])
a b c d e f g h i j
0 10 10 10 10 10 10 10 10 10
Seems pretty concise to me?
EDIT FROM MATTHEW : As per comments this a vector scan. There is a not join idiom here and here, and feature request #1384 to make it easier.
EDIT: feature request #1384 is implemented in data.table 1.8.3
df1[!'a']
# and to avoid the character-to-factor coercion warning in this example (where
# the key column happens to be a factor) :
df1[!J(factor('a'))]
I would just get all keys that are not "a":
df1[!(group %in% "a")]
Does this achieve what you want?
I have a large data.table that I am collapsing to the month level using ,by.
There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?
I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:
Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.