I have the following data table:
id user V1 V2 V3 V4
1: 1 1 1 1 1 0
2: 1 2 4 1 3 1
3: 1 3 0 1 6 0
4: 2 1 1 0 2 1
5: 2 2 2 1 0 0
and I perform an lapply group by id calculation:
my_data[,lapply(.SD,mean)*.SD,by=id,.SDcols=3:5]
The result is the following:
id V1 V2 V3
1: 1 1.666667 1.0 3.333333
2: 1 6.666667 1.0 10.000000
3: 1 0.000000 1.0 20.000000
4: 2 1.500000 0.0 2.000000
5: 2 3.000000 0.5 0.000000
Is there an easy data table way to include the column user from the original data table?
I have managed to do it with
cbind(my_data[,.(user)], my_data[,lapply(.SD,mean)*.SD,by=id,.SDcols=3:5])
but i really hope there is a nicer way
I suggest you go through the vignettes. The Introduction to data.table vignette explains an important point, which I'll repeat here..
As long as j returns a list, each element of the list will become a column in the resulting data.table.
In base R, c(list, list) returns a new list with all the elements. We can simply use that existing functionality to do:
require(data.table) # v1.9.7 devel
dt[, c(list(user=user), lapply(.SD, function(x) x*mean(x))), by=id, .SDcols=V1:V4]
I'm on the current development version of data.table, v1.9.7 which has certain new features, e.g., usage of V1:V4 in .SDcols:
We can do the assignment
my_data[,(3:5) := lapply(.SD,mean)*.SD,by=id,.SDcols=3:5]
Or instead of multiplying by .SD, we do it within the loop itself.
my_data[, (3:5) := lapply(.SD, function(x) mean(x)*x), .SDcols = 3:5, by = id]
Related
I'd like to assign only those values in the first row of a group in a data.table.
For example (simplified): my data.table is DT with the following content
x v
1 1
2 2
2 3
3 4
3 5
3 6
The key of DT is x.
I want to address every first line of a group.
This is working fine:DT[, .SD[1], by=x]
x v
1 1
2 2
3 4
Now, I want to assign only those values of v to 0.
But none of this is working:
DT[, .SD[1], by=x]$v <- 0
DT[, .SD[1], by=x]$v := 0
DT[, .SD[1], by=x, v:=0]
I searched the R-help from the package and any links provided but I just can't get it work.
I found notes there saying this would not work but no examples/solutions that helped me out.
I'd be very glad for any suggestions.
(I like this package very much and I don't wanna go back to a data.frame... where I got this working)
edit:
I'd like to have a result like this:
x v
1 0
2 0
2 3
3 0
3 5
3 6
This is not working:
DT[, .SD[1], by=x] <- DT[, .SD[1], by=x][, v:=0]
Another option would be:
DT[,v:={v[1]<-0L;v}, by=x]
DT
# x v
#1: 1 0
#2: 2 0
#3: 2 3
#4: 3 0
#5: 3 5
#6: 3 6
Or
DT[DT[, .I[1], by=x]$V1, v:=0]
DT
# x v
#1: 1 0
#2: 2 0
#3: 2 3
#4: 3 0
#5: 3 5
#6: 3 6
With a little help from Roland's solution, it looks like you could do the following. It simply concatenates zero with all the other grouped values of v except the first.
DT[, v := c(0L, v[-1]), by = x] ## must have the "L" after 0, as 0L
which results in
DT
# x v
# 1: 1 0
# 2: 2 0
# 3: 2 3
# 4: 3 0
# 5: 3 5
# 6: 3 6
Note: the middle section j of code could also be v := c(integer(1), v[-1])
I'm looking for a python-like dictionary structure in R to replace values in a large dataset (>100 MB) and I think data.table package can help me do this. However, I cannot find out an easy way to solve the problem.
For example, I have two data.table:
Table A:
V1 V2
1: A B
2: C D
3: C D
4: B C
5: D A
Table B:
V3 V4
1: A 1
2: B 2
3: C 3
4: D 4
I want to use B as a dictionary to replace the values in A. So the result I want to get is:
Table R:
V5 V6
1 2
3 4
3 4
2 3
4 1
What I did is:
c2=tB[tA[,list(V2)],list(V4)]
c1=tB[tA[,list(V1)],list(V4)]
Although I specified j=list(V4), it still returned me with the values of V3. I don't know why.
c2:
V3 V4
1: B 2
2: D 4
3: D 4
4: C 3
5: A 1
c1:
V3 V4
1: A 1
2: C 3
3: C 3
4: B 2
5: D 4
Finally, I combined the two V4 columns and got the result I want.
But I think there should be a much easier way to do this. Any ideas?
Here's an alternative way:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA]$V4)
}
# V1 V2
# 1: 1 2
# 2: 3 4
# 3: 3 4
# 4: 2 3
# 5: 4 1
Since thisA is character column, we don't need the J() (for convenience). Here, A's columns are replaced by reference, and is therefore also memory efficient. But if you don't want to replace A, then you can just use cA <- copy(A) and replace cA's columns.
Alternatively, using :=:
A[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
# or
ans = copy(A)[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
(Following user2923419's comment): You can drop the J() if the lookup is a single column of type character (just for convenience).
In 1.9.3, when j is a single column, it returns a vector (based on user request). So, it's a bit more natural data.table syntax:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA, V4])
}
I am not sure how fast this is with big data, but chmatch is supposed to be fast.
tA[ , lapply(.SD,function(x) tB$V4[chmatch(x,tB$V3)])]
V1 V2
1: 1 2
2: 3 4
3: 3 4
4: 2 3
5: 4 1
I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?
I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.
> set.seed(123)
>
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
V1 V2 data
1: 1 2 2
2: 1 3 3
3: 2 3 6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)
> #set keys
> setkey(dat,V1,V2)
>
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.5 0.5
2: 1 3 3 2.5 0.5
3: 2 3 6 6.0 NA
> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.0 NA
2: 1 3 3 4.5 4.5
3: 2 3 6 4.5 4.5
>
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2 NA
2: 1 3 3 3 NA
3: 2 3 6 6 NA
>
> #Meaningless but hopefull attempt.
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 3.666667 4.333333
2: 1 3 3 3.666667 4.333333
3: 2 3 6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
1 2.5
2 4.0
3 4.5
My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.
Thank you!
It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):
dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
# V1 MEAN
#1: 1 2.5
#2: 2 4.0
#3: 3 4.5
I am implementing k-Means. This is my main datastructures:
dt1 is a Data.table with{Filename,featureVector,GroupItBelongsTo}
dt1<- data.table(Filename=files[1:limit],Vector=list(),G=-1)
setkey(dt1,Filename)
featureVector is a list. It has words associated with occurance, I am adding the occurance to each word using this line:
featureVector[[item]] <- emaildt[email==item]$N
A typical excerpt from my console when I call dt1 is.
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 3
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 3
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 3
I now want to compute new centroids for each group number. Meaning I want to sum all vector positions at position 1 with each other, [2] etc.. until the end, and after that - average them all.
Example: v1=[1,1,1], v2=[2,2,2],I would expect the centroid to be = c1=[1,5;1,5;1,5]
I tried to do: sapply(dt1[tt]$Vector,mean) (also tried with "sum") and it sums and "means" row-wise(inside each vector), not column wise(each n-th component) like I would like it to do.
How to do it?
====Update, answering a question in comments====
> head(dt1)
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 1
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 1
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 4
5: 000fcfac9e0a468a27b5e2ad0f78d842.txt 0,0,0,0,0,0, 1
6: 00166a4964d6c939f8f62280b85e706d.txt 0,0,0,1,0,0, 1
> class(dt1)
[1] "data.table" "data.frame"
>
Typing dt1$Vector gives(I only copied a small sample, it has many more words but they all look the same):
[[1]]
homosexuality articles church people interest
1 1 1 1 1
thread email send warning worth
1 1 1 1 1
And here is the class() output
> class(dt1$Vector)
[1] "list"
Screenshots when typing:
A<-as.matrix(t(as.data.frame(dt1$Vector)))
Result of class(dt1$Vector[[1]]):
[1] "numeric"
First, (the obligatory) you might consider using the R function kmeans to do your k-means clustering. If you prefer to roll your own, you can easily compute centroids of a data table as follows. First, I'll build some random data that looks like yours:
> set.seed(123)
> dt<-data.table(name=LETTERS[1:20],replicate(5,sample(0:4,20,T)),G=sample(3,20,T))
> head(dt)
name V1 V2 V3 V4 V5 G
1: A 1 4 0 3 1 2
2: B 3 3 2 0 3 1
3: C 2 3 2 1 2 2
4: D 4 4 1 1 3 3
5: E 4 3 0 4 0 2
6: F 0 3 0 2 2 3
The centroids can be computed in one line:
> dt[,lapply(.SD[,-1],mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
If you're going to do this, you might want to drop the names from the data table (temporarily), in which case you can just do:
> dt2<-copy(dt)
> dt2$name<-NULL
> dt2[,lapply(.SD,mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
Edit: a better way to do this, suggested by #Roland, is to use .SDcols:
dt[,lapply(.SD,mean),by=G,.SDcols=2:6]
I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.
library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)
If I want to simply calculate the mean for each group by category. This works easily enough.
dt[,mean(value),by="category"]
category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978
I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.
dt[,scale(value),by="category"]
dt[,value-mean(value),by="category"]
I thought recreating scale as function that returns a numeric vector instead of a matrix might help.
zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}
dt[,zScore(value),by="category"]
category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1
This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?
You've clarified in the comments that you'd like the same behaviour as:
ddply(df,"category",transform, zscorebycategory=zScore(value))
which gives:
category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc
The data table option you gave gives:
category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc
Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:
Your
dt[,zScore(value),by="category"]
becomes:
dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
Where the named items in the list become columns in the result.
plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE
(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).
Your claim that data.table does not group is wrong:
library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)
category value
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 4
6: 2 4
7: 2 10
8: 2 10
dt[,value-mean(value),by=category]
category V1
1: 1 -0.5
2: 1 -0.5
3: 1 0.5
4: 1 0.5
5: 2 -3.0
6: 2 -3.0
7: 2 3.0
8: 2 3.0
If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.