I am implementing k-Means. This is my main datastructures:
dt1 is a Data.table with{Filename,featureVector,GroupItBelongsTo}
dt1<- data.table(Filename=files[1:limit],Vector=list(),G=-1)
setkey(dt1,Filename)
featureVector is a list. It has words associated with occurance, I am adding the occurance to each word using this line:
featureVector[[item]] <- emaildt[email==item]$N
A typical excerpt from my console when I call dt1 is.
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 3
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 3
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 3
I now want to compute new centroids for each group number. Meaning I want to sum all vector positions at position 1 with each other, [2] etc.. until the end, and after that - average them all.
Example: v1=[1,1,1], v2=[2,2,2],I would expect the centroid to be = c1=[1,5;1,5;1,5]
I tried to do: sapply(dt1[tt]$Vector,mean) (also tried with "sum") and it sums and "means" row-wise(inside each vector), not column wise(each n-th component) like I would like it to do.
How to do it?
====Update, answering a question in comments====
> head(dt1)
Filename Vector G
1: 000057219a473629b53d33cfedef590f.txt 1,1,1,1,1,1, 1
2: 00007afb5a5e467a39e517ae87e7fad5.txt 0,0,0,0,0,0, 1
3: 000946d248fdb1d5d05c59a91b00e8f2.txt 0,0,0,0,0,0, 3
4: 000bea8dc6f716a2cac6f25bdbe09073.txt 0,0,0,0,0,0, 4
5: 000fcfac9e0a468a27b5e2ad0f78d842.txt 0,0,0,0,0,0, 1
6: 00166a4964d6c939f8f62280b85e706d.txt 0,0,0,1,0,0, 1
> class(dt1)
[1] "data.table" "data.frame"
>
Typing dt1$Vector gives(I only copied a small sample, it has many more words but they all look the same):
[[1]]
homosexuality articles church people interest
1 1 1 1 1
thread email send warning worth
1 1 1 1 1
And here is the class() output
> class(dt1$Vector)
[1] "list"
Screenshots when typing:
A<-as.matrix(t(as.data.frame(dt1$Vector)))
Result of class(dt1$Vector[[1]]):
[1] "numeric"
First, (the obligatory) you might consider using the R function kmeans to do your k-means clustering. If you prefer to roll your own, you can easily compute centroids of a data table as follows. First, I'll build some random data that looks like yours:
> set.seed(123)
> dt<-data.table(name=LETTERS[1:20],replicate(5,sample(0:4,20,T)),G=sample(3,20,T))
> head(dt)
name V1 V2 V3 V4 V5 G
1: A 1 4 0 3 1 2
2: B 3 3 2 0 3 1
3: C 2 3 2 1 2 2
4: D 4 4 1 1 3 3
5: E 4 3 0 4 0 2
6: F 0 3 0 2 2 3
The centroids can be computed in one line:
> dt[,lapply(.SD[,-1],mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
If you're going to do this, you might want to drop the names from the data table (temporarily), in which case you can just do:
> dt2<-copy(dt)
> dt2$name<-NULL
> dt2[,lapply(.SD,mean),by=G]
G V1 V2 V3 V4 V5
1: 2 2.375000 2.250000 1.25 2.125000 2.250000
2: 1 2.800000 2.400000 2.40 1.800000 1.400000
3: 3 1.714286 2.428571 1.00 2.142857 1.857143
Edit: a better way to do this, suggested by #Roland, is to use .SDcols:
dt[,lapply(.SD,mean),by=G,.SDcols=2:6]
Related
I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.
I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]
I'm looking for a python-like dictionary structure in R to replace values in a large dataset (>100 MB) and I think data.table package can help me do this. However, I cannot find out an easy way to solve the problem.
For example, I have two data.table:
Table A:
V1 V2
1: A B
2: C D
3: C D
4: B C
5: D A
Table B:
V3 V4
1: A 1
2: B 2
3: C 3
4: D 4
I want to use B as a dictionary to replace the values in A. So the result I want to get is:
Table R:
V5 V6
1 2
3 4
3 4
2 3
4 1
What I did is:
c2=tB[tA[,list(V2)],list(V4)]
c1=tB[tA[,list(V1)],list(V4)]
Although I specified j=list(V4), it still returned me with the values of V3. I don't know why.
c2:
V3 V4
1: B 2
2: D 4
3: D 4
4: C 3
5: A 1
c1:
V3 V4
1: A 1
2: C 3
3: C 3
4: B 2
5: D 4
Finally, I combined the two V4 columns and got the result I want.
But I think there should be a much easier way to do this. Any ideas?
Here's an alternative way:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA]$V4)
}
# V1 V2
# 1: 1 2
# 2: 3 4
# 3: 3 4
# 4: 2 3
# 5: 4 1
Since thisA is character column, we don't need the J() (for convenience). Here, A's columns are replaced by reference, and is therefore also memory efficient. But if you don't want to replace A, then you can just use cA <- copy(A) and replace cA's columns.
Alternatively, using :=:
A[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
# or
ans = copy(A)[, names(A) := lapply(.SD, function(x) B[J(x)]$V4)]
(Following user2923419's comment): You can drop the J() if the lookup is a single column of type character (just for convenience).
In 1.9.3, when j is a single column, it returns a vector (based on user request). So, it's a bit more natural data.table syntax:
setkey(B, V3)
for (i in seq_len(length(A))) {
thisA = A[[i]]
set(A, j=i, value=B[thisA, V4])
}
I am not sure how fast this is with big data, but chmatch is supposed to be fast.
tA[ , lapply(.SD,function(x) tB$V4[chmatch(x,tB$V3)])]
V1 V2
1: 1 2
2: 3 4
3: 3 4
4: 2 3
5: 4 1
I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?
I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.
> set.seed(123)
>
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
V1 V2 data
1: 1 2 2
2: 1 3 3
3: 2 3 6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)
> #set keys
> setkey(dat,V1,V2)
>
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.5 0.5
2: 1 3 3 2.5 0.5
3: 2 3 6 6.0 NA
> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.0 NA
2: 1 3 3 4.5 4.5
3: 2 3 6 4.5 4.5
>
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2 NA
2: 1 3 3 3 NA
3: 2 3 6 6 NA
>
> #Meaningless but hopefull attempt.
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 3.666667 4.333333
2: 1 3 3 3.666667 4.333333
3: 2 3 6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
1 2.5
2 4.0
3 4.5
My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.
Thank you!
It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):
dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
# V1 MEAN
#1: 1 2.5
#2: 2 4.0
#3: 3 4.5
I'm having trouble figuring out how to find the percentage correct for each different number in V3. V4 shows whether the answer was correct or not. V2 is the block number.
V2 V3 V4
1 4 1
1 10 1
1 4 0
1 4 1
1 10 0
2 8 1
2 8 0
Thank you for all your help. I'm new to R and have been googling this problem for hours!
Calling your data frame DF:
tapply(DF$V4 * 100, DF$V3, mean)
Will give you the percentage correct for each unique number in V3.
using data.table might be helpful here
library(data.table)
mydt <- data.table(DF, key="V2")
mydt[, mean(V4), by=V3]
Results:
V3 V1
1: 4 0.6666667
2: 10 0.5000000
3: 8 0.5000000
Then if you want to clean it up aesthetically:
# you can format it nicely using round
mydt[, round(100*mean(V4),2), by=V3]
# V3 V1
# 1: 4 66.67
# 2: 10 50.00
# 3: 8 50.00
# you can give the new column a name (wrap it all in a list)
mydt[, list("Percent" = round(100*mean(V4),2)), by=V3]
# V3 Percent
# 1: 4 66.67
# 2: 10 50.00
# 3: 8 50.00