How can I choose the first 3 rows of a data table in data.table by group? - r

I currently have a dataset like:
ID RESULTS
1 M
1 A
1 M
1 C
1 B
2 Q
2 E
2 S
2 G
2 Z
......
From this, I would like to keep the first 3 rows, by group. Meaning, I'd like:
ID RESULTS
1 M
1 A
1 M
2 Q
2 E
2 S
I dug around in data.table, the closest I found was using something like mult or .I. Does anyone have a simple workaround? Thanks!

I would suggest a more concise way. You can have more detail with ?data.table or with example(data.table)
DT = data.table(ID=rep(c(1,2),each=5),RESULTS=
c("M","A","M","C","B","Q","E","S","G","Z"))
> DT[,.SD[1:3],by=ID]
## ID RESULTS
## 1: 1 M
## 2: 1 A
## 3: 1 M
## 4: 2 Q
## 5: 2 E
## 6: 2 S

Related

Count occurrences of value in multiple columns with duplicates

My problem is very similar to:
R: Count occurrences of value in multiple columns
However, the solution proposed there doesn't work for me because in the same row the value may appear twice but I want to count only the rows where this appears. I have worked out a solution but it seems too long:
> toy_data = data.table(from=c("A","A","A","C","E","E"), to=c("B","C","A","D","F","E"))
> toy_data
from to
1: A B
2: A C
3: A A
4: C D
5: E F
6: E E
> #get a table with intra-link count
> A = data.table(table(unlist(toy_data[from==to,from ])))
> A
V1 N
1: A 1
2: E 1
A #get a table with total count
> B = data.table(table(unlist(toy_data[,c(from,to)])))
> B
V1 N
1: A 4
2: B 1
3: C 2
4: D 1
5: E 3
6: F 1
>
> # concatenate changing sign
> table = rbind(B,A[,.(V1,-N)],use.names=FALSE)
> # groupby and subtract
> table[,sum(N),by=V1]
V1 V1
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Is there some function that would do the job in less lines? I thought in python I'd concatenate from and to then match(), cannot find the right sintax though
EDIT: I know this would work A=length(toy_data[from=="A"|to=="A",from]) but I would like avoiding loops among the various "A","B"... (and I don't know how to format output in this way)
You can try the code below
> toy_data[, to := replace(to, from == to, NA)][, data.frame(table(unlist(.SD)))]
Var1 Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
or
toy_data %>%
mutate(to = replace(to, from == to, NA)) %>%
unlist() %>%
table() %>%
as.data.frame()
which gives
. Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
Using data.table
library(data.table)
toy_data[from == to, to := NA][, .(to = na.omit(c(from, to)))][, .N, to]
You could just subset the to vector:
data.table(table(unlist(toy_data[,c(from,to[to!=from])])))
V1 N
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Using to:=NA as suggested by akrun, one can wrap the result in table(unlist()) and convert to data.table
data.table(table(unlist(toy_data[from==to, to:=NA, from])))

Pool items in database until minimum sample size reached and find all permutations in R

This is an example.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
df
item n
1 a 3
2 b 2
3 c 2
4 d 1
5 e 1
Item needs to be grouped so that the group has a sample size of at least 4.
This would be the solution if you follow the sorting of df.
item n cluster
1 a 3 1
2 b 2 1
3 c 2 2
4 d 1 2
5 e 1 2
How to get all possible unique solutions?
Further, the code should also not allow any clusters to have a sample size less than 4.
Below, we have a brute force approach using the package partitions. The idea is that we find every partition of the rows of df. We then sum each group and check to see that the requirement has been met.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
minSize <- 4
funGetClusters <- function(df, minSize) {
allParts <- partitions::listParts(nrow(df))
goodInd <- which(sapply(allParts, function(p) {
all(sapply(p, function(x) sum(df$n[x])) >= minSize)
}))
allParts[goodInd]
}
clusterBreakdown <- funGetClusters(df, minSize)
allDfs <- lapply(clusterBreakdown, function(p) {
copyDf <- df
copyDf$cluster <- 1L
clustInd <- 2L
for (i in p[-1]) {
copyDf$cluster[i] <- clustInd
}
copyDf
})
Here is the output:
allDfs
[[1]]
item n cluster
1 a 3 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 1
[[2]]
item n cluster
1 a 3 1
2 b 2 2
3 c 2 2
4 d 1 1
5 e 1 1
[[3]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 2
5 e 1 1
[[4]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
[[5]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 2
4 d 1 1
5 e 1 1
[[6]]
item n cluster
1 a 3 2
2 b 2 2
3 c 2 1
4 d 1 1
5 e 1 1
It should be noted, that there is a combinatorial explosion as the number of rows increases. For example, just with 10 rows we would have to test 115975 different partitions.
As #chinsoon comments, RcppAlgos could be a good choice for an acceptable solution for larger cases. Disclaimer, I am the author. I have answered similar questions with much larger inputs and have had good success.
Allocating tasks to parallel workers so that expected cost is roughly equal
Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?
#AllanCameron also has a great answer and nice methodology to attacking this problem. You should give that a read as well.
Lastly, the following vignette by Robin K. S. Hankin (author of the partitions package) and Luke J. West is not only a great read, but very applicable to problems like the one presented here.
Set Partitions in R

How to transpose when the value is a txt an the new column is a number

I have the following table
id mycol counter
1 a 1
1 b 2
2 c 1
2 c 2
2 e 3
And this is what I neee
ID 1 2 3
1 a b done
2 c c done
I try to use the dcast function
mydata<-dcast(mydata, id~mycol, counter, value = 'mycol')
but It's not working, any idea?
It appears from your question that you're trying to do something like a reshaping from long to wide format. Here's how you can use base R reshape() to do this:
mydata <- data.frame(id=c(1L,1L,2L,2L,2L),mycol=c('a','b','c','c','e'),counter=c(1L,2L,1L,2L,3L),stringsAsFactors=F);
reshape(mydata,dir='w',idvar='id',timevar='counter');
## id mycol.1 mycol.2 mycol.3
## 1 1 a b <NA>
## 3 2 c c e
reshape() does not support such precise control over the resulting column names. You can fix them up yourself afterward. Assuming you saved the above result as res, you can do this:
colnames(res) <- sub(perl=T,'^mycol\\.','',colnames(res));
res;
## id 1 2 3
## 1 1 a b <NA>
## 3 2 c c e

Formatting the output in R

I have a set of data which shows the visit ID and the subject name
visit<-c(1,2,3,1,2,1,1,2,3,1,2,3)
subject<-c("A","A","A","B","B","C","D","D","D","E","E","E")
data<-data.frame(visit=visit,subject=subject)
I attempted to work out the latest visit ID for each subject:
tapply(visit,subject,max)
And I get this output:
A B C D E
3 2 1 3 3
I am wondering if there is any way that I can change the output such that it becomes:
A 3
B 2
C 1
D 3
E 3
Thank you
You can try aggregate
aggregate(visit~subject, data, max)
# subject visit
#1 A 3
#2 B 2
#3 C 1
#4 D 3
#5 E 3
Or from tapply
res <- tapply(visit,subject,max)
data.frame(subject=names(res), visit=res)
Or data.table
library(data.table)
setDT(data)[, list(visit=max(visit)), by=subject]
And a dplyr solution would be:
library(dyplr)
data %>% group_by(subject) %>% summarize(max = max(visit))
## Source: local data frame [5 x 2]
## subject max
## 1 A 3
## 2 B 2
## 3 C 1
## 4 D 3
## 5 E 3
It may feel dirty, but using the base function as.matrix (or matrix for that matter) will give you what you need.
> as.matrix(tapply(visit,subject,max))
[,1]
A 3
B 2
C 1
D 3
E 3
You can easily do this in base R with stack:
stack(tapply(visit, subject, max))
# values ind
# 1 3 A
# 2 2 B
# 3 1 C
# 4 3 D
# 5 3 E
(Note: In this case, the values for "visit" and "subject" aren't actually coming from your data.frame. Just thought you should know!)
(Second note: You could also do data.frame(as.table(tapply(visit, subject, max))) but that is more deceptive than using stack so may lead to less readable code later on.)

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources