dcast long to wide by transforming unique RHS - r

Let's say I have a table like the following:
DT <- data.table(ID1= rep(c("a","b","c"),3),ID2=rnorm(9,4),var = 1:9)
## > DT
## ID1 ID2 var
## 1: a 2.630392 1
## 2: b 3.966620 2
## 3: c 4.002776 3
## 4: a 3.188372 4
## 5: b 4.735084 5
## 6: c 4.307198 6
## 7: a 2.830868 7
## 8: b 4.892684 8
## 9: c 3.429826 9
and I would like to perform a dcast with the by taking into consideration only the number of times that each ID1 apprear.
undesired output:
dcast(DT,ID1~ID2)
desired output:
## ID1 1 2 3
## 1: a 1 4 7
## 2: b 2 5 8
## 3: c 3 6 9

Try
dcast.data.table(DT[,N:=1:.N ,ID1], ID1~N, value.var='var')
# ID1 1 2 3
#1: a 1 4 7
#2: b 2 5 8
#3: c 3 6 9

Related

Is there some way to keep variable names from.SD+.SDcols together with non .SD variable names in data.table?

Given a data.table
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT
x v y a b
1: b 1 1 1 9
2: b 1 3 2 8
3: b 1 6 3 7
4: a 2 1 4 6
5: a 2 3 5 5
6: a 1 6 6 4
7: c 1 1 7 3
8: c 2 3 8 2
9: c 2 6 9 1
if one does
DT[, .(a, .SD), .SDcols=x:y]
a .SD.x .SD.v .SD.y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the variables from .SDcols become prefixed by .SD. On the other hand, if one tries, as in https://stackoverflow.com/a/62282856/997979,
DT[, c(.(a), .SD), .SDcols=x:y]
V1 x v y
1: 1 b 1 1
2: 2 b 1 3
3: 3 b 1 6
4: 4 a 2 1
5: 5 a 2 3
6: 6 a 1 6
7: 7 c 1 1
8: 8 c 2 3
9: 9 c 2 6
the other variable name (a) become lost. (It is due to this reason that I re-ask the question which I initially marked as a duplicate to that linked above).
Is there some way to keep the names from both .SD variables and non .SD variables?
The goal is simultaneously being able to use .() to select variables without quotes and being able to select variables through .SDcols = patterns("...")
Thanks in advance!
not really sure why.. but it works ;-)
DT[, .(a, (.SD)), .SDcols=x:y]
# a x v y
# 1: 1 b 1 1
# 2: 2 b 1 3
# 3: 3 b 1 6
# 4: 4 a 2 1
# 5: 5 a 2 3
# 6: 6 a 1 6
# 7: 7 c 1 1
# 8: 8 c 2 3
# 9: 9 c 2 6

Number of copies (duplicates) in R data.table

I want to add a column to a data.table which shows how many copies of each row exist. Take the following example:
library(data.table)
DT <- data.table(id = 1:10, colA = c(1,1,2,3,4,5,6,7,7,7), colB = c(1,1,2,3,4,5,6,7,8,8))
setkey(DT, colA, colB)
DT[, copies := length(colA), by = .(colA, colB)]
The output it gives is
id colA colB copies
1: 1 1 1 1
2: 2 1 1 1
3: 3 2 2 1
4: 4 3 3 1
5: 5 4 4 1
6: 6 5 5 1
7: 7 6 6 1
8: 8 7 7 1
9: 9 7 8 1
10: 10 7 8 1
Desired output is:
id colA colB copies
1: 1 1 1 2
2: 2 1 1 2
3: 3 2 2 1
4: 4 3 3 1
5: 5 4 4 1
6: 6 5 5 1
7: 7 6 6 1
8: 8 7 7 1
9: 9 7 8 2
10: 10 7 8 2
How should I do it?
I also want to know why my approach doesn't. work. Isn't it true that when you group by colA and colB, the first group should contain two rows of data? I understand if "length" is not the function to use, but I cannot think of any other function to use. I thought of "nrow" but what can I pass to it?
DT[, copies := .N, by=.(colA,colB)]
# id colA colB copies
# 1: 1 1 1 2
# 2: 2 1 1 2
# 3: 3 2 2 1
# 4: 4 3 3 1
# 5: 5 4 4 1
# 6: 6 5 5 1
# 7: 7 6 6 1
# 8: 8 7 7 1
# 9: 9 7 8 2
# 10: 10 7 8 2
As mentioned in the comments, .N will calculate the length of the grouped object as defined in the by argument.

data.table within group id [duplicate]

This question already has answers here:
data.table "key indices" or "group counter"
(2 answers)
Closed 7 years ago.
I have a data.table with n grouping variables (in this case 2). I want to add an identifier column for each group as seen in the desired output below. I tried :=:.N` and I get why that doesn't work but don't know how to make it happen:
library(data.table)
dat <- data.table::data.table(
w = 1:16,
x = LETTERS[1:2],
y = 1:4
)[, w := NULL][order(x, y)]
## x y
## 1: A 1
## 2: A 1
## 3: A 1
## 4: A 1
## 5: A 3
## 6: A 3
## 7: A 3
## 8: A 3
## 9: B 2
## 10: B 2
## 11: B 2
## 12: B 2
## 13: B 4
## 14: B 4
## 15: B 4
## 16: B 4
dat[, z := 1:.N, by = list(x, y)]
dat
Desired Output
## x y z
## 1: A 1 1
## 2: A 1 1
## 3: A 1 1
## 4: A 1 1
## 5: A 3 2
## 6: A 3 2
## 7: A 3 2
## 8: A 3 2
## 9: B 2 3
## 10: B 2 3
## 11: B 2 3
## 12: B 2 3
## 13: B 4 4
## 14: B 4 4
## 15: B 4 4
## 16: B 4 4
dat[, z:=.GRP,by=list(x,y)]
dat
# x y z
# 1: A 1 1
# 2: A 1 1
# 3: A 1 1
# 4: A 1 1
# 5: A 3 2
# 6: A 3 2
# 7: A 3 2
# 8: A 3 2
# 9: B 2 3
# 10: B 2 3
# ...

Persistent assignment in data.table with .SD

I'm struggling with .SD calls in data.table.
In particular, I'm trying to identify some logical characteristic within a grouping of data, and draw some identifying mark in another variable. Canonical application of .SD, right?
From FAQ 4.5, http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf, imagine the following table:
library(data.table) # 1.9.5
DT = data.table(a=rep(1:3,1:3),b=1:6,c=7:12)
DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a]
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
I've assigned these values to b (using the ':=' operator) and so when I re-call DT, I expect the same output. But, unexpectedly, I'm met with the original table:
DT
## a b c
## 1: 1 1 7
## 2: 2 2 8
## 3: 2 3 9
## 4: 3 4 10
## 5: 3 5 11
## 6: 3 6 12
Expected output was the original frame, with persistent modifications in 'b':
DT
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
Sure, I can copy this table into another one, but that doesn't seem consistent with the ethos.
DT2 <- copy(DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a])
DT2
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
It feels like I'm missing something fundamental here.
The mentioned FAQ is just showing a workaround on how to modify (a temprory copy of) .SD but it won't update your original data in place. A possible solution for you problem would be something like
DT[DT[, .I[1L], by = a]$V1, b := 99L]
DT
# a b c
# 1: 1 99 7
# 2: 2 99 8
# 3: 2 3 9
# 4: 3 99 10
# 5: 3 5 11
# 6: 3 6 12

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources