I have a column in a data table which is a list of comma separated values
dt = data.table( a = c('a','b','c'), b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt'))
> dt
a b
1: a xx,yy,zz
2: b mm,nn
3: c qq,rr,ss,tt
I would like to transform it into a long format
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This question has been answered for a data frame here. I'm wondering if there is an elegant data table solution.
The following will work for your example:
dt[, c(b=strsplit(b, ",")), by=a]
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This method fails if the "by" variable is repeated as in
dt = data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
One robust solution in this situation can be had by using paste to collapse all observations with the same grouping variable (a) and feeding the result to the code above.
dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]
This returns
a b
1: a xx
2: a yy
3: a zz
4: a zz
5: a gg
6: a tt
7: b mm
8: b nn
9: c qq
10: c rr
11: c ss
12: c tt
There is another method, but this method involves another package : splitstackshape.
library(splitstackshape)
cSplit(dt, "b", sep = ",", direction = "long")
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This function uses data.table to work. And this work even if we have multiple same value for the column "a".
We can split the column 'b' by the delimiter ',' (using strsplit), grouped by 'a' and set the name of the new column i.e. 'V1' to 'b' with setnames
setnames(dt[, strsplit(b, ','), by = a], "V1", "b")[]
# a b
#1: a xx
#2: a yy
#3: a zz
#4: b mm
#5: b nn
#6: c qq
#7: c rr
#8: c ss
#9: c tt
If there are repeating elements in 'a' as in the below example
dt <- data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
we can group by the sequence of rows, do the strsplit on 'b', concatenate with the 'a' column and assign (:=) the 'grp' to NULL
dt[, c(a=a, b=strsplit(b, ",")), .(grp = 1:nrow(dt))][, grp := NULL][]
# a b
# 1: a xx
# 2: a yy
# 3: a zz
# 4: b mm
# 5: b nn
# 6: c qq
# 7: c rr
# 8: c ss
# 9: c tt
#10: a zz
#11: a gg
#12: a tt
NOTE: Both the methods are data.table methods
Related
sample code:
library(data.table)
set.seed(42)
dt <- data.table(id = LETTERS[1:20],
setvalues = replicate(20,
sample(letters[1:4], sample(c(2,3),1))))[order(id)]
dt
id setvalues
1: A d,a,b
2: B c,d,a
3: C c,b,d
4: D b,d,c
5: E a,b,c
6: F a,c,b
7: G c,b
8: H b,c,d
9: I b,c,a
10: J a,d,b
11: K b,d,a
12: L b,c,d
13: M d,b,a
14: N b,c
15: O c,d
16: P b,d
17: Q d,c,b
18: R a,d,b
19: S a,d,c
20: T b,a
How can count the occurence of each set (order doesn't matter).
The desired results are something like
setvalue counts
b,c,d 6
a,b,d 4
a,c,c 3
a,c,d 2
b,c 2
c,d 1
b,d 1
a,b 1
The 'setvalues' is a list of vector. We loop through the list with lapply, sort it, paste, use it in the by argument and get the 'counts' with .N
dt[ , .(counts = .N), .(setvalue = unlist(lapply(setvalues, function(x) toString(sort(x)))))]
I'm afraid I can't find an answer to my problem.
I am looking to create
1) They are 4 sets of cards A, B, C, D and 16 cards.
2) Each card is numbered within a set (A from 1 to 4, B from 5 to 8, and so on).
3) We want to randomize the assignment such that each person is randomly assigned a set of cards, for example A.
4) In addition, the order of the cards within the set has to be randomized.
So what we want is the following:
Person 1: Set A, cards 1-2-3-4
Person 2: Set A, cards 4-2-3-1
Person 3: Set D, cards 16-15-12-13
and so on.
I would also like each number to be in a separate column.
Thanks for your help!
S.
if each person gets one set of cards
> df=NULL
> a=rep(LETTERS[1:4],4)
> df$card1=sample(a,16,F)
> df=as.data.frame(df)
> df=df[order(card1),]
> df
card1
1: A
2: A
3: A
4: A
5: B
6: B
7: B
8: B
9: C
10: C
11: C
12: C
13: D
14: D
15: D
16: D
> df$card2=rep((1:4),4)
> df
card1 card2
1: A 1
2: A 2
3: A 3
4: A 4
5: B 1
6: B 2
7: B 3
8: B 4
9: C 1
10: C 2
11: C 3
12: C 4
13: D 1
14: D 2
15: D 3
16: D 4
> df1=df[sample(nrow(df)),]
> df1
card1 card2
1: A 2
2: D 4
3: C 3
4: D 3
5: B 3
6: D 1
7: C 2
8: A 3
9: B 2
10: D 2
11: B 1
12: A 1
13: C 4
14: C 1
15: B 4
16: A 4
Here's one way of approaching this.
person <- c("Person1", "Person2", "Person3", "Person4")
cardset <- LETTERS[1:4]
set.seed(357) # this is for reproducibility
xy <- data.frame(
person = sample(person), # pick out persons in a random order
set = sample(cardset)) # assign a random card set to a person
vx <- rep(xy$set, each = 4) # for each set, create repeats
vy <- split(paste(vx, rep(1:4, times = 4), sep = ""), f = vx) # append numbers to it
vz <- do.call(rbind, sapply(vy, FUN = sample, simplify = FALSE)) # shuffle using sapply and stitch together with do.call
cbind(xy, vz) # add it to the original data
person set 1 2 3 4
A Person1 C A4 A3 A2 A1
B Person4 B B2 B1 B4 B3
C Person3 D C2 C3 C4 C1
D Person2 A D1 D2 D4 D3
Here's another option:
# create data frame of decks and their numbered cards
cards <- data.frame(deck = rep(LETTERS[1:4], each = 4),
numbers = c(1:16),
stringsAsFactors = FALSE)
# create list of people
people <- c("Person1", "Person2", "Person3")
# loop through each person and randomly select a deck
# based on deck selected, subset the cards that can be used
# randomize the numbered cards
# add the deck, order of cards, and person to a
# growing data frame of assignments
assignment <- NULL
for(i in unique(people)) {
set <- sample(cards$deck, size = 1)
setCards <- cards[cards$deck == set, ]
orderCards <- sample(setCards$numbers)
assignment <- rbind(assignment, data.frame(Person = i,
Deck = set,
Card1 = orderCards[1],
Card2 = orderCards[2],
Card3 = orderCards[3],
Card4 = orderCards[4],
stringsAsFactors = FALSE))
}
I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21
Suppose I have this data:
c1 c2 c3
A A AA
A B BB
A C CC
B A DD
B B EE
B C FF
C A GG
C B HH
C C II
A A JJ
I want to reshape them with dcast with this function:
dcast(data,c1~c2,value.var="c3",function(x)x)
But I get this error:
Error in vapply(indices, fun, .default) : values must be length 0,
but FUN(X[[1]]) result is length 1
How can use a new function with dcast (User defined function).
I want to get:
A B C
A AA BB CC
B DD EE FF
C GG HH II
A JJ NA NA
Here's a possible solution using data.tables v 1.9.5+ new rleid function, which will create an index for the c1 column (you can remove indx afterwards if you want)
library(data.table) # v 1.9.5+
dcast(setDT(stocksm)[, indx := rleid(c1)], indx + c1 ~ c2, value.var = "c3")
# indx c1 A B C
# 1: 1 A AA BB CC
# 2: 2 B DD EE FF
# 3: 3 C GG HH II
# 4: 4 A JJ NA NA
### installing the development version
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
So basically after creating an index on c1 we are spreading the data more or less as before, while including indx inside
Or if you insist on tidyr, here's an option
library(tidyr)
stocksm$indx <- with(rle(as.character(stocksm$c1)), rep(seq_along(lengths), lengths))
spread(stocksm, c2, c3)
# c1 indx A B C
# 1 A 1 AA BB CC
# 2 A 4 JJ <NA> <NA>
# 3 B 2 DD EE FF
# 4 C 3 GG HH II
Another way to use dcast is to create unique identifiers with cumsum. The function will not know which value to fill in for duplicates like A A if it isn't created.
data$ids <- cumsum(c(T,diff(as.numeric(data$c1)) != 0L))
dcast(data, ids+c1~c2, value.var="c3")[-1]
# c1 A B C
# 1 A AA BB CC
# 2 B DD EE FF
# 3 C GG HH II
# 4 A JJ <NA> <NA>
Is it possible to use the key value in the condition for creating a new column using := with data.table?
set.seed(315)
DT = data.table(a = factor(LETTERS[rep(c(1:5), 2)]),
b = factor(letters[rep(c(1, 2), 5)]),
c = rnorm(10), key = c("a", "b"))
Which gives a data.table that looks like this:
> DT
a b c
1: A a 0.11610792
2: A b -2.67495409
3: B a -0.18467740
4: B b 0.79994197
5: C a 0.74565643
6: C b 0.49959003
7: D a 0.04385948
8: D b -2.25996438
9: E a -1.86204824
10: E b 0.11327201
I want to create a new column d that is the difference of the values from A,a and A,b, B,a and B, b, and so on. I'd like to use the := because of how fast it can fly on large datasets.
I can get the d column that I'm looking for with a furry of creating new data.tables, merges, and more but this just feels ugly.
dt.a <- DT[DT[, .I[b == "a"]]]
dt.b <- DT[DT[, .I[b == "b"]]]
dt <- merge(dt.a, dt.b, by = c("a"))
dt <- merge(dt.a, dt.b, by = c("a"))
> dt
a b.x c.x b.y c.y
1: A a 0.11610792 b -2.674954
2: B a -0.18467740 b 0.799942
3: C a 0.74565643 b 0.499590
4: D a 0.04385948 b -2.259964
5: E a -1.86204824 b 0.113272
> dt[, d:= c.x - c.y]
> dt
a b.x c.x b.y c.y d
1: A a 0.11610792 b -2.674954 2.7910620
2: B a -0.18467740 b 0.799942 -0.9846194
3: C a 0.74565643 b 0.499590 0.2460664
4: D a 0.04385948 b -2.259964 2.3038239
5: E a -1.86204824 b 0.113272 -1.9753203
Is there a more direct way?
This gets the job done, sort of. Without splitting apart the data, each value in d would be repeated for each value in the original DT[,a]. That's ok.
Based on your input and what you have provided as your current solution, I would suggest the following:
DT[, d := diff(rev(c)), by = a]
DT
# a b c d
# 1: A a 0.11610792 2.7910620
# 2: A b -2.67495409 2.7910620
# 3: B a -0.18467740 -0.9846194
# 4: B b 0.79994197 -0.9846194
# 5: C a 0.74565643 0.2460664
# 6: C b 0.49959003 0.2460664
# 7: D a 0.04385948 2.3038239
# 8: D b -2.25996438 2.3038239
# 9: E a -1.86204824 -1.9753203
# 10: E b 0.11327201 -1.9753203