In data.table v.1.9.6 you can split a variable in columns like so:
library(data.table)
DT = data.table(x=c("A/B", "A", "B"), y=1:3)
DT[, c("c1", "c2") := tstrsplit(x, "/", fixed=TRUE)][]
The number of required splits [above: 2] is not always known in advance.
How can I generate the required variable names when the number of splits is known?
n = 2 # desired number of splits
# naive attempt to build required string
m = paste0("'", "myvar", 1:n, "'", collapse = ",")
m = paste0("c(", m, ")" )
# [1] "c('myvar1','myvar2','myvar3')"
DT[, m := tstrsplit(x, "/", fixed=TRUE)][] # doesn't work
Two methods. The first is strongly suggested:
#one
n=2
DT[, paste0("myvar", 1:n) := tstrsplit(x, "/", fixed=T)][]
# x y myvar1 myvar2
#1: A/B 1 A B
#2: A 2 A NA
#3: B 3 B NA
#two
DT[, eval(parse(text=m)) := tstrsplit(x, "/", fixed=TRUE)][]
# x y myvar1 myvar2
#1: A/B 1 A B
#2: A 2 A NA
#3: B 3 B NA
extra
If you do not know the amount of splits beforehand:
splits <- max(lengths(strsplit(DT$x, "/")))
DT[, paste0("myvar", 1:splits) := tstrsplit(x, "/", fixed=T)][]
Another simple way of doing this. Instead of making extra columns, you can stack the splitted strings in a single column:
DT = data.table(x=c("A/B", "A", "B"), y=1:3)
DT1 <- DT[, .(new=tstrsplit(x, "/",fixed=T)), by=y]
DT1
# y new
# 1: 1 A
# 2: 1 B
# 3: 2 A
# 4: 3 B
Related
I have two data.tables:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
right_table <- data.table(record = sample(LETTERS, 9))
I would like to replace the numeric entries in left_table by the values associated with the corresponding row numbers in right_table. e.g. All instances of 4 in left_table are replaced by whatever letter (or set of characters in my real data) is on row 4 of right_table and so on.
I have this solution but I feel it's a bit cumbersome and a simpler solution must be possible?
right_table <- data.table(row_n = as.character(seq_along(1:9)), right_table)
for (i in seq_along(left_table)){
cols <- colnames(left_table)
current_col <- cols[i]
# convert numbers to character to allow := to work for matching records
left_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#right_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#set key for quick joins
setkeyv(left_table, current_col)
setkeyv(right_table, "row_n")
# replace matching records
left_table[right_table, (current_col) := record]
}
You can create the new columns fetching the letters from right_table using the original variables.
left_table[, c("newa","newb","newc") :=
.(right_table[a,record],right_table[b,record],right_table[c,record])]
# a b c newa newb newc
# 1: 1 4 8 Y A R
# 2: 2 5 9 D B W
# 3: 3 6 10 G K <NA>
# 4: 4 7 11 A N <NA>
Edit:
To make it more generic:
columnNames <- names(left_table)
left_table[, (columnNames) :=
lapply(columnNames, function(x) right_table[left_table[,get(x)],record])]
Although there is probably a better way to do this without needing to call left_table inside lapply()
Using mapvalue from plyr:
library(plyr)
corresp <- function(x) mapvalues(x,seq(right_table$record),right_table$record)
left_table[,c(names(left_table)) := lapply(.SD,corresp),.SDcols = names(left_table)]
a b c
1: N K X
2: U Q V
3: Z I 10
4: K G 11
Here is my attempt. When we replace the numeric values to character values, we get NAs as we see from some other answers. So I decided to take another way. First, I created a vector using unlist(). Then, I used fifelse() from the data.table package. I used foo as indices and replaces numbers in foo with characters. I also converted numeric to character (i.e., 10 and 11 in the sample data). Then, I created a matrix and converted it to a data.table object. Finally, I assigned column names to the object.
library(data.table)
foo <- unlist(left_table)
temp <- fifelse(test = foo <= nrow(right_table),
yes = right_table$record[foo],
no = as.character(foo))
res <- as.data.table(matrix(data = temp, nrow = nrow(left_table)))
setnames(res, names(left_table))
# a b c
#1: B G J
#2: Y D I
#3: P T 10
#4: G S 11
I think it might be easier to just keep record as a vector and access it via indexing:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
# a b c
#1: 1 4 8
#2: 2 5 9
#3: 3 6 10
#4: 4 7 11
set.seed(0L)
right_table <- data.table(record = sample(LETTERS, 9))
record <- right_table$record
#[1] "N" "Y" "D" "G" "A" "B" "K" "Z" "R"
left_table[, names(left_table) := lapply(.SD, function(k) fcoalesce(record[k], as.character(k)))]
left_table
# a b c
# 1: N G Z
# 2: Y A R
# 3: D B 10
# 4: G K 11
I had a data.table like this:
library(data.table)
dt <- data.table(a = c(rep("A", 3), rep("B", 3)), b = c(1, 3, 5, 2, 4, 6))
I needed to perform an operation (forecast) on the values for each a, so I decided to put them in a list, like this:
dt <- dt[, x := .(list(b)), by = a][, .SD[1,], by = a, .SDcols = "x"]
Now I wanted to "melt" (that's the thing that comes to mind) dt back into its original form.
I could do it for very few levels of a like this:
dt2 <- rbind(expand.grid(dt[1, a], dt[1, x[[1]]]), expand.grid(dt[2, a], dt[2, x[[1]]]))
but of course, the solution is impractical for more levels of a.
I've tried
dt2 <- dt[, expand.grid(a, x[[1]]), by = a]
which results in
dt2
## a Var1 Var2
## 1: A A 1
## 2: A A 3
## 3: A A 5
## 4: B A 2
## 5: B A 4
## 6: B A 6
it's interesting to notice that Var1 doesn't actually follow the "A - B" pattern expected (but at least a remains).
Is there a better approach to achieve this?
EDITS
Expected output will be the result of
dt2[, .(a, Var2)]
Corrected "melt" for "dcast".
You are looking for a method to nest(convert a column from a atomic vector type to list type) and unnest(the opposite direction) in a data.table way. This is different from reshaping data which either spread a column values to row header(dcast) or gather the row headers to a column values(melt):
In data.table syntax, you can use list and unlist on the target column to summarize or broadcast it along with group variables:
Say if we are starting from:
dt
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6
To repeat what you have achieved in your first step, i.e. nest column b, you can do:
dt_nest <- dt[, .(b = list(b)), a]
dt_nest
# a b
# 1: A 1,3,5
# 2: B 2,4,6
To go the opposite direction, use unlist with the group variable:
dt_nest[, .(b = unlist(b)), a]
# a b
# 1: A 1
# 2: A 3
# 3: A 5
# 4: B 2
# 5: B 4
# 6: B 6
How do I use variable column names on the RHS of := operations? For example, given this data.table "dt", I'd like to create two new columns, "first_y" and "first_z" that contains the first observation of the given column for the values of "x".
dt <- data.table(x = c("one","one","two","two","three"),
y = c("a", "b", "c", "d", "e"),
z = c(1, 2, 3, 4, 5))
dt
x y z
1: one a 1
2: one b 2
3: two c 3
4: two d 4
5: three e 5
Here's how you would do it without variable column names.
dt[, c("first_y", "first_z") := .(first(y), first(z)), by = x]
dt
x y z first_y first_z
1: one a 1 a 1
2: one b 2 a 1
3: two c 3 c 3
4: two d 4 c 3
5: three e 5 e 5
But how would I do this if the "y" and "z" column names are dynamically stored in a variable?
cols <- c("y", "z")
# This doesn't work
dt[, (paste0("first_", cols)) := .(first(cols)), by = x]
# Nor does this
q <- quote(first(as.name(cols[1])))
p <- quote(first(as.name(cols[2])))
dt[, (paste0("first_", cols)) := .(eval(q), eval(p)), by = x]
I've tried numerous other combinations of quote() and eval() and as.name() without success. The LHS of the operation appears to be working as intended and is documented in many places, but I can't find anything about using a variable column name on the RHS. Thanks in advance.
I'm not familiar with the first function (although it looks like something Hadley would define).
dt[, paste0("first_", cols) := lapply(.SD, head, n = 1L),
by = x, .SDcols = cols]
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
The .SDcols answer is fine for this case, but you can also just use get:
dt[, paste0("first_", cols) := lapply(cols, function(x) get(x)[1]), by = x]
dt
# x y z first_y first_z
#1: one a 1 a 1
#2: one b 2 a 1
#3: two c 3 c 3
#4: two d 4 c 3
#5: three e 5 e 5
Another alternative is the vectorized version - mget:
dt[, paste0("first_", cols) := setDT(mget(cols))[1], by = x]
I can never get "get" or "eval" to work on the RHS when trying to do mathematical operations. Try this if you need to.
Thing_dt[, c(new_col) := Thing_dt[[oldcol1]] * Thing_dt[[oldcol2]]]
Here are two artificial but I hope pedagogical examples of my problem.
1) When running this code:
> dat0 <- data.frame(A=c("a","a","b"), B="")
> data.table(dat0)[, lapply(.SD, function(x) length(A)) , by = "A"]
A B
1: a 1
2: b 1
I expected the output
A B
1: a 2
2: b 1
(similarly to plyr::ddply(dat0, .(A), nrow)).
Update to question 1)
Let me give a less artificial example. Consider the following dataframe:
dat0 <- data.frame(A=c("a","a","b"), x=c(1,2,3), y=c(9,8,7))
> dat0
A x y
1 a 1 9
2 a 2 8
3 b 3 7
Using plyr package, I get the means of x and y by each value of A as follows:
> ddply(dat0, .(A), summarise, x=mean(x), y=mean(y))
A x y
1 a 1.5 8.5
2 b 3.0 7.0
Very nice. Now imagine another variable H and the following calculations:
dat0 <- data.frame(A=c("a","a","b"), H=c(0,1,-1), x=c(1,2,3), y=c(9,8,7))
> ddply(dat0, .(A), summarise, x=mean(x)^mean(H), y=mean(y)^mean(H))
A x y
1 a 1.2247449 2.9154759
2 b 0.3333333 0.1428571
Very nice too. But now, imagine there's a huge number of variables x for which you want to calculate mean(x)^mean(H). Then I don't want to type:
ddply(dat0, .(A), summarise, a=mean(a)^mean(H), b=mean(b)^mean(H), c=mean(c)^mean(H), d=mean(d)^mean(H), ...........)
So my idea was to try:
flipcols <- my_selected_columns # c("a", "b", "c", "d", ....)
data.table(dat0)[, lapply(.SD, function(x) mean(x)^mean(H)), by = "A", .SDcols = flipcols]
But that doesn't work because the presence of H in function(x) mean(x)^mean(H) is not handled as I expected! I have not been able to make it work with plyr::colwise too.
2) When running this code:
> dat0 <- data.frame(A=c("a","a","b"), B=1:3, c=0)
> data.table(dat0)[, lapply(.SD, function(x) B), .SDcols="c"]
Error in ..FUN(c) : object 'B' not found
I expected it works and generates :
c
1: 1
2: 2
3: 3
So is there a way to use the columns of the original data.table in a transformation ?
1) Use .N. The length of the grouping variable A there is 1 because there is just one value of A for each group (this is by definition of what grouping means):
dt <- data.table(A=c("a","a","b"), B="")
dt[, .N, by = A]
# A N
#1: a 2
#2: b 1
(updated 1) This is the same issue as 2). A workaround is to not use .SDcols:
dt = data.table(A=c("a","a","b"), H=c(0,1,-1), x=c(1,2,3), y=c(9,8,7))
dt[, lapply(.SD[, !"H"], function(x) mean(x) ^ mean(H)), by = A]
# A x y
#1: a 1.2247449 2.9154759
#2: b 0.3333333 0.1428571
2) This is a bug that's been reported before here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5222&group_id=240&atid=975
I don't know if I understand you correctly.
1)
library(data.table)
dat0 <- data.frame(A=c("a","a","b"), B="")
data.table(dat0)[, list(l= nrow(.SD)) , by = "A"]
result:
A l
1: a 2
2: b 1
2)
dat0 <- data.frame(A=c("a","a","b"), B=1:3, c=0)
data.table(dat0)[, list(c=unlist(.SD)), .SDcols= "B"]
result:
c
1: 1
2: 2
3: 3
1')
Edit: I changed -1 to mycols
dat0 <- data.frame(A=c("a","a","b"), H=c(0,1,-1), x=c(1,2,3), y=c(9,8,7))
data.table(dat0)[, lapply(.SD, function(x) mean(x)^mean(H)), by = "A", .SDcols = c("x", "y")]
result:
A u v
1: a 1.2247449 2.9154759
2: b 0.3333333 0.1428571
Note that if the data is huge, mean(H) will be calculated many times wastefully. We could do {muH = mean(H); lapply(.SD, function(x) mean(x)^muH)} in this case to save computation; the above is a bit more readable though.
i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.
You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])