Data table aggregation and using toString() - r

I was trying to create a function where I aggregate data table values on the basis of one column but I can't pass the argument in toString() for column names. The following example can show it better:
t1 <- data.table(P = c("a", "b", "c", "d", "a", "b"), Q =
c("1","2","3","4","5","6"))
t1[ ,toString(Q), by = P] # this works
t1[ ,toString(colnames(t1)[2]), by = P] # this does not give me the desired result
I get the following result with the above:
P V1
1: a Q
2: b Q
3: c Q
4: d Q
As compared to the expected:
P V1
1: a 1, 5
2: b 2, 6
3: c 3
4: d 4
I have tried removing the quotes using noquotes() but nothing works out for me.
Can anyone point out where I might be making a mistake?

We can use get to return the value
t1[ , toString(get(colnames(t1)[2])), by = P]
# P V1
#1: a 1, 5
#2: b 2, 6
#3: c 3
#4: d 4
Or with eval/as.symbol
t1[, toString(eval(as.symbol(names(t1)[2]))), by = P]
Or the standard way would be to specify in .SDcols
t1[, toString(.SD[[1]]), by = P, .SDcols = names(t1)[2]]

Related

How do I find the row and column number for a particular string in a data.table?

Let's say I have the following data.table:
DT <- setDT(data.frame(id = 1:10, LETTERS = LETTERS[1:10],
letters = letters[1:10]))
##+ > DT
## id LETTERS letters
## 1: 1 A a
## 2: 2 B b
## 3: 3 C c
## 4: 4 D d
## 5: 5 E e
## 6: 6 F f
## 7: 7 G g
## 8: 8 H h
## 9: 9 I i
## 10: 10 J j
and I want to find the row and column numbers of the letter 'h' (which are 8 and 3). How would I do that?
DT[, which(.SD == "h", arr.ind = TRUE)]
# row col
# [1,] 8 3
EDIT:
Trying to take into account Michael's points:
str_idx = which(sapply(DT, function(x) is.character(x) || is.factor(x)))
idx <- DT[, which(as.matrix(.SD) == "h", arr.ind = TRUE), .SDcols = str_idx]
idx[, "col"] <- chmatch(names(str_idx)[idx[, "col"]], names(DT))
idx
# row col
# [1,] 8 3
Depends on the exact format of your desired output.
# applying to non-string columns is inefficient
str_idx = which(sapply(DT, is.character))
# returns a list as long as str_idx with two elements appropriately named
lapply(str_idx, function(jj) list(row = which(DT[[jj]] == 'h'), col = jj))
It should also be possible to melt the string columns your table to avoid looping.

R: data.table left outer join function not updating

Based on this previous post I build leftOuterJoin which is a function to update a data.table X according to an other data.table Y. The function is defined as follows:
leftOuterJoin <- function(X, Y, onCol) {
.colsY <- names(Y)
X[Y, (.colsY) := mget(paste0("i.", .colsY)), on = onCol]
}
The function works 99% of the time as intended, e.g.:
X <- data.table(id = 1:5, L = letters[1:5])
id L
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
id L N
1: 3 <NA> 10
2: 4 g NA
3: 5 h 12
leftOuterJoin(X, Y, "id")
X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
However, for some reason that is unknown to me, it just stops working with some data tables (I have no reproductible example at hand). There is no error, but the data table is not updated. When I use the debug function, everything seems to be working fine, X is updated, but the real data.table isn't. Now, if I just do it outside the function it works. Maybe it is related to the scope of the function? I am really struggling with this problem.
Spec: R v3.5.1 and data.table v1.11.4.
EDIT
Based on the comments I figured out that the problem is related to the data.table pointer. You can reproduce the problem with this code:
> save(X, file = "X.RData")
> load("X.RData")
> leftOuterJoin(X, Y, "id")
> X
id L
1: 1 a
2: 2 b
3: 3 <NA>
4: 4 g
5: 5 h
Notice that X is updated but not the way we want it. However, if we use setDT() it works properly:
> load("X.RData")
> setDT(X)
> leftOuterJoin(X, Y, "id")
> X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
Is there a way to set up leftOuterJoin() such that it will not be necessary to run setDT() every time some data is loaded?

Matching values in a column using R - excel vlookup

I have a data set in Excel with a lot of vlookup formulas that I am trying to transpose in R using the data.table package.
In my example below I am saying, for each row, find the value in column y within column x and return the value in column z.
The first row results in na because the value 6 doesn't exist in column x.
On the second row the value 5 appears twice in column x but returning the first match is fine, which is e in this case
I've added in the result column which is the expected outcome.
library(data.table)
dt <- data.table(x = c(1,2,3,4,5,5),
y = c(6,5,4,3,2,1),
z = c("a", "b", "c", "d", "e", "f"),
Result = c("na", "e", "d", "c", "b", "a"))
Many thanks
You can do this with a join, but need to change the order first:
setorder(dt, y)
dt[.(x = x, z = z), result1 := i.z, on = .("y" = x)]
setorder(dt, x)
# x y z Result result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 1 f a a
#6: 5 2 e b b
I haven't tested if this is faster than match for a big data.table, but it might be.
We can just use match to find the index of those matching elements of 'y' with that of 'x' and use that to index to get the corresponding 'z'
dt[, Result1 := z[match(y,x)]]
dt
# x y z Result Result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 2 e b b
#6: 5 1 f a a

Adding two vectors by names

I have two named vectors
v1 <- 1:4
v2 <- 3:5
names(v1) <- c("a", "b", "c", "d")
names(v2) <- c("c", "e", "d")
I want to add them up by the names, i.e. the expected result is
> v3
a b c d e
1 2 6 9 4
Is there a way to programmatically do this in R? Note the names may not necessarily be in a sorted order, like in v2 above.
Just combine the vectors (using c, for example) and use tapply:
v3 <- c(v1, v2)
tapply(v3, names(v3), sum)
# a b c d e
# 1 2 6 9 4
Or, for fun (since you're just doing sum), continuing with "v3":
xtabs(v3 ~ names(v3))
# names(v3)
# a b c d e
# 1 2 6 9 4
I suppose with "data.table" you could also do something like:
library(data.table)
as.data.table(Reduce(c, mget(ls(pattern = "v\\d"))),
keep.rownames = TRUE)[, list(V2 = sum(V2)), by = V1]
# V1 V2
# 1: a 1
# 2: b 2
# 3: c 6
# 4: d 9
# 5: e 4
(I shared the latter not so much for "data.table" but to show an automated way of capturing the vectors of interest.)

Finding the Column Index for a Specific Value

I am having a brain cramp. Below is a toy dataset:
df <- data.frame(
id = 1:6,
v1 = c("a", "a", "c", NA, "g", "h"),
v2 = c("z", "y", "a", NA, "a", "g"),
stringsAsFactors=F)
I have a specific value that I want to find across a set of defined columns and I want to identify the position it is located in. The fields I am searching are characters and the trick is that the value I am looking for might not exist. In addition, null strings are also present in the dataset.
Assuming I knew how to do this, the variable position indicates the values I would like returned.
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
The general rule is that I want to find the position of value "a", and if it is not located or if v1 is missing, then I want 99 returned.
In this instance, I am searching across v1 and v2, but in reality, I have 10 different variables. It is also worth noting that the value I am searching for can only exist once across the 10 variables.
What is the best way to generate this recode?
Many thanks in advance.
Use match:
> df$position <- apply(df,1,function(x) match('a',x[-1], nomatch=99 ))
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
Firstly, drop the first column:
df <- df[, -1]
Then, do something like this (disclaimer: I'm feeling terribly sleepy*):
( df$result <- unlist(lapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))) )
v1 v2 result
1 a z 1
2 a y 1
3 c a 2
4 <NA> <NA> 99
5 g a 2
6 h g 99
* sleepy = code is not vectorised
EDIT (slightly different solution, I still feel sleepy):
df$result <- rapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))

Resources