Swapping values between two columns using data.table - r

I have been breaking my head over translating this question to a data.table solution. (to keep it simple I'll use the same data set)
When V2 == "b I want to swap the columns between V1 <-> V3.
dt <- data.table(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
#V1 V2 V3
#1: 1 a 2
#2: 2 a 3
#3: 4 b 1
The code below would be the working solution for data.frame, however because of the amount of frustration this has given me because I was using a data.table without realising I'm now determined to find a solution for data.table.
dt <- data.table(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
df <- as.data.frame(dt)
df[df$V2 == "b", c("V1", "V3")] <- df[df$V2 == "b", c("V3", "V1")]
# V1 V2 V3
#1 1 a 2
#2 2 a 3
#3 1 b 4
I have tried writing a lapply function looping through my target swapping list, tried to narrow down the problem to only replace one value, attempted to call the column names in different ways but all without success.
This was the closest attempt I've managed to get:
> dt[dt$V2 == "b", c("V1", "V3")] <- dt[dt$V2 == "b", c(V3, V1)]
#Warning messages:
#1: In `[<-.data.table`(`*tmp*`, dt$V2 == "b", c("V1", "V3"), value = c(1, :
# Supplied 2 items to be assigned to 1 items of column 'V1' (1 unused)
#2: In `[<-.data.table`(`*tmp*`, dt$V2 == "b", c("V1", "V3"), value = c(1, :
# Supplied 2 items to be assigned to 1 items of column 'V3' (1 unused)
How can we get the data.table solution?

We can try
dt[V2=="b", c("V3", "V1") := .(V1, V3)]

For amusement only. #akruns' solution is clearly superior. I reasoned that I could create a temporary copy, make the conditional swap, and then delete the copy all using [.data.table operations in sequence:
dt[, tv1 := V1][V2=="b", V1 := V3][V2=="b", V3 := tv1][ , tv1 := NULL]
> dt
V1 V2 V3
1: 1 a 2
2: 2 a 3
3: 1 b 4

Related

Using .BY, .GRP or other methods to add a multicolumn aggregation with data.table

Say we have this toy data.table example:
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
"V" "GR"
A 1
B 1
C 1
D 2
A 2
I would like to generate all ordered combinations with combn within each subset defined by GR and create with it a new data.table and with a new column with the grouping factor.
For example, for GR=1 we have (A,B),(A,C),(B,C)
for GR=2 we have (D,A)
If I create the result manually it would be
cbind(V=c(1,1,1,2),rbind(t(combn(c("A", "B", "C"),2)),t(combn(c( "D","A"),2))))
1 A B
1 A C
1 B C
2 D A
But I would like to do it with data.table easily instead.
This two option don't work:
temp[,cbind(rep(.GRP,.N),as.data.frame(t(combn(V,2)))),by=GR]
temp[,cbind(rep(.BY,.N),as.data.frame(t(combn(V,2)))),by=GR]
This one work, but I don't understand why. I'm afraid it could copy the whole B vector as is instead of the proper value.
temp[,.(GR,as.list(as.data.frame((combn(V,2))))),by=GR]
And I guess it should be a shorter way to write it.
This works:
> temp[, {v_comb = combn(V,2); .(v_comb[1,], v_comb[2,])}, by=GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
In general, I would avoid when possible all the reshaping operations within the data.table using cbind(), rep(), as.data.frame() or t()... It takes many trials-and-errors to figure out the right way to do it, and produces code that is very hard to maintain.
On the other hand, using code blocks {...} improves the readability of the code.
This uses data.table, though not all within [] using .BY or .GRP.
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
tempfunc <- function(x){
dat <- as.data.table(t(combn(temp[GR == x, V], 2)))
dat[, GR := x]
setcolorder(dat, c("GR", "V1", "V2"))
dat[]
}
rbindlist(lapply(unique(temp$GR), tempfunc))
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Here are two other approaches which also work if there is a group with just one row, e.g., row 6 below:
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A","E"), GR=c(1,1,1,2,2,3))
temp
V GR
1: A 1
2: B 1
3: C 1
4: D 2
5: A 2
6: E 3
Using cominat::combn2
temp[, as.data.table(combinat::combn2(V)), by = GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Using a non-equi join
temp[, V := factor(V)][temp, on = .(GR, V < V), .(GR, x.V, i.V),
nomatch = 0L, allow = TRUE]
GR x.V i.V
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 A D
I have one solution but it seems long and complex too.
temp[,do.call(c, apply(t(combn(V,2)), 2, list)),by=GR]
I've also found that combn is 10 times slower than some specialized packages such as iterpc or combinat
temp[,do.call(c, apply(combn2(V), 2, list)),by=GR]
You must also first filter out any group having just one row because otherwise it would cause an error.
And this is my final version, much faster and needs much less memory:
temp[,.(from=rep(V,(.N-1):0),to=V[unlist(sapply(2:.N, seq, .N, simplify = T))]), by=GR]

Finding factors that correspond to more than one values

Suppose, that one has the following dataframe:
x=data.frame(c(1,1,2,2,2,3),c("A","A","B","B","B","B"))
names(x)=c("v1","v2")
x
v1 v2
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 3 B
In this dataframe a value in v1 I want to correspond into a label in v2. However, as one can see in this example B has more than one corresponding values.
Is there any elegant and fast way to find which labels in v2 correspond to more than one values in v1 ?
The result I want ideally to show, the values - which in our example should be c(2,3) - as well as the row number - which in our example should be r=c(5,6).
Assuming that we want the index of the unique elements in 'v1' grouped by 'v2' and that should have more than one unique elements, we create a logical index with ave and use that to subset the rows of 'x'.
i1 <- with(x, ave(v1, v2, FUN = function(x)
length(unique(x))>1 & !duplicated(x, fromLast=TRUE)))!=0
x[i1,]
# v1 v2
#5 2 B
#6 3 B
Or a faster option is data.table
library(data.table)
i1 <- setDT(x)[, .I[uniqueN(v1)>1 & !duplicated(v1, fromLast=TRUE)], v2]$V1
x[i1, 'v1', with = FALSE][, rn := i1][]
# v1 rn
#1: 2 5
#2: 3 6

data.table join with roll = “nearest” returns "search value" instead of original value

I got a problem with the binary search function J() and roll = "nearest".
Let's say I got this example data.table "dt"
Key Value1 Value2
20 4 5
12 2 1
55 10 7
I do a search with roll = "nearest":
dt[J(15), roll = "nearest"]
...which returns:
Key Value1 Value2
15 2 1
Thus, the correct row is returned. However, the original "Key" value (12) is replaced by the value used in the search (15).
My question is that a normal behaviour and can one change this auto override?
EDIT:
Reproducible Example (Note I use version 1.9.7):
library("data.table")
dt <- data.table(c(20,12,55), c(4,2,10), c(5,1,7))
dt
# V1 V2 V3
#1: 20 4 5
#2: 12 2 1
#3: 55 10 7
setkey(dt, V1)
dt[J(15), roll = "nearest"]
# V1 V2 V3
#1: 15 2 1
You probably need data.table in 1.9.7 to make x.V1 work. Then you can refer to column from x dataset explicitly. This is required because columns used in join are taken from the second dataset i, as it is in base R.
library("data.table")
dt <- data.table(c(20,12,55), c(4,2,10), c(5,1,7))
setkey(dt, V1)
dt[J(15), .(V1=x.V1, V2, V3), roll = "nearest"]
# V1 V2 V3
#1: 12 2 1
As you mention you already have 1.9.7, for others who doesn't have see Installation wiki.

trying to subset a data table in R by removing items that are in a 2nd table

I have two data frames (from a csv file) in R as such:
df1 <- data.frame(V1 = 1:9, V2 = LETTERS[1:9])
df2 <- data.frame(V1 = 1:3, V2 = LETTERS[1:3])
I convert both to data.table as follows:
dt1 <- data.table(df1, key="V1")
dt2 <- data.table(df2, key="V1")
I want to now return a table that looks like dt1 but without any rows where the key is found in dt2. So in this instance I would like to get back:
4 D
5 E
...
9 I
I'm using the following code in R:
dt3 <- dt1[!dt2$V1]
this works on this example, however when I try it for a large data set (100k)
it does not work. It only removes 2 rows, and I know it should be a lot more than that. is there a limit to this type of operation or something else I havent considered?
Drop the column name "V1" to do a not-join. The tables are already keyed by V1.
dt3 <- dt1[!dt2]
Because the tables are keyed, you can do this with a "not-join"
dt1 <- data.table(rep(1:3,2), LETTERS[1:6], key="V1")
# V1 V2
# 1: 1 A
# 2: 1 D
# 3: 2 B
# 4: 2 E
# 5: 3 C
# 6: 3 F
dt2 <- data.table(1:2, letters[1:2], key="V1")
# V1 V2
# 1: 1 a
# 2: 2 b
dt1[!.(dt2$V1)]
# V1 V2
# 1: 3 C
# 2: 3 F
According to the documentation, . or J should not be necessary, since the ! alone is enough:
All types of i may be prefixed with !. This signals a not-join or not-select should be performed.
However, the OP's code does not work:
dt1[!(dt2$V1)]
# V1 V2
# 1: 2 B
# 2: 2 E
# 3: 3 C
# 4: 3 F
In this case, dt2$V1 is read as a vector of row numbers, not as part of a join. Looks like this is what is meant by a "not-select", but I think it could be more explicit. When I read the sentence above, for all I know "not-select" and "not-join" are two terms for the same thing.
You could try:
dt1[!(dt1$V1 %in% dt2$V1)]
This assumes that you don't care about ordering.

How to swap values between two columns

I have a data frame with three variables and 250K records. As an example consider
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
V1 V2 V3
1 a 2
2 a 3
4 b 1
and want to swap values between V1 and V3 based on the value of V2 as follows:
if V2 == 'b' then V1 <- V3 and V3 <- V1
resulting in
V1 V2 V3
1 a 2
2 a 3
1 b 4
I tried a do loop but it takes forever. If I use Perl, it takes seconds. I believe this task can be done efficiently in R as well. Any suggestions are appreciated.
Try this
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
df[df$V2 == "b", c("V1", "V3")] <- df[df$V2 == "b", c("V3", "V1")]
which yields:
> df
V1 V2 V3
1 1 a 2
2 2 a 3
3 1 b 4
You can use transform to do this.
df <- transform(df, V3 = ifelse(V2 == 'b', V1, V3), V1 = ifelse(V2 == 'b', V3, V1))
Editted I got tripped up with column names, sorry. This works.
If you don't mind the rows ending up in different orders, this is kind of a 'cute' way to do this:
dat <- read.table(textConnection("V1 V2 V3
1 a 2
2 a 3
4 b 1"),sep = "",header = TRUE)
tmp <- dat[dat$V2 == 'b',3:1]
colnames(tmp) <- colnames(dat)
rbind(dat[dat$V2 != 'b',],tmp)
Basically, that's just grabbing the rows where V2 == 'b', reverses the columns and slaps it back together with everything else. This can be extended if you have more columns that don't need switching; you'd just use an integer index with those values transposed, rather than just 3:1.

Resources