R: Conditionally replace values in data.table column over loop [duplicate] - r

This question already has answers here:
merge data.frames based on year and fill in missing values
(4 answers)
Closed 5 years ago.
I want to replace all values in one data.table column (col1).
I thought to try to do it over loop but only few values are replaced.
Example:
#Generate data.table
library(data.table)
dt2 <- data.table(col1=rep(c("a","b","c","d"),each=2), col2=rep(c(1,2,3,4),2),col3=rep(c(1,2,3,4),2))
> dt2
col1 col2 col3
1: a 1 1
2: a 2 2
3: b 3 3
4: b 4 4
5: c 1 1
6: c 2 2
7: d 3 3
8: d 4 4
Change values:
#new col1 elements
new_col1 = c("WT","m1","m9","m10")
#change col1
col1_names = unique(dt$col1)
for (i in 1:length(col1_names)){
dt2[col1==col1_names[i],col1:=new_col1[i]]
}
> dt2
col1 col2 col3
1: WT 1 1
2: WT 2 2
3: m1 3 3
4: m1 4 4
5: c 1 1
6: c 2 2
7: d 3 3
8: d 4 4
Values in col1 are only partially replaced.
Can anybody explain it or suggest better way how to replace?
I tried to replace them at once but this does not work either properly:
#generate data.table
dt2 <- data.table(col1=rep(c("a","b","c","d"),each=2), col2=rep(c(1,2,3,4),2),col3=rep(c(1,2,3,4),2))
col1 col2 col3
1: a 1 1
2: a 2 2
3: b 3 3
4: b 4 4
5: c 1 1
6: c 2 2
7: d 3 3
8: d 4 4
#replace values
col1_names = unique(dt$col1)
dt2[col1==col1_names,col1:=new_col1]
> dt2
col1 col2 col3
1: WT 1 1
2: m1 2 2
3: m9 3 3
4: m10 4 4
5: WT 1 1
6: m1 2 2
7: m9 3 3
8: m10 4 4
Values are just replaced without any condition.

I agree with the lookup-table answer. You might want to try:
library(data.table)
dt2 <- data.table(col1=rep(c("a","b","c","d"),each=2),
col2=rep(c(1,2,3,4),2),
col3=rep(c(1,2,3,4),2))
new_col1 <- c(a="WT",b="m1",c="m9",d="m10") # .. have to change slightly
dt2$col1<-unname(new_col1[dt2$col1])
Or with data.table "update join" syntax:
lookup = data.table(old = c("a","b","c","d"), new = c("WT","m1","m9","m10"))
dt2[lookup, on=.(col1 = old), col1 := i.new ]

Related

How to set the values of a column in an R data.table by referring the key and values in a second data.table? [duplicate]

This question already has answers here:
Update join: replace with values from a column with the same name
(1 answer)
Update subset of data.table based on join
(3 answers)
Closed 9 days ago.
Suppose I have a data.table in R:
> A=data.table(Col1=c(1,4,2,5,6,2,3,5,3,7))
> A
Col1
1: 1
2: 4
3: 2
4: 5
5: 6
6: 2
7: 3
8: 5
9: 3
10: 7
And a key-value data.table where
> B=data.table(Col1=c(1,2,3,4,5,6,7),Col2=c("A","B","C","D","E","F","G"))
> B
Col1 Col2
1: 1 A
2: 2 B
3: 3 C
4: 4 D
5: 5 E
6: 6 F
7: 7 G
I would like to have Col1 of data.table A reference B and create a new column in A that corresponds to the key-value pairs:
Col1 Col2
1: 1 A
2: 4 D
3: 2 B
4: 5 E
5: 6 F
6: 2 B
7: 3 C
8: 5 E
9: 3 C
10: 7 G
How can I do this in data.table? Thanks
What you are looking for is a join by reference/update join.
This looks for the value of A$Col1 in B$Col1, and returns the first match of B$Col2 (so if there are >1 matches, the value returned depends on how B is ordered). In the code, this is referred as i.Col2, since B is in the i-part of the data.table syntax. It is usually the fastest way to join, but remember that it only returns the first match. SO if there are multiple values of B$Col2 fot the same B$Col1 value, you will only get one (the topmost) value returned.
A[B, Col2 := i.Col2, on = .(Col1)]
Col1 Col2
1: 1 A
2: 4 D
3: 2 B
4: 5 E
5: 6 F
6: 2 B
7: 3 C
8: 5 E
9: 3 C
10: 7 G
Using data.tables on argument
library(data.table)
A[B, , on = "Col1"]
Col1 Col2
1: 1 A
2: 2 B
3: 2 B
4: 3 C
5: 3 C
6: 4 D
7: 5 E
8: 5 E
9: 6 F
10: 7 G

R data.table how to create duplicates [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 3 years ago.
I have:
dataDT <- data.table(A = 1:3, B = 1:3)
dataDT
A B
1: 1 1
2: 2 2
3: 3 3
I want:
dataDT <- data.table(A = c(1:3, 1:3), B = c(1:3, 1:3))
dataDT
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
i.e. create x copies of duplicate and append after the bottom row.
I've tried (results aren't what I need):
dataDT1 <- splitstackshape::expandRows(dataset = dataDT, count = 2, count.is.col = FALSE) # order not correct
dataDT1
A B
1: 1 1
2: 1 1
3: 2 2
4: 2 2
5: 3 3
6: 3 3
Also (results aren't what I need):
dataDT2 <- rbindlist(list(rep(dataDT, 2))) # it creates columns
dataDT2
A B A B
1: 1 1 1 1
2: 2 2 2 2
3: 3 3 3 3
Can anyone recommend a correct and efficient way of doing it?
You can do it with rep:
> x = 2; dataDT[rep(seq_len(nrow(dataDT)), x), ]
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3
or with rbindlist and replicate:
> x = 2; rbindlist(replicate(x, dataDT, simplify = F))
A B
1: 1 1
2: 2 2
3: 3 3
4: 1 1
5: 2 2
6: 3 3

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

Relative reference to rows in large data set

I have a very large data set (millions of rows) where I need to turn into NA certain rows when a var1 equals "Z". However, I also need to turn into NA the preceding row to a row with var1="Z".
E.g.:
id var1
1 A
1 B
1 Z
1 S
1 A
1 B
2 A
2 B
3 A
3 B
3 A
3 B
4 A
4 B
4 A
4 B
In this case, the second row and the third row for id==1 should be NA.
I have tried a loop but it doesn't work as the data set is very large.
for (i in 1:length(df$var1)){
if(df$var1[i] =="Z"){
df[i,] <- NA
df[(i-1),] <-- NA
}
}
I have also tried to use data.table package unsuccessfully. Do you have any idea of how I could do it or what is the right term to look for info on what I am trying to do?
Maybe do it like this using data.table:
df <- as.data.table(read.table(header=T, file='clipboard'))
df$var1 <- as.character(df$var1)
#find where var1 == Z
index <- df[, which(var1 == 'Z')]
#add the previous lines too
index <- c(index, index-1)
#convert to NA
df[index, var1 := NA ]
Or in one call:
df[c(which(var1 == 'Z'), which(var1 == 'Z') - 1), var1 := NA ]
Output:
> df
id var1
1: 1 A
2: 1 NA
3: 1 NA
4: 1 S
5: 1 A
6: 1 B
7: 2 A
8: 2 B
9: 3 A
10: 3 B
11: 3 A
12: 3 B
13: 4 A
14: 4 B
15: 4 A
16: 4 B
If you want to take in count the preceding indices only if they are from the same id, I would suggest to use the .I and by combination which will make sure that you are not taking indecies from previous id
setDT(df)[, var1 := as.character(var1)]
indx <- df[, {indx <- which(var1 == "Z") ; .I[c(indx - 1L, indx)]}, by = id]$V1
df[indx, var1 := NA_character_]
df
# id var1
# 1: 1 A
# 2: 1 NA
# 3: 1 NA
# 4: 1 S
# 5: 1 A
# 6: 1 B
# 7: 2 A
# 8: 2 B
# 9: 3 A
# 10: 3 B
# 11: 3 A
# 12: 3 B
# 13: 4 A
# 14: 4 B
# 15: 4 A
# 16: 4 B
You can have a base R approach:
x = var1=='Z'
df[x | c(x[-1],F), 'var1'] <- NA
# id var1
#1 1 A
#2 1 <NA>
#3 1 <NA>
#4 1 S
#5 1 A
#6 1 B
#7 2 A
#8 2 B
#9 3 A
#10 3 B
#11 3 A
#12 3 B
#13 4 A
#14 4 B
#15 4 A
#16 4 B

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources