data.table merge() with NA in by column - r

I'm trying to join two tables where the column that is joined on has some NA values such that when the NA is encountered the record is padded with NA's i.e.
Given:
> x = data.table(c(1,2,3,NA,5), c("a","b","c","d","e"))
> x
V1 V2
1: 1 a
2: 2 b
3: 3 c
4: NA d
5: 5 e
> y = data.table(c(NA,2,3,4,5), c("A","B","C","D","E"))
> y
V1 V2
1: NA A
2: 2 B
3: 3 C
4: 4 D
5: 5 E
I want my output to be:
> z = data.table(c(NA,NA,1,2,3,4,5),c("d",NA,"a","b","c",NA,"e"),c(NA,"A",NA,"B","C","D","E"))
> z
V1 V2 V3
1: NA d NA
2: NA NA A
3: 1 a NA
4: 2 b B
5: 3 c C
6: 4 NA D
7: 5 e E
I thought merge() could be used to do this. But I can't get it to produce the output I expect:
> merge(x,y, by=c("V1"), all=TRUE)
V1 V2.x V2.y
1: NA d A
2: 1 a NA
3: 2 b B
4: 3 c C
5: 4 NA D
6: 5 e E
I really don't like that it merges based on the NA value as if it was a match, and when I do this in a larger table with several NA's, it seems to iterate over all possible combinations of column values for V1 and V2 given an NA key. Any help would be appreciated.

The dataframe method of merge has a incomparables-argument, which the data.table version of merge doesn't have.
So, using the dataframe method:
merge.data.frame(x, y, by = "V1", all = TRUE, incomparables = NA)
gives the intended result:
V1 V2.x V2.y
1 1 a <NA>
2 2 b B
3 3 c C
4 4 <NA> D
5 5 e E
6 NA d <NA>
7 NA <NA> A
NOTE: According to this GitHub-issue, the data.table developers are planning to include an incomparables-argument in merge.data.table in the future.

Related

Index the first and the last rows with NA in a dataframe

I have a large dataset, which contains many NAs. I want to find the rows where the first NA and the last NA appear. For example, for column A, I want the output to be the second row (the last NA before a number) and the fifth row (the first NA after a number). My code, which was shown below, does not work very well.
nonnaindex <- which(!is.na(df))
firstnonna <- apply(nonnaindex, 2, min)
Data:
ID A B C
1 NA NA 3
2 NA 2 2
3 3 3 1
4 4 5 NA
5 NA 6 NA
I believe this function might be what you are looking for:
first_and_last_non_na <- function(DT, col) {
library(data.table)
data.table(DT)[, grp := rleid(is.na(get(col)))][
, rbind(last(.SD[is.na(get(col)) & grp == min(grp)]),
first(.SD[is.na(get(col)) & grp == max(grp)]))][
!is.na(ID)][, grp := NULL][]
}
which returns
first_and_last_na_row(DT, "A")
ID A B C
1: 2 NA 2 2
2: 5 NA 6 NA
first_and_last_na_row(DT, "B")
ID A B C
1: 1 NA NA 3
first_and_last_na_row(DT, "C")
ID A B C
1: 4 4 5 NA
first_and_last_na_row(DT, "D")
Empty data.table (0 rows) of 4 cols: ID,A,B,C
in case of
DT
ID A B C
1: 1 NA NA 3
2: 2 NA 2 2
3: 3 3 3 1
4: 4 4 5 NA
5: 5 NA 6 NA
or
first_and_last_na_row(DT2, "D")
ID A B C D
1: 1 NA NA 3 NA
in case of Akrun's (simplified) example
DT2
ID A B C D
1: 1 NA NA 3 NA
2: 2 NA 2 2 2
3: 3 3 3 1 NA
4: 4 4 5 NA NA
5: 5 NA 6 NA 4
Edit: Faster version using melt()
The OP has commented that his production data set consists of 4000 columns and 192 rows and that he needs the indices to clean another data set. He tried a for loop across all columns which is very slow.
Therefore, I suggest to reshape the data set from wide to long format and to use data.table's efficient grouping mechanism:
# reshape from wide to long format
long <- setDT(DT2)[, melt(.SD, id = "ID")][
# add grouping variable to distinguish streaks continuous of NA/non-NA values
# for each variable
, grp := rleid(variable, is.na(value))][
# set sort order just for convenience, not essential
, setorder(.SD, variable, ID)]
long
ID variable value grp
1: 1 A NA 1
2: 2 A NA 1
3: 3 A 3 2
4: 4 A 4 2
5: 5 A NA 3
6: 1 B NA 4
7: 2 B 2 5
8: 3 B 3 5
9: 4 B 5 5
10: 5 B 6 5
11: 1 C 3 6
12: 2 C 2 6
13: 3 C 1 6
14: 4 C NA 7
15: 5 C NA 7
16: 1 D NA 8
17: 2 D 2 9
18: 3 D NA 10
19: 4 D NA 10
20: 5 D 4 11
Now, we get the indices of the starting or ending, resp., NA sequence for each variable (if any) by
# starting NA sequence
long[, .(ID = which(is.na(value) & grp == min(grp))), by = variable]
variable ID
1: A 1
2: A 2
3: B 1
4: D 1
# ending NA sequence
long[, .(ID = which(is.na(value) & grp == max(grp))), by = variable]
variable ID
1: A 5
2: C 4
3: C 5
Note that this returns all indices of the starting or ending NA sequences which might be more convenient for subsequent cleaning of another data set. If only the last and first indices are required this can be achieved by
long[long[, is.na(value) & grp == min(grp), by =variable]$V1, .(ID = max(ID)), by = variable]
variable ID
1: A 2
2: B 1
3: D 1
long[long[, is.na(value) & grp == max(grp), by =variable]$V1, .(ID = min(ID)), by = variable]
variable ID
1: A 5
2: C 4
I have tested this approach using a dummy data set of 192 rows times 4000 columns. The whole operation needed less than one second.

split columns according to sequence numbers

I have a dataset like this:
seq X
1 a
2 b
3 c
1 d
2 e
1 f
2 g
3 h
4 i
5 j
And I would like to split/group the columns according to the assigned seq, like this:
seq X seq1 X1 seq2 X2
1 a 1 d 1 f
2 b 2 e 2 g
3 c NA NA 3 h
NA NA NA NA 4 i
NA NA NA NA 5 j
Thank you in advance
We need to split the data frame first and apply a custom function that merges unequal data frames, i.e.
do.call(cbindPad, split(df, cumsum(df$seq == 1)))
# 1.seq 1.X 2.seq 2.X 3.seq 3.X
#1 1 a 1 d 1 f
#2 2 b 2 e 2 g
#3 3 c NA <NA> 3 h
#4 NA <NA> NA <NA> 4 i
#5 NA <NA> NA <NA> 5 j
where cbindpad was taken by #joran answer at this post
this was just for exploration, #Sotos something to this kind would work? bdw this has lots of transposing which is not efficient
df1 = split(df, cumsum(df$seq == 1))
df2 = lapply(df1 , function(x) as.data.frame(t(x)))
#$`1`
# V1 V2 V3
#seq 1 2 3
#X a b c
#$`2`
# V1 V2
#seq 1 2
#X d e
#$`3`
# V1 V2 V3 V4 V5
#seq 1 2 3 4 5
#X f g h i j
data.frame(t(rbind.fill(df2)))
# X1 X2 X3 X4 X5 X6
#V1 1 a 1 d 1 f
#V2 2 b 2 e 2 g
#V3 3 c <NA> <NA> 3 h
#V4 <NA> <NA> <NA> <NA> 4 i
#V5 <NA> <NA> <NA> <NA> 5 j

Relative reference to rows in large data set

I have a very large data set (millions of rows) where I need to turn into NA certain rows when a var1 equals "Z". However, I also need to turn into NA the preceding row to a row with var1="Z".
E.g.:
id var1
1 A
1 B
1 Z
1 S
1 A
1 B
2 A
2 B
3 A
3 B
3 A
3 B
4 A
4 B
4 A
4 B
In this case, the second row and the third row for id==1 should be NA.
I have tried a loop but it doesn't work as the data set is very large.
for (i in 1:length(df$var1)){
if(df$var1[i] =="Z"){
df[i,] <- NA
df[(i-1),] <-- NA
}
}
I have also tried to use data.table package unsuccessfully. Do you have any idea of how I could do it or what is the right term to look for info on what I am trying to do?
Maybe do it like this using data.table:
df <- as.data.table(read.table(header=T, file='clipboard'))
df$var1 <- as.character(df$var1)
#find where var1 == Z
index <- df[, which(var1 == 'Z')]
#add the previous lines too
index <- c(index, index-1)
#convert to NA
df[index, var1 := NA ]
Or in one call:
df[c(which(var1 == 'Z'), which(var1 == 'Z') - 1), var1 := NA ]
Output:
> df
id var1
1: 1 A
2: 1 NA
3: 1 NA
4: 1 S
5: 1 A
6: 1 B
7: 2 A
8: 2 B
9: 3 A
10: 3 B
11: 3 A
12: 3 B
13: 4 A
14: 4 B
15: 4 A
16: 4 B
If you want to take in count the preceding indices only if they are from the same id, I would suggest to use the .I and by combination which will make sure that you are not taking indecies from previous id
setDT(df)[, var1 := as.character(var1)]
indx <- df[, {indx <- which(var1 == "Z") ; .I[c(indx - 1L, indx)]}, by = id]$V1
df[indx, var1 := NA_character_]
df
# id var1
# 1: 1 A
# 2: 1 NA
# 3: 1 NA
# 4: 1 S
# 5: 1 A
# 6: 1 B
# 7: 2 A
# 8: 2 B
# 9: 3 A
# 10: 3 B
# 11: 3 A
# 12: 3 B
# 13: 4 A
# 14: 4 B
# 15: 4 A
# 16: 4 B
You can have a base R approach:
x = var1=='Z'
df[x | c(x[-1],F), 'var1'] <- NA
# id var1
#1 1 A
#2 1 <NA>
#3 1 <NA>
#4 1 S
#5 1 A
#6 1 B
#7 2 A
#8 2 B
#9 3 A
#10 3 B
#11 3 A
#12 3 B
#13 4 A
#14 4 B
#15 4 A
#16 4 B

data.table merging in R

I'm using data.table_1.9.4 and the merge function seems to not work as expected. What am I doing wrong over here?
a
letter num_num
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
> b
letter num_num
1: a 3
2: b 4
3: c 5
4: d 6
5: e 5
6: f 5
> merge(as.data.frame(a),as.data.frame(b),by='letter',all=TRUE)
letter num_num.x num_num.y
##....Works as expected
> merge(a,b,by='letter',all=TRUE)
Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
neworder is length 2 but x has 3 columns.

r - data.table join and then add all columns from one table to another

My question is essentially the same as this question: data.table join then add columns to existing data.frame without re-copy.
Basically I have a template with keys and I want to assign columns from other data.tables to the template by the same keys.
> template
id1 id2
1: a 1
2: a 2
3: a 3
4: a 4
5: a 5
6: b 1
7: b 2
8: b 3
9: b 4
10: b 5
> x
id1 id2 value
1: a 2 0.01649728
2: a 3 -0.27918482
3: b 3 0.86933718
> y
id1 id2 value
1: a 4 -1.163439
2: b 4 2.267872
3: b 5 1.083258
> template[x, value := i.value]
> template[y, value := i.value]
> template
id1 id2 value
1: a 1 NA
2: a 2 0.01649728
3: a 3 -0.27918482
4: a 4 -1.16343917
5: a 5 NA
6: b 1 NA
7: b 2 NA
8: b 3 0.86933718
9: b 4 2.26787248
10: b 5 1.08325793
>
But if x and y have say 100 columns, then it is not possible to write out the value := i.value syntax for all columns. Is there a way to do the same thing but for all the columns in x and y?
EDIT:
If I do y[x[template]], then it creates separate value columns, which is not intended:
> y[x[template]]
id1 id2 value value.1
1: a 1 NA NA
2: a 2 NA 0.01649728
3: a 3 NA -0.27918482
4: a 4 -1.163439 NA
5: a 5 NA NA
6: b 1 NA NA
7: b 2 NA NA
8: b 3 NA 0.86933718
9: b 4 2.267872 NA
10: b 5 1.083258 NA
>
Just create a function that takes names as arguments and constructs the expression for you. And then eval it each time by passing the names of each data.table you require. Here's an illustration:
get_expr <- function(x) {
# 'x' is the names vector
expr = paste0("i.", x)
expr = lapply(expr, as.name)
setattr(expr, 'names', x)
as.call(c(quote(`:=`), expr))
}
> get_expr('value') ## generates the required expression
# `:=`(value = i.value)
template[x, eval(get_expr("value"))]
template[y, eval(get_expr("value"))]
# id1 id2 value
# 1: a 1 NA
# 2: a 2 0.01649728
# 3: a 3 -0.27918482
# 4: a 4 -1.16343900
# 5: a 5 NA
# 6: b 1 NA
# 7: b 2 NA
# 8: b 3 0.86933718
# 9: b 4 2.26787200
# 10: b 5 1.08325800

Resources