replace row values based on another row value in a data.table - r

I have a trivial question, though I am struggling to find a simple answer. I have a data table that looks something like this:
dt <- data.table(id= c(A,A,A,A,B,B,B,C,C,C), time=c(1,2,3,1,2,3,1,2,3), score = c(10,15,13,25,NA,NA,18,29,19))
dt
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 NA
# 8: B 4 NA
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
I would like to replace the missing values of my group "B" with the values of "A".
The final dataset should look something like this
final
# id time score
# 1: A 1 NA
# 2: A 2 10
# 3: A 3 15
# 4: A 4 13
# 5: B 1 NA
# 6: B 2 25
# 7: B 3 15
# 8: B 4 13
# 9: C 1 18
# 10: C 2 29
# 11: C 3 NA
# 12: C 4 19
In other words, conditional on the fact that B is NA, I would like to replace the score of "A". Do note that "C" remains NA.
I am struggling to find a clean way to do this using data.table. However, if it is simpler with other methods it would still be ok.
Thanks a lot for your help

Here is one option where we get the index of the rows which are NA for 'score' and the 'id' is "B", use that to replace the NA with the corresponding 'score' value from 'A'
library(data.table)
i1 <- setDT(dt)[id == 'B', which(is.na(score))]
dt[, score:= replace(score, id == 'B' & is.na(score), score[which(id == 'A')[i1]])]
Or a similar option in dplyr
library(dplyr)
dt %>%
mutate(score = replace(score, id == "B" & is.na(score),
score[which(id == "A")[i1]))

Related

rbindlist only elements that meet a condition

I have a large list. Some of the elements are strings and some of the elements are data.tables. I would like to create a big data.table, but only rbind the elements that are data.tables.
I know how to do it in a for loop, but I am looking for something more efficient as my data are big and I need something quick.
Thank you!
library(data.table)
DT1 = data.table(
ID = c("b","b","b","a","a","c"),
a = 1:6
)
DT2 = data.table(
ID = c("b","b","b","a","a","c"),
a = 11:16
)
list<- list(DT1,DT2,"string")
I am looking for a result similar to doing, but since I have many entries I cannot do it like this.
rbind(DT1, DT2)
Filter the data.table and rbind
library(data.table)
rbindlist(Filter(is.data.table, list_df))
# ID a
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: c 6
# 7: b 11
# 8: b 12
# 9: b 13
#10: a 14
#11: a 15
#12: c 16
data
list_df <- list(DT1,DT2,"string")
We can use keep from purrr with bind_rows
library(tidyverse)
keep(list, is.data.table) %>%
bind_rows
# ID a
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: c 6
# 7: b 11
# 8: b 12
# 9: b 13
#10: a 14
#11: a 15
#12: c 16
Or using rbindlist with keep
rbindlist(keep(list, is.data.table))
Using sapply() to generate a logical vector to subset your list
rbindlist(list[sapply(list, is.data.table)])

How to do a complex wide-to-long operation for network analysis

I have survey data that includes who the respondent is (iAmX), who they work with (withX), how frequently they work with each partner (freqX), and how satisfied they are with each partner (likeX). Participants can select multiple options for who they are and who they work with.
I would like to go from something like this, with one row per respondent:
df <- read.table(header=T, text='
id iAmA iAmB iAmC withA withB withC freqA freqB freqC likeA likeB likeC
1 X X NA X X NA 3 2 NA 3 2 NA
2 NA NA X X NA NA 5 NA NA 5 NA NA
')
To something like this, with one row per combination, where "from" is who the actor is and "to" is who they work with:
goal <- read.table(header=T, text='
id from to freq like
1 A A 3 3
1 B A 3 3
1 A B 2 2
1 B B 2 2
2 C A 5 5
')
I have tried some melt, gather, and reshape functions but frankly I think I'm just not up to the logic puzzle today. I would really appreciate some help!
Although I must admit I have not fully understood OP's logic, the code below reproduces the expected goal.
The key points here are data.table's incarnation of the melt() function which is able to reshape multiple measure columns simultaneously and the cross join function CJ().
library(data.table)
# reshape multiple measure columns simultaneously
cols <- c("iAm", "with", "freq", "like")
long <- melt(setDT(df), measure.vars = patterns(cols),
value.name = cols, variable.name = "to")[
# rename factor levels
, to := forcats::fct_relabel(to, function(x) LETTERS[as.integer(x)])]
# create combinations for each id
combi <- long[, CJ(from = na.omit(to[iAm == "X"]), to = na.omit(to[with == "X"])), by = id]
# join to append freq and like
result <- combi[long, on = .(id, to), nomatch = 0L][, -c("iAm", "with")]
# reorder result
setorder(result, id)
result
id from to freq like
1: 1 A A 3 3
2: 1 B A 3 3
3: 1 A B 2 2
4: 1 B B 2 2
5: 2 C A 5 5
The intermediate results are
long
id to iAm with freq like
1: 1 A X X 3 3
2: 2 A <NA> X 5 5
3: 1 B X X 2 2
4: 2 B <NA> <NA> NA NA
5: 1 C <NA> <NA> NA NA
6: 2 C X <NA> NA NA
and
combi
id from to
1: 1 A A
2: 1 A B
3: 1 B A
4: 1 B B
5: 2 C A

Refer to previous row in data.table in R, with a condition

i have a new problem with this data. Because my full data has the form like this
a=data.table(A=c(1:10),B=c(1,2,0,2,0,0,3,4,0,2),C=c(2,3,1,4,5,3,6,7,2,2),D=c(1,1,1,1,1,2,2,2,2,2))
# A B C D
# 1: 1 1 2 1
# 2: 2 2 3 1
# 3: 3 0 1 1
# 4: 4 2 4 1
# 5: 5 0 5 1
# 6: 6 0 3 2
# 7: 7 3 6 2
# 8: 8 4 7 2
# 9: 9 0 2 2
#10: 10 2 2 2
Now, I want to create a new column, which calculates the number of values of A multiple with B/C of the closet previous row, as long as B is not 0. For example, in line 2, I can calculate D=2*(1/2). However, in line 4, it has to be 4*(2/3), it can not be 4*(0/1).
I use
a[, D:= {i1 <- (NA^!B)
list( A*shift(na.locf(i1*B))/shift(na.locf(i1*C)))},by=d]
as Akrun recommended yesterday. It does not work when i calculate it by group.the result is like this
A B C d D
# 1: 1 1 2 1 NA
# 2: 2 2 3 1 1.000000
# 3: 3 0 1 1 2.000000
# 4: 4 2 4 1 2.666667
# 5: 5 0 5 1 2.500000
# 6: 6 0 3 2 NA
# 7: 7 3 6 2 3.500000
# 8: 8 4 7 2 4.571429
# 9: 9 0 2 2 5.142857
# 10: 10 2 2 2 NA
Anyone knows what is the problem here? The error is longer object length is not a multiple of shorter object length.
We can replace the elements in 'B', 'C' that corresponds to '0' value in 'B' as NA. Use na.locf from zoo to replace those NA values with the previous non-NA elements, shift the elements (by default, it gives a lag of 1), divide the modified columns 'B' with 'C' and then multiply by 'A'. Assign (:=) the output to a new column 'D'.
library(zoo)
a[B==0, c('B', 'C'):=list(NA, NA)]
a[, c('B', 'C'):= na.locf(.SD), .SDcols=B:C]
a[, D:= {tmp <- shift(.SD[, 2:3, with=FALSE])
A*(tmp[[1]]/tmp[[2]])}]
Or we can make it compact. We get a logical vector (!B) that checks for '0' elements in 'B', convert that to a vector of 1s and NA (NA^), multiply with columns 'B' and 'C' so that the 1s are replaced by the corresponding elements in those columns whereas NA remains as such. Do the na.locf (as before), shift and then do the multiplication/division.
a[, D:= {i1 <- (NA^!B)
list( A*shift(na.locf(i1*B))/shift(na.locf(i1*C)))}]
Or instead of calling shift/na.locf two times
a[, D:= {i1 <- (NA^!B)
tmp <- shift(na.locf(i1*.SD))
a[['A']]*(tmp[[1]]/tmp[[2]])}, .SDcols=B:C]
This can be done with a rolling join:
a[, row := .I]
a[, B/C, by=row][V1 != 0][a, A*shift(V1), on="row", roll=TRUE]
# [1] NA 1.000000 2.000000 2.666667 2.500000 3.000000 3.500000 4.000000
# [9] 5.142857 5.714286

Empty factors in "by" data.table

I have a data.table that has factor column with empty levels. I need to get the row count and sums of other variables, all grouped by multiple factors, including the one with empty levels.
My question is similar to this one, but here I need to count for multiple factors.
For example, let data.table be:
library('data.table')
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
I want to do the following:
dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
#Output is:
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
I want output include empty levels for v2 as well ("d" and "e"), like in table(dtr$v2,dtr$v3), so the final output should look like (the order doesn't matter):
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
7: d yes 0 0
8: d no 0 0
9: e yes 0 0
10: e no 0 0
I tried to use the method used in the link, but I'm not sure how to use joint J() function when there are multiple columns used.
This works fine for groupping by 1 column only:
setkey(dtr,v2)
dtr[J(levels(v2)),list(freq=.N,mm=sum(v1,na.rm=T))]
However, dtr[J(levels(v2),v3),list(freq=.N,mm=sum(v1,na.rm=T))] doesn't include all combinations
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
You can use CJ (a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.
setkey(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]
# v2 v3 freq mm
# 1: a no 1 9
# 2: a yes 2 11
# 3: b no 2 11
# 4: b yes 3 23
# 5: c no 4 40
# 6: c yes 3 26
# 7: d no NA NA
# 8: d yes NA NA
# 9: e no NA NA
# 10: e yes NA NA
table() will also capture freq values that are zero. To get the "mm" column, you could do a basic join. For example,
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- as.data.table(dtr[,table(v2,v3)])
setnames(res,'N','freq')
setkey(res,v2,v3)
setkey(dtr,v2,v3)
res <- dtr[,.(mm=sum(v1,na.rm=TRUE)),by=c('v2','v3')][res]
I'm not sure how table() benchmarks with cross join.

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Resources