How to replace a certain value in one data.table with values of another data.table of same dimension - r

Given two data.table:
dt1 <- data.table(id = c(1,-99,2,2,-99), a = c(2,1,-99,-99,3), b = c(5,3,3,2,5), c = c(-99,-99,-99,2,5))
dt2 <- data.table(id = c(2,3,1,4,3),a = c(6,4,3,2,6), b = c(3,7,8,8,3), c = c(2,2,4,3,2))
> dt1
id a b c
1: 1 2 5 -99
2: -99 1 3 -99
3: 2 -99 3 -99
4: 2 -99 2 2
5: -99 3 5 5
> dt2
id a b c
1: 2 6 3 2
2: 3 4 7 2
3: 1 3 8 4
4: 4 2 8 3
5: 3 6 3 2
How can one replace the -99 of dt1 with the values of dt2?
Wanted results should be dt3:
> dt3
id a b c
1: 1 2 5 2
2: 3 1 3 2
3: 2 3 3 4
4: 2 2 2 2
5: 3 3 5 5

You can do the following:
dt3 <- as.data.frame(dt1)
dt2 <- as.data.frame(dt2)
dt3[dt3 == -99] <- dt2[dt3 == -99]
dt3
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5

If your data is all of the same type (as in your example) then transforming them to matrix is a lot faster and transparent:
dt1a <- as.matrix(dt1) ## convert to matrix
dt2a <- as.matrix(dt2)
# make a matrix of the same shape to access the right entries
missing_idx <- dt1a == -99
dt1a[missing_idx] <- dt2a[missing_idx] ## replace by reference
This is a vectorized operation, so it should be fast.
Note: If you do this make sure the two data sources match exactly in shape and order of rows/columns. If they don't then you need to join by the relevant keys and pick the correct columns.
EDIT: The conversion to matrix may be unnecessary. See kath's answer for a more terse solution.

Simple way could be to use setDF function to convert to data.frame and use data frame sub-setting methods. Restore to data.table at the end.
#Change to data.frmae
setDF(dt1)
setDF(dt2)
# Perform assignment
dt1[dt1==-99] = dt2[dt1==-99]
# Restore back to data.table
setDT(dt1)
setDT(dt2)
dt1
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5

This simple trick would work efficiently.
dt1<-as.matrix(dt1)
dt2<-as.matrix(dt2)
index.replace = dt1==-99
dt1[index.replace] = dt2[index.replace]
as.data.table(dt1)
as.data.table(dt2)

This should work, a simple approach:
for (i in 1:nrow(dt1)){
for (j in 1:ncol(dt1)){
if (dt1[i,j] == -99) dt1[i,j] = dt2[i,j]
}
}

Related

create list from columns of data table expression

Consider the following dt:
dt <- data.table(a=c(1,1,2,3),b=c(4,5,6,4))
That looks like that:
> dt
a b
1: 1 4
2: 1 5
3: 2 6
4: 3 4
I'm here aggregating each column by it's unique values and then counting how many uniquye values each column has:
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 2
2: 2 1 5 1
3: 3 1 6 1
So 1 appears twice in dt and thus a.N is 2, the same logic goes on for the other values.
But the problem is if this transformations of the original datatable have different dimensions at the end, things will get recycled.
For example this dt:
dt <- data.table(a=c(1,1,2,3,7),b=c(4,5,6,4,4))
> dt[,lapply(.SD,function(agg) dt[,.N,by=agg])]
a.agg a.N b.agg b.N
1: 1 2 4 3
2: 2 1 5 1
3: 3 1 6 1
4: 7 1 4 3
Warning message:
In as.data.table.list(jval, .named = NULL) :
Item 2 has 3 rows but longest item has 4; recycled with remainder.
That is no longer the right answer because b.N should have now only 3 rows and things(vector) got recycled.
This is why I would like to transform the expression dt[,lapply(.SD,function(agg) dt[,.N,by=agg])] in a list with different dimensions, with the name of items in the list being the name of the columns in the new transformed dt.
A sketch of what I mean is:
newlist
$a.agg
1 2 3 7
$a.N
2 1 1 1
$b.agg
4 5 6 4
$b.N
3 1 1
Or even better solution would be to get a datatable with a track of the columns on another column:
dt_final
agg N column
1 2 a
2 1 a
3 1 a
7 1 a
4 3 b
5 1 b
6 1 b
Get the data in long format and then aggregate by group.
library(data.table)
dt_long <- melt(dt, measure.vars = c('a', 'b'))
dt_long[, .N, .(variable, value)]
# variable value N
#1: a 1 2
#2: a 2 1
#3: a 3 1
#4: a 7 1
#5: b 4 3
#6: b 5 1
#7: b 6 1
In tidyverse -
library(dplyr)
library(tidyr)
dt %>%
pivot_longer(cols = everything()) %>%
count(name, value)

How do we avoid for-loops when we want to conditionally add columns by reference? (condition to be evaluated seperately in each row)

I have a data.table with many numbered columns. As a simpler example, I have this:
dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(1:5,10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')
I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.
My current code for this has two for-loops which is clearly not the way to do it:
for(i in 1:nrow(dat)){
dat[i,'sumCol':={temp=0;
for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
dat[i,paste0('col',j),with=F]!=1
){temp=temp+dat[i,paste0('col',j),with=F]}};
temp}]}
I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.
A possible solution:
dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]
which gives:
> dat
col1 col2 col3 col4 oneMoreCol sumCol
1: 4 5 5 3 a 17
2: 4 5 NA 5 a 14
3: 2 3 4 3 a 12
4: 1 2 3 4 a 9
5: 4 3 NA 5 a 12
6: 2 2 1 4 a 8
7: NA 2 NA 5 a 7
8: 4 2 2 4 a 12
9: 4 1 5 4 a 13
10: 2 1 5 1 a 7
Used data:
set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(c(1:5,NA),10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Identifying unique duplicates in vector in R

I am trying to identify duplicates based of a match of elements in two vectors. Using duplicate() provides a vector of all matches, however I would like to index which are matches with each other or not. Using the following code as an example:
x <- c(1,6,4,6,4,4)
y <- c(3,2,5,2,5,5)
frame <- data.frame(x,y)
matches <- duplicated(frame) | duplicated(frame, fromLast = TRUE)
matches
[1] FALSE TRUE TRUE TRUE TRUE TRUE
Ultimately, I would like to create a vector that identifies elements 2 and 4 are matches as well as 3,5,6. Any thoughts are greatly appreciated.
Another data.table answer, using the group counter .GRP to assign every distinct element a label:
d <- data.table(frame)
d[,z := .GRP, by = list(x,y)]
# x y z
# 1: 1 3 1
# 2: 6 2 2
# 3: 4 5 3
# 4: 6 2 2
# 5: 4 5 3
# 6: 4 5 3
How about this with plyr::ddply()
ddply(cbind(index=1:nrow(frame),frame),.(x,y),summarise,count=length(index),elems=paste0(index,collapse=","))
x y count elems
1 1 3 1 1
2 4 5 3 3,5,6
3 6 2 2 2,4
NB = the expression cbind(index=1:nrow(frame),frame) just adds an element index to each row
Using merge against the unique possibilities for each row, you can get a result:
labls <- data.frame(unique(frame),num=1:nrow(unique(frame)))
result <- merge(transform(frame,row = 1:nrow(frame)),labls,by=c("x","y"))
result[order(result$row),]
# x y row num
#1 1 3 1 1
#5 6 2 2 2
#2 4 5 3 3
#6 6 2 4 2
#3 4 5 5 3
#4 4 5 6 3
The result$num vector gives the groups.

Create counter with multiple variables [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have my data that looks like below:
CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013
I need to create a counter variable, which will be like below:
CustomerID TripDate TripCounter
1 1/3/2013 1
1 1/4/2013 2
1 1/9/2013 3
2 2/1/2013 1
2 2/4/2013 2
3 1/2/2013 1
Tripcounter will be for each customer.
Use ave. Assuming your data.frame is called "mydf":
mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
# CustomerID TripDate counter
# 1 1 1/3/2013 1
# 2 1 1/4/2013 2
# 3 1 1/9/2013 3
# 4 2 2/1/2013 1
# 5 2 2/4/2013 2
# 6 3 1/2/2013 1
For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:
mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
# IDA IDB values .id
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 1 3 2
# 4 b 1 4 1
# 5 b 2 5 1
# 6 b 2 6 2
# 7 b 2 7 3
As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.
You can also use plyr for this (using #AnadaMahto's example data):
> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 2 2 2
3 a 1 3 3
4 b 1 4 1
5 b 2 5 2
6 b 2 6 3
7 b 2 7 4
or even:
> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 1 3 2
3 a 2 2 1
4 b 1 4 1
5 b 2 5 1
6 b 2 6 2
7 b 2 7 3
Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.
Here's a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
# IDA IDB values .id
# 1: a 1 1 1
# 2: a 2 2 1
# 3: a 1 3 2
# 4: b 1 4 1
# 5: b 2 5 1
# 6: b 2 6 2
# 7: b 2 7 3
meanwhile, you can also use dplyr. if your data.frame is called mydata
library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.
idCounter <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
mydf$TripCounter <- idCounter(mydf$CustomerID)
Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong
x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
y[i] <- count
}
dataframe$counter <- y
This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating.
a<-read.table(textConnection(
"CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013 "), header=TRUE)
a <- a %>%
group_by(CustomerID,TripDate) # must in order
res <- rep(1, nrow(a)) #base # 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Resources