I am trying to use data.table to recode a variable based on certain conditions. My original dataset has around 30M records and after all variable creation around 130 variables. I used the methods suggested here: conditional statements in data.table (M1) and also here data.table: Proper way to do create a conditional variable when column names are not known? (M2)
My goal is get the equivalent of the below code but something that is applicable using data.table
samp$lf5 <- samp$loadfactor5
samp$lf5 <- with(samp, ifelse(loadfactor5 < 0, 0, lf5))
I will admit that I don't understand .SD and .SDCols very well, so I might be using it wrong. The code and errors from (M1) and (M2) are given below and the sample dataset is here: http://goo.gl/Jp97Wn
(M1)
samp[,lf5 = if(loadfactor5 <0) 0 else loadfactor5]
Error Message
Error in `[.data.table`(samp, , lf5 = if (loadfactor5 < 0) 0 else loadfactor5) :
unused argument (lf5 = if (loadfactor5 < 0) 0 else loadfactor5)
When I do this:
samp[,list(lf5 = if(loadfactor5 <0) 0 else loadfactor5)]
it gives lf5 as a list but not as part of the samp data.table and does not really apply the condition as lf5 still has values less than 0.
(M2)
Col1 <- "loadfactor5"
Col2 <- "lf5"
setkeyv(samp,Col1)
samp[,(Col2) :=.SD,.SDCols = Col1][Col1<0,(Col2) := .SD, .SDcols = 0]
I get the following error
Error in `[.data.table`(samp, , `:=`((Col2), .SD), .SDCols = Col1) :
unused argument (.SDCols = Col1)
Any insights on how to finish this appreciated. My dataset has 30M records so I am hoping to use data.table to really cut the run time down.
Thanks,
Krishnan
Answer provided by eddi and included here for the sake of completeness.
samp[, lf5 := ifelse(loadfactor5 < 0, 0, loadfactor5)]
Another way (which I prefer because it's, in my opinion, cleaner):
samp[, lf5 := 0]; samp[loadfactor5 > 0, lf5 := loadfactor5];
I use data.table with a dataset with 90M rows; I am continually amazed at how fast data.table is for operations like the above.
Related
I need to work out a fast way of conditionally finding the difference, in days, between two dates in a data table. I managed to do it with an "ifelse" statement but it is slow on big objects, so my question, is there a faster, more elegant way of achieving the same, perhaps using data.table commands like ":=" or something. Thx. J.
library(lubridate)
library(data.table)
rm(list = ls())
a <- as.Date(c ("2021-09-27", "2019-10-30", "2021-09-05"))
b <- as.Date(c ("2020-06-14", "2019-09-15", "2020-09-23"))
c <- as.Date(c ("2022-07-12", "2020-09-23", "2021-06-19"))
new <- data.table(leave = a, start = b, end = c)
new$days <- ifelse (
new$leave < new$end,
new$leave - new$start,
new$end - new$start)
So in words, when leaving date < end of period, subtract leave from start, however if leave >= end then subtract start from end, and give result back in a new column in days.
Using the pmin() function and data.table's assign operator :=
new[, days := as.numeric(pmin(leave, end)-start)]
Or you could assign it to all rows one way then chain off a subset:
new[, days := as.numeric(end - start)][leave < end, days := leave - start]
Or take advantage of by and the .GRP special character:
new[, days := list(leave-start, end-start)[.GRP], keyby=.(leave>=end)]
When using a list column of data.tables in a nested data.table it is easy to apply a function over the column. Example:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
We can use:
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
dt[ ,list(length = nrow(dt.mtcars[[1]])), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
or
dt[, list( length = lapply(dt.mtcars, nrow)), by = gear]
gear length
1: 4 12
2: 3 15
3: 5 5
I would like to do the same process and apply a modification by reference using the operator := to each data.table of the column.
Example:
modify_by_ref<- function(d){
d[, max_hp:= max(hp)]
}
dt[, modify_by_ref(dt.mtcars[[1]]), by = gear]
That returns the error:
Error in `[.data.table`(d, , `:=`(max_hp, max(hp))) :
.SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Using the tip in the error message do not works in any way for me, it seems to be targeting another case but maybe I am missing something. Is there any recommended way or flexible workaround to modify list columns by refence?
This can be done in following two steps or in Single Step:
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Step 1 - Let's add list of column hp vectors in each row of dt
dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]
Step 2 - Now calculate the max of hp
dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Single Step - Single step is actually the combination of both of the above steps:
dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]
If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:
https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406
Taking inspiration from the same i am going to show how to do it here for the given data set.
Let's first clean everything:
rm(list = ls())
Let's re-define the given table in different way:
dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]
Note that i have defined the table slightly different. I have used data.table in addition to list in the above definition.
Next, populate the max by reference within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]
And, what good one can expect, we can perform manipulation within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]
ddply has a .progress to get a progress bar while it's running, is there an equivalent for data.table in R?
Yes, you can use any progress status you want.
library(data.table)
dt = data.table(a=1:4, b=c("a","b"))
dt[, {cat("group:",b,"\n"); sum(a)}, b]
#group: a
#group: b
# b V1
#1: a 4
#2: b 6
If you ask about progress in loading csv file with fread then it will automatically be displayed for bigger datasets. Also as mentioned by Sergey in comment you can use verbose argument to get more information, both in fread and in [.data.table.
If you want the percentage of groups processed.
grpn = uniqueN(dt$b)
dt[, {cat("progress",.GRP/grpn*100,"%\n"); sum(a)}, b]
#progress 50 %
#progress 100 %
# b V1
#1: a 4
#2: b 6
Following up on #jangorecki's excellent answer, here's a way to use a text progress bar:
library(data.table)
dt = data.table(a=1:4, b=c("a","b"))
grpn = uniqueN(dt$b)
pb <- txtProgressBar(min = 0, max = grpn, style = 3)
dt[, {setTxtProgressBar(pb, .GRP); Sys.sleep(0.5); sum(a)}, b]
close(pb)
Following up on #jangorecki and other great answers, you can use the data.table symbol .NGRP instead of calculating grpn as in the other answers:
dt[, {cat("progress",.GRP/.NGRP*100,"%\n"); sum(a)}, b]
Following up again on #jangorecki's great answer.
If you don't want to spam your terminal too much, you can make an external function equivalent to jangorecki's, but which does a modulus check and only prints if .GRP is divisible by a certain number "mod". Note, using the if function within the data.table curly-brackets itself doesn't work, which I assume is because if function's in R also use curly brackets.
progress = function(.GRP, grpn, mod) {
if(!(.GRP %% mod)) {
cat("progress", .GRP/grpn*100,"%\n")
}
}
Then do. Here I use mod = 1000, so it would only print the percentage every 1000 groups.
dt[, {progress(.GRP, grpn, 1000); sum(a)}, b]
I am using data.table in R and looping over my table, it s really slow because of my table size.
I wonder if someone have any idea on
I have a set of value that I want to "cluster".
Each line have a position, a positive integer. You can load a simple view of that :
library(data.table)
#Here is a toy example
fulltable=c(seq (1,4))*c(seq(1,1000,10))
fulltable=data.table(pos=fulltable[order(fulltable)])
fulltable$id=1
So I loop in my lines and When there is more than 50 between two position I change the group :
#here is the main loop
lastposition=fulltable[1]$pos
lastid=fulltable[1]$id
for(i in 2:nrow(fulltable)){
if(fulltable[i]$pos-50>lastposition){
lastid=lastid+1
print(lastid)
}
fulltable[i]$id=lastid;
lastposition=fulltable[i]$pos
}
Any idea for an effi
fulltable[which((c(fulltable$pos[-1], NA) - fulltable$pos) > 50) + 1, new_group := 2:(.N+1)]
fulltable[is.na(new_group), new_group := 1]
fulltable[, c("lastid_new", "new_group") := list(cummax(new_group), NULL)]
I have a dataframe with millions of rows and three columns labeled Keywords, Impressions, Clicks. I'd like to add a column with values depending on the evaluation of this function:
isType <- function(Impressions, Clicks)
{
if (Impressions >= 1 & Clicks >= 1){return("HasClicks")} else if (Impressions >=1 & Clicks == 0){return("NoClicks")} else {return("ZeroImp")}
}
so far so good. I then try this to create the column but 1) it takes for ever and 2) it marks all the rows has "HasClicks" even the ones where it shouldn't.
# Creates a dataframe
Type <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Type <- rbind(Type,isType(Mydf$Impressions[i], Mydf$Clicks[i]))}
# Add the column to Mydf
Mydf <- transform(Mydf, Type = Type)
input data:
Keywords,Impressions,Clicks
"Hello",0,0
"World",1,0
"R",34,23
Wanted output:
Keywords,Impressions,Clicks,Type
"Hello",0,0,"ZeroImp"
"World",1,0,"NoClicks"
"R",34,23,"HasClicks"
Building on Joshua's solution, I find it cleaner to generate Type in a single shot (note however that this presumes Clicks >= 0...)
Mydf$Type = ifelse(Mydf$Impressions >= 1,
ifelse(Mydf$Clicks >= 1, 'HasClicks', 'NoClicks'), 'ZeroImp')
First, the if/else block in your function will return the warning:
Warning message:
In if (1:2 > 2:3) TRUE else FALSE :
the condition has length > 1 and only the first element will be used
which explains why it all the rows are the same.
Second, you should allocate your data.frame and fill in the elements rather than repeatedly combining objects together. I imagine this is causing your long run-times.
EDIT: My shared code. I'd love for someone to provide a more elegant solution.
Mydf <- data.frame(
Keywords = sample(c("Hello","World","R"),20,TRUE),
Impressions = sample(0:3,20,TRUE),
Clicks = sample(0:3,20,TRUE) )
Mydf$Type <- "ZeroImp"
Mydf$Type <- ifelse(Mydf$Impressions >= 1 & Mydf$Clicks >= 1,
"HasClicks", Mydf$Type)
Mydf$Type <- ifelse(Mydf$Impressions >= 1 & Mydf$Clicks == 0,
"NoClicks", Mydf$Type)
This is a case where arithmetic can be cleaner and most likely faster than nested ifelse statements.
Again building on Joshua's solution:
Mydf$Type <- factor(with(Mydf, (Impressions>=1)*2 + (Clicks>=1)*1),
levels=1:3, labels=c("ZeroImp","NoClicks","HasClicks"))