Using predicted values to make predictions in data.table - r

I'm trying to solve a problem in data.table which requires me to use the value just predicted in the next step of the prediction.
I have the data set up like this, with NA rows generated ready receive the predictions. Each NA is calculated by multiplying the value preceding it by the current parameter
library(data.table)
dt <- data.table(
date = as.Date(paste(rep(c(2015, 2016), each = 12, times = 2), 1:12, 1, sep = "-")),
val = c(rnorm(12, 50, 5), rep(NA, 12)),
param1 = runif(48),
cat = rep(c("a", "b"), each = 24)
)
I can't do it this way
dt[, {
dt_in <- .SD
lapply(dt_in[year(date) > 2015, date], function(d){
dt_sub <- dt_in[date <= d]
pred <- dt_sub[.N-1, val] * dt_sub[.N, param1]
dt_in[date == d, val := pred]
})
} , by = cat]
As trying to update the .SD within {} give me the '.SD is locked...' error. My current solution involves breaking the data.table into a list and updating each list item row by row
# Create a list of data.tables, one for each category
break_list <- lapply(dt[, unique(cat)], function(c){
dt[cat == c]
})
l_out <- lapply(break_list, function(dt_in){
# Select the dates requiring prediction
lapply(dt_in[year(date) > 2015, date], function(d){
# Subset by date
dt_sub <- dt_in[date <= d]
# Prediciton = value from the second to last row * parameter in the last row
pred <- dt_sub[.N-1, val] * dt_sub[.N, param1]
# Update data.table
dt_in[date == d, val := pred]
})
dt_in
})
dt_out <- rbindlist(l_out)
This works and gives me the desired solution, but it can be slow and feels like I've broken all the data.table rules. Is there a better way?

You are looking to iteratively update rows of a data.table with values computed from rows updated in a previous iteration. While it is generally better to find an explicit formulation of the problem making the updates independent and it is possible in your case using a helper column holding the cumprod of param1 and a rolling join (dt[dt[...], ..., roll=TRUE]) I will show how to do iterative updates of a data.table efficiently using data.table::set, as the former is not always easy/possible:
setkey(dt, cat, date) # sort by cat first then by date in have the reference value used for each calculation in the row above
val_col_nr <- which(colnames(dt)=="val") # set requires a column number
dt[is.na(val), # we want to compute new values for val where val currently is NA
# .I is a vector the row numbers (in dt) of each row in .SD
for (ii in .I) set(dt, i=ii, j=val_col_nr, value=dt[ii,param1]*dt[ii-1L,val]),
by=cat] # for every 'cat'
You can use identical(dt, setkey(dt_out,cat,date)) to check the result.
Please do also note that it generally a bad idea to use names of base functions (cat in your case) as variable names (even in a distinct namespace).

Related

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

how to insert sequential rows in data.table in R (Example given)?

df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added .
df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80))
df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"),
location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
This is what I tried to obtain my result , but it did not work
df1<-df[,':='(hour=seq(0:23)),by=(customer)]
Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) :
Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact
performance too much for the fastest cases. Either change the type of the target column, or
coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table.
target <- data.table(
customer = rep(unique(df$customer), each = 24),
hour = 0:23)
df_join <- df[target, on = c("customer", "hour"),
.(customer, hour, visits = ifelse(is.na(visits), 0, visits))
]
all.equal(df_expected, df_join)
Edit:
This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer.
target <- df[ , .("customer" = rep(unique(customer), each = 24L),
"hour" = rep(0L:23L, times = uniqueN(customer))),
by = location_id]
df_join <- df[target, on = .NATURAL,
.(customer, location_id, hour, visits = fcoalesce(visits, 0))]
all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs:
df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE,
.(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer.
df_final <- data.table()
for(i in seq(24)){
if(i %in% df[,hour]){
a <- df[hour==i]
}else{
a <- data.table(customer="x", hour=i, visits=0)}
df_final <- rbind(df_final, a)
}
df_final
You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done).
df_final <- data.table()
for(j in unique(df[,customer])){
for(i in seq(24)){
if(i %in% df[,hour]){
if(df[hour==i,customer] %in% j){
a <- df[hour==i]
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
df_final <- rbind(df_final, a)
}
}
df_final

How can I program a loop in R?

How can I program a loop so that all eight tables are calculated one after the other?
The code:
dt_M1_I <- M1_I
dt_M1_I <- data.table(dt_M1_I)
dt_M1_I[,I:=as.numeric(gsub(",",".",I))]
dt_M1_I[,day:=substr(t,1,10)]
dt_M1_I[,hour:=substr(t,12,16)]
dt_M1_I_median <- dt_M1_I[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
This should be calculated for:
M1_I
M2_I
M3_I
M4_I
M1_U
M2_U
M3_U
M4_U
Thank you very much for your help!
Whenever you have several variables of the same kind, especially when you find yourself numbering them, as you did, step back and replace them with a single list variable. I do not recommend doing what the other answer suggested.
That is, instead of M1_I…M4_I and M1_U…M4_U, have two variables m_i and m_u (using lower case in variable names is conventional), which are each lists of four data.tables.
Alternatively, you might want to use a single variable, m, which contains nested lists of data.tables (m = list(list(i = …, u = …), …)).
Assuming the first, you can then iterate over them as follows:
give_this_a_meaningful_name = function (df) {
dt <- data.table(df)
dt[, I := as.numeric(gsub(",", ".", I))]
dt[, day := substr(t, 1, 10)]
dt[, hour := substr(t, 12, 16)]
dt[, list(median_I = median(I, na.rm = TRUE)), by = .(day, hour)]
}
m_i_median = lapply(m_i, give_this_a_meaningful_name)
(Note also the introduction of consistent spacing around operators; good readability is paramount for writing bug-free code.)
You can use a combination of a for loop and the get/assign functions like this:
# create a vector of the data.frame names
dts <- c('M1_I', 'M2_I', 'M3_I', 'M4_I', 'M1_U', 'M2_U', 'M3_U', 'M4_U')
# iterate over each dataframe
for (dt in dts){
# get the actual dataframe (not the string name of it)
tmp <- get(dt)
tmp <- data.table(tmp)
tmp[, I:=as.numeric(gsub(",",".",I))]
tmp[, day:=substr(t,1,10)]
tmp[, hour:=substr(t,12,16)]
tmp <- tmp[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
# assign the modified dataframe to the name you want (the paste adds the 'dt_' to the front)
assign(paste0('dt_', dt), tmp)
}

data.table filtering solution based on posixct datetime and two variables?

I have a data frame of when animals are detected at different sites. I want to eliminate rows (filter) from the detection file (df) for only site A if the individual animal wasn't detected at site B within a time frame ( 5 minutes). I need to iterate this over every individual animal and across multiple sites. my real data has many animals and over a million detection observations. I'm looking for a data.table solution to be efficient.
The two variables would be the individuals (animals) and the site detected.
Example:
obs.num<-1:21 # a simple observation number
animal<-c(rep("RBT 1",10),rep("RBT 2",7) ,rep("RBT 3",2),"RBT 4","RBT 2") #
a fake list of animal id's (my data has many)
now <- Sys.time()
ts <- seq(from = now, length.out = 16, by = "mins")
ts <- c(ts,seq(from=tail(ts,1), length.out = 3, by = "hour")) # create a
fake series of time stamps
ts <- c(ts,seq(from=tail(ts,1), length.out = 2, by = "hour"))
df<-data.frame(obs.num,animal,ts) # make data frame
df$site<-c("A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","B")# make a fake series of sites detection occurred at
str(df)
df # my example data frame
In this example i would like to remove the entire row for observation 19.
I'm looking for a data.table solution similar to this solution
library(sqldf)
sqldf("with B as (select * from df where site == 'B')
select distinct df.* from df
join B on df.animal = B.animal and
B.ts - df.ts between -5 * 60 and 5 * 60
order by 1")
A bit clunky, but you can accomplish this with non-equi-joins in data.table:
library(data.table)
setDT(df)
nm = names(df)
# unfortunately non-equi-joins don't support on-the-fly
# columns yet, so we have to first define them explicitly; see:
# https://github.com/Rdatatable/data.table/issues/1639
df[ , ts_minus_5 := ts - 5*60]
df[ , ts_plus_5 := ts + 5*60]
# identify the observations _matching_ your criteria (i.e. those to keep)
found_at_b = unique(
df[site == 'A'][df[site == 'B'], .(x.obs.num, x.animal),
on = .(animal == animal, ts >= ts_minus_5, ts <= ts_plus_5),
# allow.cartesian allows this join to return any
# number of rows, necessary since any "B" row
# might match multiple "A" rows;
# nomatch = 0L drops any "B" row without a
# match found in "A" rows
allow.cartesian = TRUE, nomatch = 0L]
)
# to filter, define a "drop" flag (could also call it "filter")
df[site == 'B', drop := FALSE]
df[found_at_b, on = c(obs.num = 'x.obs.num', animal = 'x.animal'),
drop := FALSE]
# could define drop = TRUE for the other rows, but no need
df = df[(!drop)]
There are some other ways to clean the code up a bit by being more careful about potentially creating copies, perhaps split-ing the data by site first, doing as much as possible within one [] call, etc., but this will get you started.

r - apply function to each row of a data.table

I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:
Data
Given two data.tables (dt and dt_lookup)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)
Function
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Attempts at a solution
I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
both taking ~30 seconds for n = 10,000.
Question
Is there a better way of using data.table to apply the getId function over each row of dt ?
Update 12/08/2015
Thanks to the pointer from #eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.
Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to #eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.
One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.
And, there will no doubt be better ways of programming this, but here's my attempt.
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup

Resources