Related
It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]
df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added .
df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80))
df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y",
"y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"),
location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23),
visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
This is what I tried to obtain my result , but it did not work
df1<-df[,':='(hour=seq(0:23)),by=(customer)]
Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) :
Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact
performance too much for the fastest cases. Either change the type of the target column, or
coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table.
target <- data.table(
customer = rep(unique(df$customer), each = 24),
hour = 0:23)
df_join <- df[target, on = c("customer", "hour"),
.(customer, hour, visits = ifelse(is.na(visits), 0, visits))
]
all.equal(df_expected, df_join)
Edit:
This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer.
target <- df[ , .("customer" = rep(unique(customer), each = 24L),
"hour" = rep(0L:23L, times = uniqueN(customer))),
by = location_id]
df_join <- df[target, on = .NATURAL,
.(customer, location_id, hour, visits = fcoalesce(visits, 0))]
all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs:
df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE,
.(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer.
df_final <- data.table()
for(i in seq(24)){
if(i %in% df[,hour]){
a <- df[hour==i]
}else{
a <- data.table(customer="x", hour=i, visits=0)}
df_final <- rbind(df_final, a)
}
df_final
You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done).
df_final <- data.table()
for(j in unique(df[,customer])){
for(i in seq(24)){
if(i %in% df[,hour]){
if(df[hour==i,customer] %in% j){
a <- df[hour==i]
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
}else{
a <- data.table(customer=j, hour=i, visits=0)
}
df_final <- rbind(df_final, a)
}
}
df_final
How can I program a loop so that all eight tables are calculated one after the other?
The code:
dt_M1_I <- M1_I
dt_M1_I <- data.table(dt_M1_I)
dt_M1_I[,I:=as.numeric(gsub(",",".",I))]
dt_M1_I[,day:=substr(t,1,10)]
dt_M1_I[,hour:=substr(t,12,16)]
dt_M1_I_median <- dt_M1_I[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
This should be calculated for:
M1_I
M2_I
M3_I
M4_I
M1_U
M2_U
M3_U
M4_U
Thank you very much for your help!
Whenever you have several variables of the same kind, especially when you find yourself numbering them, as you did, step back and replace them with a single list variable. I do not recommend doing what the other answer suggested.
That is, instead of M1_I…M4_I and M1_U…M4_U, have two variables m_i and m_u (using lower case in variable names is conventional), which are each lists of four data.tables.
Alternatively, you might want to use a single variable, m, which contains nested lists of data.tables (m = list(list(i = …, u = …), …)).
Assuming the first, you can then iterate over them as follows:
give_this_a_meaningful_name = function (df) {
dt <- data.table(df)
dt[, I := as.numeric(gsub(",", ".", I))]
dt[, day := substr(t, 1, 10)]
dt[, hour := substr(t, 12, 16)]
dt[, list(median_I = median(I, na.rm = TRUE)), by = .(day, hour)]
}
m_i_median = lapply(m_i, give_this_a_meaningful_name)
(Note also the introduction of consistent spacing around operators; good readability is paramount for writing bug-free code.)
You can use a combination of a for loop and the get/assign functions like this:
# create a vector of the data.frame names
dts <- c('M1_I', 'M2_I', 'M3_I', 'M4_I', 'M1_U', 'M2_U', 'M3_U', 'M4_U')
# iterate over each dataframe
for (dt in dts){
# get the actual dataframe (not the string name of it)
tmp <- get(dt)
tmp <- data.table(tmp)
tmp[, I:=as.numeric(gsub(",",".",I))]
tmp[, day:=substr(t,1,10)]
tmp[, hour:=substr(t,12,16)]
tmp <- tmp[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
# assign the modified dataframe to the name you want (the paste adds the 'dt_' to the front)
assign(paste0('dt_', dt), tmp)
}
I have a data frame of when animals are detected at different sites. I want to eliminate rows (filter) from the detection file (df) for only site A if the individual animal wasn't detected at site B within a time frame ( 5 minutes). I need to iterate this over every individual animal and across multiple sites. my real data has many animals and over a million detection observations. I'm looking for a data.table solution to be efficient.
The two variables would be the individuals (animals) and the site detected.
Example:
obs.num<-1:21 # a simple observation number
animal<-c(rep("RBT 1",10),rep("RBT 2",7) ,rep("RBT 3",2),"RBT 4","RBT 2") #
a fake list of animal id's (my data has many)
now <- Sys.time()
ts <- seq(from = now, length.out = 16, by = "mins")
ts <- c(ts,seq(from=tail(ts,1), length.out = 3, by = "hour")) # create a
fake series of time stamps
ts <- c(ts,seq(from=tail(ts,1), length.out = 2, by = "hour"))
df<-data.frame(obs.num,animal,ts) # make data frame
df$site<-c("A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","B")# make a fake series of sites detection occurred at
str(df)
df # my example data frame
In this example i would like to remove the entire row for observation 19.
I'm looking for a data.table solution similar to this solution
library(sqldf)
sqldf("with B as (select * from df where site == 'B')
select distinct df.* from df
join B on df.animal = B.animal and
B.ts - df.ts between -5 * 60 and 5 * 60
order by 1")
A bit clunky, but you can accomplish this with non-equi-joins in data.table:
library(data.table)
setDT(df)
nm = names(df)
# unfortunately non-equi-joins don't support on-the-fly
# columns yet, so we have to first define them explicitly; see:
# https://github.com/Rdatatable/data.table/issues/1639
df[ , ts_minus_5 := ts - 5*60]
df[ , ts_plus_5 := ts + 5*60]
# identify the observations _matching_ your criteria (i.e. those to keep)
found_at_b = unique(
df[site == 'A'][df[site == 'B'], .(x.obs.num, x.animal),
on = .(animal == animal, ts >= ts_minus_5, ts <= ts_plus_5),
# allow.cartesian allows this join to return any
# number of rows, necessary since any "B" row
# might match multiple "A" rows;
# nomatch = 0L drops any "B" row without a
# match found in "A" rows
allow.cartesian = TRUE, nomatch = 0L]
)
# to filter, define a "drop" flag (could also call it "filter")
df[site == 'B', drop := FALSE]
df[found_at_b, on = c(obs.num = 'x.obs.num', animal = 'x.animal'),
drop := FALSE]
# could define drop = TRUE for the other rows, but no need
df = df[(!drop)]
There are some other ways to clean the code up a bit by being more careful about potentially creating copies, perhaps split-ing the data by site first, doing as much as possible within one [] call, etc., but this will get you started.
I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:
Data
Given two data.tables (dt and dt_lookup)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)
Function
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Attempts at a solution
I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
both taking ~30 seconds for n = 10,000.
Question
Is there a better way of using data.table to apply the getId function over each row of dt ?
Update 12/08/2015
Thanks to the pointer from #eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.
Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to #eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.
One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.
And, there will no doubt be better ways of programming this, but here's my attempt.
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup