Time difference between dates of adjacent columns in R - r

I'm trying to adapt the answer to my previous question (Difference between dates in many columns in R). I've realised I only want the time difference between a given column, and the column immediately to it's left. Example for clarification:
df <- data.frame(
Group=c("A","B"),
ID=c(1,2),
Date1=as.POSIXct(c('2016-04-25 09:15:29','2016-04-25 09:15:29')),
Date2=as.POSIXct(c('2016-04-25 14:01:19','2016-04-25 14:01:19')),
Date3=as.POSIXct(c('2016-04-26 13:28:19','2016-04-26 13:28:19')),
stringsAsFactors=F
)
My desired output is Date2-Date1 and Date3-Date2. And this of course would extend for many columns i.e. Date4-Date3 etc. But I do not need Date3-Date1. To clarify, how can I automate this for many columns
df$Date2_Date1 <- difftime(df$Date2,df$Date1, units = c("hours"))
df$Date3_Date2 <- difftime(df$Date3,df$Date2, units = c("hours"))
Thanks to #bgoldst for the original answer. I think I just need to adapt cmb below to have the correct sequence:
cmb <- combn(seq_len(ncol(df)-1L)+1L,2L);
res <- abs(apply(cmb,2L,function(x) difftime(df[[x[1L]]],df[[x[2L]]],units='hours')));
colnames(res) <- apply(cmb,2L,function(x,cns) paste0(cns[x[1L]],'_',cns[x[2L]]),names(df))
Thanks

Given your example, this should to the trick:
df <- data.frame(
Group=c("A","B"),
ID=c(1,2),
Date1=as.POSIXct(c('2016-04-25 09:15:29','2016-04-25 09:15:29')),
Date2=as.POSIXct(c('2016-04-25 14:01:19','2016-04-25 14:01:19')),
Date3=as.POSIXct(c('2016-04-26 13:28:19','2016-04-26 13:28:19')),
stringsAsFactors=F
)
mapply(difftime, df[, 4:5], df[, 3:4], units = "hours")
> Date2 Date3
> [1,] 4.763889 23.45
> [2,] 4.763889 23.45
In my call mapply applies function difftime to the two arrays provided, so it starts with df[, 4] - df[, 3], then df[, 5] - df[, 4]. You of course have to change this with the column numbers for your dates, and make sure they are ordered in the right way.
Good luck!

You could use Non-Standard Evaluation:
First you create a character vector with the name of the columns containing the dates. So let' say all the columns starting with 'Date'
dates = names(df)[grepl("^Date", names(df))]
We create a list of formulas that dynamically calculate the difference between to adjacent columns:
all_operations = lapply(seq_len(length(dates) - 1), function(i){
as.formula(paste("~difftime(", dates[i + 1], ",", dates[i],", units = c('hours'))"))
})
this will create the formulas:
[[1]]: ~difftime(Date2, Date1, units = c("hours"))
[[2]]: ~difftime(Date3, Date2, units = c("hours"))
Then you can use dplyr's NSE mutate_ to apply the dynamic formulas generated above:
df %>%
mutate_(.dots = setNames(all_operations, paste0("Diff", seq_len(length(dates) - 1))))

Related

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

R Get percentage difference from previous row

I have a dataset with 4 columns, 1st col is date, the other 3 are numeric. I am trying to get the % diff from previous row for those 3 numeric columns. I know there already have some posts about this kind of questions df %>% mutate_each(funs(. - lag(.))) %>% na.omit(), but most of them can not take care about the date, since I want the date to be unchange, and need % different.
here is the dataset
date=c('2018-01-01', '2018-02-01', '2018-03-01')
a=c(1,3,2)
b=c(89,56,47)
c=c(1872,7222,2930)
x=data.frame(date,a,b,c)
I wish to have the final dataset like this
x=data.frame(date,a,b,c)
a=c(NA, 2, -0.333)
b=c(NA, -0.371, -0.161)
c=c(NA,2.857, -0.594)
x=data.frame(date,a,b,c)
which means for col A, 2=3/1-1, -0.333=2/3-1
for col B, -0.371=56/89-1 etc
Thank you so much for your help!
A solution using package data.table:
x = as.data.table(x)
cols = c("a", "b", "c")
x[,(paste0(cols, "_pctChange")) := lapply(.SD, function(col){
(col-shift(col,1,type = "lag"))/shift(col,1,type = "lag")
}), .SDcols=cols]
quantmod package has a very useful function for exactly this called Delt().
All you would need to do is the following:
x[-1] <- sapply(x[-1], Delt)
I'm not sure how familiar you are with sapply, but if you wanted to access Delt()'s parameters to tweak your calculation, you could try something like:
x[-1] <- sapply(x[-1], function(x) { Delt(x, k=2) })

R - add column based on intervals in separate data frame

I have the following data frames:
DF <- data.frame(Time=c(1:20))
StartEnd <- data.frame(Start=c(2,6,14,19), End=c(4,10,17,20))
I want to add a column "Activity" to DF if the values in the Time column lie inbetween one of the intervals specified in the StartEnd dataframe.
I came up with the following:
mapply(FUN = function(Start,End) ifelse(DF$Time >= Start & DF$Time <= End, 1, 0),
Start=StartEnd$Start, End=StartEnd$End)
This doesn't give me the output I want (it gives me a matrix with four columns), but I would like to get a vector that I can add to DF.
I guess the solution is easy but I'm not seeing it :) Thank you in advance.
EDIT: I'm sure I can use a loop but I'm wondering if there are more elegant solutions.
You can achieve this with
DF$Activity <- sapply(DF$Time, function(x) {
ifelse(sum(ifelse(x >= StartEnd$Start & x <= StartEnd$End, 1, 0)), 1, 0)
})
I hope this helps!
If you're using the tidyverse, I think a good way to go would be with with purrr::map2:
# generate a sequence (n, n + 1, etc.) for each StartEnd row
# (map functions return a list; purrr::flatten_int or unlist can
# squash this down to a vector!)
activity_times = map2(StartEnd$Start, StartEnd$End, seq) %>% flatten_int
# then get a new DF column that is TRUE if Time is in activity_times
DF %>% mutate(active = Time %in% active_times)

Vectorizing for-loop in R for creating strings with different length

I have created a sample R script to show my question:
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
I need to create strings of date with different length, I can only think of using for-loop for my problem, but I have about 70K cases that need to process the string, is there any way of speeding it up?
Update 01
Thanks #akrun for the answer, I have further modified my question as below:
library(dplyr)
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
# Part A
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
# Part B
test.mod <- group_by(test.df,uid) %>%
do({df <- data.frame(.)
output.df <- data.frame(uid=df[1,'uid'],
date=unlist(strsplit(df[,'output'],';')))
data.frame(output.df)
})
Now Part A is fixed, but is there anyway to speed up Part B? Or should I combine Part A and Part B together? Please enlighten me as data.table is new to me.
We could convert the 'test.df' to 'data.table' (setDT(test.df)), grouped by 'uid', we get the seq of 'start_date', 'end_date' and the paste the elements together.
library(data.table)
setDT(test.df)[,paste(seq(start_date, end_date, by = '1 day'), collapse=';') , uid]
Update
For the Part B, if we dont paste, it is a two column dataset
setDT(test.df)[,seq(start_date, end_date, by = '1 day') , uid]
Here is how you can do it with apply
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df$output <- apply(test.df, 1, function(x) paste(seq(as.Date(x[2]), as.Date(x[3]), by = 1), collapse=';'))

Reclassify select columns in Data Table

I wish to change the class of selected variables in a data table, using a vectorized operation. I am new to the data.table syntax, and am trying to learn as much as possible. I now the question is basic, but it will help me to better understand the data table way of thinking!
A similar question was asked here! However, the solution seems to pertain to either reclassing just one column or all columns. My question is unique to a select few columns.
### Load package
require(data.table)
### Create pseudo data
data <- data.table(id = 1:10,
height = rnorm(10, mean = 182, sd = 20),
weight = rnorm(10, mean = 160, sd = 10),
color = rep(c('blue', 'gold'), times = 5))
### Reclass all columns
data <- data[, lapply(.SD, as.character)]
### Search for columns to be reclassed
index <- grep('(id)|(height)|(weight)', names(data))
### data frame method
df <- data.frame(data)
df[, index] <- lapply(df[, index], as.numeric)
### Failed attempt to reclass columns used the data.table method
data <- data[, lapply(index, as.character), with = F]
Any help would be appreciated. My data are large and so using regular expressions to create a vector of column numbers to reclassify is necessary.
Thank you for your time.
You could avoid the overhead of the construction of .SD within j by using set
for(j in index) set(data, j =j ,value = as.character(data[[j]]))
I think that #SimonO101 did most of the Job
data[, names(data)[index] := lapply(.SD, as.character) , .SDcols = index ]
You can just use the := magic
You just need to use .SDcols with your index vector (I learnt that today!), but that will just return a data table with the reclassed columns. #dickoa 's answer is what you are looking for.
data <- data[, lapply(.SD, as.character) , .SDcols = index ]
sapply(data , class)
id height weight
"character" "character" "character"

Resources