It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]
I have a dataset with 4 columns, 1st col is date, the other 3 are numeric. I am trying to get the % diff from previous row for those 3 numeric columns. I know there already have some posts about this kind of questions df %>% mutate_each(funs(. - lag(.))) %>% na.omit(), but most of them can not take care about the date, since I want the date to be unchange, and need % different.
here is the dataset
date=c('2018-01-01', '2018-02-01', '2018-03-01')
a=c(1,3,2)
b=c(89,56,47)
c=c(1872,7222,2930)
x=data.frame(date,a,b,c)
I wish to have the final dataset like this
x=data.frame(date,a,b,c)
a=c(NA, 2, -0.333)
b=c(NA, -0.371, -0.161)
c=c(NA,2.857, -0.594)
x=data.frame(date,a,b,c)
which means for col A, 2=3/1-1, -0.333=2/3-1
for col B, -0.371=56/89-1 etc
Thank you so much for your help!
A solution using package data.table:
x = as.data.table(x)
cols = c("a", "b", "c")
x[,(paste0(cols, "_pctChange")) := lapply(.SD, function(col){
(col-shift(col,1,type = "lag"))/shift(col,1,type = "lag")
}), .SDcols=cols]
quantmod package has a very useful function for exactly this called Delt().
All you would need to do is the following:
x[-1] <- sapply(x[-1], Delt)
I'm not sure how familiar you are with sapply, but if you wanted to access Delt()'s parameters to tweak your calculation, you could try something like:
x[-1] <- sapply(x[-1], function(x) { Delt(x, k=2) })
I have the following data frames:
DF <- data.frame(Time=c(1:20))
StartEnd <- data.frame(Start=c(2,6,14,19), End=c(4,10,17,20))
I want to add a column "Activity" to DF if the values in the Time column lie inbetween one of the intervals specified in the StartEnd dataframe.
I came up with the following:
mapply(FUN = function(Start,End) ifelse(DF$Time >= Start & DF$Time <= End, 1, 0),
Start=StartEnd$Start, End=StartEnd$End)
This doesn't give me the output I want (it gives me a matrix with four columns), but I would like to get a vector that I can add to DF.
I guess the solution is easy but I'm not seeing it :) Thank you in advance.
EDIT: I'm sure I can use a loop but I'm wondering if there are more elegant solutions.
You can achieve this with
DF$Activity <- sapply(DF$Time, function(x) {
ifelse(sum(ifelse(x >= StartEnd$Start & x <= StartEnd$End, 1, 0)), 1, 0)
})
I hope this helps!
If you're using the tidyverse, I think a good way to go would be with with purrr::map2:
# generate a sequence (n, n + 1, etc.) for each StartEnd row
# (map functions return a list; purrr::flatten_int or unlist can
# squash this down to a vector!)
activity_times = map2(StartEnd$Start, StartEnd$End, seq) %>% flatten_int
# then get a new DF column that is TRUE if Time is in activity_times
DF %>% mutate(active = Time %in% active_times)
I have created a sample R script to show my question:
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
I need to create strings of date with different length, I can only think of using for-loop for my problem, but I have about 70K cases that need to process the string, is there any way of speeding it up?
Update 01
Thanks #akrun for the answer, I have further modified my question as below:
library(dplyr)
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df[,'start_date'] <- as.Date(test.df[,'start_date'])
test.df[,'end_date'] <- as.Date(test.df[,'end_date'])
# Part A
for (loop in (1:nrow(test.df))) {
test.df[loop,'output'] <- paste(seq(test.df[loop,'start_date'],test.df[loop,'end_date'],by = 1),collapse=';')
}
# Part B
test.mod <- group_by(test.df,uid) %>%
do({df <- data.frame(.)
output.df <- data.frame(uid=df[1,'uid'],
date=unlist(strsplit(df[,'output'],';')))
data.frame(output.df)
})
Now Part A is fixed, but is there anyway to speed up Part B? Or should I combine Part A and Part B together? Please enlighten me as data.table is new to me.
We could convert the 'test.df' to 'data.table' (setDT(test.df)), grouped by 'uid', we get the seq of 'start_date', 'end_date' and the paste the elements together.
library(data.table)
setDT(test.df)[,paste(seq(start_date, end_date, by = '1 day'), collapse=';') , uid]
Update
For the Part B, if we dont paste, it is a two column dataset
setDT(test.df)[,seq(start_date, end_date, by = '1 day') , uid]
Here is how you can do it with apply
test.df <- data.frame(uid=c('x001','x002','x003'),
start_date=c('2015-01-02','2015-03-05','2015-07-09'),
end_date=c('2015-01-07','2015-03-07','2015-07-16'),
stringsAsFactors=FALSE)
test.df$output <- apply(test.df, 1, function(x) paste(seq(as.Date(x[2]), as.Date(x[3]), by = 1), collapse=';'))
I wish to change the class of selected variables in a data table, using a vectorized operation. I am new to the data.table syntax, and am trying to learn as much as possible. I now the question is basic, but it will help me to better understand the data table way of thinking!
A similar question was asked here! However, the solution seems to pertain to either reclassing just one column or all columns. My question is unique to a select few columns.
### Load package
require(data.table)
### Create pseudo data
data <- data.table(id = 1:10,
height = rnorm(10, mean = 182, sd = 20),
weight = rnorm(10, mean = 160, sd = 10),
color = rep(c('blue', 'gold'), times = 5))
### Reclass all columns
data <- data[, lapply(.SD, as.character)]
### Search for columns to be reclassed
index <- grep('(id)|(height)|(weight)', names(data))
### data frame method
df <- data.frame(data)
df[, index] <- lapply(df[, index], as.numeric)
### Failed attempt to reclass columns used the data.table method
data <- data[, lapply(index, as.character), with = F]
Any help would be appreciated. My data are large and so using regular expressions to create a vector of column numbers to reclassify is necessary.
Thank you for your time.
You could avoid the overhead of the construction of .SD within j by using set
for(j in index) set(data, j =j ,value = as.character(data[[j]]))
I think that #SimonO101 did most of the Job
data[, names(data)[index] := lapply(.SD, as.character) , .SDcols = index ]
You can just use the := magic
You just need to use .SDcols with your index vector (I learnt that today!), but that will just return a data table with the reclassed columns. #dickoa 's answer is what you are looking for.
data <- data[, lapply(.SD, as.character) , .SDcols = index ]
sapply(data , class)
id height weight
"character" "character" "character"