It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]
I have a column in dataframe df with value 'name>year>format'. Now I want to split this column by > and add those values to new columns named as name, year, format. How can I do this in R.
You can do that easily using separate function in tidyr;
library(tidyr)
library(dplyr)
data <-
data.frame(
A = c("Joe>1993>student")
)
data %>%
separate(A, into = c("name", "year", "format"), sep = ">", remove = FALSE)
# A name year format
# Joe>1993>student Joe 1993 student
If you do not want the original column in the result dataframe change remove to TRUE
An option is read.table in base R
cbind(df, read.table(text = as.character(df$column), sep=">",
header = FALSE, col.names = c("name", "year", "format")))
In case your data is big, it would be a good idea to use data.table as it is very fast.
If you know how many fields your "combined" column has:
Suppose the column has 3 fields, and you know it:
library(data.table)
# the 1:3 should be replaced by 1:n, where n is the number of fields
dt1[, paste0("V", 1:3) := tstrsplit(y, split = ">", fixed = TRUE)]
If you DON'T know in advance how many fields the column has:
Now we can get some help from the stringi package:
library(data.table)
library(stringi)
maxFields <- dt2[, max(stri_count_fixed(y, ">")) + 1]
dt2[, paste0("V", 1:maxFields) := tstrsplit(y, split = ">", fixed = TRUE, fill = NA)]
Data used:
library(data.table)
dt1 <- data.table(x = c("A", "B"), y = c("letter>2018>pdf", "code>2020>Rmd"))
dt2 <- rbind(dt1, data.table(x = "C", y = "report>2019>html>pdf"))
I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30
I have a data frame with some model estimations. Depending on the observation the estimation has just a value or a value together with a confidence interval between square brackets. By the way, the variable is a character (I guess that I need to change it some-when)
df<-data.frame(c("5","3","8 [3 - 5]")
I would like to split this data frame column (x) into two columns. A first one for the estimated values (y) and a second one for the confidence interval with or without brackets (z).
I have tried with tidyr::separate and tidyr::split (I am big fun of the dplyr family:-), but I do not get the wished result.
tidyr::separate(col=x,into=c("y","z"),sep="//[")
Do you know what I am doing wrong?
This can be done with extract
library(tidyr)
extract(df, x, into = c("y", "z"), "(\\d+)\\s*(.*)")
Or use the extra argument in separate
separate(df, x, into = c("y", "z"), "\\s+", extra = "merge")
data
df <- data.frame(x= c("5","3","8 [3 - 5]"))
Here ya go:
library("stringr")
df <- data.frame(c("5", "3", "8 [3 - 5]"))
df2 = str_split_fixed(string = df[,1], pattern = "\\[", n = 2)
df2[,2] = gsub(pattern = "\\]", replacement = "", x = df2[,2])