Related
It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]
I write and review a fair amount of R code like this:
df <- data.frame(replicate(10, sample(0:5, 10, rep = TRUE)))
my.func <- function(col, y) {col %in% y}
df$X2 <- my.func(df$X2, c(1,2))
df$X3 <- my.func(df$X3, c(4,5))
df$X5 <- my.func(df$X5, c(1,2))
df$X6 <- my.func(df$X6, c(4,5))
df$X8 <- my.func(df$X8, c(4,5))
df$X9 <- my.func(df$X9, c(1,2))
df$X10 <- my.func(df$X10, c(1))
That is, certain columns in a data.frame (or data.table) are transformed using a function, where one argument is a column and the other is some arbitrary, somewhat-unique-to-that-column value.
What's a more concise way to make such transformations?
I've tried using data.table's set (:=) operator, which makes things slightly cleaner, but still each column name must appear twice and the function must appear once for each column.
A concise way would be Map with the input arguments as the dataset ('df') and a list of vector that would be passed as argument to my.func. Here, each column of the data.frame is a unit and similarly the vector element from list.
df[] <- Map(my.func, df, list(1:2, 4:5, 3:4))
NOTE: The OP's function or a minimal reproducible example is not provided, so it is not tested
NOTE2: Here, the assumption is that the number of columns is 3. If it is more than 3, increase the length of the list as well
The above can also be converted to data.table syntax
library(data.table)
setDT(df)[, names(df) := Map(my.func, .SD, list(1:2, 4:5, 3:4))]
If only a subset of columns needs to be changed, specify the columns in .SDcols, and also change the names(df) to the subset of names
Or with tidyverse
library(tidyverse)
map2_dfc(df, list(1:2, 4:5, 3:4), my.func)
OP's request from a comment:
make the association between column names and function argument(s) for those columns more explicit
Adjusting the Map approach seen in the other answers:
yL <- list(X2 = 1:2, X3 = 4:5, X5 = 3:4, X6 = 4:5, X8 = 4:5, X9 = 1:2, X10 = 1)
df[names(yL)] <- Map(my.func, df[names(yL)], y = yL)
With data.table:
# this saves you from writing DT twice
DT[, names(yL) := Map(my.func, .SD, y = yL), .SDcols=names(yL)]
My question relates to this previously asked question:
Calculating a weighted mean using data.table in R with weights in one of the table columns
In my case, I have different weights-columns across the columns I want to aggregate. Let's say I have four columns col_a, col_b, col_c and col_d and let's assume I want to aggregate col_a and col_b with weiths w_1 and col_c, col_d with w_2. Example:
require(data.table)
id <- c(1,1,1,2,2,2)
col_a <- c(123,56,87,987,1003,10)
col_b <- c(17,234,20,88,765,69)
col_c <- c(45,90,543,30,1,543)
col_d <- c(60,43,700,3,88,46)
w_1 <- c(1,1,1,1,1,1)
w_2 <- c(1.5,1,1.2,0.8,1,1)
dt <- data.table(id, col_a, col_b, col_c, col_d, w_1, w_2);dt
Now the desired result would look like this:
data.table(id=c(1,2),col_a=c(weighted.mean(col_a[1:3],w_1[1:3]),weighted.mean(col_a[4:6],w_1[4:6])),col_b=c(weighted.mean(col_b[1:3],w_1[1:3]),weighted.mean(col_b[4:6],w_1[4:6])),
col_c=c(weighted.mean(col_c[1:3],w_2[1:3]),weighted.mean(col_c[4:6],w_1[4:6])),col_d=c(weighted.mean(col_d[1:3],w_2[1:3]),weighted.mean(col_d[4:6],w_2[4:6])))
This, I thought could be accomplished similar to #akrun answer to this post:
R collapse multiple rows into 1 row using specific function to each column
where I would have the two functions weighted.mean(x, w_1) and weighted.mean(x, w_2) instead of min or median.
Here is how far I got:
colsToKeep <- c("col_a","col_b","col_c","col_d")
dt[, Map(function(x,y) get(x)(y, na.rm = TRUE),
setNames(rep(c('weighted.mean', 'weighted.mean'),2),names(.SD)), .SD),.SDcols=colsToKeep, by = id]
My question: how can get the arguments w=w_1 and w=w_2 into the setNames-function? Is that even possible?
Could be something like this too:
colsToKeep <- c("col_a", "col_b", "col_c", "col_d")
colsToW <- c("w_1", "w_1", "w_2", "w_2")
eval(parse(text = paste0("dt[, .(", paste0("w_", colsToKeep, " = weighted.mean(", colsToKeep, ",", colsToW, ")", collapse = ", "), "), by = id]")))
or this one:
dt[, Map(function(x,y,w) get(x)(y, w, na.rm = TRUE),
setNames(rep('weighted.mean',length(colsToKeep)), paste0("W_", colsToKeep)),
.SD[, ..colsToKeep], .SD[, ..colsToW]),
by = id]
As mentioned by Roland, you can cast into a long format. The benefit is that in the long run, you do not have to change the codes every time when there is a new column. Explanation in line. You can print mdt to take a look.
#cast into a long format with col values in 1 column and rows in another columns
mdt <- melt(dt, id.var=c("id",grep("^w", names(dt), value=TRUE)),
variable.name="col", value.name="colVal")
mdt <- melt(mdt, id.var=c("id","col","colVal"),
variable.name="w", value.name="wVal")
#prob need to think of a programmatic way rather than typing columns
myPairs <- data.table(rbind(
c(col="col_a", w="w_1"),
c(col="col_b", w="w_1"),
c(col="col_c", w="w_2"),
c(col="col_d", w="w_2")))
#calculate weighted mean according to myPairs and then pivot the table
dcast(mdt[myPairs, on=.(col, w),
weighted.mean(colVal, wVal),
by=.(id, col)],
id ~ col,
value.var="V1")
I am a novice R programmer. I am wondering how to lappy over a dataframe but avoiding certain columns.
# Some dummy dataframe
df <- data.frame(
grp = c("A", "B", "C", "D"),
trial = as.factor(c(1,1,2,2)),
mean = as.factor(c(44,33,22,11)),
sd = as.factor(c(3,4,1,.5)))
df <- lapply(df, function (x) {as.numeric(as.character(x))})
However, the method I used introduces NAs by coercion.
Would there to selectively (or deselectively) lapply over the dataframe while maintaining the integrity of the dataframe?
In other words, would there be a way to convert only mean and sd to numerics? (In general form)
Thank you
Try doing this:
df[,3:4] <- lapply(df[,3:4], function (x) {as.numeric(as.character(x))})
You are simply passing function to the specified columns. You can also provide a condition to select subset of your columns, something like excluding the ones you don't want to cast.
col = names(df)[names(df)!=c("grp","trial")]
df[,col] <- lapply(df[,col], function (x) {as.numeric(as.character(x))})
Well as you might have guessed, there are many ways. Since you seem to be doing in place substitution, actually, a for loop would be suitable.
df <- data.frame(
grp = c("A", "B", "C", "D"),
trial = as.factor(c(1,1,2,2)),
mean = as.factor(c(44,33,22,11)),
sd = as.factor(c(3,4,1,.5)))
my_cols <- c("trial", "mean", "sd")
for(mc in my_cols) {
df[[mc]] <- as.numeric(as.character(df[[mc]]))
}
If you want to convert selectively by column names:
library(dplyr)
df %>%
mutate_if(names(.) %in% c("mean", "sd"),
function(x) as.numeric(as.character(x)))
I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony