Rename all other levels to "Other" - r

I have a dataframe containing all the calls that I have done in the last year. Under the column "Name" there are the names of the people in my contact list. In R this column contains 30 factors, I want to have only 3 factors: Mom, Dad, BestFriend and Others.
I'm using this snippet:
library(plyr)
call$Name <- mapvalues(call$Name, from = 'Mikey Mouse', to = 'BFF')
call$Name <- mapvalues(call$Name, from = c('Rocky Balboa','Uma Thurman'), to = c('Dad','Mom'))
How can I rename all other levels aside those 3 to Other?

We can first create a level 'Others' (assuming it is a factor), assign the levels that are not %in% the vector of levels ('nm1') to 'Other'
levels(call$Name) <- c(levels(call$Name), 'Other'))
levels(call$Name)[!levels(call$Name %in% nm1] <- 'Other'
Or another option is recode from dplyr which also have the .default option to specify other levels that are not in the vector to a given value
library(dplyr)
recode(call$Name, `Mikey Mouse` = 'BFF', `Rocky Balboa` = 'Dad',
`Uma Thurman` = 'Mom', .default = 'Other')
data
set.seed(24)
call <- data.frame(Name = sample(c('Mikey Mouse', 'Rocky Balboa',
'Uma Thurman', 'Richard Gere', 'Rick Perry'), 25, replace = TRUE))
nm1 <- c('Mickey Mouse', 'Rocky Balboa', 'Uma Thurman')

There is also the fct_other() function in the forcats package for doing exactly this. Using the data akrun provided we could simply do:
library(forcats)
call$Name <- fct_other(call$Name, keep = nm1)

Related

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

R: Create Indicator Columns from list of conditions

I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.
This is what I tried:
# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")
df <-data.frame(age, sex)
# create the sample conditions
conditions <- list(
list("age", c(18:100)),
list("sex", c("f", "m"))
)
addIndicator <- function (df, columnName, validValues) {
indicator <- vector()
for (row in df[, toString(columnName)]) {
# for some strange reason, %in% doesn't work correctly here, but always returns FALSe
indicator <- append(indicator, row %in% validValues)
}
df <- cbind(df, indicator)
# rename the column
names(df)[length(names(df))] <- paste0("I_", columnName)
return(df)
}
for (condition in conditions){
columnName <- condition[1]
validValues <- condition[2]
df <- addIndicator(df, columnName, validValues)
}
print(df)
However, this leads to all conditions considered not to be met - which is not what I expect:
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f FALSE FALSE
I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.
What am I doing wrong and how can I achieve what I want?
If you prefer an approach that uses the tidyverse family of packages:
library(tidyverse)
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
rename_with(~ paste0('I_', .x)) %>%
bind_cols(df)
imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.
rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.
I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.
conditions appears to be a nested list. When you use:
validValues <- condition[2]
in your for loop, your result is also a list.
To get the vector of values to use with %in%, you can extract [[ by:
validValues <- condition[[2]]
A simplified approach to obtaining indicators could be with a simple list:
conditions_lst <- list(age = 18:100, sex = c("f", "m"))
And using sapply instead of a for loop:
cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
df[[x]] %in% conditions_lst[[x]]
}))
Output
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f TRUE TRUE
An alternative approach using across and cur_column() (and leaning heavily on severin's solution):
library(tidyverse)
df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
mutate(across(c(age, sex),
c(valid = ~ .x %in% allowed_values[[cur_column()]])
)
)
Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column
Related question: Refering to column names inside dplyr's across()

removing columns from a data frame which feature in a list, but don't feature in another list

Say, my variable are as follows.
df = read.csv('somedataset.csv') #contains 'col1','col2','col3','col4','col5' say
colsSomeRemoveSomeDontRemove = c('col1','col2','col3')
colsDontRemove = 'col2'
I would like to remove all those columns from df which feature in colsSomeRemoveSomeDontRemove, but are not part of colsDontRemove.
So basically, at the end my df should contain only columns 'col2','col4','col5'
How can I do that?
I have tried doing the following, but could not get it to work
df1 = cbind(df[,which(!(names(df) %in% colsSomeRemoveSomeDontRemove))],as.data.frame(df[,colsDontRemove]))
df[, !(colnames(df) %in% setdiff(colsSomeRemoveSomeDontRemove, colsDontRemove))]

R dplyr: mutate a column for specific row range

I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony

R wildcards, sapply and as.factor

I want to change the type to factor of all variables in a data frame whose names match a certain pattern.
So here I am trying to change the type to factor of all variables whose name begins with namestub in the dataframe df.
attach(df)
sapply(grep(glob2rx("namestub*"), names(df)), as.factor)
But this doesn't work since
> levels(df$namestub1)
NULL
## Make a reproducible example
df <- data.frame(namestubA = letters[1:5], B = letters[5:1],
namestubC = LETTERS[1:5], stringsAsFactors=FALSE)
## Get indices of columns to convert
ii <- grep(glob2rx("namestub*"), names(df))
## Convert and replace the indicated columns
df[ii] <- lapply(df[ii], as.factor)

Resources