Can I write a function to revalue levels of a factor? - r

I have a column 'lg_with_children' in my data frame that has 5 levels, 'Half and half', 'Mandarin', 'Shanghainese', 'Other', 'N/A', and 'Not important'. I want to condense the 5 levels down to just 2 levels, 'Shanghainese' and 'Other'.
In order to do this I used the revalue() function from the plyr package to successfully rename the levels. I used the code below and it worked fine.
data$lg_with_children <- revalue(data$lg_with_children,
c("Mandarin" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("Half and half" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("N/A" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("Not important" = "Other"))
To condense the code a little I went back data before I revalued the levels and attempted to write a function. I tried the following after doing research on how to write your own functions (I'm rather new at this).
revalue_factor_levels <- function(df, col, source, target) {df$col <- revalue(df$col, c("source" = "target"))}
I intentionally left the df, col, source, and target generic because I need to revalue some other columns in the same way.
Next, I tried to run the code filling in the args and get this message:
warning message
I am not quite sure what the problem is. I tried the following adjustment to code and still nothing.
revalue_factor_levels <- function(df, col, source, target) {df$col <- revalue(df$col, c(source = target))}
Any guidance is appreciated. Thanks.

You can write your function to recode the levels - the easiest way to do that is probably to change the levels directly with levels(fac) <- list(new_lvl1 = c(old_lvl1, old_lvl2), new_lvl2 = c(old_lvl3, old_lvl4))
But there are already several functions that do it out of the box. I typically use the forcats package to manipulate factors.
Check out fct_recode from the forcats package. Link to doc.
There are also other functions that could help you - check out the comments below.
Now, as to why your code isn't working:
df$col looks for a column literally named col. The workaround is to do df[[col]] instead.
Don't forget to return df at the end of your function
c(source = target) will create a vector with one element named "source", regardless of what happens to be in the variable source.
The solution is to create the vector c(source = target) in 2 steps.
revalue_factor_levels <- function(df, col, source, target) {
to_rename <- target
names(to_rename) <- source
df[[col]] <- revalue(df[[col]], to_rename)
df
}
Returning the df means the syntax is:
data <- revalue_factor_levels(data, "lg_with_children", "Mandarin", "Other")
I like functions that take the data as the first argument and return the modified data because they are pipeable.
library(dplyr)
data <- data %>%
revalue_factor_levels("lg_with_children", "Mandarin", "Other") %>%
revalue_factor_levels("lg_with_children", "Half and half", "Other") %>%
revalue_factor_levels("lg_with_children", "N/A", "Other")
Still, using forcats is easier and less prone to breaking on edge cases.
Edit:
There is nothing preventing you from both using forcats and creating your custom function. For example, this is closer to what you want to achieve:
revalue_factor_levels <- function(df, col, ref_level) {
df[[col]] <- forcats::fct_others(df[[col]], keep = ref_level)
df
}
# Will keep Shanghaisese and revalue other levels to "Other".
data <- revalue_factor_levels(data, "lg_with_children", "Shanghainese")

Here is what I ended up with thanks to help from the community.
revalue_factor_levels <- function(df, col, ref_level) {
df[[col]] <- fct_other(df[[col]], keep = ref_level)
df
}
data <- revalue_factor_levels(data, "lg_with_children", "Shanghainese")

Related

How do I create a loop to change the text encoding of the labels in labelled variables in R

I have imported a stata file that is giving me some encoding problems in the value labels. On import, using labelled::lookfor for any keyword returns this error:
Error in structure(as.character(x), names = names(x)) :
invalid multibyte string at '<e9>bec Solidaire'
Knowing the data-set, that is almost certainly a value label in there.
How to I loop through the data-set fixing the encoding problem in the names of the value labels and then reset them. I have found a solution, I think, to fix the problematic characters, but I don't know how to replace the original names.
v <- labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, "Bloc Qu\xe9b\xe9cois" = 3, "don't know" = 9))
x<- labelled(c(1,2,2,2,3,9,1,3,2,NA), c("Bloc Qu\xe9b\xe9cois" = 1, no = 3, "don't know" = 9))
mydat<-data.frame(v=v, x=x)
glimpse(mydat)
mydat %>%
map(., val_labels)
#This works individually
iconv(names(val_labels(x)), from="latin1", to="UTF-8")
#And this seems to work looping over each variable, but how to I store it?
mydat %>%
map(., function(x) iconv(names(val_labels(x)), from="latin1", to="UTF-8"))
This seems to be a bit tough to do in one simple step, so here I used some helper functions
conv_names <- function(x) {
setNames(x, iconv(names(x), from="latin1", to="UTF-8"))
}
conv_val_labels <- function(x) {
val_labels(x) <- conv_names(val_labels(x))
x
}
mydat <- map_dfc(mydat, conv_val_labels)
But we map the function to each column and then reassign those columns back to the data frame. Note we use map_dfc to combine the columns back into a data frame

Why do I get a " Invalid .internal.selfref detected" warning (but no output) even if I am not using list(),key<-, names<-, or attr<-?

In a new user created function I like to do some data.table transformations, especially I like to create a new column with the ':=' command.
Assume I like to make a new column called Sex that capitalizes the first letter of the column df$sex in my example data.frame df.
The output of my prepare function should be a data.table with the same name as before but with the additional "capitalised" column.
I try several ways to loop over the data.table. However I always get the following warning (and no correct output):
Warning message:
In [.data.table(x, , :=(Sex, stringr::str_to_title(sex))) :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
library(data.table)
library(magrittr)
library(stringr)
df <- data.frame("age" = c(17, 04),
sex = c("m", "f"))
df %>% setDT()
is.data.table(df)
This is the easiest way to write my function:
prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
prepare1(df)
#--> WARNING. (as block quoted above)
prepare2<-function(x){
x[, `:=`(Sex, stringr::str_to_title(sex))]
}
prepare2(df)
#--> WARNING. . (as block quoted above)
prepare3<-function(x){
require(data.table)
y <-as.data.table(list(x))
y <- y[,Sex:=stringr::str_to_title(sex)]
x <<- y
}
prepare3(df)
The last version does NOT throw the warning, but makes a new dataset called x. But I wanted to override the dataset I put in the function (if I have to go that way at all.)
From the := help file I also know I can use set, however I am not able to adapt the command appropriate. In case that could cure my problem I am happy to receive help on that, too! set(x, i = NULL, Sex, str_to_title(sex)) is apparently wrong ...
Up on request/to make the discussion in the comments clearer I show how my code produces the problem
library(data.table)
library(stringr)
df <- data.frame("age" = c(17, 04),
sex = c("m", "f"))
GetLastAssigned <- function(match = "<- *data.frame",
remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
get(sub(remove, "", match[length(match)]))
}
#ok, no need for magrittr
setDT(GetLastAssigned())
#check the last function worked
is.data.table(df)
prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
prepare1(GetLastAssigned())
# I get a warning and it does not work.
prepare1(df)
# I get a warning and it does not work, either.
#If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation.
A workaround along the OP's lines:
library(data.table)
library(stringr)
GetLastAssigned2 <- function(match = "<- *data.frame", remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
nm <- sub(remove, "", match[length(match)])
list(nm = as.name(nm), addr = address(get(nm)))
}
prepit <- function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
# usage
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))
str(df) # it seemingly works, since there is a selfref
# usage 2
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
setDT(df)
prepit(df)
str(df) # works
# usage 3
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))
eval(substitute(prepit(x), list(x=z$nm)))
str(df) # works
Some big caveats:
savehistory is only effective in interactive use, based on my reading of the docs
using regex on human input (code typed in interactively) is complicated and risky
even this workaround will fail if data.table x passed to prepit is not sufficiently "pre-allocated" space for extra columns
The data.table interface is based on passing the name/symbol of the data.frame or data.table, rather than the value (which is what get provides), as explained by Arun one of the data.table authors. Note that the address cannot be passed around either. z$address soon fails to match address(df) in all examples above.
If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation.
One idea:
# helper to compose expressions
subit = function(cmd, df_nm)
do.call("substitute", list(cmd, list(x=as.name(df_nm))))
# list of expressions with x where the df name belongs
my_cmds = list(
setDT = quote(setDT(x)),
prepit = quote(x[,Sex:=stringr::str_to_title(sex)])
)
# usage 4
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df" # somehow get this... hopefully not via regex of command history
eval(subit(my_cmds$setDT, df_nm))
eval(subit(my_cmds$prepit, df_nm))
# usage 5
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df"
for(ex in lapply(my_cmds, subit, df_nm = df_nm)) eval(ex)
I think this is more aligned with recommended programmatic usage of data.table.
There is probably some way to wrap this in a function by altering the envir= argument to eval() but I'm not knowledgeable about that.
Regarding how to get the name of the assignment target in nm <- data.frame(...), it looks like there are no good options. Maybe see How do I access the name of the variable assigned to the result of a function within the function? or Get name of x when defining `(<-` operator

Getting Looped Output into an Appended Object

So I am trying to make a basic sensitivity analysis script. The outputs come out as I want via the print I added to the end of the script. Issue is that I would like a tibble or object that has all the outputs appended together that I can export as a csv or xlsx.
I created two functions, sens_analysis which runs all the code, and multiply_across which multiplies across each possible percentage across each possible column of your table. You need multiply_across to run the sens_analysis.
I would normally like a title but instead I just added an indicator column instead that I can sort by.
I made everything with mtcars so it should be easy to replicate, the issue is that I just have a huge print at the end; not an object that I can manipulate or pull from for other analysis.
I have been trying the rbind, bind_row, appending rows in a variety of ways.
Or building a new object. As you can see in the code at line (18) I make something called output that I have tried to populate, which hasn't gone well.
rm(list = ls())
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(magrittr)
library(xtable)
data<-mtcars
percent<-c(.05,.1,.15)
goods<-c("hp","gear","wt")
weight<-c(6,7,8)
disagg<-"cyl"
func<-median
sens_analysis<-function(data=data, goods=goods, weight=weight, disagg=disagg, precent=percent, func=func){
output<-NULL%>%
as.tibble()
basket<-(rbind(goods,weight))
percent<-c(0,percent,(percent*-1))
percent_to_1<-percent+1
data_select<-data%>%
dplyr::select(c(goods,disagg))%>%
group_by_at(disagg)%>%
summarise_at(.vars = goods ,.funs = func)%>%
as_tibble()
data_select_weight<-purrr::map2(data_select[,-1], as.numeric(basket[2,]),function(var, weight){
var*weight
})%>% as_tibble %>%
add_column(data_select[,1], .before = 1)
colnames(data_select_weight)[1]<-disagg
multiply_across(data_select_weight,percent_to_1)
return(output)
#output2<-rbind(output2,output)
}
############################
multiply_across<-function(data=data_select_weight,list=percent_to_1){
varlist<-names(data[,-1])
for(i in varlist){
df1 = data[,i]
for(j in list){
df<-data
df[,i]<-round(df1*j,2)
df<-mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type=paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg))%>%
print(df)
#output<-bind_rows(output,df)
#output<-bind_rows(output,df)
#output[[j]]<-df[[j]]
}
}
}
##############################################################################################
sens_analysis(data,goods,weight,disagg,percent,func)
The expected result if you just run the code straight-up should just be a bunch of printed tibbles, that arent in an object. But ideally, for future analysis on the data or easy of use, a table of the outputs appended together would be best.
So I figured it out and will add my answer here in case someone else hits this issues.
I created a list within loops and then binded those lists together.
Just focus on the binding rows outside the right for-loop.
multiply_across<-function(data=data_select_weight,
list=percent_to_1){
varlist <- colnames(data[, -1])
output_list <- list()
for (i in varlist) {
df1 <- data[,i]
for (j in list) {
name <- paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg)
df <- as_tibble(data)
df[,i] <- round(df1*j, 2)
df <- mutate(df, total = round(rowSums(df[,-1]),2))%>%
mutate(type = paste0(i, " BY ", (as.numeric(j)-1)*100, "% OVER ", disagg))
df<-df[,c(6,1,2,3,4,5)]
output_list[[paste0(i," BY ",(as.numeric(j)-1)*100)]] <- (assign(paste0(i," BY ",(as.numeric(j)-1)*100,"% OVER ",disagg),df))
}
}
bind_rows(lapply(output_list,
as.data.frame.list,
stringsAsFactors=F))
}

Can "assign()" and "get()" be written more concisely?

Below is my code. I use an extra variation "tmp" to clean the "ABC_Chla". Because the "Location_name" can change, I use "assign()" and "get()" function.
Location_name <- "ABC_"
tmp <- get(paste(Location_name,"DO",sep = "")) %>% filter(log.DO != -Inf)
assign(paste(Location_name,"DO",sep = ""), tmp)
My code can achieve this goal, but it seems not concise (introduce a temporary variable). Is there a better way?
Assuming the inputs shown reproducibly in the Note at the end (next time please make sure your question includes complete reproducible code including inputs) we can make the following changes:
use paste0 instead of paste
create a variable locname to hold the name of the data frame and a variable e to be the environment where our data frame is located
use e[[...]] instead of get and assign
use magrittr %<>% two-way pipe
possibly use filter(is.finite(log.DO)) -- not shown below
giving this code:
library(dplyr)
library(magrittr)
e <- .GlobalEnv # change if our data frame is in some other environment
locname <- paste0(Location_name, "DO")
e[[locname]] %<>%
filter(log.DO != -Inf)
The result is:
get(locname, e)
## log.DO
## 1 1
## 2 2
Alternative
This alternative only uses ordinary pipes. We use e and locname from above.
library(dplyr)
e[[locname]] <- e[[locname]] %>%
filter(log.DO != -Inf)
Note
Test input:
ABC_DO <- data.frame(log.DO = c(1, -Inf, 2))
Location_name <- "ABC_"
You only have a temporary variable because you store the data in tmp, i don't see it as a problem.But, n this case, the only thing that i see you can do is pass the code of tmp directly to assign, like:
assign(
paste(Location_name,"DO",sep = ""),
get(paste(Location_name,"DO",sep = "")) %>% filter(log.DO != -Inf)
)

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

Resources