Loop various df to transform factors to numeric - r

I have imported various datasets with the same variables for different years. I am trying to transform some of the columns from factor to numeric. To save time I have created a function, which seems not to work.
I have created a list with the names of the datasets as strings
dfs <- list("df1", "df2", "df3", "df4", "df5", "df6", "df7", "df8")
And a second list with the names of the variables (columns) also as strings
vars <- list("var1", "var2", "var3", "var4")
First I tried joining both lists with an "$" in the middle and then passing the function to transform factors to numerics:
to_int <- function(column){
if (is.factor(column)){
column <-levels(column)[column]
column<-as.numeric(column)
return(column)
}
else{
return(column)
}
}
Option 1: create a vector with strings joined by $
col_names <- vector(mode = "list", length = length(dfs))
# Add the combination of names to each vector
for (df in dfs) {
for (var in vars){
r <- paste(df, var, sep = "$") # Combine the names in the 2 lists with a $ in the middle
col_names[[match(df, dfs)]][match(var, vars)] <- r # Assign result to the pre-set vector
}
}
# Iterate through list (col_names) and apply "to_int" to each of the strings in the list
for (l in col_names){
for (col_name in l){
colnm <- eval(parse(text = col_name))
nmrc <- to_int(colnm) # from factor to numeric each column. Works!
assign(col_name, nmrc, envir = globalenv()) # Creates values (Rstudio) with the correct name but columns on dfs remain intact
}
}
Then I tried treating the strings on both lists separately and get them together inside the loop:
Option 2: Treat the lists as separate strings and join in loop
for (df in dfs) {
for (var in vars){
a <- eval(parse(text = df))
b <- to_int(a[var]) # using $ returns null. using [] no change in original df, still factor
a[var] <- b
}
}
I finally tried creating a new function that has to variables as inputs:
# with two inputs
to_int2 <- function(df, col){
eval(parse(text = df))
if (is.factor(df[col])){ # $ OPERATOR IS INVALID FOR ATOMIC VECTORS
df[col] <-levels(df[col])[df[col]]
df[col]<-as.numeric(df[col])
return(df[col])
}
else{
return(df[col])
}
}
And passed that through a third attempt
Option 3: transform factor to numeric with two inputs
for (df in dfs) {
for (var in vars){
a <- to_int2(df, var) # $ OPERATOR IS INVALID FOR ATOMIC VECTORS
b <- eval(parse(text = df))
b$var <- a # No effect
}
}
None of them had an effect on the desired columns of the dataframes.
Any idea on how to solve this?
Thanks

It's generally better to work with multiple similar datasets as a list of frames. The premise being that whatever you do to one, you will do to all, and that is automated easily using lapply.
As an example, try this:
LOF <- mget(dfs)
LOF <- lapply(LOF, function(df) {
df[vars] <- lapply(df[vars], as.integer)
df
})
But if you must keep them separate, then try this:
for (nm in dfs) {
dat <- get(nm)
dat[vars] <- lapply(dat[vars], as.integer)
assign(nm, dat)
}

Related

how to make a loop to fetch one variable from 1000 dataframes

I have dataframe by name V1...V1000. inside the dataframe each has one variable with the same name 'var1.predict'. I'm having a hard time creating a loop in order to concatenate all the variables I want to fetch into one new dataframe
this is the syntax I want to make a loop
df <- cbind.data.frame(model_V1$var1.pred,model_V2$var1.pred,.....model_V1000$var1.pred)
I hope someone can help solve this.
thank you
a new dataframe formed by taking one variable from each dataframe
I assume that you mean you have 1,000 dataframes V1... V1000 each with the column var1.predict and you want to extract the predictions column from each df. If so, there are a few methods outlined below with a little reprex:
# putting dummy data in to the global env
lapply(1:3, \(i) {
assign(paste0("V", i), data.frame(v1 = rnorm(5),
v2 = rnorm(5),
var1.predict = rnorm(5)), envir = .GlobalEnv)
})
df_list <- list(V1, V3, V3)
# using a for loop and do.call
pred_cols <- list()
for (df in df_list) {
pred_cols <- c(pred_cols, list(df[["var1.predict"]]))
}
pred_cols_df <- do.call(cbind, pred_cols)
as.data.frame(pred_cols_df)
pred_cols_df
# using a loop without do.call
for (i in seq_along(df_list)) {
if (i == 1) {
pred_cols_df <- df_list[[1]][["var1.predict"]]
} else {
pred_cols_df <- cbind(pred_cols_df, df_list[[i]][["var1.predict"]])
}
}
as.data.frame(pred_cols_df)
pred_cols_df
# using lapply
pred_cols <- lapply(df_list, `[`, "var1.predict")
pred_cols_df <- do.call(cbind, pred_cols)
as.data.frame(pred_cols_df)
pred_cols_df

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

How can I multiply multiple dataframes of a list by each observation of a vector?

I have a list of dataframes that I would like to multiply for each element of vector.
The first dataframe in the list would be multiplied by the first observation of the vector, and so on, producing another list of dataframes already multiplied.
I tried to do this with a loop, but was unsuccessful. I also tried to imagine something using map or lapply, but I couldn't.
for(i in vec){
for(j in listdf){
listdf2 <- i*listdf[[j]]
}
}
Error in listdf[[j]] : invalid subscript type 'list'
Any idea how to solve this?
*Vector and the List of Dataframes have the same length.
Use Map :
listdf2 <- Map(`*`, listdf, vec)
in purrr this can be done using map2 :
listdf2 <- purrr::map2(listdf, vec, `*`)
If you are interested in for loop solution you just need one loop :
listdf2 <- vector('list', length(listdf))
for (i in seq_along(vec)) {
listdf2[[i]] <- listdf[[i]] * vec[i]
}
data
vec <- c(4, 3, 5)
df <- data.frame(a = 1:5, b = 3:7)
listdf <- list(df, df, df)

looping over variables of a data.frame leading one final data.frame in R

I have written a function to change any one variable (i.e., column) in a data.frame to its unique levels and return the changed data.frame.
I wonder how to change multiple variables at once using my function and get one final data.frame with all the changes?
I have tried the following, but this gives multiple data.frames while only the last data.frame is the desired output:
data <- data.frame(sid = c(33,33, 41), pid = c('Bob', 'Bob', 'Jim'))
#== My function for ONE variable:
f <- function(data, what){
data[[what]] <- as.numeric(factor(data[[what]], levels = unique(data[[what]])))
return(data)
}
# Looping over `what`:
what <- c('sid', 'pid')
lapply(seq_along(what), function(i) f(data, what[i]))
In the function, we could change to return the data[[what]]
f <- function(data, what){
data[[what]] <- as.numeric(factor(data[[what]], levels = unique(data[[what]])))
data[[what]]
}
data[what] <- lapply(seq_along(what), function(i) f(data, what[i]))
Or do
data[what] <- lapply(what, function(x) f(data, x))
Or simply
data[what] <- lapply(what, f, data = data)

rbind dataframes with varying names

I have a situation where I need to rbind multiple dataframes based on a name, the trouble i'm having is how to define binding on these dataframes when the names differ -
For instance, the names of my dataframes are:
AB_0
AB_1
BCD_0
BCD_1
And I want to rbind AB_0 and BCD_0, and AB_1 and BCD_1 - my common factor I'm binding on is everything from the _ and after
I know I could use strsplit, but all I'm trying to get to is something like:
for(i in 0:1){
do.call("rbind", mget(sprintf("*_%d", i)))
}
where * is some variable string with varying # of characters
Something like this?
AB_0 <- data.frame(a=1, b=1)
AB_1 <- data.frame(a=2, b=2)
BCD_0 <- data.frame(a=3, b=3)
BCD_1 <- data.frame(a=4, b=4)
XX0 <- do.call("rbind", mget(ls(pattern = ".+_0")))
XX1 <- do.call("rbind", mget(ls(pattern = ".+_1")))
Or automate using a list:
XX <- list()
for (i in 0:1) {
XX[[i+1]] <- do.call("rbind", mget(ls(pattern = paste0(".+_",i))))
}

Resources