my first question on Stack Overflow so bear with me ;-)
I wrote a function to row-bind all objects whose names meet a regex criterion into a dataframe.
Curiously, if I run the lines out of the function, it works perfectly. But within the function, an empty data frame is returned.
Reproducible example:
offers_2022_05 <- data.frame(x = 3)
offers_2022_06 <- data.frame(x = 6)
bind_multiple_dates <- function(prefix) {
objects <- ls(pattern = sprintf("%s_[0-9]{4}_[0-9]{2}", prefix))
data <- bind_rows(mget(objects, envir = .GlobalEnv), .id = "month")
return(data)
}
bind_multiple_dates("offers")
# A tibble: 0 × 0
However, this works:
prefix <- "offers"
objects <- ls(pattern = sprintf("%s_[0-9]{4}_[0-9]{2}", prefix))
data <- bind_rows(mget(objects, envir = .GlobalEnv), .id = "month")
data
month x
1 offers_2022_05 3
2 offers_2022_06 5
I suppose it has something to do with the environment, but I can't really figure it out. Is there a better way to do this? I would like to keep the code as a function.
Thanks in advance :-)
By default ls() will look in the current environment when looking for variables. In this case, the current environment is the function body and those data.frame variables are not inside the function scope. You can explicitly set the environment to the calling environment to find using the envir= parameter. For example
bind_multiple_dates <- function(prefix) {
objects <- ls(pattern = sprintf("%s_[0-9]{4}_[0-9]{2}", prefix), envir=parent.frame())
data <- bind_rows(mget(objects, envir = .GlobalEnv), .id = "month")
return(data)
}
The "better" way to do this is to not create a bunch of separate variables like offers_2022_05 and offers_2022_06 in the first place. Variables should not have data or indexes in their name. It would be better to create the data frames in a list directly from the beginning. Often this is easily accomplished with a call to lapply or purrr::map. See this existing question for more info
Related
my_mtcars_1 <- mtcars
my_mtcars_2 <- mtcars
my_mtcars_3 <- mtcars
for(i in 1:3) {get(paste0('my_mtcars_', i))$blah <- 1}
Error in get(paste0("my_mtcars_", i))$blah <- 1 :
target of assignment expands to non-language object
I would like each of my 3 data frames to have a new field called blah that has a value of 1.
How can I iterate over a range of numbers in a loop and refer to DFs by name by pasting the variable name into a string and then edit the df in this way?
These three options all assume you want to modify them and keep them in the environment.
So, if it must be a dataframes (in your environment & in a loop) you could do something like this:
for(i in 1:3) {
obj_name = paste0('my_mtcars_', i)
obj = get(obj_name)
obj$blah = 1
assign(obj_name, obj, envir = .GlobalEnv) # Send back to global environment
}
I agree with #Duck that a list is a better format (and preferred to the above loop). So, if you use a list and need it in your environment, use what Duck suggested with list2env() and send everything back to the .GlobalEnv. I.e. (in one ugly line),
list2env(lapply(mget(ls(pattern = "my_mtcars_")), function(x) {x[["blah"]] = 1; x}), .GlobalEnv)
Or, if you are amenable to working with data.table, you could use the set() function to add columns:
library(data.table)
# assuming my_mtcars_* is already a data.table
for(i in 1:3) {
set(get(paste0('my_mtcars_', i)), NULL, "blah", 1)
}
As suggestion, it is better if you manage data inside a list and use lapply() instead of loop:
#List
List <- list(my_mtcars_1 = mtcars,
my_mtcars_2 = mtcars,
my_mtcars_3 = mtcars)
#Variable
List2 <- lapply(List,function(x) {x$bla <- 1;return(x)})
And it is easy to store your data using a code like this:
#List
List <- mget(ls(pattern = 'my_mt'))
So no need of defining each dataset individually.
We can use tidyverse
library(dplyr)
library(purrr)
map(mget(ls(pattern = '^my_mtcars_\\d+$')), ~ .x %>%
mutate(blah = 1)) %>%
list2env(.GlobalEnv)
Is there a way to simplify this code using a loop?
VariableList <- c(v0,v1,v2, ... etc)
National_DF <- df[,VariableList]
AL_DF <- AL[,VariableList]
AR_DF <- AR[,VariableList]
AZ_DF <- AZ[,VariableList]
... etc
I want the end result to have each as a data frame since it will be used later in the model. Each state such as 'AL', 'AR', 'AZ', etc are data frames. The v{#} represents an out of place variable from the RAW data frame. This is meant to restructure the fields, while eliminating some fields, for preparation for model use.
Continuing the answer from your previous question, we can arrange the data in the same lapply call before creating dataframes.
VariableList <- c('v0','v1','v2')
data <- unlist(lapply(mget(ls(pattern = '_DF$')), function(df) {
index <- sample(1:nrow(df), 0.7*nrow(df))
df <- df[, VariableList]
list(train = df[index,], test = df[-index,])
}), recursive = FALSE)
Then get data in global environment :
list2env(data, .GlobalEnv)
Let's say I have 5 datasets in a list (each named df_1, df_2, and so on), each with a variable called cons. I'd like to execute a function over cons in each dataset in the list, and create a new variable whose name has the suffix of the corresponding dataset.
So in the end df_1 will have a variable called something like cons_1 and df_2 will have a variable called cons_2. The problem I run into is the variable looping and trying to create dynamic names.
Any suggestions?
This is actually pretty straightforward:
df_names <- paste("df", 1:5, sep = "_")
cons_names <- paste("cons", 1:5, sep = "_")
for (i in 1:5) {
# get the df from the current env by name
df_i <- get(df_names[i])
# do whatever you need to do and assign the result
df_i[[cons_names[i]]] <- some_operation(df_i)
}
But it would make more sense to keep your data frames in a list to avoid using get, which can be sketchy:
for (i in 1:5) {
df_i[[cons_names[i]]] <- some_operation(df_list[[i]])
}
Using the purrr package, this would be an alternative solution:
library(purrr)
lst <- list(mtcars_1 = mtcars,
mtcars_2 = mtcars,
mtcars_3 = mtcars,
mtcars_4 = mtcars,
mtcars_5 = mtcars)
map(seq_along(lst), function(x) {
lst[[x]][paste0("mpg_", x)] <- some_operation(lst[[x]]['mpg']); lst[[x]]
})
Subset each data frame from the list, create the new mpg variable with the index of the current data frame and perform whatever operation you want on the mpg variable. The result is a list with all data previous data frames with the new variable for each data frame.
Since this new list doesn't have the data frame names, you can always just add them with setNames(newlist, names(lst))
I have data.frame objects in the list. However, I intend to split myList by comparing its score with given threshold value. In particular, I want let my function only return data.frame whose score greater than threshold value, meanwhile I export the one with less than threshold value as csv file(because I will further process saved data.frame, while exported data.frame will be listed on summary at the end).I aware that it is easier first split data.frame then export them as csv file, and further process desired one. But I want to make this happen in one wrapper function easily. Can anyone point me how to facilitate the output of my function more efficiently ? Any idea ?
mini example:
mylist <- list(
foo=data.frame( from=seq(1, by=4, len=16), to=seq(3, by=4, len=16), score=sample(30, 16)),
bar=data.frame( from=seq(3, by=7, len=20), to=seq(6, by=7, len=20), score=sample(30, 20)),
cat=data.frame( from=seq(4, by=8, len=25), to=seq(7, by=8, len=25), score=sample(30, 25)))
I intend to split them like this:
func <- function(list, threshold=16, ...) {
# input param checking
stopifnot(is.numeric(threshold))
reslt <- lapply(list, function(elm) {
res <- split(x, c("Droped", "Saved")[(x$score > threshold)+1])
# FIXME : anyway to export Droped instance while return Saved
})
}
In my sketch function, I intend to export Droped instance from each data.frame as csv files, while return Saved instance from each data.frame as an output and use this for further process.
I tried to make this happen in my function, but my approach is not efficient here. Can anyone point me out how to accomplish this easily ? Does anyone knows any useful to trick of doing this to prompt my expected output more elegantly ? Thanks in advance.
You could roll both processes into a call to lapply, like this:
# function to perform both tasks on one data frame in mylist
splitter <- function(i, threshold) {
require(dplyr)
DF <- mylist[[i]]
DF %>%
filter(score <= threshold) %>%
write.csv(., sprintf("dropped.%s.csv", i), row.names = FALSE)
Saved <- filter(DF, score > threshold)
return(Saved)
}
# now use the function to create a new list, my list2, with the Saved
# versions as its elements. the csvs of the dropped rows will get created
# as this runs.
mylist2 <- lapply(seq_along(mylist), function(i) splitter(i, 16))
I made a loop that assigns the result of a function to a newly created variable. After that that variable is used to create another.
This second step fails to produce the expected result.
library(stringr)
for (i in 1:length(Ids)){
nam <- paste("data", Ids[i], sep = "_")
assign(nam, GetReportData(query, token,paginate_query = F))
newvar=paste(nam,"contentid",sep="$")
originStr=paste(nam,"pagePath",sep="$")
assign(newvar,str_extract(originStr,"&id=[0-9]+"))
}
Don't create a bunch of variables, store related values in named lists to make it easier to retrieve them. You didn't supply any input to test with, but i'm guessing this does the same thing.
library(stringr)
mydata <- lapply(1:length(Ids), function(i) {
dd <- GetReportData(query, token,paginate_query = F))
dd$contentid <- str_extract(d$pagePath,"&id=[0-9]+"))
dd
})
This will return a list of data.frames. You can access them with mydata[[1]], mydata[[2]], etc rather than data_1, data_2, etc
If you absolutely insist on creating a bunch of variables, just make sure to do all your transformations on an actual object, and then save that object when your are done. You can never use assign with names that have $ or [ as described in the help page: "assign does not dispatch assignment methods, so it cannot be used to set elements of vectors, names, attributes, etc." For example
for(i in 1:length(Ids)) {
dd <- GetReportData(query, token,paginate_query = F))
dd$contentid <- str_extract(d$pagePath,"&id=[0-9]+"))
assign(paste("data",i,sep="_"), dd)
}