How to order multiple dataframes in Global Environment R - r

I'm trying to Run a simulation but I'm having trouble storing multiple dataframes called "data_i" in a list ordering by i. I start with a df called "data_", which has data from 1901 to 2032 (132 rows). I apply a loop to create one dataframe per row called data_1, data_2,data_3,...,data_132 (row of 2032 is stored in data_132). Finally, I store all this dataframes in a list and use lapply to create a column in each dataframe. Here is a reproducible example:
#Main dataframe
time <- 1901:2032
b <- 1:132
data_ <- data.frame(time,b)
#Loop for creating data_i where i goes from 1 to 132
simulations <- 10000
for (i in 1:132) {
assign(paste("data_",i, sep = ""), as.data.frame( sapply(data_[i,], function(n) rep(n,simulations)), stringsAsFactors = FALSE ))
}
#Store all dataframes in list (**I THINK THE PROBLEM IS HERE**)
data_names<-str_extract(ls(), '^data_[[:digit:]]{1,3}$')[!is.na(str_extract(ls(), '^data_[[:digit:]]{1,3}$'))]
dataframes<-lapply(data_names, function(x)get(x))
#Create a new column in each dataframe
new_list <- lapply(dataframes, function(x) cbind(x, production = as.numeric(runif(simulations, min = 50, max = 100))))
#Create data_newi in environnment
list2env(setNames(new_list,paste0("data_new", seq_along(dataframes))),
envir = parent.frame())
The code runs but the problem is that the order of the dataframes is not data_1, data_2,data_3,...,data_132 but data_1,data_10,data_100,data_101...This generates that data_names stores this values in that order. This will lead to, for example, 2032 not being in data_new132 as I would want it to be.
Does anybody knows how to solve this? Thanks in advance!

Andres, See if this helps. I added a pad of '0' for the max number of characters (e.g. 132 = 3 characters wide):
#Main dataframe
time <- 1901:2032
b <- 1:132
data_ <- data.frame(time,b)
#Loop for creating data_i where i goes from 1 to 132
simulations <- 10000
for (i in 1:132) {
assign(paste("data_",str_pad(i,nchar(max(b)),pad="0"), sep = ""), as.data.frame( sapply(data_[i,], function(n) rep(n,simulations)), stringsAsFactors = FALSE ))
}
#Store all dataframes in list (**I THINK THE PROBLEM IS HERE**)
data_names<-str_extract(ls(), '^data_[[:digit:]]{1,3}$')[!is.na(str_extract(ls(), '^data_[[:digit:]]{1,3}$'))]
dataframes<-lapply(data_names, function(x)get(x))
#Create a new column in each dataframe
new_list <- lapply(dataframes, function(x) cbind(x, production = as.numeric(runif(simulations, min = 50, max = 100))))
#Create data_newi in environnment
list2env(setNames(new_list,paste0("data_new", paste(str_pad(seq_along(dataframes),nchar(max(seq_along(dataframes))),pad="0"),sep=""))),
envir = parent.frame())

1) Use mixedsort in gtools:
library(gtools)
for(i in c(2, 10)) assign(paste0("data", i), i)
ls(pattern = "^data")
## [1] "data10" "data2"
mixedsort(ls(pattern = "^data"))
## [1] "data2" "data10"
2) or ensure that the names are the same length using leading 0's in which case ls() will sort them appropriately:
for(i in c(2, 10)) assign(sprintf("data%03d", i), i)
ls(pattern = "^data")
## [1] "data002" "data010"
3) Normally one does not assign such objects directly into the global environment but puts them into a list. One can refer to elements using L[[1]], etc.
L <- list()
# for(i in 1:3) L[[i]] <- i
L
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
3a) or in one line:
L <- lapply(1:3, function(i) i)

Related

Optimize a large number of variable operations and variable ordering

I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward:
create a vector of unique combinations (m=3, 4, or 5) from df variable names
transform the vector of combinations into a list of formulas
break up the list of formulas to process into chunks to get around memory limitations
iterate through each chunk performing the formula operation and subset the df to the user specified number of rows (topn)
The full reprex is below including the different attempts using purrr::map and base lapply. I also attempted to use:= from data.table following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...)):
Apply a list of formulas to R data.table
It appears to me that one of the bottlenecks in my code is in variable operation step. A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn of 5 if the speed is reasonable.
Fastest way to find second (third...) highest/lowest value in vector or column
library(purrr)
library(stringr)
library(kit)
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))
## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
## using purrr::map
# p <- Sys.time()
ele_length <- length(chunks_run[[chunks_idx]])
ops_list_temp <- vector("list", length = ele_length)
ops_list_temp <- purrr::map(
ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
)
# (p <- Sys.time()-p) #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
# ## using base lapply
# p <- Sys.time()
# ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
# ops_list_temp <- vector("list", length = ele_length)
# ops_list_temp <- lapply(
# ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
# )
# (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
## number of rows I want to subset from df
topn <- 250
## list to store indices of topn values for each list element
indices_list <- vector("list", length = length(ops_list_temp))
## list to store value of the topn indices for each list element
values_list <- vector("list", length = length(ops_list_temp))
## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
## each element in this list should be the length of topn
indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
## each element in this list, should be the length of topn
values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
## save completed chunk to final list
chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins

Condition to check values ​between lists and add to new list in R

If the last value of each sublist in the list ListResiduals (e.g: OptionAOptionD) is > than the value with the corresponding name in the ListSigma (e.g: OptionAOptionD), it adds the name (e.g: OptionAOptionD) to the Watchlist list.
In the last line of the code I put "> 5" just for the example work, it's the "> 5" that I want to replace in the condition that I mentioned in the previous paragraph.
DF <- data.frame("OptionA" = sample(1:100, 50),
"OptionB" = sample(1:100, 50),
"OptionC" = sample(1:100, 50),
"OptionD" = sample(1:100, 50))
#Unfolding options and creating DF
UnFolding <- data.frame(
First = as.vector(sapply(names(DF[]), function(x)
sapply(names(DF[]), function(y)
paste0(x)))),
Second = as.vector(sapply(names(DF[]), function(x)
sapply(names(DF[]), function(y)
paste0(y)))))
#Deleting lines with the same names
UnFolding <-
UnFolding[UnFolding$First != UnFolding$Second, ]
#Creating list with dependent and independent variables
LMList <- apply(UnFolding, 1, function(x)
as.formula(paste(x[1], "~", x[2])))
#Change list data to variable names
names(LMList) <- substring(lapply(LMList, paste, collapse = ""), 2)
#Linear regression - lm()
LMListRegression <- lapply(LMList, function(x) {
eval(call("lm", formula = x, data = DF))
})
#Residuals
ListResiduals <- lapply(LMListRegression, residuals)
#Sigma
ListSigma <- lapply(LMListRegression, function(x) {
sigma(x)*2
})
#Watchlist
Watchlist <- as.list(unlist(lapply(ListResiduals,
function(x) names(x)[1][tail(x, 1) > 5])))
I would gravitate towards converting your Simga and Residual values to a vector and compare the vectors. You could also use a data.frame approach to be sure the order of your lists/vectors doesn't change.
# create a vector with the last value from the Residuals list.
last_residual <- sapply(ListResiduals, `[`, 50)
names(last_residual) <- substr(names(last_residual), 1, stop = -4)
# Using sapply() rather than lapply, will return a named vector
sigma_vector <- sapply(LMListRegression, function(x) {
sigma(x)*2
})
Watchlist <- sigma_vector[last_residual > sigma_vector]
Watchlist
# named numeric(0)
In your example, it returns an empty named vector because no values meet your condition
max(last_residual)
# [1] 31.70949
min(sigma_vector)
# [1] 52.93234
# To demonstrate that it works, let's devide sigma by 2 so that at least some values will pass
half_sigma <- sigma_vector/2
Watchlist2 <- sigma_vector[last_residual > half_sigma]
Watchlist2
# OptionDOptionA OptionDOptionB OptionDOptionC
# 54.52411 57.09503 56.79341

The loop in my R function appears to be running twice

I need to add rows to a data frame. I have many files with many rows so I have converted the code to a function. When I go through each element of the code it works fine. When I wrap everything in a function each row from my first loop gets added twice.
My code looks for a string (xx or x). If xx is present is replaces the xx with numbers 00-99 (one row for each number) and 0-9. If x is present it replaces it with number 0-9.
Create DF
a <- c("1.x", "2.xx", "3.1")
b <- c("single", "double", "nothing")
df <- data.frame(a, b, stringsAsFactors = FALSE)
names(df) <- c("code", "desc")
My dataframe
code desc
1 1.x single
2 2.xx double
3 3.1 nothing
My function
newdf <- function(df){
# If I run through my code chunk by chunk it works as I want it.
df$expanded <- 0 # a variable to let me know if the loop was run on the row
emp <- function(){ # This function creates empty vectors for my loop
assign("codes", c(), envir = .GlobalEnv)
assign("desc", c(), envir = .GlobalEnv)
assign("expanded", c(), envir = .GlobalEnv)
}
emp()
# I want to expand xx with numbers 00 - 99 and 0 - 9.
#Note: 2.0 is different than 2.00
# Identifies the rows to be expanded
xd <- grep("xx", df$code)
# I used chr vs. numeric so I wouldn't lose the trailing zero
# Create a vector to loop through
tens <- formatC(c(0:99)); tens <- tens[11:100]
ones <- c("00","01","02","03","04","05","06","07","08","09")
single <- as.character(c(0:9))
exp <- c(single, ones, tens)
# This loop appears to run twice when I run the function: newdf(df)
# Each row is there twice: 2.00, 2.00, 2.01 2.01...
# It runs as I want it to if I just highlight the code.
for (i in xd){
for (n in exp) {
codes <- c(codes, gsub("xx", n, df$code[i])) #expanding the number
desc <- c(desc, df$desc[i]) # repeating the description
expanded <- c(expanded, 1) # assigning 1 to indicated the row has been expanded
}
}
# Binds the df with the new expansion
df <- df[-xd, ]
df <- rbind(as.matrix(df),cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Empties the vector to begin another expansion
emp()
xs <- grep("x", df$code) # This is for the single digit expansion
# Expands the single digits. This part of the code works fine inside the function.
for (i in xs){
for (n in 0:9) {
codes <- c(codes, gsub("x", n, df$code[i]))
desc <- c(desc, df$desc[i])
expanded <- c(expanded, 1)
}
}
df <- df[-xs,]
df <- rbind(as.matrix(df), cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
assign("out", df, envir = .GlobalEnv) # This is how I view my dataframe after I run the function.
}
Calling my function
newdf(df)

subsetting a data.frame using a for loop

I have a data.frame, and I want to subset it every 10 rows and then applied a function to the subset, save the object, and remove the previous object. Here is what I got so far
L3 <- LETTERS[1:20]
df <- data.frame(1:391, "col", sample(L3, 391, replace = TRUE))
names(df) <- c("a", "b", "c")
b <- seq(from=1, to=391, by=10)
nsamp <- 0
for(i in seq_along(b)){
a <- i+1
nsamp <- nsamp+1
df_10 <- df[b[nsamp]:b[a], ]
res <- lapply(seq_along(df_10$b), function(x){...}
saveRDS(res, file="res.rds")
rm(res)
}
My problem is the for loop crashes when reaching the last element of my sequence b
When partitioning data, split is your friend. It will create a list with each data subset as an item which is then easy to iterate over.
dfs = split(df, 1:nrow(df) %/% 10)
Then your for loop can be simplified to something like this (untested... I'm not exactly sure what you're doing because example data seems to switch from df to sc2_10 and I only hope your column named b is different from your vector named b):
for(i in seq_along(dfs)){
res <- lapply(seq_along(dfs[[i]]$b), function(x){...}
saveRDS(res, file = sprintf("res_%s.rds", i))
rm(res)
}
I also modified your save file name so that you aren't overwriting the same file every time.

Reading series of values in R

I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)

Resources