R Apply Function for Formatting Many Datasets - r

I would like to loop through nine data sets, perform calculations, and output a different file name.
Existing Code:
list <- c(corporate_service, finance, its, law, market_services, operations, president, member_services, System_Planning)
Calc <- function(list){
list %>% filter(Total_Flag == 1) %>%
select(Element, Amount, Total)
}
lapply(list, Calc)
I would like to loop through each dataset and apply the function above. More specifically, I would like to re-name each processed dataframe something different. Is there a way to do this? I should also note, this code has not worked for me - is there anything noticeably wrong?
Thanks

Avoid flooding your global environment with separate, similarly structured data frames in the first place. Instead continue to use a list of data frames. See #GregorThomas's best practices answer of why. In fact, a named list is preferable for better indexing.
# DEFINE A NAMED LIST OF DATA FRAMES
df_list <- list(corporate_service = corporate_service,
finance = finance,
its = its,
law = law,
market_services = market_services,
operations = operations,
president = president,
member_services = member_services,
system_planning = System_Planning)
# REMOVE ORIGINALS FROM GLOBAL ENVIRONMENT
rm(corporate_service, finance, its, law, market_services,
operations, president, member_services, System_Planning)
# REVIEW STRUCTURE
str(df_list)
Then define a method to interact with a single data frame (not list) and its list name. Then call it iteratively:
Calc <- function(df, nm) {
df <- select(filter(df, Total_Flag == 1), Element, Amount, Total)
write.csv(df, file.path("path", "to", "my", "destination", paste(nm, ".csv")))
return(df)
}
# ASSIGN TO A NEW LIST
new_df_list <- mapply(Calc, df_list, names(df_list), SIMPLIFY=FALSE)
new_df_list <- Map(Calc, df_list, names(df_list)) # EQUIVALENT WRAPPER TO ABOVE
To be clear, you lose no functionality of a data frame if it is stored in a larger container.
head(new_df_list$corporate_service)
tail(new_df_list$finance)
summary(new_df_list$its)
Such containers even help serialize same operations:
lapply(new_df_list, summary)
Even concatenate all data frame elements together with column of corresponding list name:
final_df <- dplyr::bind_rows(new_df_list, .id="division")
Overall, your organization and data management is enhanced since you only have to use a single, indexed object and not many that require ls, mget, get, eval, assign for dynamic operations.

Related

Can I use names from a list/dataframe, to be recognised as list/dataframe name within R script for a loop function?

I'd like to use a loop function to recognise names from a list/dataframe as an actual list/dataframe name in the R script (for data analysis or manipulation).
I will create some pseudo data to try to help show what i'm trying to do.
Here is code to create 3 lists
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
This code creates a list containing those list names
vars <- c("height","weight","income")
The code below doesn't run, but I would like to use a loop code like this, where it takes the name from the list position and uses it in script as a list name. Thus it's using the name to calculate the mean, and it's using the name to create a new object.
for (i in 1:3)
{mean_**vars[i]** = mean(**vars[i]**) }
The result should be 3 objects "mean_height", "mean_weight", "mean_income" which contain the mean scores
I'm not so much interested in the calculating of mean scores, I'm interested in the ability to use the names from the list. I want to be able to expand this to other analyses that are repetitive.
Apologies if above hasn't been articulated too well, I'm quite new to R, so I hope it makes some sense.
Any help will be most useful, or if you can point me in the right direction that would be great.
This may be what you're looking for, where lapply applies the mean function to each of the items in vars (a list of dataframes). Note that you want to make the list of dataframes using the variable names.
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
vars <- list(height, weight, income)
lapply(vars, function(x) mean(x))
Then create an output dataframe using that:
df1 <- data.frame(lapply(vars, function(x) mean(x)))
colnames(df1) <- c("mean_height", "mean_weight", "mean_income")
df1
From your additional comment, using vars <- list(height, weight, income) should allow you do this:
mean(height)
mean(vars[[1]])
[1] 160.48
[1] 160.48
This should work to output dynamically named variables:
vars <- list(height = height, weight = weight, income = income)
for (i in names(vars)){
assign(paste("mean_", i, sep = ""), mean(vars[[i]]))
}
mean_height
mean_weight
mean_income
[1] 163.28
[1] 90.465
[1] 109686.5
However, I'd suggest not programming that way since it can cause issues and it's not very scalable. E.g., you could end up with 10000 variables.
I guess what you want is something like below, which produces three objects into your global environment for the means of weight, height, and income from list list, i.e.,
list2env(setNames(Map(mean,lst),paste0("mean_",names(lst))),envir = .GlobalEnv)
DATA
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
lst <- list(height,weight,income)
A more common approach in R is to use lists of data, rather than separate variables.
Like this:
# make this reproducible
set.seed(123)
# make an empty list for the data
raw_data <- list()
# then fill the list. The data can be of varying length in a list.
raw_data$height <- sample(120:200,200,TRUE)
raw_data$weight <- sample(40:140,200,TRUE)
raw_data$income <- sample(20000:200000,200, TRUE)
Then looping becomes a one-liner and your names are preserved, using the *apply family of functions:
mean_data <- lapply(raw_data, mean)
# print that
mean_data
$height
[1] 159.06
$weight
[1] 90.83
$income
[1] 114000.7
Note what we didn't have to do:
know the number of variables.
have variables all the same length.
build a loop and keep track of names.
All handled automagically. Nice.

R: transforming multiple sets to dataframes at once

I have 31 datasets corresponding to data about 31 teachers. I need to perform multiple transformations on all these datasets. One of them is transforming all of them into dataframes
class(alexandre)
[1] "tbl_df" "tbl" "data.frame"
As I said, I have 31 similar datasets, and I need to transform all into dataframes. My code to do so has been
alexandre <- as.data.frame(alexandre)
adrian <- as.data.frame(adrian)
akemi <- as.data.frame(akemi)
arcanjo <- as.data.frame(arcanjo)
ana_barbara <- as.data.frame(ana_barbara)
brigida <- as.data.frame(brigida)
cleiton <- as.data.frame(cleiton)
daniela <- as.data.frame(daniela)
davi <- as.data.frame(davi)
eliezer <- as.data.frame(eliezer)
eduardo <- as.data.frame(eduardo)
eustaquio <- as.data.frame(eustaquio)
gilberto <- as.data.frame(gilberto)
gilmar <- as.data.frame(gilmar)
jorge <- as.data.frame(jorge)
juarez <- as.data.frame(juarez)
junior <- as.data.frame(junior)
... and add some rows to this code (31 lines of this). Obviously all these lines of code take too much space and there must be a faster(and more elegant) way to accomplish this. In fact, I tried this
teachers <- c(alexandre, akemi, adrian, brigida, davi, ...)
cnames <- function(x){
colnames(x) <- c(1:18)
}
mapply(cnames, teachers)
Then I would do all the work with a few lines of code. And this method (form a vector containing all datasets, then use mapply on the vector) would make my work much easier because, as I said, I have to perform multiple transformation on all these datasets.
This code does not work, however. I get the following error:
Error in `colnames<-`(`*tmp*`, value = c(1:18)) :
attempt to set 'colnames' on an object with less than two dimensions
This error message is very unenlightening, I find. I have no idea what to do to to make the code work, which is obviously why I'm here. Any other methods to accomplish what I'm trying to do are welcome. Thanks.
As commented and often discussed in the R tag of SO, simply use a list to maintain all your individual, similarly structured data frames. Doing so allows you the following benefits:
Easily run operations consistently across all items using loops or apply family calls without separate naming assignments.
Organizes your environment and workspace with maintenance of one object with easy reference by number or name instead of 31 objects flooding your global environment.
Facilitates data frame migrations and handling with rbind, cbind, split, by, or other operations.
To create a list of all current data frames in global environment use eapply or mget filtering on data frame objects. Each returns a named list of data frames.
teachers_df_list <- Filter(is.data.frame, eapply(.GlobalEnv, identity))
teachers_df_list <- Filter(is.data.frame, mget(x=ls()))
Alternatively, source your data frames originally from file sources using list objects such as list.files:
teachers_df_list <- lapply(list.files(...), function(f) read.csv(f, ...))
You lose no functionality of data frame if stored inside a list.
head(teachers_df_list$alexandre)
tail(teachers_df_list$adrian)
summary(teachers_df_list$akemi)
...
Then run your needed operations with lapply like renaming columns with right-hand side function, setNames. Run other needed operations: aggregate or lm.
new_teachers_df_list <- lapply(teachers_df_list,
function(df) setNames(df, paste0("col_", c(1:18)))
new_teachers_agg_list <- lapply(teachers_df_list,
function(df) aggregate(col1 ~ col2, df, sum))
new_teachers_model_list <- lapply(teachers_df_list,
function(df) summary(lm(col1 ~ col2, df)))
Even compile all data frames into one master version using do.call + rbind:
# ADD A TEACHER INDICATOR COLUMN
new_teachers_df_list <- Map(function(df, n) transform(df, teacher=n),
new_teachers_df_list, names(new_teachers_df_list))
# BUILD SINGLE DF
teachers_df <- do.call(rbind, new_teachers_df_list)
Even split master version back into individual groupings if needed later on:
# SPLIT BACK TO LIST OF DFs
teachers_df_list <- split(teachers_df, teachers_df$teacher)
Maybe you could use a list to stock all your data.frame. It seems to work, but you need to find a way to extract all data.frame in the list after that.
df_1 <- data.frame(c(0, 1, 0), c(3, 4, 5))
df_2 <- data.frame(c(0, 1, 0), c(3, 4, 5))
l <- list(df_1, df_2)
lapply(l, function(x){
colnames(x) <- 1:2
return(x)
})

Looping through similar dataframes to apply changes using for

I have dataframes in which one column has to suffer a modification, handling correctly NAs, characters and digits. Dataframes have similar names, and the column of interest is shared.
I made a for loop to change every row of the column of interest correctly. However I had to create an intermediary object "df" in order to accomplish that.
Is that necessary? or the original dataframes can be modified directly.
sheet1 <- read.table(text="
data
15448
something_else
15334
14477", header=TRUE, stringsAsFactors=FALSE)
sheet2 <- read.table(text="
data
16448
NA
16477", header=TRUE, stringsAsFactors=FALSE)
sheets<-ls()[grep("sheet",ls())]
for(i in 1:length(sheets) ) {
df<-NULL
df<-eval(parse(text = paste0("sheet",i) ))
for (y in 1:length(df$data) ){
if(!is.na(as.integer(df$data[y])))
{
df[["data"]][y]<-as.character(as.Date(as.integer(df$data[y]), origin = "1899-12-30"))
}
}
assign(eval(as.character(paste0("sheet",i))),df)
}
As #d.b. mentions, consider interacting on a list of dataframes especially if similarly structured since you can run same operations using apply procedures plus you save on managing many objects in global environment. Also, consider using the vectorized ifelse to update column.
And if ever you really need separate dataframe objects use list2env to convert each element to separate object. Below wraps as.* functions with suppressWarnings since you do want to return NA.
sheetList <- mget(ls(pattern = "sheet[0-9]"))
sheetList <- lapply(sheetList, function(df) {
df$data <- ifelse(is.na(suppressWarnings(as.integer(df$data))), df$data,
as.character(suppressWarnings(as.Date(as.integer(df$data),
origin = "1899-12-30"))))
return(df)
})
list2env(sheetList, envir=.GlobalEnv)

Looping through or lapply on list of dataframes in R

I do have some grasp on how to use lapply to, say, change the names of variables in several dataframes in a list. However, I am looking to carry out a slightly (but only slightly) more complicated operation.
More specifically, I am looking to calculate the mean growth rates for several entities. The growth rate have already been calculated, so I just need to perfor the following operations on all dataframes
for (i in 1:13) {
growth.type[,i] <- tapply(growth[,8+i] , growth$type, mean, na.rm = TRUE)
}
This creates a new dataframe (growth.type) that includes the mean of all several hundred growth rates in the original dataframe (growth), by type.
Now, I would like to do this to several dataframes (like growth) and put them into new dataframes (like growth.type).
I hope this makes sense.
Put all data.frames you wish to process in a list
xy <- list(growth1, growth2, growth3, ...)
and then apply a custom function to this xy object.
customFunction <- function(.data) {
for (i in 1:13) {
growth.type[,i] <- tapply(.data[,8+i] , .data$type, mean, na.rm = TRUE)
}
growth.type # this is the object which will be returned when function finishes
}
then just do
out <- lapply(xy, FUN = customFunction)
If you want to combine the result of lapply, you can use do.call, e.g. do.call("rbind", out).

R: Adress objects deep inside lists with filter commands inside functions/loops (ExtremeBounds package)

I am using the ExtremeBounds package which provides as a result a multi level list with (amongst others) dataframes at the lowest level. I run this package over several specifications and I would like to collect some columns of selected dataframes in these results. These should be collected by specification (spec1 and spec2 in the example below) and arranged in a list of dataframes. This list of dataframes can then be used for all kind of things, for example to export the results of different specifications into different Excel Sheets.
Here is some code which creates the problematic object (just run this code blindly, my problem only concerns how to deal with the kind of list it creates: eba_results):
library("ExtremeBounds")
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,2,0.2),
var3=rnorm(30),var4=rnorm(30),var5=rnorm(30))
spec1 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4"))
spec2 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4","var5"))
indicators <- c("spec1","spec2")
ebaFun <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, weights = "lri", family = binomial(logit))}
eba_results <- lapply(mget(indicators),ebaFun) #eba_results is the object in question
Manually I know how to access each element, for example:
eba_results$spec1$bounds$type #look at str(eba_results) to see the different levels
So "bounds" is a dataframe with identical column names for both spec1 and spec2. I would like to collect the following 5 columns from "bounds":
type, cdf.mu.normal, cdf.above.mu.normal, cdf.mu.generic, cdf.above.mu.generic
into one dataframe per spec. Manually this is simple but ugly:
collectedManually <-list(
manual_spec1 = data.frame(
type=eba_results$spec1$bounds$type,
cdf.mu.normal=eba_results$spec1$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec1$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec1$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec1$bounds$cdf.above.mu.generic),
manual_spec2= data.frame(
type=eba_results$spec2$bounds$type,
cdf.mu.normal=eba_results$spec2$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec2$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec2$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec2$bounds$cdf.above.mu.generic))
But I have more than 2 specifications and I think this should be possible with lapply functions in a prettier way. Any help would be appreciated!
p.s.: A generic example to which hrbrmstr's answer applies but which turned out to be too simplistic:
exampleList = list(a=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))),
b=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))))
and I want to have an object which collects, for example, all the A and B vectors into two data frames (each with its respective A and B) which are then a list of data frames. Manually this would look like:
dfa <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
dfb <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
collectedResults <- list(a=dfa, b=dfb)
There's probably a less brute-force way to do this.
If you want lists of individual columns this is one way:
get_col <- function(my_list, col_name) {
unlist(lapply(my_list, function(x) {
lapply(x, function(y) { y[, col_name] })
}), recursive=FALSE)
}
get_col(exampleList, "A")
get_col(exampleList, "B")
If you want a consolidated data.frame of indicator columns this is one way:
collect_indicators <- function(my_list, indicators) {
lapply(my_list, function(x) {
do.call(rbind, c(lapply(x, function(y) { y[, indicators] }), make.row.names=FALSE))
})[[1]]
}
collect_indicators(exampleList, c("A", "B"))
If you just want to bring the individual data.frames up a level to make it easier to iterate over to write to a file:
unlist(exampleList, recursive=FALSE)
Much assumption about the true output format is being made (the question was a bit vague).
There is a brute force way which works but is dependent on several named objects:
collectEBA <- function(x){
df <- paste0("eba_results$",x,"$bounds")
df <- eval(parse(text=df))[,c("type",
"cdf.mu.normal","cdf.above.mu.normal",
"cdf.mu.generic","cdf.above.mu.generic")]
df[is.na(df)] <- "NA"
df
}
eba_export <- lapply(indicators,collectEBA)
names(eba_export) <- indicators

Resources