R: transforming multiple sets to dataframes at once - r

I have 31 datasets corresponding to data about 31 teachers. I need to perform multiple transformations on all these datasets. One of them is transforming all of them into dataframes
class(alexandre)
[1] "tbl_df" "tbl" "data.frame"
As I said, I have 31 similar datasets, and I need to transform all into dataframes. My code to do so has been
alexandre <- as.data.frame(alexandre)
adrian <- as.data.frame(adrian)
akemi <- as.data.frame(akemi)
arcanjo <- as.data.frame(arcanjo)
ana_barbara <- as.data.frame(ana_barbara)
brigida <- as.data.frame(brigida)
cleiton <- as.data.frame(cleiton)
daniela <- as.data.frame(daniela)
davi <- as.data.frame(davi)
eliezer <- as.data.frame(eliezer)
eduardo <- as.data.frame(eduardo)
eustaquio <- as.data.frame(eustaquio)
gilberto <- as.data.frame(gilberto)
gilmar <- as.data.frame(gilmar)
jorge <- as.data.frame(jorge)
juarez <- as.data.frame(juarez)
junior <- as.data.frame(junior)
... and add some rows to this code (31 lines of this). Obviously all these lines of code take too much space and there must be a faster(and more elegant) way to accomplish this. In fact, I tried this
teachers <- c(alexandre, akemi, adrian, brigida, davi, ...)
cnames <- function(x){
colnames(x) <- c(1:18)
}
mapply(cnames, teachers)
Then I would do all the work with a few lines of code. And this method (form a vector containing all datasets, then use mapply on the vector) would make my work much easier because, as I said, I have to perform multiple transformation on all these datasets.
This code does not work, however. I get the following error:
Error in `colnames<-`(`*tmp*`, value = c(1:18)) :
attempt to set 'colnames' on an object with less than two dimensions
This error message is very unenlightening, I find. I have no idea what to do to to make the code work, which is obviously why I'm here. Any other methods to accomplish what I'm trying to do are welcome. Thanks.

As commented and often discussed in the R tag of SO, simply use a list to maintain all your individual, similarly structured data frames. Doing so allows you the following benefits:
Easily run operations consistently across all items using loops or apply family calls without separate naming assignments.
Organizes your environment and workspace with maintenance of one object with easy reference by number or name instead of 31 objects flooding your global environment.
Facilitates data frame migrations and handling with rbind, cbind, split, by, or other operations.
To create a list of all current data frames in global environment use eapply or mget filtering on data frame objects. Each returns a named list of data frames.
teachers_df_list <- Filter(is.data.frame, eapply(.GlobalEnv, identity))
teachers_df_list <- Filter(is.data.frame, mget(x=ls()))
Alternatively, source your data frames originally from file sources using list objects such as list.files:
teachers_df_list <- lapply(list.files(...), function(f) read.csv(f, ...))
You lose no functionality of data frame if stored inside a list.
head(teachers_df_list$alexandre)
tail(teachers_df_list$adrian)
summary(teachers_df_list$akemi)
...
Then run your needed operations with lapply like renaming columns with right-hand side function, setNames. Run other needed operations: aggregate or lm.
new_teachers_df_list <- lapply(teachers_df_list,
function(df) setNames(df, paste0("col_", c(1:18)))
new_teachers_agg_list <- lapply(teachers_df_list,
function(df) aggregate(col1 ~ col2, df, sum))
new_teachers_model_list <- lapply(teachers_df_list,
function(df) summary(lm(col1 ~ col2, df)))
Even compile all data frames into one master version using do.call + rbind:
# ADD A TEACHER INDICATOR COLUMN
new_teachers_df_list <- Map(function(df, n) transform(df, teacher=n),
new_teachers_df_list, names(new_teachers_df_list))
# BUILD SINGLE DF
teachers_df <- do.call(rbind, new_teachers_df_list)
Even split master version back into individual groupings if needed later on:
# SPLIT BACK TO LIST OF DFs
teachers_df_list <- split(teachers_df, teachers_df$teacher)

Maybe you could use a list to stock all your data.frame. It seems to work, but you need to find a way to extract all data.frame in the list after that.
df_1 <- data.frame(c(0, 1, 0), c(3, 4, 5))
df_2 <- data.frame(c(0, 1, 0), c(3, 4, 5))
l <- list(df_1, df_2)
lapply(l, function(x){
colnames(x) <- 1:2
return(x)
})

Related

Running a function that renames dataframes per intermediate step, for a list of dataframes

I have gotten instructions to do an analysis in R with the vegan package (concerning DCA's).
The instructions on a single dataframe are pretty straightforward, but I would like to apply the analysis on a set of dataframes.
I know this can be done with a for-loop or lapply or sapply, but I have trouble dealing with the fact that each step of the analysis a new extension is added to the name of the dataframe.
An example below
Say I have a dataframe DF, then it goes as follows:
DF.t1 <- decostand(DF, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
What I want to achieve is to run several dataframes through this analysis without writing it all out separately.
Let's say I have a set of dataframes called "DF_1", "DF_2" & "DF_3" which I want to do this analysis on.
I probably need to put the dataframes in a list, and get all the steps in a for-loop or one of the apply methods.
But how do I approach the problem with the extensions added (.ra, .t1, .t2, .t2.dca, .t2.dca.DW etc.) to the dataframe names?
Edit: I need to retain the original dataframes after the analysis, in order to do follow-up analysis on them.
Unless you have a very limited amount of data frames, I would not advise to define ca. 8 new objects for each data frame in the global environment as this can become very messy.
One approach you might consider is creating a nested list where the first level is the data frame and the second level are the modified data frames.
# some example data sets
DF1 <- mtcars
DF2 <- mtcars*2
DF3 <- mtcars*3
all_dfs <- list(DF1 = DF1, DF2 = DF2, DF3 =DF3)
some_stuff <- function(df) {
DF.t1 <- decostand(df, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
return(list(DF.t1 = DF.t1, DF.t2 = DF.t2,
DF.t2.dca = DF.t2.dca,
DF.t2.dca.DW = DF.t2.dca.DW,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores
))
}
nested_list <- lapply(all_dfs, some_stuff)
# To obtain any of the objects for a specific data.frame you could, for example, run
nested_list$DF1$DF.t2.dca.DW

Saving data frames to values in a list

I have a list of titles that I would like to iterate over and create/save data frames to. I have tried the using the paste() function (as seen below) but that does not work for me. Any advice would be greatly appreciated.
samples <- list("A","B","C")
for (i in samples){
paste(i,sumT,sep="_") <- data.frame(col1=NA,col1=NA)
}
My desired output is three empty data frames named: A_sumT, B_sumT and C_sumT
Here's an answer with purrr.
samples <- list("A", "B", "C")
samples %>%
purrr::map(~ data.frame()) %>%
purrr::set_names(~ paste(samples, "sumT", sep="_"))
Consider creating a list of dataframes and avoid many separate objects flooding global environment as this example can extend to hundreds and not just three. Plus with this approach, you will maintain one container capable of running bulk operations across all dataframes.
By using sapply below on a character vector, you create a named list:
samples <- c("A","B","C") # OR unlist(list("A","B","C"))
df_list <- sapply(samples, function(x) data.frame(col1=NA,col2=NA), simplify=FALSE)
# RUN ANY DATAFRAME OPERATION
head(df_list$A)
tail(df_list$B)
summary(df_list$C)
# BULK OPERATIONS
stacked_df <- do.call(rbind, df_list)
stacked_df <- do.call(cbind, df_list)
merged_df <- Reduce(function(x,y) merge(x,y,by="col1"), df_list)
Or if you need to rename list
# RENAME LIST
df_list <- setNames(df_list, paste0(samples, "_sumT"))
# RUN ANY DATAFRAME OPERATION
head(df_list$A_sumT)
tail(df_list$B_sumT)
summary(df_list$C_sumT)

Combine lapply, seq_along and ddply

I've been searching around this forum and trying to implement in my case what was said in previous answers from those questions. However, something in my code is missing.
I use lapply() with a function inside that runs ddply. This works nice. However, I would like to identify every result from a single data frame by reading the name of the data frame, and not [[1]], [[2]]...
For this reason, I am trying to implement the seq_along argument, but unsuccessfully. Let's see what I have:
I created a list to group 16 different data frames (with the same structure) in one object, called melt_noNA_noDC_regression:
melt_noNA_noDC_regression <-
list(I1U_melt_noNA_noDC_regression, I1L_melt_noNA_noDC_regression,
I1U_melt_noNA_noDC_regression, I1L_melt_noNA_noDC_regression,
CU_melt_noNA_noDC_regression, CL_melt_noNA_noDC_regression,
P3U_melt_noNA_noDC_regression, P3L_melt_noNA_noDC_regression,
P4U_melt_noNA_noDC_regression, P4L_melt_noNA_noDC_regression,
M1U_melt_noNA_noDC_regression, M1L_melt_noNA_noDC_regression,
M2U_melt_noNA_noDC_regression, M2L_melt_noNA_noDC_regression,
M3U_melt_noNA_noDC_regression, M3L_melt_noNA_noDC_regression)
Later, I run this lapply() line successfully.
lapply(melt_noNA_noDC_regression, function(x) ddply(x, .(Species), model_regression))
As I have 16 different data frames, I would like to identify them in the results of the lapply function. I have tried several combinations to include seq_along within the lapply code, as in this case:
lapply(melt_noNA_noDC_regression, function(x) {
ddply(x, .(Species), model_regression)
seq_along(x), function(i) paste(names(x)[[i]], x[[i]])
})
However, I've been getting errors constantly, and it is a bit frustrating. It is maybe very easy to solve, but I am block.
Any idea to solve this?
Consider using eapply (lapply's lesser known sibling) or mget to retrieve a named list of your dataframes. Then run them through lapply for the ddply call to return the same named dataframe list with new corresponding values.
df_list <- eapply(.GlobalEnv, function(d) d)[c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...)]
df_list <- mget(c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...))
# GENERALIZED FOR ANY DF IN GLOBAL ENV
df_list <- Filter(function(i) class(i)=="data.frame", eapply(.GlobalEnv, function(d) d))
new_list <- lapply(df_list, function(x) ddply(x, .(Species), model_regression))
And because eapply (being environment apply) is part of the apply family and can iterate through objects, you can bypass lapply. But you must account for non-dataframes and then filter out by df names. Hence, tryCatch is used and [] indexing:
new_list2 <- eapply(.GlobalEnv, function(x)
tryCatch(ddply(x, .(Species), model_regression),
warning = function(w) return(NA),
error = function(e) return(NA)
)
)[c("I1U_melt_noNA_noDC_regression",
"I1L_melt_noNA_noDC_regression",
"I1U_melt_noNA_noDC_regression",
...)]
all.equal(new_list, new_list2)
# [1] TRUE
With all that said, ideally in your data processing you would originally use a named dataframe list and not create separate, similar structured 16 objects flooding your global environment. Therefore, consider adjusting the source of your regression objects, so replace the following:
I1U_melt_noNA_noDC_regression <- ...
with this:
df_list = list()
df_list["I1U_melt_noNA_noDC_regression"] <- ...

R: Adress objects deep inside lists with filter commands inside functions/loops (ExtremeBounds package)

I am using the ExtremeBounds package which provides as a result a multi level list with (amongst others) dataframes at the lowest level. I run this package over several specifications and I would like to collect some columns of selected dataframes in these results. These should be collected by specification (spec1 and spec2 in the example below) and arranged in a list of dataframes. This list of dataframes can then be used for all kind of things, for example to export the results of different specifications into different Excel Sheets.
Here is some code which creates the problematic object (just run this code blindly, my problem only concerns how to deal with the kind of list it creates: eba_results):
library("ExtremeBounds")
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,2,0.2),
var3=rnorm(30),var4=rnorm(30),var5=rnorm(30))
spec1 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4"))
spec2 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4","var5"))
indicators <- c("spec1","spec2")
ebaFun <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, weights = "lri", family = binomial(logit))}
eba_results <- lapply(mget(indicators),ebaFun) #eba_results is the object in question
Manually I know how to access each element, for example:
eba_results$spec1$bounds$type #look at str(eba_results) to see the different levels
So "bounds" is a dataframe with identical column names for both spec1 and spec2. I would like to collect the following 5 columns from "bounds":
type, cdf.mu.normal, cdf.above.mu.normal, cdf.mu.generic, cdf.above.mu.generic
into one dataframe per spec. Manually this is simple but ugly:
collectedManually <-list(
manual_spec1 = data.frame(
type=eba_results$spec1$bounds$type,
cdf.mu.normal=eba_results$spec1$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec1$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec1$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec1$bounds$cdf.above.mu.generic),
manual_spec2= data.frame(
type=eba_results$spec2$bounds$type,
cdf.mu.normal=eba_results$spec2$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec2$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec2$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec2$bounds$cdf.above.mu.generic))
But I have more than 2 specifications and I think this should be possible with lapply functions in a prettier way. Any help would be appreciated!
p.s.: A generic example to which hrbrmstr's answer applies but which turned out to be too simplistic:
exampleList = list(a=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))),
b=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))))
and I want to have an object which collects, for example, all the A and B vectors into two data frames (each with its respective A and B) which are then a list of data frames. Manually this would look like:
dfa <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
dfb <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
collectedResults <- list(a=dfa, b=dfb)
There's probably a less brute-force way to do this.
If you want lists of individual columns this is one way:
get_col <- function(my_list, col_name) {
unlist(lapply(my_list, function(x) {
lapply(x, function(y) { y[, col_name] })
}), recursive=FALSE)
}
get_col(exampleList, "A")
get_col(exampleList, "B")
If you want a consolidated data.frame of indicator columns this is one way:
collect_indicators <- function(my_list, indicators) {
lapply(my_list, function(x) {
do.call(rbind, c(lapply(x, function(y) { y[, indicators] }), make.row.names=FALSE))
})[[1]]
}
collect_indicators(exampleList, c("A", "B"))
If you just want to bring the individual data.frames up a level to make it easier to iterate over to write to a file:
unlist(exampleList, recursive=FALSE)
Much assumption about the true output format is being made (the question was a bit vague).
There is a brute force way which works but is dependent on several named objects:
collectEBA <- function(x){
df <- paste0("eba_results$",x,"$bounds")
df <- eval(parse(text=df))[,c("type",
"cdf.mu.normal","cdf.above.mu.normal",
"cdf.mu.generic","cdf.above.mu.generic")]
df[is.na(df)] <- "NA"
df
}
eba_export <- lapply(indicators,collectEBA)
names(eba_export) <- indicators

R: merging matrices (not data.frames)

merge is a very nice function: It merges matrices and data.frames, and returns a data.frame.
Having rather big character matrices,
is there another good way to merge -
without data.frame conversion?
Comment 1:
A small function to merge a named vector with a matrix or data.frame. Elements of the vector can link to multiple entries in the matrix:
expand <- function(v,m,by.m,v.name='v',...) {
df <- do.call(rbind,lapply(names(v),function(x) {
pos <- which(m[,by.m] %in% v[x])
cbind(x,m[pos,],...)
}))
colnames(df)[1] <- v.name
df
}
Example:
v <- rep(letters,each=3)[seq_along(letters)]
names(v) <- letters
m <- data.frame(a=unique(v),b=seq_along(unique(v)),stringsAsFactors=F)
expand(v,m,'a')
You can use a combination of match and cbind to do the equivalent of merge without conversion to data frame, a simple example:
st1 <- state.x77[ sample(1:50), ]
st2 <- as.matrix( USArrests )[ sample(1:50), ]
tmp1 <- match(rownames(st1), rownames(st2) )
st3 <- cbind( st1, st2[tmp1,] )
head(st3)
Keeping track of which columns you want, and merging whith many to 1 relationships or missing rows in one group require a bit more thought but are still possible.
No, not without either (a) overwriting the merge function or (b) creating a new merge.matrix() S3 function (this would be the right approach to the problem).
You can see in the merge help:
Value
A data frame.
Also, the merge.default function:
> merge.default
function (x, y, ...)
merge(as.data.frame(x), as.data.frame(y), ...)
There is now a merge.Matrix function in the Matrix.utils package. This works on combinations of matrices as well as capital M Matrices, data.frames, etc.
The match solution is nice, but as someone pointed out does not work on m:n relationships. It also does not implement the other features of merge, including all.x, all.y, etc.

Resources