Performing column select over multiple dataframes - r

I have looked around a lot for this answer, they get close but no cigar. I am trying to perform a selection of columns over multiple dataframes. I can do this and return a list, but I wish to preserve the dataframes in the global environment. I want to keep the dataframes separate for ease of use and visibility in Rstudio. For example I am selecting columns based on their name as so, for one dataframe:
E07 <- E07[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")]
I have x amount of data frames listed in dflist so I have written this function:
columnselect<-function(df){df[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")];df}
I then wish to apply this over the dflist as so:
lapply(X=dflist,FUN=columnselect)
This returns the function over the dflist however the data tables remain unchanged. How do I apply the function over multiple dataframes without returning them in a list.
Many thanks
M

Your function returns the data frames unchanged because this is the last thing evaluated in your function. Instead of:
columnselect<-function(df){
df[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")]
df}
It should be:
columnselect<-function(df){
df[,c("Block","Name","F635.Mean","F532.Mean","B635.Mean","B532")]
}
Having the last df in your function simply returned the full df that you passed in the function.
As for the second question that you would like to have the data.frames in the global environment rather than in the list (which is bad practice just so you know; it is always better to keep those in the list) you need the list2env function i.e.:
mylist <- lapply(X=dflist,FUN=columnselect)
list2env(mylist, envir = globalenv())
Using this the data.frames in the global environment will be updated.

Related

Lapply on a list of a list

This is all about a code in R.
I have seperated a big data file "All_data.csv" in smaller data of individuals in a particular year.
So the list looks like this:
All individuals is a list of 25 individuals. If you would then take the first element of that list you get:
Indivudual 1: dataframe_year1, dataframe_year2.
If you take the second element you get for instance:
Individual 2: dataframe_year1, dataframe_year2, dataframe_year3.
etc. so the lists in the lists differ in their length.
Now I want to do a (analysis) function on the dataframes, I do not need to store the output again in the list per se.
I solved it with doing an lapply on the list All_data, with a function defined by myself which also calls lapply again and then my analysis function. But I was wondering if there was another way. Because it seems a bit inefficient to do.
split <- function (All_data)
{
#function that splits files by date and individual
#returns list of individuals and within that list is another list of dataframes. Called All_individuals
}
Make_analysis <- function (All_individuals)
{
Listfiles <- split (All_individuals)
HRE <- lapply (Listfiles, Doall)
}
Analysis <- function (files)
{
...
}
function calls:
lapply (All_data, Make_analysis)
Could anyone help?
Also is this the best way to go if I would want to parallise the analysis with RSlurm to run it on a HPC? Then I could change lapply with slurm map right?
My function in itself works but it seems very inefficient. Would like some tips on how to make code more efficient. Also on how to parallise it with Rslurm.

How to modify a dataframe that can't be called directly?

I'm creating a function to that several data frames automatically. How can I call those data.frames to mutate them?
For example, say I created a data for which each item is meant to become a dataframe like so:
assign(paste0("d","f"),c(tree,fox,river))
Then I take an item from the list and use it to name a dataframe.
assign(paste(get(paste0("d","f"))[1]),as.data.frame(c(1,2,3))
so that now if i do:
get(paste(get(paste0("d","f"))[1]))
it returns a data frame with 1,2,3
Here's my problem, I want to be able to modify those items so something like
get(paste(get(paste0("d","f"))[1]))[1] <- 4
#So that now if i do
get(paste(get(paste0("d","f"))[1]))
it returns a data frame with 4,2,3
It is better not to create multiple objects in the global environment. If it is already created, load them into a list and do all the changes/transformations/mutates etc. in the list. It would make easier to read/write in list rather than looking for these objects floating in the global env
lapply(mget(paste0("df", 1:3)), function(x) {x[[1]] <- 4; x})

Converting a list of data frames into individual data frames in R [duplicate]

This question already has answers here:
Return elements of list as independent objects in global environment
(4 answers)
Closed 3 years ago.
I have been searching high and low for what I think is an easy solution.
I have a large data frame that I split by factors.
eqRegions <- split(eqDataAll, eqDataAll$SeismicRegion)
This now creates a list object of the data frames by region; there are 8 in total. I would like to loop through the list to make individual data frames using another name.
I can execute the following to convert the list items to individual data frames, but I am thinking that there is a loop mechanism that is fast if I have many factors.
testRegion1 <- eqRegions[[1]]
testRegion3 <- eqRegions[[3]]
I can manually perform the above and it handles it nicely, but if I have many regions it's not efficient. What I would like to do is the equivalent of the following:
for (i in 1:length(eqRegions)) {
region[i] <- as.data.frame(eqRegions[[i]])
}
I think the key is to define region before the loop, but it keep overwriting itself and not incrementing. Many thanks.
Try
list2env(eqRegions,envir=.GlobalEnv)
This should work. The name of the data.frames created will be equal to the names within eqDataAll$SeismicRegion. Anyways, this practice of populating individual data.frames is not recommended. The more I work with R, the more I love/use list.
lapply(names(eqRegions), function(x) assign(x, eqRegions[[x]], envir = .GlobalEnv))
edit: Use list2env solution posted. Was not aware of list2env function.
attach(eqRegions) should be enough. But I recommend working with them in list form using lapply. I guarantee it will result in simpler code.
list2env returns data frames to the global environment whose names are the names in the list. An alternative, if you want to have the same name for the data frames but identified by i from a loop:
for (i in 1:length(eqRegions)) {
assign(paste0("eqRegions", i), as.data.frame(eqRegions[[i]]))
}
This can be slow if the length if the list gets too long.
As an alternative, a "best practice" when splitting data like this is to keep the data.frames within a list, as provided by split. To process it, you use either one of sapply or lapply (many factors) and capture the output back in a list. For instance:
eqRegionsProcessed <- lapply(eqRegions, function(df) {
## do something meaningful here
})
This obviously only works if you are doing the same thing to each data.frame.
If you really must break them out and deal with each data.frame uniquely, then #MatthewPlourde's and #MaratTalipov's answers will work.

Getting the index of an iterator in R (in parallel with foreach)

I'm using the foreach function to iterate over columns of a data.frame. At each iteration, I would like to get the index of the iterator (i.e. the index or the name of the column considered) and the column itself.
However, the following code, which seems fine in first place, doesn't work because i has no names or colnames attributes.
foreach(i=iter(base[1:N],by='col')) %dopar% c(colnames(i),i)
Now, if you wonder why I'm not iterating over indexes, the reason is that I'm using the %dopar% tool and I don't want to send the whole base to all workers, but only the columns each of them require.
Question : How can I get the index of an iterator ?
Thank you
I would just specify a second iteration variable in the foreach loop that acts as a counter:
library(foreach)
library(iterators)
df <- data.frame(a=1:10, b=rnorm(10), c=runif(10))
r <- foreach(d=df, i=icount()) %do% {
list(d=d, i=i)
}
The "icount" function from the iterators package will return an unbounded counting iterator if no arguments are used, so this example works regardless of the number of columns in the data frame.
You could also include the column name as a third iteration variable:
r <- foreach(d=df, i=icount(), nm=colnames(df)) %do% {
list(d=d, i=i, nm=nm)
}
Here are a couple of possibilities:
Modify the iter function (or write your own) so that instead of sending just the value of the column it includes the names or other information)
You could iterate over the indexes, but use a shared memory tool (such as the Rdsm package) so that each process only needs to grab the part of the data frame it needs rather than distributing the entire data frame.
You could convert your base data frame into a list where each element contains the corresponding column of base along with the column name, then iterate over that list (so the entire element is sent, but not the other elements).

Writing a loop to apply the operator 'data.frame' multiple times

I would like to write a loop to create multiple data frames from a set of already existsing matrices.
I've imported and created these using the code:
temp<-list.files(pattern="*.csv")
ddives <- lapply(temp, read.csv)
so 'ddives' is my set of set of csv files. I now want to create a data frame out of each of these using a looped version of the code:
d.dives1<- data.frame(ddives[1])
A quick primer on terminology before I answer your question:
The result of read.csv() is a data.frame.
The result of lapply() is a list.
Thus you now have a list of data frames.
If you can safely assume that the data frames in the list have the same structure (i.e. the same number of columns and the same classes), then you can use rbind() to combine your list of data frames into a single data.frame.
To make this easier, you can use do.call() as follows:
do.call(rbind, ddives)
do.call constructs a call from the function using the list elements as arguments. If they are named, they are passed as named arguments, otherwise in order (as always in R). In this case you apply rbind to all of the elements in your list, thus creating a single data.frame.
This is clearly untested, since I don't have your data. But, in general, do.call is a useful function for this type of operation.
As this is a follow up to the earlier question you posted, try this:
for (i in 1:length(ddives)) assign(temp[i], ddives[[i]])
If you really want a looped version of your code, this would be:
for (i in 1:length(ddives)){
assign(paste("d.dives", i, sep =""), ddives[i])
}

Resources