R - How to extract single dataframes from list of lists? - r

I have a list of lists called step2 containing dataframes like this one:
And I want to extract every element in the list as a single dataframe, so that I have one dataframe called Likert_rank_Americas, Likert_rank_APAC, Likert_rank_Civil_law and so on for each dataframe contained in the list.
I tried with this:
list2env(step2,envir=.GlobalEnv)
But I only get the sub-lists contained in the main one as single objects, like so:
While what I want instead are the underlying dataframes as standalone objects, with the names as specified above. Is it possible to do this in a neat way without using list2env for each sub-list and then manually renaming each dataset?
I am quite new to R so apologies if the solution's easy.
Thanks in advance!

Without any data provided by you, what you want specifically is hard to guess, but at a minimum, to access a list of dataframes you need to follow this kind of logic...
a.1 <- data.frame(matrix(1:9, nrow=3))
a.2 <- data.frame(matrix(6:14, nrow=3))
data <- list(list(a.1,a.2),list("1","2"))
# NOTE: want only info from data[1] processed
library(purrr)
b <- map_dfr(data[1],rbind)
b
class(b)
dim(b)
# > b
# X1 X2 X3
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
# 4 6 9 12
# 5 7 10 13
# 6 8 11 14
# > class(b)
# [1] "data.frame"
# > dim(b)
# [1] 6 3

I think this oneliner should work.
The function 'list2env' assigns all list components to the global environment.
The function 'lapply' applies the function 'list2env' to every element of the list 'step2'.
step2%>% lapply(.%>% {list2env(., envir=.GlobalEnv)})
You can rename the dataframes before doing so of course.
names(step2$Geography)<- c(
'Likert_Rank_Americas',
'Likert_Rank_APAC',
'Likert_Rank_EMEA',
'Likert_Rank_Global')
names(step2$Legal_System)<- c(
'Likert_Rank_Civil_law',
'Likert_Rank_Common_law')

I created a quick reproducible dataset for testing with (this is good practice to include when asking for help)
dat <- list(list(data.frame(), data.frame(), data.frame()), list(data.frame(), data.frame(), data.frame()))
names(dat) <- c('list1' , 'list2')
names(dat$list1) <- c('A', 'B', 'C')
names(dat$list2) <- c('D', 'E', 'F')
Then I used
lapply(dat, list2env, .GlobalEnv)
Edit: To rename the dataframes, use the same structure as above where I named the sample dataframe, but use the names you want the end objects to have. If you want to automate this process, I would seperate it into a different question, but I suspect you would be able to find another post with the answer already.
Something like (pseudo-code)...
name_vec <- paste0('naming_convention_', names(step2$Geography))
names(step2$Geography) <- name_vec

Related

How can I create a data frame with all existing variables (at once)?

It may sound trivial and the solution is probably quite simple but I can't figure it out.
I just want to combine all my variables in a data.frame. I wonder if there is a way to do that without choosing them one by one, but instead telling R that I want to use all of the already existing variables?
var1 <- c(1,2)
var2 <- c(3,4)
Instead of doing this
df <- data.frame(var1, var2)
I want to do something like this
df <- data.frame(-ALL_VARIABLES_IN_ENVIRONMENT-)
I've tried ls() (respectively objects()) also in combinatination with unquote() as well as names() but this only gives me a vector with names (undquoted or not) and not the environment's objects.
var1 <- 1:3
var2 <- 1:3
data.frame(sapply(ls(), get))
# var1 var2
# 1 1 1
# 2 2 2
# 3 3 3

How to combine multiple data frames having similar variable names into one data frame?

I was trying to write a code to combine multiple data frames(Approximately 100) where each data frame is stored with variable name output1, output2,....,output100. I want to merge these data frames into a single dataframe using rbind function but it is not working as I have to write each variable name again.
I need a suggestion to write all variable names in one go or in the form of a loop.
Problem: I am trying to write the code as rbind(output1, output2, output3,....,output100) which is extremely long and tedious.
You could use mget. Example:
Calling ls() gives you the object names in your workspace.
ls()
# [1] "n" "out.lst" "output.1" "output.2" "output.3" "something.else"
Then use mget to grab the data frames by pattern= and rbind them using do.call.
output.long <- do.call(rbind, mget(ls(pattern="output.")))
# x y z
# output.1.1 1 1 2
# output.1.2 5 5 4
# output.2.1 2 1 4
# output.2.2 5 4 1
# output.3.1 5 4 2
# output.3.2 2 2 3
Toy data:
set.seed(42)
n <- 3
out.lst <- setNames(replicate(n, data.frame(x=sample(1:5, 2),
y=sample(1:5, 2),
z=sample(1:5, 2)), simplify=F),
paste0("output.", 1:n))
list2env(out.lst, env=.GlobalEnv)
If you're willing to use the tidyverse package, you can make output a list, then just write, say, combined <- bind_rows(output). That fits naturally with using lapply() to create the data frames in the first place.
[Untested code]
library(tidyverse)
output <- lapply(1:length(inputFiles), function(x) read.csv(inputFiles[x]))
combined <- bind_rows(output)

apply function to column names using a list of data frames

I'm trying to apply a very complex function to a list of more than 50 Data Frames.
Let's use a very simple function to lowercase names and just 3 data frames for the sake of clarity, but my general approach is coded below
[EDITED NAMES]
# Data Sample. Every column name is different accross Data Frames
quality <- data.frame(FIRST=c(1,5,3,3,2), SECOND=c(3,6,1,5,5))
thickness <- data.frame(THIRD=c(6,0,9,1,2), FOURTH=c(2,7,2,2,1))
distance <- data.frame(ONEMORE=c(0,0,1,5,1), ANOTHER=c(4,1,9,2,3))
# list of dataframes
dfs <- list(quality, thickness, distance)
# a very simple function (just for testing)
# actually a very complex one is used on real data
BetterNames <- function(x) {
names(x) <- tolower(names(x))
x
}
# apply function to data frame list
dfs <- lapply(dfs, BetterNames)
# I know the expected R behaviour is to modify a copy of the object,
# instead of the original object itself. So if you get the names
# you get the original version, not the needed one
names(quality)
[1] "FIRST" "SECOND"
is there any way of using any function inside a loop or "apply" in place for a huge amount of data frames?
As a result we must get the modified one replacing the original one for every data frame in the list (big list)
I know there's a trick using Data Table, but I wonder if using base R is that possible.
Expected Results:
names(quality)
[1] "first" "second"
[EDITED]
Pointed out to this answer: Rename columns in multiple dataframes, R
But not working. You can't use a vector of string names in my case because my new names are not a fixed list of strings.[EDITED DATA]
for(df in dfs) {
df.tmp <- get(df)
names(df.tmp) <- BetterNames(df)
assign(df, df.tmp)
}
> names(quality)
[1] "quality" NA
Thanks
i'd use a simple yet effective parse & eval approach.
Let's use a for loop to compose a command that suited your needs:
for(df in dfs) {
command <- paste0("names(",df,") <- BetterNames(",df,")")
# print(command)
eval(parse(text=command))
}
names(quality)
[1] "first" "second"
names(thickness)
[1] "third" "fourth"
names(distance)
[1] "onemore" "another"
You already have the best case scenario:
Let's add some names to your list:
names(dfs) <- c("quality", "thickness", "distance")
dfs <- lapply(dfs, BetterNames)
dfs[["quality"]]
# first second
# 1 1 3
# 2 5 6
# 3 3 1
# 4 3 5
# 5 2 5
This works great. And all your data is in a list, so if there are other things you want to do to all your data frames it is very easy.
If you are done treating these data frames similarly and really want them back in the global environment to work with individually, you can do it with
list2env(dfs, envir = .GlobalEnv)
I would recommend keeping them in a list though---in most cases if you have 50 data frames you are working with, in a list it is easy to use lapply or for loops to use them, but as individual objects you will be copy/pasting code and making mistakes.
I would consider even starting with 50 data frames in your workspace a problem - see How do I make a list of data frames? for recommendations on finding an upstream fix: going straight to a list from the start.
This is for sure not optimal and I hope something better comes up but here it goes:
BetterNames <- function(x, y) {
names(x) <- tolower(names(x))
assign(y, x, envir = .GlobalEnv)
}
dfs <- list(quality, thickness, distance)
dfs2 <- c("quality", "thickness", "distance")
mapply(BetterNames, dfs, dfs2)
> names(quality)
[1] "first" "second"

Subset data frame based on character vector of column names [duplicate]

This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 6 years ago.
Rookie question - thanks in advance for patience...
I have a dataframe:
vals <- c(1,1,1,1)
testdf <- data.frame("var1"=vals, "var2"=vals, "var3"=vals)
I have a character vector of variable names:
varnames <- c("var1", "var2")
This is a character vector b/c I use it to generate a formula earlier in the script.
I'd like to subset a dataframe such that variables in varnames are excluded, e.g.
newDF <- subset(df, select=-varnames)
This creates an error since subset expects names instead of characters. So, I use lapply to change the characters to names:
varnames <- lapply(varnames, as.name)
The result of this lapply function is a named(?) and nested(?) list.
[[1]]
var1
[[2]]
var2
[[3]]
var3
Here's where I get lost (I feel like Mugatu on crazy pills... is this confusing to anyone else!?). I can see that each value has correctly been changed from character to name, but it's in this weird nested structure - so when I try to subset, I get an error.
I've tried various solutions to unnest and unname, but with no success. This must be something easy I'm missing.
As a bonus - can someone tell me why it is ever useful for lapply to return this nested named list instead of simple vector? It seems very different than, for instance, Python. Thank you.
You can define the names of the columns you want inside [ (see the help file ?Extract or help("[") for the subset operator [).
testdf[ names(testdf)[!names(testdf) %in% varnames] ]
## or
## testdf[, names(testdf)[!names(testdf) %in% varnames] , drop = FALSE]
Or, more concisely (thanks #Frank)
testdf[ setdiff(names(testdf), varnames)]
var3
1 1
2 1
3 1
4 1
where
names(testdf)
# [1] "var1" "var2" "var3"
varnames
# [1] "var1" "var2"
And So
names(testdf) %in% varnames
# [1] TRUE TRUE FALSE
And therefore
names(testdf)[!names(testdf) %in% varnames]
# [1] "var3"
Which is the same as
testdf[, "var3" ]
And drop = FALSE to stop it 'dropping' to a vector if there's only one column returned.
Also, if you look at the help file for lapply(X, FUN, ...)
?lapply
lapply returns a list of the same length as X
This is why you're getting a list.
As a bonus - can someone tell me why it is ever useful for lapply to return this nested named list instead of simple vector? It seems very different than, for instance, Python. Thank you.
When you're working with a list, and you want it to remain as a list.
You can also use match which returns an index
testdf[-match(varnames,names(testdf))]
# var3
#1 1
#2 1
#3 1
#4 1
You can access the elements using varnames[[1]] etc. and convert it into a vector, if it makes it easier for you.
Source: https://www.datacamp.com/community/tutorials/r-tutorial-apply-family
lapply takes a list and applies the function to every element of the list. The list can also have another list as an element. So it takes that into consideration and returns that nested structure.

Naming dataframes based on counter iteration in R?

I have a loop that will spit out a bunch of dataframes, and want to name the dataframes based on current iteration of the loop, e.g. df1 for the first iteration, df2 for the second iteration, and so on.
However, i'm running into problems trying to use the loop iteration counter to construct the dataframe name. For example, let's imagine I am in the first iteration of the loop and want to name the dataframe:
counter <- 1
as.name(paste("df",counter,sep="")) <- data.frame(x = (1:10), y = (10:1))
I get an error
Error in as.name(paste("df", counter, sep = "")) <- data.frame(x = (1:10), :
target of assignment expands to non-language object
Does anyone know how I might use the counter information to create dataframe names?
This is meant to complement Richard's, as it felt a little too substantial to simply edit into his.
A typical code pattern for this sort of thing would be:
#Initialize an empty list of the desired length
dfs <- vector("list",3)
#Fill the list with data frames, naming as we go
for (i in seq_along(dfs)){
dfs[[i]] <- data.frame(x = runif(5),y = runif(5))
names(dfs)[[i]] <- paste0("df",i)
}
where the use of assign is typically frowned upon as bad (stylistically). If the naming of the data frames is very regular, you don't even need to do it in the loop:
names(dfs) <- paste0("df",seq_along(dfs))
you can do it in a vectorized fashion as above. And as I mentioned below Richard's answer, even though having them all in a list is never worse, and usually better, than having them as separate objects, you can convert the list to separate objects via:
list2env(dfs,envir = .GlobalEnv)
Instead of cluttering the global environment with data frames, it would be best to collect them in a list, and then you can use paste0 to name them in setNames with e.g.
> dfList <- setNames(list(data.frame(x = 1:10, y = 10:1)), paste0("df", 1))
after that you can refer to the data frame with
> dfList$df1
x y
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
As joran notes, if you insist on populating the global environment with these data frames, you can use
list2Env(dfList, envir = .GlobalEnv)
and the data frames will be assigned as objects in the global environment.
Use assign:
assign(paste0("df", counter), data.frame(x = (1:10), y = (10:1))
I think you are looking for
assign("name", dataframe)

Resources