How to define 1000 dataframes in a single one? - r

My problem is the following. Suppose I have 1000 dataframes in R with the names eq1.1, eq1.2, ..., eq1.1000. I would like a single dataframe containing my 1000 dataframes. Normally, if I have only two dataframes, say eq1.1 and eq1.2 then I could define
df <- data.frame(eq1.1,eq1.2)
and I'm good. However, I can't follow this procedure because I have 1000 dataframes.
I was able to define a list containing the names of my 1000 dataframes using the code
names <- c()
for (i in 1:1000){names[i]<- paste0("eq1.",i)}
However, the elements of my list are recognized as strings and not as the dataframes that I previously defined.
Any help is appreciated!

How about
df.names <- ls(pattern = "^eq1\\.\\d")
eq1.dat <- do.call(cbind,
lapply(df.names,
get))
rm(list = df.names)

library(stringi)
library(dplyr)
# recreate dummy data
lapply(1:1000,function(i){
assign(sprintf("eq1.%s",i),
as.data.frame(matrix(ncol = 12, nrow = 13, sample(1:15))),
envir = .GlobalEnv)
})
# Now have 1000 data frames in my working environment named eq1.[1:1000]
> str(ls(pattern = "eq1.\\d+"))
> chr [1:1000] "eq1.1" "eq1.10" "eq1.100" "eq1.1000" "eq1.101" "eq1.102" "eq1.103" ...
1) create a holding data frame from the ep1.1 data frame that will be appended
each iteration in the following loop
empty_df <- eq1.1
2) im going to search for all the data frame named by convention and
create a data frame from the returned characters which represent our data frame
objects, but are nothing more than a character string.
3) mutate that data frame to hold an indexing column so that I can order the data frames properly from 1:1000 as the character representation will not be in numeric order from the step above
4) Drop the indexing column once the data frame names are in proper sequence
and then unlist the dfs column back into a character sequence and slice
the first value out, since it is stored already to our empty_df
5) loop through that sequence and for each iteration globally assign and
bind the preceding data frame into place. So for example on iteration 1,
the empty_df is now the same as data.frame(ep1.1, ep1.2) and for the
second iteration the empty_df is the same as data.frame(ep1.1, ep1.2, ep1.3)
NOTE: the get function takes the character representation and calls the data object from it. see ?get for details
lapply(
data.frame(dfs = ls(pattern = 'eq1\\.\\d+'))%>%
mutate(nth = as.numeric(stri_extract_last_regex(dfs,'\\d+'))) %>%
arrange(nth) %>% select(-nth) %>% slice(-1) %>% .$dfs, function(i){
empty_df <<- data.frame(empty_df, get(i))
}
)
All done, all the dataframes are bound to the empty_df and to check
> dim(empty_df)
[1] 13 12000

Related

Compare the column names of multiple CSV files in R

I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")

Splitting large data frame by column into smaller data frames (not lists) using loops

I have many large data frames. Using of the smaller ones for example:
dim(ch29)
476 4283
I need to split it into smaller pieces (i.e. subset into 241 columns at the most). My problems come afterwards when I want to analyze these smaller subsets.
I do not know how to subset the large date-frame into smaller data-frames and not simply a list.
I also want to do all of this in a loop and give the newly created smaller data frames unique names in the loop.
chunk=241
df<-ch29
n<-ceiling(ncol(df)/chunk)
for (i in 1:n) {
xname <- paste("ch29", i, sep="_")
cat("_", xname)
assign(xname, split(df, rep(1:n, each=chunk, length.out=ncol(df))))
}
I'm not exactly sure what you're trying to do or how you want to choose the columns that go in each data frame, but here's an example of one option:
# Fake data
set.seed(100)
ch29 = as.data.frame(replicate(4283, rnorm(476)))
# Number of columns we want in each split data frame
ncols = floor(ncol(ch29)/20)
# Start column for each split data frame
start = seq(1,ncol(ch29),ncols)
# Split ch29 into a bunch of separate data frames
df.list = lapply(setNames(start, paste0("ch29_", start, "_", start+ncols-1)),
function(i) ch29[ , i:min(i+ncols-1,ncol(ch29))])
You now have a list, df.list, where each list element is a data frame with ncols columns from ch29, except for the last element of the list, which will have between 1 and ncols columns. Also, the name of each list element is the name of the parent data frame (ch29) and the column range from which the subset data frame is drawn.
Try
for (i in 1:3) { # i = 1
xname = paste("ch29", i, sep = "_")
col.min = (i - 1) * chunk + 1
col.max = min(i * chunk, ncol(df))
assign(xname, df[,col.min:col.max])
}
In other words, use the notation df[,a:b], where a < b, to get the subset of the dataframe df consisting only of columns a to b.

Manipulating a dataset by separating variables

I have a data set that looks similar to the image shown below. Total, it is over a 1000 observations long. I want to create a new data frame that separates the single variable into 3 variables. Each variable is separated by a "+" in each observation, so it will need to be separated by using that as a factor.
Here is a solution using data.table:
library(data.table)
# Data frame
df <- data.frame(MovieId.Title.Genres = c("yyyy+xxxx+wwww", "zzzz+aaaa+aaaa"))
# Data frame to data table.
df <- data.table(df)
# Split column into parts.
df[, c("MovieId", "Title", "Genres") := tstrsplit(MovieId.Title.Genres, "\\+")]
# Print data table
df
I'll assume that your movieData object is a single column data.frame object.
If you want to split a single element from your data set, use strsplit using the character + (which R wants to see written as "\\+"):
# split the first element of movieData into a vector of strings:
strsplit(as.character(movieData[1,1]), "\\+")
Use lapply to apply this to the entire column, then massage the resulting list into a nice, usable data.frame:
# convert to a list of vectors:
step1 = lapply(movieData[,1], function(x) strsplit(as.character(x), "\\+"))
# step1 is a list, so make it into a data.frame:
step2 = as.data.frame(step1)
# step2 is a nice data.frame, but its names are garbage. Fix it:
movieDataWithColumns = setNames(step2, c("MovieId", "Title", "Genres"))

How to cbind many data frames with a loop?

I have 105 data frames with xts, zoo class and II want to combine their 6th columns into a data frame.
So, I created a data frame that contains all the data frame names to use it with a 'for' function:
mydata <- AAL
for (i in 2:105) {
k <- top100[i,1] # The first column contains all the data frame names
mydata <- cbind(mydata, k)
}
It's obviously wrong, but I have no idea either how to cbind so many data frames with completely different names (my data frame names are NASDAQ Symbols) nor how to pick the 6th column of all.
Thank you in advance
Try foreach package. May be there is more elegant way to do this task, but this approach will work.
library(foreach)
#create simple data frames with columns named 'A' and 'B'
df1<-t(data.frame(1,2,3))
df2<-t(data.frame(4,5,6))
colnames(df1)<-c('A')
colnames(df2)<-c('B')
#make a list
dfs<-list(df1,df2)
#join data frames column by column, this will preserve their names
foreach(x=1:2
,.combine=cbind)%do% # don`t forget this directive
{
dfs[[x]]
}
The result will be:
A B
X1 1 4
X2 2 5
X3 3 6
To pick column number 6:
df[,6]
First, you should store all of your data.frames in a list. You can then use a combination of lapply and do.call to extract and recombine the sixth columns of each of the data.frames:
# Create sample data
df_list <- lapply(1:105, function(x) {
as.data.frame(matrix(sample(1:1000, 100), ncol = 10))
})
# Extract the sixth column from each data.frame
extracted_cols <- lapply(df_list, function(x) x[6])
# Combine all of the columns together into a new data.frame
result <- do.call("cbind", extracted_cols)
One way to get all of your preexisting data.frames into a list would be to use lapply along with get:
df_list <- lapply(top100[[1]], get)

in R: execute function on dataframes whose names are in list

My global environment contains several dataframes. I want to execute functions on only those that contain a specific string in their name. So, I first create a list of these dataframes of interest:
dfs <- ls()[sapply(ls(), function(x) class(get(x))) == 'data.frame']
dfs <- as.data.frame(dfs)
dfs_lst <- agrep("stats", dfs$dfs, ignore.case=FALSE, value=TRUE,
max.distance=0.1, useBytes=FALSE)
dfs_lst correctly returns all dataframes in my global environment containing the string "stats". dfs_lst
chr [1:3] "stats1" "stats2" "stats3".
Now, I want to execute functions on these 3 dataframes, however I do not know how to call them from the dfs_lst. I want something of the kind:
for(i in 1:length(dfs_lst){
# Find dataframe name in dfs_lst, and then use the matching dataframe in
# global environment. So, something of the sort:
for(dfs_lst[i] in ls()){
result[i,] <- dfs_lst[i] %>%
summarise(. , <summarise stuff> )
}
}
For example, for i=1, dfs_lst[1] is dataframe "stats1", I would want to perform the following, and save it in the first row of "results":
for(stats1 in ls()){
result[1,] <- stats1 %>% summarise(. , <summarise stuff> )
}
As #lmo pointed out, it's probably best to store these data.frames together in a single list. Instead of having data.frame objects called "stats1", "stats2", etc, floating around in your environment, a (hacky) way to store all your data.frame objects in a list is this:
dfs <- ls()[sapply(ls(), function(x) class(get(x))) == 'data.frame']
##make an empty list
my_list <- list()
##populate the list
for (dfm_name in dfs) {
my_list[[dfm_name]] <- get(dfm_name)
}
Now you've got a list my_list containing every object of the class data.frame in your environment. This will probably be helpful when you want to work with all data.frames names "statsX":
##find all list objects whose name starts with "stats"
stats_objects <- substr(names(my_list),1,5)=="stats"
results <- matrix(NA, ncol = your_length, nrow = sum(stats_objects))
##now perform intended operations
for ( row_num in 1:nrow(results)) {
results[i,] <- my_list[stats_objects][[row_num]] %>%
summarise(. , <summarise stuff> )
}
This should perform as necessary, after a couple alterations in the code (e.g. your_length needs to be specified, and you wanted all objects whose name contains "stats" so you'll need to work with regularized expressions).
What's nice about this is my_list contains all the data.frames, so if you choose to run analysis on data.frames not named "stats" you can still access them with a similar procedure. Hope this helps.
As discussed in the comments, if we have a list of interesting data frames, it will be easier to deal with the elements as data frame. So, the main issue here seems to be having just the object names and not the actual data.frame objects.
In order to follow the code and tracking the data types, I have decomposed it first:
1.
env.list <- ls() # chr vector
2.
env.classes <- sapply(env.list, function(x) class(get(x)))
# list of chr (containing classes), element names: data frame names
3.
dfs <- env.list[env.classes == 'data.frame'] # chr vector
4.
dfs <- as.data.frame(dfs)
# data frame with one column (named "dfs"), containing data.frame names
Now, we can get the list of data.frames:
3.
dfs <- env.list[env.classes == 'data.frame'] # chr vector
dfs.list <- sapply(dfs, function(x) {get(x)})
grep can be applied now to names(dfs.list) to get the interesting data frames.

Resources