How to merge dataframe lists of unequal length? - r

This question is similar to Joining dataframes from lists of unequal length.
I have a shiny script where I am using fileImport to allow the user to import a variable number of data files. Each datafile is then split into a list of dataframes, and these are imported as a list. So I have a list of a list of dataframes.
The input datafiles have two format possibilities, one may be 129 dataframes long, the other may be 67 - where the 67 is actually a subset of the 129 (so all 67 are present in the 129, but not all 129 are present in the 67). I am then trying to rbind the dataframes by name.
A reproducible example:
# Some data
df.l1 <- list(df1 = data.frame(A = letters[1:10],
B = rnorm(10, 5, 1)),
df2 = data.frame(A = letters[11:20],
B = rnorm(10, 10, 2)))
df.l2 <- list(df1 = data.frame(A = letters[1:10],
B = rnorm(10, 5, 1)),
df2 = data.frame(A = letters[11:20],
B = rnorm(10, 10, 2)))
df.l3 <- list(df1 = data.frame(A = letters[1:10],
B = rnorm(10, 5, 1)),
df2 = data.frame(A = letters[11:20],
B = rnorm(10, 10, 2)),
df3 = data.frame(A = LETTERS[1:10],
B = rnorm(10, 15, 2)))
This works when binding lists of equal length (e.g. df.l1 and df.l2)
df.two <- list(df.l1, df.l2)
list.merged <- do.call(function(...) Map(rbind, ...), df.two)
But fails when binding list of dataframes with variables lengths.
df.three <- list(df.l1, df.l2, df.l3)
list.merged <- do.call(function(...) Map(rbind, ...), df.three)
Giving the error:
Warning messages:
1: In mapply(FUN = f, ..., SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
2: In mapply(FUN = f, ..., SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
As I said above, similar questions have been asked, but this situation is unique given the variable number of lists I am trying to merge. Help is greatly appreciated!

For a robust handling of this I would use dplyr::bind_rows or data.table::rbindlist. First you bind each list, then you bind at the upper level:
tidyverse version:
library(dplyr)
bind_rows(lapply(df.three, bind_rows))
data.table version:
library(data.table)
rbindlist(lapply(df.three, rbindlist))
Not only will this handle weird corner cases you don't expect, but it will also be much faster than do.call.
edit in response to comment
Try this:
library(purrr)
library(dplyr)
df_names <- unique(unlist(sapply(df.three, names)))
result <- list()
for (n in df_names) {
result[[n]] <- map(df.three, n)
}
map(result, dplyr::bind_rows)

Related

How to use lapply to skimr:: skim multiple data frames then export them to Excel?

I have 2 data frames(more in real life). My goal is to apply the skim function then export them as excel to a folder. They would also have different Excel file names.
df1 <- data.frame(x = rep(3, 5), y = seq(1, 5, 1), ID = letters[1:5])
df2 <- data.frame(x = rep(5, 5), y = seq(2, 6, 1), ID = letters[6:10])
I need the short way to accomplish the below:
for df1:
df1_summary<-skim(df1)
df1_summary<-as.data.frame(df1_summary)
write_xlsx(df1_summary,"df1_summary.xlsx")
for df2:
df2_summary<-skim(df2)
df2_summary<as.data.frame(df2_summary)
write_xlsx(df2_summary,"df2_summary.xlsx")
So far I know,
df.list<-list(df1, df2)
lapply(df.list, function(x) ...
I have many more than 2 data frames for this task in real life. Any help to shorten the process would helpful!
We can apply the function after placing the datasets in a list
library(skimr)
lst1 <- lapply(list(df1, df2), function(x) {
dat <- as.data.frame(skim(x))
})
names(lst1) <- c('df1_summary', 'df2_summary')
Map(function(x, y) write_xlsx(x, paste0(y, ".xlsx")), lst1, names(lst1))

function applied to dataset R

Below are two dataframes labeled as 'A' and 'C'. I have created a function that would take the top 5 rows for dataframe and want the same applied to dataframe C. However, it only replicates it for A. How would I have this function be applied for C only. Thanks!
L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
## The "same" with automatic column names:
A<-data.frame(1, 1:10, sample(L3, 10, replace = TRUE))
L3 <- LETTERS[7:9]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
## The "same" with automatic column names:
C<-data.frame(1, 1:10, sample(L3, 10, replace = TRUE))
function_y<-function(Data_Analysis_Task) {
sample2<-head(A, 5)
return(sample2)
}
D<-function_y(C)
We need to have the same argument passed inside the function as well
function_y <- function(Data_Analysis_Task) {
head(Data_Analysis_Task, 5)
}
D <- function_y(C)
If we use head(A, 5), inside the function, it looks for the object 'A', inside the function env first, then if it doesn't find, looks at the parent env, and so on until it finds the object 'A' in the global env. So, it would return the same output of head of 'A' every time the function is called

R: object y not found in function (x,y) [function to pass through data frames in r]

I am writing a function to build new data frames based on existing data frames. So I essentially have
f1 <- function(x,y) {
x_adj <- data.frame("DID*"= df.y$`DM`[x], "LDI"= df.y$`DirectorID*`[-(x)], "LDM"= df.y$`DM`[-(x)], "IID*"=y)
}
I have 4,000 data frames df., so I really need to use this and R is returning an error saying that df.y is not found. y is meant to be used through a list of all the 4000 names of the different df. I am very new at R so any help would be really appreciated.
In case more specifics are needed I essentially have something like
df.1 <- data.frame(x = 1:3, b = 5)
And I need the following as a result using a function
df.11 <- data.frame(x = 1, c = 2:3, b = 5)
df.12 <- data.frame(x = 2, c = c(1,3), b = 5)
df.13 <- data.frame(x = 3, c = 1:2, b = 5)
Thanks in advance!
OP seems to access data.frame with dynamic name.
One option is to use get:
get(paste("df",y,sep = "."))
The above get will return df.1.
Hence, the function can be modified as:
f1 <- function(x,y) {
temp_df <- get(paste("df",y,sep = "."))
x_adj <- data.frame("DID*"= temp_df$`DM`[x], "LDI"= temp_df$`DirectorID*`[-(x)],
"LDM"= temp_df$`DM`[-(x)], "IID*"=y)
}

rename mulitple datasets after applying a function in R

I am trying to apply a function to different dataframes. After doing that I want to get the resulting dataframe and save them keeping their original names and adding something else to differentiate the new dataframes.
This is what I've tried, which is obviously not working.
#Creating dummi data
N <- 8
df1 <- data.frame(x1 = rnorm(N), x2 = sample(1:10, size = N, replace = TRUE), x3 = 1*(runif(n = N) < .75))
df2 <- data.frame(y1 = rnorm(N), y2 = sample(100:200, size = N, replace = TRUE), y3 = runif(N))
df3 <- data.frame(z1 =rnorm(N), z2 = sample(8:80, size = N,replace = TRUE), Z3 = runif(N))
# Making a list of the three data frames
mydata <- list(df1=df1, df2=df2, df3= df3)
#Applying a function to mydata list
mydata2 <- lapply(mydata, function(x) mean(unlist(x)))
# Renaming each dataset
n <- 1:length(mydata2)
noms <- names(mydata2)
for (i in 1:n){
mynewlist <- lapply(mydata2, function(x) {names(x) <-("_mean", sep ="");
return(x))}
Please any help will be deeply apreciated.
We can use list2env if we need to create multiple objects in the global environment (though not recommended as most of the operations can be done within the list itself).
We change the names of the list by pasteing a suffix substring and then use list2env
list2env(setNames(mydata2, paste0(names(mydata2),
"_newname")), envir=.GlobalEnv)

r - How to expand data.frame over unused factor levels?

I need to do this for a list of dataframes that all have a common variable. I want to expand each dataframe so that they would have the common variable expanded to all of the levels present in all of the dataframes.
myList <- list(A = data.frame(A1 = rnorm(10), A2 = rnorm(10), A3 = rnorm(10),
year = factor(c(2000:2009))),
B = data.frame(B1 = rnorm(10), B2 = rnorm(10), B3 = rnorm(10),
year = factor(c(2001:2010))))
masterYear <- unique(unlist(lapply(myList, function(x) levels(x$year)), use.names = F))
I've thus far tried to use dplyr and tidyr packages in a function
funExpand <- function(x){
levels(x$year) <- c(levels(x$year), setdiff(masterYear, levels(x$year)))
vars <- names(x)[-length(names(x))]
x %>%
tidyr::complete_(x, c(vars), fill = list(0))
x
}
myList2 <- lapply(myList, funExpand)
But that yields an error. I've tried various combinations of tidyr::complete and tidyr::complete_ functions (first argument x or year?), all yielding some error. That tells me that I'm not interpreting the complete functions correctly.
Aside fixes for this error, I also welcome all suggestions for improving the process.
Updated to reflect comment by OP
Try this,
myList2 <- lapply(myList,
function(db) {
db$year <- factor(as.character(db$year), levels=masterYear)
merge(db, data.frame(year=setdiff(masterYear, db$year)), all=T)
})
The new rows will have NA, if you really need them to be 0 add another line db[is.na(db)] <- 0 in the function.
I guess you don't need x %>%
funExpand <- function(x) {levels(x$year) <- c(levels(x$year),
setdiff(masterYear, levels(x$year)))
vars <- names(x)[-length(names(x))]
complete_(x, vars, fill=list(0))}
lapply(myList, funExpand)

Resources