How to rbind several named dataframes but keep only common columns? - r

I have several data frames named a32, a33,..., a63 in the namespace which I have to rbind to a single dataframe. Each has several (about 20) columns. They were supposed to have common column names but unfortunately a few have some columns missing. This leads to an error when I try to rbind them.
l <- 32:63
l<- as.character(l) ## create a list of characters
A <- do.call(rbind.data.frame,mget(paste0("a",l))) ## "colnames not matching" error
Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors(), :
numbers of columns of arguments do not match
I want to rbind them by only taking the common columns. I tried using paste0 inside a for loop to list column names for all dataframes and see which dataframes have missing columns but got nowhere. How can I avoid manually searching for missing columns by listing column names of each data frame one-by-one.
As a small example, say:
a32 <- data.frame(AB = 1, CD = 2, EF = 3, GH = 4)
a33 <- data.frame(AB = 6, EF = 7)
a34 <- data.frame(AB = 8, CD = 9, EF = 10, GH = 11)
a35 <- data.frame(AB = 12,CD = 13, GH = 14)
a36 <- data.frame(AB = 15,CD = 16,EF = 17,GH = 18)
and so on
Is there an efficient way to rbind all the 32 data frames in the namespace?

Get dataframes in a list.
find out the common columns using Reduce + intersect
subset each dataframe from list with common columns
combine all the data together.
list_data <- mget(paste0("a",l))
common_cols <- Reduce(intersect, lapply(list_data, colnames))
result <- do.call(rbind, lapply(list_data, `[`, common_cols))
You can also make use of purrr::map_df which will make this shorter.
result <- purrr::map_df(list_data, `[`, common_cols)

A base R solution:
# get names from workspace
dat_names <- ls()[grepl("a[0-9][0-9]", ls())]
# get data
df <- lapply(dat_names, get)
# get comman col
commen_col <- Reduce(intersect, sapply(df, FUN = colnames, simplify = TRUE))
# selet and ribind
dat <- lapply(df, FUN = function(x, commen_col) x[, c(commen_col)], commen_col=commen_col)
dat <- do.call("rbind", dat)
colnames(dat) <- commen_col
dat
# AB
# [1,] 1
# [2,] 6
# [3,] 8
# [4,] 12
# [5,] 15

Related

In a list of data frames, pad one variable with leading zeros (ideally w/ stringr)

I'm working with a list of data frames. In each data frame, I would like to pad a single ID variable with leading zeros. The ID variables are character vectors and are always the first variable in the data frame. In each data frame, however, the ID variable has a different length. For example:
df1_id ranges from 1:20, thus I need to pad with up to one zero,
df2_id ranges from 1:100, thus I need to pad with up to two zeros,
etc.
My question is, how can I pad each data frame without having to write a single line of code for each data frame in the list.
As mentioned above, I can solve this problem by using the str_pad function on each data frame separately. For example, see the code below:
#Load stringr package
library(stringr)
#Create sample data frames
df1 <- data.frame("x" = as.character(1:20), "y" = rnorm(20, 10, 1),
stringsAsFactors = FALSE)
df2 <- data.frame("v" = as.character(1:100), "y" = rnorm(100, 10, 1),
stringsAsFactors = FALSE)
df3 <- data.frame("z" = as.character(1:1000), "y" = rnorm(1000, 10, 1),
stringsAsFactors = FALSE)
#Combine data fames into list
dfl <- list(df1, df2, df3)
#Pad ID variables with leading zeros
dfl[[1]]$x <- str_pad(dfl[[1]]$x, width = 2, pad = "0")
dfl[[2]]$v <- str_pad(dfl[[2]]$v, width = 3, pad = "0")
dfl[[3]]$z <- str_pad(dfl[[3]]$z, width = 4, pad = "0")
While this solution works relatively well for a short list, as the number of data frames increases, it becomes a bit unwieldy.
I would love if there was a way that I could embed some sort of "sequence" vector into the width argument of the str_pad function. Something like this:
dfl <- lapply(dfl, function(x) {x[,1] <- str_pad(x[,1], width = SEQ, pad =
"0")})
where SEQ is a vector of variable lengths. Using the above example it would look something like:
seq <- c(2,3,4)
Thanks in advance, and please let me know if you have any questions.
~kj
You could use Map here, which is designed to apply a function "to the first elements of each ... argument, the second elements, the third elements", see ?mapply for details.
library(stringr)
vec <- c(2,3,4) # this is the vector of 'widths', don't name it seq
Map(function(i, y) {
dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
dfl[[i]] # this gets returned
},
# you iterate over these two vectors in parallel
i = 1:length(dfl),
y = vec)
Output
#[[1]]
# x y
#1 01 9.373546
#2 02 10.183643
#3 03 9.164371
#
#[[2]]
# v y
#1 001 11.595281
#2 002 10.329508
#3 003 9.179532
#4 004 10.487429
#
#[[3]]
# z y
#1 0001 10.738325
#2 0002 10.575781
#3 0003 9.694612
#4 0004 11.511781
#5 0005 10.389843
explanation
The function that we pass to Map is an anonymous function, which more or less you provided in your question:
function(i, y) {
dfl[[i]][, 1] <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
dfl[[i]] # this gets returned
}
You see the function takes two argument, i and y (choose other names if you like such as df and width), and for each dataframe in your list it modifies the first column dfl[[i]][, 1] <- ... . What the anonymous function does is it applies str_pad to the first column of each dataframe
... <- str_pad(dfl[[i]][, 1], width = y, pad = "0")
but you see that we don't pass a fixed value to the width argument, but y.
Coming back to Map. Map now applies str_pad to the first dataframe, with argument width = 2, it applies str_pad to the second dataframe, with argument width = 3 and - you probably guessed it - it applies str_pad to the third dataframe in your list, with argument width = 4.
The arguments are specified in the last two lines of the code as
i = 1:length(dfl),
y = vec)
I hope this helps.
data
(consider to create a minimal example next time as the number of rows of the dataframes is not relevant for the problem)
set.seed(1)
df1 <- data.frame("x" = as.character(1:3), "y" = rnorm(3, 10, 1),
stringsAsFactors = FALSE)
df2 <- data.frame("v" = as.character(1:4), "y" = rnorm(4, 10, 1),
stringsAsFactors = FALSE)
df3 <- data.frame("z" = as.character(1:5), "y" = rnorm(5, 10, 1),
stringsAsFactors = FALSE)
#Combine data fames into list
dfl <- list(df1, df2, df3)

Index nested lists of named data frames using character vector - R

I have a nested list of named data frames like so:
mylist2 <- list(
list(df1.a = data.frame(replicate(2,sample(0:1,5,rep=TRUE))), df2.b = data.frame(replicate(2,sample(0:1,5,rep=TRUE)))),
list(df3.c = data.frame(replicate(2,sample(0:1,5,rep=TRUE))), df4.d = data.frame(replicate(2,sample(0:1,5,rep=TRUE)))),
list(df5.e = data.frame(replicate(2,sample(0:1,5,rep=TRUE))), df6.f = data.frame(replicate(2,sample(0:1,5,rep=TRUE)))))
I run a test (not important what sort of test) and it produces a character vector telling me which data frames in this list are important:
test
[1] "df1.a" "df5.e"
What is the most efficient way to extract these data frames from the nested list using this character vector? The test only shows the names of second list, so nestedlist[test] does not work.
As the OP mentioned it was a nested list, we can loop through the initial list and then extract the elements of the second list with [
lapply(mylist2, '[', test)
or using tidyverse
library(tidyverse)
map(mylist2, ~ .x %>%
select(test))
Update
Based on the updated dataset:
Filter(length, lapply(mylist2, function(x) x[intersect(test, names(x))]))
Here is a reproducible example including sample data using nested lists:
# Sample data
lst <- list(
list(df1.a = 1, df2.b = 2),
list(df3.c = 3, df4.d = 4),
list(df5.e = 5, df6.f = 6))
test <- c("df1.a", "df5.e");
ret <- lapply(lst, function(x) x[names(x) %in% test])
ret[sapply(ret, length) > 0];
#[[1]]
#[[1]]$df1.a
#[1] 1
#
#
#[[2]]
#[[2]]$df5.e
#[1] 5

cbind equally named vectors in multiple data.frames in a list to a single data.frame

I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things

Changing Column Names in a List of Data Frames in R

Objective: Change the Column Names of all the Data Frames in the Global Environment from the following list
colnames of the ones in global environment
So.
0) The Column names are:
colnames = c("USAF","WBAN","YR--MODAHRMN")
1) I have the following data.frames: df1, df2.
2) I put them in a list:
dfList <- list(df1,df2)
3) Loop through the list:
for (df in dfList){
colnames(df)=colnames
}
But this creates a new df with the column names that I need, it doesn't change the original column names in df1, df2. Why? Could lapply be a solution? Thanks
Can something like:
lapply(dfList, function(x) {colnames(dfList)=colnames})
work?
With lapply you can do it as follows.
Create sample data:
df1 <- data.frame(A = 1, B = 2, C = 3)
df2 <- data.frame(X = 1, Y = 2, Z = 3)
dfList <- list(df1,df2)
colnames <- c("USAF","WBAN","YR--MODAHRMN")
Then, lapply over the list using setNames and supply the vector of new column names as second argument to setNames:
lapply(dfList, setNames, colnames)
#[[1]]
# USAF WBAN YR--MODAHRMN
#1 1 2 3
#
#[[2]]
# USAF WBAN YR--MODAHRMN
#1 1 2 3
Edit
If you want to assign the data.frames back to the global environment, you can modify the code like this:
dfList <- list(df1 = df1, df2 = df2)
list2env(lapply(dfList, setNames, colnames), .GlobalEnv)
Just change your for-loop into an index for-loop like this:
Data
df1 <- data.frame(a=runif(5), b=runif(5), c=runif(5))
df2 <- data.frame(a=runif(5), b=runif(5), c=runif(5))
dflist <- list(df1,df2)
colnames = c("USAF","WBAN","YR--MODAHRMN")
Solution
for (i in seq_along(dflist)){
colnames(dflist[[i]]) <- colnames
}
Output
> dflist
[[1]]
USAF WBAN YR--MODAHRMN
1 0.8794153 0.7025747 0.2136040
2 0.8805788 0.8253530 0.5467952
3 0.1719539 0.5303908 0.5965716
4 0.9682567 0.5137464 0.4038919
5 0.3172674 0.1403439 0.1539121
[[2]]
USAF WBAN YR--MODAHRMN
1 0.20558383 0.62651334 0.4365940
2 0.43330717 0.85807280 0.2509677
3 0.32614750 0.70782919 0.6319263
4 0.02957656 0.46523151 0.2087086
5 0.58757198 0.09633181 0.6941896
By using for (df in dfList) you are essentially creating a new df each time and change the column names to that leaving the original list (dfList) untouched.
If you want the for loop to work, you should not pass the whole data.frame as the argument.
for (df in 1:length(dfList))
colnames(dfList[[df]]) <- colnames
dfList <- lapply(dfList, `names<-`, colnames)
Create the sample data:
df1 <- data.frame(A = 1, B = 2, C = 3)
df2 <- data.frame(X = 1, Y = 2, Z = 3)
dfList <- list(df1,df2)
name <- c("USAF","WBAN","YR--MODAHRMN")
Then create a function to set the colnames:
res=lapply(dfList, function(x){colnames(x)=c(name);x})
[[1]]
USAF WBAN YR--MODAHRMN
1 1 2 3
[[2]]
USAF WBAN YR--MODAHRMN
1 1 2 3
A tidyverse solution with rename_with:
library(dplyr)
library(purrr)
map(dflist, ~ rename_with(., ~ colnames))
Or, if it's only for one column:
map(dflist, ~ rename(., new_col = old_col))
This also works with lapply:
lapply(dflist, rename_with, ~ colnames)
lapply(dflist, rename, new_col = old_col)

Unstacking a stacked dataframe unstacks columns in a different order

Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE

Resources