Join dataframes, retaining column names - r

I have the following 3 data frames, each of which has columns with names. I want to combine them and retain the column names. When I use the patch I found for combining dataframes, it drops that name on any dataframes that don't have at least 2 columns. How can I retain the names?
x<-data.frame(mean(1:10))
names(x)[names(x) == 'mean.1.10.'] <- 'var.name'
y<-data.frame(1:4)
names(y)[names(y) == 'X1.4'] <- 'var.name2'
z<-data.frame(matrix(1:10,5,2))
names(z)[names(z) == 'X1'] <- 'var.name3'
names(z)[names(z) == 'X2'] <- 'var.name4'
list_datf <- list(x, y, z)
n_r <- seq_len(max(sapply(list_datf, nrow)))
NEW <- do.call(cbind, lapply(list_datf, `[`, n_r, ))

You need to include drop = FALSE in the indexing step so that the things you're binding together retain all of their dimensions. I couldn't figure out a way to do this by passing drop = FALSE as an extra argument to [, so I resorted to using an anonymous function instead.
NEW <- do.call(cbind, lapply(list_datf, function(x) x[n_r, , drop = FALSE]))
Alternatively, you could convert your components to tibbles, which (unlike data frames) never drop "unneeded" dimensions:
NEW <- do.call(cbind, lapply(list_datf, function(x) tibble::as_tibble(x)[n_r, ]))
If you want to go full tidyverse:
library(dplyr)
list_datf %>% purrr::map(~ tibble::as_tibble(.)[n_r, ]) %>% bind_cols()

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

summary table for dataset in global environment

Is it a way I can get the data info from global environment into a summary table?
For example, I have a lot of data set named TXXX in my global environment, like
I would like to table that looks like this
Is it possible to also get all the variable list for each data using programing?
it will looks like this:
Any way I can do that by programming? Thanks.
We can use mget to get all the objects that starts with 'T' followed by 3 digit number in to a list , then loo over the list get the number of rows, 'Obs' and number of columns 'Variable'), rbind the list elements after creating the column 'Data' as the names of the list
lst1 <- lapply(mget(ls(pattern = "^T\\d{3}$")),
function(x) data.frame(Obs = nrow(x),
Variable = ncol(x)))
out <- do.call(rbind, Map(cbind, Data = names(lst1), lst1))
row.names(out) <- NULL
If we need the column names, we could use rowr to cbind the column names when the lengths are not the same
lst1 <- lapply(mget(ls(pattern = "^T\\d{3}$")), names)
library(versions)
available.versions('rowr') # // check for available version. Not in CRAN
install.versions('rowr', '1.1.2') # // install a version
library(rowr) # // load the package
do.call(cbind.fill, c(lst1, fill = NA))
Or without installing rowr
mx <- max(lengths(lst1))
do.call(cbind, lapply(lst1, `length<-`, mx))
Or using tidyverse
library(dplyr)
library(purrr)
mget(ls(pattern = '^T\\d{3}$')) %>%
map_dfr(~ tibble(Obs = nrow(.x), Variable = ncol(.x)), .id = 'Data')

Repeating calculation with different DF

I have around 10 DFs and would like to perform the following calculations on all of them and then have out as 10 new DFs.
I have been able to get this to work for 1 DF, but rather than copying the code and changing the names, 10 times, I wanted to see if there is a way to do this in. Ideally, I end up with 1 DF and 10 different columns, but am happy with anything
The calculations I am trying to do are:
temp <- merge (x=DF1, y=temp1, by = c("name"), all.x= TRUE)
asset_column <-grep("^Assets_", names(DF1))
return_column <-grep("^Return_", names(DF1))
OutputDF <-
stack(colSums(t(t(temp[asset_column])/colSums(temp[asset_column],
na.rm=TRUE)) * US_only[return_column],na.rm =TRUE))
OutputDF['values'] = OutputDF['values']/100
If these are repeatable calculations in a list, loop through the list with lapply and do the same code where we specify the first dataset from the anonymous function call (function(x) x)
out <- lapply(lst1, function(x) {
temp <- merge (x, y=temp1, by = c("name"), all.x= TRUE)
asset_column <-grep("^Assets_", names(x))
return_column <-grep("^Return_", names(x))
OutputDF <-
stack(colSums(t(t(temp[asset_column])/colSums(temp[asset_column],
na.rm=TRUE)) * US_only[return_column],na.rm =TRUE))
OutputDF['values'] = OutputDF['values']/100
OutputDF
})
Here, the output is also a list of data.frames which can be kept in the list as such or extract with [[

Is there an easy way to tell if many data frames stored in one list contain the same columns?

I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)

Merging two data.frames by two columns each

I have a huge data.frame that I want to reorder. The idea was to split it in half (as the first half contains different information than the second half) and create a third data frame which would be the combination of the two. As I always need the first two columns of the first data frame followed by the first two columns of the second data frame, I need help.
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
df3<-data.frame()
The new data frame should look like the following:
new3[new1[1],new1[2],new2[1],new2[2],new1[3],new1[4],new2[3],new2[4],new1[5],new1[6],new2[5],new2[6], etc.].
Pseudoalgorithmically, cbind 2 columns from data frame new1 then cbind 2 columns from data frame new2 etc.
I tried the following now (thanks to Akrun):
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
new1<-as.data.frame(new1, stringsAsFactors =FALSE)
new2<-as.data.frame(new2, stringsAsFactors =FALSE)
df3<-data.frame()
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
lst1 <- split.default(new1, f1(ncol(new1), 2))
lst2 <- split.default(new2, f1(ncol(new2), 2))
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
However, giving me a "undefined columns selected error".
See whether the below code helps
library(tidyverse)
# Two sample data frames of equal number of columns and rows
df1 = mtcars %>% select(-1)
df2 = diamonds %>% slice(1:32)
# get the column names
dn1 = names(df1)
dn2 = names(df2)
# create new ordered list
neworder = map(seq(1,length(dn1),2), # sequence with interval 2
~c(dn1[.x:(.x+1)], dn2[.x:(.x+1)])) %>% # a vector of two columns each
unlist %>% # flatten the list
na.omit # remove NAs arising from odd number of columns
# Get the data frame ordered
df3 = bind_cols(df1, df2) %>%
select(neworder)
It is not clear without a reproducible example. Based on the description, we can split the dataset columns into a list of datasets and use Map to cbind the columns of corresponding datasets, unlist and use that to order the third dataset
1) Create a function to return a grouping column for splitting the dataset
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
2) split the datasets into a list
lst1 <- split.default(df1, f1(ncol(df1), 2))
lst2 <- split.default(df2, f1(ncol(df2), 2))
3) Map through the corresponding list elements, cbind and unlist and use that to subset the columns of 'df3'
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
data
df1 <- as.data.frame(matrix(letters[1:10], 2, 5), stringsAsFactors = FALSE)
df2 <- as.data.frame(matrix(1:10, 2, 5))

Resources