Bind r data.frames that contain column(s) of nested data.frames - r

After importing multiple .json files using jsonlite I was looking for ways to bind the resulting data.frames which contained one or more columns which themselves were nested data.frames.
I came across the following post https://r.789695.n4.nabble.com/data-frame-with-nested-data-frame-td3162660.html, which helped highlight the problem.
## Create nested data.frames
dat1 <- data.frame(x = 1)
dat1$y <- data.frame(y1 = "a", y2 = "A", stringsAsFactors = FALSE)
dat2 <- data.frame(x = 2)
dat2$y <- data.frame(y1 = "b", stringsAsFactors = FALSE)
None of these work
rbind(dat1, dat2)
dplyr::bind_rows(dat1, dat2)
data.table::rbindlist(list(dat1, dat2))
I've discovered a few workarounds which I'll post below in case they help others.

This could be done without additional packages, too. The data frames need to be partly unlisted within a list and then merged using Reduce.
Reduce(function(...) merge(..., all=TRUE), Map(unlist, list(dat1, dat2), recursive=FALSE))
# x y.y1 y.y2
# 1 1 a A
# 2 2 b <NA>
This also works with more than two nested data frames.
dat3 <- data.frame(x=2, y=data.frame(y1="c", y2="C", z="CC", stringsAsFactors=FALSE))
Reduce(function(...) merge(..., all=TRUE), Map(unlist, list(dat1, dat2, dat3), recursive=FALSE))
# x y.y1 y.y2 y.z
# 1 1 a A <NA>
# 2 2 b <NA> <NA>
# 3 2 c C CC
Data
dat1 <- structure(list(x = 1, y = structure(list(y1 = "a", y2 = "A"), class = "data.frame",
row.names = c(NA, -1L))), row.names = c(NA, -1L),
class = "data.frame")
dat2 <- structure(list(x = 2, y = structure(list(y1 = "b"), class = "data.frame",
row.names = c(NA, -1L))), row.names = c(NA, -1L),
class = "data.frame")

Flatten the data first (for base rbind data.frames need to have identical column names)
dplyr::bind_rows(
jsonlite::flatten(dat1),
jsonlite::flatten(dat2)
)
Put the data.frames into a list before binding (all approaches now work)
dat1$y <- list(dat1$y)
dat2$y <- list(dat2$y)
rbind(dat1, dat2)
dplyr::bind_rows(dat1, dat2)
data.table::rbindlist(list(dat1, dat2))
Use the tidyverse to nest the data.frames
tib1 <- tidyr::nest(dat1, y = c(y))
tib2 <- tidyr::nest(dat2, y = c(y))
tib3 <- dplyr::bind_rows(tib1, tib2)
tidyr::unnest(tib3, c(y))

Related

How can I change rownames of dataframes in a list adding their own list names in R?

I try to change row.names of each dataframe in a list by adding their names to rownames.
List is:
l <- list(a=data.frame(col = c(1,2,3),row.names = c("k","l","m")), b=data.frame(col =
c(4,5,6), row.names = c("o","p","r")))
I tried this but couldn't add index names a and b:
lapply(l,function(x) {x %>% `row.names<-` (paste(names(l)[which(l %in%
x)],rownames(x),sep = "."))})
l$a is:
row.names
col
k
1
l
2
m
3
But it should be like this:
row.names
col
a.k
1
a.l
2
a.m
3
What should I do? Thank you in advance.
Instead of iterating over the list, we can iterate over the name of the list, which can have better control when pasting the names together.
This is modified from your attempt:
setNames(lapply(names(l),
\(x) l[[x]] %>% `row.names<-` (paste(x, rownames(l[[x]]), sep = "."))),
names(l))
$a
col
a.k 1
a.l 2
a.m 3
$b
col
b.o 4
b.p 5
b.r 6
An approach using mapply
mapply(function(lis, nm){rownames(lis) <- paste0(nm, ".", rownames(lis))
list(lis)},
dlis, names(dlis))
$a
col
a.k 1
a.l 2
a.m 3
$b
col
b.o 4
b.p 5
b.r 6
Data
dlis <- list(a = structure(list(col = c(1, 2, 3)), class = "data.frame", row.names = c("k",
"l", "m")), b = structure(list(col = c(4, 5, 6)), class = "data.frame", row.names = c("o",
"p", "r")))
1) rbind Use rbind to create a single data frame with the desired row names and then split it back to the original.
split(do.call("rbind", l), rep(names(l), sapply(l, nrow)))
2) data.frame Use data.frame and its row.names= argument to avoid row.names<-
rnames <- function(x, nm) data.frame(x, row.names = paste0(nm, ".", rownames(x)))
Map(rnames, l, names(l))
3) for Use a for loop creating ll
ll <- l
for(nm in names(ll)) rownames(ll[[nm]]) <- paste0(nm, ".", rownames(ll[[nm]]))

Filter rows based on one column from a list of dataframes

I have a list of multiple data frames and I would like to filter these data frames in a list by certain values in one column of each data frame. Each data frame in the list has a column called v1, which has special characters ++, ->, Now I do want to filter only rows having this arrow (->) in each data frame in a list. This is a sample of my dataframes,
dput(df)
df1 <- structure(list(v1 = c("->", "++", "->"),
t2 = c("James","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df2)
df2 <- structure(list(v1 = c("++", "->", "->"),
t2 = c("James","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
dput(df3)
df3 <- structure(list(v1 = c("++", "++", "->"),
t2 = c("James","Jane", "Egg"),
d3...c = c("James","Jane", "Egg")),
class = "data.frame", row.names = c(NA, -3L))
I have tried this but I am not getting the dataframes of filtered rows
idx = "->"
dfs <- list(df1,df2,df3)
lapply(dfs, function(x) x$v1 %in% idx)
someone help
idx <- "->"
# Base R
lapply(dfs, function(df) df[df$v1 == "->",])
lapply(dfs, function(df) df[df$v1 %in% idx,])
# tidyverse
library("purrr")
library("dplyr")
map(dfs, filter, v1 == "->")
map(dfs, filter, v1 %in% !! idx)
Try this:
idx <- "->"
fnct <- function(df){df <- df[df$v1 %in% idx, ]}
df1_idx <- fnct(df1)
df2_idx <- fnct(df2)
df3_idx <- fnct(df3)
dfs <- list(df1_idx, df2_idx, df3_idx)
dfs
Result:
[[1]]
v1 t2
1 -> James
3 -> Egg
[[2]]
v1 t2
2 -> Jane
3 -> Egg
[[3]]
v1 t2 d3...c
3 -> Egg Egg

Re-cast column types to a data frame which has already been read

I have a data frame df1 (with many columns) which I want to join with another data frame df2 that is supposed to have the same column types. However, for some reason when written and re-read they have acquired different types.
When I want to join these data frames, due to some of the columns which do not have the same type (but should have had), it refuses to join.
How can I force R to re-cast the classes of df2 to those of df1?
For example:
df1 <- data.frame(x = c(NA, NA, "3", "3"), y = c(NA, NA, "a", "b"))
df1_class <- sapply(df1, class) #first, determine the different classes of df1
df2 <- data.frame(x = c(NA, NA, 3, 3), y = c(NA, NA, "a", "b")) # df2 is
# equal to df1 but has a different class in column x
# now cast column x of df2 as class "character" - but do this for all
# columns together because there are many columns....
Using the purrrpackage the following will update df2 to match df1 classes:
df1_class <- sapply(df1, class)
df2 <-
purrr::map2_df(
df2,
df1_class,
~ do.call(paste0('as.', .y), list(.x))
)
You could change the ?mode of each column using "mode<-" via Map.
df2[] <- Map(f = "mode<-", x = df2, value = df1_class)
df2
# A tibble: 4 x 3
# x y z
# <chr> <chr> <dbl>
#1 NA NA 2
#2 NA NA 2
#3 3 a 2
#4 3 b 2
Your data extended by a third column for illustration.
data
library(tibble)
df1 <- data_frame(x = c(NA, NA, "3", "3"), y = c(NA, NA, "a", "b"), z = 1)
df2 <- data_frame(x = c(NA, NA, 3, 3), y = c(NA, NA, "a", "b"), z = 2L)
(df1_class <- sapply(df1, class))
# x y z
#"character" "character" "numeric"

Creating new dataframe using weighted averages from dataframes within list

I have many dataframes stored in a list, and I want to create weighted averages from these and store the results in a new dataframe. For example, with the list:
dfs <- structure(list(df1 = structure(list(A = 4:5, B = c(8L, 4L), Weight = c(TRUE, TRUE), Site = c("X", "X")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame"),
df2 = structure(list(A = c(6L, 8L), B = c(9L, 4L), Weight = c(FALSE, TRUE), Site = c("Y", "Y")),
.Names = c("A", "B", "Weight", "Site"), row.names = c(NA, -2L), class = "data.frame")),
.Names = c("df1", "df2"))
In this example, I want to use columns A, B, and Weight for the weighted averages. I also want to move over related data such as Site, and want to sum the number of TRUE and FALSE. My desired result would look something like:
result <- structure(list(Site = structure(1:2, .Label = c("X", "Y"), class = "factor"),
A.Weight = c(4.5, 8), B.Weight = c(6L, 4L), Sum.Weight = c(2L,
1L)), .Names = c("Site", "A.Weight", "B.Weight", "Sum.Weight"
), class = "data.frame", row.names = c(NA, -2L))
Site A.Weight B.Weight Sum.Weight
1 X 4.5 6 2
2 Y 8.0 4 1
The above is just a very simple example, but my real data have many dataframes in the list, and many more columns than just A and B for which I want to calculate weighted averages. I also have several columns similar to Site that are constant in each dataframe and that I want to move to the result.
I'm able to manually calculate weighted averages using something like
weighted.mean(dfs$df1$A, dfs$df1$Weight)
weighted.mean(dfs$df1$B, dfs$df1$Weight)
weighted.mean(dfs$df2$A, dfs$df2$Weight)
weighted.mean(dfs$df2$B, dfs$df2$Weight)
but I'm not sure how I can do this in a shorter, less "manual" way. Does anyone have any recommendations? I've recently learned how to lapply across dataframes in a list, but my attempts have not been so great so far.
The trick is to create a function that works for a single data.frame, then use lapply to iterate across your list. Since lapply returns a list, we'll then use do.call to rbind the resulting objects together:
foo <- function(data, meanCols = LETTERS[1:2], weightCol = "Weight", otherCols = "Site") {
means <- t(sapply(data[, meanCols], weighted.mean, w = data[, weightCol]))
sumWeight <- sum(data[, weightCol])
others <- data[1, otherCols, drop = FALSE] #You said all the other data was constant, so we can just grab first row
out <- data.frame(others, means, sumWeight)
return(out)
}
In action:
do.call(rbind, lapply(dfs, foo))
---
Site A B sumWeight
df1 X 4.5 6 2
df2 Y 8.0 4 1
Since you said this was a minimal example, here's one approach to expanding this to other columns. We'll use grepl() and use regular expressions to identify the right columns. Alternatively, you could write them all out in a vector. Something like this:
do.call(rbind, lapply(dfs, foo,
meanCols = grepl("A|B", names(dfs[[1]])),
otherCols = grepl("Site", names(dfs[[1]]))
))
using dplyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
unnest(dfs) %>%
group_by(Site) %>%
filter(Weight) %>%
mutate(Sum=n()) %>%
select(-Weight) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
gives the result
# Site A B Sum
#1 X 4.5 6 2
#2 Y 8.0 4 1
Or using data.table
library(data.table)
DT <- rbindlist(dfs)
DT[(Weight)][, c(lapply(.SD, mean, na.rm = TRUE),
Sum=.N), by = Site, .SDcols = c("A", "B")]
# Site A B Sum
#1: X 4.5 6 2
#2: Y 8.0 4 1
Update
In response to #jazzuro's comment, Using dplyr 0.3, I am getting
unnest(dfs) %>%
group_by(Site) %>%
summarise_each(funs(weighted.mean=stats::weighted.mean(., Weight),
Sum.Weight=sum(Weight)), -starts_with("Weight")) %>%
select(Site:B_weighted.mean, Sum.Weight=A_Sum.Weight)
# Site A_weighted.mean B_weighted.mean Sum.Weight
#1 X 4.5 6 2
#2 Y 8.0 4 1

Merge Multiple Data Frames by Row Names

I'm trying to merge multiple data frames by row names.
I know how to do it with two:
x = data.frame(a = c(1,2,3), row.names = letters[1:3])
y = data.frame(b = c(1,2,3), row.names = letters[1:3])
merge(x,y, by = "row.names")
But when I try using the reshape package's merge_all() I'm getting an error.
z = data.frame(c = c(1,2,3), row.names = letters[1:3])
l = list(x,y,z)
merge_all(l, by = "row.names")
Error in -ncol(df) : invalid argument to unary operator
What's the best way to do this?
Merging by row.names does weird things - it creates a column called Row.names, which makes subsequent merges hard.
To avoid that issue you can instead create a column with the row names (which is generally a better idea anyway - row names are very limited and hard to manipulate). One way of doing that with the data as given in OP (not the most optimal way, for more optimal and easier ways of dealing with rectangular data I recommend getting to know data.table instead):
Reduce(merge, lapply(l, function(x) data.frame(x, rn = row.names(x))))
maybe there exists a faster version using do.call or *apply, but this works in your case:
x = data.frame(X = c(1,2,3), row.names = letters[1:3])
y = data.frame(Y = c(1,2,3), row.names = letters[1:3])
z = data.frame(Z = c(1,2,3), row.names = letters[1:3])
merge.all <- function(x, ..., by = "row.names") {
L <- list(...)
for (i in seq_along(L)) {
x <- merge(x, L[[i]], by = by)
rownames(x) <- x$Row.names
x$Row.names <- NULL
}
return(x)
}
merge.all(x,y,z)
important may be to define all the parameters (like by) in the function merge.all you want to forward to merge since the whole ... arguments are used in the list of objects to merge.
As an alternative to Reduce and merge:
If you put all the data frames into a list, you can then use grep and cbind to get the data frames with the desired row names.
## set up the data
> x <- data.frame(x1 = c(2,4,6), row.names = letters[1:3])
> y <- data.frame(x2 = c(3,6,9), row.names = letters[1:3])
> z <- data.frame(x3 = c(1,2,3), row.names = letters[1:3])
> a <- data.frame(x4 = c(4,6,8), row.names = letters[4:6])
> lst <- list(a, x, y, z)
## combine all the data frames with row names = letters[1:3]
> gg <- grep(paste(letters[1:3], collapse = ""),
sapply(lapply(lst, rownames), paste, collapse = ""))
> do.call(cbind, lst[gg])
## x1 x2 x3
## a 2 3 1
## b 4 6 2
## c 6 9 3

Resources