Transforming a nested data frame with varying number of elements - r

I have a data frame with a column of nested data frames with 1 or 2 columns and n rows. It looks like df in the sample below:
'data.frame': 3 obs. of 2 variables:
$ vector:List of 3
..$ : chr "p1"
..$ : chr "p2"
..$ : chr "p3"
$ lists :List of 3
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ n1: Factor w/ 2 levels "a","b": 1 2
.. ..$ n2: Factor w/ 2 levels "1","2": 1 2
..$ :'data.frame': 1 obs. of 1 variable:
.. ..$ n1: Factor w/ 1 level "d": 1
..$ :'data.frame': 1 obs. of 2 variables:
.. ..$ n1: Factor w/ 1 level "e": 1
.. ..$ n2: Factor w/ 1 level "3": 1
df can be recreated like this :
v <- c("p1", "p2", "p3")
l <- list(data.frame(n1 = c("a", "b"), n2 = c("1", "2")), data.frame(n1 = "d"), data.frame(n1 = "e", n2 = "3"))
df <- as.data.frame(cbind(v, l))
I'd like to transform it to a data frame that looks like that:
[v] [n1] [n2]
p1 a 1
p1 b 2
p2 d NA
p3 e 3
n1 and n2 are in seperate columns
if the data frame in row i has n rows, the vector element of row i should be repeated n times
if there is no content in n1 or n2, there should be a NA
I've tried using tidyr::unnest but got the following error
unnest(df)
Error: All nested columns must have the same number of elements.
Does anyone has a better idea how to transform the dataframe in the desired format?

Using purrr::pmap_df, within each row of df, we combine v and l into a single data frame and then combine all of the data frames into a single data frame.
library(tidyverse)
pmap_df(df, function(v,l) {
data.frame(v,l)
})
v n1 n2
1 p1 a 1
2 p1 b 2
3 p2 d <NA>
4 p3 e 3

This will avoid by-row operations, which will be important if you have a lot of rows.
library(data.table)
rbindlist(df$l, fill = T, id = 'row')[, v := df$v[row]][]
# row n1 n2 v
#1: 1 a 1 p1
#2: 1 b 2 p1
#3: 2 d NA p2
#4: 3 e 3 p3

A solution using dplyr and tidyr. suppressWarnings is not required. Because when you created data frames, there are factor columns, suppressWarnings is to suppress the warning message when combining factors.
library(dplyr)
library(tidyr)
df1 <- suppressWarnings(df %>%
mutate(v = unlist(.$v)) %>%
unnest())
df1
# v n1 n2
# 1 p1 a 1
# 2 p1 b 2
# 3 p2 d <NA>
# 4 p3 e 3

Related

Creating an empty dataframe in R with column names stored in two separate lists

I have two separate lists containing column names of a new dataframe df to be created.
fixed <- list("a", "b")
variable <- list("a1", "b1", "c1")
How do I proceed so as to make the column names of df appear in the order aba1b1c1
Probabaly, unlist both lists, concatenate and subset the data
df[unlist(c(fixed, variable))]
If there are additional elements in the list that are not as column names in 'df', use intersect
df[intersect(unlist(c(fixed, variable)), names(df))]
a a1 c1
1 7 8 1
2 3 1 5
3 8 5 4
4 7 5 6
5 2 5 6
If it is a null data.frame, we could do
v1 <- unlist(c(fixed, variable))
df <- as.data.frame(matrix(numeric(), nrow = 0,
ncol = length(v1), dimnames = list(NULL, v1)))
str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
Or another option is
df <- data.frame(setNames(rep(list(0), length(v1)), v1))[0,]
> str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
data
v1 <- c('a', 'd2', 'c', 'a1', 'd1', 'c1', 'e1')
set.seed(24)
df <- as.data.frame(matrix(sample(1:9, 5 * length(v1),
replace = TRUE), ncol = length(v1), dimnames = list(NULL, v1)))

How do I lookup the closest row in a lookup data.frame based on multiple columns?

Suppose I have the following dataframe with data (v) and a lookup dataframe (l):
v <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-05'), as.Date('2019-01-30'), as.Date('2019-02-02')), kind=c('a', 'b', 'c', 'a'), v1=c(1,2,3,4))
v
d kind v1
1 2019-01-01 a 1
2 2019-01-05 b 2
3 2019-01-30 c 3
4 2019-02-02 a 4
l <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-04'), as.Date('2019-02-01')), kind=c('a','b','a'), l1=c(10,20,30))
l
d kind l1
1 2019-01-01 a 10
2 2019-01-04 b 20
3 2019-02-01 a 30
I would like to find the closest row in the l dataframe corresponding to each row in v using the columns: c("d", "kind"). Column kind needs to match exactly and maybe use findInterval(...) on d?
I would like my result to be:
d kind v1 l1
1 2019-01-01 a 1 10
2 2019-01-05 b 2 20
3 2019-01-30 c 3 NA
4 2019-02-02 a 4 30
NOTE: I would prefer a base-R implementation but it would be
interesting to see others
I tried findInterval(...) but I don't know how get it to work with multiple columns
Here's a shot in base-R only. (I do believe that data.table will do this much more elegantly, but I appreciate your aversion to bring in other packages.)
Split each frame into a list of frames, by kind:
v_spl <- split(v, v$kind)
l_spl <- split(l, l$kind)
str(v_spl)
# List of 3
# $ a:'data.frame': 2 obs. of 3 variables:
# ..$ d : Date[1:2], format: "2019-01-01" "2019-02-02"
# ..$ kind: Factor w/ 3 levels "a","b","c": 1 1
# ..$ v1 : num [1:2] 1 4
# $ b:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-05"
# ..$ kind: Factor w/ 3 levels "a","b","c": 2
# ..$ v1 : num 2
# $ c:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-30"
# ..$ kind: Factor w/ 3 levels "a","b","c": 3
# ..$ v1 : num 3
Now we determine the unique kind we have in common between the two, no need to try to join everything:
### this has the 'kind' in common
(nms <- intersect(names(v_spl), names(l_spl)))
# [1] "a" "b"
### this has the 'kind' we have to bring back in later
(miss_nms <- setdiff(names(v_spl), nms))
# [1] "c"
For the in-common kind, do an interval join:
joined <- Map(
v_spl[nms], l_spl[nms],
f = function(v0, l0) {
ind <- findInterval(v0$d, l0$d)
ind[ ind < 1 ] <- NA
v0$l1 <- l0$l1[ind]
v0
})
Ultimately we will rbind things back together, but those in miss_nms will not have the new column(s). This is a generic way to capture exactly one row of the new columns with an appropriate NA value:
emptycols <- joined[[1]][, setdiff(colnames(joined[[1]]), colnames(v)),drop=FALSE][1,,drop=FALSE][NA,,drop=FALSE]
emptycols
# l1
# NA NA
And add that column(s) to the not-yet-found frames:
unjoined <- lapply(v_spl[miss_nms], cbind, emptycols)
unjoined
# $c
# d kind v1 l1
# 3 2019-01-30 c 3 NA
And finally bring everything back into a single frame:
do.call(rbind, c(joined, unjoined))
# d kind v1 l1
# a.1 2019-01-01 a 1 10
# a.4 2019-02-02 a 4 30
# b 2019-01-05 b 2 20
# c 2019-01-30 c 3 NA
If you want an exact match you would go:
vl <- merge(v, l, by = c("d","kind"))
For your purposes, you can transform d into additional variables for year, month or day and use the merge

Why does mutate not accept a data.frame as a column to nest?

library(tidyverse)
a = data.frame(c1 = c(1,2,3), c2 = c("a","b","c"))
b = data.frame(c3 = c(TRUE,FALSE,TRUE))
a %>% mutate(c_nested = b)
produces an error:
Error: Column c_nested is of unsupported class data.frame
How do I add a column that contains a nested data.frame?
Many thanks!
We can pass it as a list column
a %>%
mutate(c_nested = list(b))
res <-
a %>%
`$<-`(c_nested, b)
str(res)
# 'data.frame': 3 obs. of 3 variables:
# $ c1 : num 1 2 3
# $ c2 : Factor w/ 3 levels "a","b","c": 1 2 3
# $ c_nested:'data.frame': 3 obs. of 1 variable:
# ..$ c3: logi TRUE FALSE TRUE

Aggregate() + quantile() with output as data frame

I am using quantile() within an aggregate(), see below.
The result is formatted as a data frame, however, as you can see in the str(), the actual quantiles are lists within a column. How do I get the output to be a data frame in which all 'columns' are actual columns (ie names(results) -->
"group" "subgroup" "value.0%" "value.25%" "value.50%" "value.75%" "value.100%"
(I don't care about the actual names, I just want to be able to use setNames())
n=1000
df <- data.frame(group=sample(c("A", "B", "C"), n, replace=T),
subgroup=sample(c("g1", "g2"), n, replace=T),
value=sample(1:10000, n, replace=T))
head(df)
result <- aggregate(value ~ group + subgroup, df, function(x) quantile(x, probs = seq(0,1, 0.25)))
> result
group subgroup value.0% value.25% value.50% value.75% value.100%
1 A g1 26.00 3088.00 5738.00 7473.00 9852.00
2 B g1 26.00 2450.00 4592.50 7319.00 9989.00
3 C g1 17.00 2989.00 5565.00 7611.00 9944.00
4 A g2 96.00 2843.75 4912.00 7719.50 9815.00
5 B g2 77.00 2802.50 4725.50 6996.75 9950.00
6 C g2 115.00 2606.00 4776.50 7673.25 9878.00
> str(result)
'data.frame': 6 obs. of 3 variables:
$ group : Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
$ subgroup: Factor w/ 2 levels "g1","g2": 1 1 1 2 2 2
$ value : num [1:6, 1:5] 26 26 17 96 77 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "0%" "25%" "50%" "75%" ...

comparing values in a row

I am trying to compare values on data frame rows, and removing all the ones that match, with this
dat[!dat[1]==dat[2]]
where
> dat
returns
n1 n2
n1 n4
n4 n5
n1 n3
n4 n4
So i want it to compare the values and delete the last row, since both columns have the same data. But when i use the above code, it tells me
Error in Ops.factor(left, right) : level sets of factors are different
the str(dat) reads
'data.frame': 5 obs. of 2 variables:
$ V1: Factor w/ 2 levels "n1","n4": 1 1 2 1 2
$ V2: Factor w/ 4 levels "n2","n3","n4",..: 1 3 4 2 3
I suspect in the creation of your data, you inadvertently and implicitly converted your columns to factors. This possibly happened when you read the data from source, e.g. when using read.csv or read.table. This example illustrates it:
dat <- read.table(text="
n1 n2
n1 n4
n4 n5
n1 n3
n4 n4")
str(dat)
'data.frame': 5 obs. of 2 variables:
$ V1: Factor w/ 2 levels "n1","n4": 1 1 2 1 2
$ V2: Factor w/ 4 levels "n2","n3","n4",..: 1 3 4 2 3
The remedy is to pass the argument stringsAsFactors=FALSE to read.table():
dat <- read.table(text="
n1 n2
n1 n4
n4 n5
n1 n3
n4 n4", stringsAsFactors=FALSE)
str(dat)
'data.frame': 5 obs. of 2 variables:
$ V1: chr "n1" "n1" "n4" "n1" ...
$ V2: chr "n2" "n4" "n5" "n3" ...
Then your code works (except that I suspect you've missed a comma):
dat[!dat[1]==dat[2], ]
V1 V2
1 n1 n2
2 n1 n4
3 n4 n5
4 n1 n3
One solution would be to instruct the data frame not convert character vectors into factors (using stringAsFactors=F):
x <- c('n1', 'n1', 'n4', 'n1', 'n4')
y <- c('n2', 'n4', 'n5', 'n3', 'n4')
df <- data.frame(x, y, stringsAsFactors=F)
df <- df[-which(df$x == df$y), ]
After creating the data frame the code removes the matching rows, producing the result you wanted.

Resources