Force dplyr not to drop attributes - is it possible? - r

Consider the simple example:
library(dplyr)
dat <- data.frame( a = 1, b = 2 )
attr(dat, "myattr") <- "xyz"
dat %>% mutate(c = 3) %>% str()
## 'data.frame': 1 obs. of 3 variables:
## $ a: num 1
## $ b: num 2
## $ c: num 3
So dplyr drops the attribute. Is it possible to force it not to drop it?
More general: is it possible to force R not to drop attributes when changing object class?

Related

Creating an empty dataframe in R with column names stored in two separate lists

I have two separate lists containing column names of a new dataframe df to be created.
fixed <- list("a", "b")
variable <- list("a1", "b1", "c1")
How do I proceed so as to make the column names of df appear in the order aba1b1c1
Probabaly, unlist both lists, concatenate and subset the data
df[unlist(c(fixed, variable))]
If there are additional elements in the list that are not as column names in 'df', use intersect
df[intersect(unlist(c(fixed, variable)), names(df))]
a a1 c1
1 7 8 1
2 3 1 5
3 8 5 4
4 7 5 6
5 2 5 6
If it is a null data.frame, we could do
v1 <- unlist(c(fixed, variable))
df <- as.data.frame(matrix(numeric(), nrow = 0,
ncol = length(v1), dimnames = list(NULL, v1)))
str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
Or another option is
df <- data.frame(setNames(rep(list(0), length(v1)), v1))[0,]
> str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
data
v1 <- c('a', 'd2', 'c', 'a1', 'd1', 'c1', 'e1')
set.seed(24)
df <- as.data.frame(matrix(sample(1:9, 5 * length(v1),
replace = TRUE), ncol = length(v1), dimnames = list(NULL, v1)))

R Changing format to columns of dataframes using functional programming

The situation is the following: I have a list of dataframes, and for each dataframe I have a list of columns whose format I need to change. Setup:
df1 <- data.frame(a = c("2020-03-02", "2020-12-22", "2020-07-03"), b = c(4, 5, 6), c = c("2020-03-13", "2019-11-03", "2011-05-02"))
df2 <- data.frame(d = c(1, 2, 3), e = c("2020-05-21", "2014-08-31", "1999-01-21"), f = c(7, 8, 9))
datasets <- list("first" = df1, "second" = df2)
dates <- list("first" = c("a", "c"), "second" = c("e"))
One could do this by 1. Looping over the list of dataframes, 2. for each dataframe, looping over the list of columns one wants to change, and reassign them in place. Something like this:
for (i in names(datasets)) {
for (j in dates[i]) {
for (k in datasets[[i]][j]) {
k <- as.Date(k)
}
}
}
This is ugly, so I wanted to try to do the same using purrr. I thought this would be a good idea:
library(purrr)
walk2(datasets, dates, ~ walk(.x[.y], ~ {.x <- as.Date(.x)}))
But the datasets remain unperturbed after this operation. Why?
Here is a solution that uses purrr and dplyr:
library(purrr)
library(dplyr)
datasets <- datasets %>%
imap(~{
.x %>%
mutate_at(vars(dates[[.y]]), as.Date)
})
str(datasets)
#List of 2
#$ first :'data.frame': 3 obs. of 3 variables:
# ..$ a: Date[1:3], format: "2020-03-02" "2020-12-22" "2020-07-03"
# ..$ b: num [1:3] 4 5 6
# ..$ c: Date[1:3], format: "2020-03-13" "2019-11-03" "2011-05-02"
#$ second:'data.frame': 3 obs. of 3 variables:
# ..$ d: num [1:3] 1 2 3
# ..$ e: Date[1:3], format: "2020-05-21" "2014-08-31" "1999-01-21"
# ..$ f: num [1:3] 7 8 9

Why does mutate not accept a data.frame as a column to nest?

library(tidyverse)
a = data.frame(c1 = c(1,2,3), c2 = c("a","b","c"))
b = data.frame(c3 = c(TRUE,FALSE,TRUE))
a %>% mutate(c_nested = b)
produces an error:
Error: Column c_nested is of unsupported class data.frame
How do I add a column that contains a nested data.frame?
Many thanks!
We can pass it as a list column
a %>%
mutate(c_nested = list(b))
res <-
a %>%
`$<-`(c_nested, b)
str(res)
# 'data.frame': 3 obs. of 3 variables:
# $ c1 : num 1 2 3
# $ c2 : Factor w/ 3 levels "a","b","c": 1 2 3
# $ c_nested:'data.frame': 3 obs. of 1 variable:
# ..$ c3: logi TRUE FALSE TRUE

how to assign each element of a list as arguments to a function in a loop in R?

I'm new to R. I'd like to get a number of statistics on the numeric columns (say, column C) of a data frame (dt) based on the combination of factor columns (say, columns A and B). First, I want the results by grouping both columns A and B, and then the same operations by A alone and by B alone. I've written a code that looks like the one below. I have a list of the factor combinations that I'd like to test (groupList) and then for each iteration of the loop I feed an element of that list as the argument to "by". However, as surely you can see, it doesn't work. R doesn't recognize the elements of the list as arguments to the function "by". Any ideas on how to make this work? Any pointer or suggestion is welcome and appreciated.
groupList <- list(".(A, B)", "A", "B")
for(i in 1:length(groupList)){
output <- dt[,list(mean=mean(C),
sd=sd(C),
min=min(C),
median=median(C),
max=max(C)),
by = groupList[i]]
Here insert code to save each output
}
I guess aggregate function can solve your problem. Let us say you have a dataframe df contains three columns A,B,C,given as:
df<-data.frame(A=rep(letters[1:3],3),B=rep(letters[4:6],each=3),C=1:9)
If you want calculate mean of C by factor A, try:
aggregate(formula=C~A,data=df,FUN=mean)
by factor B, try:
aggregate(formula=C~B,data=df,FUN=mean)
by factor A and B, try:
aggregate(formula=C~A+B,data=df,FUN=mean)
Your groupList can be restructured as a list of character vectors. Then you can either use lapply or the existing for loop with an added eval() to interpret the by= input properly:
set.seed(1)
dt <- data.table(A=rep(1:2,each=5), B=rep(1:5,each=2), C=1:10)
groupList <- list(c("A", "B"), c("A"), c("B"))
lapply(
groupList,
function(x) {
dt[, .(mean=mean(C), sd=sd(C)), by=x]
}
)
out <- vector("list", 3)
for(i in 1:length(groupList)){
out[[i]] <- dt[, .(mean=mean(C), sd=sd(C)), by=eval(groupList[[i]]) ]
}
str(out)
#List of 3
# $ :Classes ‘data.table’ and 'data.frame': 6 obs. of 4 variables:
# ..$ A : int [1:6] 1 1 1 2 2 2
# ..$ B : int [1:6] 1 2 3 3 4 5
# ..$ mean: num [1:6] 1.5 3.5 5 6 7.5 9.5
# ..$ sd : num [1:6] 0.707 0.707 NA NA 0.707 ...
# ..- attr(*, ".internal.selfref")=<externalptr>
# $ :Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
# ..$ A : int [1:2] 1 2
# ..$ mean: num [1:2] 3 8
# ..$ sd : num [1:2] 1.58 1.58
# ..- attr(*, ".internal.selfref")=<externalptr>
# $ :Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# ..$ B : int [1:5] 1 2 3 4 5
# ..$ mean: num [1:5] 1.5 3.5 5.5 7.5 9.5
# ..$ sd : num [1:5] 0.707 0.707 0.707 0.707 0.707
For demonstration, I used the mtcars data set. Here is one way with the dplyr package.
library(dplyr)
# create a vector of functions that you need
describe <- c("mean", "sd", "min", "median", "max")
# group by the variable gear
mtcars %>%
group_by(gear) %>%
summarise_at(vars(mpg), describe)
# group by the variable carb
mtcars %>%
group_by(carb) %>%
summarise_at(vars(mpg), describe)
# group by both gear and carb
mtcars %>%
group_by(gear, carb) %>%
summarise_at(vars(mpg), describe)

Access aggregated data using values rather than indexes

Using aggregate, R creates a list Z that can be indexed on the form a$Z$`1.2`, where the first number references the corresponding element in X, and likewise for Y. In addition, if X or Y has 10+ elements, the form changes to a$Z$`01.02` (and assumedly 001.002 for 100+ elements).
Instead of having to index Z with the zero-padded index value of X and Y, how can I index with the actual X and Y values instead (eg. a$Z$`52.60`), which seems much more intuitive!
df = data.frame(X=c(50, 52, 50), Y=c(60, 60, 60), Z=c(4, 5, 6))
a = aggregate(Z ~ X + Y, df, c)
str(a)
'data.frame': 2 obs. of 3 variables:
$ X: num 50 52
$ Y: num 60 60
$ Z:List of 2
..$ 1.1: num 4 6
..$ 1.2: num 5
You easily can do this after aggregate:
names(a$Z) <- paste(a$X, a$Y, sep=".")
Then check it out
str(a)
'data.frame': 2 obs. of 3 variables:
$ X: num 50 52
$ Y: num 60 60
$ Z:List of 2
..$ 50.60: num 4 6
..$ 52.60: num 5
1) Try tapply instead:
ta <- tapply(df[[3]], df[-3], c)
ta[["50", "60"]]
## [1] 4 6
ta[["52", "60"]]
## [1] 5
2) subset Consider just not using aggregate at all and use subset to retrieve the values:
subset(df, X == 50 & Y == 60)$Z
## [1] 4 6
3) data.table Subsetting is even easier with data.table:
library(data.table)
dt <- data.table(df, key = "X,Y")
dt[.(50, 60), Z]
## [1] 4 6
Note: If you are not actually starting with the df shown in the question but rather a is the result of a series of complex transformations then we can recover df like this:
df <- tidyr::unnest(a)
at which point any of the above could be used.

Resources