Aggregate() + quantile() with output as data frame

Aggregate() + quantile() with output as data frame - r

I am using quantile() within an aggregate(), see below.
The result is formatted as a data frame, however, as you can see in the str(), the actual quantiles are lists within a column. How do I get the output to be a data frame in which all 'columns' are actual columns (ie names(results) -->
"group" "subgroup" "value.0%" "value.25%" "value.50%" "value.75%" "value.100%"
(I don't care about the actual names, I just want to be able to use setNames())
n=1000
df <- data.frame(group=sample(c("A", "B", "C"), n, replace=T),
subgroup=sample(c("g1", "g2"), n, replace=T),
value=sample(1:10000, n, replace=T))
head(df)
result <- aggregate(value ~ group + subgroup, df, function(x) quantile(x, probs = seq(0,1, 0.25)))
> result
group subgroup value.0% value.25% value.50% value.75% value.100%
1 A g1 26.00 3088.00 5738.00 7473.00 9852.00
2 B g1 26.00 2450.00 4592.50 7319.00 9989.00
3 C g1 17.00 2989.00 5565.00 7611.00 9944.00
4 A g2 96.00 2843.75 4912.00 7719.50 9815.00
5 B g2 77.00 2802.50 4725.50 6996.75 9950.00
6 C g2 115.00 2606.00 4776.50 7673.25 9878.00
> str(result)
'data.frame': 6 obs. of 3 variables:
$ group : Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3
$ subgroup: Factor w/ 2 levels "g1","g2": 1 1 1 2 2 2
$ value : num [1:6, 1:5] 26 26 17 96 77 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "0%" "25%" "50%" "75%" ...

Related

Creating an empty dataframe in R with column names stored in two separate lists

I have two separate lists containing column names of a new dataframe df to be created.
fixed <- list("a", "b")
variable <- list("a1", "b1", "c1")
How do I proceed so as to make the column names of df appear in the order aba1b1c1

Probabaly, unlist both lists, concatenate and subset the data
df[unlist(c(fixed, variable))]
If there are additional elements in the list that are not as column names in 'df', use intersect
df[intersect(unlist(c(fixed, variable)), names(df))]
a a1 c1
1 7 8 1
2 3 1 5
3 8 5 4
4 7 5 6
5 2 5 6
If it is a null data.frame, we could do
v1 <- unlist(c(fixed, variable))
df <- as.data.frame(matrix(numeric(), nrow = 0,
ncol = length(v1), dimnames = list(NULL, v1)))
str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
Or another option is
df <- data.frame(setNames(rep(list(0), length(v1)), v1))[0,]
> str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
data
v1 <- c('a', 'd2', 'c', 'a1', 'd1', 'c1', 'e1')
set.seed(24)
df <- as.data.frame(matrix(sample(1:9, 5 * length(v1),
replace = TRUE), ncol = length(v1), dimnames = list(NULL, v1)))

append to dataframe in function - is globalenv really required

I am using the following code, which works fine (improvement suggestions very much welcome):
WeeklySlopes <- function(Year, Week){
DynamicQuery <- paste('select DayOfYear, Week, Year, Close from SourceData where year =', Year, 'and week =', Week, 'order by DayOfYear')
SubData = sqldf(DynamicQuery)
SubData$X <- as.numeric(rownames(SubData))
lmfit <- lm(Close ~ X, data = SubData)
lmfit <- tidy(lmfit)
Slope <- as.numeric(sqldf("select estimate from lmfit where term = 'X'"))
e <- globalenv()
e$WeeklySlopesDf[nrow(e$WeeklySlopesDf) + 1,] = c(Year,Week, Slope)
}
WeeklySlopesDf <- data.frame(Year = integer(), Week = integer(), Slope = double())
WeeklySlopes(2017, 15)
WeeklySlopes(2017, 14)
head(WeeklySlopesDf)
Is there really no other way to append a row to my existing dataframe. I seem to need to access the globalenv. On the other hand, why can sqldf 'see' the 'global' dataframe SourceData?

dfrm <- data.frame(a=1:10, b=letters[1:10]) # reproducible example
myfunc <- function(new_a=20){ g <- globalenv(); g$dfrm[3,1] <- new_a; cat(dfrm[3,1])}
myfunc()
20
dfrm
a b
1 1 a
2 2 b
3 20 c # so your strategy might work, although it's unconventional.
Now try to extend dataframe outside a function:
dfrm[11, ] <- c(a=20,b="c")
An occult disaster (conversion of numeric column to character):
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: chr "1" "2" "20" "4" ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So use a list to avoid occult coercion:
dfrm <- data.frame(a=1:10, b=letters[1:10]) # start over
dfrm[11, ] <- list(a=20,b="c")
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Now try within a function:
myfunc <- function(new_a=20, new_b="ZZ"){ g <- globalenv(); g$dfrm[nrow(dfrm)+1, ] <- list(a=new_a,b=new_b)}
myfunc()
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "ZZ") :
invalid factor level, NA generated
str(dfrm)
'data.frame': 12 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So it succeeds, but if there are any factor columns, non-existent levels will get turned into NA values (with a warning). You method of using named access to objects in the global environment is rather unconventional but there is a set of tested methods that you might want to examine. Look at ?R6. Other options are <<- and assign which allows one to specify the environment in which the assignment is to occur.

how to assign each element of a list as arguments to a function in a loop in R?

I'm new to R. I'd like to get a number of statistics on the numeric columns (say, column C) of a data frame (dt) based on the combination of factor columns (say, columns A and B). First, I want the results by grouping both columns A and B, and then the same operations by A alone and by B alone. I've written a code that looks like the one below. I have a list of the factor combinations that I'd like to test (groupList) and then for each iteration of the loop I feed an element of that list as the argument to "by". However, as surely you can see, it doesn't work. R doesn't recognize the elements of the list as arguments to the function "by". Any ideas on how to make this work? Any pointer or suggestion is welcome and appreciated.
groupList <- list(".(A, B)", "A", "B")
for(i in 1:length(groupList)){
output <- dt[,list(mean=mean(C),
sd=sd(C),
min=min(C),
median=median(C),
max=max(C)),
by = groupList[i]]
Here insert code to save each output
}

I guess aggregate function can solve your problem. Let us say you have a dataframe df contains three columns A,B,C,given as:
df<-data.frame(A=rep(letters[1:3],3),B=rep(letters[4:6],each=3),C=1:9)
If you want calculate mean of C by factor A, try:
aggregate(formula=C~A,data=df,FUN=mean)
by factor B, try:
aggregate(formula=C~B,data=df,FUN=mean)
by factor A and B, try:
aggregate(formula=C~A+B,data=df,FUN=mean)

Your groupList can be restructured as a list of character vectors. Then you can either use lapply or the existing for loop with an added eval() to interpret the by= input properly:
set.seed(1)
dt <- data.table(A=rep(1:2,each=5), B=rep(1:5,each=2), C=1:10)
groupList <- list(c("A", "B"), c("A"), c("B"))
lapply(
groupList,
function(x) {
dt[, .(mean=mean(C), sd=sd(C)), by=x]
}
)
out <- vector("list", 3)
for(i in 1:length(groupList)){
out[[i]] <- dt[, .(mean=mean(C), sd=sd(C)), by=eval(groupList[[i]]) ]
}
str(out)
#List of 3
# $ :Classes ‘data.table’ and 'data.frame': 6 obs. of 4 variables:
# ..$ A : int [1:6] 1 1 1 2 2 2
# ..$ B : int [1:6] 1 2 3 3 4 5
# ..$ mean: num [1:6] 1.5 3.5 5 6 7.5 9.5
# ..$ sd : num [1:6] 0.707 0.707 NA NA 0.707 ...
# ..- attr(*, ".internal.selfref")=<externalptr>
# $ :Classes ‘data.table’ and 'data.frame': 2 obs. of 3 variables:
# ..$ A : int [1:2] 1 2
# ..$ mean: num [1:2] 3 8
# ..$ sd : num [1:2] 1.58 1.58
# ..- attr(*, ".internal.selfref")=<externalptr>
# $ :Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# ..$ B : int [1:5] 1 2 3 4 5
# ..$ mean: num [1:5] 1.5 3.5 5.5 7.5 9.5
# ..$ sd : num [1:5] 0.707 0.707 0.707 0.707 0.707

For demonstration, I used the mtcars data set. Here is one way with the dplyr package.
library(dplyr)
# create a vector of functions that you need
describe <- c("mean", "sd", "min", "median", "max")
# group by the variable gear
mtcars %>%
group_by(gear) %>%
summarise_at(vars(mpg), describe)
# group by the variable carb
mtcars %>%
group_by(carb) %>%
summarise_at(vars(mpg), describe)
# group by both gear and carb
mtcars %>%
group_by(gear, carb) %>%
summarise_at(vars(mpg), describe)

Transforming a nested data frame with varying number of elements

I have a data frame with a column of nested data frames with 1 or 2 columns and n rows. It looks like df in the sample below:
'data.frame': 3 obs. of 2 variables:
$ vector:List of 3
..$ : chr "p1"
..$ : chr "p2"
..$ : chr "p3"
$ lists :List of 3
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ n1: Factor w/ 2 levels "a","b": 1 2
.. ..$ n2: Factor w/ 2 levels "1","2": 1 2
..$ :'data.frame': 1 obs. of 1 variable:
.. ..$ n1: Factor w/ 1 level "d": 1
..$ :'data.frame': 1 obs. of 2 variables:
.. ..$ n1: Factor w/ 1 level "e": 1
.. ..$ n2: Factor w/ 1 level "3": 1
df can be recreated like this :
v <- c("p1", "p2", "p3")
l <- list(data.frame(n1 = c("a", "b"), n2 = c("1", "2")), data.frame(n1 = "d"), data.frame(n1 = "e", n2 = "3"))
df <- as.data.frame(cbind(v, l))
I'd like to transform it to a data frame that looks like that:
[v] [n1] [n2]
p1 a 1
p1 b 2
p2 d NA
p3 e 3
n1 and n2 are in seperate columns
if the data frame in row i has n rows, the vector element of row i should be repeated n times
if there is no content in n1 or n2, there should be a NA
I've tried using tidyr::unnest but got the following error
unnest(df)
Error: All nested columns must have the same number of elements.
Does anyone has a better idea how to transform the dataframe in the desired format?

Using purrr::pmap_df, within each row of df, we combine v and l into a single data frame and then combine all of the data frames into a single data frame.
library(tidyverse)
pmap_df(df, function(v,l) {
data.frame(v,l)
})
v n1 n2
1 p1 a 1
2 p1 b 2
3 p2 d <NA>
4 p3 e 3

This will avoid by-row operations, which will be important if you have a lot of rows.
library(data.table)
rbindlist(df$l, fill = T, id = 'row')[, v := df$v[row]][]
# row n1 n2 v
#1: 1 a 1 p1
#2: 1 b 2 p1
#3: 2 d NA p2
#4: 3 e 3 p3

A solution using dplyr and tidyr. suppressWarnings is not required. Because when you created data frames, there are factor columns, suppressWarnings is to suppress the warning message when combining factors.
library(dplyr)
library(tidyr)
df1 <- suppressWarnings(df %>%
mutate(v = unlist(.$v)) %>%
unnest())
df1
# v n1 n2
# 1 p1 a 1
# 2 p1 b 2
# 3 p2 d <NA>
# 4 p3 e 3

Access aggregated data using values rather than indexes

Using aggregate, R creates a list Z that can be indexed on the form a$Z$`1.2`, where the first number references the corresponding element in X, and likewise for Y. In addition, if X or Y has 10+ elements, the form changes to a$Z$`01.02` (and assumedly 001.002 for 100+ elements).
Instead of having to index Z with the zero-padded index value of X and Y, how can I index with the actual X and Y values instead (eg. a$Z$`52.60`), which seems much more intuitive!
df = data.frame(X=c(50, 52, 50), Y=c(60, 60, 60), Z=c(4, 5, 6))
a = aggregate(Z ~ X + Y, df, c)
str(a)
'data.frame': 2 obs. of 3 variables:
$ X: num 50 52
$ Y: num 60 60
$ Z:List of 2
..$ 1.1: num 4 6
..$ 1.2: num 5

You easily can do this after aggregate:
names(a$Z) <- paste(a$X, a$Y, sep=".")
Then check it out
str(a)
'data.frame': 2 obs. of 3 variables:
$ X: num 50 52
$ Y: num 60 60
$ Z:List of 2
..$ 50.60: num 4 6
..$ 52.60: num 5

1) Try tapply instead:
ta <- tapply(df[[3]], df[-3], c)
ta[["50", "60"]]
## [1] 4 6
ta[["52", "60"]]
## [1] 5
2) subset Consider just not using aggregate at all and use subset to retrieve the values:
subset(df, X == 50 & Y == 60)$Z
## [1] 4 6
3) data.table Subsetting is even easier with data.table:
library(data.table)
dt <- data.table(df, key = "X,Y")
dt[.(50, 60), Z]
## [1] 4 6
Note: If you are not actually starting with the df shown in the question but rather a is the result of a series of complex transformations then we can recover df like this:
df <- tidyr::unnest(a)
at which point any of the above could be used.