Difference between tilde and "by" while using aggregate function in R - r

Every time I do an aggregate on a data.frame I default to using the "by = list(...)" parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.
In some cases, the output is exactly the same. For example:
aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))
AND
aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)
What is the difference between the two and when do you use which?

I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.
I'll illustrate with a small example.
Here's some sample data:
set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
x[sample(nrow(mydf), 1)] <- NA
x
})
mydf
# A B X1 X2 X3
# 1 1 A 27 69 27
# 2 1 A 38 NA 39
# 3 1 A 58 77 2
# 4 2 A 91 50 39
# 5 2 A 21 72 87
# 6 3 B 90 100 35
# 7 3 B 95 39 49
# 8 3 B 67 78 60
# 9 3 B 63 94 NA
# 10 4 B NA 22 19
# 11 4 B 21 66 83
# 12 4 B 18 13 67
First, the formula interface. The following three commands will all yield the same output.
aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
# A B X1 X2 X3
# 1 1 A 85 146 29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B 39 79 150
Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with, if required).
aggregate(cbind(mydf$X1, mydf$X2, mydf$X3),
by = list(mydf$A, mydf$B), sum)
Group.1 Group.2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112 122 126
3 3 B 315 311 NA
4 4 B NA 101 169
Now, stop and make note of any differences.
The two that pop into my mind are:
The formula method does a nicer job of preserving names but it doesn't let you control the names directly in your command, which you can do in the data.frame method:
aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3),
by = list(NewA = mydf$A, NewB = mydf$B), sum)
The formula method and the data.frame method treat NA values differently. To get the same result with the formula method as you do with the data.frame method, you need to use na.action = na.pass.
aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.

From the help page,
aggregate.formula is a standard formula interface to aggregate.data.frame
So I don't think it really matters. Use whichever approach you're comfortable with, or which fits existing variables and formulas in your workspace.

Related

Compare vector to a dataframe

I have a dataframe that looks something like -
test A B C
28 67 4 23
45 82 43 56
34 8 24 42
I need to compare test to the other three columns in that I just need the number of elements in the other column that is less than the corresponding element in the test column.
So the desired output is -
test A B C result
28 67 4 23 2
45 82 43 56 1
34 8 24 42 2
When I tried -
comp_vec = "test"
name_vec = c("A", "B", "C")
rowSums(df[, comp_vec] > df[, name_vec])
I get the error -
Error in Ops.data.frame(df[, comp_vec], df[, name_vec]) :
‘>’ only defined for equally-sized data frames
I am looking for a way without replicating test to match size of dataframe.
You can use sapply to return a vector of mapping the df$test column against the other three columns. That will return a T/F matrix that you can do rowSums, and set as your result column.
df <- data.frame(test = c(28, 45, 34), A = c(67, 82, 8), B = c(4, 43, 24), C = c(23, 56, 42))
df$result <- rowSums(sapply(df[,2:4], function(x) df$test > x))
> df
test A B C result
1 28 67 4 23 2
2 45 82 43 56 1
3 34 8 24 42 2
I noticed your expected results has 82 for the second row of A, whereas its 5 in your starting example.
df$result <- apply(df, 1, function(x) sum(x < x[1]))
Use apply, specify 1 to indicate by row. x < x[1] will give a vector of TRUE/FALSE if the value at each position in the row is smaller than the first column's value. Use sum to give the number of TRUE values.
# test A B C result
# 1 28 67 4 23 2
# 2 45 82 43 56 1
# 3 34 8 24 42 2

R: colSums when not all columns are numeric

I have the following data frame
Type CA AR
alpha 1 5
beta 4 9
gamma 3 8
I want to get the column and row sums such that it looks like this:
Type CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
I am able to do rowSums (as shown above) I guess because they are all numeric.
colSums(df)
However, when I do colSums I get the error 'x must be numeric.' I realize that this is because the "Type" column is not numeric.
If I do the following code such that I try to print the value into the 4th row (and only the 2nd through 4th columns are summed)
df[,4] = colSums(df[c(2:4)]
Then I get an error that replacement isn't same as data size.
Does anyone know how to work around this? I want to print the column sums for columns 2-4, and leave the 1st column total blank or allow me to print "Total"?
Thanks in advance!!
Checkout numcolwise() in the plyr package.
library(plyr)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
numcolwise(sum)(df)
Result:
CA AR
1 8 22
Use a matrix:
m <- as.matrix(df[,-1])
rownames(m) <- df$Type
# CA AR
# alpha 1 5
# beta 4 9
# gamma 3 8
Then add margins:
addmargins(m,FUN=c(Total=sum),quiet=TRUE)
# CA AR Total
# alpha 1 5 6
# beta 4 9 13
# gamma 3 8 11
# Total 8 22 30
The simpler addmargins(m) also works, but defaults to labeling the margins with "Sum".
You are right, it is because the first column is not numeric.
Try to use the first column as rownames:
df <- data.frame(row.names = c("alpha", "beta", "gamma"), CA = c(1, 4, 3), AR = c(5, 9, 8))
df$Total <- rowSums(df)
df['Total',] <- colSums(df)
df
The output will be:
CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
If you need the word 'Type', just remove the rownames and add the column back:
Type <- rownames(df)
df <- data.frame(Type, df, row.names=NULL)
df
And it's output:
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
Use:
df$Total <- df$CA + df$AR
A more general solution:
data$Total <- Reduce('+',data[, sapply(data, is.numeric)])
EDIT: I realize I completely misunderstood the question. you are indeed looking for the sum of rows, and I gave sum of columns.
To do rows instead:
data <- data.frame(x = 1:3, y = 4:6, z = as.character(letters[1:3]))
data$z <- as.character(data$z)
rbind(data,sapply(data, function(y) ifelse(test = is.numeric(y), Reduce('+',y), "Total")))
If you do not know which columns are numeric, but rather want the sums across rows then do this:
df$Total = rowSums( df[ sapply(df, is.numeric)] )
The is.numeric function will return a logical value which is valid for selecting columns and sapply will return the logical values as a vector.
To add a set of column totals and a grand total we need to rewind to the point where the dataset was created and prevent the "Type" column from being constructed as a factor:
dat <- read.table(text="Type CA AR
alpha 1 5
beta 4 9
gamma 3 8 ",stringsAsFactors=FALSE)
dat$Total = rowSums( dat[ sapply(dat, is.numeric)] )
rbind( dat, append(c(Type="Total"),
as.list(colSums( dat[ sapply(dat, is.numeric)] ))))
#----------
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
That's a data.frame:
> str( rbind( dat, append(c(Type="Total"), as.list(colSums( dat[ sapply(dat, is.numeric)] )))) )
'data.frame': 4 obs. of 4 variables:
$ Type : chr "alpha" "beta" "gamma" "Total"
$ CA : num 1 4 3 8
$ AR : num 5 9 8 22
$ Total: num 6 13 11 30
I think this should solve your problem
x<-data.frame(type=c('alpha','beta','gama'), x=c(1,2,3), y=c(4,5,6))
x[,'Total'] <- rowSums(x[,c(2:3)])
x<-rbind(x,c(type = c('Total'), c(colSums(x[,c(2:4)]))))
library(tidyverse)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
df2 <- colSums(df[, c("CA", "AR")])
# CA AR
# 8 22

R programming(sum of products)

i'm working on how to find sum of products of two dataframes.
data<-w1 w2 w3 w4
4 6 8 5
where w1 w2 w3 w4 are column names
and I have one more dataframe
data2<-p1 p2 p3 p4
3 4 5 6
5 6 8 4
4 6 6 8
3 5 8 9
my result should be like this:
result <- w1*P1+w2*p2+w3*p3*w4*p4
result1 <- 4*3+6*4+8*5+5*6 # result on row 1
result2 <- 4*5+6*6+8*8+5*4 # result on row 2
and so on for each row in data2
how to do this in general
Thanks
Fastest way is to come back to R linear algebra (even more is you have big data.frame's):
> as.matrix(data2) %*% unlist(data)
# [,1]
#[1,] 106
#[2,] 140
#[3,] 140
#[4,] 151
Or sweep:
> rowSums(sweep(as.matrix(data2), 2, unlist(data), `*`))
#[1] 106 140 140 151
Data
data=data.frame(a=4,b=6,c=8,d=5)
data2=data.frame(a=c(3,5,4,3),b=c(4,6,6,5),c=c(5,8,6,8),d=c(6,4,8,9))
You could use mapply:
df1 <- data.frame(w1 = 4, w2 = 6, w3 = 8, w4 = 5)
df2 <- data.frame(p1 = c(3, 5, 4, 3), p2 = c(4, 6, 6, 5),
p3 = c(5, 8, 6, 8), p4 = c(6, 4, 8, 9))
This multiplies each element of df2 with each element of df1 (by element I mean column - the data frame is treated as a list in this context):
> (tmp <- mapply(`*`, df2, df1))
p1 p2 p3 p4
[1,] 12 24 40 30
[2,] 20 36 64 20
[3,] 16 36 48 40
[4,] 12 30 64 45
>sum(tmp)
[1] 537
Edit If you want to get the sum of each row from the above matrix you can use either apply(tmp, 1, sum) or rowSums:
> rowSums(tmp)
[1] 106 140 140 151

vector to dataframe in r given length of vector

I have vectors of different lengths. For instance:
df1
[1] 1 95 5 2 135 4 3 135 4 4 135 4 5 135 4 6 135 4
df2
[1] 1 70 3 2 110 4 3 112 4
I'm trying to write a script in R in order to have any vector enter the function or for loop and it returns a dataframe of three columns. So a separate dataframe for each input vector. Each vector is a multiple of three (hence, the three columns). I'm fairly new to R in terms of writing functions and can't seem to figure this out. Here was my attempt:
newdf = c()
ld <- length(df1)
ld_mult <- length(df1)/3
ld_seq <- seq(from=1,to=ld,by=3)
ld_seq2 < ld_seq +2
for (i in 1:ld_mult) {
newdf[i,] <- df1[ld_seq[i]:ld_seq2[i]]
}
the output I want for df1 would be:
1 95 5
2 135 4
3 135 4
4 135 4
5 135 4
6 135 4
Here's an example of how you could use matrix for that purpose:
x <- c(1, 95, 5,2, 135, 4, 3, 135, 4)
as.data.frame(matrix(x, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 95 5
#2 2 135 4
#3 3 135 4
And
y <- c(1, 70, 3, 2, 110, 4, 3, 112, 4)
as.data.frame(matrix(y, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
Or if you want to make it a custom function:
newdf <- function(vec) {
as.data.frame(matrix(vec, ncol = 3, byrow = TRUE))
}
newdf(y)
#V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
You could also let the user specify the number of columns he wants to create with the function if you add another argument to newdf:
newdf <- function(vec, cols = 3) {
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
Now, the default number of columns is 3, if the user doesnt specify a number. If he wants to, he could use it like this:
newdf(z, 5) # to create 5 columns
Another nice little addon for the function would be a check if the input vector length is a multiple of the number of columns specified in the function call:
newdf <- function(vec, cols = 3) {
if(length(vec) %% cols != 0) {
stop("Number of columns is not a multiple of input vector length. Please double check.")
}
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
newdf(x, 4)
#Error in newdf(x, 4) :
# Number of columns is not a multiple of input vector length. Please double check.
If you had multiple vectors sitting in a list, here's how you could convert each of them to be a data.frame:
> l <- list(x,y)
> l
#[[1]]
#[1] 1 95 5 2 135 4 3 135 4
#
#[[2]]
#[1] 1 70 3 2 110 4 3 112 4
> lapply(l, newdf)
#[[1]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
#
#[[2]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4

Combining objects across a list

I have a simple question. I have a list of objects. Each object holds a few lists. Before this gets too complicated, let me illustrate:
x = a list
x[[1]] = some object
x[[2]] = another object
...
x[[n]] = another object
And as I said, each object holds some more lists. But I'm interested in a specific list, let's call it "a".
x[[1]][[a]] = ('A': 1, 'B': 2, 'C': 3, ..., Z: 26)
Sorry for the python-like syntax! I am really just learning R. Anyway, what I want to do is combine the lists held in these objects, then take their median. To make this more clear, I want to group all 'A' elements, then take their median:
x[[1]][[a]][['A']], x[[2]][[a]][['A']], x[[3]][[a]][['A']], ..., x[[n]][[a]][['A']]
Similarly I want to group all 'B', 'C', ..., 'Z' elements and take their median...
x[[1]][[a]][['Z']], x[[2]][[a]][['Z']], x[[3]][[a]][['Z']], ..., x[[n]][[a]][['Z']]
So the question is what's the best way to do this? I've spent hours trying to figure this out! It would be great if someone could help me.
And if you would like to know what I'm actually doing, basically I have a list (x) of random forest objects. So x[[1]] is the first random forest, x[[100]] is the 100th random forest. Each random forest has a list of predicted values, which are stored in, e.g. x[[1]][['predicted']]. Each prediction list has a label associated with its predicted value. What I'm actually trying to do is calculate each label's median predicted value across all 100 random forests. And I want to do it efficiently. In Python, this is easy, but in R I'm not so sure. Anyway, thanks for the help!!! I really appreciate it.
Here's one way you could do it. It's a bit tough because you can't use rapply to subset by the names of list elements (which is frustrating). But you can unlist and then subset on names and take the median that way...
# Make some reproducible data
set.seed(1)
l <- list( a = sample(10,3) , b = sample(10,3) , c = sample(10,3) )
ll <- list( l , l , l )
# Unlist - we get a named vector but all a's have unique names - e.g. a1 , a2... an
unl <- unlist(ll)
# a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3
# 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9
# Subset by those elements that contian 'a' in their name
a.unl <- unl[ grepl("a",names(unl)) ]
# a1 a2 a3 a1 a2 a3 a1 a2 a3
# 3 4 5 3 4 5 3 4 5
# Take median
median( a.unl )
# [1] 4
To loop over multiple names try this...
sapply( c( "a" , "b" , "c" ) , function(x) median( unl[ grepl(x,names(unl) ) ] ) )
# a b c
# 4 8 9
you could do this with a simple loop for every A,B,C,...
x <- c()
for( i in 1:n ) x <- c( x, x[[i]][[a]][['A']] )
median(x)
Sample data for creating your top-level list x:
x <- replicate(3, list(a = as.list(setNames(sample(1:100, 26), LETTERS)),
b = runif(10)),
simplify = FALSE)
First, extract a from each list:
a.only <- lapply(ll, `[[`, "a")
Then, to compute all A through Z medians in one shot, do:
do.call(mapply, c(a.only, FUN = function(...) median(unlist(list(...)))))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 55 59 41 21 93 72 65 74 51 42 87 25 60 40 13 77 35 31 92 51 57 37 87 67 29 46
If the sublists contain more items than you need, say you only want to compute medians on A, C, Z, do:
a.slices <- lapply(a.only, `[`, c("A", "C", "Z"))
do.call(mapply, c(a.slices, FUN = function(...) median(unlist(list(...)))))
# A C Z
# 55 41 46

Resources