How can I apply different aggregate functions to different columns in R? - r

How can I apply different aggregate functions to different columns in R? The aggregate() function only offers one function argument to be passed:
V1 V2 V3
1 18.45022 62.24411694
2 90.34637 20.86505214
1 50.77358 27.30074987
2 52.95872 30.26189013
1 61.36935 26.90993530
2 49.31730 70.60387016
1 43.64142 87.64433517
2 36.19730 83.47232907
1 91.51753 0.03056485
... ... ...
> aggregate(sample,by=sample["V1"],FUN=sum)
V1 V1 V2 V3
1 1 10 578.5299 489.5307
2 2 20 575.2294 527.2222
How can I apply a different function to each column, i.e. aggregate V2 with the mean() function and V2 with the sum() function, without calling aggregate() multiple times?

For that task, I will use ddply in plyr
> library(plyr)
> ddply(sample, .(V1), summarize, V2 = sum(V2), V3 = mean(V3))
V1 V2 V3
1 1 578.5299 48.95307
2 2 575.2294 52.72222

...Or the function data.table in the package of the same name:
library(data.table)
myDT <- data.table(sample) # As mdsumner suggested, this is not a great name
myDT[, list(sumV2 = sum(V2), meanV3 = mean(V3)), by = V1]
# V1 sumV2 meanV3
# [1,] 1 578.5299 48.95307
# [2,] 2 575.2294 52.72222

Let's call the dataframe x rather than sample which is already taken.
EDIT:
The by function provides a more direct route than split/apply/combine
by(x, list(x$V1), f)
:EDIT
lapply(split(x, x$V1), myfunkyfunctionthatdoesadifferentthingforeachcolumn)
Of course, that's not a separate function for each column but one can do both jobs.
myfunkyfunctionthatdoesadifferentthingforeachcolumn = function(x) c(sum(x$V2), mean(x$V3))
Convenient ways to collate the result are possible such as this (but check out plyr package for a comprehensive solution, consider this motivation to learn something better).
matrix(unlist(lapply(split(x, x$V1), myfunkyfunctionthatdoesadifferentthingforeachcolumn)), ncol = 2, byrow = TRUE, dimnames = list(unique(x$V1), c("sum", "mean")))

Related

How to pass a vector of values as parameters for mutate?

I am writing a code which is expected to raise each column of a data frame to some exponent.
I've tried to use mutate_all to apply function(x,a) x^a to each column of the dataframe, but I am having trouble passing values of a from a pre-defined vector.
powers <- c(1,2,3)
df <- data.frame(v1 = c(1,2,3), v2 = c(2,3,4), v3 = c(3,4,5))
df %>% mutate_all(.funs, ...)
I am seeking help on how to write the parameters of mutate_all so that the elements of powers can be applied to the function for each column.
I expect the output to be a data frame, with columns being (1,2,3),(4,9,16),(27,64,125) respectively.
We can use Map in base R
df[] <- Map(`^`, df, powers)
Or map2 in purrr
purrr::map2_df(df, powers, `^`)
You can also try sweep()from base R:
sweep(df, 2, powers, "^")
v1 v2 v3
1 1 4 27
2 2 9 64
3 3 16 125
In base R, we can replicate the 'powers' to make the lengths same and then apply the function
df ^ powers[col(df)]
# v1 v2 v3
#1 1 4 27
#2 2 9 64
#3 3 16 125

Mutliply several columns of a dataframe by a factor (scalar)

I have a very basic problem and can't find a solution, so sorry in advance for the beginner question.
I have a data frame with several ID columns and 30 numerical columns. I want to multiply all values of those 30 columns with the same factor. I want to keep the the rest of the data frame unchanged. I figured that dplyr and transmute_all or transmute_at are my friends, but I can't find a way to express the function Column1:Column30 * factor. All examples given use simple functions like mean and that doesn't help me with the expression.
I would use mutate_at. For example:
library(dplyr)
mtcars %>%
mutate_at(vars(mpg:qsec),
.funs = funs(. * 3))
I'll give a solution with data.table, the dplyr version should be close to identical.
library(data.table)
# convert to data.table format to use data.table syntax
setDT(my_df)
# .SD refers to all the columns mentioned in the .SDcols argument
# (all columns by default when this argument is not specified)
# - instead of using backticks around *, you could use quotes: "*"
my_df[ , lapply(.SD, `*`, factor), .SDcols = Column1:Column30]
On some made-up data
set.seed(0123498)
# create fake data
DT = setDT(replicate(8, rnorm(5), simplify = FALSE))
DT
# V1 V2 V3 V4 V5 V6 V7 V8
# 1: -0.2685077 -1.06491111 0.7307661 0.09880937 0.2791274 -0.5589676 1.5320685 0.4730013
# 2: 1.0783236 -0.17810929 -0.2578453 0.95940860 1.0990367 -0.6983235 0.9530062 -1.3800769
# 3: 1.1730611 -0.48828441 -1.6314077 -0.76117268 -0.5753245 -0.7370099 0.3982160 -0.8088035
# 4: 0.2060451 -0.07105785 -1.1878591 -0.83464592 2.1872117 -0.4390479 0.1428239 1.2634280
# 5: 1.6142695 0.46381602 0.5315299 2.34790945 -1.2977851 1.0428450 1.9292390 0.5337248
scalar = 3
DT[ , lapply(.SD, "*", scalar), .SDcols = V4:V6]
# V4 V5 V6
# 1: 0.2964281 0.8373822 -1.676903
# 2: 2.8782258 3.2971101 -2.094970
# 3: -2.2835180 -1.7259734 -2.211030
# 4: -2.5039378 6.5616352 -1.317144
# 5: 7.0437283 -3.8933554 3.128535
If it's all numeric columns you want to multiply, (or if you can easily write a test) I'd use lapply with an is.numeric test:
Calling the data frame dd (and using iris to demonstrate):
dd = iris
dd[] = lapply(dd, FUN = function(x) if (is.numeric(x)) return(x * 2) else return(x))
This is equivalent to a simple for loop, which also works just fine.
for (i in 1:ncol(dd)) {
if (is.numeric(dd[[i]])) dd[[i]] = dd[[i]] * 2
}
Another way is to use lapply only on the relevant columns, e.g.:
dd[1:30] = lapply(dd[1:30], "*", 2)
Since dplyr version 1.0, you can use across():
dd = iris
dd = dd %>%
mutate(across(where(is.numeric), function(x) x * 2))
May be this will help you, just R base
> set.seed(100)
> df = data.frame(id=rep(1:5), val1=rnorm(5), val2=rnorm(5), val3=rnorm(5))
> df
id val1 val2 val3
1 1 -0.50219235 0.3186301 0.08988614
2 2 0.13153117 -0.5817907 0.09627446
3 3 -0.07891709 0.7145327 -0.20163395
4 4 0.88678481 -0.8252594 0.73984050
5 5 0.11697127 -0.3598621 0.12337950
# Multiply by 2 all columns except id column
> df[, !colnames(df) %in% c("id")] <- df[, !colnames(df) %in% c("id")] * 2
> df
id val1 val2 val3
1 1 -1.0043847 0.6372602 0.1797723
2 2 0.2630623 -1.1635814 0.1925489
3 3 -0.1578342 1.4290654 -0.4032679
4 4 1.7735696 -1.6505189 1.4796810
5 5 0.2339425 -0.7197243 0.2467590
>
You could just use apply
my_df <- data_frame(//some data)
my_scaled_df <- apply(data_frame, 2, transformation_logic)
For this you can use try:
y <- xx[-(1:2)]*100
this "xx[-(1:2)]" is non numeric columns so you need to exclude these from the calculation.

R: efficient way to apply a function according to the columns of a dataframe

I feel extremely stupid now but I can't come up with more than a for loop...
I have a data frame with numerical and factorial columns. I simply want the numerical columns to be scaled and the factorial columns to be kept as they are. For example
> set.seed(160)
> df1 <- data.frame(as.data.frame(matrix(rnorm(8), ncol=2)),
V3=factor(c("A", "A", "B", "B")))
> df1
V1 V2 V3
1 0.6185496 -0.6410203 A
2 -0.8722777 2.6520986 A
3 0.8529240 -1.4156009 B
4 0.3678875 -1.1615607 B
I'd like to get
> df1
V1 V2 V3
1 0.4901808 -0.2642698 A
2 -1.4493527 1.4780179 A
3 0.7950968 -0.6740765 B
4 0.1640750 -0.5396717 B
with a more efficient command than
for(i in 1:ncol(df1)) {
if(is.factor(df1[,i])) {df1[,i] <- df1[,i]}
else{df1[,i] <- scale(df1[,i])}
}
I tried various combinations of lapply(), sapply(), if(), ifelse() but nothing seemed to work (apply doesn't work because the df gets transformed into a matrix and I lose the factor/numeric structure). Any suggestions?
NB: I am not trying to apply a function based on the values in the columns but based on the type of column.
You can try the following, which is similar to a suggestion in the comments:
df1[sapply(df1, is.numeric)] <- scale(df1[sapply(df1, is.numeric)])
#> df1
# V1 V2 V3
#1 0.4901808 -0.2642698 A
#2 -1.4493527 1.4780179 A
#3 0.7950968 -0.6740765 B
#4 0.1640750 -0.5396717 B
This should work.
df1[] <- sapply(df1, function(i) if(is.numeric(i)) scale(i) else i)

R: Properly using a dataframe as an argument to a function

I am practicing using the apply function in R, and so I'm writing a simple function to apply to a dataframe.
I have a dataframe with 2 columns.
V1 V2
1 3
2 4
I decided to do some basic arithmetic and have the answer in the 3rd column, specifically, I want to multiply the first column by 2 and the second column by 3, then sum them.
V1 V2 V3
1 3 11
2 4 16
Here's what I was thinking:
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[,1]*2 +
some_df[,2]*3}
mydf <- apply(mydf ,2, some_function)
But what is wrong with my arguments to the function? R is giving me an error regarding the dimension of the dataframe. Why?
Three things wrong:
1) apply "loops" a vector of either each column or row, so you just address the name [1] not [,1]
2) you need to run by row MARGIN=1, not 2
3) you need to cbind the result, because apply doesn't append, so you're overwriting the vector
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[1]*2 +
some_df[2]*3}
mydf <- cbind(mydf,V3=apply(mydf ,1, some_function))
# V1 V2 V3
#1 1 3 11
#2 2 4 16
but probably easier just to do the vector math:
mydf$V3<-mydf[,1]*2 + mydf[,2]*3
because vector math is one of the greatest things about R

Collapse data frame by group using different functions on each variable

Define
df<-read.table(textConnection('egg 1 20 a
egg 2 30 a
jap 3 50 b
jap 1 60 b'))
s.t.
> df
V1 V2 V3 V4
1 egg 1 20 a
2 egg 2 30 a
3 jap 3 50 b
4 jap 1 60 b
My data has no factors so I convert factors to characters:
> df$V1 <- as.character(df$V1)
> df$V4 <- as.character(df$V4)
I would like to "collapse" the data frame by V1 keeping:
The max of V2
The mean of V3
The mode of V4 (this value does not actually change within V1 groups, so first, last, etc might do also.)
Please note this is a general question, e.g. my dataset is much larger and I may want to use different functions (e.g. last, first, min, max, variance, st. dev., etc for different variables) when collapsing. Hence the functions argument could be quite long.
In this case I would want output of the form:
> df.collapse
V1 V2 V3 V4
1 egg 2 25 a
2 jap 3 55 b
plyr package will help you:
library(plyr)
ddply(df, .(V1), summarize, V2 = max(V2), V3 = mean(V3), V4 = toupper(V4)[1])
As R does not have mode function (probably), I put other function.
But it is easy to implement a mode function.
I would suggest using ddply from plyr:
require(plyr)
ddply(df, .(V1), summarise, V2=max(V2), V3=mean(V3), V4=V4[1])
You can replace the functions with any calculation you wish. Your V3 column is non-numeric so might want to convert that to a numeric and then compute the mode. For now I am just returning the V3 value of the first row for each of the splits. Or if you don't want to use plyr:
do.call(rbind, lapply(split(df, df$V1), function(x) {
data.frame(V2=max(x$V2), V3=mean(x$V3), V4=x$V4[1]))
})

Resources