Collapse data frame by group using different functions on each variable - r

Define
df<-read.table(textConnection('egg 1 20 a
egg 2 30 a
jap 3 50 b
jap 1 60 b'))
s.t.
> df
V1 V2 V3 V4
1 egg 1 20 a
2 egg 2 30 a
3 jap 3 50 b
4 jap 1 60 b
My data has no factors so I convert factors to characters:
> df$V1 <- as.character(df$V1)
> df$V4 <- as.character(df$V4)
I would like to "collapse" the data frame by V1 keeping:
The max of V2
The mean of V3
The mode of V4 (this value does not actually change within V1 groups, so first, last, etc might do also.)
Please note this is a general question, e.g. my dataset is much larger and I may want to use different functions (e.g. last, first, min, max, variance, st. dev., etc for different variables) when collapsing. Hence the functions argument could be quite long.
In this case I would want output of the form:
> df.collapse
V1 V2 V3 V4
1 egg 2 25 a
2 jap 3 55 b

plyr package will help you:
library(plyr)
ddply(df, .(V1), summarize, V2 = max(V2), V3 = mean(V3), V4 = toupper(V4)[1])
As R does not have mode function (probably), I put other function.
But it is easy to implement a mode function.

I would suggest using ddply from plyr:
require(plyr)
ddply(df, .(V1), summarise, V2=max(V2), V3=mean(V3), V4=V4[1])
You can replace the functions with any calculation you wish. Your V3 column is non-numeric so might want to convert that to a numeric and then compute the mode. For now I am just returning the V3 value of the first row for each of the splits. Or if you don't want to use plyr:
do.call(rbind, lapply(split(df, df$V1), function(x) {
data.frame(V2=max(x$V2), V3=mean(x$V3), V4=x$V4[1]))
})

Related

How to pass a vector of values as parameters for mutate?

I am writing a code which is expected to raise each column of a data frame to some exponent.
I've tried to use mutate_all to apply function(x,a) x^a to each column of the dataframe, but I am having trouble passing values of a from a pre-defined vector.
powers <- c(1,2,3)
df <- data.frame(v1 = c(1,2,3), v2 = c(2,3,4), v3 = c(3,4,5))
df %>% mutate_all(.funs, ...)
I am seeking help on how to write the parameters of mutate_all so that the elements of powers can be applied to the function for each column.
I expect the output to be a data frame, with columns being (1,2,3),(4,9,16),(27,64,125) respectively.
We can use Map in base R
df[] <- Map(`^`, df, powers)
Or map2 in purrr
purrr::map2_df(df, powers, `^`)
You can also try sweep()from base R:
sweep(df, 2, powers, "^")
v1 v2 v3
1 1 4 27
2 2 9 64
3 3 16 125
In base R, we can replicate the 'powers' to make the lengths same and then apply the function
df ^ powers[col(df)]
# v1 v2 v3
#1 1 4 27
#2 2 9 64
#3 3 16 125

Trying to avoid a for loop in r

I have some code that works but is very clunky and I'm sure there is a better way to do it, avoiding the for loop. Essentially I have a list of performances, and a list of factors. And I want to assign the highest performance to the highest factors, the lowest performance to the lowest factors, etc. Here is some simplified sample code:
#My simplified sample list of performances:
PerformanceList <- data.frame(v1 <- c(rep(10,4)), v2 <- c(rep(9,4)), v3 <- c(rep(8,4)))
View(PerformanceList)
v1 v2 v3
1 10 9 8
2 10 9 8
3 10 9 8
4 10 9 8
#My simplified sample list of Factors:
MyFactors <- data.frame(v1 <- c(35,25,15,5), v2 <- c(10,20,60,20), v3 <- c(5,10,15,40))
View(MyFactors)
v1 v2 v3
1 35 10 5
2 25 20 10
3 15 60 50
4 5 20 40
#Code to find the ranking of each row from largest to smallest:
Rankings <- data.frame(t(apply(-MyFactors, 1, rank, na.last="keep",ties.method="random")))
View(Rankings)
v1 v2 v3
1 1 2 3
2 1 2 3
3 3 1 2
4 3 2 1
Function to sort each row by ranking. I assume there is a better way to do this but I couldn't figure it out:
SortFunction <- function(RankingList){
SortedRankings <- order(RankingList)
return(SortedRankings)
}
#applying that Sort function to each row of the data frame:
SortedRankings <- data.frame(t(apply(Rankings, 1,SortFunction)))
View(SortedRankings)
X1 X2 X3
1 1 2 3
2 1 2 3
3 2 3 1
4 3 2 1
Here is a for loop that does what I want but I'm sure it's not the best way to do it. Basically I want to go down each row of my PerformanceList and choose the column that corresponds to the highest Ranking (which is column 1 from my Sorted Rankings above). I'd ideally like to then be able to assign column 2 from those Sorted Rankings to assign the second highest performance to my second highest factor, and so on...
FactorPerformanceList <- data.frame(matrix(NA, ncol=1, nrow=NROW(Rankings)))
for (i in 1:NROW(Rankings)){
FactorPerformanceList[i,] <- PerformanceList[i,SortedRankings[i,1]]
}
View(FactorPerformanceList)
1 10
2 10
3 9
4 8
It seems like this should work but it gives a matrix of 4 rows by 4 columns instead:
FactorPerformanceList2 <- PerformanceList[,SortedRankings[,1]]
View(FactorPerformanceList2)
v1 v1 v2 v3
1 10 10 9 8
2 10 10 9 8
3 10 10 9 8
4 10 10 9 8
Any ideas or help would be greatly appreciated! Thank you!
This technically does not remove the for-loop, it just hides it. That said, it's a lot cleaner code than what you have, and unless you need all the intermediate data steps, it simplifies things greatly.
PerformanceList <- data.frame(
v1= c(rep(10,4)),
v2= c(rep(9,4)),
v3 = c(rep(8,4))
)
MyFactors <- data.frame(
v1 = c(35,25,15,5),
v2 = c(10,20,60,20),
v3 = c(5,10,15,40))
FactorPerformanceList <- as.data.frame(t(sapply(1:nrow(PerformanceList), function(i) {
PerformanceList[i,order(MyFactors[i,])]
})))
The same code can be written
library(tidyverse)
FactorPerformanceList <- 1:nrow(PerformanceList) %>%
sapply(function(i) {
PerformanceList[i,order(MyFactors[i,])]
}) %>%
t() %>%
as.data.frame()
which makes the order of operations a little clearer (sapply, then t, then as.data.frame).
In general, for-loops can be avoided completely when you're working with columns, but row-wise operations aren't as easy to remove entirely. You can clean up the code by using the apply family of functions, or (if you want something fancier) the plyr or purrr packages.
Given the lack of clarity I've come up with a somewhat flexible answer for you.
It might make sense to take a given data.frame and force it to take a long format, we can make sure we maintain the index positions from the prior structure as this is what you might use to join other data.frames to one another.
I've chosen to use the tidyverse suite of packages to answer this, namely dplyr.
Data
library(tidyverse)
PerformanceList <- data.frame(v1 = c(rep(10,4)), v2 = c(rep(9,4)), v3 = c(rep(8,4)))
MyFactors <- data.frame(v1 = c(35,25,15,5), v2 = c(10,20,60,20), v3 = c(5,10,15,40))
This function will take a data.frame and provide a long format data.frame with index position columns.
Function to convert to long data.frame with index ranks
df_ranks <- function(df) {
names(df) <- 1:ncol(df)
df %>%
mutate(row_index = 1:nrow(.)) %>%
gather(col_index, value, -row_index) %>%
group_by(row_index) %>%
mutate(row_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
group_by(col_index) %>%
mutate(col_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
ungroup()
}
Applying the function to the data, and making sure to adjust column names will let us join without much hassle.
ranked_perf <- df_ranks(PerformanceList) %>% setNames(paste0("rank_", names(.)))
ranked_fact <- df_ranks(MyFactors) %>% setNames(paste0("fact_", names(.)))
We can then join the tables, its important to understand what you want to do and what the expected result may be before this step. For this example I've said that I want to have the matching values within a column by its rank.
full_join(ranked_perf, ranked_fact,
by = c("rank_col_rank" = "fact_col_rank",
"rank_col_index" = "fact_col_index"))
As to what you want to do with this result is up to you, you can select columns and manipulate it back to wide format using combinations of select, unite, and spread.

How do I check if subgroups of a character column in R are different?

I have some columns of characters such as:
V1 V2 group
B C 1
B C 1
B C 1
A C 2
A A 2
A A 2
in a data frame (call it df) in R which are also grouped by a factor with 2 levels 1 and 2, and I wanted to use
'by' or 'lapply' to see if I could work out which column(s) had a corresponding group structure which is given by group. In this case, the answer would be column V1.
I was thinking something like
by(df, df$group,...)
but wasn't quite sure how to implement this. I've also seen the 'identical' function but didn't know if the opposite was available?
Thanks for any advice!
may be
sapply(df[,1:2], function(x) all(as.numeric(factor(x,
levels=unique(x)))==df$group))
# V1 V2
#TRUE FALSE
Or for this example
!colSums((df[,1:2]=='A')+1!=df$group)
# V1 V2
#TRUE FALSE
Or you could use
!rowSums(aggregate(.~ group, df, FUN=function(x) length(unique(x)))[,-1]!=1)
#[1] TRUE FALSE

R: Properly using a dataframe as an argument to a function

I am practicing using the apply function in R, and so I'm writing a simple function to apply to a dataframe.
I have a dataframe with 2 columns.
V1 V2
1 3
2 4
I decided to do some basic arithmetic and have the answer in the 3rd column, specifically, I want to multiply the first column by 2 and the second column by 3, then sum them.
V1 V2 V3
1 3 11
2 4 16
Here's what I was thinking:
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[,1]*2 +
some_df[,2]*3}
mydf <- apply(mydf ,2, some_function)
But what is wrong with my arguments to the function? R is giving me an error regarding the dimension of the dataframe. Why?
Three things wrong:
1) apply "loops" a vector of either each column or row, so you just address the name [1] not [,1]
2) you need to run by row MARGIN=1, not 2
3) you need to cbind the result, because apply doesn't append, so you're overwriting the vector
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[1]*2 +
some_df[2]*3}
mydf <- cbind(mydf,V3=apply(mydf ,1, some_function))
# V1 V2 V3
#1 1 3 11
#2 2 4 16
but probably easier just to do the vector math:
mydf$V3<-mydf[,1]*2 + mydf[,2]*3
because vector math is one of the greatest things about R

How can I apply different aggregate functions to different columns in R?

How can I apply different aggregate functions to different columns in R? The aggregate() function only offers one function argument to be passed:
V1 V2 V3
1 18.45022 62.24411694
2 90.34637 20.86505214
1 50.77358 27.30074987
2 52.95872 30.26189013
1 61.36935 26.90993530
2 49.31730 70.60387016
1 43.64142 87.64433517
2 36.19730 83.47232907
1 91.51753 0.03056485
... ... ...
> aggregate(sample,by=sample["V1"],FUN=sum)
V1 V1 V2 V3
1 1 10 578.5299 489.5307
2 2 20 575.2294 527.2222
How can I apply a different function to each column, i.e. aggregate V2 with the mean() function and V2 with the sum() function, without calling aggregate() multiple times?
For that task, I will use ddply in plyr
> library(plyr)
> ddply(sample, .(V1), summarize, V2 = sum(V2), V3 = mean(V3))
V1 V2 V3
1 1 578.5299 48.95307
2 2 575.2294 52.72222
...Or the function data.table in the package of the same name:
library(data.table)
myDT <- data.table(sample) # As mdsumner suggested, this is not a great name
myDT[, list(sumV2 = sum(V2), meanV3 = mean(V3)), by = V1]
# V1 sumV2 meanV3
# [1,] 1 578.5299 48.95307
# [2,] 2 575.2294 52.72222
Let's call the dataframe x rather than sample which is already taken.
EDIT:
The by function provides a more direct route than split/apply/combine
by(x, list(x$V1), f)
:EDIT
lapply(split(x, x$V1), myfunkyfunctionthatdoesadifferentthingforeachcolumn)
Of course, that's not a separate function for each column but one can do both jobs.
myfunkyfunctionthatdoesadifferentthingforeachcolumn = function(x) c(sum(x$V2), mean(x$V3))
Convenient ways to collate the result are possible such as this (but check out plyr package for a comprehensive solution, consider this motivation to learn something better).
matrix(unlist(lapply(split(x, x$V1), myfunkyfunctionthatdoesadifferentthingforeachcolumn)), ncol = 2, byrow = TRUE, dimnames = list(unique(x$V1), c("sum", "mean")))

Resources