R aggregate function with two values - r

Let's say I have a function that takes two vectors:
someFunction <- function(x,y){
return(mean(x+y));
}
And say I have some data
toy <- data.frame(a=c(1,1,1,1,1,2,2,2,2,2), b=rnorm(10), c=rnorm(10))
What I want to do is return the result of the function someFunction for each value of toy$a, i.e. I want to acchieve the same result as the code
toy$d <- toy$b + toy$c
result <- aggregate(toy$d, list(toy$a), mean)
However, in real life, the function someFunction is way more complicated and it needs two inputs, so the workaround in this toy example is not possible. So, what I want to do is:
Group the data set according to one column.
For each value in the column (in the toy example, that's 1 and 2), take two vectors v1, v2, and return someFunction(v1,v2)

Checkout dplyr package, specifically group_by and summarize functions.
Assuming that you want to compute someFunction(b, c) for each value of a, the syntax would look like
library(dplyr)
data %>% group_by(a) %>% summarize(someFunction(b, c))

library(data.table)
toy <- data.table(toy)
toy[, list(New_col = someFunction(b, c)), by = 'a']

Related

Vector addition with vector indexing

This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks
If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.

Applying functions on columns in nested data frame

I have data that I'm nesting into list columns, then I'd like to use purrr::map() to apply a plotting function separately to each column within the nested data frames. Minimal reproducible example:
library(dplyr)
library(tidyr)
library(purrr)
data=data.frame(Type=c(rep('Type1',20),
rep('Type2',20),
rep('Type3',20)),
Result1=rnorm(60),
Result2=rnorm(60),
Result3=rnorm(60)
)
dataNested=data%>%group_by(Type)%>%nest()
Say, I wanted to generate a histogram for Result1:Result3 for each element of dataNested$data:
dataNested%>%map(data,hist)
Any iteration of my code won't separately iterate over the columns within each nested dataframe.
Why would you need to complicate things in such way, when you're already in the tidyverse? List columns are rather a last resort solution to problems..
library(tidyverse)
data %>%
gather(result, value, -Type) %>%
ggplot(aes(value)) +
geom_histogram() +
facet_grid(Type ~ result)
gather reformats the wide dataset into a long one, with Type column, result column and a value column, where all the numbers are.
Perhaps do not create a nested data frame. We can split the data frame by the Type column and plot the histogram.
library(tidyverse)
dt %>%
split(.$Type) %>%
map(~walk(.[-1], ~hist(.)))
DATA
library(tidyverse)
set.seed(1)
dt <- data.frame(Type = c(rep('Type1', 20),
rep('Type2', 20),
rep('Type3', 20)),
Result1 = rnorm(60),
Result2 = rnorm(60),
Result3 = rnorm(60),
stringsAsFactors = FALSE)
So I think you are thinking about this the right way. Running this code:
dataNested$data[[1]
You can see that you have data that you can iterate. You can loop through it like:
for(i in dataNested) {
print(i)
}
This clearly demonstrates that the structure is nothing too complicated to work with. Okay so how to create the histograms? We can create a helper function:
helper_hist <- function(df) {
lapply(df, hist)
}
And run using:
map(dataNested$data, helper_hist)
Hope this helps.

returning a list from a user function using group_by in R

I have a data.frame, I would like to group the data by one of the columns and then apply a function, which operates on the remaining columns of the data. The function returns a list of mixed objects.
If I was just returning one value from the group I know that I could use something like:
df %>% group_by(Column_1) %>% summarise(my_function)
I also know that I could perform operations on a list using the lapply which will happily return a list. I'm just not sure how to combines these two pieces of knowledge to acheive my desired result.
example code added, userFunction and data are representitive, but should give a good enough idea of what I want.
userFunction <- function(carData){
return(list(
a = carData$am * carData$carb,
b = plot(carData$disp ~ carData$carb),
c = mean(carData$drat)
))
}
mtcars %>%
group_by(cyl) %>%
summarise(userFunction)
I'd like to get back a list of lenght the number of factors in the columns i group_by. In the list should be a, b and c.
This seems to work as I was want.
this <- by(mtcars, mtcars$am, userFunction)

pass grouped dataframe to own function in dplyr

I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))

dplyr: manipulate function with multiple arguments by groups

With the robCompositions package, I need to impute missing values on a group basis. For example, with the iris dataset.
library(robCompositions)
library(dplyr)
data(iris)
# Insert random NAs
for (i in 1:4) {
n_NA = sample(0:10, 1)
index_NA = sample(1:nrow(iris), n_NA)
iris[index_NA, i] = NA
}
This is where I have no idea which manip to use...
impfunc <- function(x) x %.%
regroup(list(...)) %.%
mutate(impKNNa(x[,-5], k=6, metric="Euclidean"))
impfunc(iris, "Species")
iris %.% group_by(Species) %.% mutate(impKNNa(iris[,-5], k=6, metric="Euclidean"))
Any idea?
Thanks.
Use the the do() function. It allows you to apply any arbitrary function to a grouped data frame.
You'll also want to extract not just the output from impKNNa but specifically impKNNA$xImp which is the altered data frame.
The other issue is that impKNNA doesn't want any variables except the numeric variables of interest and do() won't remove the categorical variables. So perhaps a solution is to write a wrapper function for impKNNA that will remove categorical variables and return xIMP, and use do() to apply that to a grouped data frame.

Resources