This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks
If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.
Related
As a follow up to this question, I'm using dplyr's group_split() to make dataframes / tibbles based on a levels of a column. Continuing off of this question, I want to split off of two columns instead of 1. When I try to split and name the columns, it attributes the wrong names to some of the datasets.
Here's a simple example:
library(dplyr)
#Sample dataset to intuitively illustrate issue
example <- tibble(number = c(1:6),
even_or_odd = c("odd", "even", "odd", "even", "odd", "even"),
prime_or_not = c("prime", "prime", "prime", "not", "prime", "not")) %>%
mutate(type = paste0(even_or_odd, "_", prime_or_not)) %>%
mutate(type_factor = factor(type, levels = unique(type)))
#Does group split to make 3 datasets
the_test <- example %>%
group_split(even_or_odd, prime_or_not) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #wrong label :`-(
odd_prime <- the_test["odd_prime"]$odd_prime #wrong label :`-(
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
My question: how do I ensure that my group names will be attributed to the right dataset and avoid the issues here with even_not and odd_prime being mixed up?
In my actual dataset, I have 50+ combinations, so typing them all out manually is not an option. In addition, my actual dataset will have some combinations that don't consistently exist (like the (like the odd not prime combination here), so relying on index isn't an option.
Instead of splitting by the two columns, use the factor column that was created, which ensures that it splits by the order of the levels created in the type_factor. In addition, using the unique on type_factor can have some issues if the order of the values in 'type_factor' is different i.e. unique gets the first non-duplicated value based on its occurrence. Instead, levels is better. In fact, it may be more appropriate to droplevels as well in case of unused levels.
the_test <- example %>%
group_split(type_factor) %>%
setNames(levels(example$type_factor))
group_split returns unnamed list. If we want to avoid the pain of renaming incorrectly, use split from base R which does return a named list. Thus, it can return in any order as long as the key/value pairs are correct
# 1 - return in a different order based on alphabetic order
split(example, example[c("even_or_odd", "prime_or_not")], drop = TRUE)
# 2 - return order based on the levels of the factor column
split(example, example$type_factor)
# 3 - With dplyr pipe
example %>%
split(.$type_factor)
# 4 - or using magrittr exposition operator
library(magrittr)
example %$%
split(x = ., f = type_factor)
Oh, of course the moment I post it, I realize that an easy solution existed:
Just change the group split to the new variable and it works!
library(dplyr)
#Does group split to make 3 datasets
the_test <- example %>%
group_split(type_factor) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #works now!
odd_prime <- the_test["odd_prime"]$odd_prime #works now!
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
I have a data.frame, I would like to group the data by one of the columns and then apply a function, which operates on the remaining columns of the data. The function returns a list of mixed objects.
If I was just returning one value from the group I know that I could use something like:
df %>% group_by(Column_1) %>% summarise(my_function)
I also know that I could perform operations on a list using the lapply which will happily return a list. I'm just not sure how to combines these two pieces of knowledge to acheive my desired result.
example code added, userFunction and data are representitive, but should give a good enough idea of what I want.
userFunction <- function(carData){
return(list(
a = carData$am * carData$carb,
b = plot(carData$disp ~ carData$carb),
c = mean(carData$drat)
))
}
mtcars %>%
group_by(cyl) %>%
summarise(userFunction)
I'd like to get back a list of lenght the number of factors in the columns i group_by. In the list should be a, b and c.
This seems to work as I was want.
this <- by(mtcars, mtcars$am, userFunction)
I have a dataframe that consists of 3 different columns (a,b and noise).
I want to apply a function on all distinct combinations of the two first columns together with the mean of the third variable and save it in a new column named c.
My first thought was to solve it with the following code
library(dplyr)
df <- data.frame(a = rep(c(1,2,3),each=9),
b = rep(c(1,2,3),length.out=3*9),
noise = rnorm(9*3*1000))
f <- function(a,b,c) a + b + c
result <- df %>% group_by(a,b) %>% summarise(c = f(a,b,mean(noise)))
To my surprise this gave the error "Error: expecting a single value".
So dlyr still treats a and b as vectors.
So the problem could then be solved with the somewhat confusing code
result <- df %>% group_by(a,b) %>% summarise(c = f(a[1],b[1],mean(noise)))
My questions are:
Why do dplyr keep grouped variables as vectors (are there any benefits of this?)
Is there any better way to solve this problem using dplyr?
I am trying to transfer from plyr to dplyr. However, I still can't seem to figure out how to call on own functions in a chained dplyr function.
I have a data frame with a factorised ID variable and an order variable. I want to split the frame by the ID, order it by the order variable and add a sequence in a new column.
My plyr functions looks like this:
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- ddply(data, .(ID_variable), f)
In dplyr I though this should look something like this
f <- function(x) cbind(x[order(x$order_variable), ], Experience = 0:(nrow(x)-1))
data <- data %>% group_by(ID_variable) %>% f
Can anyone tell me how to modify my dplyr call to successfully pass my own function and get the same functionality my plyr function provides?
EDIT: If I use the dplyr formula as described here, it DOES pass an object to f. However, while plyr seems to pass a number of different tables (split by the ID variable), dplyr does not pass one table per group but the ENTIRE table (as some kind of dplyr object where groups are annotated), thus when I cbind the Experience variable it appends a counter from 0 to the length of the entire table instead of the single groups.
I have found a way to get the same functionality in dplyr using this approach:
data <- data %>%
group_by(ID_variable) %>%
arrange(ID_variable,order_variable) %>%
mutate(Experience = 0:(n()-1))
However, I would still be keen to learn how to pass grouped variables split into different tables to own functions in dplyr.
For those who get here from google. Let's say you wrote your own print function.
printFunction <- function(dat) print(dat)
df <- data.frame(a = 1:6, b = 1:2)
As it was asked here
df %>%
group_by(b) %>%
printFunction(.)
prints entire data. To get dplyr print multiple tables grouped by, you should use do
df %>%
group_by(b) %>%
do(printFunction(.))
I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
Make your own grouping variable.
d <- split(my_data_frame,rep(1:400,each=1000))
You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.
edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
chunk <- 1000
n <- nrow(my_data_frame)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)
You could also use
r <- ggplot2::cut_width(1:n,chunk,boundary=0)
For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like
(my_data_frame
%>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
%>% group_by(index)
%>% [mutate, summarise, do()] ...
)
There are also many answers here
I had a similar question and used this:
library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)
from left to right:
you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.
So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).
You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions