Understanding plyr's ddply function - r

I am learning R and don't understand a section of the below function. In the below function what exactly is count=length(address) doing? Is there another way to do this?
crime_dat = ddply(crime, .(lat, lon), summarise, count = length(address))

The plyr library has two very common "helper" functions, summarize and mutate.
Summarise is used when you want to discard irrelevant data/columns, keeping only the levels of the grouping variable(s) and the specific and the summary functions of those groups (in your example, length).
Mutate is used to add a column (analogous to transform in base R), but without discarding anything. If you run these two commands, they should illustrate the difference nicely.
library(plyr)
ddply(mtcars, .(cyl), summarise, count = length(mpg))
ddply(mtcars, .(cyl), mutate, count = length(mpg))
In this example, as in your example, the goal is to figure out how many rows there are in each group. When using ddply like this with summarise, we need to pick a function that takes a single column (vector) as an argument, so length is a good choice. Since we're just counting rows / taking the length of the vector, it doesn't matter which column we pass to it. Alternatively, we could use nrow, but for that we have to pass a whole data.frame, so summarise won't work. In this case it saves us typing:
ddply(mtcars, .(cyl), nrow)
But if we want to do more, summarise really shines
ddply(mtcars, .(cyl), summarise, count = length(mpg),
mean_mpg = mean(mpg), mean_disp = mean(disp))
Is there another way to do this?
Yes, many other ways.
I'd second Alex's recommendation to use dplyr for things like this. The summarize and mutate concepts are still used, but it works faster and results in more readable code.
Other options include the data.table package (also a great option), tapply() or aggregate() in base R, and countless other possibilities.

Related

Combine different apply functions in R

I really love the apply-family in R, but I think I still do not get the best of it.
with(mtcars, tapply(mpg, cyl, mean))
sapply(mtcars, mean)
These two functions for example are really nice, but how can I combine them to get the mean for each variable for every category of the variable cyl?
With dplyr it is quite easy I guess:
mtcars %>%
group_by(cyl) %>%
summarise_all(mean)
For dplyr it seems to be quite easy. So maybe another questions might be why it is useful to even learn all these apply functions, when dplyr makes it easy to solve the problem? :-)
If you're looking for a base R solution, then you can use split to separate your data frame by cyl, then use sapply as before:
S <- split( mtcars, mtcars$cyl )
lapply( S, function(x) sapply(x, mean) )
Your second question is primarily opinion-based, so I'll give mine: tidyverse packages, like dplyr, build on top of base R functionality to provide convenient and consistent interface for common data manipulation operations. For this reason, it is generally preferable, but may not always be available in a particular development environment. In the latter case, it is helpful to know how to fall back on base R functionality.

dplyr default grouping option

I'm a bit confused about the default grouping option in dplyr. I always assumed that without explicitly group_by any operations are rowwise. However, I have a data frame
data = data.frame(a=c(1,2,3,4),b=c(1,2,3,4))
and when I want to calculate the mean of each ROW
data = data.raw %>%
mutate(data.average = mean(c(a,b),na.rm = T))
it returns the mean value of all the elements in a and b. It seems by doing c() all data are grouped in one group and mean performed on that. I wonder how it is possible to know how functions used within mutate etc. introduce grouping.
ps. I'm not looking for a solution for this specific problem, but asking for more generally how function calls affect grouping in dplyr.

Multiple Response Questions using Dplyr and Tidyr

I'm struggling with multiple response questions in R. I'm hoping to find an easy way to tackle this with dplyr and tidyr. Below is a sample multiple respose data frame. I'm trying to do things,first, create percentages - % of cats,% of dogs, etc. Percentages will be of overall responses. My usual of calculating percentages -
group_by(_)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
doesn't seem to cut it in this situation. Maybe I have to use summarise_each or a more specialized function? I'm still new to r and really new to Dplyr and Tidyr. I also tried to use Tidyr's "unite" function, which works, but it includes NA's, which I will have to recode away. But I still can't seem to calculate the percentages of the united column.
Any suggestions would be great! First, how to unite the multiple response columns using "unite" into all possible combinations and then calculating percentages of each, and also how to simply calculate the percentage of each binary column as a proportion of overall responses? Hope this makes sense! I'm sure there's a simple and elegant answer that I'm overlooking.
Cats<-c(Cat,NA,Cat,NA,NA,NA,Cat,NA)
Dogs<-c(NA,NA,Dog,Dog,NA,Dog,NA,Dog)
Fish<-c(NA,NA,Fish,NA,NA,NA,Fish,Fish)
Pets<-data.frame(Cats,Dogs,Fish)
Pets<-Pets%>%unite(Combined,Cats,Dogs,Fish,sep=",",remove=FALSE)
Animals%>%group_by(Combined)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
Sounds like what you're trying to do can be done by 'gather()' function from tidyr instead of 'unite()' function, based on my understanding of your question.
library(dplyr)
library(tidyr)
Pets %>%
gather(animal, type, na.rm = TRUE) %>%
group_by(animal) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count))

Using dplyr in assignment

I have a question about dplyr.
Lets say I want to update certain values in a dataframe, can I do this?:
mtcars %>% filter(mpg>20) %>% select(hp)=1000
(The example is nonsensical where all cars with MPGs greater than 20 have HP set to 1000)
I get an error so I am guessing the answer is no I can't use %>% and the dplyr verbs to the left of an assignment, but the dplyr syntax is a lot cleaner than:
mtcars[mtcars$mpg>20,"hp"]=1000
Especially when you are dealing with more complex cases, so I wanted to ask if there is any way to use the dplyr syntax in this case?
edit: It looks like mutate is the verb I want, so now my question is, can I dynamically change the name of the var in the mutate statement like so:
for (i in c("hp","wt")) {mtcars<-mtcars %>% filter(mpg>20) %>% mutate(i=1000) }
This example just creates a column named "i" with value 1000, which isn't what I want.

Split up a dataframe by number of rows

I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
Make your own grouping variable.
d <- split(my_data_frame,rep(1:400,each=1000))
You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.
edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
chunk <- 1000
n <- nrow(my_data_frame)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)
You could also use
r <- ggplot2::cut_width(1:n,chunk,boundary=0)
For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like
(my_data_frame
%>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
%>% group_by(index)
%>% [mutate, summarise, do()] ...
)
There are also many answers here
I had a similar question and used this:
library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)
from left to right:
you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.
So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).
You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions

Resources