I really love the apply-family in R, but I think I still do not get the best of it.
with(mtcars, tapply(mpg, cyl, mean))
sapply(mtcars, mean)
These two functions for example are really nice, but how can I combine them to get the mean for each variable for every category of the variable cyl?
With dplyr it is quite easy I guess:
mtcars %>%
group_by(cyl) %>%
summarise_all(mean)
For dplyr it seems to be quite easy. So maybe another questions might be why it is useful to even learn all these apply functions, when dplyr makes it easy to solve the problem? :-)
If you're looking for a base R solution, then you can use split to separate your data frame by cyl, then use sapply as before:
S <- split( mtcars, mtcars$cyl )
lapply( S, function(x) sapply(x, mean) )
Your second question is primarily opinion-based, so I'll give mine: tidyverse packages, like dplyr, build on top of base R functionality to provide convenient and consistent interface for common data manipulation operations. For this reason, it is generally preferable, but may not always be available in a particular development environment. In the latter case, it is helpful to know how to fall back on base R functionality.
Related
I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
I used this question and answer to get what I want (how to compute rowsums using tidyverse), but I was wondering if there was a way to do named subsetting with rowSums. I can imagine an instance where I have a lot of variables where this would be desirable.
What I mean is something like this:
rosSums(iris, Sepal.Length, Sepal.Width)
Instead of:
rowSums(iris[1:2])
Thanks for any help in advance!
using dplyr you could simply do.
iris %>% mutate(Sepal.Length + Sepal.Width)
I have a question about dplyr.
Lets say I want to update certain values in a dataframe, can I do this?:
mtcars %>% filter(mpg>20) %>% select(hp)=1000
(The example is nonsensical where all cars with MPGs greater than 20 have HP set to 1000)
I get an error so I am guessing the answer is no I can't use %>% and the dplyr verbs to the left of an assignment, but the dplyr syntax is a lot cleaner than:
mtcars[mtcars$mpg>20,"hp"]=1000
Especially when you are dealing with more complex cases, so I wanted to ask if there is any way to use the dplyr syntax in this case?
edit: It looks like mutate is the verb I want, so now my question is, can I dynamically change the name of the var in the mutate statement like so:
for (i in c("hp","wt")) {mtcars<-mtcars %>% filter(mpg>20) %>% mutate(i=1000) }
This example just creates a column named "i" with value 1000, which isn't what I want.
I have the following example data
d.1 = data.frame(id=c(1,1,2,3,3), date=c(2001,2002,2001,2001,2003), measure=c(1:5))
d.2 = data.frame(id=c(1,2,2,3,3), date=c(2001,2002,2003,2002,2008), measure=c(1:5))
d = merge(d.1,d.2, all=T, by="id")
d.1 and d.2 are two kinds of measurements and I need one of each measurements per id. The measurements should be as close to each other as possible. I can do that with dplyr by
require(dplyr)
d = d %>%
group_by(id) %>%
do(.[which.min(abs(.$date.x-.$date.y)),])
The question is how i can use dplyr if the names of the date columns are saved in a variable like name.x="date.x" and name.y="date.y" because i can't use
...
do(.[which.min(abs(.[, name.x]-.[, name.y])),])
....
I tried to find anaother solution using eval, as.symbol ans stuff like that but i couldn't figure out a solution...
d$date.x returns a vector while d[, name.x] returns a data.frame, which does not work when passed inside your function. So simply change the way you access this column to d[[name.x]] and it will work:
d %>% group_by(id) %>% do(.[which.min(abs(.[[name.x]] -.[[name.y]])),])
Since 0.4 (which was released just after this question was answered), dplyr has included standard evaluation version do_, which in theory should be easier to program with than the NSE version.
You could use it similarly:
interp <- lazyeval::interp
d %>%
group_by(id) %>%
do_(interp(~ .[which.min(abs(.$x - .$y)), ],
x = as.name(name.x), y = as.name(name.y)))
I'm not sure it's any easier to read or write than the NSE version. For the other verbs,
code can remain concise while also programmatically accessing names.
For do_, however, one must use the dot pronoun to access column names e.g. as discussed in this question. As a consequence, I think, you always need to use interp with do_. This makes the code more verbose than the NSE version in the earlier answer.
I am learning R and don't understand a section of the below function. In the below function what exactly is count=length(address) doing? Is there another way to do this?
crime_dat = ddply(crime, .(lat, lon), summarise, count = length(address))
The plyr library has two very common "helper" functions, summarize and mutate.
Summarise is used when you want to discard irrelevant data/columns, keeping only the levels of the grouping variable(s) and the specific and the summary functions of those groups (in your example, length).
Mutate is used to add a column (analogous to transform in base R), but without discarding anything. If you run these two commands, they should illustrate the difference nicely.
library(plyr)
ddply(mtcars, .(cyl), summarise, count = length(mpg))
ddply(mtcars, .(cyl), mutate, count = length(mpg))
In this example, as in your example, the goal is to figure out how many rows there are in each group. When using ddply like this with summarise, we need to pick a function that takes a single column (vector) as an argument, so length is a good choice. Since we're just counting rows / taking the length of the vector, it doesn't matter which column we pass to it. Alternatively, we could use nrow, but for that we have to pass a whole data.frame, so summarise won't work. In this case it saves us typing:
ddply(mtcars, .(cyl), nrow)
But if we want to do more, summarise really shines
ddply(mtcars, .(cyl), summarise, count = length(mpg),
mean_mpg = mean(mpg), mean_disp = mean(disp))
Is there another way to do this?
Yes, many other ways.
I'd second Alex's recommendation to use dplyr for things like this. The summarize and mutate concepts are still used, but it works faster and results in more readable code.
Other options include the data.table package (also a great option), tapply() or aggregate() in base R, and countless other possibilities.