I'm a bit confused about the default grouping option in dplyr. I always assumed that without explicitly group_by any operations are rowwise. However, I have a data frame
data = data.frame(a=c(1,2,3,4),b=c(1,2,3,4))
and when I want to calculate the mean of each ROW
data = data.raw %>%
mutate(data.average = mean(c(a,b),na.rm = T))
it returns the mean value of all the elements in a and b. It seems by doing c() all data are grouped in one group and mean performed on that. I wonder how it is possible to know how functions used within mutate etc. introduce grouping.
ps. I'm not looking for a solution for this specific problem, but asking for more generally how function calls affect grouping in dplyr.
Related
I am trying to use the group_by in R, then summarise while keeping extra columns in the data.
I just want to group by id_trayecto, but I want to include the other columns that are inside the group_by. The thing is that I don't know how to include them, without them being inside the group_by.
This is my code:
library(dplyr)
prueba <- reservas %>%
group_by(id_trayecto, id_trayecto_dia, Van, fecha, hora) %>%
summarize(Tickets.Vendidos = n(),
Revenue = sum(Costo.final))
Thanks in advance guys :))
The answer really depends on the cardinality between id_trayecto and the other columns in reservas that you want to keep.
To simplify, let's say you only want to keep id_trayecto and fecha after summarizing. If reservas in general can contain multiple values for fecha for a given value of id_trayecto, then what you want to do doesn't make sense, since summarizing by id_trayecto would possibly need to summarize across multiple values of fecha, so you'd either have to leave fecha out, or include it in the summarize statement with an appropriate aggregation function.
If, however, in reservas you only ever get the same value of fecha for a given value of id_trayecto than you can just include fecha in the group_by statement without it changing the results of the values Tickets.Vendidos and Revenue. Or in other words: Grouping and summarizing by the more granular variable is equivalent to grouping and summarizing by both variables.
I am using R for a project for University. I imported a csv file and created a df. Everything was going smoothly until I had to gather the percentages of age groups in the "Age" column. There are 3,000 rows of information in my df. How do I only sample information from rows 50-200 to find the percentages of people ages 15-20, 21-25, 26-30, and 31-35?
You can try creating another df which only takes information from rows 50-200 using the slice function e.g my_data %>% slice(1:6) would give rows 1-6 I believe. Incase you didnt know, this function exists in tidyverse, which you can call using library(tidyverse). For filtering by particular age groups, you can again use the tidyverse filter function, e.g my_data %>% filter.
If your goal is to sample, better than slice specific rows you can use the function sample_n
I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
This is a mock-up based on mtcars of what I would like to do:
compute a column that counts the number of cars that have less
displacement (disp) of the current row within the same gear type
category (am)
expected column is the values I would like to get
try1 is one try with the findInterval function, the problem is that I cannot make it count across the subsets that depend on the category (am)
I have tried solutions with *apply but I am somehow never able to make the function called work only on a subset that depends on the value of a variable of the row that is processed (hope this makes sense).
x = mtcars[1:6,c("disp","am")]
# expected values are the number of cars that have less disp while having the same am
x$expected = c(1,1,0,1,2,0)
#this ordered table is for findInterval
a = x[order(x$disp),]
a
# I use the findInterval function to get the number of values and I try subsetting the call
# -0.1 is to deal with the closed intervalq
x$try1 = findInterval(x$disp-0.1, a$disp[a$am==x$am])
x
# try1 values are not computed depending on the subsetting of a
Any solution will do; the use of the findInterval function is not mandatory.
I'd rather have a more general solution enabling a column value to be computed by calling a function that takes values from the current row to compute the expected value.
As pointed out by #dimitris_ps, the previous solution neglects the duplicated counts. Following provides the remedy.
library(dplyr)
x %>%
group_by(am) %>%
mutate(expected=findInterval(disp, sort(disp) + 0.0001))
or
library(data.table)
setDT(x)[, expected:=findInterval(disp, sort(disp) + 0.0001), by=am]
Based on #Khashaa's logic this is my approach
library(dplyr)
mtcars %>%
group_by(am) %>%
mutate(expected=match(disp, sort(disp))-1)
I am learning R and don't understand a section of the below function. In the below function what exactly is count=length(address) doing? Is there another way to do this?
crime_dat = ddply(crime, .(lat, lon), summarise, count = length(address))
The plyr library has two very common "helper" functions, summarize and mutate.
Summarise is used when you want to discard irrelevant data/columns, keeping only the levels of the grouping variable(s) and the specific and the summary functions of those groups (in your example, length).
Mutate is used to add a column (analogous to transform in base R), but without discarding anything. If you run these two commands, they should illustrate the difference nicely.
library(plyr)
ddply(mtcars, .(cyl), summarise, count = length(mpg))
ddply(mtcars, .(cyl), mutate, count = length(mpg))
In this example, as in your example, the goal is to figure out how many rows there are in each group. When using ddply like this with summarise, we need to pick a function that takes a single column (vector) as an argument, so length is a good choice. Since we're just counting rows / taking the length of the vector, it doesn't matter which column we pass to it. Alternatively, we could use nrow, but for that we have to pass a whole data.frame, so summarise won't work. In this case it saves us typing:
ddply(mtcars, .(cyl), nrow)
But if we want to do more, summarise really shines
ddply(mtcars, .(cyl), summarise, count = length(mpg),
mean_mpg = mean(mpg), mean_disp = mean(disp))
Is there another way to do this?
Yes, many other ways.
I'd second Alex's recommendation to use dplyr for things like this. The summarize and mutate concepts are still used, but it works faster and results in more readable code.
Other options include the data.table package (also a great option), tapply() or aggregate() in base R, and countless other possibilities.