I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
Related
I am analyzing the COVID-19 data in r and I want to get the aggregate result of total case in different continent.
total_cases_continent <- covid_data %>%
select(continent, new_cases) %>%
group_by(continent) %>%
summarize(total_cases = sum(new_cases))
I get this result below, instead of present total cases in different continent, this only shows total cases in different continent in one row
It looks like there might be some issues with the values of your variable "continent". I would recommend checking the class of the variable, as well as making sure all of the values are what you expected them to be. This is probably causing the issues within your group_by statement.
This question already has answers here:
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 1 year ago.
I have a very big csv file and im trying to find the amount of times a value has been repeated in a column.
csv file im using: https://www.kaggle.com/nyphil/perf-history
this is what ive been trying to do.
library(dplyr)
repeatedcomposers<-table(ny_philarmonic$composerName)
this works but only gives me 1000 values instead of the 2767 composers in the dataframe.
I also need it to create a separate dataframe so i can use it later.
The main dplyr verbs (e.g., mutate(), arrange(), etc) always return dataframes. So if you are looking to do some kind of operation that results in an operation, you are correct that a dplyr-centric approach is probably a good place to start. Base R functions are often vector-centric, so something like table() will often require additional steps afterward, if you want a dataframe in the end.
Once you've committed to dplyr, You have at least two options for this particular dilemma:
Option 1
The count() function gets you there in one step.
df %>%
count(composerName) %>%
arrange(-n) # to bring the highest count to the top
Option 2
Although it is one more line, I personally prefer the more verbose option because it helps me see what is happening more easily.
df %>%
group_by(composerName) %>%
summarise(n = n()) %>%
arrange(-n) # to bring the highest count to the top
It has the added benefit that I can role right into additional summarize() commands that I might care about too.
df %>%
group_by(composerName) %>%
summarise(
n = n(),
n_sq = n^2) # a little silly here, but often convenient in other contexts
Consider data.table for large datasets
EDIT: I would be remiss if I failed to mention the data.table might be worth looking into for this larger dataset. Although dplyr is optimized for readibility, it often slows down with datasets with more than 100k rows. In contrast, the data.table package is designed for speed with large datasets. If you are an R-focused person who often runs into large datasets, it's worth the time to look into. Here is a good comparison
Have a simple problem I am trying to solve with the tidyverse, particularly dplyr (I believe this is the appropriate function).
What is the average age of daily riders?
There is a data.frame named Bike and there are two columns of data including cyc_freq which includes the Daily observation and another column of data entitled age which contains the different ages.
I am attempting to write a script that returns the average age of those who ride their bikes Daily. I was able to solve the problem but feel like my solution was inefficient.
Is there a simpler way to achieve my answer using dplyr?
bavg <- filter(BikeData, cyc_freq == "Daily", age)
mean(bavg$age)
It could be done within summarise itself without the need to have another step with filter
library(dplyr)
BikeData %>%
summarise(Mean = mean(age[cyc_freq == "Daily"]))
Or in base R
with(BikeData, mean(age[cyc_freq == "Daily"]))
I'm struggling with multiple response questions in R. I'm hoping to find an easy way to tackle this with dplyr and tidyr. Below is a sample multiple respose data frame. I'm trying to do things,first, create percentages - % of cats,% of dogs, etc. Percentages will be of overall responses. My usual of calculating percentages -
group_by(_)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
doesn't seem to cut it in this situation. Maybe I have to use summarise_each or a more specialized function? I'm still new to r and really new to Dplyr and Tidyr. I also tried to use Tidyr's "unite" function, which works, but it includes NA's, which I will have to recode away. But I still can't seem to calculate the percentages of the united column.
Any suggestions would be great! First, how to unite the multiple response columns using "unite" into all possible combinations and then calculating percentages of each, and also how to simply calculate the percentage of each binary column as a proportion of overall responses? Hope this makes sense! I'm sure there's a simple and elegant answer that I'm overlooking.
Cats<-c(Cat,NA,Cat,NA,NA,NA,Cat,NA)
Dogs<-c(NA,NA,Dog,Dog,NA,Dog,NA,Dog)
Fish<-c(NA,NA,Fish,NA,NA,NA,Fish,Fish)
Pets<-data.frame(Cats,Dogs,Fish)
Pets<-Pets%>%unite(Combined,Cats,Dogs,Fish,sep=",",remove=FALSE)
Animals%>%group_by(Combined)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
Sounds like what you're trying to do can be done by 'gather()' function from tidyr instead of 'unite()' function, based on my understanding of your question.
library(dplyr)
library(tidyr)
Pets %>%
gather(animal, type, na.rm = TRUE) %>%
group_by(animal) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count))
I'm working with a data frame that looks very similar to the below:
Image here, unfortunately don't have enough reputation yet
This is a 600,000 row data frame. What I want to do is for every repeated instance within the same date, I'd like to divide the cost by total number of repeated instances. I would also like to consider only those falling under the "Sales" tactic.
So for example, in 1/1/16, there are 2 "Help Packages" that are also under the "Sales" tactic. Because there are 2 instances within the same date, I'd like to divide the cost of each by 2 (so the cost would come out as $5 for each).
This is the code I have:
for(i in 1:length(dfExample$Date)){
if(dfExample$Tactic) == "Sales"){
list = agrep(dfExample$Package[i], dfExample$Package)
for(i in list){
date_repeats = agrep(i, dfExample$Date)
dfExample$Cost[date_repeats] = dfExample$Package[i]/length(date_repeats)
}
}
}
It is incredibly inefficient and slow. I know there's got to be a better way to achieve this. Any help would be much appreciated. Thank you!
ave() can give a solution without additional packages:
with(dfExample, Cost / ave(Cost, Date, Package, Tactic, FUN=length))
Using dplyr:
library(dplyr)
dfExample %>%
group_by(Date, Package, Tactic) %>%
mutate(Cost = Cost / n())
I'm a little unclear what you mean by "instance". This (pretty clearly) groups by Date, Package, and Tactic, and so will consider each unique combination of those columns as a grouper. If you don't include Tactic in the definition of an "instance", then you can remove it to group only by Date and Package.