I'm working with a data frame that looks very similar to the below:
Image here, unfortunately don't have enough reputation yet
This is a 600,000 row data frame. What I want to do is for every repeated instance within the same date, I'd like to divide the cost by total number of repeated instances. I would also like to consider only those falling under the "Sales" tactic.
So for example, in 1/1/16, there are 2 "Help Packages" that are also under the "Sales" tactic. Because there are 2 instances within the same date, I'd like to divide the cost of each by 2 (so the cost would come out as $5 for each).
This is the code I have:
for(i in 1:length(dfExample$Date)){
if(dfExample$Tactic) == "Sales"){
list = agrep(dfExample$Package[i], dfExample$Package)
for(i in list){
date_repeats = agrep(i, dfExample$Date)
dfExample$Cost[date_repeats] = dfExample$Package[i]/length(date_repeats)
}
}
}
It is incredibly inefficient and slow. I know there's got to be a better way to achieve this. Any help would be much appreciated. Thank you!
ave() can give a solution without additional packages:
with(dfExample, Cost / ave(Cost, Date, Package, Tactic, FUN=length))
Using dplyr:
library(dplyr)
dfExample %>%
group_by(Date, Package, Tactic) %>%
mutate(Cost = Cost / n())
I'm a little unclear what you mean by "instance". This (pretty clearly) groups by Date, Package, and Tactic, and so will consider each unique combination of those columns as a grouper. If you don't include Tactic in the definition of an "instance", then you can remove it to group only by Date and Package.
Related
I am analyzing the COVID-19 data in r and I want to get the aggregate result of total case in different continent.
total_cases_continent <- covid_data %>%
select(continent, new_cases) %>%
group_by(continent) %>%
summarize(total_cases = sum(new_cases))
I get this result below, instead of present total cases in different continent, this only shows total cases in different continent in one row
It looks like there might be some issues with the values of your variable "continent". I would recommend checking the class of the variable, as well as making sure all of the values are what you expected them to be. This is probably causing the issues within your group_by statement.
I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
This question already has answers here:
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 1 year ago.
I have a very big csv file and im trying to find the amount of times a value has been repeated in a column.
csv file im using: https://www.kaggle.com/nyphil/perf-history
this is what ive been trying to do.
library(dplyr)
repeatedcomposers<-table(ny_philarmonic$composerName)
this works but only gives me 1000 values instead of the 2767 composers in the dataframe.
I also need it to create a separate dataframe so i can use it later.
The main dplyr verbs (e.g., mutate(), arrange(), etc) always return dataframes. So if you are looking to do some kind of operation that results in an operation, you are correct that a dplyr-centric approach is probably a good place to start. Base R functions are often vector-centric, so something like table() will often require additional steps afterward, if you want a dataframe in the end.
Once you've committed to dplyr, You have at least two options for this particular dilemma:
Option 1
The count() function gets you there in one step.
df %>%
count(composerName) %>%
arrange(-n) # to bring the highest count to the top
Option 2
Although it is one more line, I personally prefer the more verbose option because it helps me see what is happening more easily.
df %>%
group_by(composerName) %>%
summarise(n = n()) %>%
arrange(-n) # to bring the highest count to the top
It has the added benefit that I can role right into additional summarize() commands that I might care about too.
df %>%
group_by(composerName) %>%
summarise(
n = n(),
n_sq = n^2) # a little silly here, but often convenient in other contexts
Consider data.table for large datasets
EDIT: I would be remiss if I failed to mention the data.table might be worth looking into for this larger dataset. Although dplyr is optimized for readibility, it often slows down with datasets with more than 100k rows. In contrast, the data.table package is designed for speed with large datasets. If you are an R-focused person who often runs into large datasets, it's worth the time to look into. Here is a good comparison
I know there are so many threats answering similar questions to mine but none of the answers out there are specific enough to what I want to obtain.
I've got the following dataset:
I want to count the number of patients (found in "Var_name") that harbour each mutation (found in "var_id") and display the count in a new column ("var_freq"). I've tried things like:
y <- ALL_merged %>%
group_by(var_id, Var_name) %>%
summarise(n_counts = n(), var_freq = sum(var_id == Var_name))
NOTE: In case is relevant for the answers... I had to convert "var_id" and "Var_name" into characters to make this work because they were factors.
However, this does not give me the output I want. Instead, I get the count of how many times each "var_id" appear per patient since, for each "var_id", the same "Var_name" appears a lot of times (because rows contain additional columns with different information), so the final outcome gives me a higher count that I would expect:
I also want to add this new column to the original dataset, I believe this could be done for example by using "mutate". But not sure how to set up everything...
So, in sum, what I want to get is: for each "var_id" how many different "Var_name" I have - taking into account that these data is duplicated...
Thanks in advance!
It is not entirely clear what you are looking for. It would help to provide data without images (such as using dput) and clearly show what your output should be for example data.
Based on what you describe this might be helpful. First, group_by just var_id alone. Then in summarise, you can include n() to get the number of observations/rows of data for each var_id, and then n_distinct to get the number of unique Var_name for each var_id:
library(dplyr)
df %>%
group_by(var_id) %>%
summarise(n_counts = n(),
var_freq = n_distinct(Var_name))
Have a simple problem I am trying to solve with the tidyverse, particularly dplyr (I believe this is the appropriate function).
What is the average age of daily riders?
There is a data.frame named Bike and there are two columns of data including cyc_freq which includes the Daily observation and another column of data entitled age which contains the different ages.
I am attempting to write a script that returns the average age of those who ride their bikes Daily. I was able to solve the problem but feel like my solution was inefficient.
Is there a simpler way to achieve my answer using dplyr?
bavg <- filter(BikeData, cyc_freq == "Daily", age)
mean(bavg$age)
It could be done within summarise itself without the need to have another step with filter
library(dplyr)
BikeData %>%
summarise(Mean = mean(age[cyc_freq == "Daily"]))
Or in base R
with(BikeData, mean(age[cyc_freq == "Daily"]))