Have a simple problem I am trying to solve with the tidyverse, particularly dplyr (I believe this is the appropriate function).
What is the average age of daily riders?
There is a data.frame named Bike and there are two columns of data including cyc_freq which includes the Daily observation and another column of data entitled age which contains the different ages.
I am attempting to write a script that returns the average age of those who ride their bikes Daily. I was able to solve the problem but feel like my solution was inefficient.
Is there a simpler way to achieve my answer using dplyr?
bavg <- filter(BikeData, cyc_freq == "Daily", age)
mean(bavg$age)
It could be done within summarise itself without the need to have another step with filter
library(dplyr)
BikeData %>%
summarise(Mean = mean(age[cyc_freq == "Daily"]))
Or in base R
with(BikeData, mean(age[cyc_freq == "Daily"]))
Related
I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at.
I tried using ddply but was the code only returns one line of output
ddply(data, .('Box #', 'Date'), summarize, max('Temp'))
I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups
mx_day_2 <- tapply(box2$Temp, box2$Date, max)
I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code.
Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here?
Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was
max_data <- data %>%
group_by(data$'Box #', data$'Date')
max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number).
library(tidyverse)
library(janitor)
mx_day2 <- data %>%
clean_names() %>%
group_by(date,box_number)%>%
summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below
max_data_v2 <- data %>%
group_by(data$'Box #', data$'Date') %>%
filter(Temp == max(Temp, na.rm=TRUE))
I am using R for a project for University. I imported a csv file and created a df. Everything was going smoothly until I had to gather the percentages of age groups in the "Age" column. There are 3,000 rows of information in my df. How do I only sample information from rows 50-200 to find the percentages of people ages 15-20, 21-25, 26-30, and 31-35?
You can try creating another df which only takes information from rows 50-200 using the slice function e.g my_data %>% slice(1:6) would give rows 1-6 I believe. Incase you didnt know, this function exists in tidyverse, which you can call using library(tidyverse). For filtering by particular age groups, you can again use the tidyverse filter function, e.g my_data %>% filter.
If your goal is to sample, better than slice specific rows you can use the function sample_n
I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.
I'm working with a data frame that looks very similar to the below:
Image here, unfortunately don't have enough reputation yet
This is a 600,000 row data frame. What I want to do is for every repeated instance within the same date, I'd like to divide the cost by total number of repeated instances. I would also like to consider only those falling under the "Sales" tactic.
So for example, in 1/1/16, there are 2 "Help Packages" that are also under the "Sales" tactic. Because there are 2 instances within the same date, I'd like to divide the cost of each by 2 (so the cost would come out as $5 for each).
This is the code I have:
for(i in 1:length(dfExample$Date)){
if(dfExample$Tactic) == "Sales"){
list = agrep(dfExample$Package[i], dfExample$Package)
for(i in list){
date_repeats = agrep(i, dfExample$Date)
dfExample$Cost[date_repeats] = dfExample$Package[i]/length(date_repeats)
}
}
}
It is incredibly inefficient and slow. I know there's got to be a better way to achieve this. Any help would be much appreciated. Thank you!
ave() can give a solution without additional packages:
with(dfExample, Cost / ave(Cost, Date, Package, Tactic, FUN=length))
Using dplyr:
library(dplyr)
dfExample %>%
group_by(Date, Package, Tactic) %>%
mutate(Cost = Cost / n())
I'm a little unclear what you mean by "instance". This (pretty clearly) groups by Date, Package, and Tactic, and so will consider each unique combination of those columns as a grouper. If you don't include Tactic in the definition of an "instance", then you can remove it to group only by Date and Package.