I am using a data set of public transit information in rstudio. One column in this huge data frame is Origin Station. I'd like to be able to count the number of times each specific station appears as an origin station and then create a new column with that value. I'd do this in excel but the data file is way too big. IE, for every record where "14 Street-Union Sq" is the value for Origin Station, there will be a new column counting the total number of times that 14 St-Union Sq was the Origin Station.
Thanks.
sounds like the dplyr package and the n() function along with a group_by variable. Try something like this:
df <- data.frame(origin = sample(letters[1:5], 1000, replace = TRUE),
other_column = rnorm(1000))
library(dplyr)
df %>% group_by(origin) %>% mutate(n_appearances = n())
You can using ave function
test['count']=with(test,ave(variable, variable, FUN=function(x) length(x)))
Related
I attempted this question yesterday(Applying a function using elements within a list) but my reprex produced the wrong data structure and unfortunately the suggestions didn't work for my actual dataset.
I have what is hopefully a simple functional programming question. I have a list of locations with average temperature and amplitude for each day (180 days in my actual dataset). I want to iterate through these locations and create a sine curve of 24 points using a custom made function taking the average temperature and amplitude from each day within a list. Below is my new reprex.
library(tibble)
library(REdaS)##degrees to radians
library(tidyverse)
sinefunc<- function(Amplitude,Average){
hour<- seq(0,23,1)
temperature<-vector("double",length = 24)
for(i in seq_along(hour)){
temperature[i]<- (Amplitude*sin(deg2rad(180*(hour[i]/24)))+Average)+Amplitude*sin(deg2rad(180*hour[i]/12))
}
temperature
}
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
Using Purrr and map_dfr I get the error Error in .x$Average : $ operator is invalid for atomic vectors
df<-data %>%
map_dfr(~sinefunc(.x$Average, .x$Amplitude))
Using lapply I get the error Error in x[, "Amplitude"] : incorrect number of dimensions
data <- lapply(data, function(x){
sinefunc(Amplitude = x[,"Amplitude"], Average = x[,"Average"])
})
My goal is to have 24 hourly data points for each day and location.
Any further help would be much appreciated.
Stuart
Maybe you look for this? You get a dataframe back with 24 datapoints for each day and location, e.g. London-Day1, Dublin-Day1 etc.
library(dplyr)
library(purrr)
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))
# get group name
group_name <- data %>%
group_by(Location, Day) %>%
group_keys() %>%
mutate(group_name = stringr::str_c(Location,"_",Day)) %>%
pull(group_name)
data %>%
# split into lists
group_split(Location, Day) %>%
# get list name
setNames(group_name) %>%
# apply your function and get a dataframe back
map_dfr(~sinefunc(.x$Average, .x$Amplitude))
I have a dataframe with the following sample:
df = data.frame(x1 = c(2000a,2010a,2000b,2010b,2000c,2010c),
x2 = c(1,2,3,4,5,6))
I am trying to find a way to calculate the percent change for each "group" (a,b,c) using the change() function. Below is my attempt:
percent_change = change(df,x2, NewVar = "percent_change", slideBy = 1,type = 'percent')
where slideBy is the lag variable that restarts the percent change calculation every other observation. This does not work, and I get the following error:
" Remember to put data in time order before running.
Leading total_units by 1 time units."
Would it be possible to adapt my x1 column to a time series or is there an easier way around this I am missing?
Thank you!
This uses the data.table structure from the data.table package. First it sorts on x1, then does a row by row calculation of the percent change, grouping by the letter in x1.
library(data.table)
setDT(df)
df[order(x1),
100*x2/shift(x2,1L),
keyby=gsub("[0-9]","",x1)]
Here is a tidyverse way to do this. First, use extract to separate x1 into year and group, then pivot_wider on the table. Now you can use mutate to create the percent change row.
library(dplyr)
library(tidyr)
df = data.frame(x1 = c("2000a","2010a","2000b","2010b","2000c","2010c"),x2 = c(1,2,3,4,5,6))
df_new = df %>%
extract(x1, c("year", "group"),regex="(\\d{4})(\\D{1})") %>%
pivot_wider(names_from = year, values_from=x2) %>%
mutate(percent_change=(`2010`-`2000`)/`2000`)
I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!
Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)
I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.
I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age
Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.