I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!
Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)
Related
In a dataset where each patient had multiple test administrations and a score on each test date, I have to identify the earliest & latest test dates, then subtract the difference of the scores of those dates. I think I've identified the first & last dates through dplyr, creating new columns for those:
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
arrange(desc(PatientID))
Score column is TotalScore
Now how do I extract the scores from these 2 dates (for each patient) to create new columns of earliest & latest scores? Haven't been able to figure out a mutate with case_when or if_else to create a score based on a record with a certain date.
Have you tried to use one combine verb, like left_join, for example?
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
ungroup() %>%
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “EarliestTestDate”)) %>% # picking the score of EarliestTestDate
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “LatestTestDate”)) %>% # picking the score of EarliestTestDate
arrange(desc(PatientID)) # now you can make the mutante task that you need.
I suggest to you see the dplyr cheatsheet.
I have a dataframe with the following sample:
df = data.frame(x1 = c(2000a,2010a,2000b,2010b,2000c,2010c),
x2 = c(1,2,3,4,5,6))
I am trying to find a way to calculate the percent change for each "group" (a,b,c) using the change() function. Below is my attempt:
percent_change = change(df,x2, NewVar = "percent_change", slideBy = 1,type = 'percent')
where slideBy is the lag variable that restarts the percent change calculation every other observation. This does not work, and I get the following error:
" Remember to put data in time order before running.
Leading total_units by 1 time units."
Would it be possible to adapt my x1 column to a time series or is there an easier way around this I am missing?
Thank you!
This uses the data.table structure from the data.table package. First it sorts on x1, then does a row by row calculation of the percent change, grouping by the letter in x1.
library(data.table)
setDT(df)
df[order(x1),
100*x2/shift(x2,1L),
keyby=gsub("[0-9]","",x1)]
Here is a tidyverse way to do this. First, use extract to separate x1 into year and group, then pivot_wider on the table. Now you can use mutate to create the percent change row.
library(dplyr)
library(tidyr)
df = data.frame(x1 = c("2000a","2010a","2000b","2010b","2000c","2010c"),x2 = c(1,2,3,4,5,6))
df_new = df %>%
extract(x1, c("year", "group"),regex="(\\d{4})(\\D{1})") %>%
pivot_wider(names_from = year, values_from=x2) %>%
mutate(percent_change=(`2010`-`2000`)/`2000`)
Here I have the code with a for loop:
for (i in 1:length(mc_1$code))
{cmc1 = mc_1$code[i]
cmc2 = mc_1[mc_1$code == cmc1,]
cmc3 = cmc2[order(cmc2[ ,2], cmc2[ ,3]),]
mc_1[mc_1$code == cmc1,]$region = last(cmc3$region)
}
For each value in the variable "code", mc_1 have different number of rows. And mc_1 also has columns of year and month (column 2 and 3), and another column, say, region. "region" is different even for same "code" at different month and year.
For each "code", I want to select only the most recent region by month and year (that's why I use "order") and assign that region to all the regions in all the rows for that certain code.
I did have this for loop, which works. But for efficiency and code length issue, how can I rewrite it better using something like data table or dplyr?
you can try this using the dplyr package
and the fact that n() returns the number of rows in each group
mc_1 %>%
group_by(code) %>%
arrange(year, month ) %>%
mutate(region = region[n()])
hope it helps!!
There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)
I have a list of statcast data, per day dating back to 2016. I am attempting to aggregate this data for finding the mean for each pitching ID.
I have the following code:
aggpitch <- aggregate(pitchingstat, by=list(pitchingstat$PitcherID),
FUN=mean, na.rm = TRUE)
This function aggregates every single column. I am looking to only aggregate a certain amount of columns.
How would I include only certain columns?
If you have more than one column that you'd like to summarize, you can use QAsena's approach and add summarise_at function like so:
pitchingstat %>%
group_by(PitcherID) %>%
summarise_at(vars(col1:coln), mean, na.rm = TRUE)
Check out link below for more examples:
https://dplyr.tidyverse.org/reference/summarise_all.html
Replace the first argument (pitchingstat) with the name of the column you want to aggregate (or a vector thereof)
How about?:
library(tidyverse)
aggpitch <- pitchingstat %>%
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(variable)) #replace 'variable' with your variable of interest here
or
library(tidyverse)
aggpitch <- pitchingstat %>%
select(var_1, var_2)
group_by(PitcherID) %>%
summarise(pitcher_mean = mean(var_1),
pitcher_mean2 = mean(var_2))
I think this works but could use a dummy example of your data to play with.