I have this code where i need to find the mean of the approval for each quarter.
approval <- read_csv('covid_approval_polls.csv')
quarters2 <- approval %>%
select(start_date, end_date, approve) %>%
filter(approval$party == 'all') %>%
mutate(Quarter = as.yearqtr(approval$start_date)) %>%
group_by(Quarter) %>%
summarise(AVERAGE = ceiling(mean(approval$approve, na.rm = TRUE)))
I am trying to use dplyr which I think is correct but my code gives me the mean of all the data.
Related
I HAD A PROBLEM CODING THIS GRAPH WITH GGPLOT IN R. THOUGHT SOMEBODY WOULD BE ABLE TO HELP ME HERE.
THE MAIN PROBLEM IS FINDING THE ORIGINAL AND CURRENT WORLD RECORDS FROM THE DATASET
THE LINK TO THE DATASET IT BELOW
https://drive.google.com/file/d/1olmDVa0Ku01LQrkpC_MkGq7_wFO8gLPQ/view?usp=sharing
THANKS
this is the plot i need to code ->
PLOT IN R
library(tidyverse)
fLm = function(data) lm(time~date, data)
dPradict = function(data){
model = data$model[[1]]
data = data$data[[1]]
dfFirstLast = data %>% arrange(date) %>% slice_head() %>%
bind_rows(data %>% arrange(date) %>% slice_tail()) %>%
select(date, time)
tibble(
x = c("Orginal", "Current") %>% fct_inorder(),
time = c(predict(model, dfFirstLast)[1],
predict(model, dfFirstLast)[2])
)
}
df = read_csv("records.csv", show_col_types = FALSE) %>%
mutate(track = track %>% fct_relevel("Banshee Boardwalk")) %>%
group_by(track) %>%
nest() %>%
mutate(model = map(data, ~fLm(.x))) %>%
group_modify(~dPradict(.x))
df4 = df %>%
ungroup() %>%
filter(x=="Current") %>%
arrange(desc(time)) %>%
slice_head(n=4)
df %>% ggplot(aes(as.numeric(x), time, color=track))+
geom_line()+
geom_point(size=3)+
geom_label(aes(as.numeric(x), time, label=track), data = df4, hjust=-.1)+
scale_x_continuous(breaks=c(1, 2), name="WR",
labels=c("Orginal", "Current"),
limits=c(0.8,2.2))+
labs(title = "Comparing the orginal and current WR for tree lap and no shortcut races")
I am trying to create new columns grouped by different columns but I am not sure if the way I am doing it is the best way to use group_by. I am wondering if there is a way I can group_by in line?
I know it can be done using data.table package where the syntax is of type
DT[i,j, by].
But since this is a small piece in a bigger code which uses tidyverse and works great as is, I just don't want to deviate from that.
## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10)
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-")
customers <- sample.int(50:100,50)
sales <- sample.int(500:5000,50)
df <- bind_cols(data.frame(state, county,customers,sales))
## workflow
df2 <- df %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales)) %>%
ungroup %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
ungroup %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
I want my code to look like
df3 <- df %>%
mutate(customerInState = sum(customers, by = state),
saleInState = sum(sales, by = state),
customerInCounty = sum(customers, by = county),
saleInCounty = sum(sales, by = county),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState,
minSale = min(salePerCountyPercent, by = state))
it runs without errors, but I know the output is not right
I understand that it may be possible to juggle around the mutates to get what I need with less amount of group_bys.
But the questions is, if there is away to do in line group by in dplyr
You could create wrapper to do what you want. This specific solution works if you have one grouping variable. Good luck!
library(tidyverse)
mutate_by <- function(.data, group, ...) {
group_by(.data, !!enquo(group)) %>%
mutate(...) %>%
ungroup
}
df1 <- df %>%
mutate_by(state,
customerInState = sum(customers),
saleInState = sum(sales)) %>%
mutate_by(county,
customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(state,
minSale = min(salePerCountyPercent))
identical(df2, df1)
[1] TRUE
EDIT: or, more concicely / similar to your code:
df %>%
mutate_by(customerInState = sum(customers),
saleInState = sum(sales), group = state) %>%
mutate_by(customerInCounty = sum(customers),
saleInCounty = sum(sales), group = county) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(minSale = min(salePerCountyPercent), group = state)
Ah, you mean the syntax style. No, this is not how tidyverse runs, I'm afraid. You want tidyverse, you better use pipes. However: (i) once you grouped something, it stays grouped until you group again with a different column. (ii) No need to ungroup if you group again. We can therefore shorten your code:
df3 <- df %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
Two mutates and two group_by's.
Now: the order of columns is different, but we can easily test that the data is identical:
identical((df3 %>% select(colnames(df2))), (df2)) # TRUE
(iii) I have no idea about the administrative structure of the US, but I assume that counties are nested within states, correct? Then how about using summarize? Do you need to keep all the individual sales, or is it enough to generate per county and/or per state statistics?
You can do it in two steps, creating two data sets, then left_join them.
library(dplyr)
df2 <- df %>%
group_by(state) %>%
summarise(customerInState = sum(customers),
saleInState = sum(sales))
df3 <- df %>%
group_by(state, county) %>%
summarise(customerInCounty = sum(customers),
saleInCounty = sum(sales))
df2 <- left_join(df2, df3) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent))
Final clean up.
rm(df3)
this is my code and I have a problem with groupby :
library(dplyr)
library(lubridate)
df <- read.xlsx("Data.xlsx", sheet = "Sector-STOXX600", startRow = 2,colNames = TRUE, detectDates = TRUE, skipEmptyRows = FALSE)
df[2:19] <- data.matrix(df[2:19])
percent_change2 <- function(x)last(x)/first(x) - 1
monthly_return <- df %>%
group_by(gr = floor_date(Date, unit = "month")) %>%
summarize_at(vars(-Date, -gr), percent_change2) %>%
ungroup() %>%
select(-gr) %>%
as.matrix()
Indeed I have this error :
"Error in is_character(x) : object 'gr' not found"
Here is a sample of the dataset :
Date .SXQR .SXTR .SXNR .SXMR .SXAR .SX3R .SX6R .SXFR .SXOR .SXDR .SX4R .SXRR .SXER
1 2000-01-03 364.94 223.93 489.04 586.38 306.56 246.81 385.36 403.82 283.78 455.39 427.43 498.08 457.57
2 2000-01-04 345.04 218.90 474.05 566.15 301.13 239.24 374.64 390.41 275.93 434.92 414.10 476.17 435.72
UPDATE
volatility_function<- function(x)sqrt(252) * sd(diff(log(x))) * 100
annualized_volatility <- df %>%
mutate(Date=ymd(Date)) %>%
group_by(gr = floor_date(Date, unit = "year")) %>%
select(gr,everything()) %>%
summarize_at(vars(-Date, -gr), volatility_function) %>%
ungroup() %>% select(-gr) %>%
as.matrix()
head(annualized_volatility,5)
I tried what #NeslonGon told me to do, however I know get the same error on an another function, what should I do ?
The idea is that we don't need to summarise_at a grouped variable but use the Date to account for this. The select and mutate calls can be skipped. They're for convenience.
df %>%
mutate(Date=ymd(Date)) %>%
group_by(gr = floor_date(Date, unit = "month")) %>%
select(gr,everything()) %>%
summarize_at(vars(-Date), percent_change2) %>%
ungroup() %>%
select(-gr) %>%
as.matrix()
I have generated this summary table based on the df below.
set.seed(1)
df <- data.frame(rep(
sample(c(2012,2016),10, replace = T)),
sample(c('Treat','Control'),10,replace = T),
runif(10,0,1),
runif(10,0,1),
runif(10,0,1))
colnames(df) <- c('Year','Group','V1','V2','V3')
summary.table = df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd,median)) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
Is there a way I could display the values related to the median columns as percentages?
I did not know how to use mutate() and scales::percent() for only a subset of columns (I dont want to do it individually, since there will be more columns in the original dataset, making this procedure not practical enough.
What should I have done instead if I wanted to mutate according to a subset of rows?
Thank you
EDIT:
And if it was like this?
summary.table = df %>%
group_by(Year, Group) %>%
summarise_all(funs(median,sd)) %>%
gather(key, value, -Year, -Group) %>%
separate(key, into=c("var", "stat")) %>%
unite(stat_Group, stat, Group) %>%
spread(stat_Group, value) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
We need to use the percent wrapped on median
summary.table <- df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd=sd(.),median=scales::percent(median(.)))) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
New to (d)plyr, working through chaining, a basic question - for the hflights example, want to use one of these embedded vars to make a basic plot:
hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
) %>%
plot (Month, arr)
Returns:
Error in match.fun(panel) : object 'arr' not found
I can make this work going step by step, but can I get where I want to go somehow with %>%...
plot() doesn't work that way. The closest you could get is:
library(dplyr)
library(hflights)
summary <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
)
summary %>%
plot(arr ~ Month, .)
Another alternative is to use ggvis, which is explicitly designed to work with pipes:
library(ggvis)
summary %>%
ggvis(~Month, ~arr)