How to group twice? - r

I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way.
I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far:
library(tidyverse)
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv")
brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>%
filter(data >= "2020-05-15") %>%
subset(!(coduf == 76)) %>%
mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000,
av_dth = (obitosAcumulado/populacaoTCU2019)*100000)
brazilcorona_hilow_rates <- brazilcorona_hl_rates %>%
group_by(estado) %>%
summarize(top_dth = top_n(1, av_dth))

In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from.
rm(list=ls())
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv")
#
#remove NA's from municipios
brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),]
#here I am gonna use the column "obitosAcumulado" but you should use the one you want
brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado)
states<-as.list(unique(brazilcorona$estado))
result<-lapply(states,FUN=function(x){
df<-brazilcorona[brazilcorona$estado==x,]
df<-df[order(df$obitosAcumulado,decreasing = T),]
return(c(paste(x),as.character(df[1:2,"municipio"]),
as.character(df[(nrow(df)-1):nrow(df),"municipio"])))})
I hope it helps you...

Related

How to separate a time series panel by the number of missing observations at the end?

Consider a set of time series having the same length. Some have missing data in the end, due to the product being out of stock, or due to delisting.
If the series contains at least four missing observations (in my case it is value = 0 and not NA) at the end, I consider the series as delisted.
In my time series panel, I want to separate the series with delisted id's from the other ones and create two different dataframes based on this separation.
I created a simple reprex to illustrate the problem:
library(tidyverse)
library(lubridate)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),2),
value = c(c(rep(1,17),0,0,0,0,2,2,3), c(rep(9,20),0,0,0,0))
)
I am searching for a pipeable tidyverse solution.
Here is one possibility to find delisted ids
data %>%
group_by(id) %>%
mutate(delisted = all(value[(n()- 3):n()] == 0)) %>%
group_by(delisted) %>%
group_split()
In the end I use group_split to split the data into two parts: one containing delisted ids and the other one contains the non-delisted ids.

'Join columns must be present in data' error

I am having this weird issue.
The following code works:
Jakarta_Covid <- left_join(DKI_Jakarta, Covid_DF,
by = c("Sub_District" = "Sub_District"))
However the code chunk below is giving me 'Join columns must be present in data.
x Problem with Sub_District.
Jakarta_Death <- Covid_DF %>%
inner_join(DKI_Jakarta, by=c("Sub_District"="Sub_District")) %>%
group_by(Sub_District, Month) %>%
summarise(`Covid Death Per 10,000 Population` = (((sum(Death))/(Total_Population))*10000))
Jakarta_Death <- Jakarta_Death %>% left_join(DKI_Jakarta,
by=c("Sub_District"="Sub_District"))
How can I calculate the 'Covid Death Per 10,000 Population' from two DF and I need the Geometry column in DKI_Jakarta to plot into a map later on.
left_join(DKI_Jakarta, Covid_DF, by = c("Sub_District")
If you have the same column name in both tables just left one in the by=()

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Attaching Appropriate Subset Quartiles to Dataset

I'm trying to add two columns to an NBA player dataset. The first column will establish what quartile the player's age is among all players within the dataset. The second additional column will establish which quartile an individual player's age resides in among his position (i.e. Point Guard, Small Forward, Center, etc.). I'm able to use the dplyr package to calculate the subset age quartiles based on player position, but I don't know how to join it back to the original dataset or if this is even the correct approach to take.
I've used dplyr to calculate subset age quartiles based on position. Have attempted to use other packages like fuzzyjoin, but don't feel that comfortable working with them.
#Incorporate necessary packages
library(ballr)
library(magrittr)
library(dplyr)
library(tidyverse)
#Establish full player table
players <- NBAPerGameAdvStatistics(season = 2018)
#Calculates Quartiles for Each Position
Pos_quartiles <- players %>%
group_by(pos) %>%
summarise(age = list(enframe(quantile(age, probs=c(0.25,0.5,0.75,1.0))))) %>%
unnest
I expect to have the players dataset with 664 observations and 32 variables, the last two of which have been added on as a result of this procedure. The additional rows will show a player's age quartile based upon all players included, as well as the player's age quartile based upon his position.
We can use base::cut with quantile to get the appropriate quartile
library(dplyr)
players %>%
mutate(quar_all=cut(age, breaks=c(0,quantile(age, probs=c(0.25,0.5,0.75,1.0))),labels = FALSE)) %>%
group_by(pos) %>%
mutate(quar_pos=cut(age, breaks=unique(c(0,quantile(age, probs=c(0.25,0.5,0.75,1.0)))),labels = FALSE))
Please note in quar_pos I used unique as I got the error
Error in cut.default(age, breaks = quantile(age, probs = c(0.25, 0.5, :
'breaks' are not unique
For a similar error unique was proposed by Didzis Elferts here , so as Didzis mentioned expect less number of quartiles for the affected groups.

R: Balancing a Repeated Cross-Section Sample

I have an unbalanced panel of repeated cross sectional data with different number of observations with different number of ages of individuals by sampling year something like the following:
mydata <- data.frame(age = sample(60, 1000, replace=TRUE),
year=sample(3,1000, replace=TRUE),
x=rnorm(1000))
I would like to balance my cross sections panels so that there is an equal number of ages for each cross section. I have thought of a few ways to do this. I believe the easiest would be to count the number of people in each cross section for each age.
mydata <- dplyr::mutate(group_by(mydata, age, year), nage=n())
Then I find the minimum count for each age group across years.
mydata <- dplyr::mutate(group_by(mydata, age), minN=min(nage))
Now the last part is the part I don't know how to do. I would now like to select the first 1:N observations within each group. The obvious way to do this would be to create an index variable within each group. Then subset the data.frame to only those observations which are less than that index value that counts from 1 to N.
mydata <- dplyr::mutate(group_by(mydata, age, year), index=index())
subset(mydata, index <= minN)
Of course this is the problem. The function index does not exist. I have written out this entire explanation so that either someone can provide the function I am looking for or someone can suggest an alternative method to accomplish this same objective, or both. Thanks for your consideration!
Old solution:
mydata %>% group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
filter(row_number()%in%1:min(nage))
Final solution:
mydata %>%
group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
mutate(minN = min(nage)) %>%
group_by(age, year) %>%
slice(seq_len(minN[1L]))

Resources