R impute/'approximate' delayed covid time series points using non-delayed total - r

I'm using a grouped time series data set where there's often NAs for more recent dates (length of NAs varies fairly randomly). A total of all the series is provided in the data, where for more recent dates, this total is actually greater than the sum of the individual series, I guess because of imputation/forecasting.
So, my question is, how can the missing values be estimated, assuming that the series total is correct?
My general approach is to calculate what proportion of the total each series is, and somehow extrapolate to the future missing dates. As you can see by the graphs, I'm not so successful. There's complications caused by differing last dates of reported data. I'm not sure if cumulative makes a difference.
R code for simulated data and failed solution below:
Code:
## simulate simple grouped time series
library(tidyverse)
set.seed(555)
## time series length, e.g. 10
len=10
## group names
grps=letters[1:5]
df=bind_rows(lapply(grps,function(z){
tibble(rn=seq(1:len)) %>%
mutate(real_val=runif(len,min=0,max=1)) %>%
mutate(grp=z) %>%
select(grp,rn,real_val) %>%
## replace final data points with NA, length varying by group
## this simulates delays in data reporting across groups
mutate(reported_val=ifelse(rn>len-match(z,letters)+1,NA,real_val)) %>%
# mutate(reported_val=ifelse(rn>len-runif(1,0,round(max_trim)),NA,real_val)) %>%
## make cumulative to assist viz a bit. may affect imputation method.
group_by(grp) %>%
arrange(rn) %>%
mutate(real_val=cumsum(real_val),reported_val=cumsum(reported_val)) %>%
ungroup()
}))
df
## attempt to impute/estimate missing real values, given total for each rn (date)
## general solution is to use (adjusted?) proportions of the total.
df2=df %>%
group_by(rn) %>%
mutate(sum_real_val=sum(real_val),sum_reported_val=sum(reported_val,na.rm=T)) %>%
ungroup() %>%
## total value missing for each date
mutate(val_missing=sum_real_val-sum_reported_val) %>%
## what proportion of the continent that country takes up
mutate(prop=reported_val/sum_real_val) %>%
# mutate(prop=reported_val/sum_reported_val) %>%
## fill missing proportions to end terminus from most recent value
group_by(grp) %>%
arrange(rn) %>%
fill(prop,.direction='down') %>%
ungroup() %>%
## get estimated proportion of those missing
mutate(temp1=ifelse(is.na(reported_val),prop,NA)) %>%
## re-calculate proportion as only of those missing.
group_by(grp) %>%
mutate(prop_temp1=temp1/sum(temp1,na.rm=T)) %>%
ungroup() %>%
## if value missing, then multiply total missing by expected proportion of missing
mutate(result_val=ifelse(is.na(reported_val),val_missing*prop_temp1,reported_val)) %>%
ungroup()
## time series plot
## stacked by group, black line shows real_val total.
ggplot(df2 %>%
select(grp,rn,real_val,reported_val,sum_real_val,result_val) %>%
gather(val_grp,val,-c(grp,rn,sum_real_val)) %>%
ungroup(),aes(rn,val))+
# geom_line(aes(colour=loc))+
geom_area(aes(fill=grp))+
geom_line(aes(y=sum_real_val))+
facet_wrap("val_grp")
## but alas the result total doesn't agree with the reported total
## nb the imputed values for each group don't necessarily have to agree with the real values.

It's hard to know what conclusions to draw from the incomplete dates. One simple assumption could be to take the last share and extrapolate that into the future:
default_share <- df %>%
count(rn, grp, wt = !is.na(reported_val)) %>%
count(rn, n) %>%
filter(nn == max(nn)) %>%
slice_max(rn) %>%
left_join(df) %>%
mutate(share = real_val / sum(real_val)) %>%
select(rn, grp, share)
df %>%
group_by(rn) %>%
mutate(result_val = if_else(rn > default_share$rn[[1]],
sum(real_val) * default_share$share[rn == rn],
real_val)
) %>% ungroup() %>%
select(-reported_val) %>%
pivot_longer(-c(grp:rn)) %>%
ggplot(aes(rn,value))+
geom_area(aes(fill=grp))+
facet_wrap(~name)

Related

How to add a total distance column in 'flights' dataset? DPLYR, Group_by, Ungroup

I am working with 'flights' dataset from 'nycflights13' package in R.
I want to add a column which adds the total distance covered by each 'carrier' in 2013. I got the total distance covered by each carrier and have stored the value in a new variable.
We have 16 carriers so how I bind a row of 16 numbers with a data frame of many more rows.
carrier <- flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance))
How can i add the sum of these carrier distances in a new column in flights dataset?
Thank you for your time and effort here.]
PS. I tried running for loop, but it doesnt work. I am new to programming
Use mutate instead:
flights %>%
group_by(carrier) %>%
mutate(TotalDistance = sum(distance)) %>%
ungroup()-> carrier
We can also use left_join.
library(nycflights13)
data("flights")
library(dplyr)
flights %>%
left_join(flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance)), by='carrier')-> carrier
This will work even if you don't use arrange at the end.

Calculate percentage occurence in R

I conducted a dietary analysis in a raptor species and I would like to calculate the percentage of occurence of the prey items in the three different stages of it's breeding cycle. I would like the occurence to be expressed a percentage of the sample size. As an example if the sample size is 135 and I get an occurence of Orthoptera 65. I would like to calculate the percentage: 65/135.
So far I have tried with the long version without succes. The result I am getting is not correct. Any help is highly recommended and sorry if this question is reposted.
The raw dataset is as it follows:
set.seed(123)
pellets_2014<-data.frame(
Period = sample(c("Prebreeding","Breeding","Postbreedng"),12, replace=TRUE),
Orthoptera = sample(0:10, 12,replace=TRUE),
Coleoptera=sample(0:10,12,replace = TRUE),
Mammalia=sample(0:10,12, replace=TRUE))
##I transform the file to long format
##Library all the necessary packages
library(dplyr)
library(tidyr)
library(scales)
library(naniar)
pellets2014_long<-gather(pellets_2014,Categories, Count, c(Orthoptera,Coleoptera,Mammalia))
##I trasnform the zero values to NAs
pellets2014_NA<-pellets2014_long %>% replace_with_na(replace = list(Count = 0))
## Try to calculate the occurence
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())
## I do get here but I don't get the right number of occurence and I am stuck how to get the right percentage
##If I try this:
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())%>%mutate(Freq_n=n/sum(n)*100)
##The above is also wrong because I need it to be divide by the sample size in each period (here is 4 samples per period, the overall sample size is 12)!
The output must be occurence and percentage of occurence for its prey category in each Period. As it is shown in the picture below
Desired output
Is this close to what you're looking for?
Occurence2014 <- pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(
freq = n / sum(n)
)
Something like this?
Occurence2014 <- pellets2014_NA %>%
group_by(Period) %>%
mutate(period_sample_size = n()) %>%
ungroup() %>%
group_by(Period,Categories,period_sample_size) %>%
summarise(n=n())%>%
mutate(Freq_n=n/period_sample_size*100)

Calculation of grouped zscores change the order

This question is based on How do I calculate a grouped z score in R using dplyr?.
Here data are scaled (zscores) for different groups and ungrouped.
dat = iris %>%
gather(variable, value, -Species) %>%
group_by(Species, variable) %>%
mutate(z_score_group = (value - mean(value)) / sd(value)) %>%
ungroup %>%
mutate(z_score_ungrouped = (value - mean(value)) / sd(value))
Scaling ungrouped preserves the order of the data.
> identical(order(dat$z_score_ungrouped), order(dat$value))
[1] TRUE
However, interestingly the data change their order by scaling group wise.
> identical(order(dat$z_score_group), order(dat$value))
[1] FALSE
In my opinion scaling should never change the order of data because this has a huge impact on rank based analysis (e.g. ROC-curves). Does anyone have an idea why grouping changes the order?

Explain ungroup() in dplyr

If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.

Categorical Variables Table with Percentages in R

I have a series of categorical variables that have the response options (Favorable, Unfavorable, Neutral).
I want to create a table in R that will give the list of all 10 variables in rows (one variable per row) - with the percentage response "Favorable, Unfavorable, Neutral" in the columns. Is this possible in R? Ideally, I would also want to be able to group this by another categorical variable (e.g. to compare how males vs. females responded to the questions differently).
You'll get better answers if you provide a sample of your actual data (see this post). That said, here is a solution using dplyr:: (and reshape2::melt).
# function to create a column of fake data
make_var <- function(n=100) sample(c("good","bad","ugly"), size=n, replace=TRUE)
# put ten of them together
dat <- as.data.frame(replicate(10, make_var()), stringsAsFactors=FALSE)
library("dplyr")
# then reshape to long format, group, and summarize --
dat %>% reshape2::melt(NULL) %>% group_by(variable) %>% summarize(
good_pct = (sum(value=="good") / length(value)) * 100,
bad_pct = (sum(value=="bad") / length(value)) * 100,
ugly_pct = (sum(value=="ugly") / length(value)) * 100
)
Note that to group by another column (e.g. sex), you can just say group_by(variable, sex) before you summarize (as long as sex is a column of the data, which isn't the case in this constructed example).
Adapting lefft's example but trying to do everything in dplyr:
dat %>%
gather(variable, value) %>%
group_by(variable) %>%
count(value) %>%
mutate(pct = n / sum(n) * 100) %>%
select(-n) %>%
spread(value, pct)

Resources