Calculate percentage occurence in R - r

I conducted a dietary analysis in a raptor species and I would like to calculate the percentage of occurence of the prey items in the three different stages of it's breeding cycle. I would like the occurence to be expressed a percentage of the sample size. As an example if the sample size is 135 and I get an occurence of Orthoptera 65. I would like to calculate the percentage: 65/135.
So far I have tried with the long version without succes. The result I am getting is not correct. Any help is highly recommended and sorry if this question is reposted.
The raw dataset is as it follows:
set.seed(123)
pellets_2014<-data.frame(
Period = sample(c("Prebreeding","Breeding","Postbreedng"),12, replace=TRUE),
Orthoptera = sample(0:10, 12,replace=TRUE),
Coleoptera=sample(0:10,12,replace = TRUE),
Mammalia=sample(0:10,12, replace=TRUE))
##I transform the file to long format
##Library all the necessary packages
library(dplyr)
library(tidyr)
library(scales)
library(naniar)
pellets2014_long<-gather(pellets_2014,Categories, Count, c(Orthoptera,Coleoptera,Mammalia))
##I trasnform the zero values to NAs
pellets2014_NA<-pellets2014_long %>% replace_with_na(replace = list(Count = 0))
## Try to calculate the occurence
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())
## I do get here but I don't get the right number of occurence and I am stuck how to get the right percentage
##If I try this:
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())%>%mutate(Freq_n=n/sum(n)*100)
##The above is also wrong because I need it to be divide by the sample size in each period (here is 4 samples per period, the overall sample size is 12)!
The output must be occurence and percentage of occurence for its prey category in each Period. As it is shown in the picture below
Desired output

Is this close to what you're looking for?
Occurence2014 <- pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(
freq = n / sum(n)
)

Something like this?
Occurence2014 <- pellets2014_NA %>%
group_by(Period) %>%
mutate(period_sample_size = n()) %>%
ungroup() %>%
group_by(Period,Categories,period_sample_size) %>%
summarise(n=n())%>%
mutate(Freq_n=n/period_sample_size*100)

Related

R impute/'approximate' delayed covid time series points using non-delayed total

I'm using a grouped time series data set where there's often NAs for more recent dates (length of NAs varies fairly randomly). A total of all the series is provided in the data, where for more recent dates, this total is actually greater than the sum of the individual series, I guess because of imputation/forecasting.
So, my question is, how can the missing values be estimated, assuming that the series total is correct?
My general approach is to calculate what proportion of the total each series is, and somehow extrapolate to the future missing dates. As you can see by the graphs, I'm not so successful. There's complications caused by differing last dates of reported data. I'm not sure if cumulative makes a difference.
R code for simulated data and failed solution below:
Code:
## simulate simple grouped time series
library(tidyverse)
set.seed(555)
## time series length, e.g. 10
len=10
## group names
grps=letters[1:5]
df=bind_rows(lapply(grps,function(z){
tibble(rn=seq(1:len)) %>%
mutate(real_val=runif(len,min=0,max=1)) %>%
mutate(grp=z) %>%
select(grp,rn,real_val) %>%
## replace final data points with NA, length varying by group
## this simulates delays in data reporting across groups
mutate(reported_val=ifelse(rn>len-match(z,letters)+1,NA,real_val)) %>%
# mutate(reported_val=ifelse(rn>len-runif(1,0,round(max_trim)),NA,real_val)) %>%
## make cumulative to assist viz a bit. may affect imputation method.
group_by(grp) %>%
arrange(rn) %>%
mutate(real_val=cumsum(real_val),reported_val=cumsum(reported_val)) %>%
ungroup()
}))
df
## attempt to impute/estimate missing real values, given total for each rn (date)
## general solution is to use (adjusted?) proportions of the total.
df2=df %>%
group_by(rn) %>%
mutate(sum_real_val=sum(real_val),sum_reported_val=sum(reported_val,na.rm=T)) %>%
ungroup() %>%
## total value missing for each date
mutate(val_missing=sum_real_val-sum_reported_val) %>%
## what proportion of the continent that country takes up
mutate(prop=reported_val/sum_real_val) %>%
# mutate(prop=reported_val/sum_reported_val) %>%
## fill missing proportions to end terminus from most recent value
group_by(grp) %>%
arrange(rn) %>%
fill(prop,.direction='down') %>%
ungroup() %>%
## get estimated proportion of those missing
mutate(temp1=ifelse(is.na(reported_val),prop,NA)) %>%
## re-calculate proportion as only of those missing.
group_by(grp) %>%
mutate(prop_temp1=temp1/sum(temp1,na.rm=T)) %>%
ungroup() %>%
## if value missing, then multiply total missing by expected proportion of missing
mutate(result_val=ifelse(is.na(reported_val),val_missing*prop_temp1,reported_val)) %>%
ungroup()
## time series plot
## stacked by group, black line shows real_val total.
ggplot(df2 %>%
select(grp,rn,real_val,reported_val,sum_real_val,result_val) %>%
gather(val_grp,val,-c(grp,rn,sum_real_val)) %>%
ungroup(),aes(rn,val))+
# geom_line(aes(colour=loc))+
geom_area(aes(fill=grp))+
geom_line(aes(y=sum_real_val))+
facet_wrap("val_grp")
## but alas the result total doesn't agree with the reported total
## nb the imputed values for each group don't necessarily have to agree with the real values.
It's hard to know what conclusions to draw from the incomplete dates. One simple assumption could be to take the last share and extrapolate that into the future:
default_share <- df %>%
count(rn, grp, wt = !is.na(reported_val)) %>%
count(rn, n) %>%
filter(nn == max(nn)) %>%
slice_max(rn) %>%
left_join(df) %>%
mutate(share = real_val / sum(real_val)) %>%
select(rn, grp, share)
df %>%
group_by(rn) %>%
mutate(result_val = if_else(rn > default_share$rn[[1]],
sum(real_val) * default_share$share[rn == rn],
real_val)
) %>% ungroup() %>%
select(-reported_val) %>%
pivot_longer(-c(grp:rn)) %>%
ggplot(aes(rn,value))+
geom_area(aes(fill=grp))+
facet_wrap(~name)

Keep messing up percentage change depending on multiple columns

Once again I go to you, the great stackoverflow community, for help. I posted this question more than once but somehow I'm not capable of solving this seamingly simple problem... Honestly, I'm getting a little frustrated here, so eternal gratitude for all who'll help me with this.
Multiple similar answers on stack overflow worked on my small reproducible data frame but when I used the same tactics on my original one it didn't work. So first a little translation of the variables (they're in Dutch):
Gemeente == municipality
jaar == year
Beleidscode ~ crime category
aantal_misdrijven == number of offences
We won't need Kennisnamedatum == (date) and weekdag == weekday for this problem.
My problem:
I want to calculate change from 2017 opposed to 2015 grouped by Gemeente and Beleidscode.
library(tidyverse)
# This wil download my original data frame with ease:
df <- read_csv("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/crime_dataset.csv")
# The following tries to first add a column with totals per
# year, municipality and crime category. Then calculate percentage change.
df %>%
group_by(Gemeente, jaar, Beleidscode) %>%
arrange(Gemeente, jaar, Beleidscode) %>%
summarise(per_jaar_Gem_misdrijf = sum(aantal_misdrijven)) %>%
mutate(perct_change = (per_jaar_gem_misdrijf - lag(per_jaar_gem_misdrijf, order_by = jaar)) / lag(per_jaar_gem_misdrijf, order_by = jaar))
ungroup()
So yeah, as you might accpect this isn't creating the right figures...
I hope someone can help.
It looks like you are trying to calculate the percentage change from the previous year. One way you can do this is
library(tidyverse)
# This wil download my original data frame with ease:
df <- read_csv("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/crime_dataset.csv")
# Create a data frame with the summary count by year
dfSumByYear <-
df %>%
group_by(Gemeente, jaar, Beleidscode) %>%
summarise(per_jaar_Gem_misdrijf = sum(aantal_misdrijven)) %>%
ungroup()
# Add the Previous Year counts as an additional column
dfSumByYearWithPrev <-
dfSumByYear %>%
left_join(dfSumByYear %>%
mutate(JaarJoin = jaar+1) %>%
rename(per_PrevJaar_Gem_misdrijf = per_jaar_Gem_misdrijf) %>%
select(-jaar), by = c("Gemeente", c("jaar"="JaarJoin"), "Beleidscode")) %>%
# Calculate the Percentage Change
mutate(perct_change = (coalesce(per_jaar_Gem_misdrijf,0L) - coalesce(per_PrevJaar_Gem_misdrijf,0L)) / coalesce(per_PrevJaar_Gem_misdrijf,0L))
If you want to calculate the change from 2015 to 2017 specifically, one way you can do this is
library(tidyverse)
# This wil download my original data frame with ease:
df <- read_csv("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/crime_dataset.csv")
# Calculate the percentage change from 2015 to 2017
df %>%
group_by(Gemeente, jaar, Beleidscode) %>%
summarise(per_jaar_Gem_misdrijf = sum(aantal_misdrijven)) %>%
ungroup() %>%
spread(jaar, per_jaar_Gem_misdrijf, fill = 0L) %>%
mutate(perct_change = (`2017` - `2015`) / `2015`)

Categorical Variables Table with Percentages in R

I have a series of categorical variables that have the response options (Favorable, Unfavorable, Neutral).
I want to create a table in R that will give the list of all 10 variables in rows (one variable per row) - with the percentage response "Favorable, Unfavorable, Neutral" in the columns. Is this possible in R? Ideally, I would also want to be able to group this by another categorical variable (e.g. to compare how males vs. females responded to the questions differently).
You'll get better answers if you provide a sample of your actual data (see this post). That said, here is a solution using dplyr:: (and reshape2::melt).
# function to create a column of fake data
make_var <- function(n=100) sample(c("good","bad","ugly"), size=n, replace=TRUE)
# put ten of them together
dat <- as.data.frame(replicate(10, make_var()), stringsAsFactors=FALSE)
library("dplyr")
# then reshape to long format, group, and summarize --
dat %>% reshape2::melt(NULL) %>% group_by(variable) %>% summarize(
good_pct = (sum(value=="good") / length(value)) * 100,
bad_pct = (sum(value=="bad") / length(value)) * 100,
ugly_pct = (sum(value=="ugly") / length(value)) * 100
)
Note that to group by another column (e.g. sex), you can just say group_by(variable, sex) before you summarize (as long as sex is a column of the data, which isn't the case in this constructed example).
Adapting lefft's example but trying to do everything in dplyr:
dat %>%
gather(variable, value) %>%
group_by(variable) %>%
count(value) %>%
mutate(pct = n / sum(n) * 100) %>%
select(-n) %>%
spread(value, pct)

sample a different number of random rows for each level of a factor with dplyr

I an trying to take a random sample from each level of a factor. There are a different number of observations for each factor level. For each level I want to create a sample with half as many observations.
library(dplyr)
dat <- data.frame(ID = rep(c("AAA", "AAA","AAA","BBB","BBB","CCC"), length = 100),
Value = sample(1:100, replace = T))
Using the data above, it seems like something like the following should nearly work, but the error (Error in n() : This function should not be called directly) suggests I am incorrectly using the n() function.
Samp <- dat %>% group_by(ID) %>% sample_n(size = n()/2 )
Thanks in advance.
Try sample_frac():
library(dplyr)
Samp <- dat %>% group_by(ID) %>% sample_frac(.5)

R: Balancing a Repeated Cross-Section Sample

I have an unbalanced panel of repeated cross sectional data with different number of observations with different number of ages of individuals by sampling year something like the following:
mydata <- data.frame(age = sample(60, 1000, replace=TRUE),
year=sample(3,1000, replace=TRUE),
x=rnorm(1000))
I would like to balance my cross sections panels so that there is an equal number of ages for each cross section. I have thought of a few ways to do this. I believe the easiest would be to count the number of people in each cross section for each age.
mydata <- dplyr::mutate(group_by(mydata, age, year), nage=n())
Then I find the minimum count for each age group across years.
mydata <- dplyr::mutate(group_by(mydata, age), minN=min(nage))
Now the last part is the part I don't know how to do. I would now like to select the first 1:N observations within each group. The obvious way to do this would be to create an index variable within each group. Then subset the data.frame to only those observations which are less than that index value that counts from 1 to N.
mydata <- dplyr::mutate(group_by(mydata, age, year), index=index())
subset(mydata, index <= minN)
Of course this is the problem. The function index does not exist. I have written out this entire explanation so that either someone can provide the function I am looking for or someone can suggest an alternative method to accomplish this same objective, or both. Thanks for your consideration!
Old solution:
mydata %>% group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
filter(row_number()%in%1:min(nage))
Final solution:
mydata %>%
group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
mutate(minN = min(nage)) %>%
group_by(age, year) %>%
slice(seq_len(minN[1L]))

Resources