how to group and summarise column by two other columns - r

I have a question like that "For 5 countries of your choice, use a group bar chart to compare “deaths per 100 confirmed cases” in each year since the beginning of the pandemic."
I wrote some code like :
COVID_data%>%
filter(countriesAndTerritories%in%selected_countries)%>%
drop_na(deaths)%>%
filter(deaths>0, cases>0)%>%
mutate(d =(deaths*100)/cases)%>%
ggplot(aes(x=countriesAndTerritories, y=d, fill=as.factor(year)))+
geom_bar(position = "dodge", stat = "identity")+
labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
ggtitle("Number of Deaths per 100 confirmed cases in each year")
It gives me this output:
but the output of my teacher is like that:
Our output of France and Italy are different I examined my data and calculate the number of deaths per 100 cases and my data looks correct I couldn't find my mistake. Could you help me?
My data is from this link:
https://www.ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-country

The most obvious problem in the code submitted in the question is that it does not correctly aggregate cases and deaths by country and year. Less obvious errors are related to choices made to subset the data, handling of missing values, and the use of stat = "identity" as an argument in geom_bar().
Here's a completely reproducible example that reproduces the instructor's chart.
First, we load the data from the European CDC.
library(ggplot2)
library(dplyr)
library(tidyr)
data <- read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv",
na.strings = "", fileEncoding = "UTF-8-BOM")
Next, we create a list of countries to subset that match those in the instructor's chart.
# select some countries
countryList <- c("France","Italy","Germany","Poland","Romania")
Here we group by country and year, and then aggregate cases & deaths, then we calculate the death rate (deaths per 100 confirmed cases), and save to an output data frame.
data %>%
filter(countriesAndTerritories %in% countryList) %>%
group_by(countriesAndTerritories,year) %>%
summarise(cases = sum(cases,na.rm=TRUE),
deaths = sum(deaths,na.rm=TRUE)) %>%
mutate(deathRate = deaths / (cases / 100)) -> summedData
We use the na.rm = TRUE argument on sum() to include as much of the data as possible, since the codebook tells us that these are daily reports of cases and deaths.
If we view the data frame with View(summedData), we see that the death rates are between 1 and 4, as expected.
Having inspected the data, we plot it with ggplot().
Diagnosing the Errors
We'll walk step-by-step through the code of the original post to find the errors, now that we know we are able to reproduce the professor's chart with the data provided from the European CDC.
After reading the data as above, we subset countries and execute the first part of the dplyr pipeline, and save to a data frame.
selected_countries <- c("France","Italy","Norway","Sweden","Finland")
data%>%
filter(countriesAndTerritories%in%selected_countries) -> step1
In the RStudio object viewer we see that the resulting data frame has 4,240 observations.
If we summarize the data to look at the average daily cases and average daily deaths, we see that the cases average 12260.9 while deaths average 81.03.
> mean(step1$cases,na.rm=TRUE)
[1] 12260.9
> mean(step1$deaths,na.rm=TRUE)
[1] 81.03202
So far, so good because because this means that the average death rate across all the data is less than 1.0, which makes sense given worldwide reports about COVID mortality rates since March 2020.
> mean(step1$deaths,na.rm=TRUE) / (mean(step1$cases,na.rm=TRUE) /100)
[1] 0.6608977
Next, we execute the tidyr::drop_na() function and see what happens.
library(tidyr)
step1 %>% drop_na(deaths) -> step2
nrow(step2)
Looks like we've lost 24 observations.
> nrow(step2)
[1] 4216
When we sort by deaths and inspect the output from step1 in the RStudio data viewer, we find the disappearing observations in Norway. There are 24 days where there were cases recorded but no deaths.
Still, there's nothing to suggest we'd generate a graph where 400 people die for every 100 confirmed COVID cases.
Next, we apply the next operation in the original poster's dplyr pipeline and count the rows.
step2 %>% filter(deaths>0, cases>0) -> step3
nrow(step3)
Hmm... we've lost over 980 rows of data now.
> nrow(step3)
[1] 3231
At this point the code is throwing valid data away, which is going to skew the results. Why? COVID case and death counts are notorious for data corrections over time, so sometimes governments will report negative cases or deaths to correct past over-reporting errors.
Sure enough, our data includes rows with negative values.
> summary(step1$cases)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-348846.0 275.5 1114.0 12260.9 7570.5 501635.0 1
> summary(step1$deaths)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-217.00 1.00 11.00 81.03 76.00 2004.00 24
>
Wow! one country reported -348,846 cases in one day. That's a major correction.
Inspecting the data again, we see that France is the culprit here. If we were conducting a more serious analysis with this data, the researcher would be obligated to assess the validity of this observation by doing things as reviewing news reports about case COVID reporting corrections in France during 2021.
Now, when the original code uses mutate() to calculate death rates, it does not aggregate by countriesAndTerritories or year.
step3 %>% mutate(d =(deaths*100)/cases) -> step4
Therefore, when the code uses ggplot() with geom(stat = "identity") the y axis uses the values of the individual observations, 3,231 at this point, and produces unexpected results.
Corrected Version of Original Poster's Chart
Here is the code that correctly analyzes the data, using the five countries selected by the original poster.
countryList <- c("France","Italy","Norway","Sweeden","Finland")
data %>%
filter(countriesAndTerritories %in% countryList) %>%
group_by(countriesAndTerritories,year) %>%
summarise(cases = sum(cases,na.rm=TRUE),
deaths = sum(deaths,na.rm=TRUE)) %>%
mutate(deathRate = deaths / (cases / 100)) -> summedData
# plot the results
ggplot(data = summedData,
aes(x=countriesAndTerritories, y=deathRate, fill=as.factor(year)))+
geom_bar(position = "dodge", stat = "identity")+
labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
ggtitle("Number of Deaths per 100 confirmed cases in each year")
...and the output, which has death rates between 1 and 3.5%.
Note that the data frame used to generate the chart has only 12 observations, or one observation per number on the chart. This is why stat = "identity" can be used with geom_bar(). ggplot() uses the value of deathRate to plot the height of each bar along the y axis.
Conclusions
First, it's important to understand the details of the data we're analyzing, especially when there are plenty of outside references such as worldwide COVID death rates.
Second, it's important to understand what are valid observations in a data set, such as whether it's reasonable for a data correction like the one France made in May 2021.
Finally, it's important to conduct a face validity analysis of the results. Is it realistic to expect 400 people to die for every 100 confirmed COVID cases? For a disease with a worldwide deaths reported between 1 and 4% of confirmed cases, probably not.

Related

distinct() in dplyr is returning more columns/variables than expected

I am teaching myself R for data visualization in my research, so my apologies if I end up using imprecise language or format the question incorrectly. Also, I included as much info as possible because I'm not sure what is/isn't relevant for my issue.
I am reading in data of causes of mortality from the CDC using its ICD-10 codes. These show an incredible amount of granularity, so to better visualize the data, I have converted the data from the codes to a grouping appropriate to my research. Along with these, I want to create a table of Code to the Cause of Death description to display along my graphs later on. My issue is that distinct() returns more variables than expected.
distinct() worked fine for me in the original data, displayed to the granular level. I wrote the following:
keyGranular <- mortAge %>%
distinct(`Cause of death`, `Cause of death Code`) %>%
na.omit(keyGranular)
Variables in the data frame mortAge are "Year", "Cause of Death", "Cause of Death Code", "Age Group", "Deaths, "Population", "Crude Rate"; these data were read in initially using read_tsv()
the na.omit targets the rows of data that consist of the "total" amount.
This returns 60 obs. of 2 variables, appearing as follows:
Cause of death, Cause of death Code
[1] Mental and behavioural disorders due to use of opioids, acute intoxication, F11.0
[2] Mental and behavioural disorders due to use of opioids, harmful use, F11.1
[3] Mental and behavioural disorders due to use of opioids, dependence syndrome, F11.2
This is as expected, great!
I then reduce the data from the most granular level to one step above it, ie. F11.0-11.9 -> F11, summing up the values from each of the sub-categories. I did this by:
use str_extract() to remove the data after the decimal place in Cause of death Code
use str_extract() to remove description after the ","
summing the data from all the rows with the same (new) cause of death and age group
mortAgeC <- mortAge
mortAgeC$`Cause of death Code` <- str_extract(mortAgeC$`Cause of death Code`, pattern = "[A-Z][0-9][0-9]")
mortAgeC$`Cause of death` <- str_extract(mortAgeC$`Cause of death`,"[^,]+")
mortAgeC$`Cause of death Code` <- mortAgeC$`Cause of death Code` %>% replace_na('TOTAL')
mortAgeC <- mortAgeC %>%
group_by(Year,`Cause of death Code`,`Cause of death`,`Age Group`) %>%
summarise(Rate = sum(`Crude Rate`, na.rm=TRUE))
Now, I want to create a new key with these newly grouped codes, so I rewrite the same code as earlier:
key <- mortAgeC %>%
distinct(`Cause of death`,`Cause of death Code`) %>%
na.omit(key)
Here is where I run into the issue. Instead of returning only the two columns, as I expect it to, I get three columns, including the year. Thus, instead of returning only 15 obs. of 2 variables, I am left with 270 obs of 3 variables... which are just the same 15 or so codes repeated across 17 years of data:
Year, Cause of death Code, Cause of death
[1] 1999, F11, Mental and behavioural disorders due to use of opioids
[2] 1999, F12, Mental and behavioural disorders due to use of cannabinoids
[3] 1999, F13, Mental and behavioural disorders due to use of sedatives or hypnotics
[4] 1999, F14, Mental and behavioural disorders due to use of cocaine
I tried to use select() to remove the "Year" column (I did this when I read in the data to move extraneous codes just fine from the original mortAge), but it returned an error:
key <- mortAgeC %>%
select(key,-"Year")
Warning message:
Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(key)
# Now:
data %>% select(all_of(key))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
I'm not sure where to go from here, since the documentation I've read on distinct() says that it only returns the subset of the data frame passed through distinct(). I tried also specifying .keep_all = FALSE, but that gave me the same results.
The only thing I can think of is that there is some difference with the type(?) of data, due to mortAge being created by read_tsv() and mortAgeC being created by group_by() and summarise(), but I'm not sure why that causes key to reduce mortAgeC from 5 variables to 3 instead of 5->2, as I want.
Or: does group_by() create the assumption that the data in mortAgeC must be related/grouped in a sort of funnel/tree (year > code > cause of death), where in mortAge there was no assumed relationship between the data? And that's why I can remove the Age Group and Rate variables, but not Year?

R - Draw cases per 100k population

I try to draw line COVID cases for each date. I do not have output, the lecturer gave just questions. I solved the question but my problem is the output. It looks weird. Here is the question:
"For the ten countries with the highest absolute number of total deaths, draw the following line graphs to visualize whether epidemic has started to slow down and how the growth rate of new cases/deaths differs across those countries.
a) Number of new cases at each date (absolute number vs per 100.000 population)"
Here is my codes:
library(utils)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
#Finding ten countries where the highest absolute total deaths number is
abs_total_deaths <-COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(abs_total_deaths = sum(deaths)) %>%
arrange(desc(abs_total_deaths))
abs_ten_total_deaths <- c('Italy','France','Germany','Spain','Poland',
'Romania','Czechia','Hungary','Belgium','Bulgaria')
#Calculate new cases by dividing absolute number to 100.000 population
#Draw line for each country
COVID_data %>%
filter(countriesAndTerritories %in% abs_ten_total_deaths) %>%
filter(cases >0) %>%
mutate(new_cases = cases/100000) %>%
ungroup() %>%
ggplot()+
geom_line(aes(x = dateRep, y = new_cases, color = countriesAndTerritories),size=1)+
labs(x="Date",
y="New Cases",
title="New Cases per 100.000 population") +
facet_wrap(~countriesAndTerritories)+
theme_bw()
I will also add a pic of my output. I think my graph is not correct, because the output looks really weird. I can't understand where I make a mistake. If you help me, I'll be appreciated that.
Here is the output:
Looking at Belgium, I get total deaths = 25051 from your data file, which tallies exactly with the data here.
It's obvious that the highest value (by far) for every country occurs "on" the earliest date for the country in the file. Amongst your top ten (I agree with your selection), this is 01Mar2021 for every country apart from Spain, and 28Feb2021 for Spain.
These two facts lead me to conclude (1) your graphs correctly display the data you have asked them to summarise and that (2) you have a data artefact: the first record for each country contains the cumulative total to date, whereas subsequent dates contain data reported "in the previous 24 hours". I use quotes because different countries have different reporting conventions. For example, in the UK (since August 2020) "COVID-related deaths" are deaths from any cause within 28 days of a positive COVID test. Citation
Therefore, to get meaningful graphs, I think your only option is to discard the cumulative data contained in the first record for each country. Here's how I would do that:
library(utils)
library(tidyverse)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
# For better printing
COVID_data <- as_tibble(COVID_data)
# Which countries have the higest absolute death toll?
# [I get the same countries as you do.]
top10 <- COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(TotalDeaths=sum(deaths)) %>%
slice_max(TotalDeaths, n=10) %>%
distinct(countriesAndTerritories) %>%
pull(countriesAndTerritories)
COVID_data %>%
filter(countriesAndTerritories %in% top10) %>%
mutate(
deathRate=100000 * deaths / popData2020,
caseRate=100000 * cases /popData2020,
Date=lubridate::dmy(dateRep)
) %>%
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
ggplot() +
geom_line(aes(x=Date, y=deathRate)) +
facet_wrap(~countriesAndTerritories)
The critical part that excludes the first data row for each country is
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
The call to arrange is necessary because the data are not in date order to begin with.
This gives the following plot
which is much more like what I (and I suspect, you) would expect.
The sawtooth patterns you see are most likely also reporting artefacts: deaths that take place over the weekend (or on public holidays) are not reported until the following Monday (or next working day). This is certainly true in the UK.

Smoothing out missing values in R dataframe

I am using the dataset - https://data.ca.gov/dataset/covid-19-cases/resource/7e477adb-d7ab-4d4b-a198-dc4c6dc634c9 to look into covid cases and deaths in California.
As well as looking at cases/deaths by ethnicity I have grouped the data to give a total column of cases deaths per day. I also used the lag function to give a daily case / death amount.
However on 2 days in December (23rd and 30th) no increment to the cases or deaths columns were made so the daily cases and deaths read 0. The following day the data is 'caught up' with an extra large amount being added on, clearly the sum of the 2 days. (I suspect Christmas and New Year are the causes)
Is there a way of fixing this data? e.g. splitting the double days measurement into half and populating the cells with this, and then retrospectively altering the daily cases and daily deaths figures?
Hopefully the screenshots will clarify what i mean.
Here is the code I have used:
demog_eth <- (read.csv ("./Data/case_demographics_ethnicity.csv", header = T, sep = ","))
demog_eth$date <-as.Date(demog_eth$date)
#Create a DF with total daily information
total_stats <- data.frame(demog_eth$cases,demog_eth$deaths,demog_eth$date)
names(total_stats) <- c('cases', 'deaths', 'date')
total_stats <- total_stats %>% group_by(date) %>% summarise(cases = sum(cases), deaths = sum(deaths))
#Add daily cases and deaths by computing faily difference in totals
##Comment - use lag to look at previous rows
total_stats <- total_stats %>%
mutate(daily_cases = cases-lag(cases),
daily_deaths = deaths-lag(deaths))
The top paragraph of text in the image says cases and deaths. It should say Daily Cases and Daily Deaths. Apologies
df <- data.frame(col=seq(1:100), col2=seq(from=1, to=200, by=2))
df[c(33, 2),] <- 0
zeros <- as.integer(rownames(df[df$col == 0,])) # detect rows with 0
for (i in zeros){
df[i,"col"] <- 0.5 * df[i+1,"col"]
df[i+1,"col"] <- 0.5 * df[i+1,"col"]
}
Sorry, that I used own simple example data. But the mechanism should work if adapted.

How to join tables and time align for cohort analysis in R

Background
I have 2 tables I wish to join in R to perform a cohort analysis. Table 1 is a large data table, with observations of hundreds of objects, and timestamps for each observation. Table 2 holds individual timestamped events associated with individual objects. The cohort analysis would be to align the observations from every object so that the adjusted time = 0 would be the point in time of the event for that object.
Table 1: object_id, ts, observation_type, reading
Table 2: object_id, ts, event_type, notes
The complication is that there will be periodic drift of the baseline due to seasonal effects. Therefore, for the analysis to be meaningful, the observations must be adjusted against the average of all objects.
The top level plan using tydr, dplyr & ggplot2
Create timestamped seasonal baseline - average of all readings for each observation type (there are 3).
Filter out objects where event occurs.
Adjust data set for filtered set against baseline.
time adjust filtered dataset so events are all coincident.
Plot results.
Disclaimer
I've literally only just read R for data sceintists, and have only ever written about 100 lines of code, so I'm a rank novice. Could a data scientist offer hints on the approach above to structure my analysis?
Pseudo-Code and Plan
To create seasonal baseline with timestamp.
seasonal_average_reading1 <- table1 %>%
filter(observation_type = type1) %>%
group_by(ts) %>%
filter(ts, count > 10) %>%
summarise(seasonal_average = mean(reading))
select(ts, seasonal_average)
No clue where to start. First I need to search Table 2 for all isntances of the event I'm interested in, and grab out the object_id's that are affected. Then I need to filter Table1 for only the objects in that list! Is that do-able?
... %>%
mutate(adj = reading - seasonal_average)
For each object, the ts of the event will be taken, and used to adjust the timestamp of the observations. Repeat for all devices and somehow keep the data together? Help!
gplot2
data = time adjusted, filtered dataset.
aes(x=adjusted time, y = average of adjusted dataset)
what else do I need.
Data Samples
event
{
deviceId: '410DF5',
ts: 1604391197,
date: 1604368332,
jobs: [ 'Insulation' ],
notes: 'blah blah blah'
}
DATA
id,ts,seqNum,rssi,type,reading
410D08,1604482547,714,-118,hum,61
410D08,1604482547,714,-118,temp,23.2
410D08,1604482547,714,-118,hum,61
410D08,1604482547,714,-118,temp,23.2
410D08,1604482547,714,-118,hum,62
410D08,1604482547,714,-118,temp,23.2
2C7568,1604482585,2,-98,hum,68
2C7568,1604482585,2,-98,temp,22.3
2C7568,1604482585,2,-98,hum,68
2C7568,1604482585,2,-98,temp,22.2
2C7568,1604482585,2,-98,hum,68

Compute average over sliding time interval (7 days ago/later) in R

I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)

Resources