Smoothing out missing values in R dataframe - r

I am using the dataset - https://data.ca.gov/dataset/covid-19-cases/resource/7e477adb-d7ab-4d4b-a198-dc4c6dc634c9 to look into covid cases and deaths in California.
As well as looking at cases/deaths by ethnicity I have grouped the data to give a total column of cases deaths per day. I also used the lag function to give a daily case / death amount.
However on 2 days in December (23rd and 30th) no increment to the cases or deaths columns were made so the daily cases and deaths read 0. The following day the data is 'caught up' with an extra large amount being added on, clearly the sum of the 2 days. (I suspect Christmas and New Year are the causes)
Is there a way of fixing this data? e.g. splitting the double days measurement into half and populating the cells with this, and then retrospectively altering the daily cases and daily deaths figures?
Hopefully the screenshots will clarify what i mean.
Here is the code I have used:
demog_eth <- (read.csv ("./Data/case_demographics_ethnicity.csv", header = T, sep = ","))
demog_eth$date <-as.Date(demog_eth$date)
#Create a DF with total daily information
total_stats <- data.frame(demog_eth$cases,demog_eth$deaths,demog_eth$date)
names(total_stats) <- c('cases', 'deaths', 'date')
total_stats <- total_stats %>% group_by(date) %>% summarise(cases = sum(cases), deaths = sum(deaths))
#Add daily cases and deaths by computing faily difference in totals
##Comment - use lag to look at previous rows
total_stats <- total_stats %>%
mutate(daily_cases = cases-lag(cases),
daily_deaths = deaths-lag(deaths))
The top paragraph of text in the image says cases and deaths. It should say Daily Cases and Daily Deaths. Apologies

df <- data.frame(col=seq(1:100), col2=seq(from=1, to=200, by=2))
df[c(33, 2),] <- 0
zeros <- as.integer(rownames(df[df$col == 0,])) # detect rows with 0
for (i in zeros){
df[i,"col"] <- 0.5 * df[i+1,"col"]
df[i+1,"col"] <- 0.5 * df[i+1,"col"]
}
Sorry, that I used own simple example data. But the mechanism should work if adapted.

Related

R - Draw cases per 100k population

I try to draw line COVID cases for each date. I do not have output, the lecturer gave just questions. I solved the question but my problem is the output. It looks weird. Here is the question:
"For the ten countries with the highest absolute number of total deaths, draw the following line graphs to visualize whether epidemic has started to slow down and how the growth rate of new cases/deaths differs across those countries.
a) Number of new cases at each date (absolute number vs per 100.000 population)"
Here is my codes:
library(utils)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
#Finding ten countries where the highest absolute total deaths number is
abs_total_deaths <-COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(abs_total_deaths = sum(deaths)) %>%
arrange(desc(abs_total_deaths))
abs_ten_total_deaths <- c('Italy','France','Germany','Spain','Poland',
'Romania','Czechia','Hungary','Belgium','Bulgaria')
#Calculate new cases by dividing absolute number to 100.000 population
#Draw line for each country
COVID_data %>%
filter(countriesAndTerritories %in% abs_ten_total_deaths) %>%
filter(cases >0) %>%
mutate(new_cases = cases/100000) %>%
ungroup() %>%
ggplot()+
geom_line(aes(x = dateRep, y = new_cases, color = countriesAndTerritories),size=1)+
labs(x="Date",
y="New Cases",
title="New Cases per 100.000 population") +
facet_wrap(~countriesAndTerritories)+
theme_bw()
I will also add a pic of my output. I think my graph is not correct, because the output looks really weird. I can't understand where I make a mistake. If you help me, I'll be appreciated that.
Here is the output:
Looking at Belgium, I get total deaths = 25051 from your data file, which tallies exactly with the data here.
It's obvious that the highest value (by far) for every country occurs "on" the earliest date for the country in the file. Amongst your top ten (I agree with your selection), this is 01Mar2021 for every country apart from Spain, and 28Feb2021 for Spain.
These two facts lead me to conclude (1) your graphs correctly display the data you have asked them to summarise and that (2) you have a data artefact: the first record for each country contains the cumulative total to date, whereas subsequent dates contain data reported "in the previous 24 hours". I use quotes because different countries have different reporting conventions. For example, in the UK (since August 2020) "COVID-related deaths" are deaths from any cause within 28 days of a positive COVID test. Citation
Therefore, to get meaningful graphs, I think your only option is to discard the cumulative data contained in the first record for each country. Here's how I would do that:
library(utils)
library(tidyverse)
COVID_data <-read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")
# For better printing
COVID_data <- as_tibble(COVID_data)
# Which countries have the higest absolute death toll?
# [I get the same countries as you do.]
top10 <- COVID_data %>%
group_by(countriesAndTerritories) %>%
summarise(TotalDeaths=sum(deaths)) %>%
slice_max(TotalDeaths, n=10) %>%
distinct(countriesAndTerritories) %>%
pull(countriesAndTerritories)
COVID_data %>%
filter(countriesAndTerritories %in% top10) %>%
mutate(
deathRate=100000 * deaths / popData2020,
caseRate=100000 * cases /popData2020,
Date=lubridate::dmy(dateRep)
) %>%
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
ggplot() +
geom_line(aes(x=Date, y=deathRate)) +
facet_wrap(~countriesAndTerritories)
The critical part that excludes the first data row for each country is
arrange(countriesAndTerritories, Date) %>%
group_by(countriesAndTerritories) %>%
filter(row_number() > 1) %>%
The call to arrange is necessary because the data are not in date order to begin with.
This gives the following plot
which is much more like what I (and I suspect, you) would expect.
The sawtooth patterns you see are most likely also reporting artefacts: deaths that take place over the weekend (or on public holidays) are not reported until the following Monday (or next working day). This is certainly true in the UK.

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Simulate a series of code n(lets say 1000) times while saving the result in a vector in R

I'm still relatively new to R so I'm struggling with repeating lines of code several times and saving the result for each repetition.
The aim is to randomly (equal probability) assign a number of events, in my case 100, over a 20 year period. Since days are irrelevant I use the number of months to define the period. Subsequently, I'm counting the events for every 24-month period within the 20 years. Lastly, extracting the maximum number of events occurring within a 24-month period.
Albeit messy and probably inefficient, the code works for the intended purpose. However, I want to repeat this process 1000 times to get a distribution of all the maximum number of events taking place over 24 months to compare to my real data.
here is my coding so far:
library(runner)
library(dplyr)
#First I set the period from the year 2000 to 2019 with one-month increments.
period <- seq(as.Date("2000/1/1"), by = "month", length.out = 240)
#I sample random observations assigned to different months over the entire period.
u <- sample(period, size=100, replace=T)
#Make a table in order to register the number of occurrences within each month.
u <- table(u)
#Create a data frame to ease information processing.
simulation <- data.frame(u)
#Change the date column to date format.
simulation$u <- as.Date(simulation$u)
#Compute number of events taking place within every 24-month period (730 = days in 24 months).
u <- u %>%
mutate(
Last_24_month_total = sum_run(
x = simulation$Freq,
k = 730,
idx = as.Date(simulation$u, format = "%d/%m/%Y"))
)
#extract the maximum number of uccurences within a 24 month period
max <- max(u$Last_24_month_total)
Could someone help me understand how to rewrite this process in order to facilitate a thousand repetitions while saving the max value for each repetition?
thanks
As #jogo suggested in the comments, you can use replicate.
I simplified your code.
library(runner)
library(dplyr)
seq_dates <- seq(as.Date("2000/1/1"), by = "month", length.out = 240)
replicate(100,
seq_dates %>%
sample(100, replace = TRUE) %>%
table() %>%
sum_run(730, idx = as.Date(names(.))) %>%
max)

Create efficient week over week calculation with subsetting

In my working dataset, I'm trying to calculate week-over-week values for the changes in wholesale and revenue. The code seems to work, but my estimates show it'll take about 75hrs to run what is a seemingly simple calculation. Below is the generic reproducible version which takes about 2m to run on this smaller dataset:
########################################################################################################################
# MAKE A GENERIC REPORDUCIBLE STACK OVERFLOW QUESTION
########################################################################################################################
# Create empty data frame of 26,000 observations similar to my data, but populated with noise
exampleData <- data.frame(product = rep(LETTERS,1000),
wholesale = rnorm(1000*26),
revenue = rnorm(1000*26))
# create a week_ending column which increases by one week with every set of 26 "products"
for(i in 1:nrow(exampleData)){
exampleData$week_ending[i] <- as.Date("2016-09-04")+7*floor((i-1)/26)
}
exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01")
# create empty columns to fill
exampleData$wholesale_wow <- NA
exampleData$revenue_wow <- NA
# loop through the wholesale and revenue numbers and append the week-over-week changes
for(i in 1:nrow(exampleData)){
# set a condition where the loop only appends the week-over-week values if it's not the first week
if(exampleData$week_ending[i]!="2016-09-04"){
# set temporary values for the current and past week's wholesale value
currentWholesale <- exampleData$wholesale[i]
lastWeekWholesale <- exampleData$wholesale[which(exampleData$product==exampleData$product[i] &
exampleData$week_ending==exampleData$week_ending[i]-7)]
exampleData$wholesale_wow[i] <- currentWholesale/lastWeekWholesale -1
# set temporary values for the current and past week's revenue
currentRevenue <- exampleData$revenue[i]
lastWeekRevenue <- exampleData$revenue[which(exampleData$product==exampleData$product[i] &
exampleData$week_ending==exampleData$week_ending[i]-7)]
exampleData$revenue_wow[i] <- currentRevenue/lastWeekRevenue -1
}
}
Any help understanding why this takes so long or how to cut down the time would be much appreciated!
The first for loop can be simplified with the following for:
exampleData$week_ending2 <- as.Date("2016-09-04") + 7 * floor((seq_len(nrow(exampleData)) - 1) / 26)
setequal(exampleData$week_ending, exampleData$week_ending2)
[1] TRUE
Replacing second for loop
library(data.table)
dt1 <- as.data.table(exampleData)
dt1[, wholesale_wow := wholesale / shift(wholesale) - 1 , by = product]
dt1[, revenue_wow := revenue / shift(revenue) - 1 , by = product]
setequal(exampleData, dt1)
[1] TRUE
This takes about 4 milliseconds to run on my laptop
Here is a vectorized solution using the tidyr package.
set.seed(123)
# Create empty data frame of 26,000 observations similar to my data, but populated with noise
exampleData <- data.frame(product = rep(LETTERS,1000),
wholesale = rnorm(1000*26),
revenue = rnorm(1000*26))
# create a week_ending column which increases by one week with every set of 26 "products"
#vectorize the creating of the data
i<-1:nrow(exampleData)
exampleData$week_ending <- as.Date("2016-09-04")+7*floor((i-1)/26)
exampleData$week_ending <- as.Date(exampleData$week_ending, origin = "1970-01-01")
# create empty columns to fill
exampleData$wholesale_wow <- NA
exampleData$revenue_wow <- NA
#find the index of rows of interest (ie removing the first week)
i<-i[exampleData$week_ending!="2016-09-04"]
library(tidyr)
#create temp variables and convert into wide format
# the rows are product and the columns are the ending weeks
Wholesale<-exampleData[ ,c(1,2,4)]
Wholesale<-spread(Wholesale, week_ending, wholesale)
Revenue<-exampleData[ ,c(1,3,4)]
Revenue<-spread(Revenue, week_ending, revenue)
#number of columns
numCol<-ncol(Wholesale)
#remove the first two columns for current wholesale
#remove the first and last column for last week's wholesale
#perform calculation on ever element in dataframe (divide this week/lastweek)
Wholesale_wow<- Wholesale[ ,-c(1, 2)]/Wholesale[ ,-c(1, numCol)] - 1
#convert back to long format
Wholesale_wow<-gather(Wholesale_wow)
#repeat for revenue
Revenue_wow<- Revenue[ ,-c(1, 2)]/Revenue[ ,-c(1, numCol)] - 1
#convert back to long format
Revenue_wow<-gather(Revenue_wow)
#assemble calculated values back into the original dataframe
exampleData$wholesale_wow[i]<-Wholesale_wow$value
exampleData$revenue_wow[i]<-Revenue_wow$value
The strategy was to convert the original data into a wide format where the rows were the product id and the columns were the weeks. Then divide the data frames by each other. Convert back into a long format and add the newly calculated values to the exampleData data frame. This works, not very clean but very much faster than the loop. The dplyr package is another tool for this type of work.
To compare this results of this code with you test case use:
print(identical(goldendata, exampleData))
Where goldendata is your known good results, be sure to use the same random numbers with the set.seed() function.

calculating seasonal range in r for a number of years

I have a data frame of daily temperature measurements spanning 20 years. I would like to calculate the annual range in the data series for each year (i.e. end up with 20 values, representing the range for each year). Example data:
begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
dat = data.frame(dt = begin_date + (0:(20*365)) * (86400))
dat = within(dat, {speed = runif(length(dt), 1, 10)})
I was thinking of writing a loop which goes through each year and then calculate the range, but was hoping there was another solution.
I think the best way forward would be to have the maximum and minimum values for each year and then calculate the range from that. Can anyone suggest a method to do this without writing a loop to go through each year individually?
Try
library(dplyr)
dat %>%
group_by(year=year(dt)) %>%
summarise(Range=diff(range(speed)))
Or
library(data.table)
setDT(dat)[, list(Range=diff(range(speed))), year(dt)]
Or
aggregate(speed~cbind(year=year(dt)), dat, function(x) diff(range(x)))

Resources