Grouping daily observations from specific time intervals - r

Context: I have a survey dataset with daily observations in a 6-7 day period each month for about a year. Observations include party choice and trust in government (Likert-scale).
Problem: the N is too small for observations each day, so I need to group the daily observations from each period. How?
I've tried the following (using lubridate), but that supposes each period of observations begins at the start of the week.
df <- df %>%
group_by(date_week = floor_date(date_variable, "week"))
Unfortunately, this is a mess as it takes all observations from Monday-Sunday and groups them together (starting Monday), but some survey periods "crosses" weeks from e.g. Thursday-Wednesday, and thus R creates two periods of observations.
I need to solve this problem and then visualize (I'm using ggplot). So the new date-variable needs to be in date style, and it would be perfect, if it could visualize from the median day in each period.
Example of data
Date Party N Trust-in-gov-average
"2021-10-02" A 25 1.5
"2021-10-02" B 10 2.5
"2021-10-02" C 15 3.8
"2021-10-03" A 12 1.2
"2021-10-03" B 53 3.2
"2021-10-03" C 23 2.8
"2021-10-04" A 58 1.6
"2021-10-04" B 33 2.6
"2021-10-04" C 44 3.0

After many sleepless nights (in part thanks to New Years Eve) I finally found a solution to my problem. It's all about combining lubridate and dplyr.
First convert the variable to date-format.
df$date_string <- ymd(df$date_string)
Then use mutate and %withnin% commands to extract periods. Define the name as the date you want to define the period e.g. first day of observation.
df <- df %>%
mutate(waves = case_when(date_string %within% interval(ymd("2020-09-13"), ymd("2020-09-19")) ~ "2020-09-13",
date_string %within% interval(ymd("2020-09-20"), ymd("2020-10-03")) ~ "2020-09-20",
date_string %within% interval(ymd("2020-10-11"), ymd("2020-10-17")) ~ "2020-10-11",
date_string %within% interval(ymd("2020-10-25"), ymd("2020-10-31")) ~ "2020-10-25"))
At last convert the new variable back to a date-variable using ymd-command again
df$waves <- ymd(df$waves)

Related

Plot data over time in R

I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).

Using scale_x_date in ggplot2 with different columns

Say I have the following data:
Date Month Year Miles Activity
3/1/2014 3 2014 72 Walking
3/1/2014 3 2014 85 Running
3/2/2014 3 2014 42 Running
4/1/2014 4 2014 65 Biking
1/1/2015 1 2015 21 Walking
1/2/2015 1 2015 32 Running
I want to make graphs that display the sum of each month's date for miles, grouped and colored by year. I know that I can make a separate data frame with the sum of the miles per month per activity, but the issue is in displaying. Here in Excel is basically what I want--the sums displayed chronologically and colored by activity.
I know ggplot2 has a scale_x_date command, but I run into issues on "both sides" of the problem--if I use the Date column as my X variable, they're not summed. But if I sum my data how I want it in a separate data frame (i.e., where every activity for every month has just one row), I can't use both Month and Year as my x-axis--at least, not in any way that I can get scale_x_date to understand.
(And, I know, if Excel is graphing it correctly why not just use Excel--unfortunately, my data is so large that Excel was running very slowly and it's not feasible to keep using it.) Any ideas?
The below worked fine for me with the small dataset. If you convert you data.frame to a data.table you can sum the data up to the mile per activity and month level with just a couple preprocessing steps. I've left some comments in the code to give you an idea of what's going on but it should be pretty self-explanatory.
# Assuming your dataframe looks like this
df <- data.frame(Date = c('3/1/2014','3/1/2014','4/2/2014','5/1/2014','5/1/2014','6/1/2014','6/1/2014'), Miles = c(72,14,131,534,123,43,56), Activity = c('Walking','Walking','Biking','Running','Running','Running', 'Biking'))
# Load lubridate and data.table
library(lubridate)
library(data.table)
# Convert dataframe to a data.table
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')] # Convert data to a column of Class Date -- check with class(df[, Date]) if you are unsure
df[, Date := floor_date(Date, unit = 'month')] # Reduce all dates to the first day of the month for summing later on
# Create ggplot object using data.tables functionality to sum the miles
ggplot(df[, sum(Miles), by = .(Date, Activity)], aes(x = Date, y = V1, colour = factor(Activity))) + # Data.table creates the column V1 which is the sum of miles
geom_line() +
scale_x_date(date_labels = '%b-%y') # %b is used to display the first 3 letters of the month

R - Merging data of different frequencies

I have two dataframes that I am trying to merge. One is daily data with days missing (but at least one observation for each month). The other is monthly data (with no months missing). They both span the same time frame.
I would like to merge the data by month (i.e. the month-year of the daily data corresponding with the month-year of the monthly data), keeping the higher frequency.
df1 = daily data (unequal frequency ... i.e. missing days)
df2 = monthly data (equal frequency)
merge(df1, df2) ???
df1.date df1.x df2.y
1/1/2005 5.5 10
1/2/2005 5.9 10
1/5/2005 6.5 10
...
11/2/2005 2.5 12
11/4/2005 3.9 12
11/6/2005 1.3 12
...
Is there anyway to do this in R? (I have been struggling with zoo and ts and haven't found anything even close ... hence this post).
Thank you #user1945827 & #Gregor
I was making a mountain out of a molehill.
As you suggested, all that I needed to do was to create a common index for both datasets to merge on:
lo$monthyear <- format(lo$ListingCreationDate, format='%B-%Y')
ue$monthyear <- format(ue$Month, format='%B-%Y')
lonew <- data.frame(merge(lo, ue, by="monthyear"))
I am posting this as an answer because I literally spent hours using different packages trying to accomplish something that a one sentence answer resolved. Hopefully it will be useful to someone else.

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Creating with time series from a dataset including missing values

I need to create a time series from a data frame. The problem is variables is not well-ordered. Data frame is like below
Cases Date
15 1/2009
30 3/2010
45 12/2013
I have 60 observations like that. As you can see, data was collected randomly, which is starting from 1/2008 and ending 12/2013 ( There are many missing values(cases) in bulk of the months between these years). My assumption will be there is no cases in that months. So, how can I convert this dataset as time series? Then, I will try to make some prediction for possible number of cases in future.
Try installing the plyr library,
install.packages("plyr")
and then to sum duplicated Date2 rows:
library(plyr)
mergedData <- ddply(dat, .(Date2), .fun = function(x) {
data.frame(Cases = sum(x$Cases))
})
> head(mergedData)
Date2 Cases
1 2008-01-01 16352
2 2008-11-01 10
3 2009-01-01 23
4 2009-02-01 138
5 2009-04-01 18
6 2009-06-01 3534
you can create a separate sequence of time series and merge with data series.This will create a complete time series with missing values as NA.
if df is your data frame with Date as column of date than create new time series ts and merge as below.
ts <- data.frame(Date = seq(as.Date("2008-01-01"), as.Date("2013-12-31"), by="1 month"))
dfwithmisisng <- merge(ts, df, by="Date", all=T)

Resources