Using scale_x_date in ggplot2 with different columns - r

Say I have the following data:
Date Month Year Miles Activity
3/1/2014 3 2014 72 Walking
3/1/2014 3 2014 85 Running
3/2/2014 3 2014 42 Running
4/1/2014 4 2014 65 Biking
1/1/2015 1 2015 21 Walking
1/2/2015 1 2015 32 Running
I want to make graphs that display the sum of each month's date for miles, grouped and colored by year. I know that I can make a separate data frame with the sum of the miles per month per activity, but the issue is in displaying. Here in Excel is basically what I want--the sums displayed chronologically and colored by activity.
I know ggplot2 has a scale_x_date command, but I run into issues on "both sides" of the problem--if I use the Date column as my X variable, they're not summed. But if I sum my data how I want it in a separate data frame (i.e., where every activity for every month has just one row), I can't use both Month and Year as my x-axis--at least, not in any way that I can get scale_x_date to understand.
(And, I know, if Excel is graphing it correctly why not just use Excel--unfortunately, my data is so large that Excel was running very slowly and it's not feasible to keep using it.) Any ideas?

The below worked fine for me with the small dataset. If you convert you data.frame to a data.table you can sum the data up to the mile per activity and month level with just a couple preprocessing steps. I've left some comments in the code to give you an idea of what's going on but it should be pretty self-explanatory.
# Assuming your dataframe looks like this
df <- data.frame(Date = c('3/1/2014','3/1/2014','4/2/2014','5/1/2014','5/1/2014','6/1/2014','6/1/2014'), Miles = c(72,14,131,534,123,43,56), Activity = c('Walking','Walking','Biking','Running','Running','Running', 'Biking'))
# Load lubridate and data.table
library(lubridate)
library(data.table)
# Convert dataframe to a data.table
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')] # Convert data to a column of Class Date -- check with class(df[, Date]) if you are unsure
df[, Date := floor_date(Date, unit = 'month')] # Reduce all dates to the first day of the month for summing later on
# Create ggplot object using data.tables functionality to sum the miles
ggplot(df[, sum(Miles), by = .(Date, Activity)], aes(x = Date, y = V1, colour = factor(Activity))) + # Data.table creates the column V1 which is the sum of miles
geom_line() +
scale_x_date(date_labels = '%b-%y') # %b is used to display the first 3 letters of the month

Related

Grouping daily observations from specific time intervals

Context: I have a survey dataset with daily observations in a 6-7 day period each month for about a year. Observations include party choice and trust in government (Likert-scale).
Problem: the N is too small for observations each day, so I need to group the daily observations from each period. How?
I've tried the following (using lubridate), but that supposes each period of observations begins at the start of the week.
df <- df %>%
group_by(date_week = floor_date(date_variable, "week"))
Unfortunately, this is a mess as it takes all observations from Monday-Sunday and groups them together (starting Monday), but some survey periods "crosses" weeks from e.g. Thursday-Wednesday, and thus R creates two periods of observations.
I need to solve this problem and then visualize (I'm using ggplot). So the new date-variable needs to be in date style, and it would be perfect, if it could visualize from the median day in each period.
Example of data
Date Party N Trust-in-gov-average
"2021-10-02" A 25 1.5
"2021-10-02" B 10 2.5
"2021-10-02" C 15 3.8
"2021-10-03" A 12 1.2
"2021-10-03" B 53 3.2
"2021-10-03" C 23 2.8
"2021-10-04" A 58 1.6
"2021-10-04" B 33 2.6
"2021-10-04" C 44 3.0
After many sleepless nights (in part thanks to New Years Eve) I finally found a solution to my problem. It's all about combining lubridate and dplyr.
First convert the variable to date-format.
df$date_string <- ymd(df$date_string)
Then use mutate and %withnin% commands to extract periods. Define the name as the date you want to define the period e.g. first day of observation.
df <- df %>%
mutate(waves = case_when(date_string %within% interval(ymd("2020-09-13"), ymd("2020-09-19")) ~ "2020-09-13",
date_string %within% interval(ymd("2020-09-20"), ymd("2020-10-03")) ~ "2020-09-20",
date_string %within% interval(ymd("2020-10-11"), ymd("2020-10-17")) ~ "2020-10-11",
date_string %within% interval(ymd("2020-10-25"), ymd("2020-10-31")) ~ "2020-10-25"))
At last convert the new variable back to a date-variable using ymd-command again
df$waves <- ymd(df$waves)

Plot data over time in R

I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).

Plotting a variable measured monthly with a variable measured yearly in the same plot (R)

Here are two samples of datasets I would like to plot together on the same plot:
>head(df1)
Date y
1 2015-10-01 6217.734
2 2015-09-01 6242.592
3 2015-08-01 6772.145
4 2015-07-01 6865.719
and
>head(df2)
Year x
1 1980 5760
2 1981 4765
3 1982 2620
4 1983 7484
Given that df2$Year and df1$Date overlap date ranges and df1$y and df2$x are of the same scale, how can I best plot y and x against time on the same plot given that x is measured only yearly and y monthly?
I imagine it will require converting Year to an arbitrary date (1980-01-01, 1981-01-01). But beyond that, other than altering my df2 data.frame to having twelve observations per year with the same x value per observation, then combining the two data.frames, I cannot think of what to do.
I would prefer to use ggplot2 if there is a solution there.
Can you try this out for me?
library(dygraphs)
library(xts)
rename one of your variable to match the other scaled variable
rename Year to match the other's date
then do
prep <- cbind(df1, df2)
ts_object <- as.xts(prep[,2:ncol(prep)], prep$Year)
dygraph(ts_object)
Note that you are providing literally NO data for me to work with here. If you can do so that'd be great. Try using dput(df1), and dput(df2), and post the output of these commands

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Split dataframe and calculate averages for data subsets in R

I have this data frame in R:
steps day month
4758 Tuesday December
9822 Wednesday December
10773 Thursday December
I want to iterate over the data frame and apply a function to the steps column based on the value in the month column. I'm trying to work out the average number of steps per weekday for each month.
I want to output to a new data frame like so where the week days repeat but I only have the average values per day:
average.steps day month
4500 Tuesday December
9000 Wednesday December
1000 Thursday December
I can work out how to work out the averages for the data frame as a whole, but want to use a for loop to apply it just for step values from the same month.
avgsteps <- ddply(DATA, "day", summarise, msteps = mean(steps))
My basic idea for the for function was:
f <- function(m in month) {ddply(DATA, "day", summarise, msteps = mean(steps))}
But it won't process it and throws the error:
Error: unexpected 'in' in "f <- function(m in"
Any help would be greatly appreciated!
EDIT:
SO I've tried #agstudy's suggested fix (below) and it gets the right data structure (single value for each weekday for each month), but the value assigned to each day is identical. I'm a bit confused what could be going wrong.
steps.month.day.avg <- ddply(steps.month.day, .(fitbit.day,fitbit.month), summarise, msteps = mean(steps))
No need to loop here , you should just change the variables to split data frame by,
ddply(DATA, .(day,month), summarise, msteps = mean(steps))

Resources