Plotting a variable measured monthly with a variable measured yearly in the same plot (R) - r

Here are two samples of datasets I would like to plot together on the same plot:
>head(df1)
Date y
1 2015-10-01 6217.734
2 2015-09-01 6242.592
3 2015-08-01 6772.145
4 2015-07-01 6865.719
and
>head(df2)
Year x
1 1980 5760
2 1981 4765
3 1982 2620
4 1983 7484
Given that df2$Year and df1$Date overlap date ranges and df1$y and df2$x are of the same scale, how can I best plot y and x against time on the same plot given that x is measured only yearly and y monthly?
I imagine it will require converting Year to an arbitrary date (1980-01-01, 1981-01-01). But beyond that, other than altering my df2 data.frame to having twelve observations per year with the same x value per observation, then combining the two data.frames, I cannot think of what to do.
I would prefer to use ggplot2 if there is a solution there.

Can you try this out for me?
library(dygraphs)
library(xts)
rename one of your variable to match the other scaled variable
rename Year to match the other's date
then do
prep <- cbind(df1, df2)
ts_object <- as.xts(prep[,2:ncol(prep)], prep$Year)
dygraph(ts_object)
Note that you are providing literally NO data for me to work with here. If you can do so that'd be great. Try using dput(df1), and dput(df2), and post the output of these commands

Related

Binning and making histogram for dates in R

I am new to using dates in R so sorry if this is a basic question. I have a data set that has the name of fracking wells and their job end date as listed below:
df = as.data.frame(df)
head(df)
`WellName JobEndDate
1 WILLIAM VALENTINE 1 5/19/1982 12:00:00 AM
2 LIZARD HEAD 1-8H RE 2/7/1995 12:00:00 AM
3 North Westbrook Unit/Well No. 3032 6/11/1996 12:00:00 AM
4 Olene Reagan 3-1 12/13/2001 12:00:00 AM
5 CNX3 9/22/2008 12:00:00 AM
7 CNX2 1/22/2009 12:00:00 AM`
It is a large file with about 100,000 entries that go until 2017. I want to create a histogram based on the job end date. To do that, I figured I would place the dates into bins, breaking by months. However, I am struggling with placing them into bins so that each month has a number corresponding to how many wells were finished in each month. Therefore, I am also struggling with the histogram. I would appreciate any help!! Thank you!
First, extract month from every date
library(data.table)
df$months <- month(df$JobEndDate)
Then, make your plot:
library(ggplot2)
ggplot(df, aes(x='months')) + geom_histogram()
# alternate
hist(table(df$months))

Using scale_x_date in ggplot2 with different columns

Say I have the following data:
Date Month Year Miles Activity
3/1/2014 3 2014 72 Walking
3/1/2014 3 2014 85 Running
3/2/2014 3 2014 42 Running
4/1/2014 4 2014 65 Biking
1/1/2015 1 2015 21 Walking
1/2/2015 1 2015 32 Running
I want to make graphs that display the sum of each month's date for miles, grouped and colored by year. I know that I can make a separate data frame with the sum of the miles per month per activity, but the issue is in displaying. Here in Excel is basically what I want--the sums displayed chronologically and colored by activity.
I know ggplot2 has a scale_x_date command, but I run into issues on "both sides" of the problem--if I use the Date column as my X variable, they're not summed. But if I sum my data how I want it in a separate data frame (i.e., where every activity for every month has just one row), I can't use both Month and Year as my x-axis--at least, not in any way that I can get scale_x_date to understand.
(And, I know, if Excel is graphing it correctly why not just use Excel--unfortunately, my data is so large that Excel was running very slowly and it's not feasible to keep using it.) Any ideas?
The below worked fine for me with the small dataset. If you convert you data.frame to a data.table you can sum the data up to the mile per activity and month level with just a couple preprocessing steps. I've left some comments in the code to give you an idea of what's going on but it should be pretty self-explanatory.
# Assuming your dataframe looks like this
df <- data.frame(Date = c('3/1/2014','3/1/2014','4/2/2014','5/1/2014','5/1/2014','6/1/2014','6/1/2014'), Miles = c(72,14,131,534,123,43,56), Activity = c('Walking','Walking','Biking','Running','Running','Running', 'Biking'))
# Load lubridate and data.table
library(lubridate)
library(data.table)
# Convert dataframe to a data.table
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')] # Convert data to a column of Class Date -- check with class(df[, Date]) if you are unsure
df[, Date := floor_date(Date, unit = 'month')] # Reduce all dates to the first day of the month for summing later on
# Create ggplot object using data.tables functionality to sum the miles
ggplot(df[, sum(Miles), by = .(Date, Activity)], aes(x = Date, y = V1, colour = factor(Activity))) + # Data.table creates the column V1 which is the sum of miles
geom_line() +
scale_x_date(date_labels = '%b-%y') # %b is used to display the first 3 letters of the month

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Measure retention in R

I have a dataset containing information on students enrolled in an after-school program in the following format:
student_id year group number
1 2009-10 1
2 2009-10 1
3 2009-10 2
4 2009-10 3
5 2009-10 3
1 2010-11 1
2 2010-11 2
3 2010-11 3
4 2010-11 2
5 2010-11 2
I want to measure retention for each group on a per-year basis. I need to write some kind of loop statement that will look back at the previous year, compute a value of similar IDs for each group, and return a value divided by total numbers in that group. I have sketched out code (which is probably inefficient/missing some steps) as follows?
for (i in levels(data$year)){
if (i=="2009-10"){
#no previous year to look for
next
}else{
for(g in levels(data$group)){
##perhaps a plyr summarize function?
#look for id in previous year for that group
#compute count of identical ids
#return value/length(group)
}
}
edit
after reading some suggestions, perhaps it would be simpler to use the ddply(transform) function. is there a way to create an associative relationship between the year and group number? the code would look something like this:
tracking=ddply(data,"student_id", transform, enroll.year1=1, enroll.year2=ifelse(criteria goes here,1,0), enroll.year3=ifelse(criteria goes here,1,0)
some sample output might look like this:
Year Group retention rate
2010-11 1 0.88
2011-12 1 0.8
2010-11 2 0.5
2011-12 2 0.6
2010-11 3 0.5
2011-12 3 0.5
has anyone solved a similar retention problem before? I'm having difficulty conceptualizing the steps, let alone implementing in R. any help would be greatly appreciated.
*
One solution I can think of involves a self-join shifting one year. I will be using data.table for the simplicity of both joining and the grouping required. I'll also be changing some names and the year format for convenience. I have saved your data in a data.frame called dd:
names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))
dd.prev <- dd
dd.prev$year <- dd.prev$year + 1 ## shifting year upwards so it matches the next year
require(data.table)
dd <- data.table(dd)
setkey(dd, group, year)
dd.prev <- data.table(dd.prev)
setkey(dd.prev, group, year)
setnames(dd.prev, 'id', 'id.prev') ## changing variable name so it is distinct
extra.year <- max(dd$year) + 1 ## the shift generates an extra year
dd.prev <- dd.prev[!year==extra.year] ## drop extra year as retention not defined
dd <- dd[dd.prev] ## join data from previous year to current data
dd.all[, retention:=as.numeric(length(intersect(id, id.prev))
/length(unique(id.prev))),
by=list(year, id)]
That last bit computes the retention rate as you defined it: number of students who still remain from last year, intersect(id, id.prev), divided by the total number of students last year, unique(id.prev). With this data it only generates the retention for 2010, but with a longer series it would generate it for all years except the first.
UPDATE 1: Using plyr
names(dd) <- c('id', 'year', 'group')
dd$year <- as.integer(substr(dd$year, 1, 4))
dd.prev <- dd
dd.prev$year <- dd.prev$year + 1 ## shifting year upwards so it matches the next year
names(dd.prev)[1] <- 'id.prev' ## changing variable name so it is distinct
extra.year <- max(dd$year) + 1 ## the shift generates an extra year
dd.prev <- dd.prev[dd.prev$year!=extra.year,] ## drop extra year
dd <- merge(dd, dd.prev, all.y=TRUE) ## join data from previous year to current data
require(plyr)
dd <- ddply(dd, .(group, year), summarize,
retention=length(intersect(id, id.prev))
/length(unique(id.prev)))
I hope that helps.

Creating with time series from a dataset including missing values

I need to create a time series from a data frame. The problem is variables is not well-ordered. Data frame is like below
Cases Date
15 1/2009
30 3/2010
45 12/2013
I have 60 observations like that. As you can see, data was collected randomly, which is starting from 1/2008 and ending 12/2013 ( There are many missing values(cases) in bulk of the months between these years). My assumption will be there is no cases in that months. So, how can I convert this dataset as time series? Then, I will try to make some prediction for possible number of cases in future.
Try installing the plyr library,
install.packages("plyr")
and then to sum duplicated Date2 rows:
library(plyr)
mergedData <- ddply(dat, .(Date2), .fun = function(x) {
data.frame(Cases = sum(x$Cases))
})
> head(mergedData)
Date2 Cases
1 2008-01-01 16352
2 2008-11-01 10
3 2009-01-01 23
4 2009-02-01 138
5 2009-04-01 18
6 2009-06-01 3534
you can create a separate sequence of time series and merge with data series.This will create a complete time series with missing values as NA.
if df is your data frame with Date as column of date than create new time series ts and merge as below.
ts <- data.frame(Date = seq(as.Date("2008-01-01"), as.Date("2013-12-31"), by="1 month"))
dfwithmisisng <- merge(ts, df, by="Date", all=T)

Resources