Date DE VE
12/1/2016 93.387 0.095
11/1/2016 77.968 0.095
10/1/2016 65.184 0.095
9/1/2016 63.984 0.095
8/1/2016 67.657 0.095
%m/%d/%Y
DE and VE are daily averages. How to convert from daily average to monthly total in R based on the actual days in that month? Total for 12/2016 =93.387*31. Need to calculate the monthly total for all 10*12 months from 2006-01 to 2016-12.
To find the number of days in a month you can use the days_in_month function in the lubridate package.
The argument takes a datetime object so you have to convert your Date column to a known date/datetime-based class (i.e. "POSIXct, POSIXlt, Date, chron, yearmon, yearqtr, zoo, zooreg, timeDate, xts, its, ti, jul, timeSeries, and fts objects").
Then you can just mutate your df with the multiplicated daily averages.
library(lubridate)
library(dplyr)
myDf <- read.table(text = "Date DE VE
12/1/2016 93.387 0.095
11/1/2016 77.968 0.095
10/1/2016 65.184 0.095
9/1/2016 63.984 0.095
8/1/2016 67.657 0.095", header = TRUE)
mutate(myDf, Date = as.Date(Date, format = "%m/%d/%Y"),
monthlyTotalDE = DE * days_in_month(Date),
monthlyTotalVE = VE * days_in_month(Date))
# Date DE VE monthlyTotalDE monthlyTotalVE
# 1 2016-12-01 93.387 0.095 2894.997 2.945
# 2 2016-11-01 77.968 0.095 2339.040 2.850
# 3 2016-10-01 65.184 0.095 2020.704 2.945
# 4 2016-09-01 63.984 0.095 1919.520 2.850
# 5 2016-08-01 67.657 0.095 2097.367 2.945
EDIT
In mutate if you use a new column name, it will append this column to the data frame.
If you want to avoid to add columns, you have to keep the columns names that already exist, it will overwrite these columns e.g.
mutate(myDf, Date = as.Date(Date, format = "%m/%d/%Y"),
DE = DE * days_in_month(Date),
VE = VE * days_in_month(Date))
# Date DE VE
# 1 2016-12-01 2894.997 2.945
# 2 2016-11-01 2339.040 2.850
# 3 2016-10-01 2020.704 2.945
# 4 2016-09-01 1919.520 2.850
# 5 2016-08-01 2097.367 2.945
If you have a lot of columns to compute, I suggest you to use mutate_each, it's very powerfull and will save you the pain to do it manualy with mutate or the loss of performance by doing a traditional loop.
Use vars to include/exclude variables in mutate.
You can exclude variables manualy using the variable name prececed by a minus :
vars = -Date or use a vector to exclude several variables vars = c(Date, DE).
Or you can also use special specification functions as in dplyr::select, see ?dplyr::select for more informations.
Warning : If you use vars to include variables, don't explicit the named argument vars = in your function if you want to keep the column names.
one_of(c("DE", "VE")), DE:VE... To drop variables, use - before the function : -contains("Date")
myDf %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
mutate_each(funs(. * days_in_month(Date)),
vars = -Date)
# Date DE VE
# 1 2016-12-01 2894.997 2.945
# 2 2016-11-01 2339.040 2.850
# 3 2016-10-01 2020.704 2.945
# 4 2016-09-01 1919.520 2.850
# 5 2016-08-01 2097.367 2.945
Related
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
Lets say we have, two time-series data.tables, one sampled by day, another by hour:
dtByDay
EURO TIME ... and some other columns
<num> <POSc>
1: 0.95 2017-01-20
2: 0.97 2017-01-21
3: 0.98 2017-01-22
...
dtByHour
TIME TEMP ... also some other columns
<POSc> <num>
1: 2017-01-20 00:00:00 22.45
2: 2017-01-20 01:00:00 23.50
3: 2017-01-20 02:00:00 23.50
...
and we need to merge them, so that to get all columns together. What's a nice what of doing it?
Evidently dtByDay[dtByHour] does not produce the desired outcome (as one could have wished) - you get `NA' in "EURO" column ...
Seems like roll = TRUE might give you funny behavior if a date is present in one data frame but no the other. So I wanted to post this alternative:
Starting with your original data frames:
dtbyday <- data.frame( EURO = c(0.95,0.97,0.98),
TIME = c(ymd("2017-01-20"),ymd("2017-01-21"),ymd("2017-01-22")))
dtbyhour <- data.frame( TEMP = c(22.45,23.50,23.40),
TIME = c(ymd_hms("2017-01-21 00:00:00"),ymd_hms("2017-01-21 01:00:00"),ymd_hms("2017-01-21 02:00:00")))
I converted the byhour$TIME to the same format as the byday$TIME using lubridate functions
dtbyhour <- dtbyhour %>%
rowwise() %>%
mutate( TIME = ymd( paste( year(TIME), month(TIME), day(TIME), sep="-" ) ) )
dtbyhour
# A tibble: 3 x 2
TEMP TIME
<dbl> <date>
1 22.45 2017-01-20
2 23.50 2017-01-20
3 23.40 2017-01-20
NOTE: The date changed because of time zone issues.
Then use dplyr::full_join to join by TIME, which will keep all records, and impute values whenever possible. You'll need to aggregate byHour values on a particular day...I calculated the mean TEMP below.
new.dt <- full_join( dtbyday, dtbyhour, by = c("TIME") ) %>%
group_by( TIME ) %>%
summarize( EURO = unique( EURO ),
TEMP = mean( TEMP, na.rm = TRUE ) )
# A tibble: 3 x 3
TIME EURO TEMP
<date> <dbl> <dbl>
1 2017-01-20 0.95 23.11667
2 2017-01-21 0.97 NaN
3 2017-01-22 0.98 NaN
Big thanks to comments above! - The solution is as easy as just adding roll=Inf argument when joining:
dtByHour[dtByDay, roll=Inf]
That's exactly what I needed. It takes dtByDay value and use it for all hours of this day. The output (from my application) is shown below.
For other applications, you may also consider roll="nearest". This will take the closest (from midnight) dtByDay value for all hours before and after midnight:
dtByHour[dtByDay, roll="nearest"]
Date Price
2006-01-03 12.02
2006-01-04 11.84
2006-01-05 11.83
...
EXPIRATION DATES
2006-01-18
2006-02-15
2006-03-22
...
Hello, I have a data frame of daily futures prices with corresponding dates. I also have a vector of all the relevant contract expiration dates for the futures prices.
The price column is the price for the contract expiring in the nearest month (12 month expiration cycle). For example, the 12.02 contract price on 2006-01-03 expires on 2006-01-18. I want to create a column that lists the relevant expiration date for each futures price so I can calculate days until expiration for each daily price. The logic would be:
all dates between 2006-01-03 and 2006-01-18 would have 2006-01-18 in the new expiration date column and so on for all the 127 expiration dates I have.
I tried playing around with mutate() and subset(), but I've had no luck. I assume this will be tedious, but just need someone to help me get started
Thanks
Assuming the two data.frames are called df and df2 and dates are already formatted as such, with dplyr,
# add a row with a different expiration date to make sure it's working
df[4,] <- list(as.Date('2006-02-04'), 12)
library(dplyr)
df %>% rowwise() %>%
mutate(days_left = min(df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > Date] - Date))
## Source: local data frame [4 x 3]
## Groups: <by row>
##
## # A tibble: 4 x 3
## Date Price days_left
## <date> <dbl> <S3: difftime>
## 1 2006-01-03 12.02 15 days
## 2 2006-01-04 11.84 14 days
## 3 2006-01-05 11.83 13 days
## 4 2006-02-04 12.00 11 days
or in base,
df$days_left <- lapply(df$Date, function(x){
min(df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > x] - x)
})
df
## Date Price days_left
## 1 2006-01-03 12.02 15
## 2 2006-01-04 11.84 14
## 3 2006-01-05 11.83 13
## 4 2006-02-04 12.00 11
Subtracting dates calls difftime, which it may be worth calling explicitly so you can specify units:
# dplyr
df %>% rowwise() %>%
mutate(days_left = df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > Date] %>%
difftime(Date, units = 'days') %>%
min())
# base
df$days_left <- lapply(df$Date, function(x){
min(difftime(df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > x], x, units = 'days'))
})
Depending on your data it may not make a difference, but it is a more robust approach than simple subtraction.
Disclaimer: I dislike pipes (I have my reasons) and when I can find a good "Base R" solution, I go for that one first. So here's my old fart solution.
I added more data to make sure it really works as expected.
# Create main dataframe
df1 <- read.table(text=
"Date Price
2006-01-03 12.02
2006-01-18 12.04
2006-01-22 12.05
2006-02-01 11.99
2006-02-16 11.84
2006-03-21 11.83
2006-03-22 11.90
2006-03-29 12.00
", head=T, stringsAsFactors=FALSE)
# Convert Date column to a proper Date-classed column
df1$Date <- as.Date(df1$Date)
# Generate an expiration dates vector
exp_dates <- as.Date(c("2006-01-18", "2006-02-15", "2006-03-22", "2006-04-18"))
# initialize df1$exp_dates
df1$exp_date <- NA
class(df1$exp_date) <- "Date"
# Loop over rows and find closest expir. date which is not past the date
for(i in 1:nrow(df1))
df1$exp_date[i] <- exp_dates[which.max((df1$Date[i]-exp_dates) <= 0)]
(Yeah, I also loops, and I even like it! :^p)
df1
Date Price exp_date
1 2006-01-03 12.02 2006-01-18
2 2006-01-18 12.04 2006-01-18
3 2006-01-22 12.05 2006-02-15
4 2006-02-01 11.99 2006-02-15
5 2006-02-16 11.84 2006-03-22
6 2006-03-21 11.83 2006-03-22
7 2006-03-22 11.90 2006-03-22
8 2006-03-29 12.00 2006-04-18
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).
I have a simple data set which has a date column and a value column. I noticed that the date sometimes comes in as mmddyy (%m/%d/%y) format and other times in mmddYYYY (%m/%d/%Y) format. What is the best way to standardize the dates so that i can do other calculations without this formatting causing issues?
I tried the answers provided here
Changing date format in R
and here
How to change multiple Date formats in same column
Neither of these were able to fix the problem.
Below is a sample of the data
Date, Market
12/17/09,1.703
12/18/09,1.700
12/21/09,1.700
12/22/09,1.590
12/23/2009,1.568
12/24/2009,1.520
12/28/2009,1.500
12/29/2009,1.450
12/30/2009,1.450
12/31/2009,1.450
1/4/2010,1.440
When i read it into a new vector using something like this
dt <- as.Date(inp$Date, format="%m/%d/%y")
I get the following output for the above segment
dt Market
2009-12-17 1.703
2009-12-18 1.700
2009-12-21 1.700
2009-12-22 1.590
2020-12-23 1.568
2020-12-24 1.520
2020-12-28 1.500
2020-12-29 1.450
2020-12-30 1.450
2020-12-31 1.450
2020-01-04 1.440
As you can see we skipped from 2009 to 2020 at 12/23 because of change in formatting. Any help is appreciated. Thanks.
> dat$Date <- gsub("[0-9]{2}([0-9]{2})$", "\\1", dat$Date)
> dat$Date <- as.Date(dat$Date, format = "%m/%d/%y")
> dat
Date Market
# 1 2009-12-17 1.703
# 2 2009-12-18 1.700
# 3 2009-12-21 1.700
# 4 2009-12-22 1.590
# 5 2009-12-23 1.568
# 6 2009-12-24 1.520
# 7 2009-12-28 1.500
# 8 2009-12-29 1.450
# 9 2009-12-30 1.450
# 10 2009-12-31 1.450
# 11 2010-01-04 1.440