I have a round about solution to get the last Thursday of each month, the reproducible code is as below:
import pandas as pd
start = pd.Timestamp('2016-07-27 00:00:00')
end = pd.Timestamp('2016-11-18 00:00:00')
dt_range = pd.Series(pd.date_range(start, end, freq='W-THU'))
t = dt_range.groupby(dt_range.dt.month).last().values.astype('datetime64[D]')
However, i guess it is somewhat unnecessary to produce range of values and operate groupby on it for get to the last Thursday. I tried
dt_range = pd.Series(pd.date_range(start, end, freq='4W-THU'))
but this can result in selecting 2nd last Thursday for months with Thursdays in fifth week.
How can i accomplish this more efficiently, preferably at the date_range function itself?
You can use pd.offsets.LastWeekOfMonth to build your custom frequency:
last_thu = pd.offsets.LastWeekOfMonth(weekday=3)
dt_range = pd.Series(pd.date_range('2016-01-01', periods=12, freq=last_thu))
The resulting output:
0 2016-01-28
1 2016-02-25
2 2016-03-31
3 2016-04-28
4 2016-05-26
5 2016-06-30
6 2016-07-28
7 2016-08-25
8 2016-09-29
9 2016-10-27
10 2016-11-24
11 2016-12-29
Related
I am trying to convert the timestamps in the stock data from Google Finance API to a more usable datetime format.
I have used data.table::fread to read the data here:
fread(<url>)
datetime open high low close volume
1: a1497619800 154.230 154.2300 154.2300 154.2300 500
2: 1 153.720 154.3200 153.7000 154.2500 1085946
3: 2 153.510 153.8000 153.2000 153.7700 34882
4: 3 153.239 153.4800 153.1400 153.4800 24343
5: 4 153.250 153.3000 152.9676 153.2700 20212
As you can see, the "datetime" format is rather strange. The format is described in this link:
The full timestamps are denoted by the leading 'a'. Like this: a1092945600. The number after the 'a' is a Unix timestamp. [...]
The numbers without a leading 'a' are "intervals". So, for example, the second row in the data set below has an interval of 1. You can multiply this number by our interval size [...] and add it to the last Unix Timestamp.
In my case, the "interval size" is 300 seconds (5 minutes). This format is restarted at the start of each new day and so trying to format it is quite difficult!
I can pull out the index positions of where the day starts are by using grep and searching for "a";
newDay <- grep(df$V1, pattern = "a")
Then my idea was to split the dataframe into chunks depending on index positions then extend the unix times on each day separately followed by combing them back to a data.table, before storing.
data.table::split looks like it will do the job, but I am unsure of how to supply it the day breaks to split by index positions, or if there is a more logical way to achieve the same result without having to break it down to each day.
Thanks.
You may use grepl to search for "a" in "datetime", which results in a boolean vector. cumsum the boolean to create a grouping variable - for each "a" (TRUE), the counter will increase by one.
Within each group, convert the first element to POSIXct, using an appropriate format and origin (and timezone, tz?). Add multiples of the 'interval size' (300 sec), using zero for the first element and the "datetime" multiples for the others.
d[ , time := {
t1 <- as.POSIXct(datetime[1], format = "a%s", origin = "1970-01-01")
.(t1 + c(0, as.numeric(datetime[-1]) * 300))
}
, by = .(cumsum(grepl("^a", datetime)))]
d
# datetime time
# 1: a1497619800 2017-06-16 15:30:00
# 2: 1 2017-06-16 15:35:00
# 3: 2 2017-06-16 15:40:00
# 4: 3 2017-06-16 15:45:00
# 5: 4 2017-06-16 15:50:00
# 6: a1500000000 2017-07-14 04:40:00
# 7: 3 2017-07-14 04:55:00
# 8: 5 2017-07-14 05:05:00
# 9: 7 2017-07-14 05:15:00
Some toy data:
d <- fread(input = "datetime
a1497619800
1
2
3
4
a1500000000
3
5
7")
With:
DT[grep('^a', date), datetime := as.integer(gsub('\\D+','',date))
][, datetime := zoo::na.locf(datetime)
][nchar(date) < 4, datetime := datetime + (300 * as.integer(date))
][, datetime := as.POSIXct(datetime, origin = '1970-01-01', tz = 'America/New_York')][]
you get:
date close high low open volume datetime
1: a1500298200 153.57 153.7100 153.57 153.5900 1473 2017-07-17 09:30:00
2: 1 153.51 153.8700 153.33 153.7500 205057 2017-07-17 09:35:00
3: 2 153.49 153.7800 153.34 153.5800 70023 2017-07-17 09:40:00
4: 3 153.68 153.7300 153.42 153.5400 53050 2017-07-17 09:45:00
5: 4 153.06 153.7500 153.06 153.7200 120899 2017-07-17 09:50:00
---
2348: 937 143.94 144.0052 143.91 143.9917 36651 2017-08-25 15:40:00
2349: 938 143.90 143.9958 143.90 143.9400 40769 2017-08-25 15:45:00
2350: 939 143.94 143.9500 143.87 143.8900 56616 2017-08-25 15:50:00
2351: 940 143.97 143.9700 143.89 143.9400 56381 2017-08-25 15:55:00
2352: 941 143.74 143.9700 143.74 143.9655 179811 2017-08-25 16:00:00
Used data:
DT <- fread('https://www.google.com/finance/getprices?i=300&p=30d&f=d,t,o,h,l,c,v&df=cpct&q=IBM', skip = 7, header = FALSE)
setnames(DT, 1:6, c('date','close','high','low','open','volume'))
When running the code below which is meant to create an end date for a daily time series (next day ahead), I am using c() to combine the shifted time series (which when run on its own returns a date format) with the last end date as specified. However, rather than adding 2017-05-09 to the end of it, it returns a numerical value, which while correctly shifted is of little use.
As can be seen in the example below it also does something strange with the last date which is supposed to be added.
Why is this happening and how can I fix it?
as.Date(portsim$period)
periods <- data.frame(period = portsim$period)
periods$start <- periods$period
periods$end <- c(periods$start[-1], as.Date("2017-05-09")) ##problematic line
> periods
period start end
1 2011-08-12 2011-08-12 2
2 2011-08-15 2011-08-15 3
3 2011-08-16 2011-08-16 4
4 2011-08-17 2011-08-17 5
5 2011-08-18 2011-08-18 6
6 2011-08-19 2011-08-19 7
7 2011-08-22 2011-08-22 8
8 2011-08-23 2011-08-23 9
9 2011-08-24 2011-08-24 10
10 2011-08-25 2011-08-25 11
...
1432 2017-05-04 2017-05-04 1433
1433 2017-05-05 2017-05-05 1434
1434 2017-05-08 2017-05-08 17295
It looks as if your dates are actually character or even factors. Try str(periods) to check.
If so, then replace your second line with
periods$start <- as.Date(as.character(periods$period))
Adding 1 to the date should work
period <- seq(as.Date("2011-08-12", "%Y-%m-%d"), length = 7, # 1 week long
by = 1) # seq by day
start <- period
periods <- data.frame(period, start)
library(lubridate) # can do it with base R as well
periods$end <- periods$start + days(1)
Given a dataframe:
df = pd.DataFrame({'c':[0,1,1,2,2,2],'date':pd.to_datetime(['2016-01-01','2016-02-01','2016-03-01','2016-04-01','2016-05-01','2016-06-05'])})
How to get the previous month begin for each date? The below doesn't work for 6/5 and there is some extra time portion.
pd.to_datetime(df['date'], format="%Y%m") + pd.Timedelta(-1,unit='M') + MonthBegin(0)
EDIT
I have a workaround (2 steps back and 1 step forward):
(df['date']+ pd.Timedelta(-2,unit='M')+ MonthBegin(1)).dt.date
Don't like this. There should be something better.
You can first subtract MonthEnd to get to the end of the previous month, then MonthBegin to get to the beginning of the previous month:
df['date'] - pd.offsets.MonthEnd() - pd.offsets.MonthBegin()
The resulting output:
0 2015-12-01
1 2016-01-01
2 2016-02-01
3 2016-03-01
4 2016-04-01
5 2016-05-01
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).
I'm trying to define a custom week for a dataframe.
I have a dataframe with timestamps.
I've read the questions on here regarding isocalendar. While it does the job. It's not what I want.
I'm trying to define the weeks from Friday to Thrusday.
For example:
Friday 2nd Jan 2015 would be the first day of the week.
Thursday 8th Jan 2015 would be the last day of the week.
And this would be week 1.
Is there a way to set a custom weekday? so when I access the the datetime library, I get the result that I expect.
df['Week_Number'] = df['Date'].dt.week
Here's one solution - convert your dates to a Period representing weeks that end on Thursday.
In [39]: df = pd.DataFrame({'Date':pd.date_range('2015-1-1', '2015-12-31')})
In [40]: df['Period'] = df['Date'].dt.to_period('W-THU')
In [41]: df['Week_Number'] = df['Period'].dt.week
In [44]: df.head()
Out[44]:
Date Period Week_Number
0 2015-01-01 2014-12-26/2015-01-01 1
1 2015-01-02 2015-01-02/2015-01-08 2
2 2015-01-03 2015-01-02/2015-01-08 2
3 2015-01-04 2015-01-02/2015-01-08 2
4 2015-01-05 2015-01-02/2015-01-08 2
Note that it follows the same convention as datetimes, where week 1 can be incomplete, so you may have to do a little extra munging if you want 1 to be the first complete week.