Download google trends data via R - r

I'm using this script to download data from google trends. However,it doesn't print the last 3 days. In other words, I got results until 28/09/2020, and now it's 01/10/2020.
Is there a way to download even more recent data?
Thank you.
Note: the script is retrived from here.
library(gtrendsR)
library(tidyverse)
library(lubridate)
get_daily_gtrend <- function(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15') {
if (ymd(to) >= floor_date(Sys.Date(), 'month')) {
to <- floor_date(ymd(to), 'month') - days(1)
if (to < from) {
stop("Specifying \'to\' date in the current month is not allowed")
}
}
mult_m <- gtrends(keyword = keyword, geo = geo, time = paste(from, to))$interest_over_time %>%
group_by(month = floor_date(date, 'month')) %>%
summarise(hits = sum(hits)) %>%
mutate(ym = format(month, '%Y-%m'),
mult = hits / max(hits)) %>%
select(month, ym, mult) %>%
as_tibble()
pm <- tibble(s = seq(ymd(from), ymd(to), by = 'month'),
e = seq(ymd(from), ymd(to), by = 'month') + months(1) - days(1))
raw_trends_m <- tibble()
for (i in seq(1, nrow(pm), 1)) {
curr <- gtrends(keyword, geo = geo, time = paste(pm$s[i], pm$e[i]))
print(paste('for', pm$s[i], pm$e[i], 'retrieved', count(curr$interest_over_time), 'days of data'))
raw_trends_m<- rbind(raw_trends_m,
curr$interest_over_time)
}
trend_m <- raw_trends_m %>%
select(date, hits) %>%
mutate(ym = format(date, '%Y-%m')) %>%
as_tibble()
trend_res <- trend_m %>%
left_join(mult_m, by = 'ym') %>%
mutate(est_hits = hits * mult) %>%
select(date, est_hits) %>%
as_tibble() %>%
mutate(date = as.Date(date))
return(trend_res)
}
get_daily_gtrend(keyword = 'Taylor Swift', geo = 'UA', from = '2013-01-01', to = '2019-08-15')

This is how Google Trends data works. Even if you go to the website and download data for anything beyond the last 7 days up to the last 90 days, it will give you daily data up to three days ago. So it is by design.
I'm not certain whether gTrendsR retrieves hourly data, but you can either manually retrieve data for the last 7 days from the website to get hourly data right up to a few hours before your request, or use the PyTrends package, which can return hourly date. If you then have the hourly data, you can, of course, aggregate it easily to daily.

Related

Getting candlestick chart to display properly using a text / .txt file of historic stock prices in R

Hell there,
I have purchased the historic intraday prices of the S&P 500 (1min through 1hour) back through 2005 because most stock charting packages stop reporting intraday prices around 2016 or 2011. I have successfully imported the prices and gotten R to only read market hours, excluding premarket and aftermarket. Two problems exist. First, I need to get the chart to not show saturday and sunday. The bigger problem is that the plot is NOT showing candlesticks, but bars and they are very hard to read. I have tried increasing the size via (size = 4), but the bars overlap and are still not candlesticks. How can I get these to show as proper candlesticks? thank you
library(quantmod)
library(tidyquant)
library(tidyverse)
library(ggplot2)
library(readr)
library(ggforce)
library(dplyr)
dir <- "E:/Stock Trading/Historical Data/SPY_qjrt28"
setwd(dir)
data <- read_csv("SPY_30min.txt",
col_names = FALSE)
names(data) <- tolower(c("DateTime", "Open", "High", "Low", "Close", "Volume"))
data
#clean the data
write_rds(data, "cleaned.rds")
read_rds("cleaned.rds")
spy30m <- read_rds("cleaned.rds")
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-02'), datetime <= as.Date('2009-03-19'))
# adding more time objects to the dataset
data <- data %>%
mutate(hour = hour(datetime),
minute = minute(datetime),
hms = as_hms(datetime))
# is the hour function working as expected? Yes!
data %>%
select(datetime, hour) %>%
sample_n(10)
# look at bins of observations at 30 minute intervals. Looks good!
data %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
# filter the dataset to only include the times during regular market hours
data_regularmkt <- data %>%
# `filter` is the dplyr function that limits the number of observations in a data frame
# `between` function takes 3 arguments: an object/variable, a lower bound value, and upper bound value
filter(between(hms, as_hms("09:30:00"), as_hms("16:00:00")))
# look at it again
data_regularmkt %>%
group_by(hms) %>%
summarise(count = n()) %>%
arrange(hms) %>%
print(n=100)
###########
firstwave <- filter(spy30m, datetime >= as.Date('2009-03-06'), datetime <= as.Date('2009-03-19'))
ggplot(firstwave, aes(x = datetime, y = close)) +
geom_candlestick(aes(open = open, high = high, low = low, close = close, size = 3))
Say we have a data frame df with the columns date (dttm format), open, high, low, close.
To overcome the issue that non-trading hours are shown, my first idea was to use another x-axis scale. Here's with a row-index.
library(tidyverse)
library(lubridate)
library(tidyquant)
df <- df %>%
arrange(date) %>%
mutate(i = row_number())
# this is for the x-axis labels
df_x <- df %>%
group_by(d = floor_date(date, "day")) %>%
filter(date %in% c(min(date)))
df %>%
ggplot(aes(x = i)) +
geom_candlestick(aes(open = open, low = low, high = high, close = close)) +
scale_x_continuous(breaks = df_x$i,
labels = df_x$date)
The problem then is that if a contract is halted during trading hours, there will be no data too just like with night or weekend. However, these times you probably want to show anyway.
One could probably play with dplyr functions' complete or expand to fix the data first and still use my solution of plotting over an index x-scale.
Easier could be to use the plotly library.
plt <- plot_ly(data = df, x = ~date,
open = ~open, close = ~close,
high = ~high, low = ~low,
type="candlestick")
plt
This is to hide the non-trading hours:
plt %>% layout(showlegend = F, xaxis = list(rangebreaks=
list(
list(bounds=list(17, 9),
pattern="hour")),#hide hours outside of 9am-5pm
dtick=86400000.0/2,
tickformat="%H:%M\n%b\n%Y"))
More information can be found here: https://plotly.com/r/time-series/#hiding-nonbusiness-hours and https://plotly.com/r/candlestick-charts/
As for you not liking the appearance of tidyquant's geom_candlestick, I also suggest you try out Plotly.

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.
We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')
It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

Grouping and Summing Data by Irregular Time Intervals (R language)

I am looking at a stackoverflow post over here: R: Count Number of Observations within a group
Here, daily data is created and summed/grouped at monthly intervals (as well as weekly intervals):
library(xts)
library(dplyr)
#create data
date_decision_made = seq(as.Date("2014/1/1"), as.Date("2016/1/1"),by="day")
date_decision_made <- format(as.Date(date_decision_made), "%Y/%m/%d")
property_damages_in_dollars <- rnorm(731,100,10)
final_data <- data.frame(date_decision_made, property_damages_in_dollars)
# weekly
weekly = final_data %>%
mutate(date_decision_made = as.Date(date_decision_made)) %>%
group_by(week = format(date_decision_made, "%W-%y")) %>%
summarise( total = sum(property_damages_in_dollars, na.rm = TRUE), Count = n())
# monthly
final_data %>%
mutate(date_decision_made = as.Date(date_decision_made)) %>%
group_by(week = format(date_decision_made, "%Y-%m")) %>%
summarise( total = sum(property_damages_in_dollars, na.rm = TRUE), Count = n())
It seems that the "format" statement in R (https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/format) is being used to instruct the computer to "group and sum" the data some fixed interval.
My question: is there a way to "instruct" the computer to "group and sum" by irregular intervals? E.g. by 11 day periods, by 3 month periods, by 2 year periods? (I guess 3 months can be written as 90 days...2 years can be written as 730 days).
Is this possible?
Thanks
You can use lubridate's ceiling_date/floor_date to create groups at irregular intervals.
library(dplyr)
library(lubridate)
final_data %>%
mutate(date_decision_made = as.Date(date_decision_made)) %>%
group_by(group = ceiling_date(date_decision_made, '11 days')) %>%
summarise(amount = sum(property_damages_in_dollars))
You can also specify intervals like ceiling_date(date_decision_made, '3 years') or ceiling_date(date_decision_made, '2 months').
Using data.table
library(data.table)
library(lubridate)
setDT(final_data)[, .(amount = sum(property_damages_in_dollars)),
,.(group = ceiling_date(as.IDate(date_decison_made), "11 days"))]

Question with using time series in R for forecasting via example

Im working through this example. However, when I begin investigating the tk_ts output I don't think it is taking the start/end data Im entering correctly, but am unsure as to what the proper input is if I want it to start at 12-31-2019 and end at 7-17-2020:
daily_cases2 <- as_tibble(countrydatescases) %>%
mutate(Date = as_date(date)) %>%
group_by(country, Date) %>%
summarise(total_cases = sum(total_cases))
daily_cases2$total_cases <- as.double(daily_cases2$total_cases)
# Nest
daily_cases2_nest <- daily_cases2 %>%
group_by(country) %>%
tidyr::nest()
# TS
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = 2019-12-31,
freq = 1))
Here is what I get when I examine it closely:
When I go through the example steps with these parameters the issue is also then seen in the subsequent graph:
I've tried varying the frequency and start parameters and its just not making sense. Any suggestions?
You've given the start and end dates, but you haven't said what frequency you want. Given that you want the series to start at the end of 2019 and end in the middle of July, 2020, I'm guessing you want a daily time series. In that case, the code should be:
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = c(2019, 365), # day 365 of year 2019
freq = 365)) # daily series

lubridate - select first non-Monday of every week.

Having a tibble of financial data, I would like to filter it by only selecting the first non-Monday of every week. Usually it will be a Tuesday, but sometimes it can be a Wednesday if Tuesday is a Holiday.
Here is my code that works in most cases
XLF <- quantmod::getSymbols("XLF", from = "2000-01-01", auto.assign = FALSE)
library(tibble)
library(lubridate)
library(dplyr)
xlf <- as_tibble(XLF) %>% rownames_to_column(var = "date") %>%
select(date, XLF.Adjusted)
xlf$date <- ymd(xlf$date)
# We create Month, Week number and Days of the week columns
# Then we remove all the Mondays
xlf <- xlf %>% mutate(Year = year(date), Month = month(date),
IsoWeek = isoweek(date), WDay = wday(date)) %>%
filter(WDay != 2)
# Creating another tibble just for ease of comparison
xlf2 <- xlf %>%
group_by(Year, IsoWeek) %>%
filter(row_number() == 1) %>%
ungroup()
That said, there are some issues that I have not been able to solve so far.
The issue is for instance that it is skipping "2002-12-31" which is a Tuesday because it is considered as part of the first ISO week of 2003.
There are a few similar issues.
My question is how could I select of the first non-Monday of every week without such issues while staying in the tidyverse (ie. not having to use xts / zoo class)?
You can create a consistently increasing week number yourself. Perhaps not the most elegant solution but it works fine for me.
as_tibble(XLF) %>%
rownames_to_column(var = "date")%>%
select(date, XLF.Adjusted)%>%
mutate(date = ymd(date),
Year = year(date),
Month = month(date),
WDay = wday(date),
WDay_label = wday(date, label = T))%>%
# if the weekday number is higher in the line above or
# if the date in the previous line is more than 6 days ago
# the week number should be incremented
mutate(week_increment = (WDay < lag(WDay) | difftime(date, lag(date), unit = 'days') > 6))%>%
# the previous line causes the first element to be NA due to
# the fact that the lag function can't find a line above
# we correct this here by setting the first element to TRUE
mutate(week_increment = ifelse(row_number() == 1,
TRUE,
week_increment))%>%
# we can sum the boolean elements in a cumulative way to get a week number
mutate(week_number = cumsum(week_increment))%>%
filter(WDay != 2)%>%
group_by(Year, week_number) %>%
filter(row_number() == 1)

Resources