I've a data frame with the start and end of each month of the year 2019.
I need to make a fetch to an API, write a CSV file with name mydf plus month (eg. mydf-01.csv, mydf-02.csv, etc).
I need to fetch the data, write CSV, clean memory to avoid error message "not enough memory", and continue with the next month.
For now I've this, but is giving me error: not enough memory, because the expected data for all 2019 is around 3GB.
I was thinking on making a for loop. But maybe I can use another apply family function?
Months: my_dates data.frame
This is how it looks:
from to
2019-01-01 2019-01-31
2019-02-01 2019-02-28
2019-03-01 2019-03-31
...
Code to generate the 12 months:
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
month_ranges <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
my_dates <- month_ranges(som("2019-01-01"), eom("2019-12-31"))
Code to fetch data:
Currently it fetches all months, holds them in memory and at the end
it rbinds them together. However, this approache gives error when
months range is too large because data is above 2GB. So I'd like it for each month to save the data to > a CSV and continue to the next month.
library(googleAuthR)
library(googleAnalyticsR)
my_fetch <- function(ga_id, d1, d2) {
google_analytics(ga_id,
date_range = c(d1, d2),
metrics = c("totalEvents"),
dimensions = c("ga:date", "ga:eventCategory", "ga:eventAction", "ga:eventLabel"),
anti_sample = TRUE,
anti_sample_batches = 1,
rows_per_call = 400)
}
my_fetches_fetches <- mapply(my_fetch, myviewID, my_dates$from, my_dates$to, SIMPLIFY = FALSE)
total <- do.call(rbind, my_fetches_fetches)
UPDATE 1:
Maybe it could be possible to pass the "loop" that generates an error, like API timeout to continue to the next month?
Related
I have a dataframe of Users with their DATE_CREATION and DATE_DELETION.
ACTIVE_USERS <- data.frame(Date=c("2022-01-12", "2022-02-18", "2022-03-22", "2022-04-10", "2022-07-15" ) ,
USER_ID=c("user123","user311","user245","user245","user213"),
DATE_DELETION=c("2022-04-11","2022-04-12","2022-03-28","2022-07-12","2022-08-11"))
I am trying to create a graph that will show Nb of Active Users each day (we consider Active if simply existing).
For that, I wrote the below script that seems to do the job, but is extremely slow and takes a few hours to render (because I have much more dates than in above dataframe example). Any help to improve the speed of this script is welcomed, see below my current method:
I created a dataframe sequence of Dates from 1st date of creation to last, and I joined these two dataframes by Date of Creation
Axis_X_Date <- data.frame(seq(as.Date(min(ACTIVE_USERS$DATE_CREATION)),as.Date(max(ACTIVE_USERS$DATE_CREATION)),by = 1))
colnames(Axis_X_Date) <- c("DATE_CREATION")
Axis_X_Date <- merge(Axis_X_Date, ACTIVE_USERS, by = "DATE_CREATION", all.x =TRUE , all.y = FALSE)
ACTIVE_USERS <- Axis_X_Date
rm(Axis_X_Date)
colnames(ACTIVE_USERS) <- c("DATE_CREATION", "USER_ID", "DATE_DELETION")
Then I calculate the "Age" for each day.
ACTIVE_USERS$Age_Days <- ACTIVE_USERS$DATE_DELETION - ACTIVE_USERS$Date
And then I create a for loop, and apply the following actions for each row of the dataframe:
table_page_all <- data.frame()
for(i in 1:nrow(ACTIVE_USERS)){
# To simplify, I convert NAs of DATE_DELETION, meaning still active accounts, as active until today
ifelse( is.na(ACTIVE_USERS$DATE_DELETION[i] ), ACTIVE_USERS$DATE_DELETION[i] <- today_date, ACTIVE_USERS$DATE_DELETION[i])
# here I replicate USER_ID for each day where USER_ID was alive/active
Dates_Active <- seq(ACTIVE_USERS$Date[i], ACTIVE_USERS$DATE_DELETION[i], by="days")
Ref_ID <- rep(ACTIVE_USERS$USER_ID[i], length(Dates_Active))
# I combine these USER_ID and Dates_Active for each row of the ACTIVE_USERS dataframe
data_frame <- data.frame(Dates_Active, Ref_ID)
table_page_all <- rbind(table_page_all, data_frame)
cat('Row: ', i , 'out of', nrow(ACTIVE_USERS), '\n')
}
EDIT:
I've did tried changes and opted for the tidyquant package shared in the comments below.
This time I've set a range with variables, but I think I'm having trouble turning it into a function or a vector. This could either be the result of me not writing a bad for loop orrr a limitation with the underlying library.
The idea behind this loop is that it pulls the adjusted prices for the period and then takes the first and last price to calculate a change (aka the return in share price.)
I'm not sure, but would love some thoughts!
start_date = "2019-05-20"
end_date = "2019-05-30"
Symbol_list <- c("CTVA","IBM", "GOOG", "GE")
range_returns <- for (Symbol in Symbol_List) {
frame <- tq_get(Symbol, get = "stock.prices", from = start_date, to = end_date, complete_cases = FALSE)[,7]
(frame[nrow(frame),] - frame[1,]) / frame[1,]
}
Old stuff
Let's say I've got a dataframe
symbol <- c("GOOG", "IBM","GE","F","BKR")
name <- c("Google", "IBM","General Electric","Ford","Berkshire Hathaway")
df <- cbind(symbol, name)
And I want to create a third column - df$custom_return that's defined based on my personal time frame.
I've tried working with the quantmod package and I'm having some trouble with it's constraints.
Where I'm at:
I have to pull the entire price history first which prohibits the ability create a new column like so:
start_date <- "2003-01-05"
end_date <- "2019-01-05"
df$defined_period_return <- ROC(getSymbol(df$symbol, src = yahoo, from = start_date, to = end_date, periodicity = "monthly"))
I know that I only want the adjusted close which is the 6th column for the Yahoo source. So, I could add the following and just pull the records into an environment.
price_history <- null
for (Symbol in sp_500$Symbol)
price_history <- cbind(price_history,
getSymbols(df$symbol, from = start_date,
to = end_date, periodicity = "daily",
auto.assign=FALSE)[,6])
Ok, that seems feasible, but it's not exactly seamless and now I run into an issue if one of my symbols (Tickers) falls outside of the range of dates provided. For example CTVA is one of them and it didn't start trading until after the the end date. The whole scrape stops in motion right there. How do I skip over that error?
And let's say we solve the "snag" of not finding relevant records...how would you calculate the return for each symbol over different timelines? For example - Google didn't start trading until 2004. getSymbol does pull the price history once it starts trading, but that return timeline is different than GE which had data at the start of my range.
No need for a for loop. You can do everything with tidyquant and dplyr. For the first and last observations of a group you can use the functions first and last from dplyr. See code below for a working example.
library(tidyquant)
library(dplyr)
start_date = "2019-05-20"
end_date = "2019-05-30"
Symbol_list <- c("CTVA","IBM", "GOOG", "GE")
stocks <- tq_get(Symbol_list, get = "stock.prices", from = start_date, to = end_date, complete_cases = FALSE)
stocks %>%
group_by(symbol) %>%
summarise(returns = (last(adjusted) / first(adjusted)) - 1) # calculate returns
# A tibble: 4 x 2
symbol returns
<chr> <dbl>
1 CTVA -0.0172
2 GE -0.0516
3 GOOG -0.0197
4 IBM -0.0402
For all you Bloomberg and R users out there:
I usually have no problem pulling Bloomberg data into R via the Rblpapi package, but have run across an issue when trying to pull index-level data.
The problem is the code below returns erroneous results as it begins pulling data starting in 1986 (not 1950) and it leaves many values NA that should be populated. Using the excel API, the data pulls in fine, but I need to add "days = a" for some of the fields since they don't begin until after 1950.
Reproducible example (assuming you have Bloomberg access):
# Load packages ----------------------------------------------------------
library("Rblpapi")
library("tidyverse")
library("lubridate")
# Connect to Bloomberg --------------------------------------------------
blpConnect()
# Pull equity index-level specific data over time for S&P 500, S&P Mid Cap (400) and S&P Small Cap (600) indices ----------------------
# Index tickers
tickers <- c("SPX Index", "MID Index", "SML Index")
# Bloomberg inputs
myField <- c("PX_LAST", "TRAIL_12M_EPS", "TRAIL_12M_DILUTED_EPS", "BEST_EPS", "PE_RATIO", "BEST_PE_RATIO",
"TRAIL_12M_EBITDA_PER_SHARE", "PX_TO_EBITDA", "PX_TO_BOOK_RATIO", "PX_TO_SALES_RATIO",
"PX_TO_FREE_CASH_FLOW", "EQY_DVD_YLD_12M", "TOT_DEBT_TO_EBITDA", "EV_TO_T12M_SALES", "EV_TO_T12M_EBITDA",
"TRAIL_12M_GROSS_MARGIN", "EBITDA_MARGIN", "TRAIL_12M_OPER_MARGIN", "TRAIL_12M_PROF_MARGIN",
"RETURN_ON_ASSET", "RETURN_COM_EQY", "RETURN_ON_CAP", "NET_DEBT_TO_EBITDA", "CUR_MKT_CAP", "AVERAGE_MARKET_CAP"
)
# Pull data
sp_indices_fundmtls_raw <- as.data.frame(bdh(tickers,
myField,
start.date = as.Date("1950-01-01"),
end.date = Sys.Date(),
include.non.trading.days = TRUE
)
)
Since this didn't work, I tried just pulling the data using SPX Index only. Same issue. I then tried the formula with fewer tickers
# Bloomberg inputs
myField <- c("PX_LAST", "TRAIL_12M_EPS", "TRAIL_12M_DILUTED_EPS", "BEST_EPS", "PE_RATIO", "BEST_PE_RATIO",
"TRAIL_12M_EBITDA_PER_SHARE", "PX_TO_EBITDA", "PX_TO_BOOK_RATIO", "PX_TO_SALES_RATIO",
"PX_TO_FREE_CASH_FLOW", "EQY_DVD_YLD_12M",
"TOT_DEBT_TO_EBITDA", "EV_TO_T12M_SALES", "EV_TO_T12M_EBITDA"
)
That worked better, but still started in 1964 not 1950. Again, excel API works fine and will just return NA if data is missing earlier as I expected R to do.
This makes me think that there must be a field that needs an option or an override to pull the data correctly. I tried adding
ovrd <- c("PERIODICITY_OVERRIDE" = "D")
# Pull data
sp_indices_fundmtls_raw <- as.data.frame(bdh(tickers,
myField,
start.date = as.Date("1950-01-01"),
end.date = Sys.Date(),
include.non.trading.days = TRUE,
overrides = ovrd
)
)
But no luck.
Can anyone figure out the issue?
Thanks!
After much trial and error, I figured out a way to get the data.
I created a function to pull the data:
# Function to pull data
sp_indices_pull_fx <- function(myField, index_ticker) {
df <- as.data.frame(bdh(index_ticker,
myField,
start.date = as.Date("1950-01-01"),
end.date = Sys.Date(),
include.non.trading.days = TRUE
)
)
Then I used lapply to cycle through each ticker. For example:
# SP500
sp_500_pull <- lapply(myField, sp_indices_pull_fx, index_ticker = "SPX Index")
Then I combined those results into a single data frame:
# Merge
sp_500_fundmtls_raw = Reduce(function(...) merge(..., all = TRUE), sp_500_pull)
So in short, what worked was creating a function and feeding that function each individual ticker as opposed to trying to pull multiple tickers at once using the bdh function.
I am trying to complete a problem that pulls from two data sets that need to be combined into one data set. To get to this point, I need to rbind both data sets by the year-month information. Unfortunately, the first data set needs to be tallied by year-month info, and I can't seem to figure out how to change the date so I can have month-year info rather than month-day-year info.
This is data on avalanches and I need to write code totally the number of avalanches each moth for the Snow Season, defined as Dec-Mar. How do I do that?
I keep trying to convert the format of the date to month-year but after I change it with
as.Date(avalancheslc$Date, format="%y-%m")
all the values for Date turn to NA's....help!
# write the webscraper
library(XML)
library(RCurl)
avalanche<-data.frame()
avalanche.url<-"https://utahavalanchecenter.org/observations?page="
all.pages<-0:202
for(page in all.pages){
this.url<-paste(avalanche.url, page, sep=" ")
this.webpage<-htmlParse(getURL(this.url))
thispage.avalanche<-readHTMLTable(this.webpage, which=1, header=T)
avalanche<-rbind(avalanche,thispage.avalanche)
}
# subset the data to the Salt Lake Region
avalancheslc<-subset(avalanche, Region=="Salt Lake")
str(avalancheslc)
avalancheslc$monthyear<-format(as.Date(avalancheslc$Date),"%Y-%m")
# How can I tally the number of avalanches?
The final output of my dataset should be something like:
date avalanches
2000-1 18
2000-2 4
2000-3 10
2000-12 12
2001-1 52
This should work (I tried it on only 1 page, not all 203). Note the use of the option stringsAsFactors = F in the readHTMLTable function, and the need to add names because 1 column does not automatically get one.
library(XML)
library(RCurl)
library(dplyr)
avalanche <- data.frame()
avalanche.url <- "https://utahavalanchecenter.org/observations?page="
all.pages <- 0:202
for(page in all.pages){
this.url <- paste(avalanche.url, page, sep=" ")
this.webpage <- htmlParse(getURL(this.url))
thispage.avalanche <- readHTMLTable(this.webpage, which = 1, header = T,
stringsAsFactors = F)
names(thispage.avalanche) <- c('Date','Region','Location','Observer')
avalanche <- rbind(avalanche,thispage.avalanche)
}
avalancheslc <- subset(avalanche, Region == "Salt Lake")
str(avalancheslc)
avalancheslc <- mutate(avalancheslc, Date = as.Date(Date, format = "%m/%d/%Y"),
monthyear = paste(year(Date), month(Date), sep = "-"))
I would like to download daily data from yahoo for the S&P 500, the DJIA, and 30-year T-Bonds, map the data to the proper time zone, and merge them with my own data. I have several questions.
My first problem is getting the tickers right. From yahoo's website, it looks like the tickers are: ^GSPC, ^DJI, and ^TYX. However, ^DJI fails. Any idea why?
My second problem is that I would like to constrain the time zone to GMT (I would like to ensure that all my data is on the same clock, GMT seems like a neutral choice), but I couldn' get it to work.
My third problem is that I would like to merge the yahoo data with my own data, obtained by other means and available in a different format. It is also daily data.
Here is my attempt at constraining the data to the GMT time zone. Executed at the top of my R script.
Sys.setenv(TZ = "GMT")
# > Sys.getenv("TZ")
# [1] "GMT"
# the TZ variable is properly set
# but does not affect the time zone in zoo objects, why?
Here is my code to get the yahoo data:
library("tseries")
library("xts")
date.start <- "1999-12-31"
date.end <- "2013-01-01"
# tickers <- c("GSPC","TYX","DJI")
# DJI Fails, why?
# http://finance.yahoo.com/q?s=%5EDJI
tickers <- c("GSPC","TYX") # proceed without DJI
z <- zoo()
index(z) <- as.Date(format(time(z)),tz="")
for ( i in 1:length(tickers) )
{
cat("Downloading ", i, " out of ", length(tickers) , "\n")
x <- try(get.hist.quote(
instrument = paste0("^",tickers[i])
, start = date.start
, end = date.end
, quote = "AdjClose"
, provider = "yahoo"
, origin = "1970-01-01"
, compression = "d"
, retclass = "zoo"
, quiet = FALSE )
, silent = FALSE )
print(x[1:4]) # check that it's not empty
colnames(x) <- tickers[i]
z <- try( merge(z,x), silent = TRUE )
}
Here is the dput(head(df)) of my dataset:
df <- structure(list(A = c(-0.011489000171423, -0.00020300000323914,
0.0430639982223511, 0.0201549995690584, 0.0372899994254112, -0.0183669999241829
), B = c(0.00110999995376915, -0.000153000000864267, 0.0497750006616116,
0.0337960012257099, 0.014121999964118, 0.0127800004556775), date = c(9861,
9862, 9863, 9866, 9867, 9868)), .Names = c("A", "B", "date"
), row.names = c("0001-01-01", "0002-01-01", "0003-01-01", "0004-01-01",
"0005-01-01", "0006-01-01"), class = "data.frame")
I'd like to merge the data in df with the data in z. I can't seem to get it to work.
I am new to R and very much open to your advice about efficiency, best practice, etc.. Thanks.
EDIT: SOLUTIONS
On the first problem: following GSee's suggestions, the Dow Jones Industrial Average data may be downloaded with the quantmod package: thus, instead of the "^DJI" ticker, which is no longer available from yahoo, use the "DJIA" ticker. Note that there is no caret in the "DJIA" ticker.
On the second problem, Joshua Ulrich points out in the comments that "Dates don't have timezones because days don't have a time component."
On the third problem: The data frame appears to have corrupted dates, as pointed out by agstudy in the comments.
My solutions rely on the quantmod package and the attached zoo/xts packages:
library(quantmod)
Here is the code I have used to get proper dates from my csv file:
toDate <- function(x){ as.Date(as.character(x), format("%Y%m%d")) }
dtz <- read.zoo("myData.csv"
, header = TRUE
, sep = ","
, FUN = toDate
)
dtx <- as.xts(dtz)
The dates in the csv file were stored in a single column in the format "19861231". The key to getting correct dates was to wrap the date in "as.character()". Part of this code was inspired from R - Stock market data from csv to xts. I also found the zoo/xts manuals helpful.
I then extract the date range from this dataset:
date.start <- start(dtx)
date.end <- end(dtx)
I will use those dates with quantmod's getSymbols function so that the other data I download will cover the same period.
Here is the code I have used to get all three tickers.
tickers <- c("^GSPC","^TYX","DJIA")
data <- new.env() # the data environment will store the data
do.call(cbind, lapply( tickers
, getSymbols
, from = date.start
, to = date.end
, env = data # data saved inside an environment
)
)
ls(data) # see what's inside the data environment
data$GSPC # access a particular ticker
Also note, as GSee pointed out in the comments, that the option auto.assign=FALSE cannot be used in conjunction with the option env=data (otherwise the download fails).
A big thank you for your help.
Yahoo doesn't provide historical data for ^DJI. Currently, it looks like you can get the same data by using the ticker "DJIA", but your mileage may vary.
It does work in this case because you're only dealing with Dates
the df object your provided is yearly data beginning in the year 0001. So, that's probably not what you wanted.
Here's how I would fetch and merge those series (or use an environment and only make one call to getSymbols)
library(quantmod)
do.call(cbind, lapply(c("^GSPC", "^TYX"), getSymbols, auto.assign=FALSE))