Handling internet connection R - r

I`m trying to download several stocks from google, but every time the connection stops, R stops the loop. How can I handle this problem?
stocks <- c(
'MSFT',
'GOOG',
...
)
for (symbol in stocks)
{
stock_price <- getSymbols(symbol,src='google', from=startDate,to=endDate,auto.assign = FALSE)
prices[,j] <- stock_price[,1]
j <- j + 1
}

From the R manual "quantmod.pdf:
If auto.assign=FALSE or env=NULL (as of 0.4-0) the data will be returnedfrom the call, and will require the user to assign the results himself.Note that only one symbol at a time may be requested when auto assignment is disabled.
You are trying to request more than one ticket symbol at a time with the auto.assign parameter set to false and this is not allowed. However, you should be able to obtain all your symbols at once by adapting the following code:
data <- new.env()
getSymbols.extra(stocks, src = 'google', from = startDate, to = endDate, env = data, auto.assign = T)
plot(data$MSFT)
Pay careful attention to the R manual for getSymbols
"Data is fetched through one of the available getSymbols methods and saved in the env specified - the .GloblEnv by default.

Related

how to write out multiple files in R?

I am a newbie R user. Now, I have a question related to write out multiple files with different names. Lets says that my data has the following structure:
IV_HAR_m1<-matrix(rnorm(1:100), ncol=30, nrow = 2000)
DV_HAR_m1<-matrix(rnorm(1:100), ncol=10, nrow = 2000)
I am trying to estimate multiple LASSO regressions. At the beginning, I was storing the iterations in one object called Dinamic_beta. This object was stored in only one file, and it saves the required information each time that my code iterate.
For doing this I was using stew which belongs to pomp package, but the total process takes 5 or 6 days and I am worried about a power outage or a fail in my computer.
Now, I want to save each environment (iterations) in a .Rnd file. I do not know how can I do that? but the code that I am using is the following:
library(glmnet)
library(Matrix)
library(pomp)
space <- 7 #THE NUMBER OF FILES THAT I would WANT TO CREATE
Dinamic_betas<-array(NA, c(10, 31, (nrow(IV_HAR_m1)-space)))
dimnames(Dinamic_betas) <- list(NULL, NULL)
set.seed(12345)
stew( #stew save the enviroment in a .Rnd file
file = "Dinamic_LASSO_RD",{ # The name required by stew for creating one file with all information
for (i in 1:dim(Dinamic_betas)[3]) {
tryCatch( #print messsages
expr = {
cv_dinamic <- cv.glmnet(IV_HAR_m1[i:(space+i-1),],
DV_HAR_m1[i:(space+i-1),], alpha = 1, family = "mgaussian", thresh=1e-08, maxit=10^9)
LASSO_estimation_dinamic<- glmnet(IV_HAR_m1[i:(space+i-1),], DV_HAR_m1[i:(space+i-1),],
alpha = 1, lambda = cv_dinamic$lambda.min, family = "mgaussian")
coefs <- as.matrix(do.call(cbind, coef(LASSO_estimation_dinamic)))
Dinamic_betas[,,i] <- t(coefs)
},
error = function(e){
message("Caught an error!")
print(e)
},
warning = function(w){
message("Caught an warning!")
print(w)
},
finally = {
message("All done, quitting.")
}
)
if (i%%400==0) {print(i)}
}
}
)
If someone can suggest another package that stores the outputs in different files I will grateful.
Try adding this just before the close of your loop
save.image(paste0("Results_iteration_",i,".RData"))
This should save your entire workspace to disk for every iteration. You can then use load() to load the workspace of every environment. Let me know if this works.

Wait for rgbif download to complete before proceeding

I am developing a small application in R Shiny. Part of the application will need to query GBIF to download species occurrence data. This is possible using rgbif. The function rgbif::occ_download() will download the data and rgbif::occ_download_meta() will check whether GBIF has fulfilled your request. For example:
geometry <- "POLYGON((30.1 10.1,40 40,20 40,10 20,30.1 10.1))"
res <- occ_download(paste0("geometry within ", geometry), type = "within", format = "SPECIES_LIST")
occ_download_meta(res)
<<gbif download metadata>>
Status: RUNNING
Format: SPECIES_LIST
Download key: 0004089-190415153152247
Created: 2019-04-25T09:18:20.952+0000
Modified: 2019-04-25T09:18:21.045+0000
Download link: http://api.gbif.org/v1/occurrence/download/request/0004089-190415153152247.zip
Total records: 0
So far, so good. However, the following function rgbif::occ_download_get() can't download the data for downstream analysis until occ_download_meta(res) has completed (when Status = SUCCEEDED).
How can I make the session wait until the download from GBIF has been completed? I cannot hard code a wait time into the script as different sized extents will take GBIF longer or shorter amounts of time to process. Also, the number of other active users querying the service could also alter wait times. I therefore need some sort of flag where Status == Succeeded before proceeding.
I have copied some skeleton code with comments below.
library(rgbif)
geometry <- "POLYGON((30.1 10.1,40 40,20 40,10 20,30.1 10.1))" # Define boundary
res <- occ_download(paste0("geometry within ", geometry), type = "within", format = "SPECIES_LIST")
# WAIT HERE UNTIL Status == SUCCEEDED
occ_download_meta(res)
x <- occ_download_get(res, overwrite = TRUE) # Download data
data<-occ_download_import(x) # Import into R
rgbif maintainer here. You could do something like we have within the occ_download_queue() function:
res <- occ_download(paste0("geometry within ", geometry), type = "within", format = "SPECIES_LIST")
still_running <- TRUE
status_ping <- 3
while (still_running) {
meta <- occ_download_meta(res)
status <- meta$status
still_running <- status %in% c("succeeded", "killed")
Sys.sleep(status_ping) # sleep between pings
}
you probably want to check for succeeded and killed, and do something different if killed

Read in large text file in chunks

I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

"download.file" Incomplete and inconsistent downloads

Am trying to understand why I am having inconsistent results downloading CSV files from a website archive. Don't know if the problem is at my end, the other side or just failed communications in between. Any suggestions are welcomed.
Using a R script to automate the downloading of CSV files by month and year from the HYCOM archives for analysis. The script generated the following URL trying URL 'http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?var=salinity&var=water_temp&var=water_u&var=water_v&latitude=13.875&longitude=-72.25&time_start=2012-05-01T00:00:00Z&time_end=2012-05-31T21:00:00Z&vertCoord=&accept=csv'
Running download.file successfully obtains the file about half the time, otherwise fails. Any suggestions are welcomed. The images below shows the failed run. Successful run is below.
Successful Log
#download one month of data
MM = '05'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste( as shown in image)
H2 = '-01T00:00:00Z&time_end='
#H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
H3 = 'T21:00:00Z&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
download.file(url =HtmlLink ,destfile=dest,cacheOK=FALSE, method="auto")
trying URL 'as shown in image'
Content type 'text/plain;charset=UTF-8' length unknown
..................................................
................downloaded 666 KB
user system elapsed
28.278 6.605 5201.421
LOG OF FAILED RUN
You can/should turn the following into a function accepting parameters and replace the hardcoded values with said params (I used httr:::parse_query() to make the list):
library(httr)
URL <- "http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly"
params <- list(var = "salinity",
var = "water_temp",
var = "water_u",
var = "water_v",
latitude = "13.875",
longitude = "-72.25",
time_start = "2012-05-01T00:00:00Z",
time_end = "2012-05-31T21:00:00Z",
vertCoord = "",
accept = "csv")
dest_file <- "filename"
res <- GET(url=URL,
query=params,
timeout(360),
write_disk(dest_file, overwrite=TRUE),
verbose())
warn_for_status(res)
You can (eventually) remove the verbose() from that GET call, but it's helpful during debugging.
The main issue is that this server is s l o w and times out before the transfer is complete. Even the value of 360 might not be enough (you'll need to experiment).
Many thanks to all for the help. The suggestion by hrbrmstr appears to be an elegant answer and I look forwards to testing it. However, I was unable to install a working copy using the program manager. Installation from a local download also failed since R complained that the OS X version that I downloaded from CRAN was a windows version, not OS X. Yes, I repeated the download several times to make sure I had the right package.
As suggested by Cyrus Mohammadian, I tried the procedures in the curl library.
Running the same URL, download.file transfers failed about 50% of the time. Using curl reduced the transfer times from 2000 seconds to 1000 seconds with no failures in 12 tries.
## calculate number of days in month
ndays <- function(d) {
last_days <- 28:31
rev(last_days[which(!is.na(
as.Date( paste( substr(d, 1, 8),
last_days, sep = ''),
'%Y-%m-%d')))])[1] }
nlat = 13.875
elon = -72.25
#download one month of data
year = 2008
MM = '01'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste('http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?
var=salinity&var=water_temp&var=water_u&var=water_v&latitude=',
nlat,'&longitude=', elon,'&time_start=',sep="")
H2 = '-01T00:00:00Z&time_end='
H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
curl_download(url =HtmlLink ,destfile=dest,quiet=FALSE, mode="wb")

Error in getSymbols, must use auto.assign=TRUE for multiple symbol requests

I'm trying to write a program that would take a .csv file of stock symbols and test them against each other for things like cointegration. However, when I run the following code quatnmod gives me something about having to use auto.assign = TRUE for multiple symbol requests.
getprices<-function(sym){
#get prices from last 7 years
prices<-getSymbols(sym, from = Sys.Date() - (365*7), auto.assign=FALSE)
#exract closing prices
prices<-Cl(prices)
return(prices)}
symbols1 <- c('TSN', 'MSFT')
symbols2 <- c('AAPL', 'NFLX')
container<-c()
addprices <- function(symbols1, symbols2){
for (i in symbols1){
for (g in symbols2){
i<-getprices(i)
g<-getprices(g)
container <- i+g
}
}
return(container)
}
When I run addprices(symbols1, symbols2) I get this error:
Error in getSymbols(sym, from = Sys.Date() - (365 * 7), auto.assign = FALSE) :
must use auto.assign=TRUE for multiple Symbols requests
Calls: addprices -> getprices -> getSymbols
I know when I do this I should get that error, and I believe this is what the error is referring to:
getSymbols(sym, from = Sys.Date() - (365 * 7), auto.assign = FALSE)
However, what I'm doing isn't that, so what gives? Any advice? Is there a work around?
I googled this and there really weren't any relevant questions/answers.
The problem is that you're over-writing the iterator i inside the g for loop. The first iteration of g works fine but i is no longer symbols1[1] in the second iteration... it's the output from getprices(i).

Resources