I want take financial data using API.
I do so.
#load jsons
library("rjson")
json_file <- "https://api.coindesk.com/v1/bpi/currentprice/USD.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
#get json content as data.frame
x = data.frame(json_data$time$updated,json_data$time$updatedISO,json_data$time$updateduk,json_data$bpi$USD)
x
But the main problem is that information changes every minute, so i can't gather history.
Are there ways to make R independently connect every minute(i.e. in real time mode) to this site and collect data every minute.
So collected data must be saved in C:/Myfolder.
Is it possible to do it?
Something like this could do it
library("rjson")
json_file <- "https://api.coindesk.com/v1/bpi/currentprice/USD.json"
numOfTimes <- 2L # how many times to run in total
sleepTime <- 60L # time to wait between iterations (in seconds)
iteration <- 0L
while (iteration < numOfTimes) {
# gather data
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
# get json content as data.frame
x = data.frame(json_data$time$updated,json_data$time$updatedISO,json_data$time$updateduk,json_data$bpi$USD)
# create file to save in 'C:/Myfolder'
# alternatively, create just one .csv file and update it in each iteration
nameToSave <- nameToSave <- paste('C:/Myfolder/',
gsub('\\D','',format(Sys.time(),'%F%T')),
'json_data.csv', sep = '_')
# save the file
write.csv(x, nameToSave)
# update counter and wait
iteration <- iteration + 1L
Sys.sleep(sleepTime)
}
Note that this requires to have an R session opened (you could create a .exe or .bat file and have it run in the background).
Related
I see the chunk_size argument in arrow::write_parquet(), but it doesn't seem to behave as expected. I would expect the code below to generate 3 separate parquet files, but only one is created, and nrow > chunk_size.
library(arrow)
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write with chunk_size 1e6, and view directory
write_parquet(df, tf, chunk_size = 1e6)
list.files(td)
Returns one file instead of 3:
[1] "25ff74854ba6.parquet"
# read parquet and show all rows are there
nrow(read_parquet(tf))
Returns:
[1] 3000000
We can't pass multiple file name arguments to write_parquet(), and I don't want to partition, so write_dataset() also seem inapplicable.
The chunk_size parameter refers to how much data to write to disk at once, rather than the number of files produced. The write_parquet() function is designed to write individual files, whereas, as you said, write_dataset() allows partitioned file writing. I don't believe that splitting files on any other basis is supported at the moment, though it is a possibility in future releases. If you had a specific reason for wanting 3 separate files, I'd recommend separating the data into multiple datasets first and then writing each of those via write_parquet().
(Also, I am one of the devs on the R package, and can see that this isn't entirely clear from the docs, so I'm going to open a ticket to update those - thanks for flagging this up)
Short of an argument to write_parquet() like max_row that defaults to reasonable number (like 1e6), we can do something like this:
library(arrow)
library(uuid)
library(glue)
library(dplyr)
write_parquet_multi <- function(df, dir_out, max_row = 1e6){
# Only one parquet file is needed
if(nrow(df) <= max_row){
cat("Saving", formatC(nrow(df), big.mark = ","),
"rows to 1 parquet file...")
write_parquet(
df,
glue("{dir_out}/{UUIDgenerate(use.time = FALSE)}.parquet"))
cat("done.\n")
}
# Multiple parquet files are needed
if(nrow(df) > max_row){
count = ceiling(nrow(df)/max_row)
start = seq(1, count*max_row, max_row)
end = c(seq(max_row, nrow(df), max_row), nrow(df))
uuids = UUIDgenerate(n = count, use.time = FALSE)
cat("Saving", formatC(nrow(df), big.mark = ","),
"rows to", count, "parquet files...")
for(j in 1:count){
write_parquet(
dplyr::slice(df, start[j]:end[j]),
glue("{dir_out}/{uuids[j]}.parquet"))
}
cat("done.\n")
}
}
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write parquet multi
write_parquet_multi(df, td)
list.files(td)
This returns:
[1] "7a1292f0-cf1e-4cae-b3c1-fe29dc4a1949.parquet"
[2] "a61ac509-34bb-4aac-97fd-07f9f6b374f3.parquet"
[3] "eb5a3f95-77bf-4606-bf36-c8de4843f44a.parquet"
I have created the following function to read a csv file from a given URL:
function(){
s <- 1;
#first get the bhav copy
today <- c();ty <- c();tm <- c();tmu <- c();td <- c();
# get the URL first
today <- Sys.Date()
ty <- format(today, format = "%Y")
tm <- format(today, format = "%b")
tmu <- toupper(tm)
td <- format(today, format = "%d")
dynamic.URL <- paste("https://www.nseindia.com/content/historical/EQUITIES/",ty,"/",tmu,"/cm",td,tmu,ty,"bhav.csv.zip", sep = "")
file.string <- paste("C:/Users/user/AppData/Local/Temp/cm",td,tmu,ty,"bhav.csv")
download.file(dynamic.URL, "C:/Users/user/Desktop/bhav.csv.zip")
bhav.copy <- read.csv(file.string)
return(bhav.copy)
}
If I run the function, immediately it says that "file.string not found". But when I run it after some time(a few seconds), it executes normally. I think when download.file ecexecutes, it transfers control to read.csv,and it tries to load the file which is not yet properly saved. when i run it after some time, it tries to overwrite the existing file, which it cannot, and the read.csvproperly loads the saved file.`
I want the function to execute the first time I run it. Is there any way or a function to defer the action of read.csvuntil the file is properly saved? Something like this:
download.file(dynamic.URL, "C:/Users/user/Desktop/bhav.csv.zip")
wait......
bhav.copy <- read.csv(file.string)
Ignore the fact that the destfile in download.file is different from file.string; it is due to function of my system (windows 7).
Very many thanks for your time and effort...
I have a huge .txt file named SDN_1 with more than 1 million rows. I would like to split this file into smaller .txt files (10,000 rows each) using R.
I used this code to load the file into R:
SDN_1 <- read.csv("C:/Users/JHU/Desktop/rfiles/SDN_1.csv", header=FALSE)
Then I used this code to split the table:
chunk <- 10000
n <- nrow(SDN_1)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(SDN_1,r)
Next I would like to save the output of the split function into separate files as .txt and encode as UTF8. The files need to be named in the following format: test_YYYMMDD_HHMMSS.txt
I'm new to R and any help would be appreciated.
UPDATE:
Hack-R suggested the code below to create the .csv file. The code below worked once then started giving me the error message below:
Code Hack-R suggested:
n <- 1
for(i in d){
con <- file(paste0("file",n,"_", gsub("-
","",gsub(":","",gsub("","_",Sys.time()))), "_",".csv"),encoding="UTF-8")
write.csv(tmp, file = con)
n <- n + 1
}
The error message I'm getting:
Error in is.data.frame(x) : object 'tmp' not found
Using the code you already have:
SDN_1 <- mtcars # this represents your csv, to make it reproducible
chunk <- 10 # scaled it down for the example
n <- nrow(SDN_1)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(SDN_1,r)
n <- 1 # this part is optional
for(i in d){
con <- file(paste0("file",n,"_", gsub("-","",gsub(":","",gsub(" ","_",Sys.time()))), "_",".csv"),encoding="UTF-8")
write.csv(tmp, file = con)
n <- n + 1
}
More generally, let's say a and b represent the splits of a larger object or any collection of objects in the environment you want to write out programmatically:
a <- "a"
b <- "b"
You can get a vector containing their names:
files <- ls()
Then loop through and programmatically write them to a UTF-8 encoded csv file as follows, appending the date and time in the format you requested:
for(i in files){
tmp <- get(i)
con <- file(paste0(tmp,"_", gsub("-","",gsub(":","",gsub(" ","_",Sys.time()))), "_",".csv"),encoding="UTF-8")
write.csv(tmp, file = con)
}
I used Sys.time() for the timestamp with nested gsub()s to format the way you wanted. I encoded the file to UTF-8 as explained in this post.
I'm sure this must have been answered somewhere, so; if you have a pointer to an answer that helps please let me know... ;o)
I have a number of fairly sizeable processing tasks (mainly multi-label text classifiers) which read in large volumes of files, do stuff with that, output a result then move on to the next.
I have this working neatly sequentially but wanted to parallelise things.
By way of a really basic example...
require(plyr)
fileDir <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))
l_ply(files, function(x){
print(x)
#change to dir containing source files
setwd(fileDir)
# read file
content <- read.csv(file=x,header=TRUE)
# change directory to output
setwd(outputDir)
# append the itemID from CSV file to
write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE)
}, .parallel=FALSE )
Will iterate through all the files in directory fileDir, opening each CSV, extracting a value from the file and appending this to an output CSV held in the directory outputDir. A basic example but it runs just fine to illustrate the problem.
To run this in parallel creates a problem in so far as the directory variables (fileDir & outputDir) are essentially unknown by the anonymous function (x), ala...
require(plyr)
require(doParallel)
fileDir <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))
cl<-makeCluster(4) # make a cluster of available cores
registerDoParallel(cl) # raise cluster
l_ply(files, function(x){
print(x)
#change to dir containing source files
#setwd(fileDir)
# read file
content <- read.csv(file=x,header=TRUE)
# change directory to output
setwd(y)
# append the itemID from CSV file to
write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE)
}, .parallel=TRUE )
stopCluster() # kill the cluster
Can anyone shed light on how I pass those two directory variables through to the function here?
So thanks to #Roland my parallel function would now be...
require(plyr)
require(doParallel)
fileDir <- "/Users/barneyc/sourceFiles"
outputDir <- "/Users/barneyc/outputFiles"
files <- as.list(list.files(full.names=TRUE,recursive=FALSE,pattern=".csv"))
cl<-makeCluster(4) # make a cluster of available cores
registerDoParallel(cl) # raise cluster
l_ply(files, function(x,y,z){
filename <- x
fileDir <- y
outputDir <- z
#change to dir containing source files
setwd(fileDir)
# read file
content <- read.csv(file=filename,header=TRUE)
# change directory to output
setwd(outputDir)
# append the itemID from CSV file to
write.table(content$itemID,file="ids.csv", append = TRUE, sep=",", row.names=FALSE,col.names=TRUE)
}, y=fileDir, z=outputDir, .parallel=TRUE )
stopCluster() # kill the cluster
Hallo experts,
I am trying to read in a large file in consecutive blocks of 10000 lines. This is
because the file is too large to read in at once. The "skip" field of read.csv comes in
handy to accomplish this task ( see below). However I noticed that the program starts
slowing down towards the end of the file ( for large values of i).
I suspect this is because each call to read.csv(file,skip=nskip,nrows=block) always
starts reading the file from the beginning until the required starting line "skip" is
reached. This becomes increasingly time-consuming as i increases.
Question: Is there a way to continue reading a file starting from the last location that
was reached in the previous block?
numberOfBlocksInFile<-800
block<-10000
for ( i in 1:(n-1))
{
print(i)
nskip<-i*block
out<-read.csv(file,skip=nskip,nrows=block)
colnames(out)<-names
.....
print("keep going")
}
many thanks (:-
One way is to use readLines with a file connection. For example, you could do something like this:
temp.fpath <- tempfile() # create a temp file for this demo
d <- data.frame(a=letters[1:10], b=1:10) # sample data, 10 rows. we'll read 5 at a time
write.csv(d, temp.fpath, row.names=FALSE) # write the sample data
f.cnxn <- file(temp.fpath, 'r') # open a new connection
fields <- readLines(f.cnxn, n=1) # read the header, which we'll reuse for each block
block.size <- 5
repeat { # keep reading and printing 5 row chunks until you reach the end of the cnxn.
block.text <- readLines(f.cnxn, n=5) # read chunk
if (length(block.text) == 0) # if there's nothing left, leave the loop
break
block <- read.csv(text=c(fields, block.text)) # process chunk with
print(block)
}
close(f.cnxn)
file.remove(temp.fpath)
Another option is to use fread from read.table package.
N <- 1e6 ## 1 second to read 1e6 rows/10cols
skip <- N
DT <- fread("test.csv",nrows=N)
repeat {
if (nrow(DT) < N) break
DT <- fread("test.csv",nrows=N,skip=skip)
## here use DT for your process
skip <- skip + N
}