Hi I'm new to R and i'm building off of two guides from the web, I figured out how to automate a script for data mining but instead of appending the data is then over written each time the code is run. I would like to have it appended can any one point me in the right direction.
here is the script as such
# loading the package is required once each session
require(XML)
# initialize a storage variable for Twitter tweets
mydata.vectors <- character(0)
# paginate to get more tweets
for (page in c(1:15))
{
# search parameter
twitter_q <- URLencode('#google OR #apple')
# construct a URL
twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
# fetch remote URL and parse
mydata.xml <- xmlParseDoc(twitter_url, asText=F)
# extract the titles
mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
# aggregate new tweets with previous tweets
mydata.vectors <- c(mydata.vector, mydata.vectors)
}
# how many tweets did we get?
length(mydata.vectors)
I think what you want is to save the results to disk between runs. So, something like this at the beginning:
if (!file.exists('path/to/file'))
mydata.vectors <- character(0)
else
load('path/to/file')
And something like this at the end:
save(mydata.vectors, file='path/to/file')
Should do the trick. Of course you could get more sophisticated with save file types etc.
Related
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
I want to get multiple zip files from a ftp server.
I can get the zip files individually with the help of previous posts.
But that would be a lot of work for all needed data. So I wantet to find an automated way.
The ftp looks like this:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_01048_now.zip
I want to change the "01048" to the ID of the nearest wheater station, whitch I have in a data frame (data) already.
I thought I can just for loop all the needed stations
for(y in Data$StationsID)) {
urls <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_{y}_now.zip"))
}
but I only get "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_{y}_now.zip"))
the zip data holds an.txt ideal for csv analysis.
Later I want to use the files to get solar data from diffrent points in Germany.
But firts I need a list like this but I don't know how to get that:
urls
[1] url_1
[2] url_2
.
.
You do not even need a loop. Try
urls <- paste0("ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_", Data$StationsID, "_now.zip")
This will give you a vector of all URLs.
Afterwards you can fetch all files using e.g. lapply.
results <- lapply(urls, FUN = function(u) {
# FETCH
# UNZIP
# TIDY
# ...
}
(👍🏼 first question!)
An alternate approach is below.
We:
get all the possible data files
filter it to just the SOLAR zip data files (you could filter more if needed — i.e. restrict it to what you have in your existing data frame)
make a save location
download the files
Note that it is really bad form to hammer a server with consecutive requests without a pause so this introduces that but that's an oft overlooked courtesy these days.
library(curl)
library(httr)
base_dir <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/"
# Get all available files in that directory
res <- curl_fetch_memory(base_dir, handle = new_handle(dirlistonly = TRUE)
grep(
"SOLAR_[[:digit:]]{5}",
strsplit(rawToChar(res$content), "\n")[[1]], # curl_fetch returns a raw vector since it has no idea what type of content might be there so we have to convert it and it's a text listing so we have to do some more wrangling
value = TRUE
) -> all_zips
head(all_zips)
## [1] "10minutenwerte_SOLAR_00044_now.zip"
## [2] "10minutenwerte_SOLAR_00071_now.zip"
## [3] "10minutenwerte_SOLAR_00073_now.zip"
## [4] "10minutenwerte_SOLAR_00131_now.zip"
## [5] "10minutenwerte_SOLAR_00150_now.zip"
## [6] "10minutenwerte_SOLAR_00154_now.zip"
save_dir <- "~/Data/solar-output"
dir.create(save_dir)
for (zip in all_zips) {
try(httr::GET(
url = sprintf("%s%s", base_dir, zip),
httr::write_disk(file.path(save_dir, zip)), # enables caching (it won't overwrite by default and avoid issues with download.file() on windows)
httr::progress() # progress bars for free!
))
Sys.sleep(5) # be kind to their server CPU and network bandwidth
}
We wrap the GET() in try() since we've asked write_disk() to not overwrite existing files. It tosses an exception when this happens so try() catches it and lets the loop keep going (but still displays the helpful message about the file already existing).
I'm trying in R to get the list of tickers from every exchange covered by Quandl.
There are 2 ways:
1) For every exchange the provide the zipped csv with all ticker. The URL looks like this (XXXXXXXXXXXXXXXXXXXX - API key, YYY - code of exchange):
https://www.quandl.com/api/v3/databases/YYY/codes?api_key=XXXXXXXXXXXXXXXXXXXX
This looks pretty promissing, but I was not able to read the file with read.table or e.g fread. Don't know why. Is it because of the API key? read.table is supposed to read zip files with no problem.
2) I was able to go further with the 2nd way. They provide URL to the csv of tickers. E.g.:
https://www.quandl.com/api/v3/datasets.csv?database_code=YYY&per_page=100&sort_by=id&page=1&api_key=XXXXXXXXXXXXXXXXXXXX
As you see, URL contains page number. The problem is they only mention it below in text, that you need to run this URL many time (e.g. 56 for LSE) in order to get the full list. I was able to do it like this:
pages <- 1:100 # "100" is taken just to be big enough
Source <- c("LSE","FSE", ...) # vector of exchange codes
QUANDL_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXX"
TICKERS = lapply(sprintf(
"https://www.quandl.com/api/v3/datasets.csv?
database_code=%s&per_page=100&sort_by=id&page=%s&api_key=%s",
Source,pages,QUANDL_API_KEY),
FUN=fread,
stringsAsFactors=FALSE)
TICKERS <- do.call(rbind, TICKERS)
The problem is I just put 100 pages, but when R tryies to get the non-existing page (e.g. #57) it delivers an error and do not go further. I was trying to do smth like iferror, but failed.
Could you pls give some hints?
I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.
I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:
# enter user credentials
user <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"#",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"
# construct path to data
path <- paste("https://", credentials, web.site, sep="")
# read data for 4/10/2013
file <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)
However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.
The impages below show what the list of files look like in the web browser:
...
...
...
Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?
Any guidance would be greatly appreciated?
I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.
Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.
Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).
> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/#href")
> free(doc)
> links
href href href
"?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
href href href
"?C=D;O=A" "/src/contrib/Archive/" "Amelia_1.1-23.tar.gz"
href href href
"Amelia_1.1-29.tar.gz" "Amelia_1.1-30.tar.gz" "Amelia_1.1-32.tar.gz"
href href href
"Amelia_1.1-33.tar.gz" "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz"
href href href
"Amelia_1.2-2.tar.gz" "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz"
href href href
"Amelia_1.2-13.tar.gz" "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz"
href href href
"Amelia_1.2-16.tar.gz" "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
href href href
"Amelia_1.5-4.tar.gz" "Amelia_1.5-5.tar.gz" "Amelia_1.6.1.tar.gz"
href href href
"Amelia_1.6.3.tar.gz" "Amelia_1.6.4.tar.gz" "Amelia_1.7.tar.gz"
For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.
> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
href href href
"Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" "Amelia_1.2-2.tar.gz"
href href href
"Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz"
href href href
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz"
href href
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
You can now use that vector as follows:
wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe),
function(x) download.file(GetMe[x], wanted[x], mode = "wb"))
Update (to clarify your question in comments)
The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:
lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))
However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by #Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.
#MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:
require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days
and
paste0("icecleared_power_",as.character.Date(tBiz))
gives you the concatenated file name.
If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.
Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.
You can try using the command "download.file".
### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"
### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))
The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.
All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.
### "dl" will apply a function to every element in our vector "data"
# It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
path <- 'url'
dest <- './data_intermediate/...'
try(download.file(path, dest))
})
### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index <- sapply(dl, is.null)
### Figure out which downloads returned nothing
no.download <- names(dl)[index]
You can then use "list.files()" to merge all data together, assuming they belong in one data.frame
### Create a list of files you want to merge together
files <- list.files()
### Create a list of data.frames by reading each file into memory
data <- lapply(files, read.csv)
### Stack data together
data <- do.call(rbind, data)
Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.
Another solution
for( i in seq_along(stations)) {
URL<-""
files<-rvest::read_html(URL) %>% html_nodes("a") %>% html_text(trim = T)
files<-grep(stations[i],files ,ignore.case = TRUE, value = TRUE )
destfile<- paste0("C:/Users/...",files)
download.file(URL,destfile, mode="wb")
}