I want to get multiple zip files from a ftp server.
I can get the zip files individually with the help of previous posts.
But that would be a lot of work for all needed data. So I wantet to find an automated way.
The ftp looks like this:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_01048_now.zip
I want to change the "01048" to the ID of the nearest wheater station, whitch I have in a data frame (data) already.
I thought I can just for loop all the needed stations
for(y in Data$StationsID)) {
urls <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_{y}_now.zip"))
}
but I only get "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_{y}_now.zip"))
the zip data holds an.txt ideal for csv analysis.
Later I want to use the files to get solar data from diffrent points in Germany.
But firts I need a list like this but I don't know how to get that:
urls
[1] url_1
[2] url_2
.
.
You do not even need a loop. Try
urls <- paste0("ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_", Data$StationsID, "_now.zip")
This will give you a vector of all URLs.
Afterwards you can fetch all files using e.g. lapply.
results <- lapply(urls, FUN = function(u) {
# FETCH
# UNZIP
# TIDY
# ...
}
(👍🏼 first question!)
An alternate approach is below.
We:
get all the possible data files
filter it to just the SOLAR zip data files (you could filter more if needed — i.e. restrict it to what you have in your existing data frame)
make a save location
download the files
Note that it is really bad form to hammer a server with consecutive requests without a pause so this introduces that but that's an oft overlooked courtesy these days.
library(curl)
library(httr)
base_dir <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/"
# Get all available files in that directory
res <- curl_fetch_memory(base_dir, handle = new_handle(dirlistonly = TRUE)
grep(
"SOLAR_[[:digit:]]{5}",
strsplit(rawToChar(res$content), "\n")[[1]], # curl_fetch returns a raw vector since it has no idea what type of content might be there so we have to convert it and it's a text listing so we have to do some more wrangling
value = TRUE
) -> all_zips
head(all_zips)
## [1] "10minutenwerte_SOLAR_00044_now.zip"
## [2] "10minutenwerte_SOLAR_00071_now.zip"
## [3] "10minutenwerte_SOLAR_00073_now.zip"
## [4] "10minutenwerte_SOLAR_00131_now.zip"
## [5] "10minutenwerte_SOLAR_00150_now.zip"
## [6] "10minutenwerte_SOLAR_00154_now.zip"
save_dir <- "~/Data/solar-output"
dir.create(save_dir)
for (zip in all_zips) {
try(httr::GET(
url = sprintf("%s%s", base_dir, zip),
httr::write_disk(file.path(save_dir, zip)), # enables caching (it won't overwrite by default and avoid issues with download.file() on windows)
httr::progress() # progress bars for free!
))
Sys.sleep(5) # be kind to their server CPU and network bandwidth
}
We wrap the GET() in try() since we've asked write_disk() to not overwrite existing files. It tosses an exception when this happens so try() catches it and lets the loop keep going (but still displays the helpful message about the file already existing).
Related
Thanks in advance for any feedback.
As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:
-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.
Here's the code I've been working on:
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333)
for (i in 1:length(document.numbers)) {
temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)
#download and save data
safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))
}
Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.
Note: I've tried several different ways to scrape the data. Unfortunately for me, the RSS feed only has the most recent 25. Because there are multiple dropdown menus to navigate before you reach the .docx file, my workaround is to use document numbers. I am however, open to more efficient way to scrape these written questions.
Again, thanks for any feedback!
Kari
After quickly checking out the site, I agree that I can't see any easier ways to do this, because the search function doesn't appear to be URL-based. So what you need to do is poll each candidate URL and see if it returns a "good" status (usually 200) and don't download when it returns a "bad" status (like 404). The following code block does that.
Note that purrr::safely doesn't run a function -- it creates another function that is safe and which you then can call. The created function returns a list with two slots: result and error.
base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"
#document.number <- 2321
document.numbers <- c(2330:2333,2552,2321)
sHEAD = purrr::safely(httr::HEAD)
sdownload = purrr::safely(download.file)
for (i in seq_along(document.numbers)) {
file_name = paste0(document.name.1,document.numbers[i],document.extension)
temp.doc.name <- paste0(base.url,file_name)
print(temp.doc.name)
print(sHEAD(temp.doc.name)$result$status)
if(sHEAD(temp.doc.name)$result$status %in% 200:299){
sdownload(temp.doc.name,destfile=file_name)
}
}
It might not be as simple as all of the valid URLs returning a '200' status. I think in general URLs in the range 200:299 are ok (edited answer to reflect this).
I used parts of this answer in my answer.
If the file does not exists, tryCatch simply skips it
library(tidyverse)
get_data <- function(index) {
paste0(
"https://www.europarl.europa.eu/doceo/document/",
"P-9-2022-00",
index,
"_EN.docx"
) %>%
download.file(url = .,
destfile = paste0(index, ".docx"),
mode = "wb",
quiet = TRUE) %>%
tryCatch(.,
error = function(e) print(paste(index, "does not exists - SKIPS")))
}
map(2000:5000, get_data)
I'm trying to read multiple .csv files from an URL starting with http. All files can be found on the same website. Generally, the structure of the file's name is: yyyy_mm_dd_location_XX.csv
Now, there are three different locations (lets say locA, locB, locC) for which there is a file for every day of the month each. So, the file name would be e.g. "2009_10_01_locA_XX.csv", "2009_10_02_locA_XX.csv" and so forth.
The structure, meaning the number of columns of all csv files is the same, however the length is not.
I'd like to combine all these files into one csv file but have problems reading them from the website due to the changing names.
Thanks a lot for any ideas!
Here is a way to programmatically generate the names of the files, and then run download.file() to download them. Since no reproducible example was given with the question, one needs to change the code to the correct HTTP location to access the files.
startDate <- as.Date("2019-10-01","%Y-%m-%d")
dateVec <- date + 0:4 # create additional dates by adding integers
library(lubridate)
downloadFileNames <- unlist(lapply(dateVec,function(x) {
locs <- c("locA","locB","locC")
paste(year(x),month(x),day(x),locs,"XX",sep="_")
}))
head(downloadFileNames)
We print the head() of the vector to show the correct naming pattern.
> head(downloadFileNames)
[1] "2019_10_1_locA_XX" "2019_10_1_locB_XX" "2019_10_1_locC_XX"
[4] "2019_10_2_locA_XX" "2019_10_2_locB_XX" "2019_10_2_locC_XX"
>
Next, we'll create a directory to store the files, and download them.
# create a subdirectory to store the files
if(!dir.exists("./data")) dir.create("./data")
# download files, as https://www.example.com/2019_10_01_locA_XX.csv
# to ./data/2019_10_01_locA_XX.csv, etc.
result <- lapply(downloadFileNames,function(x){
download.file(paste0("https://www.example.com/",x,".csv"),
paste0("./data/",x,".csv"))
})
Once the files are downloaded, we can use list.files() to retrieve the path names, read the data with read.csv(), and combine them into a single data frame with do.call().
theFiles <- list.files("./data",pattern = ".csv",full.names = TRUE)
dataList <- lapply(theFiles,read.csv)
data <- do.call(rbind,dataList)
I'm trying in R to get the list of tickers from every exchange covered by Quandl.
There are 2 ways:
1) For every exchange the provide the zipped csv with all ticker. The URL looks like this (XXXXXXXXXXXXXXXXXXXX - API key, YYY - code of exchange):
https://www.quandl.com/api/v3/databases/YYY/codes?api_key=XXXXXXXXXXXXXXXXXXXX
This looks pretty promissing, but I was not able to read the file with read.table or e.g fread. Don't know why. Is it because of the API key? read.table is supposed to read zip files with no problem.
2) I was able to go further with the 2nd way. They provide URL to the csv of tickers. E.g.:
https://www.quandl.com/api/v3/datasets.csv?database_code=YYY&per_page=100&sort_by=id&page=1&api_key=XXXXXXXXXXXXXXXXXXXX
As you see, URL contains page number. The problem is they only mention it below in text, that you need to run this URL many time (e.g. 56 for LSE) in order to get the full list. I was able to do it like this:
pages <- 1:100 # "100" is taken just to be big enough
Source <- c("LSE","FSE", ...) # vector of exchange codes
QUANDL_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXX"
TICKERS = lapply(sprintf(
"https://www.quandl.com/api/v3/datasets.csv?
database_code=%s&per_page=100&sort_by=id&page=%s&api_key=%s",
Source,pages,QUANDL_API_KEY),
FUN=fread,
stringsAsFactors=FALSE)
TICKERS <- do.call(rbind, TICKERS)
The problem is I just put 100 pages, but when R tryies to get the non-existing page (e.g. #57) it delivers an error and do not go further. I was trying to do smth like iferror, but failed.
Could you pls give some hints?
I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:
# enter user credentials
user <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"#",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"
# construct path to data
path <- paste("https://", credentials, web.site, sep="")
# read data for 4/10/2013
file <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)
However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.
The impages below show what the list of files look like in the web browser:
...
...
...
Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?
Any guidance would be greatly appreciated?
I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.
Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.
Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).
> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/#href")
> free(doc)
> links
href href href
"?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
href href href
"?C=D;O=A" "/src/contrib/Archive/" "Amelia_1.1-23.tar.gz"
href href href
"Amelia_1.1-29.tar.gz" "Amelia_1.1-30.tar.gz" "Amelia_1.1-32.tar.gz"
href href href
"Amelia_1.1-33.tar.gz" "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz"
href href href
"Amelia_1.2-2.tar.gz" "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz"
href href href
"Amelia_1.2-13.tar.gz" "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz"
href href href
"Amelia_1.2-16.tar.gz" "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
href href href
"Amelia_1.5-4.tar.gz" "Amelia_1.5-5.tar.gz" "Amelia_1.6.1.tar.gz"
href href href
"Amelia_1.6.3.tar.gz" "Amelia_1.6.4.tar.gz" "Amelia_1.7.tar.gz"
For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.
> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
href href href
"Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" "Amelia_1.2-2.tar.gz"
href href href
"Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz"
href href href
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz"
href href
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
You can now use that vector as follows:
wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe),
function(x) download.file(GetMe[x], wanted[x], mode = "wb"))
Update (to clarify your question in comments)
The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:
lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))
However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by #Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.
#MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:
require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days
and
paste0("icecleared_power_",as.character.Date(tBiz))
gives you the concatenated file name.
If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.
Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.
You can try using the command "download.file".
### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"
### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))
The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.
All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.
### "dl" will apply a function to every element in our vector "data"
# It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
path <- 'url'
dest <- './data_intermediate/...'
try(download.file(path, dest))
})
### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index <- sapply(dl, is.null)
### Figure out which downloads returned nothing
no.download <- names(dl)[index]
You can then use "list.files()" to merge all data together, assuming they belong in one data.frame
### Create a list of files you want to merge together
files <- list.files()
### Create a list of data.frames by reading each file into memory
data <- lapply(files, read.csv)
### Stack data together
data <- do.call(rbind, data)
Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.
Another solution
for( i in seq_along(stations)) {
URL<-""
files<-rvest::read_html(URL) %>% html_nodes("a") %>% html_text(trim = T)
files<-grep(stations[i],files ,ignore.case = TRUE, value = TRUE )
destfile<- paste0("C:/Users/...",files)
download.file(URL,destfile, mode="wb")
}
Hi I'm new to R and i'm building off of two guides from the web, I figured out how to automate a script for data mining but instead of appending the data is then over written each time the code is run. I would like to have it appended can any one point me in the right direction.
here is the script as such
# loading the package is required once each session
require(XML)
# initialize a storage variable for Twitter tweets
mydata.vectors <- character(0)
# paginate to get more tweets
for (page in c(1:15))
{
# search parameter
twitter_q <- URLencode('#google OR #apple')
# construct a URL
twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
# fetch remote URL and parse
mydata.xml <- xmlParseDoc(twitter_url, asText=F)
# extract the titles
mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
# aggregate new tweets with previous tweets
mydata.vectors <- c(mydata.vector, mydata.vectors)
}
# how many tweets did we get?
length(mydata.vectors)
I think what you want is to save the results to disk between runs. So, something like this at the beginning:
if (!file.exists('path/to/file'))
mydata.vectors <- character(0)
else
load('path/to/file')
And something like this at the end:
save(mydata.vectors, file='path/to/file')
Should do the trick. Of course you could get more sophisticated with save file types etc.