I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:
# enter user credentials
user <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"#",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"
# construct path to data
path <- paste("https://", credentials, web.site, sep="")
# read data for 4/10/2013
file <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)
However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.
The impages below show what the list of files look like in the web browser:
...
...
...
Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?
Any guidance would be greatly appreciated?
I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.
Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.
Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).
> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/#href")
> free(doc)
> links
href href href
"?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
href href href
"?C=D;O=A" "/src/contrib/Archive/" "Amelia_1.1-23.tar.gz"
href href href
"Amelia_1.1-29.tar.gz" "Amelia_1.1-30.tar.gz" "Amelia_1.1-32.tar.gz"
href href href
"Amelia_1.1-33.tar.gz" "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz"
href href href
"Amelia_1.2-2.tar.gz" "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz"
href href href
"Amelia_1.2-13.tar.gz" "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz"
href href href
"Amelia_1.2-16.tar.gz" "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
href href href
"Amelia_1.5-4.tar.gz" "Amelia_1.5-5.tar.gz" "Amelia_1.6.1.tar.gz"
href href href
"Amelia_1.6.3.tar.gz" "Amelia_1.6.4.tar.gz" "Amelia_1.7.tar.gz"
For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.
> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
href href href
"Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" "Amelia_1.2-2.tar.gz"
href href href
"Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz"
href href href
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz"
href href
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
You can now use that vector as follows:
wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe),
function(x) download.file(GetMe[x], wanted[x], mode = "wb"))
Update (to clarify your question in comments)
The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:
lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))
However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by #Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.
#MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:
require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days
and
paste0("icecleared_power_",as.character.Date(tBiz))
gives you the concatenated file name.
If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.
Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.
You can try using the command "download.file".
### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"
### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))
The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.
All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.
### "dl" will apply a function to every element in our vector "data"
# It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
path <- 'url'
dest <- './data_intermediate/...'
try(download.file(path, dest))
})
### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index <- sapply(dl, is.null)
### Figure out which downloads returned nothing
no.download <- names(dl)[index]
You can then use "list.files()" to merge all data together, assuming they belong in one data.frame
### Create a list of files you want to merge together
files <- list.files()
### Create a list of data.frames by reading each file into memory
data <- lapply(files, read.csv)
### Stack data together
data <- do.call(rbind, data)
Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.
Another solution
for( i in seq_along(stations)) {
URL<-""
files<-rvest::read_html(URL) %>% html_nodes("a") %>% html_text(trim = T)
files<-grep(stations[i],files ,ignore.case = TRUE, value = TRUE )
destfile<- paste0("C:/Users/...",files)
download.file(URL,destfile, mode="wb")
}
Related
I'm trying to read multiple .csv files from an URL starting with http. All files can be found on the same website. Generally, the structure of the file's name is: yyyy_mm_dd_location_XX.csv
Now, there are three different locations (lets say locA, locB, locC) for which there is a file for every day of the month each. So, the file name would be e.g. "2009_10_01_locA_XX.csv", "2009_10_02_locA_XX.csv" and so forth.
The structure, meaning the number of columns of all csv files is the same, however the length is not.
I'd like to combine all these files into one csv file but have problems reading them from the website due to the changing names.
Thanks a lot for any ideas!
Here is a way to programmatically generate the names of the files, and then run download.file() to download them. Since no reproducible example was given with the question, one needs to change the code to the correct HTTP location to access the files.
startDate <- as.Date("2019-10-01","%Y-%m-%d")
dateVec <- date + 0:4 # create additional dates by adding integers
library(lubridate)
downloadFileNames <- unlist(lapply(dateVec,function(x) {
locs <- c("locA","locB","locC")
paste(year(x),month(x),day(x),locs,"XX",sep="_")
}))
head(downloadFileNames)
We print the head() of the vector to show the correct naming pattern.
> head(downloadFileNames)
[1] "2019_10_1_locA_XX" "2019_10_1_locB_XX" "2019_10_1_locC_XX"
[4] "2019_10_2_locA_XX" "2019_10_2_locB_XX" "2019_10_2_locC_XX"
>
Next, we'll create a directory to store the files, and download them.
# create a subdirectory to store the files
if(!dir.exists("./data")) dir.create("./data")
# download files, as https://www.example.com/2019_10_01_locA_XX.csv
# to ./data/2019_10_01_locA_XX.csv, etc.
result <- lapply(downloadFileNames,function(x){
download.file(paste0("https://www.example.com/",x,".csv"),
paste0("./data/",x,".csv"))
})
Once the files are downloaded, we can use list.files() to retrieve the path names, read the data with read.csv(), and combine them into a single data frame with do.call().
theFiles <- list.files("./data",pattern = ".csv",full.names = TRUE)
dataList <- lapply(theFiles,read.csv)
data <- do.call(rbind,dataList)
I want to get multiple zip files from a ftp server.
I can get the zip files individually with the help of previous posts.
But that would be a lot of work for all needed data. So I wantet to find an automated way.
The ftp looks like this:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_01048_now.zip
I want to change the "01048" to the ID of the nearest wheater station, whitch I have in a data frame (data) already.
I thought I can just for loop all the needed stations
for(y in Data$StationsID)) {
urls <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_{y}_now.zip"))
}
but I only get "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_{y}_now.zip"))
the zip data holds an.txt ideal for csv analysis.
Later I want to use the files to get solar data from diffrent points in Germany.
But firts I need a list like this but I don't know how to get that:
urls
[1] url_1
[2] url_2
.
.
You do not even need a loop. Try
urls <- paste0("ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/10minutenwerte_SOLAR_", Data$StationsID, "_now.zip")
This will give you a vector of all URLs.
Afterwards you can fetch all files using e.g. lapply.
results <- lapply(urls, FUN = function(u) {
# FETCH
# UNZIP
# TIDY
# ...
}
(👍🏼 first question!)
An alternate approach is below.
We:
get all the possible data files
filter it to just the SOLAR zip data files (you could filter more if needed — i.e. restrict it to what you have in your existing data frame)
make a save location
download the files
Note that it is really bad form to hammer a server with consecutive requests without a pause so this introduces that but that's an oft overlooked courtesy these days.
library(curl)
library(httr)
base_dir <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/10_minutes/solar/now/"
# Get all available files in that directory
res <- curl_fetch_memory(base_dir, handle = new_handle(dirlistonly = TRUE)
grep(
"SOLAR_[[:digit:]]{5}",
strsplit(rawToChar(res$content), "\n")[[1]], # curl_fetch returns a raw vector since it has no idea what type of content might be there so we have to convert it and it's a text listing so we have to do some more wrangling
value = TRUE
) -> all_zips
head(all_zips)
## [1] "10minutenwerte_SOLAR_00044_now.zip"
## [2] "10minutenwerte_SOLAR_00071_now.zip"
## [3] "10minutenwerte_SOLAR_00073_now.zip"
## [4] "10minutenwerte_SOLAR_00131_now.zip"
## [5] "10minutenwerte_SOLAR_00150_now.zip"
## [6] "10minutenwerte_SOLAR_00154_now.zip"
save_dir <- "~/Data/solar-output"
dir.create(save_dir)
for (zip in all_zips) {
try(httr::GET(
url = sprintf("%s%s", base_dir, zip),
httr::write_disk(file.path(save_dir, zip)), # enables caching (it won't overwrite by default and avoid issues with download.file() on windows)
httr::progress() # progress bars for free!
))
Sys.sleep(5) # be kind to their server CPU and network bandwidth
}
We wrap the GET() in try() since we've asked write_disk() to not overwrite existing files. It tosses an exception when this happens so try() catches it and lets the loop keep going (but still displays the helpful message about the file already existing).
My company documents summaries of policies/services for each client in a pdf formatted file. These files are combined into a large dataset each year. One row per client and columns are variables in the client's document. There are a couple thousand of these files and each one has approximately 20-30 variables each. I' want to automate this process by creating a data.frame with each row representing a client, and then pull the variables for each client from their pdf document. I'm able to create a list or data.frame of all the clients by the pdf filename in a directory but don't know how to create a loop that pulls each variable I need for each document. I currently have two different methods which I can't decide between, and also need help with a loop that grabs the variables I need for each client document. My code and links to two mock files are provided below. Any help would be appreciated!
Files: Client 1 and Client 2
Method 1: pdftools
The benefit of the first method is it extracts the entire pdf into a vector, and each page into a separate element. This makes it easier for me to pull strings/variables. However, don't know how to loop it to pull the information from each client and appropriately place it in a column for each client.
library(pdftools)
library(stringr)
Files <- list.files(path="...", pattern=".pdf")
Files <- Files %% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting")) #Extract the first variable
Method 2:
The benefit of this approach is it automatically creates a database for each of the client documents with file name as a row, and the each pdf in a variable. The downside is an entire pdf in a variable makes it more difficult to match and extract strings compared to having each page in its own element. I don't know how to write a loop that will extract variables for each client and place them in their respective column.
DF <- readtext("directory pathway/*.pdf")
DF <- DF %>% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting"))
Here's a basic framework that I think solves your problem using your proposed Method 1.
library(pdftools)
library(stringr)
Files <- list.files(path="pdfs/", pattern=".pdf")
lf <- length(Files)
client_df <- data.frame(client = rep(NA, lf), fr = rep(NA, lf))
for(i in 1:lf){
# extract the text from the pdf
f <- pdf_text(paste0("pdfs/", Files[i]))
# remove commas from numbers
f <- gsub(',', '', f)
# extract variables
client_name <- str_match(f[1], "Client\\s+\\d+")[[1]]
fr <- as.numeric(str_match(f[1], "\\$(\\d+)\\s+Financial Reporting")[[2]])
# add variables to your dataframe
client_df$client[i] <- client_name
client_df$fr[i] <- fr
}
I removed commas from the text under the assumption that any numeric variables you extract you'll want to use as numbers in some analysis. This removes all commas though, so if those are important in other areas you'll have to rethink that.
Also note that I put the sample PDFs into a directory called 'pdfs'.
I would imagine that with a little creative regex you can extract anything else that would be useful. Using this method makes it easy to scrape the data if the elements of interest will always be on the same pages across all documents. (Note the index on f in the str_match lines.) Hope this helps!
Using this script I have created a specific folder for each csv file and then saved all my further analysis results in this folder. The name of the folder and csv file are same. The csv files are stored in the main/master directory.
Now, I have created a csv file in each of these folders which contains a list of all the fitted values.
I would now like to do the following:
Set the working directory to the particular filename
Read fitted values file
Add a row/column stating the name of the site/ unique ID
Add it to the masterfile which is stored in the main directory with a title specifying site name/filename. It can be stacked by rows or by columns it doesn't really matter.
Come to the main directory to pick the next file
Repeat the loop
Using the merge(), rbind(), cbind() combines all the data under one column name. I want to keep all the sites separate for comparison at a later on stage.
This is what I'm using at the moment and I'm lost on how to proceed further.
setwd( "path") # main directory
path <-"path" # need this for convenience while switching back to main directory
# import all files and create a character type array
files <- list.files(path=path, pattern="*.csv")
for(i in seq(1, length(files), by = 1)){
fileName <- read.csv(files[i]) # repeat to set the required working directory
base <- strsplit(files[i], ".csv")[[1]] # getting the filename
setwd(file.path(path, base)) # setting the working directory to the same filename
master <- read.csv(paste(base,"_fiited_values curve.csv"))
# read the fitted value csv file for the site and store it in a list
}
I want to construct a for loop to make one master file with the files in different directories. I do not want to merge all under one column name.
For example, If I have 50 similar csv files and each had two columns of data, I would like to have one csv file which accommodates all of it; but in its original format rather than appending to the existing row/column. So then I will have 100 columns of data.
Please tell me what further information can I provide?
for reading a group of files, from a number of different directories, with pathnames patha pathb pathc:
paths = c('patha','pathb','pathc')
files = unlist(sapply(paths, function(path) list.files(path,pattern = "*.csv", full.names = TRUE)))
listContainingAllFiles = lapply(files, read.csv)
If you want to be really quick about it, you can grab fread from data.table:
library(data.table)
listContainingAllFiles = lapply(files, fread)
Either way this will give you a list of all objects, kept separate. If you want to join them together vertically/horizontally, then:
do.call(rbind, listContainingAllFiles)
do.call(cbind, listContainingAllFiles)
EDIT: NOTE, the latter makes no sense unless your rows actually mean something when they're corresponding. It makes far more sense to just create a field tracking what location the data is from.
if you want to include the names of the files as the method of determining sample location (I don't see where you're getting this info from in your example), then you want to do this as you read in the files, so:
listContainingAllFiles = lapply(files,
function(file) data.frame(filename = file,
read.csv(file)))
then later you can split that column to get your details (Assuming of course you have a standard naming convention)
I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.
Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!
This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.