Reading multiple offline html files to a list in R

Reading multiple offline html files to a list in R - r

I have rawdata as 20 offline html files stored in following format
../rawdata/1999_table.html
../rawdata/2000_table.html
../rawdata/2001_table.html
../rawdata/2002_table.html
.
.
../rawdata/2017_table.html
These files contain tables that I am extracting and reshaping to a particular format.
I want to read these files at once to a list and process them one by one through a function that I have written.
What I tried:
I put the names of these files into an Excel file called filestoread.xlsx and used a for loop to load these files using the names mentioned in the sheet. But it doesn't seem to work
filestoread <- fread("../rawdata/filestoread.csv")
x <- list()
for (i in nrow(filestoread)) {
x[[i]] <- read_html(paste0("../rawdata/", filestoread[i]))
}
How can this be done?
Also, after reading the HTML files I want to extract the tables from them and reshape them using a function I wrote after converting it to a data table.
My final objective is to rbind all the tables and have a single data table with year wise entries of the tables in the html file.

First save path of your data on one of the following ways.
Either, hardcoded
filestoread <- paste0("../rawdata/", 1999:2017, "_table.html")
or reading all html files in the directory
filestoread <- list.files(path = "../rawdata/", pattern="\\.html$")
Then use lapply()
library(rvest)
lapply(filestoread, function(x) try(read_html(x)))
Note: try() runs the code even when there is a file missing (throwing error).
The second part of your question is a little broad, depends on the content of your files, and there are already some answers, you could consider e.g. this answer. In principle you use a combination of ?html_nodes and ?html_table.

Related

Creating temporary data frames in R

I am importing multiple excel workbooks, processing them, and appending them subsequently. I want to create a temporary dataframe (tempfile?) that holds nothing in the beginning, and after each successive workbook processing, append it. How do I create such temporary dataframe in the beginning?
I am coming from Stata and I use tempfile a lot. Is there a counterpart to tempfile from Stata to R?

As #James said you do not need an empty data frame or tempfile, simply add newly processed data frames to the first data frame. Here is an example (based on csv but the logic is the same):
list_of_files <- c('1.csv','2.csv',...)
pre_processor <- function(dataframe){
# do stuff
}
library(dplyr)
dataframe <- pre_processor(read.csv('1.csv')) %>%
rbind(pre_processor(read.csv('2.csv'))) %>%>
...
Now if you have a lot of files or a very complicated pre_processsing then you might have other questions (e.g. how to loop over the list of files or to write the right pre_processing function) but these should be separate and we really need more specifics (example data, code so far, etc.).

How can I write multiple csv files in a specific directory and then merge them into a single csv?

I am trying to deal with extracting a subset from multiple .grb2 files in the same file path, and write them in separate csv files. I am using the following code which does the job and stores the csv files in the same directory as the .grb2 files.
path <- "file path"
input.file.names <- dir(path, pattern =".grb2")
output.file.names <-
paste0(tools::file_path_sans_ext(input.file.names),".csv")
for(i in 1:length(input.file.names)){
GRIB <- brick(input.file.names[i])
GRIB <- as.array(GRIB)
tmp2m.6hr <- GRIB[46,13,c(1:20)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,output.file.names[i])
}
My first question is this: how can I store the csv files in a different directory than the .grb2 files?
My .grb2 files, and thus the resulting csv files, end in four different types, i.e. 00.grb2, 06.grb2, 12.grb2, 18.grb2. The resulting csv files have the following form:
enter image description here
My second question is: how can I merge all my 00.csv, 06.csv, 12.csv, 18.csv files (each category in the same column) in a single csv file in a directory of my choice with the following headrs: 00_tmp2m.6hr, 06_tmp2m.6hr, 12_tmp2m.6hr, 18_tmp2m.6hr, and also create a fifth column with the average of the other four? The result that I want is the following:
enter image description here
As I m not an experienced user this is too complicated for me. I would very much apreciate any assistance with this.

For your fist question, you might try specifying the path using a relative reference to the folder, as in write.csv(paste0("./myfolder/", output.file.names[i])).
Your second question might be easier if you read the data and then write your results as a new file. you might also want to take a look at the optional parameters of write.csv(append = FALSE, ...).
Also, you might get a better answer by creating a minimal example.

Why can I only read one .json file at a time?

I have 500+ .json files that I am trying to get a specific element out of. I cannot figure out why I cannot read more than one at a time..
This works:
library (jsonlite)
files<-list.files(‘~/JSON’)
file1<-fromJSON(readLines(‘~/JSON/file1.json),flatten=TRUE)
result<-as.data.frame(source=file1$element$subdata$data)
However, regardless of using different json packages (eg RJSONIO), I cannot apply this to the entire contents of files. The error I continue to get is...
attempt to run same code as function over all contents in file list
for (i in files) {
fromJSON(readLines(i),flatten = TRUE)
as.data.frame(i)$element$subdata$data}
My goal is to loop through all 500+ and extract the data and its contents. Specifically if the file has the element ‘subdata$data’, i want to extract the list and put them all in a dataframe.
Note: files are being read as ASCII (Windows OS). This does bot have a negative effect on single extractions but for the loop i get ‘invalid character bytes’
Update 1/25/2019
Ran the following but returned errors...
files<-list.files('~/JSON')
out<-lapply(files,function (fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in file(i): object 'i' not found
Also updated function, this time with UTF* errors...
files<-list.files('~/JSON')
out<-lapply(files,function (i,fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in parse_con(txt,bigint_as_char):
lexical error: invalid bytes in UTF8 string. (right here)------^
Latest Update
Think I found out a solution to the crazy 'bytes' problem. When I run readLines on the .json file, I can then apply fromJSON),
e.x.
json<-readLines('~/JSON')
jsonread<-fromJSON(json)
jsondf<-as.data.frame(jsonread$element$subdata$data)
#returns a dataframe with the correct information
Problem is, I cannot apply readLines to all the files within the JSON folder (PATH). If I can get help with that, I think I can run...
files<-list.files('~/JSON')
for (i in files){
a<-readLines(i)
o<-fromJSON(file(a),flatten=TRUE)
as.data.frame(i)$element$subdata}
Needed Steps
apply readLines to all 500 .json files in JSON folder
apply fromJSON to files from step.1
create a data.frame that returns entries if list (fromJSON) contains $element$subdata$data.
Thoughts?
Solution (Workaround?)
Unfortunately, the fromJSON still runs in to trouble with the .json files. My guess is that my GET method (httr) is unable to wait/delay and load the 'pretty print' and thus is grabbing the raw .json which in-turn is giving odd characters and as a result giving the ubiquitous '------^' error. Nevertheless, I was able to put together a solution, please see below. I want to post it for future folks that may have the same problem with the .json files not working nicely with any R json package.
#keeping the same 'files' variable as earlier
raw_data<-lapply(files,readLines)
dat<-do.call(rbind,raw_data)
dat2<-as.data.frame(dat,stringsasFactors=FALSE)
#check to see json contents were read-in
dat2[1,1]
library(tidyr)
dat3<-separate_rows(dat2,sep='')
x<-unlist(raw_data)
x<-gsub('[[:punct:]]', ' ',x)
#Identify elements wanted in original .json and apply regex
y<-regmatches(x,regexc('.*SubElement2 *(.*?) *Text.*',x))

for loops never return anything, so you must save all valuable data yourself.
You call as.data.frame(i) which is creating a frame with exactly one element, the filename, probably not what you want to keep.
(Minor) Use fromJSON(file(i),...).
Since you want to capture these into one frame, I suggest something along the lines of:
out <- lapply(files, function(fn) {
o <- fromJSON(file(fn), flatten = TRUE)
as.data.frame(o)$element$subdata$data
})
allout <- do.call(rbind.data.frame, out)
### alternatives:
allout <- dplyr::bind_rows(out)
allout <- data.table::rbindlist(out)

Extracting one text files from multiple zip archives in R

I am trying to extract one text file from each of the zip files located in one folder. Then I want to combine those text files into one dataframe.
The folder has multiple Zip files:
pf_0915.zip
pf_0914.zip
pf_0913.zip
.....
Inside of those zip files are multiple text files. I am only interested in the one called abc.txt. This is a fixed width format file without header. I have already set up a read for this file using read_fwd. Since all the extracted text files have the same name, it might be better to rename them according the name of their archive. i.e. the abc.txt from pf_0915.zip could be called abc_0915.txt. Once they are all read they should be combined into a large file called abcCombined.txt.
Or as each new abc.txt file is read, we could add it to the abcCombined.txt.
I have tried various version of unzip() and unz() without much success. This was done without looping through all the zip files. And finally, this directory contains many zip files, are there ways to read only some of them by using pattern matching like grep. I would for example be interested in reading only September files, those .._09...txt.
Any hints would be appreciated.

The following:
Creates a vector of the files in a directory
Uses the list parameter to unzip() to see the metadata for the contents
Builds a regular expression to find only the target file (I did that in the event your use-case generalizes to a broader pattern)
Tests if any of the files meet your criteria
Keeps only those files into a resultant vector
Iterates over that vector and
Extracts only the target file into a temporary directory
Reads it into a data.frame
Ultimately binds the individual data.frames into one big one
You can write out the resultant combined data.frame however you wish.
library(purrr)
target_dir <- "so"
extract_file <- "abc.txt"
list.files(target_dir, full.names=TRUE) %>%
keep(~any(grepl(sprintf("^%s$", extract_file), unzip(., list=TRUE)$Name))) %>%
map_df(function(x) {
td <- tempdir()
read.fwf(unzip(x, extract_file, exdir=td), widths=c(4,1,4,2))
}) -> combined_df
The version below just expands some of the shortcuts in the one above:
only_files_with_this_name <- function(zip_path, name) {
zip_contents <- unzip(zip_path, list=TRUE)
look_for <- sprintf("^%s$", name)
any(grepl(look_for, zip_contents$Name))
}
list.files(target_dir, full.names=TRUE) %>%
keep(only_files_with_this_name, name=extract_file)) %>%
map_df(function(x) {
td <- tempdir()
file_in_zip <- unzip(x, extract_file, exdir=td)
read.fwf(file_in_zip, widths=c(4,1,4,2))
unlink(file_in_zip)
}) -> combined_df

Can't comment because of my low reputation, so although this is a partial answer:
If you know the file name within the various zips the syntax to get just that file would be something like the following:
my_data<-read.csv(unz("pf_0915.zip","abc.txt"))
This is the code for a csv obviously, not a fixed width text, but if you already have that set up, it'll be something like
my_data<-read_fwd(unz("pf_0915.zip","abc.txt") ... )
with all your other parameters in the ...
You can do this in a loop if you have many zips, and accumulate them in a data frame, data table, whatever structure floats your boat...

R: Loading data from folder with multiple files

I have a folder with multiple files to load:
Every file is a list. And I want to combine all the lists loaded in a single list. I am using the following code (the variable loaded every time from a file is called TotalData) :
Filenames <- paste0('DATA3_',as.character(1:18))
Data <- list()
for (ii in Filenames){
load(ii)
Data <- append(Data,TotalData)
}
Is there a more elegant way to write it? For example using apply functions?

You can use lapply. I assume that your files have been stored using save, because you use load to get them. I create two files to use in my example as follows:
TotalData<-list(1:10)
save(TotalData,file="DATA3_1")
TotalData<-list(11:20)
save(TotalData,file="DATA3_2")
And then I read them in by
Filenames <- paste0('DATA3_',as.character(1:2))
Data <- lapply(Filenames,function(fn) {
load(fn)
return (TotalData)
})
After this, Data will be a list that contains the lists from the files as its elements. Since you are using append in your example, I assume this is not what you want. I remove one level of nesting with
Data <- unlist(Data,recursive=FALSE)
For my two example files, this gave the same result as your code. Whether it is more elegant can be debated, but I would claim that it is more R-ish than the for-loop.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reading multiple offline html files to a list in R - r

Related

Creating temporary data frames in R

How can I write multiple csv files in a specific directory and then merge them into a single csv?

Why can I only read one .json file at a time?

Extracting one text files from multiple zip archives in R

R: Loading data from folder with multiple files

Categories

Resources