Parse multiple XBRL files stored in a zip file - r

I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).
It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)
Example file (sorry it is a bit big) using code from a
previous question
- download one zip file
library(XML)
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]
dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))
I can parse the files using the
XBRL package if i manually extract them.
This can be done as follows
library(XBRL)
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)
I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them.
I tried making a start, but don't know how to progress from here. Thanks for any advice.
# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626
# unzip and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)
I am using Windows 8.1
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.
# Parse one zip file to start
fls <- list.files(temp)[[1]]
# Unzip
tmp <- tempdir()
lst <- unzip(file.path(temp, fls), exdir=tmp)
# Only parse first 10 records
inst <- lst[1:10]
# Start to parse - in parallel
cl <- makeCluster(parallel::detectCores())
clusterCall(cl, function() library(XBRL))
# Start
st <- Sys.time()
out <- parLapply(cl, inst, function(i)
xbrlDoAll(i,
cache.dir="temp/hmrcCache",
prefix.out=NULL, verbose=T) )
stopCluster(cl)
Sys.time() - st

Related

Problem with XLS files with R's package readxl

I need to read a XLS file in R, but I'm having a problem regarding the way my file is generated and the R function readxl. I do not have this issue with Python, and this is my hope that it's possible to solve this problem inside R.
An application we use at my company exports reports in XLS format (not XLSX). This report is generated daily. What I need is to sum the total value of the rows in each file, in order to create a new report containing each day followed by this total value.
When I try to read these files in R using the readxl package, the program returns this error:
Erro: Can't subset columns that don't exist.
x Location 5 doesn't exist.
i There are only 0 columns.
Run rlang::last_error() to see where the error occurred.
Now, the weird thing is that, when I open the XLS file on Excel before running my script, R is able to run properly.
I guesses this was an error caused by something like the file only being completed when I open it... but the same python script does give me the correct result.
I am now assuming this is a bug in the readxl package. Is there another package I could use to run XLS (and not XLSX)? One that does not depend on Java installed on my computer, I mean.
my readxl script:
if (!require("readxl")) {install.packages("readxl"); library("readxl")}
"%,%" <- function(x,y) paste0(x,"\\",y)
year = "2021"
month = "Aug"
column = 5 # VL_COVAR
path <- "F:\\variancia" %,% year %,% month
tiposDF = c("date","numeric","list","numeric","numeric","numeric","list")
file.names <- dir(path, pattern =".xls")
vari <- c()
for (i in 1:length(file.names)){
file <- paste(path,sep="\\",file.names[i])
print(paste("Reading ", file))
dados <- read_excel(file, col_types = tiposDF)
somaVar <- sum(dados[column])
vari <- append(vari,c(somaVar))
}
vari
file <- paste(path,sep="\\",'Covariância.xls_02082021.xls')
print(paste("Reading ", file))
dados <- read_excel(file, col_types = tiposDF)
somaVar <- sum(dados[column])
vari <- append(vari,c(somaVar))
x <- import(file)
View(x)
Thanks everyone!

Writing a large JSON to CSV using sparklyr

I'm trying to convert a large JSON file (6GB) into a CSV to more easily load it into R. I happened upon this solution (from https://community.rstudio.com/t/how-to-read-large-json-file-in-r/13486/33):
library(sparklyr)
library(dplyr)
library(jsonlite)
Sys.setenv(SPARK_HOME="/usr/lib/spark")
# Configure cluster (c3.4xlarge 30G 16core 320disk)
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "7g"
conf$'sparklyr.shell.driver-memory' <- "7g"
conf$spark.executor.cores <- 20
conf$spark.executor.memory <- "7G"
conf$spark.yarn.am.cores <- 20
conf$spark.yarn.am.memory <- "7G"
conf$spark.executor.instances <- 20
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32
sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE,
overwrite = TRUE)
sdf_schema_viewer(sample_tbl)
I've never used Spark before, and I'm trying to understand where the data I loaded is located in Rstudio, and how can I write the data to a CSV?
Not sure about sparklyr, but if you are trying to read large json file and trying to write into CSV file using Spark R, below is sample code for same.
This code will run only on spark environment and not in Rstudio
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)
# Write date frame to CSV
write.df(people, "people.csv", "csv")

Reading hdf files into R and converting them to geoTIFF rasters

I'm trying to read MODIS 17 data files into R, manipulate them (cropping etc.) and then save them as geoTIFF's. The data files come in .hdf format and there doesn't seem to be an easy way to read them into R.
Compared to other topics there isn't a lot of advice out there and most of it is several years old. Some of it also advises using additional programmes but I want to stick with just using R.
What package/s do people use for dealing with .hdf files in R?
Ok, so my MODIS hdf files were hdf4 rather than hdf5 format. It was surprisingly difficult to discover this, MODIS don't mention it on their website but there are a few hints in various blogs and stack exchange posts. In the end I had to download HDFView to find out for sure.
R doesn't do hdf4 files and pretty much all the packages (like rgdal) only support hdf5 files. There are a few posts about downloading drivers and compiling rgdal from source but it all seemed rather complicated and the posts were for MAC or Unix and I'm using Windows.
Basically gdal_translate from the gdalUtils package is the saving grace for anyone who wants to use hdf4 files in R. It converts hdf4 files into geoTIFFs without reading them into R. This means that you can't manipulate them at all e.g. by cropping them, so its worth getting the smallest tiles you can (for MODIS data through something like Reverb) to minimise computing time.
Here's and example of the code:
library(gdalUtils)
# Provides detailed data on hdf4 files but takes ages
gdalinfo("MOD17A3H.A2000001.h21v09.006.2015141183401.hdf")
# Tells me what subdatasets are within my hdf4 MODIS files and makes them into a list
sds <- get_subdatasets("MOD17A3H.A2000001.h21v09.006.2015141183401.hdf")
sds
[1] "HDF4_EOS:EOS_GRID:MOD17A3H.A2000001.h21v09.006.2015141183401.hdf:MOD_Grid_MOD17A3H:Npp_500m"
[2] "HDF4_EOS:EOS_GRID:MOD17A3H.A2000001.h21v09.006.2015141183401.hdf:MOD_Grid_MOD17A3H:Npp_QC_500m"
# I'm only interested in the first subdataset and I can use gdal_translate to convert it to a .tif
gdal_translate(sds[1], dst_dataset = "NPP2000.tif")
# Load and plot the new .tif
rast <- raster("NPP2000.tif")
plot(rast)
# If you have lots of files then you can make a loop to do all this for you
files <- dir(pattern = ".hdf")
files
[1] "MOD17A3H.A2000001.h21v09.006.2015141183401.hdf" "MOD17A3H.A2001001.h21v09.006.2015148124025.hdf"
[3] "MOD17A3H.A2002001.h21v09.006.2015153182349.hdf" "MOD17A3H.A2003001.h21v09.006.2015166203852.hdf"
[5] "MOD17A3H.A2004001.h21v09.006.2015099031743.hdf" "MOD17A3H.A2005001.h21v09.006.2015113012334.hdf"
[7] "MOD17A3H.A2006001.h21v09.006.2015125163852.hdf" "MOD17A3H.A2007001.h21v09.006.2015169164508.hdf"
[9] "MOD17A3H.A2008001.h21v09.006.2015186104744.hdf" "MOD17A3H.A2009001.h21v09.006.2015198113503.hdf"
[11] "MOD17A3H.A2010001.h21v09.006.2015216071137.hdf" "MOD17A3H.A2011001.h21v09.006.2015230092603.hdf"
[13] "MOD17A3H.A2012001.h21v09.006.2015254070417.hdf" "MOD17A3H.A2013001.h21v09.006.2015272075433.hdf"
[15] "MOD17A3H.A2014001.h21v09.006.2015295062210.hdf"
filename <- substr(files,11,14)
filename <- paste0("NPP", filename, ".tif")
filename
[1] "NPP2000.tif" "NPP2001.tif" "NPP2002.tif" "NPP2003.tif" "NPP2004.tif" "NPP2005.tif" "NPP2006.tif" "NPP2007.tif" "NPP2008.tif"
[10] "NPP2009.tif" "NPP2010.tif" "NPP2011.tif" "NPP2012.tif" "NPP2013.tif" "NPP2014.tif"
i <- 1
for (i in 1:15){
sds <- get_subdatasets(files[i])
gdal_translate(sds[1], dst_dataset = filename[i])
}
Now you can read your .tif files into R using, for example, raster from the raster package and work as normal. I've checked the resulting files against a few I converted manually using QGIS and they match so I'm confident the code is doing what I think it is. Thanks to Loïc Dutrieux and this for the help!
These days you can use the terra package with HDF files
Either get sub-datasets
library(terra)
s <- sds("file.hdf")
s
That can be extracted as SpatRasters like this
s[1]
Or create a SpatRaster of all subdatasets like this
r <- rast("file.hdf")
The following worked for me. It's a short program and just takes in the input folder name. Make sure you know which sub data you want. I was interested in sub data 1.
library(raster)
library(gdalUtils)
inpath <- "E:/aster200102/ast_200102"
setwd(inpath)
filenames <- list.files(,pattern=".hdf$",full.names = FALSE)
for (filename in filenames)
{
sds <- get_subdatasets(filename)
gdal_translate(sds[1], dst_dataset=paste0(substr(filename, 1, nchar(filename)-4) ,".tif"))
}
Use the HEG toolkit provided by NASA to convert your hdf file to geotiff and then use any package ("raster" for example) to read the file. I do the same for both old and new hdf files.
Heres the link: https://newsroom.gsfc.nasa.gov/sdptoolkit/HEG/HEGHome.html
Take a look at the NASA products supported here: https://newsroom.gsfc.nasa.gov/sdptoolkit/HEG/HEGProductList.html
Hope this helps.
This script has been very useful and I managed to convert a batch of 36 files using it. However, my problem is that the conversion does not seem correct. When I do it using ArcGIS 'Make NetCDF Raster Layer tool', I get different results + I am able to convert the numbers to C from Kelvin using simple formula: RasterValue * 0.02 - 273.15. With the results from R conversion I don't get the right results after conversion which leads me to believe ArcGIS conversion is good, and R conversion returns an error.
library(gdalUtils)
library(raster)
setwd("D:/Data/Climate/MODIS")
# Get a list of sds names
sds <- get_subdatasets('MOD11C3.A2009001.006.2016006051904.hdf')
# Isolate the name of the first sds
name <- sds[1]
filename <- 'Rasterinr.tif'
gdal_translate(sds[1], dst_dataset = filename)
# Load the Geotiff created into R
r <- raster(filename)
# Identify files to read:
rlist=list.files(getwd(), pattern="hdf$", full.names=FALSE)
# Substract last 5 digits from MODIS filename for use in a new .img filename
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
filenames0 <- substrRight(rlist,9)
# Suffixes for MODIS files for identyfication:
filenamessuffix <- substr(filenames0,1,5)
listofnewnames <- c("2009.01.MODIS_","2009.02.MODIS_","2009.03.MODIS_","2009.04.MODIS_","2009.05.MODIS_",
"2009.06.MODIS_","2009.07.MODIS_","2009.08.MODIS_","2009.09.MODIS_","2009.10.MODIS_",
"2009.11.MODIS_","2009.12.MODIS_",
"2010.01.MODIS_","2010.02.MODIS_","2010.03.MODIS_","2010.04.MODIS_","2010.05.MODIS_",
"2010.06.MODIS_","2010.07.MODIS_","2010.08.MODIS_","2010.09.MODIS_","2010.10.MODIS_",
"2010.11.MODIS_","2010.12.MODIS_",
"2011.01.MODIS_","2011.02.MODIS_","2011.03.MODIS_","2011.04.MODIS_","2011.05.MODIS_",
"2011.06.MODIS_","2011.07.MODIS_","2011.08.MODIS_","2011.09.MODIS_","2011.10.MODIS_",
"2011.11.MODIS_","2011.12.MODIS_")
# Final new names for converted files:
newnames <- vector()
for (i in 1:length(listofnewnames)) {
newnames[i] <- paste0(listofnewnames[i],filenamessuffix[i],".img")
}
# Loop converting files to raster from NetCDF
for (i in 1:length(rlist)) {
sds <- get_subdatasets(rlist[i])
gdal_translate(sds[1], dst_dataset = newnames[i])
}

R crashes when using for-loop with tuneR package to get length of audiofiles

First question here, hope I did asking part right.
I'm trying to write a short piece of R code that will create a vector with lenghts of all of audiofiles in my 'Music' folder. I'm using RStudio 0.98.501 with R 3.0.3 on i686-pc-linux-gnu (32-bit). I use tuneR package to extract info about lengths of the songs. Here's a problem: I export first MP3 file fine, but when I do it to the second MP3, it gives me 'R Session aborted, R encountered a fatal error, the session will be terminated'.
I'm working on Intel® Atom™ CPU N2800 # 1.86GHz × 4 with 2 Gb memory with Ubuntu 13.10.
I put my code below, just change the directory for the one where your Music folder is.
library(tuneR)
# Set your working directory here
ddpath <- "/home/daniel/"
wdpath <- ddpath
setwd(wdpath)
# Create a character vector with all filenames
filenames <- list.files("Music", pattern="*.mp3",
full.names=TRUE, recursive=TRUE)
# How many audio files do we have?
numTracks <- length(filenames)
# Vector to store lengths
lengthVector <- numeric(0)
# Here problem arises
for (i in 1:numTracks){
numWave <- readMP3(filenames[i])
lengthSec <- length(numWave#left)/numWave#samp.rate
lengthVector <- c(lengthVector, lengthSec)
rm(numWave)
}

quit and restart a clean R session from within R (Windows 7, RGui 64-bit)

I am trying to quit and restart R from within R. The reason for this is that my job takes up a lot of memory, and none of the common options for cleaning R's workspace reclaim RAM taken up by R. gc(), closeAllConnections(), rm(list = ls(all = TRUE)) clear the workspace, but when I examine the processes in the Windows Task Manager, R's usage of RAM remains the same. The memory is reclaimed when R session is restarted.
I have tried the suggestion from this post:
Quit and restart a clean R session from within R?
but it doesn't work on my machine. It closes R, but doesn't open it again. I am running R x64 3.0.2 through RGui (64-bit) on Windows 7. Perhaps it is just a simple tweak of the first line in the above post:
makeActiveBinding("refresh", function() { shell("Rgui"); q("no") }, .GlobalEnv)
but I am unsure how it needs to be changed.
Here is the code. It is not fully reproducible, because a large list of files is needed that are read in and scraped. What eats memory is the scrape.func(); everything else is pretty small. In the code, I apply the scrape function to all files in one folder. Eventually, I would like to apply to a set of folders, each with a large number of files (~ 12,000 per folder; 50+ folders). Doing so at present is impossible, since R runs out of memory pretty quickly.
library(XML)
library(R.utils)
## define scraper function
scrape.func <- function(file.name){
require(XML)
## read in (zipped) html file
txt <- readLines(gunzip(file.name))
## parse html
doc <- htmlTreeParse(txt, useInternalNodes = TRUE)
## extract information
top.data <- xpathSApply(doc, "//td[#valign='top']", xmlValue)
id <- top.data[which(top.data=="I.D.:") + 1]
pub.date <- top.data[which(top.data=="Data publicarii:") + 1]
doc.type <- top.data[which(top.data=="Tipul documentului:") + 1]
## tie into dataframe
df <- data.frame(
id, pub.date, doc.type, stringsAsFactors=F)
return(df)
# clean up
closeAllConnections()
rm(txt)
rm(top.data)
rm(doc)
gc()
}
## where to store the scraped data
file.create("/extract.top.data.2008.1.csv")
## extract the list of files from the target folder
write(list.files(path = "/2008/01"),
file = "/list.files.2008.1.txt")
## count the number of files
length.list <- length(readLines("/list.files.2008.1.txt"))
length.list <- length.list - 1
## read in filename by filename and scrape
for (i in 0:length.list){
## read in line by line
line <- scan("/list.files.2008.1.txt", '',
skip = i, nlines = 1, sep = '\n', quiet = TRUE)
## catch the full path
filename <- paste0("/2008/01/", as.character(line))
## scrape
data <- scrape.func(filename)
## append output to results file
write.table(data,file = /extract.top.data.2008.1.csv",
append = TRUE, sep = ",", col.names = FALSE)
## rezip the html
filename2 <- sub(".gz","",filename)
gzip(filename2)
}
Many thanks in advance,
Marko
I also did some webscraping and ran directily into the same problem like u and it turned me crazy. Although im running a mordern OS (windows 10), the memory is still not released from time to time. after having a look at R FAQ I went for CleanMem, here u can set an automated memory cleaner at every 5 minutes or so. be sure to use
rm(list = ls())
gc()
closeAllConnections()
before so that R releases the memory.
Then use CleanMem so that the OS will notice there's free memory.

Resources