How Can I Download and Use a Matrix from Matrix Market? - julia

I am trying to write code to store a matrix to a variable directly from Matrix Market's website. Below is a sample URL that I'd use:
https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz
The example URL will download a bcsstk01.mtx.gz file. I need to extract the bcsstk01.mtx file. Then I need to use MatrixMarket.mmread() so I can save to a variable.
I first tried saving the downloaded file (or URL location) to a variable A = HTTP.get(), but lack of online resources and lack of knowledge led to no results. Then I used HTTP.download() and got the .mtx.gz file, but I can't unzip it. And finally, MatrixMarket.mmread() cannot read .gz files. So I'm stuck with a downloaded file I can't do anything with unless I manually unzip it.

Using the info from link in the comments and some fiddling, I managed to get the following:
using TranscodingStreams, CodecZlib
using Downloads
stream = PipeBuffer()
openstream = TranscodingStream(GzipDecompressor(), stream)
Downloads.download("https://math.nist.gov/pub/MatrixMarket2/Harwell-Boeing/bcsstruc1/bcsstk01.mtx.gz", stream)
for line in eachline(openstream)
println(line)
end
This prints:
%%MatrixMarket matrix coordinate real symmetric
48 48 224
1 1 2.8322685185200e+06
5 1 1.0000000000000e+06
6 1 2.0833333333300e+06
7 1 -3.3333333333300e+03
...
which I suppose is the desired data.

Related

Failure of unz() to unzip from a zip file offset of more than 2^31 bytes

I have been obtaining .zip archives of genome annotation from NCBI (mainly gff files). In order save disk space I prefer not to unzip the archive, but to read these files directly into R using unz(). However, it seems that unz() is unable to extract files from the end of 'large' zip files:
ncbi.zip <- "file_location/name.zip"
files <- unzip(ncbi.zip, list=TRUE)
gff.files <- files$Name[ grep("gff$", files$Name) ]
## this works
gff.128 <- readLines( unz(ncbi.zip, gff.files[128]) )
## this gives an empty data structure (read.table() stops
## with an error saying no lines or similar
gff.129 <- readLines( unz(ncbi.zip, gff.files[129]) )
## there are 31 more gff files after the 129th one.
## no lines are read from any of these.
The zip file itself seems to be fine; I can unzip the specific files using unzip on the command line and unzip -t does not report any errors.
I've tried this with R versions 3.5 (openSuse Leap 15.1), 3.6, and 4.2 (centOS 7) and with more than one zip file and get exactly the same result.
I attached strace to R whilst reading in the 128 and 129th file. In both cases I get a lot of lseek towards the end of file (offset 2845892608, larger than 2^31) to start with. This is where I assume the zip directory can be found. For the 128th file (the one that can be read), I eventually get an lseek to an offset slightly below 2^31, followed by a set of lseeks and reads (that extend beyone 2^31).
For the 129th file, I get the same reads towards the end of the file, but then rather than finding a position within the file I get:
lseek(3, 2845933568, SEEK_SET) = 2845933568
lseek(3, 4294963200, SEEK_SET) = 4294963200
read(3, "", 4096) = 0
lseek(3, 4095, SEEK_CUR) = 4294967295
read(3, "", 4096) = 0
Which is a bit weird since the file itself is only about 2.8 GB. 4294967295, is of course 2^32 - 1.
To me this feels like an integer overflow bug, and I am considering to post a bug report. But am wondering if anyone has seen something similar before or if I am doing something stupid.
Having done what I should have started with (reading the specification for the zip64 format specification), it's actually clear that this is not an integer overflow error.
Zip files contain a central directory at the end of the archive; this contains amongst other things the names of the compressed files and the offset of the compressed data in the zip archive. The offset (and file size fields) are only given 4 bytes each in the standard directory field; when the offset is larger than this it should instead be given in the extra fields section and the value in the standard field should be set to 0xFFFFFFFF. Since this is the offset that gets used when reading the file it seems clear that the problem lies in the parsing of the extra field.
I had a look at the source code for R 4.2.1 and it seems that the problem is due to the way the offset specified in the standard offset field is tested:
if(file_info.uncompressed_size == (ZPOS64_T)(unsigned long)-1)
changing this == 0xFFFFFFFF seems to fix the problem.
I've submitted a bug report to R. Hopefully changing the check will not have any unintended consequences and the issue will be fixed.
Still, I'm curious as to whether anyone else has come across the same issue. Seems a bit unlikely that my experience is unique.

Opening JSON files in R

I have downloaded some data from the following site as a zip file and extracted it onto my computer. Now, I'm having trouble trying to open the included json data files.
Running following code:
install.packages("rjson")
library("rjson")
comp <- fromJSON("statsbomb/data/competitions")
gave this error:
Error in fromJSON("statsbomb/data/competitions") : unexpected character 's'
Also, is there a way to load all files at once instead of writing individual statements each time?
Here is what I did(Unix system).
Clone the Github repo(mark location)
git clone https://github.com/statsbomb/open-data.git
Set working directory(directory to which you cloned the repo or extracted the zip file).
setwd("path to directory where you cloned the repo")
Read data.
jsonlite::fromJSON("competitions.json")
With rjson: rjson::fromJSON(file="competitions.json")
To run all the files at once, move all .json files to a single directory and use lapply/assign to assign your objects to your environment.
Result(single file):
competition_id season_id country_name
1 37 4 England
2 43 3 International
3 49 3 United States of America
4 72 30 International
competition_name season_name match_updated
1 FA Women's Super League 2018/2019 2019-06-05T22:43:14.514
2 FIFA World Cup 2018 2019-05-14T08:23:15.306297
3 NWSL 2018 2019-05-17T00:35:34.979298
4 Women's World Cup 2019 2019-06-21T16:45:45.211614
match_available
1 2019-06-05T22:43:14.514
2 2019-05-14T08:23:15.306297
3 2019-05-14T08:02:00.567719
4 2019-06-21T16:45:45.211614
The function fromJSON takes a JSON string as a first argument unless you specify you are giving a file (fromJSON(file = "competitions.json")).
The error you mention comes from the function trying to parse 'statsbomb/data/competitions' as a string and not a file name. In JSON however, everything is enclosed in brackets and strings are inside quotation marks. So the s from "statsbomb" is not a valid first character.
To read all json files you could do:
lapply(dir("open-data-master/",pattern="*.json",recursive = T), function(x) {
assign(gsub("/","_",x), fromJSON(file = paste0("open-data-master/",x)), envir = .GlobalEnv)
})
however this will take a long time to complete! You probably should elaborate a little bit on this function. E.g. split the list of files obtained with dir into chunks of 50 before running the lapply call.

R - read html files within a folder, count frequency, and export output

I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.

R programming. How to open files by taking data from user?

I have a folder containing 10 files namely 1.csv,2.csv .... 10.csv
I have to take a number from the user from 1 to 10 and open the corresponding file.
Here's my code:
I have saved the number from user in x.
Now,
y<-as.character(x)
y<-paste(y,csv,sep=".")
read.csv("y")
But this isnt working. Why? Please help
You meant to do read.csv(y) instead of read.csv("y").

How do I read multiple binary files in R?

Suppose we have files in one folder file1.bin, file2.bin, ... , and file1460.bin in directory C:\R\Data and we want to read them and make a loop to go from 1 to 4 and take the average then from 4 to 8 average and so on till 1460.in the end will get 360 files
I tried to have them in a list,but did not know how to make the loop.
How do I read multiple files and manupulat them? in R language
I have been wasting countless hourse to figuer it out.any help
results <- array(dim=360)
for (i in 1:360){
results <- mean(yourlist[[(i*4):(i*4+3)]])
}
YMMV with the mean(yourList) call, but that structure would be how you could loop through the data once it's loaded.

Resources