julia: how to read a bz2 compressed text file

julia: how to read a bz2 compressed text file - julia

In R, I can read a whole compressed text file into a character vector as
readLines("file.txt.bz2")
readLines transparently decompresses .gz and .bz2 files but also works with non-compressed files. Is there something analogous available in julia? I can do
text = open(f -> read(f, String), "file.txt")
but this cannot open compressed files. What is the preferred way to read bzip2 files? Is there any approach (besides manually checking the filename extension) that can deduce compression format automatically?

I don't know about anything automatic but this is how you could (create and) read a bz2 compressed file:
using CodecBzip2 # after ] add CodecBzip2
# Creating a dummy bz2 file
mystring = "Hello StackOverflow!"
mystring_compressed = transcode(Bzip2Compressor, mystring)
write("testfile.bz2", mystring_compressed)
# Reading and uncompressing it
compressed = read("testfile.bz2")
plain = transcode(Bzip2Decompressor, compressed)
String(plain) # "Hello StackOverflow!"
There are also streaming variants available. For more see CodecBzip2.jl.

Related

Get ASCII grid from compressed .gz file from URL in R

I am trying to download and gunzip grid files in ascii format, compressed to .gz files from an URL like this. I tried to get to the files via y <- gzon(url("name-of-url") and then gunzip(y), but for gunzip that is an invalid file. If I can decompress the file, I would like to read the .asc file with raster()
Any ideas how to solve this?

I don't know why unzip does not work on these files, but you can get at the contents as follows:
URL = "https://opendata.dwd.de/climate_environment/CDC/grids_germany/annual/summer_days/grids_germany_annual_summer_days_1951_17.asc.gz"
download.file(URL, "grids_germany_annual_summer_days_1951_17.asc.gz")
GZ = gzfile("grids_germany_annual_summer_days_1951_17.asc.gz")
Lines = readLines(GZ, 10)
writeLines(Lines, "grids_germany_annual_summer_days_1951_17.asc")
Now you have an ascii text file.

How do I download and extract a list of papers in LaTeX format from arXiv?

I have a list of papers that I'd like to extract from arXiv (I have the arxiv links / name of the arxvi file), but in the LaTeX format. How can I do this in Python?
If we go to this page: https://arxiv.org/format/2010.11645
We can read the following text:
Source:
Delivered as a gzipped tar (.tar.gz) file if there are multiple files, otherwise as a PDF file, or a gzipped TeX, DVI, PostScript or HTML (.gz, .dvi.gz, .ps.gz or .html.gz) file depending on submission format. [ Download source ]
We can download the file by clicking on [ Download source ], but I have no idea what type of file I'm getting back. The filename is simple 2010.11645.
I'd like to download the file in LaTeX format (which I believe it .tex) and then convert it into .txt using pandoc. I believe I'd need to download the files via requests somehow?
How can I do this? Thanks!

Figuring out file extension before downloading

I'm doing a project that requires going into a database of the brazillian equivalent of the FTC and downloading a few files (which I will later process), and I want to automate this using R.
My problem is that when naming the file, I have to tell it the file extension, and I don't know what it will be (usually it will be a scanned pdf, but sometimes it will be an html file). Here an example:
https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_processo_exibir.php?0c62g277GvPsZDAxAO1tMiVcL9FcFMR5UuJ6rLqPEJuTUu08mg6wxLt0JzWxCor9mNcMYP8UAjTVP9dxRfPBcbZvmE_iaYkTbpPedZsRpa1llf9W8WXxdUJxor5q0IiE
I want the first and the tenth file. Downloading them is easy:
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao", destfile = 'C:/teste/teste1', mode = 'wb')
download.file("https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH", destfile = 'C:/teste/teste2', mode = 'wb')
The thing is, I don't know which one is a pdf file and which one is an html file without manually trying to open them with another program. Is there any way to tell R to automatically add the correct file extension when downloading?

If you use the httr package, you can get the content-type header which will help you decide what type of file it is. You can use the HEAD() function to get the headers of the files. For example with your URLs
urls <- c(
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPDzrBMElK1BGz7u3NcOFP7-Z5s9oDvQR1K4ELVR_nmNlPto_G3CRD_y2Hu6JLvHZVV2LDxnr4dccffqX3xlEao",
"https://sei.cade.gov.br/sei/modulos/pesquisa/md_pesq_documento_consulta_externa.php?DZ2uWeaYicbuRZEFhBt-n3BfPLlu9u7akQAh8mpB9yPaFy5S3krC8lTKjlRbfodOIg2NArJmAFS5PyUEHL3hnJYr8VG9zLGdNts6K99Ht673e_ZPr2gr3Cw7r8zJqRiH"
)
You can write a helper function
get_content_type <- function(x) {
unname(sapply(x, function(x) headers(HEAD(x))[["content-type"]]))
}
get_content_type(urls)
# [1] "application/pdf;" "text/html; charset=ISO-8859-1"
These return mime-type, but you can grep for things like "pdf" to save as a PDF or "html" for web pages. Not sure what other types of files might be available. There is no "correct" file name for a given file type so you'd need to make that decision yourself.

Read and write csv.gz file in R

There are very similar questions about this topic, but non deals with this under R quite precisely.
I have a csv.gz file and I would like to "unzip" the file and have it as ordinary *.csv file. I suppose one would go about first reading the csv.gz file and latter via write.csv command create the csv file itself.
Here what I have tried a part of other things:
gz.file <- read.csv(gzfile(file.choose()), as.is = TRUE)
gives:
head(gz.file)
farmNo.milk.energy.vet.cows
1 1;862533;117894;21186;121
2 2;605764;72049;43910;80
3 3;865658;158466;54583;95
4 4;662331;66783;45469;87
5 5;1003444;101714;81625;125
6 6;923512;252408;96807;135
File claims to be data.frame but doesn't behave like one, what I'm missing here?
class(gz.file)
[1] "data.frame"
Once read into memory I would like to have it in pure csv file, so would write.csv would be the solution?
write.csv(gz.file, file="PATH")

In recent versions of data.table fast csv reader fread got support for csv.gz files. It automatically detects if it needs to decompress based on the filename so there is not much new to learn. Following should work.
library(data.table)
dt = fread("data.csv.gz")
This feature requires extra, fortunately lightweight, dependency as you can read in ?fread manual
Compressed files ending .gz and .bz2 are supported if the R.utils package is installed.
To write compressed argument use fwrite(compress="gzip").

tidyverse, particularly the readr package, has transparent support of gzip compressed files (and a few others)
library(readr)
read_csv("file.csv.gz") -> d
# write uncompressed data
d %>% write_csv("file.csv")

Merging EBCDIC converted files and pdf files into a single file and pushing to mainframes

I have two pdf files and two text files which are converted into ebcdif format. The two text files acts like cover files for the pdf files containing details like pdf name, number of pages, etc. in a fixed format.
Cover1.det, Firstpdf.pdf, Cover2.det, Secondpdf.pdf
Format of the cover file could be:
Firstpdf.pdf|22|03/31/2012
that is
pdfname|page num|date generated
which is then converted into ebcdic format.
I want to merge all these files in a single file in the order first text file, first pdf file, second text file, second pdf file.
The idea is to then push this single merged file into mainframes using scp.
1) How to merge above mentioned four files into a single file?
2) Do I need to convert pdf files also in ebcdic format ? If yes, how ?
3) As far as I know, mainframe files also need record length details during transit. How to find out record length of the file if at all I succeed in merging them in a single file ?
I remember reading somewhere that it could be done using put and append in ftp. However since I have to use scp, I am not sure how to achieve this merging.
Thanks for reading.

1) Why not use something like pkzip?
2) I don't think converting the pdf files to ebcdic is necessary or even possible. The files need to be transfered in Binary mode
3) Using pkzip and scp you will not need the record length

File merging could easily be achieved by using Cat command in unix with > and >> append operators.
Also, if the next file should start from a new line (as was my case) a blank echo could be inserted between files.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

julia: how to read a bz2 compressed text file - julia

Related

Get ASCII grid from compressed .gz file from URL in R

How do I download and extract a list of papers in LaTeX format from arXiv?

Figuring out file extension before downloading

Read and write csv.gz file in R

Merging EBCDIC converted files and pdf files into a single file and pushing to mainframes

Categories

Resources