read_fwf not working while unzipping files - r

I want to read in several fixed width format txt files into R but I first need to unzip them.
Since they are very large files I want to use read_fwf from the readr package because it's very fast.
When I do:
read_fwf(unz(zipfileName, fileName), fwf_widths(colWidths, col_names = colNames))
I get this error Error in isOpen(con) : invalid connection
However when I do:
read.table(unz(zipfileName, fileName)) without specfiying widths it reads into R just fine. Any thoughts as to why this isn't working with read_fwf ?
I am having trouble making a reproducible example. Here is what I got:
df <- data.frame(
rnorm(100),
rnorm(100)
)
write.table(df, "data.txt", row.names=F, col.names = F)
zip(zipfile = "data.zip", files = "data.txt")
colWidths <- rep(2, 100)
colNames <- c("thing1","thing2")
zipfileName <- "data.zip"
fileName <- "data.csv"

I also had trouble getting read_fwf to read zip files when passing an unz-ed file to it but then reading the ?read_fwf page I see that zipped files are promised to be handled automagically. You didn't make a file that was a valid fwf as an example, since neither of the columns had constant positions but that is apparent with the output:
read_fwf(file="~/data.zip", fwf_widths(widths=rep(16,2) ,col_names = colNames) )
Warning: 1 parsing failure.
row col expected actual
3 thing2 16 chars 14
# A tibble: 100 x 2
thing1 thing2
<chr> <chr>
1 1.37170820802141 -0.58354018425322
2 0.03608988699566 7 -0.402708262870141
3 1.02963272114 -1 .0644333112294
4 0.73546166509663 8 0.607941664550652
5 -1.5285547658079 -0.319983522035755
6 -1.4673290956901 0.523579231857175
7 0.24946312418273 9 -0.574046655188405
8 0.58126541455159 5 -0.406516495600345
9 1.5074477698981 -0.496512994239183
10 -2.2999905645658 8 -0.662667854341041
# ... with 90 more rows
The error you were getting was from the unz function because it expects a full path to a zip extension file (and apparently won't accept an implicit working directory location) as the "description" argument. It's second argument is the name of the compressed file inside the zip file. I think it returns a connection, but not of a type that read_fwf is able to process. Doing parsing by hand I see that the errors both of us got was from this section of code in read_connection:
> readr:::read_connection
function (con)
{
stopifnot(is.connection(con))
if (!isOpen(con)) {
open(con, "rb")
on.exit(close(con), add = TRUE)
}
read_connection_(con)
}
<environment: namespace:readr>
You didn't give unz a valid "description" argument, and even if we did the effort to open with open(con, "rb") fails because of the lack of standardization in arguments in the various file handling functions.

Related

Fail to get result with Using readxl and read_excel() - Rogue lines 2 and 3 in .xls

I'm trying to open an .xls which has column names in line 1 but addition parts of the name in lines 2 and 3. Some cells in line 2 and 3 are blank.
library(readxl)
# This doesn't work
read_excel(dest)
# This doesn't work
read_excel(dest, skip = 3, col_names = FALSE)
# ... nor this
read_excel(dest, n_max = 1, col_names = TRUE)
# This works on manually modified file content (lines 2 and 3 deleted)
read_excel('../data/downloaded_FLEET2.xls', sheet = 1)
# This works on file that was manually converted to xlsx (Note: lines 2 and 3 still present)
read_excel('../data/downloaded_FLEET.xlsx', sheet = 1)
# This works on file that was manually converted to csv and back into xls (Note: lines 2 and 3 still present)
read_excel('../data/downloaded_FLEET3.xls', sheet = 1)
Any ideas? I really want to avoid any manual intervention.
Thanks
Since .xlsx is working, I would try a different library such as openxlsx::read.xlsx.

Ignore errors in readtext r

I am now trying to extract a large number of docx files (1500) placed in one folder, using readtext (after creating a list using list.files)
You can find similar examples here: https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html
I am getting errors with some files (examples below), the problem is when this error occurs, the extraction process is stopped. I can identify the problematic file, by changing verbosity = 3, but then I have to restart the extraction process (to find another problematic file(s)).
My question is if there is a way to avoid interrupting the process if an error is encountered?
I change ignore_missing_files = TRUE but this did not fix the problem.
examples for the errors encountered:
write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.
Sorry for not posting a reproducible example, but I do not know how to post an example with large docx files. But this is the code:
library(readtext)
data_files <- list.files(path = "PATH", full.names = T, recursive = T) # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv
Let's first put together a reproducible example:
download.file("https://file-examples-com.github.io/uploads/2017/02/file-sample_1MB.docx", "test1.docx")
writeLines("", "test2.docx")
The first file I produced here should be a proper docx file, the second one is rubbish.
I would wrap readtext in a small function that deals with the errors and warnings:
readtext_safe <- function(f) {
out <- tryCatch(readtext::readtext(f),
error = function(e) "fail",
warning = function(e) "fail")
if (isTRUE("fail" == out)) {
write(f, "errored_files.txt", append = TRUE)
} else {
return(out)
}
}
Note that I treat errors and warning the same, which might not be what you actually want. We can use this function to loop through your files:
files <- list.files(pattern = ".docx$", ignore.case = TRUE, full.names = TRUE)
x <- lapply(files, readtext_safe)
x
#> [[1]]
#> readtext object consisting of 1 document and 0 docvars.
#> # Description: df[,2] [1 × 2]
#> doc_id text
#> <chr> <chr>
#> 1 test1.docx "\"Lorem ipsu\"..."
#>
#> [[2]]
#> NULL
In the resulting list, failed files simply have a NULL entry as nothing is returned. I like to write out a list of these errored files and the function above creates a txt file that looks like this:
readLines("errored_files.txt")
#> [1] "./test2.docx"

Split a dataframe based on a column and write out the multiple split .txt files with specific names

I'm dealing with my huge .txt data frames generated from microscopic data. Each single .txt output file from it is about 3 to 4 GB! And I have a couple hundreds of them....
For each of those monster file, it has a couple hundreds of features, some are categorical and some are numeric.
Here is an abstract example of the dataframe:
df <- read.csv("output.txt", sep="\t", skip = 9,header=TRUE, fill = T)
df
Row Column stimulation Compound Concentration treatmentsum Pid_treatmentsum var1 var2 var3 ...
1 1 uns Drug1 3 uns_Drug1_3 Jack_uns_Drug1_3 15.0 20.2 3.568 ...
1 1 uns Drug1 3 uns_Drug1_3 Jack_uns_Drug1_3 55.0 0.20 9.068
1 1 uns Drug2 5 uns_Drug2_5 Jack_uns_Drug2_5 100 50.2 3.568
1 1 uns Drug2 5 uns_Drug2_5 Jack_uns_Drug2_5 75.0 60.2 13.68
1 1 3S Drug1 3 3s_Drug3_3 Jack_3s_Drug1_3 65.0 30.8 6.58
1 1 4S Drug1 3 4s_Drug3_3 Jack_4s_Drug1_3 35.0 69.3 2.98
.....
And I would like to split the data frame based on common value in a categorical column, the treatmentsum.
So I can have all cells treated with the same drug and same dosage together, aka all "uns_Drug1_3" goes to one output.txt.
I have seen similar post so I used split()
sptdf <- split(df, df$treatmentsum)
it worked, as now sptdf gave me lists of data frames. Now I want to write them out as tables, ideally I want to use the "Pid_treatmentsum" element as the name of each splited file's name, as they should have the exact same "Pid_treatmentsum" after splitting. I don't quite know how to do that, so thus far I can at least manual input patient ID and join them by paste
lapply(names(sptdf), function(x){write.table(sptdf[[x]], file = paste("Jack", x, sep = "_"))})
This works isn a sense that it writes out all the individual files with correct titles, but they are not .txt and if I open them in excel, I get warning messages that they are corrupted. Meanwhile in R, I get warning messages
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
Where did I got this wrong?
Given the sheer size of each output file by the microscope (3-4GB), is this the best way to do this?
And if I can push this further, can I dump all hundreds of those huge files in a folder, and could I write a loop to autopmate the process instead of splitting one file a time? the only problem I foresee is the microscope outfiles always have the same name, titled "output".
Thank you in advance, and sorry for the long post.
Cheers,
ML
I don't believe this is very different from the OP's code but here it goes.
First, a test data set. I will use a copy of the built-in data set iris
df <- iris
names(df)[5] <- "Pid_treatmentsum"
Now the file writing code.
sptdf <- split(df, df$Pid_treatmentsum)
lapply(sptdf, function(DF){
outfile <- as.character(unique(DF[["Pid_treatmentsum"]]))
outfile <- paste0(outfile, ".txt")
write.table(DF,
file = outfile,
row.names = FALSE,
quote = FALSE)
})
If Excel complains that the file is corrupt maybe write.csv (and file extension "csv") will solve the problem.
Edit.
To automate the code above to processing many files, the split/lapply could be rewritten as a function. Then the function would be called passing the filenames as an argument.
Something along the lines of (untested)
splitFun <- function(file, col = "Pid_treatmentsum", ...){
X <- read.table(file, header = TRUE, ...)
sptdf <- split(X, X[[col]])
lapply(sptdf, function(DF){
outfile <- as.character(unique(DF[[col]]))
outfile <- paste0(outfile, ".txt")
write.table(DF,
file = outfile,
row.names = FALSE,
quote = FALSE)
})
}
filenames <- list.files(pattern = "<a regular expression>")
lapply(filenames, splitFun)

R scp and unzip - invalid zip name argument

I wonder if anyone can help me - I am trying to download a file using scp and then unzip it (it is a set of 100 files, so they have to be zipped and as well it excludes unz)
Basically first I run
x <- scp(host = "WM.net", path = "/wmdata.zip", user = "w", password = "wm")
it returns a raw object (of course it is a dummy address, you would not get anything, I cannot provide working site you can scp anything)
> class(x)
[1] "raw"
then I try to unzip it
b<-unzip(x)
Error in unzip(x) : invalid zip name argument
I tried to decompress it in memory but with no luck - the output is still raw, not a file list
z<-memDecompress(x, type = "unknown")
> class(z)
[1] "raw"
Where is my error? What am doing wrong? I have a vague feeling I need to save x to disc as zip, and then use unzip, but no idea how to save raw compressed value.
EDIT: I tried as well saving as a binary file via
f<-file("file.bin",open="wb") #or f<-file("file.zip",open="wb")
writeBin(x, f)
b <- unzip(f) #or b <- unzip("file.bin") or b <- unzip("file.zip")
and it produced a file after the first line, but after the second line the file is still empty and the unzip procedure returns the same zip name error
> class(f)
[1] "file" "connection"
> f
A connection with
description "file.zip"
class "file"
mode "wb"
text "binary"
opened "opened"
can read "no"
can write "yes"
The error you are getting is not unexpected at all, because unzip expects a file as its first parameter, and you are trying to pass a raw R vector, which is a vector of bytes. You can try first writing that raw vector to file, and then reading it using unzip. Something like this:
x <- scp(host = "WM.net", path = "/wmdata.zip", user = "w", password = "wm")
f <- file("path/to/your/file.bin", "wb")
writeBin(x, f)
b <- unzip(f)
This is not tested, but I wanted to point out the issues with how you were using the various APIs.

Applying a function for csv files having more than 10 rows in R

Following is the code that I've written for applying moving average forecast to all the .csv files in a directory.
fileNames <- Sys.glob("*.csv")
for (fileName in fileNames) {
abc <- read.csv(fileName, header = TRUE, sep = ",")
nrows <- sapply(fileNames, function(f) nrow(read.csv(f)))
if (nrows>=as.vector(10)) {
library(stats)
library(graphics)
library(forecast)
library(TTR)
library(zoo)
library(tseries)
abc1 = abc[,1]
abc1 = t(t(abc1))
abc1 = as.vector(abc1)
abc2 = ts(abc1, frequency = 12,start = c(2014,1))
abc_decompose = decompose(abc2)
plot(abc_decompose)
forecast = (abc_decompose$trend)
x <- data.frame(abc, forecast)
write.csv (x, file = fileName, row.names=FALSE, col.names=TRUE)
}
}
Now when I exclude line 5, i.e. if(nrows>=as.vector(10)) the code is working fine on files which has enough no. of entries (I had taken around 20 files all having more than 10 rows).
But, I have some csv files in the directory which contains 2 or less than 2 entries, so when the code runs on the whole directory it's giving the following error message:
Error in decompose(abc2) : time series has no or less than 2 periods. As excluding those files manually is hard, I have to use something like line 5.
Now nrows is giving me a list of all the file names of the directory with their no. of rows, but when I run the whole code I'm getting 148 warning messages (that directory has 148 csv files), each one saying:
In if (nrows >= as.vector(10)) { ... :
the condition has length > 1 and only the first element will be used and I'm not getting the output.
So, definitely I'm doing something wrong in that line 5. Please help.
Change nrows <- sapply(fileNames, function(f) nrow(read.csv(f))) to:
nrows <- nrow(abc)
Why do you need to take the number of rows in all files at each iteration. The error is telling you what is going wrong. It is using the number of rows in the first file every time.

Resources