I am doing the following assignment:
[Coursera Air pollution Assignment][1]
[1]: https://i.stack.imgur.com/QAcMG.png
After doing dir.create("specdata") and unzipping the file, all 332 files went to the "specdata" directory. So I did the function:
pollutantmean <- function(directory, pollutant, id = 1:332) {
lista <- list.files("C:/Users/Ana/Desktop/Temporario/specdata", pattern = "*.csv")
for(i in id) {
dados <- read.csv(lista[i])
valor <- numeric(dados[pollutant])
}
mean(valor, na.rm = TRUE)
}
And as I tested it with
pollutantmean("specdata", "sulfate", 1:10)
I got the error message:
Error in file(file, "rt") : not possible to open the connection
Warning message: In file(file, "rt") :
Could anyone help? When I list.files they all appear (001.csv, ..., 332.csv), and my working directory is the parent environment to "specdata".
I am trying to import this dataset of .arff type
file_location <- file.path("/Users","supreet","Downloads","Chronic_Kidney_Disease1/")
Chronic_Kidney_Disease <- read.arff(paste(file_location,"chronic_kidney_disease.arff",sep=""))
But it is throwing the following error
Error in file(arff_file, "rb") : cannot open the connection In
addition: Warning message: In file(arff_file, "rb") : cannot open
file
'/Users/supreet/Downloads/Chronic_Kidney_Disease1/chronic_kidney_disease.arff.arff':
No such file or directory
Also, if remove .arff extension as it is already appended :
file_location <- file.path("/Users","supreet","Downloads","Chronic_Kidney_Disease1/")
Chronic_Kidney_Disease <- read.arff(paste(file_location,"chronic_kidney_disease",sep=""))
I get this error:
Error: XML content does not seem to be XML:
'/Users/supreet/Downloads/Chronic_Kidney_Disease1/chronic_kidney_disease.xml'
In addition: Warning message: In matrix(unlist(strsplit(arff_data,
",", fixed = T)), ncol = num_attrs, : data length [10001] is not a
sub-multiple or multiple of the number of rows [401]
>
I have written a code that should combine multiple files into one file by combining the second column in the input files. The first column is similar across input files. However, it gives an error that I cannot understand.
files <- list.files(path = "/Rfam/",pattern='\\.sam')
My code
lst <- lapply(files, function(x) read.csv(x,header=TRUE))
setNames(Reduce(function(...) merge(..., by='V1'),
lst),c('ID', paste0('file',seq_along(files))) )
The error
> lst <- lapply(files, function(x) read.csv(x,header=TRUE))
Show Traceback
Rerun with Debug
Error in file(file, "rt") : cannot open the connection In addition: Warning message:
In file(file, "rt") :
cannot open file 'Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100G.sam': No such file or directory
My files:
> head(files)
[1] "Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100G.sam"
[2] "Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput100R.sam"
[3] "Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput106G.sam"
[4] "Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput106R.sam"
[5] "Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput122G.sam"
[6] "Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput122R.sam"
> length(files)
[1] 96
Example of input
DMED7013:Rfam robinm$ head Rfam_Counts_combined_SplitRfam_Counts_combinedhtseq_Rfamoutput402R.sam
Seq_../trimmed/402R.tally.fasta __not_aligned
__too_low_aQual 3
mir-10 5
Y_RNA 4
__too_low_aQual 0
__too_low_aQual 0
__not_aligned 1
mir-8 2
mir-671 3
mir-671 16
I would like to create a loop to load this files through read.esetof bioconductor.
I tried that:
for(k in 1:29){
expr <- paste0("/home/proj/MT_Nellore/R/eBrowser/Adjusted/LRRadjustedextremes0.5kgchr",k,".txt")
pdat <- paste0("/home/proj/MT_Nellore/R/eBrowser/Adjusted/Samplesbinary0.5.txt")
ffdat <- paste0("/home/proj/MT_Nellore/R/LRR/Chr_adjusted/probeslabeladjustedchr",k,".txt")
eset <- read.eset(exprs.file="expr", pdat.file="/home/proj/MT_Nellore/R/eBrowser/Adjusted/Samplesbinary0.5.txt", fdat.file="ffdat")
}
However I get this error:
## Error in file(file, "r") : cannot open the connection
## In addition: Warning message:
## In file(file, "r") : cannot open file 'ffdat': No such file or directory
Any suggestions?
Ah - just spotted the error - you must remove quotes from around the "ffdat" on the final line, and same for the "expr"
(Windows 7 / R version 3.0.1)
Below the commands and the resulting error:
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp
\RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory
How do I solve this issue?
EDIT I
(As suggested by Ben and described here)
I downloaded Xpdf copied the 32bit version to
C:\Program Files (x86)\xpdf32
and the 64bit version to
C:\Program Files\xpdf64
The environment variables pdfinfo and pdftotext are referring to the respective executables either 32bit (tested with R 32bit) or to 64bit (tested with R 64bit)
EDIT II
One very confusing observation is that starting from a fresh session (tm not loaded) the last command alone will produce the error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpKi5GnL
\pdfinfode8283c422f': No such file or directory
I don't understand this at all because the function variable is not defined by tm.readPDF yet. Below you'll find the function pdf refers to "naturally" and to what is returned by tm.readPDF:
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0674bd8c>
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0c3d7364>
Apparently there is no difference - then why use readPDF at all?
EDIT III
The pdf file is located here: C:\Users\Raffael\Documents
> getwd()
[1] "C:/Users/Raffael/Documents"
EDIT IV
First instruction in pdf() is a call to tm:::pdfinfo() - and there the error is caused within the first few lines:
> outfile <- tempfile("pdfinfo")
> on.exit(unlink(outfile))
> status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")),
+ stdout = outfile)
> tags <- c("Title", "Subject", "Keywords", "Author", "Creator",
+ "Producer", "CreationDate", "ModDate", "Tagged", "Form",
+ "Pages", "Encrypted", "Page size", "File size", "Optimized",
+ "PDF version")
> re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",
+ tags)), collapse = "|"))
> lines <- readLines(outfile, warn = FALSE)
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d419174450': No such file or direc
Apparently tempfile() simply doesn't create a file.
> outfile <- tempfile("pdfinfo")
> outfile
[1] "C:\\Users\\Raffael\\AppData\\Local\\Temp\\RtmpquRYX6\\pdfinfo8d437bd65d9"
The folder C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6 exists and holds some files but none is named pdfinfo8d437bd65d9.
Intersting, on my machine after a fresh start pdf is a function to convert an image to a PDF:
getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
package:grDevices
namespace:grDevices [etc.]
But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text using system as Tony Breyal describes here.
In your case it would be (note the two sets of quotes):
system(paste('"C:/Program Files/xpdf64/pdftotext.exe"',
'"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)
This could easily be extended with an *apply function or loop if you have many PDF files.