Using read.arff() function in R and importing .arff files - r

I am trying to import this dataset of .arff type
file_location <- file.path("/Users","supreet","Downloads","Chronic_Kidney_Disease1/")
Chronic_Kidney_Disease <- read.arff(paste(file_location,"chronic_kidney_disease.arff",sep=""))
But it is throwing the following error
Error in file(arff_file, "rb") : cannot open the connection In
addition: Warning message: In file(arff_file, "rb") : cannot open
file
'/Users/supreet/Downloads/Chronic_Kidney_Disease1/chronic_kidney_disease.arff.arff':
No such file or directory
Also, if remove .arff extension as it is already appended :
file_location <- file.path("/Users","supreet","Downloads","Chronic_Kidney_Disease1/")
Chronic_Kidney_Disease <- read.arff(paste(file_location,"chronic_kidney_disease",sep=""))
I get this error:
Error: XML content does not seem to be XML:
'/Users/supreet/Downloads/Chronic_Kidney_Disease1/chronic_kidney_disease.xml'
In addition: Warning message: In matrix(unlist(strsplit(arff_data,
",", fixed = T)), ncol = num_attrs, : data length [10001] is not a
sub-multiple or multiple of the number of rows [401]
>

Related

Error in file(file, "rt") : invalid 'description' argument when running R script

I am trying to reproduce this protocol for DNA sequencing data analysis. It requires running this bash script that links to an R script. However, I am getting this error (see bottom) that I cant seem to solve.
#!/bin/bash
Project_dir=~/base
cd /${Project_dir}
SCRTP=~/scRepliseq-Pipeline
OUTNAME="bam/G1_F121_A1.adapter_filtered2"
genome_name="mm10"
bamfile=${OUTNAME}.${genome_name}.clean_srt_markdup.bam
rscript=${SCRTP}/util/Step3_R-Aneu-Fragment-bins.R
out_dir="Aneu_analysis"
Name=‘$bamfile’
Name=${name%.adapter_filtered2.${genome_name}.clean_srt_markdup.bam}
blacklist=~/blacklist/mm10-blacklist-v1_id.bed
genome_file=~/reference/UCSC_hg19_female.fa.fai
mkdir -p ${out_dir}
Rscript --vanilla $rscript ${bamfile} ${out_dir} ${name} ${blacklist} ${genome_file}
it links to this R script
args = commandArgs(TRUE)
bamfile=args[1]
out_dir=args[2]
name=args[3]
blacklist=args[4]
genome_file=args[5]
options(scipen=100)
##Extension of file name##
ext="_mapq10_blacklist_fragment.Rdata"
ext2="_mapq10_blacklist_bin.Rdata"
library(AneuFinder)
##loading black list and genome Info##
genome_tmp <- read.table(genome_file,sep="\t") #UCSC_mm9.woYwR.fa.fai
genome=data.frame(UCSC_seqlevel=genome_tmp$V1,UCSC_seqlength=genome_tmp$V2)
chromosomes=as.character(genome$UCSC_seqlevel)
##setup output directories##
out_dir_f=paste0(out_dir,"/fragment")
out_dir_b=paste0(out_dir,"/bins")
dir.create(out_dir,showWarnings = FALSE)
dir.create(out_dir_f,showWarnings = FALSE)
dir.create(out_dir_b,showWarnings = FALSE)
##save the fragment file (>10 MAPQ), filtering out the blacklist regions##
raw_reads=bam2GRanges(bamfile,remove.duplicate.reads = TRUE,min.mapq = 10,blacklist = blacklist)
save(raw_reads,file = paste0(out_dir_f,"/",name,ext))
##save the bin data file ##
bins_reads=binReads(raw_reads,
assembly=genome,
chromosomes=chromosomes,
binsizes=c(40000,80000,100000,200000,500000))
rpm=1000000/length(raw_reads)
bins_reads[["rpm"]]=rpm
save(bins_reads,file=paste(out_dir_b,"/",name,ext2,sep=""))
It shows this error:
Error in file(file, "rt") : invalid 'description' argument
Calls: read.table -> file
Execution halted

XML content does not seem to be XML using read text: '/var/folders/_v/4fhshkb92kq_bnyv7dkxmh9c0000gn/T//Rtmp17F8m5/tmp.CBPLpCrO9L/word/document.xml'

I am currently working on a project in R where I collected a number of online news stories and saved them as pdfs in word. Then I attempted to upload the data into R and ran in to problems. The code I used is below:
setwd("/Users/dk/Dropbox/Corpus")
getwd()
directory<-"/Users/dk/Dropbox/Corpus/nsle"
directory
library(quanteda)
library(readtext)
mydocs<- readtext(directory, encoding = "UTF-8", docvarsfrom = "filenames")`
the error I am recieving is as follows:
Error: XML content does not seem to be XML:
'/var/folders/_v/4fhshkb92kq_bnyv7dkxmh9c0000gn/T//Rtmp17F8m5/tmp.CBPLpCrO9L/word/document.xml
In addition: Warning message:
In utils::unzip(i, exdir = td) : error 1 in extracting from zip file
I am only asking the question on this website because all of the other askers were using XML or R-curl.
When I mess with the directory- based on other answers- I receive this:
Error in get(method, envir = home) : lazy-load database
'/Library/Frameworks/R.framework/Versions/3.4/Resources/library/XML/R/XML.rdb'
is corrupt In addition: Warning messages:
1: In getSource(x,
text_field = text_field, encoding = encoding, ...) : Unsupported
extension " r " of file /Users/dinaklimkina/Dropbox/Corpus/New Sample
Legislature/New Project 08_03_18.R treating as plain text
2: In utils::unzip(i, exdir = td) : error 1 in extracting from zip
file
3: In .registerS3method(fin[i, 1], fin[i, 2], fin[i, 3], fin[i, 4], :
restarting interrupted promise evaluation
4: In get(method, envir = home) : restarting interrupted promise
evaluation
5: In get(method, envir = home) : internal error -3 in R_decompress1
Should I be using a different package?
Thanks so much!
D

Creating a loop to use read.eset in bioconductor

I would like to create a loop to load this files through read.esetof bioconductor.
I tried that:
for(k in 1:29){
expr <- paste0("/home/proj/MT_Nellore/R/eBrowser/Adjusted/LRRadjustedextremes0.5kgchr",k,".txt")
pdat <- paste0("/home/proj/MT_Nellore/R/eBrowser/Adjusted/Samplesbinary0.5.txt")
ffdat <- paste0("/home/proj/MT_Nellore/R/LRR/Chr_adjusted/probeslabeladjustedchr",k,".txt")
eset <- read.eset(exprs.file="expr", pdat.file="/home/proj/MT_Nellore/R/eBrowser/Adjusted/Samplesbinary0.5.txt", fdat.file="ffdat")
}
However I get this error:
## Error in file(file, "r") : cannot open the connection
## In addition: Warning message:
## In file(file, "r") : cannot open file 'ffdat': No such file or directory
Any suggestions?
Ah - just spotted the error - you must remove quotes from around the "ffdat" on the final line, and same for the "expr"

Error trying to read a PDF using readPDF from the tm package

(Windows 7 / R version 3.0.1)
Below the commands and the resulting error:
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp
\RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory
How do I solve this issue?
EDIT I
(As suggested by Ben and described here)
I downloaded Xpdf copied the 32bit version to
C:\Program Files (x86)\xpdf32
and the 64bit version to
C:\Program Files\xpdf64
The environment variables pdfinfo and pdftotext are referring to the respective executables either 32bit (tested with R 32bit) or to 64bit (tested with R 64bit)
EDIT II
One very confusing observation is that starting from a fresh session (tm not loaded) the last command alone will produce the error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpKi5GnL
\pdfinfode8283c422f': No such file or directory
I don't understand this at all because the function variable is not defined by tm.readPDF yet. Below you'll find the function pdf refers to "naturally" and to what is returned by tm.readPDF:
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0674bd8c>
> library(tm)
> pdf <- readPDF(PdftotextOptions = "-layout")
> pdf
function (elem, language, id)
{
meta <- tm:::pdfinfo(elem$uri)
content <- system2("pdftotext", c(PdftotextOptions, shQuote(elem$uri),
"-"), stdout = TRUE)
PlainTextDocument(content, meta$Author, meta$CreationDate,
meta$Subject, meta$Title, id, meta$Creator, language)
}
<environment: 0x0c3d7364>
Apparently there is no difference - then why use readPDF at all?
EDIT III
The pdf file is located here: C:\Users\Raffael\Documents
> getwd()
[1] "C:/Users/Raffael/Documents"
EDIT IV
First instruction in pdf() is a call to tm:::pdfinfo() - and there the error is caused within the first few lines:
> outfile <- tempfile("pdfinfo")
> on.exit(unlink(outfile))
> status <- system2("pdfinfo", shQuote(normalizePath("C:/Users/Raffael/Documents/17214.pdf")),
+ stdout = outfile)
> tags <- c("Title", "Subject", "Keywords", "Author", "Creator",
+ "Producer", "CreationDate", "ModDate", "Tagged", "Form",
+ "Pages", "Encrypted", "Page size", "File size", "Optimized",
+ "PDF version")
> re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",
+ tags)), collapse = "|"))
> lines <- readLines(outfile, warn = FALSE)
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file 'C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6\pdfinfo8d419174450': No such file or direc
Apparently tempfile() simply doesn't create a file.
> outfile <- tempfile("pdfinfo")
> outfile
[1] "C:\\Users\\Raffael\\AppData\\Local\\Temp\\RtmpquRYX6\\pdfinfo8d437bd65d9"
The folder C:\Users\Raffael\AppData\Local\Temp\RtmpquRYX6 exists and holds some files but none is named pdfinfo8d437bd65d9.
Intersting, on my machine after a fresh start pdf is a function to convert an image to a PDF:
getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
package:grDevices
namespace:grDevices [etc.]
But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text using system as Tony Breyal describes here.
In your case it would be (note the two sets of quotes):
system(paste('"C:/Program Files/xpdf64/pdftotext.exe"',
'"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)
This could easily be extended with an *apply function or loop if you have many PDF files.

How to read a .sav SPSS file in in R?

I've tried read.spps(), but I get an encoding error:
library(foreign)
read.spss('persona.sav')
#>re-encoding from CP1252
Error in iconv(names(rval), cp, "") :
unsupported conversion from 'CP1252' to ''
In addition: Warning message:
In read.spss("persona.sav") :
persona.sav: Unrecognized record type 7, subtype 18 encountered in system file
Try re-encoding it as a utf-8 file:
library(foreign)
read.spss('persona.sav', reencode='utf-8')
You can try adding 'to.data.frame = TRUE' into read.spss()
For instance:
df <- read.spss("data.sav", to.data.frame = TRUE)

Resources