read.table function and stdin - r

I have a tab-delimited text file that I am trying to load into R with the read.table function. The first few lines of the script look like this
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly=TRUE)
data <- read.table(args[1], header=TRUE, sep="\t", quote="")
# process the data
This works. I had originally tried to get R to read the data from standard input, but was unsuccessful. My first approach...
#!/usr/bin/env Rscript
data <- read.table(stdin(), header=TRUE, sep="\t", quote="")
# process the data
...didn't seem to work at all. My second approach...
#!/usr/bin/env Rscript
data <- read.table("/dev/stdin", header=TRUE, sep="\t", quote="")
# process the data
...read the data file but (for some reason I don't understand) the first 20 or so lines get mangled, which is a big problem (especially since those lines contain the header information). Is there any way to get read.table to read from standard input? Am I missing something completely obvious?

?stdin says:
stdin() refers to the ‘console’ and not to the C-level ‘stdin’
of the process. The distinction matters in GUI consoles (which
may not have an active ‘stdin’, and if they do it may not be
connected to console input), and also in embedded applications.
If you want access to the C-level file stream ‘stdin’, use
file("stdin").
And:
When R is reading a script from a file, the file is the
‘console’: this is traditional usage to allow in-line data …
That’s the probable reason for the observed behaviour. In principle you can read.table from standard input – but in most (almost all?) cases you’ll want to do this via file('stdin').

Related

Different number of lines when loading a file into R

I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")

Writing results from multiple runs of an R script to a single output csv file

I have written an R script that is to be used as part of a shell script based pipeline which will feed dozens of files containing genetic sequence data to the R script one after the other (using args[]).
I am having trouble finding a way to write the results of each run of this script to a single results file. I thought that the easiest way to do this might be to create an empty results.csv table and then ask the script to write to the next row of this file each time it is run (saves the problem of the script writing straight over the file on each run). In this vein a friend helped me out with the following code:
x<-readLines("results.csv")
if(x[[1]]==""){x[[1]]<-paste("meancoscore", "meanboot", "CIres", "RIres", "RC", "nodecount", sep= ",")}
x[[length(x)+1]]<-paste(meancoscore, meanboot, CIres, RIres, RC, nodecount, sep = ",")
x<-data.frame(x)
write.table(x,"results.csv", row.names = F, col.names = F, sep = ",")
In the above code "meancoscore", "meanboot", "CIres", "RIres", "RC", and "nodecount" are first used as a header if the data frame has nothing on the first row.
Following this the results (objects: meancoscore, meanboot, CIres, RIres, RC and nodecount are written in the columns corresponding with their headers. The idea here is that if you run the R script again with different source files it should simply write the results to the next line in the results.csv file.
However, the following is seen in the results.csv file after three runs of this code with different input files:
"\""\\""meancoscore,meanboot,CIres,RIres,RC,nodecount\\""\""
""\""\\""0.000,76.3247863247863,0.721002252252252,0.983235214508053,0.708914804154032,117\\""\""
""\""0.845,77.6923076923077,0.723259762308998,0.983410513459875,0.711261254217159,117\""
""0.85,77.4358974358974,0.728886344116805,0.983878381369061,0.717135516451654,117"
Where my desired result would be the following:
meancoscore,meanboot,CIres,RIres,RC,nodecount
0.000,76.3247863247863,0.721002252252252,0.983235214508053,0.708914804154032,117
0.845,77.6923076923077,0.723259762308998,0.983410513459875,0.711261254217159,117
0.85,77.4358974358974,0.728886344116805,0.983878381369061,0.717135516451654,117
It is worth noting that each successive fun seems to be adding more backslashes and more quotation marks to the results.csv file.
Ideally I would like to be able to simply read in the results.csv file when it is done and analyse the data by accessing the columns with results$meanboot, or summary(results$meanboot) for example.
Could anyone offer some advice on how to modify the above code or offer an alternative solution?
I should add here that I purposefully did not go for the option of writing into the R script a loop that will run through the input files of interest and simply assemble a full table of results as an object (I am aware that this would be very simple to write out). This was because the work being done by this script will be farmed out to multiple machines in a cluster.
Thank you for your time and any help you might be able to offer.
The problem was solved by adding quote = FALSE to the write.table() call as per voidHead's suspicion.

Problems with reading a txt file (EOF within quoted string)

I am trying to use read.table() to import this TXT file into R (contains informations about meteorological stations provided by the WMO):
However, when I try to use
tmp <- read.table(file=...,sep=";",header=FALSE)
I get this error
eof within quoted string
warning and only 3514 of the 6702 lines appear in 'tmp'. From a quick look at the text file, I couldn't find any seemingly problematic characters.
As suggested in other threads, I also tried quote="". The EOF warning disappeared, but still only 3514 lines are imported.
Any advice on how I can get read.table() to work for this particular txt file?
It looks like your data actually has 11548 rows. This works:
read.table(url('http://weather.noaa.gov/data/nsd_bbsss.txt'),
sep=';', quote=NULL, comment='', header=FALSE)
edit: updated according #MrFlick's comment's below.
The problem is LF. R will not recognize "^M", to load the file, you only need to specify the encoding like this:
read.table("nsd_bbsss.txt",sep=";",header=F,encoding="latin1",quote="",comment='',colClasses=rep("character",14)) -> data
But Line 8638 has more than 14 columns, which is different from other lines and may lead an error message.

Import text file using ff package

I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table).
So, I use the following code to import:
test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T)
but this returns the following error message :
Error in read.table.ffdf(FUN = "read.csv2", ...) :
only ffdf objects can be used for appending (and skipping the first.row chunk)
What am I doing wrong?
Here are the first 5 rows of the text file:
CANTVILLE.NUMMI.AEMMR.AGED.AGER20.AGEREV.AGEREVQ.ANAI.ANEMR.APAF.ARM.ASCEN.BAIN.BATI.CATIRIS.CATL.CATPC.CHAU.CHFL.CHOS.CLIM.CMBL.COUPLE.CS1.CUIS.DEPT.DEROU.DIPL.DNAI.EAU.EGOUL.ELEC.EMPL.ETUD.GARL.HLML.ILETUD.ILT.IMMI.INAI.INATC.INFAM.INPER.INPERF.IPO ...
1 1601;1;8;052;54;051;050;1956;03;1;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;1;Z;16;Z;03;16;Z;Z;Z;21;2;2;2;Z;1;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;1;1;1;4;M;22;32;AZ;AZ;00;04;2;2;0;1;2;4;1;00;Z;54;2;ZZ;1;32;2;10;2;11;111;11;11;1;2;ZZZZZZ;1;2;1;4;41;2;Z
2 1601;1;8;012;14;011;010;1996;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
3 1601;1;8;006;05;005;005;2002;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
4 1601;1;8;047;54;046;045;1961;03;2;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;6;Z;16;Z;14;974;Z;Z;Z;16;2;2;2;Z;2;2;4;1;1;4;4;4,02306147485403;ZZZZZZZZZ;2;2;2;1;M;22;32;MN;GU;14;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;2;32;1;10;2;11;111;11;11;1;4;ZZZZZZ;1;2;1;4;41;2;Z
5 1601;2;9;053;54;052;050;1958;02;1;ZZZZZ;2;Z;Z;Z;1;0;Z;2;Z;Z;2;1;2;Z;16;Z;12;87;Z;Z;Z;22;2;1;2;Z;1;2;3;1;1;2;2;4,21707670353782;ZZZZZZZZZ;1;1;1;2;M;21;40;GZ;GU;00;07;0;0;0;0;0;2;1;00;Z;54;2;ZZ;1;30;2;10;3;11;111;ZZ;ZZ;1;1;ZZZZZZ;2;2;1;4;42;1;Z
I encountered a similar problem related to reading csv into ff objects. On using
read.csv2.ffdf(file = "FD_INDCVIZC_2010.txt")
instead of implicit call
read.csv2.ffdf("FD_INDCVIZC_2010.txt")
I got rid of the error. The explicitly passing values to the argument seems specific to ff functions.
You could try the following code:
read.csv2.ffdf("FD_INDCVIZC_2010.txt",
sep = "\t",
VERBOSE = TRUE,
first.rows = 100000,
next.rows = 200000,
header=T)
I am assuming that since its a txt file, its a tab-delimited file.
Sorry I came across the question just now. Using the VERBOSE option, you can actually see how much time your each block of data is taking to be read. Hope this helps.
If possible try to filter the data at the OS level, that is before they are loaded into R. The simplest way to do this in R is to use a combination of pipe and grep command:
textpipe <- pipe('grep XXXX file.name |')
mutable <- read.table(textpipe)
You can use grep, awk, sed and basically all the machinery of unix command tools to add the necessary selection criteria and edit the csv files before they are imported into R. This works very fast and by this procedure you can strip unnecessary data before R begins to read them from pipe.
This works well under Linux and Mac, perhaps you need to install cygwin to make this work under Windows or use some other windows-specific utils.
perhaps you could try the following code:
read.table.ffdf(x = NULL, file = 'your/file/path', seq=';' )

Read SPecific lines of a CSV file in R-language

I am trying to write a code which manipulates data from a particular .csv and writes the data to another one.
I want to read each line one by one and perform the operation.
Also I am trying to read a particular line from the .csv but what I am getting is that line and the lines before it.
I am a beginner in R-Language, so I find the syntax a bit confusing.
testconn<=file("<path>")
num<-(length(readLines(testconn)))
for(i in 1:num){
num1=i-1
los<=read.table(file="<path>",sep=",",head=FALSE,skip=num1,nrows=1)[,c(col1,col2)]
write.table(los,"<path>",row.names=FALSE,quote=FALSE,sep=",",col.names=FALSE,append=TRUE)
}
This is the code I am currently using, thought it is giving the desiored output but it is extremely slow, my .csv data file has 43200 lines.
Your code doesn't work. You confuse the comparison operator <= and the assignment one <-
Your code is is extremly innefficient. You call both read.table and write.table 43200 times to read/write a single file.
You can simply do this:
los<- read.table(file="<path>",sep=",")[,c(col1,col2)]
res <- apply(los,1,function(x){## you treat your line here}
write.table(res,"<path_write>",row.names=FALSE,
quote=FALSE,sep=",",col.names=FALSE)

Resources