I'm using data.table::fread to read input from a shell script. For readability I want to split the script on multiple lines using the line continuation character '\'.
However, fread doesn't seem to like shell scripts on multiple lines.
Examples:
library(data.table)
fread("cat test1.txt test2.txt") ## OK
Now split script on two lines:
fread("cat test1.txt \
test2.txt")
Error in fread("cat test.txt \n test.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first): test.txt
## Same problem
fread("cat test.txt \\
test.txt")
Is there any escape sequence or switch I'm missing?
If not, these are possible solutions I guess: 1) Don't split script at all 2) write script to a file and call that file with fread.
These are my settings:
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] tools_3.2.3 chron_2.3-46 tcltk_3.2.3
embedding within paste is an alternative:
fread(paste("cat test1.txt",
"test2.txt"))
If you are looking for an easy way to read multiple text files, you could either use
fread("cat t*.txt")
or if the .txt files don't follow the above example pattern of file names, perhaps move them to a sub-directory (say 'data') and read them all as below:
fread("ls data | cat")
Related
For at least some cases, Asian characters are printable if they are contained in a matrix, or a vector, but not in a data.frame. Here is an example
q<-'天'
q # Works
# [1] "天"
matrix(q) # Works
# [,1]
# [1,] "天"
q2<-data.frame(q,stringsAsFactors=FALSE)
q2 # Does not work
# q
# 1 <U+5929>
q2[1,] # Works again.
# [1] "天"
Clearly, my device is capable of displaying the character, but when it is in a data.frame, it does not work.
Doing some digging, I found that the print.data.frame function runs format on each column. It turns out that if you run format.default directly, the same problem occurs:
format(q)
# "<U+5929>"
Digging into format.default, I find that it is calling the internal format, written in C.
Before I dig any further, I want to know if others can reproduce this behaviour. Is there some configuration of R that would allow me to display these characters within data.frames?
My sessionInfo(), if it helps:
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.1
I hate to answer my own question, but although the comments and answers helped, they weren't quite right. In Windows, it doesn't seem like you can set a generic 'UTF-8' locale. You can, however, set country-specific locales, which will work in this case:
Sys.setlocale("LC_CTYPE", locale="Chinese")
q2 # Works fine
# q
#1 天
But, it does make me wonder why exactly format seems to use the locale; I wonder if there is a way to have it ignore the locale in Windows. I also wonder if there is some generic UTF-8 locale that I don't know about on Windows.
I just blogged about Unicode and R several days ago. I think your R editor is UTF-8 and this gives your illusion that R in your Windows handles UTF-8 characters.
The short answer is when you want to process Unicode (Here, it is Chinese), don't use English Windows, use a Chinese version Windows or Linux which by default is UTF-8.
Session info in my Ubuntu:
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
I googled around for this issue and couldn't find anything, but hopefully someone on SW can help diagnose this issue!
While running on R version 3.4.1, I tried running
demo("graphics")
and the result was a blank window, unable to render any of the colors/graph (see image).
After exiting out of the window and pressing return again, another blank window (exact same thing) pops up. When I exit out of the second window, I am left with the following command line output:
> demo("graphics")
demo(graphics)
---- ~~~~~~~~
Type <Return> to start :
> # Copyright (C) 1997-2009 The R Core Team
>
> require(datasets)
> require(grDevices); require(graphics)
> ## Here is some code which illustrates some of the differences between
> ## R and S graphics capabilities. Note that colors are generally
specified
> ## by a character string name (taken from the X11 rgb.txt file) and
that line
> ## textures are given similarly. The parameter "bg" sets the background
> ## parameter for the plot and there is also an "fg" parameter which sets
> ## the foreground color.
>
>
> x <- stats::rnorm(50)
> opar <- par(bg = "white")
> plot(x, ann = FALSE, type = "n")
Hit <Return> to see next plot:
Error in plot.new() : attempt to plot on null device
I checked, and the "graphics" package definitely exists and is in libPath. Moreover, it appears that demo("persp") also fails in a similar way.
Anyone know what might be causing this issue?
Edit 1:
Thanks for responding!
I am running this from bash, on Ubuntu 16.04.
Here is the requested terminal output:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.1
> dev.off()
Error in dev.off() : cannot shut down device 1 (the null device)
Edit 2:
Turns out, when I call
dev.new()
before a demo(), the demo is suddenly able to run. However, demo() will not run without it. This doesn't explain to me why demo() doesn't work off the start though.
I am using the stringi package for a while now and everything works fine.
I recently wanted to put some regex inside a function and store that function in a separate file. The code works just fine if the function is loaded from the script but when it is sourced I do not get the expected result.
Here is the code to reproduce the issue :
clean <- function(text){
stri_replace_all_regex(str = text,
pattern = "(?i)[^a-zàâçéèêëîïôûùüÿñæœ0-9,\\.\\?!']",
replacement = " ")
}
text <- "A sample text with some french accent é, è, â, û and some special characters |, [, ( that needs to be cleaned."
clean(text) # OK
[1] "A sample text with some french accent é, è, â, û and some special characters , , that needs to be cleaned."
source(clean.r)
clean(text) # KO
[1] "A sample text with some french accent , , , and some special characters , , that needs to be cleaned."
I want to remove everything that is not a letter, an accented letters and punctuation charcater ?, !, ,, and ..
The code works just fine if the function is loaded inside the script directly. If it is sourced then it gives a different result.
I also tried using stringr and I have the same problem. My files are saved in UTF-8 encoding.
I do not understand why this is happening, any help is greatly appreciated.
Thank you.
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.1.5 data.table_1.10.4
loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1 yaml_2.1.14
Try converting the text to ASCII first. This will change the characters, and may allow the same behaviour when you source the function in R.
+1 to Felipe Alvarenga
https://stackoverflow.com/a/45941699/2069472
text <- "Ábcdêãçoàúü"
iconv(text, to = "ASCII//TRANSLIT")
A little mystery. I have a .tsv file that contains 58936 rows. I loaded the file into R using this command:
dat <- read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE, sep="\t")
but nrow(dat) only shows this:
> nrow(dat)
[1] 28485
So I used the sed -n command to write the rows around where it stopped (before, including and after that row) to a new file and was able to load that file into R so I don't think there was any corruption in the file.
Is it an environment issue?
Here's my sessionInfo()
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-10 RSQLite_1.0.0 DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10 scales_0.2.4 plotrix_3.5-11
[8] reshape2_1.4.1 dplyr_0.4.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 chron_2.3-45 colorspace_1.2-4 lazyeval_0.1.10 magrittr_1.5 munsell_0.4.2
[7] parallel_3.1.2 plyr_1.8.1 Rcpp_0.11.4 rpart_4.1-8 stringr_0.6.2 tools_3.1.2
Did I run out of memory? Is that why it didn't finish loading?
I had a similar problem lately, and it turned out I had two different problems.
1 - Not all rows had the right number of tabs. I ended up counting them using awk
2 - At some points in the file I had quotes that were not closed. This was causing it to skip over all the lines until it found a closing quote.
I will dig up the awk code I used to investigate and fix these issues and post it.
Since I am using Windows, I used the awk that came with git bash.
This counted the number of tabs in a line and printed out those lines that did not have the right number.
awk -F "\t" 'NF!=6 { print NF-1 ":" $0 } ' Catalog01.csv
I used something similar to count quotes, and I used tr to fix a lot of it.
Pretty sure this was not a memory issue. If the problem is unmatched quotes then try this:
t <-read.csv("weekly_devdata.tsv", header=FALSE, stringsAsFactors=TRUE,sep="\t",
quote="")
There is also the very useful function count.fields that I use inside table to get a high-level view of the consequences of various parameter settings. Take a look at results of:
table( count.fields( "weekly_devdata.tsv", sep="\t"))
And compare to:
table( count.fields( "weekly_devdata.tsv", sep="\t", quote=""))
It's sometime necessary to read in with readLines, then remove one or more lines assigning the result to clean and then send the cleaned up lines to read.table(text=clean, sep="\t", quote="")
Could well be some illegal characters in some of the entries... Check out how many upload and where the issue is taking place. Delve deeper into that row of the raw data. Webdings font that kind of stuff!
I' ve the following problem :
Error in .C("NetCDFOpen", as.character(filename), ncid = integer(1), status = integer(1), :
C symbol name "NetCDFOpen" not in DLL for package "xcms"
How do you get this error :
nc <- xcms:::netCDFOpen(cdfFile)
ncData <- xcms:::netCDFRawData(nc)
xcms:::netCDFClose(nc)
I don't know why this don't works, although it should. For further info feel free to ask. Free .cdf files canf be found in the TargetSearchData package.
Code example :
## The directory with the NetCDF GC-MS files
cdfpath <- file.path(.find.package("TargetSearchData"), "gc-ms-data")
cdfpath
I don't think that it should, as you are implying. First, you are using a non-exported function through the :::. In addition, as stated by the error message, the is no NetCDFOpen symbol defined is the dll/so files.
Using the standard input functionality from xcms, works smoothly:
> library("xcms")
> cdfpath <- file.path(.find.package("TargetSearchData"), "gc-ms-data")
> cdfFile <- dir(cdfpath, full.names=TRUE)[1]
> xs <- xcmsSet(cdfFile)
7235eg04: 135:168 185:314 235:444 285:580
> xr <- xcmsRaw(cdfFile)
If you really want to input your data manually, you should use the functionality from the mzR package, which xcms depends on:
> openMSfile(cdfFile)
Mass Spectrometry file handle.
Filename: /home/lgatto/R/x86_64-unknown-linux-gnu-library/2.16/TargetSearchData/gc-ms-data/7235eg04.cdf
Number of scans: 4400
Finally, do pay attention to always provide the output of sessionInfo, to assure that you are using the latest version. In my case:
> sessionInfo()
R Under development (unstable) (2012-10-23 r61007)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.9.4 xcms_1.35.1 mzR_1.5.1
[4] Rcpp_0.9.15
loaded via a namespace (and not attached):
[1] Biobase_2.19.0 BiocGenerics_0.5.1 codetools_0.2-8 parallel_2.16.0
[5] tools_2.16.0
although if might be different for you, if you use the stable version of R and Bioconductor (currently 2.15.2/2.11).
Hope this helps.