Read txt file as a numeric array in R - r

I am using RStudio in a MAC (10.14.6), and I am trying to read a text file that looks like this
5:[0.12126984126984124, 0.11682539682539679, 0.14666666666666664, 0.07269841269841269, 0.06984126984126983, 0.0911111111111111, 0.1092063492063492, 0.12253968253968253, 0.08698412698412696, 0.09523809523809523, 0.12222222222222222, 0.10761904761904759]
I've used several iterations of "read", "read.delim", and "read.csv" and all pretty much do the same
> data.matrix(read.delim("data.txt",sep=','))
X5..0.12126984126984124 X0.11682539682539679 X0.14666666666666664 X0.07269841269841269 X0.06984126984126983
X0.0911111111111111 X0.1092063492063492 X0.12253968253968253 X0.08698412698412696 X0.09523809523809523
X0.12222222222222222 X0.10761904761904759.
Using "unlist", "as.numeric", "as.character" does not yield anything most likely due to the presence of the X in front of each number. Does anyone have ideas to read this file properly?

if you are only interested in reading the numbers, then you first have to delete 5:[ at the beginning and also ] at the end. then read using scan with the sep = ','
scan(text=gsub("^.*\\[|\\]", "", string), sep = ",")
Read 12 items
[1] 0.12126984 0.11682540 0.14666667 0.07269841 0.06984127 0.09111111
[7] 0.10920635 0.12253968 0.08698413 0.09523810 0.12222222 0.10761905

Related

paste specific text to strings that do not have it

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")
One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

How to read unquoted extra \r with data.table::fread

Data I have to process has unquoted text with some additional \r character. Files are big (500MB), copious (>600), and changing the export is not an option. Data might look like
A,B,C
blah,a,1
bloo,a\r,b
blee,c,d
How can this be handled with data.table's fread?
Is there a better R read CSV function for this, that's similarly performant?
Repro
library(data.table)
csv<-"A,B,C\r\n
blah,a,1\r\n
bloo,a\r,b\r\n
blee,c,d\r\n"
fread(csv)
Error in fread(csv) :
Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0:
bloo,a
Advanced repro
The simple repro might be too trivial to give a sense of scale...
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Naive approach
fread("sample.csv")
# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) : file name conversion problem -- name too long?
# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage 48029706 is too close to the limit
Further to #dirk-eddelbuettel & #nrussell's suggestions, a way of solving this is to is to pre-process the file. The processor could also be called within fread() but here it is performed in seperate steps:
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")
We can try with gsub
fread(gsub("\r\n|\r", "", csv))
# A B C
#1: blah a 1
#2: bloo a b
#3: blee c d
You can also do this with tidyverse packages, if you'd like.
> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
A B C
<chr> <chr> <chr>
1 blah a 1
2 bloo a b
3 blee c d
If you do want to do it purely in R, you could try working with connections. As long as a connection is kept open, it will start reading/writing from its previous position. Of course, this means the burden of opening and closing connections falls on you.
In the following code, the file is processed by chunks:
library(data.table)
input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)
chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"
buffer <- ""
repeat {
new_chars <- readChar(in_conn, chunk_size)
buffer <- paste0(buffer, new_chars)
while (grepl("[\r\n]$", buffer, perl = TRUE)) {
next_char <- readChar(in_conn, 1)
buffer <- paste0(buffer, next_char)
if (!length(next_char))
break
}
chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
writeChar(cleaned, out_conn, eos = NULL)
if (!length(new_chars))
break
}
writeChar('\n', out_conn, eos = NULL)
close(in_conn)
close(out_conn)
result <- fread(output_csv)
Process:
If a chunk ends with a \r or \n, another character is added until it doesn't.
Quotes are put around values containing a \r which isn't adjacent to a
\n.
The cleaned chunk is added to the end of another file.
Rinse and repeat.
This code simplifies the problem by assuming no quoting is done for any field in sample.csv. It's not especially fast, but not terribly slow. Larger values for chunk_size should reduce the amount of time spent in I/O operations. If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...) call to make sure the files are closed afterwards.

Writing and reading a zoo object - errors

I have a zoo object, prices, which, when I type class(prices), it returns “zoo.” I then create a file using:
write.zoo(prices, file = “foo”, index.name = “time”)
The resulting files looks like this:
"time" "AAPL.Adjusted" “SHY.Adjusted"
2013-05-01 60.31 84.12
2013-05-02 61.16 84.11
2013-05-03 61.77 84.08
I then try and read this file with this statement:
myData <- read.zoo(“foo”)
and I get this error:
Error in read.zoo(“foo") :
index has bad entries at data rows: 1 2 3 4
I’ve tried a number of parameter settings and nothing seems to work. Help much appreciated.
Newbie
The file has a header line so try:
z <- read.zoo("foo", header = TRUE, check.names = FALSE)
The check.names part gives nicer looking column names but you could leave it out if that were not important.

jpeg() function incompatible with paste() function in R?

I want to write jpeg files with dynamical filenames.
In plot_filename I concatenate strings with values from other variables to create
a dynamical filename.
plot_filename = paste("Series T_all","/",Participant[i],"/",Part[i,2],"/",Part[i,3],".jpg")
The output of plot_filename is just another string: "Series T_all / 802 / 1 / 64 .jpg"
However when I want to use this string as a filename in the jpeg() function
jpeg(filename= plot_filename, width = 2000, height = 1500, quality = 100,
pointsize = 50)
plot(T1)
dev.off()
I get the following error:
Error in jpeg(filename = paste("Series T_all", "/", Participant[i], "/", :
unable to start jpeg() device
In addition: Warning messages:
1: In jpeg(filename = paste("Series T_all", "/", Participant[i], "/", :
unable to open file 'Series T_all / 802 / 1 / 64 .jpg' for writing
2: In jpeg(filename = paste("Series T_all", "/", Participant[i], "/", :
opening device failed
But when I just use a plain string (without the paste function) as a filename
name="plot_filename.jpg"
the jpeg() function works just fine.
Does anybody know how this is possible? It seems to me that in both cases you're just inputting strings into the jpeg() function so I don't see why one but not the other would work.
Thanks
The statement
plot_filename = paste("Series T_all","/",Participant[i],"/",Part[i,2],"/",Part[i,3],".jpg")
separates the individual strings with spaces (the default) as you can see in your output example
"Series T_all / 802 / 1 / 64 .jpg"
This path, however, does not exists.
If you use
plot_filename = paste("Series T_all","/",Participant[i],"/",Part[i,2],"/",Part[i,3],".jpg", sep="")
this should give a string like
"Series T_all/802/1/64.jpg"
In general, sep= can take any character or string. So you can also use sep="/" to separate your strings so you do not have to write "/" when you concatenate you strings. However, this would affect the concatenation of Part[i,3] and ".jpg". If you want to use it that way, you may append ".jpeg" in a second step with sep="". For your case, I think it is okay just to use sep="".

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources