Is there a way to output line separator values in sqlite text output - sqlite

I've got text like this in one of my sqlite table columns:
Mantas are found in temperate, subtropical and tropical waters. Both species are pelagic; M. birostris migrates across open oceans, singly or in groups, while M. alfredi tends to be resident and coastal. They are filter feeders and eat large quantities of zooplankton, which they swallow with their open mouths as they swim. Gestation lasts over a year, producing live pups.
Mantas may visit cleaning stations for the removal of parasites. Like whales, they breach, for unknown reasons.
The last two lines are broken from the previous by either a \r or \n. I want to be able to see the actual value of \r or \n in the shell output of of the column. Any ideas?

There doesn't seem to be any way to directly do this in the SQLite shell, but you can use .output <file> to output the result to a file, then use a text or hex editor to see what the line endings are.

Related

Vroom/fread won't read LARGE .csv file - cannot memory map it

I have a .csv file that is 112GB in weight but neither vroom nor data.table::fread will open it. Even if I ask to read in 10 rows or just a couple of columns it complains with mapping error: Cannot allocate memory.
df<-data.table::fread("FINAL_data_Bus.csv", select = c(1:2),nrows=10)
System errno 22 unmapping file: Invalid argument
Error in data.table::fread("FINAL_data_Bus.csv", select = c(1:2), nrows = 10) :
Opened 112.3GB (120565605488 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.
read.csv on the other hand will read the ten rows happily.
Why won't vroom or fread read it using the usual altrep, even for 10 rows?
This matter has been discussed by the main creator of data.table package at https://github.com/Rdatatable/data.table/issues/3526. See the comment by Matt Dowle himself at https://github.com/Rdatatable/data.table/issues/3526#issuecomment-488364641. From what I understand, the gist of the matter is that to read even 10 lines from a huge csv file with fread, the entire file needs to be memory mapped. So fread cannot be used on its own in case your csv file is too big for your machine. Please correct me if I'm wrong.
Also, I haven't been able to use vroom with big more-than-RAM csv files. Any pointers towards this end will be appreciated.
For me, the most convenient way to check out a huge (gzipped) csv file is by using a small command line tool csvtk from https://bioinf.shenwei.me/csvtk/
e.g., check dimensions with
csvtk dim BigFile.csv.gz
and, check out head with top 100 rows
csvtk head -n100 BigFile.csv.gz
get a better view of above with
csvtk head -n100 BigFile.csv.gz | csvtk pretty | less -SN
Here I've used less command available with "Gnu On Windows" at https://github.com/bmatzelle/gow
A word of caution - many people suggest using command
wc -l BigFile.csv
to check out no. of lines from a big csv file. In most cases, it will be equal to the no. of rows. But in case the big csv file contains newline characters within a cell, to use a spreadsheet term, the above command will not show the no. of rows. In such cases the no. of lines is different from the no. of rows. So it is advisable to use csvtk dim or csvtk nrow. Other csv command line tools like xsv, miller will also show correct results.
Another word of caution - the short command fread(cmd="head -n 10 BigFile.csv") is not advisable to preview top few lines in case some columns contain significant leading zeros in data such as 0301, 0542, etc. since without column specification, fread will interpret them as integers and not show leading zeros from such columns. For example, in some databases that I have to analyse, the first digit zero in a particular column means that it is a Revenue Receipt. So better use a command line tool like csvtk, miller, xsv with less -SN for previewing a big csv file which show the file "as is" without any potentially wrong interpretation.
PS1: Even spreadsheets like MS Excel and LibreOffice Calc loses leading zeroes in csv files by default. LibreOffice Calc actually shows leading zeroes in the preview window but loses them when you load the file! I'm yet to find a spreadsheet that does not lose leading zeroes in csv files by default.
PS2: I've posted my approach to querying very large csv files at https://stackoverflow.com/a/68693819/8079808
EDIT:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203

How to remove these footnotes from text

Alright so I have minimal experience with RStudio, I've been googling this for hours now and I'm fed up-- I don't care about the pride of figuring it out on my own anymore, I just want it done. I want to do some stuff with Canterbury Tales-- the Middle English version on Gutenberg.
Downloaded the plaintext, trimmed out the meta data, etc but it's chock-full of "helpful" footnotes and I can't figure out how to cut them out. EX:
"And shortly, whan the sonne was to reste,
So hadde I spoken with hem everichon,
That I was of hir felawshipe anon,
And made forward erly for to ryse,
To take our wey, ther as I yow devyse.
19. Hn. Bifel; E. Bifil. 23. E. were; _rest_ was. 24. E. Hn.
compaignye. 26, 32. E. felaweshipe. Hl. pilgryms; E. pilgrimes.
34. E. oure
But natheles, whyl I have tyme and space,..."
I at least have the vague notion that this is a grep/regex puzzle. Looking at the text in TextEdit, each bundle of footnotes is indented by 4 spaces, and the next verse starts with a capitalized word indented by (edit: 4 spaces as well).
So I tried downloading the package qdap and using the rm_between function to specify removal of text between four spaces and a number; and two spaces and a capital letter (" [0-9]"," "[A-Z]") to no avail.
I mean, this isn't nearly as simple as "make the text lowercase and remove all the numbers dur-hur" which all the tutorials are so helpfully offering. But I'm assuming this is a rather common thing that people have to do when dealing with big texts. Can anyone help me? Or do I have to go into textedit and just manually delete all the footnotes?
EDIT: I restarted the workspace today and all I have is a scan of the file, each line stored in a character vector, with the Gutenburg metadata trimmed out:
text<- scan("thefilepath.txt, what = "character", sep = "\n")
start <-which(text=="GROUP A. THE PROLOGUE.")
end <-which(text==""God bringe us to the Ioye . that ever schal be!")
cant.lines.v <- text[start:end]
And that's it so far. Eventually I will
cant.v<- paste(cant.lines.v, collapse=" ")
And then strsplit and unlist into a vector of individual words-- but I'm assuming, to get rid of the footnotes, I need to gsub and replace with blank space, and that will be easier with each separate line? I just don't know how to encode the pattern I need to cut. I believe it is 4 spaces followed by a number, then continuing on until you get to 4 spaces followed by a capitalized word and a second word w/o numbers and special characters and punctuation.
I hope that I'm providing enough information, I'm not well-versed in this but I am looking to become so...thanks in advance.

R - Exctract multiple tables from text file

I have a .txt file containing text (which I don't want) and 65 tables, as shown below (just the top of the .txt file)
Does anyone know how I can extract only the tables from this text file, such that I can open the resulting .txt file as a data.frame with my 65 tables in R? Above each table is a fixed number of lines (starting with "The result of abcpred on seq..." and ending with "Predicted B cell epitopes") and below each of them is a variable number of lines, depending on how many rows each tables has. Then it comes the next table, and it goes like that until I reach the 65th table.
Given that the tables are the only elements that start with numbers, to grep for integers at the beginning of the line is indeed the best solution. Using the shell (and not R) the command:
grep '^[0-9]' input > output
did exactly what I wanted.

How can I enter a command that is over 256 Characters Long in IRIX

I connect to different types of computers every day. When I Telnet in, the first thing I do is run a command line script that is about 1150 characters long. I have no problem with Linux based systems, but if it is Unix based (ie IRIX), then my command is truncated at ~256 Chars.
The Final result of the Command will be a data dump (the results of the commands) to the Telnet window. This data will then be copied and pasted into a tool for analysis. Also the Command string that is being entered is a series of Commands (mostly egreps) separated by semi-colons, but when combined together it gets very long.
I need to be able to enter all 1150 Chars on the command line. The systems I access are not mine, So I need to be as Benign as possible when interacting with them.
Your Help is appreciated.
If its a parameter list thats making the command that long then xargs is your friend
I'm not sure if this is the answer you're looking for, but as you stated in your comment, all of the commands are less than 256 characters. So, you can break the commands up into 5-6 groups being sure to only separate at the semi-colon (not at pipes). Then execute each group in sequence. It's more work if your use to just copying and pasting, but not much if you already have the groups created in a text file.

R: Extract value and lines after key word (text file mining)

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Resources