How do parse a text file for the line after a phrase in r? - r

I have a large text file with 40,000+ sections of output. Each output section is ~150 lines. I want one number from each section to put in a vector. The section I want to parse is shown as below.
Min ChiSq_Th: ith_cs ith_rk
-1 1
chisq_th chisq_th_min chisq_th_max ftmp_imv fstp_imv
0.149282D+05 0.200268D+05 0.200268D+05 0.100000D+01 0.100000D+00
I need the number below chisq_th in each sweep. I tried taking every 152nd line but not every sweep is exactly the same. I know R is not the ideal platform for this problem but it is the language I know best.

Related

Optimized way to write a series strings to a text file without quotations

I am new to Julia so sorry if this question is obvious.
I am trying to use Julia to help me run a series of finite element models, which use a text input file to give instructions to the finite element solver. Basically, I would like to use Julia to read in the base input file, edit some parameters on some lines of the file and then write it as a new file. I am getting hung up on a couple things though.
Currently, I am reading in the file like this
mdl = "fullmodelSVTV"; #name of input file
A = readlines(mdl*".inp")
This read each line from the file in as a separate string in a vector which I like because it makes it easier to edit the sections I want but it also makes things more difficult when I try to write to a new file.
I am writing the file like this.
io = open("name.inp","w")
print(io,A)
close(io)
When I try to write to a new file the output ends up look like this
Output from code
which is ["string at index 1","string at index 2","string at index 3"...].
What I would like to do is output this the exact same way is it is read in with string at each index of the vector on its own line. I would also like to remove the brackets and quotation marks from the file, as they might interfere with the finite element solver.
I think I have found a way to concatenate all of the strings at each index and separated them with a new line like shown below.
for i in 1:length(A)
conc = conc*"\n"*lines[i]
end
The issue with this is that it takes a long time to do given the size of the input files I am working with and I feel like there has to achieve my goal.
I also cannot find a way to remove the brackets or quotation marks when writing the file.
So, I'm wondering if anyone has any advice for a better way to write these text files in terms of both concatenating all of the strings from the vector when outputting as well as outputting without the brackets and quotation marks.
Thanks, any advice is appreciated.
The issue with print(io,A) is that it is printing a representation of the vector, but in fact you want to print each element of the vector. To do so, you can simply print each line in a loop:
open("name.inp", "w") do io
for line in A
println(io, line)
end
end
This avoids the overhead of string concatenation.

How to extract sections of specific text from PDF files into R data frames? Complex

Please any advice will be appreciated.. This is time sensitive. I have PDF reports that are mostly blocks of text. They are long reports (~50-100 pages). I'm trying to write an R script that is capable of extracting specific sections of these PDF reports using start/stop positional strings. NOTE: Reports vary in length. Short example:
DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
02. SECTION 2
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
11. SECTION 11
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
12. SECTION 12
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
So the goal in this example, is to extract the paragraph below Section 2 and store it as a field/data point. I also want to store Section 11 as a field/data point. Note the document is in PDF format
I have tried used pdftools, tm, stringr, I've literally spent 20+ hours searching for solutions and tutorials on how to do this. I know it is possible as I have done it using SAS before...
Please see code below, I added comments with questions. I believe RegEx will be part of the solution but i'm so lost.
# Init Step
libs <- c("tm","class","stringr","testthat",
"pdftools")
lapply(libs, require, character.only= TRUE)
# File name & location
filename = "~/pdf_test/test.pdf"
# converting PDF to text
textFile <- pdf_text(filename)
cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF
# I'm at a loss of how to parse the values I want. I have seen things
like:
sectionxyz <- str_extract_all(textFile, #??? )
rm_between()
# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
# and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)
NEW------
So I have been able to:
-Prep doc correctly
-Identify the correct start/stop patterns:
length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2))
length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))
pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$"
pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"
The length(grep()) statements verify only 1 instance is being found. From here I am kind of lost based on how to use gsub or similar to extract the portion of data I want. I tried:
pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*")
test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n",
"Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2)
test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")
So far, little progress, but progress anyways.
Provided you can come with a systematic to identify the sections you need, you could, as you indicated, use Regex to extract the text you want.
In your above example, something like gsub(".*SECTION 11(.*)\n12\\..*","\\1",string) ought to work.
Now you could define patterns dynamically using paste and iterate through all files. Each result can then be saved in your data.frame, list,....
Here is a brief more detailed explanation of the pattern:
Firstly, .* is way of matching "anything". If you want to match digits you can use \\d or equivalently [0-9]. Here is a short intro to Regex in R (which I found to be quite useful) where you can find several character classes.
.* at the edges of the pattern means that there can be text before/after
(.*) denotes the content we want (so here matching any content as .* is used). Basically it means extract "anything" between SECTION 11 and 12.
\\. means the dot and \n is the "newline" metacharacter (as before "12.", a new line is started)
In Regex you can create groupings within your pattern using the brackets, i.e. gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37") will return 21:37, or gsub("([A-z]) \\d+","\\1","hello 123") will give hello.
Now the second argument in gsub can and is often used to provide a substitute, i.e. something to replace to matched pattern with. Here however, we do not want any substitue, we want to extract something. \\1 means extract the first grouping, i.e. what it inside the first brackets (you could have multiple groupings).
Finally, string is the string from which we want to extract, i.e. the PDF file
Now if you want to perform something similar in a loop you could do the following:
# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."
first <- "SECTION 11" # first and last can be dynamically assigned
last <- "12\\." # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."
pat <- paste0(".*",first,"(.*)","\n",last,".*") #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above

R - Exctract multiple tables from text file

I have a .txt file containing text (which I don't want) and 65 tables, as shown below (just the top of the .txt file)
Does anyone know how I can extract only the tables from this text file, such that I can open the resulting .txt file as a data.frame with my 65 tables in R? Above each table is a fixed number of lines (starting with "The result of abcpred on seq..." and ending with "Predicted B cell epitopes") and below each of them is a variable number of lines, depending on how many rows each tables has. Then it comes the next table, and it goes like that until I reach the 65th table.
Given that the tables are the only elements that start with numbers, to grep for integers at the beginning of the line is indeed the best solution. Using the shell (and not R) the command:
grep '^[0-9]' input > output
did exactly what I wanted.

R: Save multiple svg/png/tif plots

I am currently using pdf() to save multiple plots on several pages.
I change page simply by plot.new().
Can I easily get svg() and png() to do the same? Currently only the last plot is saved in the file.
If I cannot have it in the same file, can I have them autogenerate files like: output.png, output2.png.
If you look at the help pages ?png and ?svg you will see that the default file names are "Rplot%03d.png" and "Rplot%03d.svg" respectively. The %03d part of those names means that each time a new plot is created it will automatically open a new file and that part of the file name will be replaced by an incrementing integer. So the first file will be "Rplot001.png" and the next will be "Rplot002.png" etc.
If you don't like the default file name you can create your own and still insert the portion to be replaced by an integer, such as "myplots%02d.png". The % says this is where the number part starts, the 0 is optional, but says to 0 pad the numbers (so you get 01, 02, ... rather than 1, 2, ...), this is generally preferred so that the sorting works out correctly (otherwise you may see the sorting as 1,10,11,2,3,...) and the digit (3 in the default, 2 in my example) is the number of digits, if you will create more than 1,000 plots you should up that to 4, if you know that you will not create 100 then 2 is fine (1 is fine if you know that you will produce fewer than 10). And the d is just an indicator for an integer.

fread - skip lines starting with certain character - "#"

I am using the fread function in R for reading files to data.tables objects.
However, when reading the file I'd like to skip lines that start with #, is that possible?
I could not find any mention to that in the documentation.
fread can read from a piped command that filters out such lines, like this:
fread("grep -v '^#' filename")
Not currently, but it's on the list to do.
Are the # lines at the top forming a header which is more than 30 lines long?
If so, that's come up before and the solution is :
fread("filename", autostart=60)
where 60 is chosen to be inside the block of data to be read.
From ?fread :
Once the separator is found on line autostart, the number of columns
is determined. Then the file is searched backwards from autostart
until a row is found that doesn't have that number of columns. Thus,
the first data row is found and any human readable banners are
automatically skipped. This feature can be particularly useful for
loading a set of files which may not all have consistently sized
banners. Setting skip>0 overrides this feature by setting
autostart=skip+1 and turning off the search upwards step.
The default autostart=30 might just need bumping up a bit in your case.
Or maybe skip=n or skip="string" helps :
If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

Resources