I have a text file with no apparent tabular or other structure, for example with contents
some text on line 1
some more text on line 2
even more text on the third line
etc
What is the most elegant and R-like way to print out the first few (say 2) lines of text from this file to the console?
Option 1: readLines
readLines('file.txt', n=2)
# [1] "some text on line 1" "some more text on line 2"
The n=2 option is useful, but I want the raw file contents, not the individual lines as elements of a vector.
Option 2: file.show
file.show('file.txt')
# some text on line 1
# some more text on line 2
# even more text on the third line
# etc
This output format is what I would like to see, but an option to limit the number of lines, like n=2 in readLines, is missing.
Option 3: system('head')
system('head -n2 file.txt')
# some text on line 1
# some more text on line 2
That's exactly the output I would like to get, but I'm not sure if this works on all platforms, and invoking an external command for such a simple task is a bit awkward.
One could combine the readLines solution with paste and cat to format the output, but this seems excessive. A simple command like file.head would be nice, or a n=2 argument in file.show, but neither exists. What is the most elegant and compact way to achieve this in R?
To clarify the goal here: This is for a write up of an R tutorial, where the narrative is something like "... we now have written our data to a new text file, so let's see if it worked by looking at the first couple of lines ...". At this point a simple and compact R expression, using base (update: or tidyverse) functions, to do exactly this would be very useful.
Use writeLines with readLines:
writeLines(readLines("file.txt", 2))
giving:
some text on line 1
some more text on line 2
This could alternately be written as the following pipeline. It gives the same output:
library(magrittr)
"file.txt" %>% readLines(2) %>% writeLines
Related
I am reading text data using the read.delim() -&- read.delim2() methods. They accept a skip argument, but it works with number of lines, (i.e. it skips 2,3,4,100 the line you pass into it).
I am using the methods...
read.delim()
read.delim2()
...to read text data. The methods above are able to skip lines, consequently; the methods have a skip parameter, and the parameter accepts an array of line-numbers as an argument. All line-numbers passed into the skip-parameter are skipped by the reader methods (i.e. the lines are not read by the reader methods).
I want to iterate through a body of text, skipping every line until I get to a specific line. Does anyone know how this can be done?
You cannot do that in base R functions, and I don't know of a package that directly provides that. However, here are two ways to get the effect.
First, a file named file.txt:
I want to skip this
and this too
Absolute Irradiance
I need this line
txt <- readLines("file.txt")
txt[cumany(grepl("Absolute Irradiance", txt))]
# [1] "Absolute Irradiance" "I need this line"
If you don't want the "Irradiance" line but want everything after it, then add [-1] to remove the first of the returned lines:
txt[cumany(grepl("Absolute Irradiance", txt))][-1]
# [1] "I need this line"
If the file is relatively large and you do not want to read all of it into R, then
system2("sed", c("-ne", "'/Absolute Irradiance/,$p'", "file.txt"), stdout = TRUE)
# [1] "Absolute Irradiance" "I need this line"
This second technique is really not that great ... it might be better to run that from file.txt into a second (temp) file, then just readLines("tempfile.txt") directly.
Please any advice will be appreciated.. This is time sensitive. I have PDF reports that are mostly blocks of text. They are long reports (~50-100 pages). I'm trying to write an R script that is capable of extracting specific sections of these PDF reports using start/stop positional strings. NOTE: Reports vary in length. Short example:
DOCUMENT TITLE
01. SECTION 1
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
02. SECTION 2
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
11. SECTION 11
This is a test section that I do want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
12. SECTION 12
This is a test section that I DONT want to extract.
This text would normally be much longer... Over 100 words.
Sample Text Text Text Text Text Text Text Text
...
So the goal in this example, is to extract the paragraph below Section 2 and store it as a field/data point. I also want to store Section 11 as a field/data point. Note the document is in PDF format
I have tried used pdftools, tm, stringr, I've literally spent 20+ hours searching for solutions and tutorials on how to do this. I know it is possible as I have done it using SAS before...
Please see code below, I added comments with questions. I believe RegEx will be part of the solution but i'm so lost.
# Init Step
libs <- c("tm","class","stringr","testthat",
"pdftools")
lapply(libs, require, character.only= TRUE)
# File name & location
filename = "~/pdf_test/test.pdf"
# converting PDF to text
textFile <- pdf_text(filename)
cat(textFile[1]) # Text of pg. 1 of PDF
cat(textFile[2]) # Text of pg. 2 of PDF
# I'm at a loss of how to parse the values I want. I have seen things
like:
sectionxyz <- str_extract_all(textFile, #??? )
rm_between()
# 1) How do I loop through each page of PDF file?
# 2) How do I identify start/stop positions for section to be extracted?
# 3) How do I add logic to extract text between start/stop positions
# and then add the result to a data field?
# 4) Sections in PDF will be long sections of text (i.e. 100+ words into a field)
NEW------
So I have been able to:
-Prep doc correctly
-Identify the correct start/stop patterns:
length(grep("^11\\. LIMITS OF LIABILITY( +){1}$",source_main2))
length(grep("Applicable\\s+[Ll]imits\\s+[Oo]f",source_main2))
pat_st_lol <- "^11\\. LIMITS OF LIABILITY( +){1}$"
pat_ed_lol <- "Applicable\\s+[Ll]imits\\s+[Oo]f"
The length(grep()) statements verify only 1 instance is being found. From here I am kind of lost based on how to use gsub or similar to extract the portion of data I want. I tried:
pat <- paste0(".*",pat_st_lol,"(.*)",pat_ed_lol,".*")
test <- gsub(".*^11\\. LIMITS OF LIABILITY( +){1}$(.*)\n",
"Applicable\\s+[Ll]imits\\s+[Oo]f", source_main2)
test2 <-gsub(".*pat_st_lol(.*)\npat_ed_lol.*")
So far, little progress, but progress anyways.
Provided you can come with a systematic to identify the sections you need, you could, as you indicated, use Regex to extract the text you want.
In your above example, something like gsub(".*SECTION 11(.*)\n12\\..*","\\1",string) ought to work.
Now you could define patterns dynamically using paste and iterate through all files. Each result can then be saved in your data.frame, list,....
Here is a brief more detailed explanation of the pattern:
Firstly, .* is way of matching "anything". If you want to match digits you can use \\d or equivalently [0-9]. Here is a short intro to Regex in R (which I found to be quite useful) where you can find several character classes.
.* at the edges of the pattern means that there can be text before/after
(.*) denotes the content we want (so here matching any content as .* is used). Basically it means extract "anything" between SECTION 11 and 12.
\\. means the dot and \n is the "newline" metacharacter (as before "12.", a new line is started)
In Regex you can create groupings within your pattern using the brackets, i.e. gsub(".*(\\d{2}\\:\\d{2})", "\\1","18.05.2018, 21:37") will return 21:37, or gsub("([A-z]) \\d+","\\1","hello 123") will give hello.
Now the second argument in gsub can and is often used to provide a substitute, i.e. something to replace to matched pattern with. Here however, we do not want any substitue, we want to extract something. \\1 means extract the first grouping, i.e. what it inside the first brackets (you could have multiple groupings).
Finally, string is the string from which we want to extract, i.e. the PDF file
Now if you want to perform something similar in a loop you could do the following:
# we are in the loop
# first is your starting point in the extraction, i.e. "SECTION 11"
# last is your end point, i.e. "12."
first <- "SECTION 11" # first and last can be dynamically assigned
last <- "12\\." # "\\" is added before the dot as "." is a Regex metachar
# If last doesn't systematically contain a dot
# you could use gsub to add "\\" before the dot when needed:
# gsub("\\.","\\\\.",".") returns "\\."
# so gsub("\\.","\\\\.","12.") returns "12\\."
pat <- paste0(".*",first,"(.*)","\n",last,".*") #"\n" is added to stop before the newline, but it could be omitted (then "\n" might appear in the extraction)
gsub(pat,"\\1",string) # returns the same as above
When i run this code
f = open("test.txt", "r")
xp_levelup_save=f.readlines(3)
xp_levelup_save=[int(i.replace("\n", "")) for i in xp_levelup_save][2]
f.close()
print (xp_levelup_save)
but a error comes up "List index out of range" If the readlines is 2 and the [2] is [1] it works fine. Not sure why this is happening. Can anyone help me and find a fix. I've tried looking at mulitple other discussions but none
seem to work with this code.
My text document looks like this
1
2
3
4
5
6
f.readlines() doesn't need an argument to retrieve the lines of a document. If you provide this optional argument like f.readlines(n), Python reads it as "expect n bytes and finish the line". Depending on your text file, this could be one, two or even more lines.
To read the text file, just use
xp_levelup_save = f.readlines()
#resume with your code
where each line is stored as a string list element in your variable xp_levelup_save.
Or simply
xp_levelup_save = [int(i.replace("\n", "")) for i in f.readlines()][2]
I have a double-quote delimited CSV file that has an extra double quote in it, quotes.csv
ID,Page,Category,Comments1
"6203168",26,"A","test, line 1"
"6205809",26,"B","test, line 2"
"6205410",16,"C","test, 3" line 3"
"6205410",16,"D","test, line 4"
I read a lot of SO and google links, but still cannot read the file correctly.
Basic code:
DataFrame = read.csv("quotes.csv",colClasses=c("character","integer","character","character"),header=TRUE,sep=",")
View(DataFrame)
I tried quote="\"", tried read.table with variations of quote - nothing helped. Note: It is not possible to manually edit the CSV file to correct that double quote. Looking for output like this:
ID Page Category Comments1
1 6203168 26 A test, line 1
2 6205809 26 B test, line 2
3 6205410 16 C test, 3" line 3
4 6205410 16 D test, line 4
Thanks!
Based on this Stack Overflow post, read.csv does not seem to be able to handle escaped double quotes. But the link gives a possible workaround. Most solutions I found involve removing the double quotes before calling read.csv.
In Linux, to remove all double quotes from a file called input.csv, use the sed command:
sed 's/\"//g' input.csv output.csv
Where output.csv is your csv file with all quotes having been stripped.
In Windows, removing quotes from a file is a bit hairier, and you would have to write a batch file. However, there is an alternative. When I was working with large csv files to do data modeling, I would just open the files in a program like Textpad or Notepad++ and just do a global replace of all quotes. You probably don't want this option if you are dealing with csv very often, but this alternative is just fine for occasional use. Also note that when you open the csv file in an editor, you run the risk that the editor will save and/or strip special characters.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
This is my first ever question here and I'm new to R, trying to figure out my first step in how to do data processing, please keep it easy : )
I'm wondering what would be the best function and a useful data structure in R to load unstructured text data for further processing. For example, let's say I have a book stored as a text file, with no new line characters in it.
Is it a good idea to use read.delim() and store the data in a list? Or is a character vector better, and how would I define it?
Thank you in advance.
PN
P.S. If I use "." as my delimeter, it would treat things like "Mr." as a separate sentence. While this is just an example and I'm not concerned about this flaw, just for educational purposes, I'd still be curious how you'd go around this problem.
read.delim reads in data in table format (with rows and columns, as in Excel). It is not very useful for reading a string of text.
To read text from a text file into R you can use readLines(). readLines() creates a character vector with as many elements as lines of text. A line, for this kind of software, is any string of text that ends with a newline. (Read about newline on Wikipedia.) When you write text, you enter your system specific newline character(s) by pressing Return. In effect, a line of text is not defined by the width of your software window, but can run over many visual rows. In effect, a line of text is what in a book would be a a paragraph. So readLines() splits your text at the paragraphs:
> readLines("/path/to/tom_sawyer.txt")
[1] "\"TOM!\""
[2] "No answer."
[3] "\"TOM!\""
[4] "No answer."
[5] "\"What's gone with that boy, I wonder? You TOM!\""
[6] "No answer."
[7] "The old lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built for \"style,\" not service—she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and then said, not fiercely, but still loud enough for the furniture to hear:"
[8] "\"Well, I lay if I get hold of you I'll—\"
Note that you can scroll long text to the left here in Stackoverflow. That seventh line is longer than this column is wide.
As you can see, readLines() read that long seventh paragraph as one line. And, as you can also see, readLines() added a backslash in front of each quotation mark. Since R holds the individual lines in quotation marks, it needs to distinguish these from those that are part of the original text. Therefore, it "escapes" the original quotation marks. Read about escaping on Wikipedia.
readLines() may output a warning that an "incomplete final line" was found in your file. This only means that there was no newline after the last line. You can suppress this warning with readLines(..., warn = FALSE), but you don't have to, it is not an error, and supressing the warning will do nothing but supress the warning message.
If you don't want to just output your text to the R console but process it further, create an object that holds the output of readLines():
mytext <- readLines("textfile.txt")
Besides readLines(), you can also use scan(), readBin() and other functions to read text from files. Look at the manual by entering ?scan etc. Look at ?connections to learn about many different methods to read files into R.
I would strongly advise you to write your text in a .txt-file in a text editor like Vim, Notepad, TextWrangler etc., and not compose it in a word processor like MS Word. Word files contain more than the text you see on screen or printed, and those will be read by R. You can try and see what you get, but for good results you should either save your file as a .txt-file from Word or compose it in a text editor.
You can also copy-paste your text from a text file open in any other software to R or compose your text in the R console:
myothertext <- c("What did you do?
+ I wrote some text.
+ Ah, interesting.")
> myothertext
[1] "What did you do?\nI wrote some text.\nAh, interesting."
Note how entering Return does not cause R to execute the command before I closed the string with "). R just replies with +, telling me that I can continue to edit. I did not type in those plusses. Try it. Note also that now the newlines are part of your string of text. (I'm on a Mac, so my newline is \n.)
If you input your text manually, I would load the whole text as one string into a vector:
x <- c("The text of your book.")
You could load different chapters into different elements of this vector:
y <- c("Chapter 1", "Chapter 2")
For better reference, you can name the elements:
z <- c(ch1 = "This is the text of the first chapter. It is not long! Why was the author so lazy?", ch2 = "This is the text of the second chapter. It is even shorter.")
Now you can split the elements of any of these vectors:
sentences <- strsplit(z, "[.!?] *")
Enter ?strsplit to read the manual for this function and learn about the attributes it takes. The second attribute takes a regular expression. In this case I told strsplit to split the elements of the vector at any of the three punctuation marks followed by an optional space (if you don't define a space here, the resulting "sentences" will be preceded by a space).
sentences now contains:
> sentences
$ch1
[1] "This is the text of the first chapter" "It is not long"
[3] "Why was the author so lazy"
$ch2
[1] "This is the text of the second chapter" "It is even shorter"
You can access the individual sentences by indexing:
> sentences$ch1[2]
[3] "It is not long"
R will be unable to know that it should not split after "Mr.". You must define exceptions in your regular expression. Explaining this is beyond the scope of this question.
How you would tell R how to recognize subjects or objects, I have no idea.