I have some data in an object called all_lines that is a character class in R (as a result of reading into R a PDF file). My objective: to delete everything before a certain string and delete everything after another string.
The data looks like this and it is stored in the object all_lines
class(all_lines)
"character"
[1] LABORATORY Research Cover Sheet
[2] Number 201111EZ Title Maximizing throughput"
[3] " in Computers
[4] Start Date 01/15/2000
....
[49] Introduction
[50] Some text here and there
[51] Look more text
....
[912] Citations
[913] Author_1 Paper or journal
[914] Author_2 Book chapter
I want to delete everything before the string 'Introduction' and everything after 'Citations'. However, nothing I find seems to do the trick. I have tried the following commands from these posts: How to delete everything after a matching string in R and multiple on-line R tutorials on how to do just this. Here are some commands that I have tried and all I get is the string 'Introduction' deleted in the all_lines with everything else returned.
str_remove(all_lines, "^.*(?=(Introduction))")
sub(".*Introduction", "", all_lines)
gsub(".*Introduction", "", all_lines)
I have also tried to delete everything after the string 'Citations' using the same commands, such as:
sub("Citations.*", "", all_lines)
Am I missing anything? Any help would really be appreciated!
It looks like your variable is vector of character strings. One element per line in the document.
We can use the grep() function here to locate the lines containing the desired text. I am assuming only 1 line contains "Introduction" and only 1 line contains "Citations"
#line numbers containing the start and end
Intro <- grep("Introduction", all_lines)
Citation <- grep("Citations", all_lines)
#extract out the desired portion.
abridged <- all_lines[Intro:Citation]
You may need to add 1 or substract 1 if you would like to actually remove the "Introduction" or "Citations" line.
Assuming you can accept a single string as output, you could collapse the input into a single string and then use gsub():
all_lines <- paste(all_lines, collapse = " ")
output <- gsub("^.*?(?=\\bIntroduction\\b)|(?<=\\bCitations\\b).*$", "", all_lines)
Related
I have a df = desc with a variable "value" that holds long text and would like to remove every word in that variable that ends with ".htm" . I looked for a long time around here and regex expressions and cannot find a solution.
Can anyone help? Thank you so much!
I tried things like:
library(stringr)
desc <- str_replace_all(desc$value, "\*.htm*$", "")
But I get:
Error: '\*' is an unrecognized escape in character string starting ""\*"
This regex:
Will Catch all that ends with .htm
Will not catch instances with .html
Is not dependent on being in the beginning / end of a string.
strings <- c("random text shouldbematched.htm notremoved.html matched.htm random stuff")
gsub("\\w+\\.htm\\b", "", strings)
Output:
[1] "random text notremoved.html random stuff"
I am not sure what exactly you would like to accomplish, but I guess one of those is what you are looking for:
words <- c("apple", "test.htm", "friend.html", "remove.htm")
# just replace the ".htm" from every string
str_replace_all(words, ".htm", "")
# exclude all words that contains .htm anywhere
words[!grepl(pattern = ".htm", words)]
# exlude all words that END with .htm
words[substr(words, nchar(words)-3, nchar(words)) != ".htm"]
I am not sure if you can use * to tell R to consider any value inside a string, so I would first remove it. Also, in your code you are setting a change in your variable "value" to replace the entire df.
So I would suggest the following:
desc$value <- str_replace(desc$value, ".htm", "")
By doing so, you are telling R to remove all .htm that you have in the desc$value variable alone. I hope it works!
Let's assume you have, as you say, a variable "value" that holds long text and you want to remove every word that ends in .html. Based on these assumptions you can use str_remove all:
The main point here is to wrap the pattern into word boundary markers \\b:
library(stringr)
str_remove_all(value, "\\b\\w+\\.html\\b")
[1] "apple and test2.html01" "the word must etc. and as well" "we want to remove .htm"
Data:
value <- c("apple test.html and test2.html01",
"the word friend.html must etc. and x.html as well",
"we want to remove .htm")
To achieve what you want just do:
desc$value <- str_replace(desc$value, ".*\\.htm$", "")
You are trying to escape the star and it is useless. You get an error because \* does not exist in R strings. You just have \n, \t etc...
\. does not exist either in R strings. But \\ exists and it produces a single \ in the resulting string used for the regular expression. Therefore, when you escape something in a R regexp you have to escape it twice:
In my regexp: .* means any chars and \\. means a real dot. I have to escape it twice because \ needs to be escape first from the R string.
I'm working with CSV files and the problem is that some rows have columns containing \" inside. A simple example would be:
"Row 42"; "Some value"; "Description: \"xyz\""; "Anoher value"
As you can see, the third column contains that combination and when I use the read_csv method in R, the input format is messed up. One working solution is to open the CSV file in Notepad++ and simply replace \" with ', for example. However, I'd prefer to have this automated.
I'm able to replace the \" with ' by using
gsub('\\\\"', "\\\'", df)
However, I'm not able to write it in the original format. Whenever I read the CSV file with R, I lose the quotation marks indicating the columns. So, in other words, my current method outputs the following:
"Row 42; Some value; Description: 'xyz'; Anoher value"
The quotation marks before and after ; are missing.
It's almost fine, but when opening the preprocessed file with Excel, it doesn't recongize the columns. I think the most convenient solution would be to read the CSV file simply as one big string containing all the quotation marks, replacing the desired combination explained above and then write it again. However, I'm not able to read the file as one big string containing all the quotation marks.
Is there a way to read the CSV file with R containing all the quotation marks? Do you have any other solutions to achieve that?
Already tried read.table? It comes with the base installation of R.
Define sep=';' as the separator and use nothing as quotes, quotes=''. Then gsub the redundant quotes away and do trimws. This should fix your data.
x <- '"Row 42"; "Some value;" "Description: \"xyz\""; "Anoher value"'
tab <- read.table(text=x, sep=';', quote='')
tab[] <- lapply(tab, \(x) trimws(gsub(x, pat='\\"', rep='')))
tab
# V1 V2 V3 V4
# 1 Row 42 Some value Description: xyz Anoher value
In your case use read.table(file='<path to .csv file>', sep=';', quote='')
I found the solution, if anyone else faces the same problem:
data <- read_lines(inputFileName)
preprocessed <- gsub('\\\\"', "\\\'", data)
write_lines(preprocessed, outputFileName)
I'm working in R, trying to prepare text documents for analysis. Each document is stored in a column (aptly named, "document") of dataframe called "metaDataFrame." The documents are strings containing articles and their BibTex citation info. Data frame looks like this:
[1] filename document doc_number
[2] lithuania2016 Commentary highlights Estonian... 1
[3] lithuania2016 Norwegian police, immigration ... 2
[4] lithuania2016 Portugal to deply over 1,000 m... 3
I want to extract the BibTex information from each document into a new column. The citation information begins with "Credit:" but some articles contain multiple "Credit:" instances, so I need to extract all of the text after the last instance. Unfortunately, the string is only sometimes preceded by a new line.
My solution so far has been to find all of the instances of the string and save the location of the last instance of "Credit:" in each document in a list:
locate.last.credit <- lapply(gregexpr('Credit:', metaDataFrame$document), tail, 1)
This provides a list of integer locations of the last "Credit:" string in each document or a value of "-1" where no instance is found. (Those missing values pose a separate but related problem I think I can tackle after resolving this issue).
I've tried variations of strsplit, substr, stri_match_last, and rm_between...but can't figure out a way to use the character position in lieu of regular expression to extract this part of the string.
How can I use the location of characters to manipulate a string instead of regular expressions? Is there a better approach to this (perhaps with regex)?
How about like this:
test_string <- " Portugal to deply over 1,000 m Credit: mike jones Credit: this is the bibliography"
gsub(".*Credit:\\s*(.*)", "\\1", test_string, ignore.case = TRUE)
[1] "this is the bibliography"
The Regex pattern is looking for Credit, but because it's preceeded by .*, it's going to find the last instance of the word (if you wanted the first instance of Credit, you'd use .*?). \\s* matches 0 or more white space characters after credit and before the rest of the text. We then capture the remainder of each document in (.*), as capture group 1. And we return \\1. Also, I use ignore.case = TRUE so credit, CREDIT, and Credit will all be matched.
And with your object it would be:
gsub(".*Credit:\\s*(.*)", "\\1", metaDataFrame$document, ignore.case = TRUE)
I am trying to read a text file containing some words in each line. I would like to store each line as an element into a list.
I am using the following command:
listOfNames = readLines("text.txt")
I am getting the following result:
[1] "\"string1\""
[2] "\"string2\""
[3] "\"string3\""
How can I get rid of the "\" and extra " symbols.
my desired output looks like the following.
[1] "string1"
[2] "string2"
[3] "string3"
Do you have any idea how can I fix this?
Thanks,
Looks like your input contain quotes(?) You can remove these with gsub()
listOfNames <- gsub("\"" ,"", listOfNames)
You can use
listOfNames = data.table("text.txt")
to get a data.frame. If you need a list, you can use
listOfNames = as.list(unlist(data.table("text.txt")))
The function data.table() like its alias (for example read.csv()) contain a parameter quote which handels ' and " by default. If you every need another quote, you can change the parameter.
I have scraped some data and stored it in a data frame. Some rows contain unwanted information within square brackets. Example "[N] Team Name".
I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them
gsub( " *\\(.*?\\) *", "", x)
This leaves me with " Team Name" (notice the space before the T).
Now I am trying to remove the white space before the T using trimws or the method shown here, but it is not working
could someone please help me with removing the extra white space.
Note: if I write the string containing the space manually and apply trimws on it, it works. However when obtaining the string directly from the data frame it doesnt. Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. This gives me reason to believe that the string in the data frame is not the same as the manually typed string.
" team name" == df[1,1]
You could try
gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")
We can use
sub(".*\\]\\s+", "", x)
#[1] "Team Name"
Or just
sub("\\S+\\s+", "", x)
#[1] "Team Name"
data
x <- '[N] Team Name';
You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. Your regex is correct as-is, and should successfully accomplish this. (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.)
Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working:
x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"
In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T, so I suggest you use that. I really think there is something broken in the default regex implementation.