Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have the string I got from html_text()
feel sore — болеть feel sore about — страдать; мучиться
But it should be like this
feel sore — болеть feelsore about — страдать; мучиться
The problem is, rvest doesn't distinguish whitespaces from line breaks, but I need to get only the first line " feel sore — болеть" somehow.
I tried using stringr::str_extract() but failed. What do I do?
UPD: ok I've found out there's html_text2() but is it still possible to use regex?
You can use two negative character classes:
[^—]+: this matches any character that is not —one or more times
[^A-Za-z]+: this matches any character that is not an upper- or lower case letter of the English alphabet one or more times:
Data:
str <- c("feel sore — болеть feel sore about — страдать; мучиться",
"so long — разг. Пока!")
Solution:
str_extract_all(str, "[^—]+—[^A-Za-z]+")
[[1]]
[1] "feel sore — болеть " "feel sore about — страдать; мучиться"
[[2]]
[1] "so long — разг. Пока!"
To get rid of the list character, use unlist; to get rid of the trailing whitespace, use trimws.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 days ago.
Improve this question
I am trying to set my working directory within R to a specific location on my windows computer. However, the destination that I wish to set the working directory to has a somewhat irregular pathway. That is, it begins with two double slashes, and then only single slashes are used through the rest of the path. Because of this, I get error: unrecognized escape character.
To fix this, I have tried a few common R packages along with using raw strings. The functions/packages I have tried so far are gsub, readLines and enc2vec.
However, none of these resolve the issue, and I always end up with double slashes where I need single slashes.
Below, I will provide an example that helps illustrate my issue;
For example, the path that I need to set my working directory to has the form
\\Z\image\pic\store
But,if I try the basic path="\\Z\image\pic\store" , I get the escape character error.
What can I do?
Edit: It appears is it just an issue with how windows interprets the slashes, as one user pointed out, using the cat function will show how it is interpreted. Hence I fixed the issue by using quadruple slashes in place of double slash, and double slash in front of single slashes.
If \\Z\image\pic\store is a valid path then we can refer to it in R using
path <- r"{\\Z\image\pic\store}"
# test
cat(path, "\n")
## \\Z\image\pic\store
See ?Quotes for some variations of this syntax that are possible as well.
edit to demonstrate I made an effort to solve this myself
I have the following string:
"This is a question? This is an answer This is another question? This is another answer."
What I want to do is create a regex that will match all the questions so I can remove them. In this case the preceding answer or sentence doesn't always end with a '.' (full stop). So what I am looking for is to match sentences that end with a question mark and start with a capital letter.
What I want to match:
"This is a question?" and "This is another question?"
I work in R so I prefer an answer with stringr, but I'm mostly interested in the regex that I should apply.
I tried the following regex ^[A-Z].+\? but unfortunately it matches the whole string.
This should do the trick in regex: ([A-Z][^A-Z?]*\?)
This question already has an answer here:
"'\w' is an unrecognized escape" in grep
(1 answer)
Closed 1 year ago.
I would like to find and replace tabular instances by tabularx. I tried with gsub but it seems to enter me into a world of escaping pain. Following other questions and answers I find fixed=TRUE which is the best I so far have. The code snippet below almost works, \B is unrecognized. If I escape it twice I get \BEGIN as output!
texText <- '\begin{tabular}{rl}\begin{tabular}{rll}'
texText <- gsub("\begin{tabular}{rl}", "\BEGIN{tabular}{rll}", texText, fixed=TRUE)
I'm using BEGIN as my test to see what is happening. This is before I get to tackling the question of what goes on in the brackets {rl} {ll} {rrl} etc. Ideally I'm looking for a regex that would output:
\begin{tabularx}{rX}\begin{tabularx}{rlX}
That is the final column is replaced by X.
Try using proper escaping:
texText <- "\begin{tabular}{rl}\begin{tabular}{rll}"
output <- gsub("\begin\\{tabular\\}", "\begin{tabularx}", texText)
output
[1] "\begin{tabularx}{rl}\begin{tabularx}{rll}"
A literal backslash requires two backslashes, and also metacharacters such as { and } require two backslashes.
This question already has answers here:
Text Mining R Package & Regex to handle Replace Smart Curly Quotes
(3 answers)
Closed 3 years ago.
I'm having a scenario as below:
> print(bob)
[1] "Do not fall in love if you can’t handle pain"
When I try to replace the can't with gsub, it does not work:
> gsub("can't", "can not", bob)
[1] "Do not fall in love if you can’t handle pain"
Yet if I simply replace the object with its content, it works fine:
> gsub("can't", "can not", "Do not fall in love if you can't handle pain")
[1] "Do not fall in love if you can not handle pain"
I'm really baffled as I can't think of any difference between these two that would be causing it to fail:
> summary(bob); summary("Do not fall in love if you can't handle pain")
Length Class Mode
1 character character
Length Class Mode
1 character character
The variable bob was derived from a dataframe, such that:
bob <- dataframe$column[3]
So my only lead is that it may have something to do with the dataframe.
The same thing happens with str_replace. Please let me know if you have any insights as to what may be causing this.
Some helpful commentators pointed out that the symbols are not matching.
The correct symbol can be typed by holding alt+0146.
Otherwise, using "can.t" in the gsub function will match any symbol.
You can use [[:punct:]] to match any punctuation:
gsub("can[[:punct:]]t", "can not", bob)
# [1] "Do not fall in love if you can not handle pain"
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
In the below example, i want to replace the string "lastName:JordanlastName:Jordan" with "lastName:Jordan" i.e, when the pattern repeats i want to stop. i want to do this for every record. How to do this in R?
lastName:Portnoy
lastName:JordanlastName:JordanlastName:Jordan
lastName:JordanlastName:JordanlastName:Jordan
lastName:CliffordlastName:CliffordlastName:Clifford
lastName:WalkerlastName:Walker
lastName:Portnoy
# Read in the example data:
x <- unname(unlist(c(read.table(text="lastName:Portnoy
lastName:JordanlastName:JordanlastName:Jordan
lastName:JordanlastName:JordanlastName:Jordan
lastName:CliffordlastName:CliffordlastName:Clifford
lastName:WalkerlastName:Walker
lastName:Portnoy", stringsAsFactors=FALSE))))
# Delete everything after the first occurrence of the pattern:
sub('(?<=[a-z])lastName[A-Za-z:]+', '', x, perl=TRUE)
[1] "lastName:Portnoy" "lastName:Jordan" "lastName:Jordan"
[4] "lastName:Clifford" "lastName:Walker" "lastName:Portnoy"
This replaces every occurrence of "lastName" and the following characters and colons with nothing ('') if and only if there was a letter before it.
Details
sub() has three mandatory arguments: pattern, replacement, and x. I've also used the optional perl=TRUE argument because the pattern I used is a Perl-style regular expression. I've told sub() to look in the character vector x for the pattern '(?<=[a-z])lastName[A-Za-z:]+' and replace it with '', or nothing (equivalent to deleting those characters). The (?<=[a-z]) part of the pattern is called a "look-behind assertion." That means the pattern matches 'lastName[A-Za-z:]+' if and only if it finds a letter immediately preceding that pattern. 'lastName[A-Za-z:]+' looks for the exact characters "lastName" followed immediately by one or more characters in the set of uppercase letters, lowercase letters, and the colon character. It matches everything until it finds a character that is not in that set.