Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
In the below example, i want to replace the string "lastName:JordanlastName:Jordan" with "lastName:Jordan" i.e, when the pattern repeats i want to stop. i want to do this for every record. How to do this in R?
lastName:Portnoy
lastName:JordanlastName:JordanlastName:Jordan
lastName:JordanlastName:JordanlastName:Jordan
lastName:CliffordlastName:CliffordlastName:Clifford
lastName:WalkerlastName:Walker
lastName:Portnoy
# Read in the example data:
x <- unname(unlist(c(read.table(text="lastName:Portnoy
lastName:JordanlastName:JordanlastName:Jordan
lastName:JordanlastName:JordanlastName:Jordan
lastName:CliffordlastName:CliffordlastName:Clifford
lastName:WalkerlastName:Walker
lastName:Portnoy", stringsAsFactors=FALSE))))
# Delete everything after the first occurrence of the pattern:
sub('(?<=[a-z])lastName[A-Za-z:]+', '', x, perl=TRUE)
[1] "lastName:Portnoy" "lastName:Jordan" "lastName:Jordan"
[4] "lastName:Clifford" "lastName:Walker" "lastName:Portnoy"
This replaces every occurrence of "lastName" and the following characters and colons with nothing ('') if and only if there was a letter before it.
Details
sub() has three mandatory arguments: pattern, replacement, and x. I've also used the optional perl=TRUE argument because the pattern I used is a Perl-style regular expression. I've told sub() to look in the character vector x for the pattern '(?<=[a-z])lastName[A-Za-z:]+' and replace it with '', or nothing (equivalent to deleting those characters). The (?<=[a-z]) part of the pattern is called a "look-behind assertion." That means the pattern matches 'lastName[A-Za-z:]+' if and only if it finds a letter immediately preceding that pattern. 'lastName[A-Za-z:]+' looks for the exact characters "lastName" followed immediately by one or more characters in the set of uppercase letters, lowercase letters, and the colon character. It matches everything until it finds a character that is not in that set.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have the string I got from html_text()
feel sore — болеть feel sore about — страдать; мучиться
But it should be like this
feel sore — болеть feelsore about — страдать; мучиться
The problem is, rvest doesn't distinguish whitespaces from line breaks, but I need to get only the first line " feel sore — болеть" somehow.
I tried using stringr::str_extract() but failed. What do I do?
UPD: ok I've found out there's html_text2() but is it still possible to use regex?
You can use two negative character classes:
[^—]+: this matches any character that is not —one or more times
[^A-Za-z]+: this matches any character that is not an upper- or lower case letter of the English alphabet one or more times:
Data:
str <- c("feel sore — болеть feel sore about — страдать; мучиться",
"so long — разг. Пока!")
Solution:
str_extract_all(str, "[^—]+—[^A-Za-z]+")
[[1]]
[1] "feel sore — болеть " "feel sore about — страдать; мучиться"
[[2]]
[1] "so long — разг. Пока!"
To get rid of the list character, use unlist; to get rid of the trailing whitespace, use trimws.
edit to demonstrate I made an effort to solve this myself
I have the following string:
"This is a question? This is an answer This is another question? This is another answer."
What I want to do is create a regex that will match all the questions so I can remove them. In this case the preceding answer or sentence doesn't always end with a '.' (full stop). So what I am looking for is to match sentences that end with a question mark and start with a capital letter.
What I want to match:
"This is a question?" and "This is another question?"
I work in R so I prefer an answer with stringr, but I'm mostly interested in the regex that I should apply.
I tried the following regex ^[A-Z].+\? but unfortunately it matches the whole string.
This should do the trick in regex: ([A-Z][^A-Z?]*\?)
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to print all the lines of a file in which second and second last character are the same, such as with the following file (comments to the right are for explanatory purposes, they do not exist in the file):
hello james # second/second-last are 'ee' - match
how are you? # 'ou'
are you okay? # 'ry'
Is it past # 'ss' - match
Then the output should be
hello james
Is it past
How would I go about doing this?
You can use grep with grouping and backreference for this, e.g.:
grep -x ".\(.\).*\1." f1.txt
This pattern looks in given order for:
any character: .
another arbitrary character in a capture group: \(.\)
any number (including 0) of characters: .*
the same character previously captured (the backreference): \1
finally, the last arbitrary character: .
-x means it has to match the whole line rather than just some portion of it (same as using --line-regexp). As a result only the matched lines will be printed.
Here is an awk that compare second first and second last character:
awk '{b=split($0,a,"")} a[2]==a[b-1]' file
hello james
Is it past
If there are spaces or tabs at the end of the line, it can be trimmed away like this:
awk '{$1=$1;b=split($0,a,"")} a[2]==a[b-1]'
hello james
Is it past
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Actually I have a lot of txt file in a folder and I make a list and then I put them all together. So far so good. Lets say I have files name like this "1a" "1b" "2a" "3b" etc I get a column from each file and make a data frame at the end.
What I cannot do now, is to make the files names as the column name of my final data frame. Lets say I get a column from "1a" I want to name it as 1a in my final data frame.
Is there anyway to do it?
Here is the names
> head(filelist)
[1] "./1a.txt" "./1b.txt" "./2a.txt" "./2b.txt" "./3a.txt" "./3b.txt"
You probably don't want to begin with numbers as your names here is what I would suggest:
# create example vector of file names for example
myFiles <- c("./1a.txt", "./1b.txt", "./2a.txt",
"./2b.txt", "./3a.txt", "./3b.txt")
# get a vector of filenames
myFiles <- list.files(<filePath>)
# paste the word file in front:
myFiles <- paste0("file.", gsub("\\./(.*)\\.txt$", "\\1", myFiles))
# add names to your data.frame columns:
names(df) <- myFiles
The regular expression "\./(.*)\.txt$" can be broken down as follows:
\. tells the regex engine to match the literal dot "." In regex, "." by itself is the useful, yet dangerous "match any character."
"/" and "txt" are literals: match those characters.
"$" is an anchor that forces the match to the end of the string.
"()" is a capturing parentheses: it tells the engine to save that piece for later.
".*" within the parentheses says match anything in between the adjacent ("\./" and "\.txt$") subexpressions.
the "\1" says to return the bit of text in the capturing parentheses.
For more on the wonderful world of regular expressions, take a look here. Also, this site, which is linked in the SO link is where I learned much of what I use.
You will have to make sure that the orders of the names and the order of the columns match, but from your description, it sounds like you have this already.
If the list that contains the files is a named list, it should be event easier:
names(df) <- paste0("file.", names(fileList))