I'd like to replace the content of a column in a data frame with only a specific word in that column.
The column always looks like this:
Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')
Place(fullName='Iphofen, Deutschland', name='Iphofen', type='city', country='Germany', countryCode='DE')
I'd like to extract the city name (in this case Würzburg or Iphofen) into a new column, or replace the entire row with the name of the town. There are many different towns so having a gsub-command for every city name will be tough.
Is there a way to maybe just use a gsub and tell Rstudio to replace whatever it finds inside the first two ' '?
Might it be possible to tell it, "give me the word after "name=' until the next '?
I'm very new to using R so I'm kind of out of ideas.
Thanks a lot for any help!
I know of the gsub command, but I don't think it will be the most appropriate in this case.
Yes, with a regular expression you can do exactly that:
string <- "Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')"
city <- gsub(".*name='(.*?)'.*", "\\1", string)
The regular expression says "match any characters followed by name=', then capture any characters until the next ' and then match any additional characters". Then you replace all of that with just the captured characters ("\\1").
The parentheses mean "capture this part", and the value becomes "\\1". (You can do multiple captures, with subsequent captures being \\2, \\3, etc.
Note the question mark in (.*?). This means "match as little as possible while still satisfying the rest of the regex". If you don't include the question mark, the regular expression will match "greedily" and you will capture the entire rest of the line instead of just the city since that would also satisfy the regular expression.
More about regular expression (specific to R) can be found here
Related
I want to use a function to go threw strings and only replaces the last two characters if they match certain criteria. How can I do that?
I tried
cleantable3 = gsub('.{2}$', '', cleantable2)
but then it always deletes all the last two. Lets say I only want those replaced that contain " D| E"
Thank you all!
Your regex expression isn't good '.{2}$' means it will match any(.) 2 characters at the end of string($).
You haven't defined the question very well, but in your case I think this is the regex expression you need '(D.$)|(E.$)|(.D$)|(.E$)'. So this is the desired code.
cleantable3 = gsub('(D.$)|(E.$)|(.D$)|(.E$)', '', cleantable2)
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm trying to understand a regular expression someone has written in the gsub() function.
I've never used regular expressions before seeing this code, and i have tried to work out how it's getting the final result with some googling, but i have hit a wall so to speak.
gsub('.*(.{2}$)', '\\1',"my big fluffy cat")
This code returns the last two characters in the given string. In the above example it would return "at". This is the expected result but from my brief foray into regular expressions i don't understand why this code does what it does.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string. It would make more sense to me if this part in brackets was in place of the '\1'. To me it would then read look at the entire string and replace it with the last two characters of that string.
All that does though is output the actual code as the replacement e.g ".{2}$".
Finally i don't understand why '\1' is in the replace part of the function. To me this is just saying replace the entire string with a single backslash and the number one. I say a single backslash because it's my understanding the first backslash is just there to make the second backslash a none special character.
For gsub there are two ways of using the function. The most common way is probably.
gsub("-","TEST","This is a - ")
which would return
This is a TEST
What this does is simply finds the matches in the regular expression and replaces it with the replacement string.
The second way to use gsub is the method in which you described. using \\1, \\2 or \\3...
What this does is looks at the first, second or third capture group in your regular expression.
A capture group is defined by anything inside the circular brackets ex: (capture_group_1)(capture_group_2)...
Explanation
Your analysis is correct.
What i understand is the '.*' means look for any character 0 or more times. So it's going to look at the entire string and this is what will be replaced.
The part in brackets looks for any two characters at the end of the string
The last two characters are placed in a capture group and we are simply replace the whole string with this capture group. Not replacing them with anything.
if it helps, check out the result of this expression.
gsub('(.*)(.{2}$)', 'Group 1: \\1, Group 2: \\2',"my big fluffy cat")
hope the examples can help you to understand it better:
Say we have a string foobarabcabcdef
.* matches whole string.
.*abc it matches: from the beginning matches any chars till the last abc (greedy matching), thus, it matches foobarabcabc
.*(...)$ matches the whole string as well, however, the last 3 chars were groupped. Without the () , the matched string will have a default group, group0, the () will be group1, 2, 3.... think about .*(...)(...)(...)$ so we have:
group 0 : whole string
group 1 : "abc" the first "abc"
group 2 : "abc" the 2nd "abc"
group 3 : "def" the last 3 chars
So back to your example, the \\1 is a reference to group. What it does is: "replace the whole string by the matched text in group1" That is, the .{2}$ part is the replacement.
If you don't understand the backslashs, you have to reference the syntax of r, I cannot tell more. It is all about escaping.
Important part of that regular expression are brackets, that's something called "capturing group".
Regular expression .*(.{2}$) says - match anything and capture last 2 characters at the line. Replacement \\1 is referencing to that group, so it will replace whole match with captured group, which are last two characters in this case.
I have a question.
My text file contains lines such as:
1.1 Description.
This is the description.
1.1.1 Quality Assurance
Random sentence.
1.6.1 Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1 Description
1.1.1 Quality Assurance
1.6.1 Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1 Description.
1.1.1 Quality Assurance
1.6.1 Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))
I am trying to figure out how I can put together a find and replace command with wildcards or figure out a way to find and replace the following example:
I would like to find terms that contain double quotes in front of them with a single quote at the end:
Example:
find "joe' and replace with 'joe'
Basically, I'm trying to find all terms with terms having "in front and at the end.'
Check the [x] Regular expression checkbox in textpad's replace dialog and enter the following values:
Find what:
"([^'"]*)'
Replace with:
'\1'
Explanation:
In a regular expression, square brackets are used to indicate character classes. A character class beginning with a caret will match anything not in the class.
Thus [^'"] will match any character except ' and ". The following * indicates that any number of these characters can follow. The ( and ) mark a group. And the group we're looking for starts with " and ends with '. Finally in the replace string we can refer to any group via \n where n is the nth group. In our case it is the first and only group and that is why we used \1.
I need to find attribute values in an ASPX file using regular expressions.
That means you don't need to worry about malformed HTML or any HTML related issues.
I need to find the value of a particular attribute (LocText). I want to get what's inside the quotes.
Any ASPX tags such as <%=, <%#, <%$ etc. inside the value don't make sense for this attribute therefore are considered as part of it.
The regex I began with looks like this:
LocText="([^"]+)"
This works great, the first group, which is the result text, gets everything except the double quotes, which are not allowed there (" ; must be used instead)
But the ASPX file allows using of single quotes - second regular expression must be applied then.
LocText='([^']+)'
I could use these two regular expressions but I'm looking for a way to connect them.
LocText=("([^"]+)"|'([^']+)')
This also works but doesn't seem very efficient as it's creating unnecessary number of groups. I think this could be somehow done by using backreferences, but I can't get it to work.
LocText=(["']{1})([^\1]+)\1
I thought that by this, I save the single/double quote to the first group and then I tell it to read anything that is NOT the char found in the first group. This is enclosed again by the quote from the first group. Obviously, I'm wrong and it's not working like that.
Is there any way, how to connect the first two expressions together creating just a minimum amount of groups with one group being the value of the attribute I want to get? Is it possible using a backreference for the single/double quote value, or have I completely misunderstood the meaning of them?
I'd say your solution with alternation isn't that bad, but you could use named captures so the result will always be found in the same group's value:
Regex regexObj = new Regex(#"LocText=(?:""(?<attr>[^""]+)""|'(?<attr>[^']+)')");
resultString = regexObj.Match(subjectString).Groups["attr"].Value;
Explanation:
LocText= # Match LocText=
(?: # Either match
"(?<attr>[^"]+)" # "...", capture in named group <attr>
| # or match
'(?<attr>[^']+)' # '...', also capture in named group <attr>
) # End of alternation
Another option would be to use lookahead assertions ([^\1] isn't working because you can't place backreferences inside a character class, but you can use them in lookarounds):
Regex regexObj = new Regex(#"LocText=([""'])((?:(?!\1).)*)\1");
resultString = regexObj.Match(subjectString).Groups[2].Value;
Explanation:
LocText= # Match LocText=
(["']) # Match and capture (group 1) " or '
( # Match and capture (group 2)...
(?: # Try to match...
(?!\1) # (unless it's the quote character we matched before)
. # any character
)* # repeat any number of times
) # End of capturing group 2
\1 # Match the previous quote character