String Manipulation in R data frames - r

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000

You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"

A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)

There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

Related

Using regular expression, how can I add elements after I find a match in r?

I have a column that has a string of numbers-length range 10 and 11. This is an example of some of the values in the column:
column=c("5699420001","00409226602")
How can I place a hyphen after the first four digits (in the strings with 10 characters) and after the first five digits (in the strings with 11 characters), as well as after the second four digits for both lengths? Output is provided below. I wanted to use stringr for this.
column_standard=c("5699-4200-01","00409-2266-02")
Here's a solution using capture groups with stringr's str_replace() function:
library(stringr)
column <- c("5699420001","00409226602")
column_standard <- sapply(column, function(x){
ifelse(nchar(x) == 11,
stringr::str_replace(x, "^([0-9]{5})([0-9]{4})(.*)", "\\1\\-\\2-\\3"),
stringr::str_replace(x, "^([0-9]{4})([0-9]{4})(.*)", "\\1\\-\\2-\\3"))
})
column_standard
# 5699420001 00409226602
# "5699-4200-01" "00409-2266-02"
The code should be fairly self-explanatory. I can provide a detailed explanation upon request.
try using this as your expression:
\b(\d{4,5})(\d{4})(\d{2}\b)
It sets up three capture groups that you can later use in your replacement to easily add hyphens between them.
Then you just replace with:
\1-\2-\3
Thanks to #Dunois for pointing out how it would look in code:
column_standard <- sapply(column, function(x) stringr::str_replace(x, "^(\\d{4,5})(\\d{4})(\\d{2})", "\\1\\-\\2-\\3"))
Here is a live example.

Ignore last "/" in R regex

Given the string "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", I need to generate a regex filter so that it ignores the last char if it is an "/" .
I tried the following regex "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)" as of regexr.com/4om61, but it doesn´t work when I run in R as:
regex_exp_R <- "(http:////)?compras\\.dados\\.gov\\.br.*\\?.*(?<!//)"
grep(regex_exp_R, "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/", perl = T, value = T)
I need this to work in pure regex and grep function, without using any string R package.
Thank you.
Simplified Case:
After important contributions of you all, one last issue remains.
Because I will use regex as an input in another friunction, the solution must work with pure regex and grep.
The remaining point is a very basic one: given the strings "a1bc/" or "a1bc", the regex must return "a1bc". Building on suggestions I received, I tried
grep(".*[^//]" ,"a1bc/", perl = T, value = T), but still get "a1bc/" instead of "a1bc". Any hints? Thank you.
If you want to return the string without the last / you can do this several ways. Below are a couple options using base R:
Using a back-reference in gsub() (sub() would work too here):
gsub("(.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
# or, adapting your original pattern
gsub("((http:////)?compras\\.dados\\.gov\\.br.*\\?.*?)/*$", "\\1", x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
By position using ifelse() and substr() (this will proabbly be a little bit faster if scaling matters)
ifelse(substr(x, nchar(x), nchar(x)) == "/", substr(x, 1, nchar(x)-1), x)
[1] "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275"
Data:
x <- "http://compras.dados.gov.br/materiais/v1/materiais.html?pdm=08275/"
Use sub to remove a trailing /:
x <- c("a1bc/", "a2bc")
sub("/$", "", x)
This changes nothing on a string that does not end in /.
As others have pointed out, grep does not modify strings. It returns a numeric vector of indices of the matched strings or a vector of the (unmodified) matched items. It's usually used to subset a character vector.
You can use a negative look-behind at the end to ensure it doesn't end with the character you don't want (in this case, a /). The regex would then be:
.+(?<!\/)
You can view it here with your three input examples: https://regex101.com/r/XB9f7K/1/. If you only want it to match urls, then you would change the .+ part at the beginning to your url regex.
How about trying gsub("(.*?)/+$","\\1",s)?

str_extract expressions in R

I would like to convert this:
AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1
to this:
ELA-3
I tried this function:
str_extract(.,pattern = ":?(ELA).*(\\d\\-)"))
it printed this:
"ELA-NH-COMBINED-3-"
I need to get rid of the text or anything between the two extracts. The number will be a number between 3 and 9. How should I modify my expression in pattern =?
Thanks!
1) Match everything up to -ELA followed by anything (.*) up to - followed by captured digits (\\d+)followed by - followed by anything. Then replace that with ELA- followed by the captured digits. No packages are used.
x <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
sub(".*-ELA.*-(\\d+)-.*", "ELA-\\1", x)
## [1] "ELA-3"
2) Another approach if there is only one numeric field is that we can read in the fields, grep out the numeric one and preface it with ELA- . No packages are used.
s <- scan(text = x, what = "", quiet = TRUE, sep = "-")
paste("ELA", grep("^\\d+$", s, value = TRUE), sep = "-")
## [1] "ELA-3"
TL;DR;
You can't do that with a single call to str_extract because you cannot match discontinuous portions of texts within a single match operation.
Again, it is impossible to match texts that are separated with other text into one group.
Work-arounds/Solutions
There are two solutions:
Capture parts of text you need and then join them (2 operations: match + join)
Capture parts of text you need and then replace with backreferences to the groups needed (1 replace operation)
Capturing groups only keep parts of text you match in separate memory buffers, but you also need a method or function that is capable of accessing these chunks.
Here, in R, str_extract drops them, but str_match keeps them in the result.
s <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
m <- str_match(s, ":?(ELA).*-(\\d+)")
paste0(m[,2], "-", m[,3])
This prints ELA-3. See R demo online.
Another way is to replace while capturing the parts you need to keep and then using backreferences to those parts in the replacement pattern:
x <- "AIR-GEN-SUM-UD-ELA-NH-COMBINED-3-SEG1"
sub("^.*-ELA.*?-([^-]+)-[^-]+$", "ELA-\\1", x)
See this R demo

R: How to count gaps at the beginning of a sequence alignment?

I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. My alignment can be read in as a data frame. Here is a sample of 3.
alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
"Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
"MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
"-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))
Each of the dashes represents a space. What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. So far I've tried using the str_count function. For example:
alignment$shift <- str_count(alignment$Sequence, "-")
but this fails me when I have gaps downstream in my sequence. Really I'm only interested in the gaps that occur at the beginning of the sequences.
I stumbled across the regex function in a post that almost perfectly matches my problem, (How to count the number of hyphens at the beginning of a string in javascript?) but this is in Java and I'm not sure how to translate this to R.
My questions are:
1) Is it possible to have str_count stop looking for "-" characters once it reaches a non-"-" character?
2) Is there a way to use regex or a similar function in R that outputs the length of a character match at the beginning of a string?
You could do this...
alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)
alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))
alignment$shift
[1] 0 0 23
It just counts the number of characters removed by telling gsub to delete the start of a string (the ^) followed by any number of spaces (-+). You could use str_replace instead of gsub.
Maybe this might help? It'll return the position index of the start and end of the "---" string only if it begins at the start of the string.
library(stringr)
str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
start end
[[2]]
start end
[[3]]
start end
[1,] 1 24

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Resources