How to split strings separated by many semicolons in R? - r

My desire is to know the length of a certain text separated by ; which comes after any number. In the text named txt below, I don't want to consider the first two semicolons. To get the length, the ; comes after 6, 5 should be considered. I mean the code should lookbehind some number(s) to consider the appropriate ;.
library(stringr)
txt <- "A;B; dd (2020) text pp. 805-806; Mining; exercise (1999), ee, p-123-125; F;G;H text, (2017) kk"
lenghths(strsplit(txt,";")) gives me 8. In my case, however, it should be 3. Any help is highly appreciated.

We can use a regex lookaround to match a ; that succeeds a digit ((?<=[0-9])) and get the lengths
lengths(strsplit(txt, "(?<=[5-6]);", perl = TRUE))
#[1] 3
Or using str_count
library(stringr)
str_count(txt, '[5-6];') + 1
#[1] 3

Related

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

Using regular expression, how can I add elements after I find a match in r?

I have a column that has a string of numbers-length range 10 and 11. This is an example of some of the values in the column:
column=c("5699420001","00409226602")
How can I place a hyphen after the first four digits (in the strings with 10 characters) and after the first five digits (in the strings with 11 characters), as well as after the second four digits for both lengths? Output is provided below. I wanted to use stringr for this.
column_standard=c("5699-4200-01","00409-2266-02")
Here's a solution using capture groups with stringr's str_replace() function:
library(stringr)
column <- c("5699420001","00409226602")
column_standard <- sapply(column, function(x){
ifelse(nchar(x) == 11,
stringr::str_replace(x, "^([0-9]{5})([0-9]{4})(.*)", "\\1\\-\\2-\\3"),
stringr::str_replace(x, "^([0-9]{4})([0-9]{4})(.*)", "\\1\\-\\2-\\3"))
})
column_standard
# 5699420001 00409226602
# "5699-4200-01" "00409-2266-02"
The code should be fairly self-explanatory. I can provide a detailed explanation upon request.
try using this as your expression:
\b(\d{4,5})(\d{4})(\d{2}\b)
It sets up three capture groups that you can later use in your replacement to easily add hyphens between them.
Then you just replace with:
\1-\2-\3
Thanks to #Dunois for pointing out how it would look in code:
column_standard <- sapply(column, function(x) stringr::str_replace(x, "^(\\d{4,5})(\\d{4})(\\d{2})", "\\1\\-\\2-\\3"))
Here is a live example.

Regex to remove all non-digit symbols from string in R

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

str_extract in R to extract number from a string

I have a table as below that I would like to extract the number following the underscore
description desired_output
desc_lvl1_id_1 1
desc_lvl1_id_2 2
The solution that I have come up with is split into two parts, first to get the underscore and the number that I want, then to take out the underscore gsub("_", "", str_extract(description, "_[0-9]")). I'm hoping though that this can be done in one step
We can use a positive lookbehind ((?<=_)) and match the numbers that follow the _ as the pattern in str_extract.
library(stringr)
df1$desired_output <- as.numeric(str_extract(df1$description, '(?<=_)\\d+'))

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"
x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.
gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"
Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

Resources