Extracting all values between ( ) and before % sign

Extracting all values between ( ) and before % sign - r

How can I extract just the number between the parentheses () and before %?
df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
X
1 (0.746698269620538%)
2 (0.104987640399486%)
3 (0.864544949028641%)
For instance, I would like to have a DF like this:
X
1 0.746698269620538
2 0.104987640399486
3 0.864544949028641

We can use sub to match the ( (escaped \\ because it is metacharacter) at the start (^) of the string followed by 0 or more numbers ([0-9.]*) captured as a group ((...)), followed by % and other characters (.*), replace it with the backreference (\\1) of the captured group
df$X <- as.numeric(sub("^\\(([0-9.]*)%.*", "\\1", df$X))
If it includes also non-numeric characters then
sub("^\\(([^%]*)%.*", "\\1", df$X)

Use substr since your know you need to omit the first and last two chars:
> df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
> df
X
1 (0.393457352882251%)
2 (0.0288733830675483%)
3 (0.289543839870021%)
> df$X <- as.numeric(substr(df$X, 2, nchar(as.character(df$X)) - 2))
> df
X
1 0.39345735
2 0.02887338
3 0.28954384

Related

Extract values based on last n characters

I have a vector like below:
vector
jdjss-jdhs--abc-bec-ndj
kdjska-kvjd-jfj-nej-ndjk
eknd-nend-neekd-nemd-nemdkd-nedke
How do I extract the last 3 values so that my result looks like below based on a - delimitor:
vector Col1 Col2 Col3
jdjss-jdhs--abc-bec-ndj abc bec ndj
kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
I've attemped to use sub and the qdap package but no luck.
sub( "(^[^-]+[-][^-]+)(.+$)", "\\2", df$vector)
qdap::char2end(df$vector, "-", 3)
Not sure how to go about doing this.

You may use tidyr::extract:
library(tidyr)
vector <- c("jdjss-jdhs--abc-bec-ndj", "kdjska-kvjd-jfj-nej-ndjk", "eknd-nend-neekd-nemd-nemdkd-nedke")
df <- data.frame(vector)
tidyr::extract(df, vector, into = c("Col1", "Col2", "Col3"), "([^-]*)-([^-]*)-([^-]*)$", remove=FALSE)
vector Col1 Col2 Col3
1 jdjss-jdhs--abc-bec-ndj abc bec ndj
2 kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
3 eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke
The ([^-]*)-([^-]*)-([^-]*)$ pattern matches:
([^-]*) - Group 1 ('Col1'): 0+ chars other than -
- - a hyphen
([^-]*) - Group 2 ('Col2'): 0+ chars other than -
- - a hyphen
([^-]*) - Group 3 ('Col3'): 0+ chars other than -
$ - end of string
Set remove=FALSE in order to keep the original column.

You can use strsplit from base.
x <- "eknd-nend-neekd-nemd-nemdkd-nedke"
lastElements <- function(x, last = 3){
strLength <- length(strsplit(x, "-")[[1]])
start <- strLength - (last - 1)
strsplit(x, "-")[[1]][start:strLength]
}
> lastElements(x)
[1] "nemd" "nemdkd" "nedke"

You can simply split string by - using strsplit and extract last n elements:
df <- data.frame(vector = c(
"jdjss-jdhs--abc-bec-ndj",
"kdjska-kvjd-jfj-nej-ndjk",
"eknd-nend-neekd-nemd-nemdkd-nedke"),
stringsAsFactors = FALSE
)
cbind(df, t(sapply(strsplit(df$vector, "-"), tail, 3)))
vector 1 2 3
1 jdjss-jdhs--abc-bec-ndj abc bec ndj
2 kdjska-kvjd-jfj-nej-ndjk jfj nej ndjk
3 eknd-nend-neekd-nemd-nemdkd-nedke nemd nemdkd nedke

strcapture, as a base R corollary to the tidyr extract answer from Wiktor:
strcapture("([^-]*)-([^-]*)-([^-]*)$", df$vector, proto=list(Col1="",Col2="",Col3=""))
# Col1 Col2 Col3
#1 abc bec ndj
#2 jfj nej ndjk
#3 nemd nemdkd nedke

Extract only values with a decimal point in between from strings

I have a dataframe with strings such as:
id <- c(1,2)
x <- c("...14.....5.......................395.00.........................14.........1..",
"......114.99....................124.99................")
df <- data.frame(id,x)
df$x <- as.character(df$x)
How can I extract only values with a decimal point in between such as 395.00, 114.99 and 124.99 and not 14, 5, or 1 for each row, and put them in a new column separated by a comma?
The ideal result would be:
id x2
1 395.00
2 114.99,124.99
The amount of periods separating the values are random.

library(stringr)
df$x2 = str_extract_all(df$x, "[0-9]+\\.[0-9]+")
df[c(1, 3)]
# id x2
# 1 1 395.00
# 2 2 114.99, 124.99
Explanation: [0-9]+ matches one or more numbers, \\. matches a single decimal point. str_extract_all extracts all matches.
The new column is a list column, not a string with an inserted comma. This allows you access to the individual elements, if needed:
df$x2[2]
# [[1]]
# [1] "114.99" "124.99"
If you prefer a character vector as the column, do this:
df$x3 = sapply(str_extract_all(df$x, "[0-9]+\\.[0-9]+"), paste, collapse = ",")
df$x3[2]
#[1] "114.99,124.99"

Extracting numbers from character string based on delimiters

I have the following dataframe:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3",
"abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b)
df$b <- as.character(df$b)
And I need to extract the numbers in df$b that come between the second and third underscores and assign to df$c.
I'm guessing there's a fairly simple solution, but haven't found it yet. The actual dataset is fairly large (3MM rows) so efficiency is a bit of a factor.
Thanks for the help!

We can use sub to match the zeor or more characters that are not a _ ([^_]*) from the start (^) of the string followed by an underscore (_), then another set of characters that are not an underscore followed by underscore, capture the one of more numbers that follows in a group ((\\d+)) followed by underscore and other characters, then replace it with the backreference for that group and finally convert it to numeric
as.numeric(sub("^[^_]*_[^_]+_(\\d+)_.*", "\\1", df$b))
#[1] 123456 78912 345678912 34567 891234556778

create a my_split function that finds the start and end position of "_" using gregexpr. Then extract the string between start and end position using substr.
my_split <- function(x, start, end){
a1 <- gregexpr("_", x)
substr(x, a1[[1]][start]+1, a1[[1]][end]-1)
}
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
sapply(b, my_split, start = 2, end = 3)
# abc_a_123456_defghij_1 abc_a_78912_abc_2
# "123456" "78912"
# abc_a_345678912_xyzabc_3 abc_b_34567_defgh_4
# "345678912" "34567"
# abc_c_891234556778_ijklmnop_5
# "891234556778"
using data.table library
library(data.table)
setDT(df)[, c := lapply(b, my_split, start = 2, end = 3)]
df
# a b c
# 1: 1 abc_a_123456_defghij_1 123456
# 2: 2 abc_a_78912_abc_2 78912
# 3: 3 abc_a_345678912_xyzabc_3 345678912
# 4: 4 abc_b_34567_defgh_4 34567
# 5: 5 abc_c_891234556778_ijklmnop_5 891234556778
data:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b, stringsAsFactors = FALSE)

Remove comma and or period except if certain condition holds for last occurrence in R

I would like to remove all commas and periods from string, except in the case that a string ends in a comma (or period) followed by one or two numbers.
Some examples would be:
12.345.67 #would become 12345.67
12.345,67 #would become 12345,67
12.345,6 #would become 12345,6
12.345.6 #would become 12345.6
12.345 #would become 12345
1,2.345 #would become 12345
and so forth

a stringi solution using same data as #Sotos would be:
library(stringi)
line 1 removes the last , or . character if more than 2 characters follow
line 2 removes the first , or . characters if there is more than 1 , or . left
x<-ifelse(stri_locate_last_regex(x,"([,.])")[,2]<(stri_length(x)-2),
stri_replace_last_regex(x,"([,.])",""),x)
x <- if(stri_count_regex(x,"([,.])") > 1){stri_replace_first_regex(x,"([,.])","")}
> x
[1] "12345.67" "12345,67" "12345,6" "12234" "1234" "12.45"

Another option is to use negative look ahead syntax ?! with the perl compatible regex:
df
# V1
# 1 12.345.67
# 2 12.345,67
# 3 12.345,6
# 4 12.345.6
# 5 12.345
# 6 1,2.345
df$V1 = gsub("[,.](?!\\d{1,2}$)", "", df$V1, perl = T)
df # remove , or . except they are followed by 1 or 2 digits at the end of string
# V1
# 1 12345.67
# 2 12345,67
# 3 12345,6
# 4 12345.6
# 5 12345
# 6 12345

One solution is to count the characters after the last comma/period (nchar(word(x, -1, sep = ',|\\.'))), and if the length is greater than 2, remove all delimiters (gsub(',|\\.', '', x)), otherwise just the first one (sub(',|\\.', '', x).
library(stringr)
ifelse(nchar(word(x, -1, sep = ',|\\.')) > 2, gsub(',|\\.', '', x), sub(',|\\.', '', x))
#[1] "12345.67" "12345,67" "12345,6" "12234" "1234" "12.45"
DATA
x <- c("12.345.67", "12.345,67", "12.345,6", "1,2.234", "1.234", "1,2.45")

Split a string based on "^" in R

I need to split and obtain the all the characters before ^
example:
I have a column in a dataframe that reads
2567543^ABC
7545435^J
8934939^XY
and the result column in the same dataframe should read:
2567543
7545435
8934939
I tried using stringr, strsub{base}, stringi, gsubfn. But they are throwing weird results because ^. I cannot replace ^ because the table is simply huge.

Just remove all the chars from ^ upto the last using sub function. Since ^ is a special meta charcater in regex which matches the start of a line, you need to escape ^ symbol in-order to match a literal ^ symbol.
sub("\\^.*", "", df$x)
Example:
> df <- data.frame(x=c("2567543^ABC", "7545435^J", "8934939^XY"))
> df$x <- sub("\\^.*", "", df$x)
> df
x
1 2567543
2 7545435
3 8934939
OR
> df <- data.frame(x=c("2567543^ABC", "7545435^J", "8934939^XY"))
> df$x <- strsplit(as.character(df$x), "\\^")[[1]][1]
> df
x
1 2567543
2 2567543
3 2567543
OR
Use fixed=TRUE parameter in strsplit since ^ is a special character.
> df <- data.frame(x=c("2567543^ABC", "7545435^J", "8934939^XY"))
> df$x <- strsplit(as.character(df$x), "^", fixed=TRUE)[[1]][1]
> df
x
1 2567543
2 2567543
3 2567543

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting all values between ( ) and before % sign - r

Related

Extract values based on last n characters

Extract only values with a decimal point in between from strings

Extracting numbers from character string based on delimiters

Remove comma and or period except if certain condition holds for last occurrence in R

Split a string based on "^" in R

Categories

Resources