Split a string based on "^" in R

Split a string based on "^" in R - r

I need to split and obtain the all the characters before ^
example:
I have a column in a dataframe that reads
2567543^ABC
7545435^J
8934939^XY
and the result column in the same dataframe should read:
2567543
7545435
8934939
I tried using stringr, strsub{base}, stringi, gsubfn. But they are throwing weird results because ^. I cannot replace ^ because the table is simply huge.

Just remove all the chars from ^ upto the last using sub function. Since ^ is a special meta charcater in regex which matches the start of a line, you need to escape ^ symbol in-order to match a literal ^ symbol.
sub("\\^.*", "", df$x)
Example:
> df <- data.frame(x=c("2567543^ABC", "7545435^J", "8934939^XY"))
> df$x <- sub("\\^.*", "", df$x)
> df
x
1 2567543
2 7545435
3 8934939
OR
> df <- data.frame(x=c("2567543^ABC", "7545435^J", "8934939^XY"))
> df$x <- strsplit(as.character(df$x), "\\^")[[1]][1]
> df
x
1 2567543
2 2567543
3 2567543
OR
Use fixed=TRUE parameter in strsplit since ^ is a special character.
> df <- data.frame(x=c("2567543^ABC", "7545435^J", "8934939^XY"))
> df$x <- strsplit(as.character(df$x), "^", fixed=TRUE)[[1]][1]
> df
x
1 2567543
2 2567543
3 2567543

Related

Is there a way in R to count the number of substrings in a string enclosed in square brackets, all substrings are separated by commas and are quoted?

['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.

We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)

Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")

How to return rows of a df that contain strings from a character list

I have a character list. I would like to return rows in a df that contain any of the strings in the list in a given column.
I have tried things like:
hits <- df %>%
filter(column, any(strings))
strings <- c("ape", "bat", "cat")
head(df$column)
[1] "ape and some other text here"
[2] "just some random text"
[3] "Something about cats"
I would like only rows 1 and 3 returned
Thanks in advance for the help.

Use grepl() with a regular expression matching any of the strings in your strings vector:
strings <- c("ape", "bat", "cat")
Firstly, you can collapse the strings vector to the regex you need:
regex <- paste(strings, collapse = "|")
Which gives:
> regex <- paste(strings, collapse = "|")
> regex
[1] "ape|bat|cat"
The pipe symbol | acts as an or operator, so this regex ape|bat|cat will match ape or bat or cat.
If your data.frame df looks like this:
> df
# A tibble: 3 x 1
column
<chr>
1 ape and some other text here
2 just some random text
3 something about cats
Then you can run the following line of code to return just the rows matching your desired strings:
df[grepl(regex, df$column), ]
The output is as follows:
> df[grepl(regex, df$column), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about cats
Note that the above example is case-insensitive, it will only match the lower case strings exactly as specified. You can overcome this easily using the ignore.case parameter of grepl() (note the upper case Cats):
> df[grepl(regex, df$column, ignore.case = TRUE), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about Cats

This can be accomplished with a regular expression.
aColumn <- c("ape and some other text here","just some random text","Something about cats")
aColumn[grepl("ape|bat|cat",aColumn)]
...and the output:
> aColumn[grepl("ape|bat|cat",aColumn)]
[1] "ape and some other text here" "Something about cats"
>
One an also set up the regular expression in an R object, as follows.
# use with a variable
strings <- "ape|cat|bat"
aColumn[grepl(strings,aColumn)]

Extracting all values between ( ) and before % sign

How can I extract just the number between the parentheses () and before %?
df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
X
1 (0.746698269620538%)
2 (0.104987640399486%)
3 (0.864544949028641%)
For instance, I would like to have a DF like this:
X
1 0.746698269620538
2 0.104987640399486
3 0.864544949028641

We can use sub to match the ( (escaped \\ because it is metacharacter) at the start (^) of the string followed by 0 or more numbers ([0-9.]*) captured as a group ((...)), followed by % and other characters (.*), replace it with the backreference (\\1) of the captured group
df$X <- as.numeric(sub("^\\(([0-9.]*)%.*", "\\1", df$X))
If it includes also non-numeric characters then
sub("^\\(([^%]*)%.*", "\\1", df$X)

Use substr since your know you need to omit the first and last two chars:
> df <- data.frame(X = paste0('(',runif(3,0,1), '%)'))
> df
X
1 (0.393457352882251%)
2 (0.0288733830675483%)
3 (0.289543839870021%)
> df$X <- as.numeric(substr(df$X, 2, nchar(as.character(df$X)) - 2))
> df
X
1 0.39345735
2 0.02887338
3 0.28954384

Extracting numbers from character string based on delimiters

I have the following dataframe:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3",
"abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b)
df$b <- as.character(df$b)
And I need to extract the numbers in df$b that come between the second and third underscores and assign to df$c.
I'm guessing there's a fairly simple solution, but haven't found it yet. The actual dataset is fairly large (3MM rows) so efficiency is a bit of a factor.
Thanks for the help!

We can use sub to match the zeor or more characters that are not a _ ([^_]*) from the start (^) of the string followed by an underscore (_), then another set of characters that are not an underscore followed by underscore, capture the one of more numbers that follows in a group ((\\d+)) followed by underscore and other characters, then replace it with the backreference for that group and finally convert it to numeric
as.numeric(sub("^[^_]*_[^_]+_(\\d+)_.*", "\\1", df$b))
#[1] 123456 78912 345678912 34567 891234556778

create a my_split function that finds the start and end position of "_" using gregexpr. Then extract the string between start and end position using substr.
my_split <- function(x, start, end){
a1 <- gregexpr("_", x)
substr(x, a1[[1]][start]+1, a1[[1]][end]-1)
}
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
sapply(b, my_split, start = 2, end = 3)
# abc_a_123456_defghij_1 abc_a_78912_abc_2
# "123456" "78912"
# abc_a_345678912_xyzabc_3 abc_b_34567_defgh_4
# "345678912" "34567"
# abc_c_891234556778_ijklmnop_5
# "891234556778"
using data.table library
library(data.table)
setDT(df)[, c := lapply(b, my_split, start = 2, end = 3)]
df
# a b c
# 1: 1 abc_a_123456_defghij_1 123456
# 2: 2 abc_a_78912_abc_2 78912
# 3: 3 abc_a_345678912_xyzabc_3 345678912
# 4: 4 abc_b_34567_defgh_4 34567
# 5: 5 abc_c_891234556778_ijklmnop_5 891234556778
data:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b, stringsAsFactors = FALSE)

How to split strings and numbers in R?

I have character vector of the following form (this is just a sample):
R1Ng(10)
test(0)
n.Ex1T(34)
where as can be seen above, the first part is always some combination of alphanumeric and punctuation marks, then there are parentheses with a number inside. I want to create a numeric vector which will store the values inside the parentheses, and each number should have name attribute, and the name attribute should be the string before the number. So, for example, I want to store 10, 0, 34, inside a numeric vector and their name attributes should be, R1Ng, test, n.Ex1T, respectively.
I can always do something like this to get the numbers and create a numeric vector:
counts <- regmatches(data, gregexpr("[[:digit:]]+", data))
as.numeric(unlist(counts))
But, how can I extract the first string part, and store it as the name attribute of that numberic array?

How about this:
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name = gsub( "\\(.*", "", x),
Count = as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)))
# Name Count
# 1 R1Ng 10
# 2 test 0
# 3 n.Ex1T 34
Or alternatively as a vector
setNames(as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)),
gsub( "\\(.*", "", x ))
# R1Ng test n.Ex1T
# 10 0 34

Here is another variation using the same expression and capturing parentheses:
temp <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name=gsub("^(.*)\\((\\d+)\\)$", "\\1", temp),
count=gsub("^(.*)\\((\\d+)\\)$", "\\2", temp))

We can use str_extract_all
library(stringr)
lst <- str_extract_all(x, "[^()]+")
Or with strsplit from base R
lst <- strsplit(x, "[()]")
If we need to store as a named vector
sapply(lst, function(x) setNames(as.numeric(x[2]), x[1]))
# R1Ng test n.Ex1T
# 10 0 34
data
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split a string based on "^" in R - r

Related

Is there a way in R to count the number of substrings in a string enclosed in square brackets, all substrings are separated by commas and are quoted?

How to return rows of a df that contain strings from a character list

Extracting all values between ( ) and before % sign

Extracting numbers from character string based on delimiters

How to split strings and numbers in R?

Categories

Resources