How to split letters with bracket and numbers in R? - r

The string is s = '[12]B1[16]M5'
I want to split it as the following results with strsplit function in R:
let <- c('[12]B', '[16]M')
num <- c(1, 5)
Thanks a lot

You could use regular expression for your task.
s = '[12]B1[16]M22'
grx <- gregexpr("\\[.+?\\].+[[:digit:]]?", s)
let <- do.call(c, regmatches(s, grx))
#let
#[1] "[12]B" "[16]M"
If you want to get all chunks (let + num), you can tweak the patter as below. This facilitates extracting the numeric part.
grx <- gregexpr("\\[.+?\\].+([[:digit:]]+)", s)
out <- do.call(c, regmatches(s, grx))
num <- gsub(".+\\][[:alpha:]]+", "", out)
num
[1] "1" "22"

Using the stringr package:
library(stringr)
x <- '[12]B1[16]M2'
let <- unlist(str_extract_all(x, "\\[[0-9]{2}\\][A-Z]"))
x <- gsub(pattern = "\\[[0-9]{2}\\][A-Z]",
replacement = "",
x)
num <- unlist(str_extract_all(x, "[0-9]"))
the regular expression "\\[[0-9]{2}\\][A-Z]" can be broken down as
\\[ an opening bracket
[0-9]{2} a sequence of two consecutive digits
\\] a closing bracket
[A-Z] a sequence of exactly one upper case letter

1) strapply Create a regular expression, pat which matches the two parts and then extract each separately using strapply. The first capture group (first parenthesized portion of regular expression) consists of a left square bracket "\\[" the smallest string ".*?" until the right square bracket "\\]" followed by any character "." . The second capture group consists of one or more digits "\\d+".
library(gsubfn)
pat <- "(\\[.*?\\].)(\\d+)"
let <- strapply(s, pat, simplify = c)
num <- strapply(s, pat, ~ as.numeric(..2), simplify = c)
let
## [1] "[12]B" "[16]M"
num
## [1] 1 5
1a) Variation
This could also be expressed as this mapply producing a 2 component list:
mapply(strapply, s, pat, c(~ ..1, ~ as.numeric(..2)), simplify = "c",
SIMPLIFY = FALSE, USE.NAMES = FALSE)
## [[1]]
## [1] "[12]B" "[16]M"
##
## [[2]]
## [1] 1 5
2) gsub/read.table This uses no packages -- only gsub and read.table. pat is defined in (1). It returns a data frame with the results in two coiumns:
read.table(text = gsub(pat, "\\1 \\2\n", s), as.is = TRUE, col.names = c("let", "num"))
## let num
## 1 [12]B 1
## 2 [16]M 5
3) gsub/strsplit This is somewhat similar to (2) but uses strsplit rather than read.table. pat is from (1).
spl <- matrix(strsplit(gsub(pat, "\\1 \\2 ", s), " ")[[1]], 2)
let <- spl[1, ]
num <- as.numeric(spl[2, ])

Related

How do I convert to numeric an exponential stored as a char in R?

I have a file with numbers in scientific notation stored as 0.00684*10^0.0023. When I read the file with read.csv(), they are loaded as character strings, and trying to convert them with the as.numeric() function returns only NA.
a = "0.00684*10^0.0023"
as.numeric(a)
NA
Is there a way to force R to take on the variable as Scientific notation?
Here are a few ways. They use a from the question (also shown in the Note below) or c(a, a) to illustrate how to apply it to a vector. No packages are used.
1) Use eval/parse:
eval(parse(text = a))
## [1] 0.00687632
or for a vector such as c(a, a)
sapply(c(a, a), function(x) eval(parse(text = x)), USE.NAMES = FALSE)
## [1] 0.00687632 0.00687632
2) A different way is to split up the input and calculate it ourself.
with(read.table(text = sub("\\*10\\^", " ", c(a, a))), V1 * 10^V2)
## [1] 0.00687632 0.00687632
3) A third way is to convert to complex numbers and the combine the real and imaginary parts to get the result.
tmp <- type.convert(sub("\\*10\\^(.*)", "+\\1i", c(a, a)), as.is = TRUE)
Re(tmp) * 10^Im(tmp)
## [1] 0.00687632 0.00687632
4) Another way to split it up is:
as.numeric(sub("\\*.*", "", c(a, a))) *
10^as.numeric(sub(".*\\^", "", c(a, a)))
## [1] 0.00687632 0.00687632
6) We could use any of the above to define a custom class which can read in fields of the required form. First we write out a test data file and then define a custom num class. In read.csv use the colClasses argument to specify which class each column has. Use NA for those columns where we want read.csv to determine the class automatically..
# generate test file
cat("a,b\n1,0.00684*10^0.0023\n", file = "tmpfile.csv")
setAs("character", "num",
function(from) with(read.table(text = sub("\\*10\\^", " ", from)), V1 * 10^V2))
read.csv("tmpfile.csv", colClasses = c(NA, "num"))
## a b
## 1 1 0.00687632
With this definition we can also use as like this:
as(c(a, a), "num")
## [1] 0.00687632 0.00687632
Note
a <- "0.00684*10^0.0023"
One idea:
library(stringr)
a <- "0.00684*10^0.0023" # your input
b <- str_split(a, "\\*10\\^") # split the string by the operator
b <- as.numeric(b[[1]]) # transform the individual elements to numbers
c <- b[1]*10^(b[2]) # execute the wished operation with the obtained numbers
c # the result

extract numbers at specific position in a vector in r

I have lots of vectors and every element has 3 numbers and I want to extract it to different columns.
test <- '1.0226 [1.0109; 1.0344]'
What I expected is
rr <- 1.0226
low_95 <- 1.0109
up_95 <- 1.0344
I thought I should use the str_extract() function to do this, but I don't know how to write the regex.
rr : extract number before [;
low_95 : extract number between [ and ;;
up_95 : extract number between ; and ].
Regex for extracting number before [ in R: *\\[.*
test <- '1.0226 [1.0109; 1.0344]'
rr <- gsub(" *\\[.*", "", test)
rr
# [1] "1.0226"
Regex for extracting number between [ and ; in R: .*\\[|;.*
test <- '1.0226 [1.0109; 1.0344]'
low_95 <- gsub(".*\\[|;.*", "", test)
low_95
# [1] "1.0109"
Regex for extracting number between ; and ] in R: .*; |].*
test <- '1.0226 [1.0109; 1.0344]'
up_95 <- gsub(".*; |].*", "", test)
up_95
# [1] "1.0344"
we could use strcapture from base R:
prt <- data.frame(rr = numeric(),low_95 = numeric(), up_95 = numeric())
strcapture("(\\d+\\.?\\d+)\\D+\\[((?1));\\s*((?1))\\]",test,prt,perl = TRUE)
rr low_95 up_95
1 1.0226 1.0109 1.0344
If you are going to have other values in test, you can extract from tidyr.
data.frame(test) %>%
tidyr::extract(test, paste0('num', 1:3), '(.*)\\[(.*);\\s*(.*)\\]')
With data.table:
test <- '1.0226 [1.0109; 1.0344]'
data.table::tstrsplit(test, " \\[|; |\\]")
[[1]]
[1] "1.0226"
[[2]]
[1] "1.0109"
[[3]]
[1] "1.0344"
In case the numbers are every time at the same position you can use read.table after removing []; with gsub.
read.table(text=gsub("[][;]", "", test), col.names=c("rr","low_95","up_95"))
# rr low_95 up_95
#1 1.0226 1.0109 1.0344
Using base R with gregexpr we can just extract out all numbers, then assign to separate variables:
test <- '1.0226 [1.0109; 1.0344]'
matches <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', test, perl=TRUE)
vec <- regmatches(test, matches)[[1]]
vec
rr <- vec[1]
low_94 <- vec[2]
up_95 <- vec[3]
[1] "1.0226" "1.0109" "1.0344"
A fun way of doing this is via suband multiple backreferencing:
strsplit(gsub("(^\\d+\\.\\d+)\\s\\[(\\d+\\.\\d+);\\s(\\d+\\.\\d+)]", "\\1,\\2,\\3", test), ",")
[[1]]
[1] "1.0226" "1.0109" "1.0344"
From there you can proceed by assigning the elements to your vectors of choice, for example:
rr <- unlist(strsplit(gsub("(^\\d+\\.\\d+)\\s\\[(\\d+\\.\\d+);\\s(\\d+\\.\\d+)]", "\\1,\\2,\\3", test), ","))[1]
[1] "1.0226"

Remove all sentences where number to character ratio is greater than the average in the text

Is it possible to find and delete all sentences containing a higher number to character ratio?
I created the following function to calculate the ratio in a given string:
a <- "1aaaaaa2bbbbbbb3"
Num_Char_Ration <- function(string){
length(unlist(regmatches(string,gregexpr("[[:digit:]]",string))))/nchar(as.character(string))
}
Num_Char_Ration(a)
#0.1875
The task is now to find a method to calculate the ratio for a sentence(so for a character sequence between ending with a ".") and then to delete sentences with a higher ratio from the text. For example:
input:
a <- " aa111111. bbbbbb22. cccccc3."
output:
#"bbbbbb22. cccccc3."
I would use stringr package to count digits and characters:
# Original data
input <- " aa111111. bbbbbb22. cccccc3."
# Split by .
inputSplit <- strsplit(input, "\\.")[[1]]
# Count digits and all alnum in splitted string
counts <- sapply(inputSplit, stringr::str_count, c("[[:digit:]]", "[[:alnum:]]"))
# Get ratios and collapse text back
paste(inputSplit[counts[1, ] / counts[2, ] < 0.5], collapse = ".")
# [1] " bbbbbb22. cccccc3"
counts looks like this:
# To get ratio between digits and string
# Divide first row by second row
aa111111 bbbbbb22 cccccc3
[1,] 6 2 1
[2,] 8 8 7
Here is a simple base solution:
x <- strsplit(input,"\\.")[[1]]
x <- x[nchar(x) < 2 * nchar(gsub("\\d","",x))]
paste(x,collapse=".")
# [1] " bbbbbb22. cccccc3"
You need to split up your long string into single words! (strsplit() for eg)
data:
words <- c("aa111111.","bbbbbb22.","cccccc3.")
code:
library(magrittr)
fun1 <- function(x) {
num <- gsub("\\D","",x) %>% nchar
char<- gsub("[^A-z]","",x,perl=T) %>% nchar
if(num <= char) return(x) else NULL
}
sapply(words,fun1) %>% unlist %>% unname
result:
#[1] "bbbbbb22." "cccccc3."
Here's how I would do it in base R. Adapted Andre's code.
my_string <- " aa111111. bbbbbb22. cccccc3."
#Split paragraph into sentences based on '.'
my_string <- unlist(strsplit(my_string, '(?<=\\.)\\s+', perl=TRUE))
#Removing sentences with more numbers than letters
my_string <- subset(my_string,nchar(gsub("\\D","",my_string)) <= nchar(gsub("[^A-z]","",my_string,perl=T)))
my_string
##[1] "bbbbbb22." "cccccc3."
If you then want to combine these sentences back into a paragraph, you can use
paste(my_string,collapse=" ")
##[1] "bbbbbb22. cccccc3."
# Simplified num to char ratio function
Num_Char_Ration <- function(string) {
lengths(regmatches(x, gregexpr("[0-9]", x))) / nchar(x)
}
clear_nmbstring <- function(x) {
x <- strsplit(x, ".", fixed = TRUE)[[1]]
cleanx <- trimws(x)
x <- x[Num_Char_Ration(cleanx) < 0.5]
paste(x, collapse = ".")
}
# Example:
string <- c(" aa111111. bbbbbb22. cccccc3.")
clear_nmbstring(string)
[1] " bbbbbb22. cccccc3"

R extract string between nth and ith instance of delimiter

I have a vector of strings, similar to this one, but with many more elements:
s <- c("CGA-DV-558_T_90.67.0_DV_1541_07", "TC-V-576_T_90.0_DV_151_0", "TCA-DV-X_T_6.0_D_A2_07", "T-V-Z_T_2_D_A_0", "CGA-DV-AW0_T.1_24.4.0_V_A6_7", "ACGA-DV-A4W0_T_274.46.0_DV_A266_07")
And I would like to use a function that extracts the string between the nth and ith instances of the delimiter "_". For example, the string between the 2nd (n = 2) and 3rd (i = 3) instances, to get this:
[1] "90.67.0" "90.0" "6.0" "2" "24.4.0" "274.46.0"
Or if n = 4 and i = 5"
[1] "1541" "151" "A2" "A" "A" "A266"
Any suggestions? Thank you for your help!
You can do this with gsub
n = 2
i = 3
pattern1 = paste0("(.*?_){", n, "}")
temp = gsub(pattern1, "", s)
pattern2 = paste0("((.*?_){", i-n, "}).*")
temp = gsub(pattern2, "\\1", temp)
temp = gsub("_$", "", temp)
[1] "1541" "151" "A2" "A" "A6" "A266"
#FUNCTION
foo = function(x, n, i){
do.call(c, lapply(x, function(X)
paste(unlist(strsplit(X, "_"))[(n+1):(i)], collapse = "_")))
}
#USAGE
foo(x = s, n = 3, i = 5)
#[1] "DV_1541" "DV_151" "D_A2" "D_A" "V_A6" "DV_A266"
A third method, that uses substring for the extraction and gregexpr to find the positions is
# extract postions of "_" from each vector element, returns a list
spots <- gregexpr("_", s, fixed=TRUE)
# extract text in between third and fifth underscores
substring(s, sapply(spots, "[", 3) + 1, sapply(spots, "[", 5) - 1)
"DV_1541" "DV_151" "D_A2" "D_A" "V_A6" "DV_A266"

Combining elements in a string vector with defined element size and accounting for not event sizes

Given is vector:
vec <- c(LETTERS[1:10])
I would like to be able to combine it in a following manner:
resA <- c("AB", "CD", "EF", "GH", "IJ")
resB <- c("ABCDEF","GHIJ")
where elements of the vector vec are merged together according to the desired size of a new element constituting the resulting vector. This is 2 in case of resA and 5 in case of resB.
Desired solution characteristics
The solution should allow for flexibility with respect to the element sizes, i.e. I may want to have vectors with elements of size 2 or 20
There may be not enough elements in the vector to match the desired chunk size, in that case last element should be shortened accordingly (as shown)
This is shouldn't make a difference but the solution should work on words as well
Attempts
Initially, I was thinking of using something on the lines:
c(
paste0(vec[1:2], collapse = ""),
paste0(vec[3:4], collapse = ""),
paste0(vec[5:6], collapse = "")
# ...
)
but this would have to be adapted to jump through the remaining pairs/bigger groups of the vec and handle last group which often would be of a smaller size.
Here is what I came up with. Using Harlan's idea in this question, you can split the vector in different number of chunks. You also want to use your paste0() idea in lapply() here. Finally, you unlist a list.
unlist(lapply(split(vec, ceiling(seq_along(vec)/2)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4 5
#"AB" "CD" "EF" "GH" "IJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/5)), function(x){paste0(x, collapse = "")}))
# 1 2
#"ABCDE" "FGHIJ"
unlist(lapply(split(vec, ceiling(seq_along(vec)/3)), function(x){paste0(x, collapse = "")}))
# 1 2 3 4
#"ABC" "DEF" "GHI" "J"
vec <- c(LETTERS[1:10])
f1 <- function(x, n){
f <- function(x) paste0(x, collapse = '')
regmatches(f(x), gregexpr(f(rep('.', n)), f(x)))[[1]]
}
f1(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f2 <- function(x, n)
apply(matrix(x, nrow = n), 2, paste0, collapse = '')
f2(vec, 5)
# [1] "ABCDE" "FGHIJ"
or
f3 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
strsplit(gsub(sprintf('(%s)', f(rep('.', n))), '\\1 ', f(x)), '\\s+')[[1]]
}
f3(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
I would say the last is best of these since n for the others must be a factor or you will get warnings or recycling
edit - more
f4 <- function(x, n) {
f <- function(x) paste0(x, collapse = '')
Vectorize(substring, USE.NAMES = FALSE)(f(x), which((seq_along(x) %% n) == 1),
which((seq_along(x) %% n) == 0))
}
f4(vec, 2)
# [1] "AB" "CD" "EF" "GH" "IJ"
or
f5 <- function(x, n)
mapply(function(x) paste0(x, collapse = ''),
split(x, c(0, head(cumsum(rep_len(sequence(n), length(x)) %in% n), -1))),
USE.NAMES = FALSE)
f5(vec, 4)
# [1] "ABCD" "EFGH" "IJ"
Here is another way, working with the original array.
A side note, working with words is not straightforward, since there is at least two ways to understand it: you can either keep each word separately or collapse them first an get individual characters. The next function can deal with both options.
vec <- c(LETTERS[1:10])
vec2 <- c("AB","CDE","F","GHIJ")
cuts <- function(x, n, bychar=F) {
if (bychar) x <- unlist(strsplit(paste0(x, collapse=""), ""))
ii <- seq_along(x)
li <- split(ii, ceiling(ii/n))
return(sapply(li, function(y) paste0(x[y], collapse="")))
}
cuts(vec2,2,F)
# 1 2
# "ABCDE" "FGHIJ"
cuts(vec2,2,T)
# 1 2 3 4 5
# "AB" "CD" "EF" "GH" "IJ"

Resources