extract numbers at specific position in a vector in r

extract numbers at specific position in a vector in r - r

I have lots of vectors and every element has 3 numbers and I want to extract it to different columns.
test <- '1.0226 [1.0109; 1.0344]'
What I expected is
rr <- 1.0226
low_95 <- 1.0109
up_95 <- 1.0344
I thought I should use the str_extract() function to do this, but I don't know how to write the regex.
rr : extract number before [;
low_95 : extract number between [ and ;;
up_95 : extract number between ; and ].

Regex for extracting number before [ in R: *\\[.*
test <- '1.0226 [1.0109; 1.0344]'
rr <- gsub(" *\\[.*", "", test)
rr
# [1] "1.0226"
Regex for extracting number between [ and ; in R: .*\\[|;.*
test <- '1.0226 [1.0109; 1.0344]'
low_95 <- gsub(".*\\[|;.*", "", test)
low_95
# [1] "1.0109"
Regex for extracting number between ; and ] in R: .*; |].*
test <- '1.0226 [1.0109; 1.0344]'
up_95 <- gsub(".*; |].*", "", test)
up_95
# [1] "1.0344"

we could use strcapture from base R:
prt <- data.frame(rr = numeric(),low_95 = numeric(), up_95 = numeric())
strcapture("(\\d+\\.?\\d+)\\D+\\[((?1));\\s*((?1))\\]",test,prt,perl = TRUE)
rr low_95 up_95
1 1.0226 1.0109 1.0344

If you are going to have other values in test, you can extract from tidyr.
data.frame(test) %>%
tidyr::extract(test, paste0('num', 1:3), '(.*)\\[(.*);\\s*(.*)\\]')

With data.table:
test <- '1.0226 [1.0109; 1.0344]'
data.table::tstrsplit(test, " \\[|; |\\]")
[[1]]
[1] "1.0226"
[[2]]
[1] "1.0109"
[[3]]
[1] "1.0344"

In case the numbers are every time at the same position you can use read.table after removing []; with gsub.
read.table(text=gsub("[][;]", "", test), col.names=c("rr","low_95","up_95"))
# rr low_95 up_95
#1 1.0226 1.0109 1.0344

Using base R with gregexpr we can just extract out all numbers, then assign to separate variables:
test <- '1.0226 [1.0109; 1.0344]'
matches <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', test, perl=TRUE)
vec <- regmatches(test, matches)[[1]]
vec
rr <- vec[1]
low_94 <- vec[2]
up_95 <- vec[3]
[1] "1.0226" "1.0109" "1.0344"

A fun way of doing this is via suband multiple backreferencing:
strsplit(gsub("(^\\d+\\.\\d+)\\s\\[(\\d+\\.\\d+);\\s(\\d+\\.\\d+)]", "\\1,\\2,\\3", test), ",")
[[1]]
[1] "1.0226" "1.0109" "1.0344"
From there you can proceed by assigning the elements to your vectors of choice, for example:
rr <- unlist(strsplit(gsub("(^\\d+\\.\\d+)\\s\\[(\\d+\\.\\d+);\\s(\\d+\\.\\d+)]", "\\1,\\2,\\3", test), ","))[1]
[1] "1.0226"

Related

How do I convert to numeric an exponential stored as a char in R?

I have a file with numbers in scientific notation stored as 0.00684*10^0.0023. When I read the file with read.csv(), they are loaded as character strings, and trying to convert them with the as.numeric() function returns only NA.
a = "0.00684*10^0.0023"
as.numeric(a)
NA
Is there a way to force R to take on the variable as Scientific notation?

Here are a few ways. They use a from the question (also shown in the Note below) or c(a, a) to illustrate how to apply it to a vector. No packages are used.
1) Use eval/parse:
eval(parse(text = a))
## [1] 0.00687632
or for a vector such as c(a, a)
sapply(c(a, a), function(x) eval(parse(text = x)), USE.NAMES = FALSE)
## [1] 0.00687632 0.00687632
2) A different way is to split up the input and calculate it ourself.
with(read.table(text = sub("\\*10\\^", " ", c(a, a))), V1 * 10^V2)
## [1] 0.00687632 0.00687632
3) A third way is to convert to complex numbers and the combine the real and imaginary parts to get the result.
tmp <- type.convert(sub("\\*10\\^(.*)", "+\\1i", c(a, a)), as.is = TRUE)
Re(tmp) * 10^Im(tmp)
## [1] 0.00687632 0.00687632
4) Another way to split it up is:
as.numeric(sub("\\*.*", "", c(a, a))) *
10^as.numeric(sub(".*\\^", "", c(a, a)))
## [1] 0.00687632 0.00687632
6) We could use any of the above to define a custom class which can read in fields of the required form. First we write out a test data file and then define a custom num class. In read.csv use the colClasses argument to specify which class each column has. Use NA for those columns where we want read.csv to determine the class automatically..
# generate test file
cat("a,b\n1,0.00684*10^0.0023\n", file = "tmpfile.csv")
setAs("character", "num",
function(from) with(read.table(text = sub("\\*10\\^", " ", from)), V1 * 10^V2))
read.csv("tmpfile.csv", colClasses = c(NA, "num"))
## a b
## 1 1 0.00687632
With this definition we can also use as like this:
as(c(a, a), "num")
## [1] 0.00687632 0.00687632
Note
a <- "0.00684*10^0.0023"

One idea:
library(stringr)
a <- "0.00684*10^0.0023" # your input
b <- str_split(a, "\\*10\\^") # split the string by the operator
b <- as.numeric(b[[1]]) # transform the individual elements to numbers
c <- b[1]*10^(b[2]) # execute the wished operation with the obtained numbers
c # the result

extract different strings after match using R

data <- c("Demand = 001 979", "Demand = -08 976 (154)", "Demand = -01 975 (359)")
data <- str_match(data, pattern = ("Demand = (.*) (.*)"))
I need to extract the first 2 sets of numbers (including the - sign) into columns using str_match.
Exclude 3rd set of numbers in bracket ().
Any help is welcomed.
Output:
## [1] "001" "-08" "-01"
## [2] "979" "976" "975"

How about removing everything else?
data <- c("Demand = 001 979", "Demand = -08 976 (154)", "Demand = -01 975 (359)")
data <- gsub("Demand = ", "", x = data)
data <- trimws(gsub("\\(.*\\)", "", x = data))
out <- list()
out[[1]] <- sapply(data, "[", 1)
out[[2]] <- sapply(data, "[", 2)
out
[[1]]
[1] "001" "-08" "-01"
[[2]]
[1] "979" "976" "975"

A possibility with str_extract_all() from stringr:
sapply(str_extract_all(x, "-?[0-9]+?[0-9]*"), function(x) x[1])
[1] "001" "-08" "-01"
sapply(str_extract_all(x, "-?[0-9]+?[0-9]*"), function(x) x[2])
[1] "979" "976" "975"
Or using the idea of #Roman Luštrik with strsplit():
sapply(strsplit(gsub("Demand = ", "", x), " "), function(x) x[1])
[1] "001" "-08" "-01"

Remove all sentences where number to character ratio is greater than the average in the text

Is it possible to find and delete all sentences containing a higher number to character ratio?
I created the following function to calculate the ratio in a given string:
a <- "1aaaaaa2bbbbbbb3"
Num_Char_Ration <- function(string){
length(unlist(regmatches(string,gregexpr("[[:digit:]]",string))))/nchar(as.character(string))
}
Num_Char_Ration(a)
#0.1875
The task is now to find a method to calculate the ratio for a sentence(so for a character sequence between ending with a ".") and then to delete sentences with a higher ratio from the text. For example:
input:
a <- " aa111111. bbbbbb22. cccccc3."
output:
#"bbbbbb22. cccccc3."

I would use stringr package to count digits and characters:
# Original data
input <- " aa111111. bbbbbb22. cccccc3."
# Split by .
inputSplit <- strsplit(input, "\\.")[[1]]
# Count digits and all alnum in splitted string
counts <- sapply(inputSplit, stringr::str_count, c("[[:digit:]]", "[[:alnum:]]"))
# Get ratios and collapse text back
paste(inputSplit[counts[1, ] / counts[2, ] < 0.5], collapse = ".")
# [1] " bbbbbb22. cccccc3"
counts looks like this:
# To get ratio between digits and string
# Divide first row by second row
aa111111 bbbbbb22 cccccc3
[1,] 6 2 1
[2,] 8 8 7

Here is a simple base solution:
x <- strsplit(input,"\\.")[[1]]
x <- x[nchar(x) < 2 * nchar(gsub("\\d","",x))]
paste(x,collapse=".")
# [1] " bbbbbb22. cccccc3"

You need to split up your long string into single words! (strsplit() for eg)
data:
words <- c("aa111111.","bbbbbb22.","cccccc3.")
code:
library(magrittr)
fun1 <- function(x) {
num <- gsub("\\D","",x) %>% nchar
char<- gsub("[^A-z]","",x,perl=T) %>% nchar
if(num <= char) return(x) else NULL
}
sapply(words,fun1) %>% unlist %>% unname
result:
#[1] "bbbbbb22." "cccccc3."

Here's how I would do it in base R. Adapted Andre's code.
my_string <- " aa111111. bbbbbb22. cccccc3."
#Split paragraph into sentences based on '.'
my_string <- unlist(strsplit(my_string, '(?<=\\.)\\s+', perl=TRUE))
#Removing sentences with more numbers than letters
my_string <- subset(my_string,nchar(gsub("\\D","",my_string)) <= nchar(gsub("[^A-z]","",my_string,perl=T)))
my_string
##[1] "bbbbbb22." "cccccc3."
If you then want to combine these sentences back into a paragraph, you can use
paste(my_string,collapse=" ")
##[1] "bbbbbb22. cccccc3."

# Simplified num to char ratio function
Num_Char_Ration <- function(string) {
lengths(regmatches(x, gregexpr("[0-9]", x))) / nchar(x)
}
clear_nmbstring <- function(x) {
x <- strsplit(x, ".", fixed = TRUE)[[1]]
cleanx <- trimws(x)
x <- x[Num_Char_Ration(cleanx) < 0.5]
paste(x, collapse = ".")
}
# Example:
string <- c(" aa111111. bbbbbb22. cccccc3.")
clear_nmbstring(string)
[1] " bbbbbb22. cccccc3"

How to split letters with bracket and numbers in R?

The string is s = '[12]B1[16]M5'
I want to split it as the following results with strsplit function in R:
let <- c('[12]B', '[16]M')
num <- c(1, 5)
Thanks a lot

You could use regular expression for your task.
s = '[12]B1[16]M22'
grx <- gregexpr("\\[.+?\\].+[[:digit:]]?", s)
let <- do.call(c, regmatches(s, grx))
#let
#[1] "[12]B" "[16]M"
If you want to get all chunks (let + num), you can tweak the patter as below. This facilitates extracting the numeric part.
grx <- gregexpr("\\[.+?\\].+([[:digit:]]+)", s)
out <- do.call(c, regmatches(s, grx))
num <- gsub(".+\\][[:alpha:]]+", "", out)
num
[1] "1" "22"

Using the stringr package:
library(stringr)
x <- '[12]B1[16]M2'
let <- unlist(str_extract_all(x, "\\[[0-9]{2}\\][A-Z]"))
x <- gsub(pattern = "\\[[0-9]{2}\\][A-Z]",
replacement = "",
x)
num <- unlist(str_extract_all(x, "[0-9]"))
the regular expression "\\[[0-9]{2}\\][A-Z]" can be broken down as
\\[ an opening bracket
[0-9]{2} a sequence of two consecutive digits
\\] a closing bracket
[A-Z] a sequence of exactly one upper case letter

1) strapply Create a regular expression, pat which matches the two parts and then extract each separately using strapply. The first capture group (first parenthesized portion of regular expression) consists of a left square bracket "\\[" the smallest string ".*?" until the right square bracket "\\]" followed by any character "." . The second capture group consists of one or more digits "\\d+".
library(gsubfn)
pat <- "(\\[.*?\\].)(\\d+)"
let <- strapply(s, pat, simplify = c)
num <- strapply(s, pat, ~ as.numeric(..2), simplify = c)
let
## [1] "[12]B" "[16]M"
num
## [1] 1 5
1a) Variation
This could also be expressed as this mapply producing a 2 component list:
mapply(strapply, s, pat, c(~ ..1, ~ as.numeric(..2)), simplify = "c",
SIMPLIFY = FALSE, USE.NAMES = FALSE)
## [[1]]
## [1] "[12]B" "[16]M"
##
## [[2]]
## [1] 1 5
2) gsub/read.table This uses no packages -- only gsub and read.table. pat is defined in (1). It returns a data frame with the results in two coiumns:
read.table(text = gsub(pat, "\\1 \\2\n", s), as.is = TRUE, col.names = c("let", "num"))
## let num
## 1 [12]B 1
## 2 [16]M 5
3) gsub/strsplit This is somewhat similar to (2) but uses strsplit rather than read.table. pat is from (1).
spl <- matrix(strsplit(gsub(pat, "\\1 \\2 ", s), " ")[[1]], 2)
let <- spl[1, ]
num <- as.numeric(spl[2, ])

Extract characters that differ between two strings

I have used adist to calculate the number of characters that differ between two strings:
a <- "Happy day"
b <- "Tappy Pay"
adist(a,b) # result 2
Now I would like to extract those character that differ. In my example, I would like to get the string "Hd" (or "TP", it doesn't matter).
I tried to look in adist, agrep and stringi but found nothing.

You can use the following sequence of operations:
split the string using strsplit().
Use setdiff() to compare the elements
Wrap in a reducing function
Try this:
Reduce(setdiff, strsplit(c(a, b), split = ""))
[1] "H" "d"

Split into letters and take the difference as sets:
> setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]])
[1] "H" "d"

Not really proud of this, but it seems to do the job:
sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8)
Results:
[1] "H" "d"

You can use one of the variables as a regex character class and gsub out from the other one:
gsub(paste0("[",a,"]"),"",b)
[1] "TP"
gsub(paste0("[",b,"]"),"",a)
[1] "Hd"

As long as a and b have the same length we can do this:
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")
giving:
[1] "Hd"
This seems straightforward in terms of clarity of the code and seems tied for the fastest of the solutions provided here although I think I prefer f3:
f1 <- function(a, b)
paste(setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]]), collapse = "")
f2 <- function(a, b)
paste(sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8), collapse = "")
f3 <- function(a, b)
paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")
f4 <- function(a, b) {
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")
}
a <- "Happy day"
b <- "Tappy Pay"
library(rbenchmark)
benchmark(f1, f2, f3, f4, replications = 10000, order = "relative")[1:4]
giving the following on a fresh session on my laptop:
test replications elapsed relative
3 f3 10000 0.07 1.000
4 f4 10000 0.07 1.000
1 f1 10000 0.09 1.286
2 f2 10000 0.10 1.429
I have assumed that the differences must be in the corresponding character positions. You might want to clarify if that is the intention or not.

The following function could be a better option to solve problem like this.
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info
from https://www.r-bloggers.com/extract-different-characters-between-two-strings-of-equal-length/
Then you can run
list.string.diff(a, b)
to get the difference.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract numbers at specific position in a vector in r - r

we could use strcapture from base R: prt <- data.frame(rr = numeric(),low_95 = numeric(), up_95 = numeric()) strcapture("(\\d+\\.?\\d+)\\D+\\[((?1));\\s*((?1))\\]",test,prt,perl = TRUE) rr low_95 up_95 1 1.0226 1.0109 1.0344

If you are going to have other values in test, you can extract from tidyr. data.frame(test) %>% tidyr::extract(test, paste0('num', 1:3), '(.)\\[(.);\\s(.)\\]')

With data.table: test <- '1.0226 [1.0109; 1.0344]' data.table::tstrsplit(test, " \\[|; |\\]") [[1]] [1] "1.0226" [[2]] [1] "1.0109" [[3]] [1] "1.0344"

In case the numbers are every time at the same position you can use read.table after removing []; with gsub. read.table(text=gsub("[][;]", "", test), col.names=c("rr","low_95","up_95")) # rr low_95 up_95 #1 1.0226 1.0109 1.0344

Related

How do I convert to numeric an exponential stored as a char in R?

extract different strings after match using R

Remove all sentences where number to character ratio is greater than the average in the text

How to split letters with bracket and numbers in R?

Extract characters that differ between two strings

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract numbers at specific position in a vector in r - r

we could use strcapture from base R: prt <- data.frame(rr = numeric(),low_95 = numeric(), up_95 = numeric()) strcapture("(\\d+\\.?\\d+)\\D+\\[((?1));\\s*((?1))\\]",test,prt,perl = TRUE) rr low_95 up_95 1 1.0226 1.0109 1.0344

If you are going to have other values in test, you can extract from tidyr. data.frame(test) %>% tidyr::extract(test, paste0('num', 1:3), '(.*)\\[(.*);\\s*(.*)\\]')

With data.table: test <- '1.0226 [1.0109; 1.0344]' data.table::tstrsplit(test, " \\[|; |\\]") [[1]] [1] "1.0226" [[2]] [1] "1.0109" [[3]] [1] "1.0344"

In case the numbers are every time at the same position you can use read.table after removing []; with gsub. read.table(text=gsub("[][;]", "", test), col.names=c("rr","low_95","up_95")) # rr low_95 up_95 #1 1.0226 1.0109 1.0344

Related

How do I convert to numeric an exponential stored as a char in R?

extract different strings after match using R

Remove all sentences where number to character ratio is greater than the average in the text

How to split letters with bracket and numbers in R?

Extract characters that differ between two strings

Categories

Resources

If you are going to have other values in test, you can extract from tidyr. data.frame(test) %>% tidyr::extract(test, paste0('num', 1:3), '(.)\\[(.);\\s(.)\\]')