I have lots of vectors and every element has 3 numbers and I want to extract it to different columns.
test <- '1.0226 [1.0109; 1.0344]'
What I expected is
rr <- 1.0226
low_95 <- 1.0109
up_95 <- 1.0344
I thought I should use the str_extract() function to do this, but I don't know how to write the regex.
rr : extract number before [;
low_95 : extract number between [ and ;;
up_95 : extract number between ; and ].
Regex for extracting number before [ in R: *\\[.*
test <- '1.0226 [1.0109; 1.0344]'
rr <- gsub(" *\\[.*", "", test)
rr
# [1] "1.0226"
Regex for extracting number between [ and ; in R: .*\\[|;.*
test <- '1.0226 [1.0109; 1.0344]'
low_95 <- gsub(".*\\[|;.*", "", test)
low_95
# [1] "1.0109"
Regex for extracting number between ; and ] in R: .*; |].*
test <- '1.0226 [1.0109; 1.0344]'
up_95 <- gsub(".*; |].*", "", test)
up_95
# [1] "1.0344"
we could use strcapture from base R:
prt <- data.frame(rr = numeric(),low_95 = numeric(), up_95 = numeric())
strcapture("(\\d+\\.?\\d+)\\D+\\[((?1));\\s*((?1))\\]",test,prt,perl = TRUE)
rr low_95 up_95
1 1.0226 1.0109 1.0344
If you are going to have other values in test, you can extract from tidyr.
data.frame(test) %>%
tidyr::extract(test, paste0('num', 1:3), '(.*)\\[(.*);\\s*(.*)\\]')
With data.table:
test <- '1.0226 [1.0109; 1.0344]'
data.table::tstrsplit(test, " \\[|; |\\]")
[[1]]
[1] "1.0226"
[[2]]
[1] "1.0109"
[[3]]
[1] "1.0344"
In case the numbers are every time at the same position you can use read.table after removing []; with gsub.
read.table(text=gsub("[][;]", "", test), col.names=c("rr","low_95","up_95"))
# rr low_95 up_95
#1 1.0226 1.0109 1.0344
Using base R with gregexpr we can just extract out all numbers, then assign to separate variables:
test <- '1.0226 [1.0109; 1.0344]'
matches <- gregexpr('\\b\\d+(?:\\.\\d+)?\\b', test, perl=TRUE)
vec <- regmatches(test, matches)[[1]]
vec
rr <- vec[1]
low_94 <- vec[2]
up_95 <- vec[3]
[1] "1.0226" "1.0109" "1.0344"
A fun way of doing this is via suband multiple backreferencing:
strsplit(gsub("(^\\d+\\.\\d+)\\s\\[(\\d+\\.\\d+);\\s(\\d+\\.\\d+)]", "\\1,\\2,\\3", test), ",")
[[1]]
[1] "1.0226" "1.0109" "1.0344"
From there you can proceed by assigning the elements to your vectors of choice, for example:
rr <- unlist(strsplit(gsub("(^\\d+\\.\\d+)\\s\\[(\\d+\\.\\d+);\\s(\\d+\\.\\d+)]", "\\1,\\2,\\3", test), ","))[1]
[1] "1.0226"
Related
I have a file with numbers in scientific notation stored as 0.00684*10^0.0023. When I read the file with read.csv(), they are loaded as character strings, and trying to convert them with the as.numeric() function returns only NA.
a = "0.00684*10^0.0023"
as.numeric(a)
NA
Is there a way to force R to take on the variable as Scientific notation?
Here are a few ways. They use a from the question (also shown in the Note below) or c(a, a) to illustrate how to apply it to a vector. No packages are used.
1) Use eval/parse:
eval(parse(text = a))
## [1] 0.00687632
or for a vector such as c(a, a)
sapply(c(a, a), function(x) eval(parse(text = x)), USE.NAMES = FALSE)
## [1] 0.00687632 0.00687632
2) A different way is to split up the input and calculate it ourself.
with(read.table(text = sub("\\*10\\^", " ", c(a, a))), V1 * 10^V2)
## [1] 0.00687632 0.00687632
3) A third way is to convert to complex numbers and the combine the real and imaginary parts to get the result.
tmp <- type.convert(sub("\\*10\\^(.*)", "+\\1i", c(a, a)), as.is = TRUE)
Re(tmp) * 10^Im(tmp)
## [1] 0.00687632 0.00687632
4) Another way to split it up is:
as.numeric(sub("\\*.*", "", c(a, a))) *
10^as.numeric(sub(".*\\^", "", c(a, a)))
## [1] 0.00687632 0.00687632
6) We could use any of the above to define a custom class which can read in fields of the required form. First we write out a test data file and then define a custom num class. In read.csv use the colClasses argument to specify which class each column has. Use NA for those columns where we want read.csv to determine the class automatically..
# generate test file
cat("a,b\n1,0.00684*10^0.0023\n", file = "tmpfile.csv")
setAs("character", "num",
function(from) with(read.table(text = sub("\\*10\\^", " ", from)), V1 * 10^V2))
read.csv("tmpfile.csv", colClasses = c(NA, "num"))
## a b
## 1 1 0.00687632
With this definition we can also use as like this:
as(c(a, a), "num")
## [1] 0.00687632 0.00687632
Note
a <- "0.00684*10^0.0023"
One idea:
library(stringr)
a <- "0.00684*10^0.0023" # your input
b <- str_split(a, "\\*10\\^") # split the string by the operator
b <- as.numeric(b[[1]]) # transform the individual elements to numbers
c <- b[1]*10^(b[2]) # execute the wished operation with the obtained numbers
c # the result
data <- c("Demand = 001 979", "Demand = -08 976 (154)", "Demand = -01 975 (359)")
data <- str_match(data, pattern = ("Demand = (.*) (.*)"))
I need to extract the first 2 sets of numbers (including the - sign) into columns using str_match.
Exclude 3rd set of numbers in bracket ().
Any help is welcomed.
Output:
## [1] "001" "-08" "-01"
## [2] "979" "976" "975"
How about removing everything else?
data <- c("Demand = 001 979", "Demand = -08 976 (154)", "Demand = -01 975 (359)")
data <- gsub("Demand = ", "", x = data)
data <- trimws(gsub("\\(.*\\)", "", x = data))
out <- list()
out[[1]] <- sapply(data, "[", 1)
out[[2]] <- sapply(data, "[", 2)
out
[[1]]
[1] "001" "-08" "-01"
[[2]]
[1] "979" "976" "975"
A possibility with str_extract_all() from stringr:
sapply(str_extract_all(x, "-?[0-9]+?[0-9]*"), function(x) x[1])
[1] "001" "-08" "-01"
sapply(str_extract_all(x, "-?[0-9]+?[0-9]*"), function(x) x[2])
[1] "979" "976" "975"
Or using the idea of #Roman Luštrik with strsplit():
sapply(strsplit(gsub("Demand = ", "", x), " "), function(x) x[1])
[1] "001" "-08" "-01"
Is it possible to find and delete all sentences containing a higher number to character ratio?
I created the following function to calculate the ratio in a given string:
a <- "1aaaaaa2bbbbbbb3"
Num_Char_Ration <- function(string){
length(unlist(regmatches(string,gregexpr("[[:digit:]]",string))))/nchar(as.character(string))
}
Num_Char_Ration(a)
#0.1875
The task is now to find a method to calculate the ratio for a sentence(so for a character sequence between ending with a ".") and then to delete sentences with a higher ratio from the text. For example:
input:
a <- " aa111111. bbbbbb22. cccccc3."
output:
#"bbbbbb22. cccccc3."
I would use stringr package to count digits and characters:
# Original data
input <- " aa111111. bbbbbb22. cccccc3."
# Split by .
inputSplit <- strsplit(input, "\\.")[[1]]
# Count digits and all alnum in splitted string
counts <- sapply(inputSplit, stringr::str_count, c("[[:digit:]]", "[[:alnum:]]"))
# Get ratios and collapse text back
paste(inputSplit[counts[1, ] / counts[2, ] < 0.5], collapse = ".")
# [1] " bbbbbb22. cccccc3"
counts looks like this:
# To get ratio between digits and string
# Divide first row by second row
aa111111 bbbbbb22 cccccc3
[1,] 6 2 1
[2,] 8 8 7
Here is a simple base solution:
x <- strsplit(input,"\\.")[[1]]
x <- x[nchar(x) < 2 * nchar(gsub("\\d","",x))]
paste(x,collapse=".")
# [1] " bbbbbb22. cccccc3"
You need to split up your long string into single words! (strsplit() for eg)
data:
words <- c("aa111111.","bbbbbb22.","cccccc3.")
code:
library(magrittr)
fun1 <- function(x) {
num <- gsub("\\D","",x) %>% nchar
char<- gsub("[^A-z]","",x,perl=T) %>% nchar
if(num <= char) return(x) else NULL
}
sapply(words,fun1) %>% unlist %>% unname
result:
#[1] "bbbbbb22." "cccccc3."
Here's how I would do it in base R. Adapted Andre's code.
my_string <- " aa111111. bbbbbb22. cccccc3."
#Split paragraph into sentences based on '.'
my_string <- unlist(strsplit(my_string, '(?<=\\.)\\s+', perl=TRUE))
#Removing sentences with more numbers than letters
my_string <- subset(my_string,nchar(gsub("\\D","",my_string)) <= nchar(gsub("[^A-z]","",my_string,perl=T)))
my_string
##[1] "bbbbbb22." "cccccc3."
If you then want to combine these sentences back into a paragraph, you can use
paste(my_string,collapse=" ")
##[1] "bbbbbb22. cccccc3."
# Simplified num to char ratio function
Num_Char_Ration <- function(string) {
lengths(regmatches(x, gregexpr("[0-9]", x))) / nchar(x)
}
clear_nmbstring <- function(x) {
x <- strsplit(x, ".", fixed = TRUE)[[1]]
cleanx <- trimws(x)
x <- x[Num_Char_Ration(cleanx) < 0.5]
paste(x, collapse = ".")
}
# Example:
string <- c(" aa111111. bbbbbb22. cccccc3.")
clear_nmbstring(string)
[1] " bbbbbb22. cccccc3"
The string is s = '[12]B1[16]M5'
I want to split it as the following results with strsplit function in R:
let <- c('[12]B', '[16]M')
num <- c(1, 5)
Thanks a lot
You could use regular expression for your task.
s = '[12]B1[16]M22'
grx <- gregexpr("\\[.+?\\].+[[:digit:]]?", s)
let <- do.call(c, regmatches(s, grx))
#let
#[1] "[12]B" "[16]M"
If you want to get all chunks (let + num), you can tweak the patter as below. This facilitates extracting the numeric part.
grx <- gregexpr("\\[.+?\\].+([[:digit:]]+)", s)
out <- do.call(c, regmatches(s, grx))
num <- gsub(".+\\][[:alpha:]]+", "", out)
num
[1] "1" "22"
Using the stringr package:
library(stringr)
x <- '[12]B1[16]M2'
let <- unlist(str_extract_all(x, "\\[[0-9]{2}\\][A-Z]"))
x <- gsub(pattern = "\\[[0-9]{2}\\][A-Z]",
replacement = "",
x)
num <- unlist(str_extract_all(x, "[0-9]"))
the regular expression "\\[[0-9]{2}\\][A-Z]" can be broken down as
\\[ an opening bracket
[0-9]{2} a sequence of two consecutive digits
\\] a closing bracket
[A-Z] a sequence of exactly one upper case letter
1) strapply Create a regular expression, pat which matches the two parts and then extract each separately using strapply. The first capture group (first parenthesized portion of regular expression) consists of a left square bracket "\\[" the smallest string ".*?" until the right square bracket "\\]" followed by any character "." . The second capture group consists of one or more digits "\\d+".
library(gsubfn)
pat <- "(\\[.*?\\].)(\\d+)"
let <- strapply(s, pat, simplify = c)
num <- strapply(s, pat, ~ as.numeric(..2), simplify = c)
let
## [1] "[12]B" "[16]M"
num
## [1] 1 5
1a) Variation
This could also be expressed as this mapply producing a 2 component list:
mapply(strapply, s, pat, c(~ ..1, ~ as.numeric(..2)), simplify = "c",
SIMPLIFY = FALSE, USE.NAMES = FALSE)
## [[1]]
## [1] "[12]B" "[16]M"
##
## [[2]]
## [1] 1 5
2) gsub/read.table This uses no packages -- only gsub and read.table. pat is defined in (1). It returns a data frame with the results in two coiumns:
read.table(text = gsub(pat, "\\1 \\2\n", s), as.is = TRUE, col.names = c("let", "num"))
## let num
## 1 [12]B 1
## 2 [16]M 5
3) gsub/strsplit This is somewhat similar to (2) but uses strsplit rather than read.table. pat is from (1).
spl <- matrix(strsplit(gsub(pat, "\\1 \\2 ", s), " ")[[1]], 2)
let <- spl[1, ]
num <- as.numeric(spl[2, ])
I have used adist to calculate the number of characters that differ between two strings:
a <- "Happy day"
b <- "Tappy Pay"
adist(a,b) # result 2
Now I would like to extract those character that differ. In my example, I would like to get the string "Hd" (or "TP", it doesn't matter).
I tried to look in adist, agrep and stringi but found nothing.
You can use the following sequence of operations:
split the string using strsplit().
Use setdiff() to compare the elements
Wrap in a reducing function
Try this:
Reduce(setdiff, strsplit(c(a, b), split = ""))
[1] "H" "d"
Split into letters and take the difference as sets:
> setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]])
[1] "H" "d"
Not really proud of this, but it seems to do the job:
sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8)
Results:
[1] "H" "d"
You can use one of the variables as a regex character class and gsub out from the other one:
gsub(paste0("[",a,"]"),"",b)
[1] "TP"
gsub(paste0("[",b,"]"),"",a)
[1] "Hd"
As long as a and b have the same length we can do this:
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")
giving:
[1] "Hd"
This seems straightforward in terms of clarity of the code and seems tied for the fastest of the solutions provided here although I think I prefer f3:
f1 <- function(a, b)
paste(setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]]), collapse = "")
f2 <- function(a, b)
paste(sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8), collapse = "")
f3 <- function(a, b)
paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")
f4 <- function(a, b) {
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")
}
a <- "Happy day"
b <- "Tappy Pay"
library(rbenchmark)
benchmark(f1, f2, f3, f4, replications = 10000, order = "relative")[1:4]
giving the following on a fresh session on my laptop:
test replications elapsed relative
3 f3 10000 0.07 1.000
4 f4 10000 0.07 1.000
1 f1 10000 0.09 1.286
2 f2 10000 0.10 1.429
I have assumed that the differences must be in the corresponding character positions. You might want to clarify if that is the intention or not.
The following function could be a better option to solve problem like this.
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info
from https://www.r-bloggers.com/extract-different-characters-between-two-strings-of-equal-length/
Then you can run
list.string.diff(a, b)
to get the difference.