R: Find patern and get the values in between - r

I am using readLines() to extract an html code from a site. In almost every line of the code there is pattern of the form <td>VALUE1<td>VALUE2<td>. I would like to take the values in between the <td>. I tried some compilations such as:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
but the output gives back only the one value. Any idea how to do that?

string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)

Did you take a look at the "XML" package that can extract tables from HTML? You probably need to provide more context of the entire message that you are trying to parse so that we could see if it might be appropriate.

Related

Finding consecutive values in a string in R

I am trying to find 3 or more consecutive "a" within the last 10 letters of my data frame string. My data frame looks like this:
V1
aaashkjnlkdjfoin
jbfkjdnsnkjaaaas
djshbdkjaaabdfkj
jbdfkjaaajbfjna
ndjksnsjksdnakns
aaaandfjhsnsjna
I have written this code, however it just gets out the number of consecutive "a" within the whole string. However, I am wanting to do it so it only looks at the last 10 digits and then prints the string where the consecutive "a" are found. The code I have wrote is:
out: [1] 3
I am wanting my output to look like this:
jbfkjdnsnkjaaaas
djshbdkjaaabdfkj
jbdfkjaaajbfjna
Can anyone help
Using regex, you could do:
grep("(?=.{10}$).*?a{3,}", string, perl = TRUE, value = TRUE)
[1] "jbfkjdnsnkjaaaas" "djshbdkjaaabdfkj" "jbdfkjaaajbfjna"
string <- c("aaashkjnlkdjfoin", "jbfkjdnsnkjaaaas", "djshbdkjaaabdfkj",
"jbdfkjaaajbfjna", "ndjksnsjksdnakns", "aaaandfjhsnsjna")
If you have a dataframe and need tosubset it:
subset(df, grepl("(?=.{10}$).*?a{3}",V1, perl = TRUE))
V1
2 jbfkjdnsnkjaaaas
3 djshbdkjaaabdfkj
4 jbdfkjaaajbfjna

How do I find the position of a (fuzzy) match within a string?

I have a text processing problem in R. I want to get the character within a string where a different string makes an exact match and/or a fuzzy match with some edit distance. For example:
A = "blahmatchblah"
B = "match"
C = "latch"
I would like to return something telling me that the 5th character within string A is where the match for a search of both B and C. All the pattern matching tools I'm aware of will tell me if there's a (fuzzy) match for B and C within A, but none for where that match begins.
The base function aregexec() is used for approximate string position matching. Unfortunately it's not vectorized over pattern, so we'll have to use a loop to get the positions for both B and C.
sapply(c(B, C), aregexec, A)
# $match
# [1] 5
# attr(,"match.length")
# [1] 5
#
# $latch
# [1] 5
# attr(,"match.length")
# [1] 5
See help(aregexec) for more.
I don't have rep to comment but at least for the first part of your question: gregexpr(B,A)[[1]][1] will yield 5 because "match" is a valid sub-sequence in A.
A few months back I made an interface to the fuzzywuzzy Python package in R, which has the get_matching_blocks() method (it's pretty close to what you actually ask).
Assuming you want to find the matching blocks between two strings,
A = "blahmatchblah"
B = "match"
library(fuzzywuzzyR)
init <- SequenceMatcher$new(string1 = A, string2 = B)
init$get_matching_blocks()
returns,
[[1]]
Match(a=4, b=0, size=5)
[[2]]
Match(a=13, b=5, size=0)
The first sublist gives the matching blocks of the two strings. a = 4 gives the starting index of the string A and b=0 gives the starting index of the string B (indexing starts from 0). size = 5 gives the count of characters that both strings match (in this case the matching block is "match" and has 5 characters).
The documentation, especially for SequenceMatcher, has more info.

How do I find a subtext without comma using regex in R?

I have a data frame as:
result <- c('Ab1 : 256 ug/mL(R), Ab2(disk); 18mm(S)', 'Ab1 : 4 ug/mL(S), Ab2(disk); <2mm(R)')
df <- data.frame(result)
What should I do if I would like to check whether '(R)' appears after 'antibiotics1' ?
grep("Ab1[[:print:]]*\\(R\\)", result)
gives
[1] 1 2
while the result I want is
[1] 1
Try this:
grep("Ab1[^(]*?\\(R\\)", result)
[1] 1
Ab1 match 'Ab1' literally
[^(]*? match anything besides an opening parenthesis, non greedily
(R) match '(R)' literally
In the second case, it is not possible to do this match without first consuming at least one opening parenthesis, hence only the first matches.

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

String processing in R (find and replace)

This is an example of my data.frame
no string
1 abc&URL_drf
2 abcdef&URL_efg
I need to replace word *&URL with "". So, I need a this result
no string
1 _drf
2 _efg
In case of Excel, I can easily make this result using '*&URL' in 'find and replace' function.
However, I cannot look for effective method in R.
In R, my approach is below.
First, I have split string using strsplit(df$string, "&URL") and then I have selected second column. I think that it is not a effective way.
Is there a any effective method?
# data
df <- read.table(text="no string
1 abc&URL_drf
2 abcdef&URL_efg", header=T, as.is=T)
# `gsub` function is to substitute the unwanted string with nothing,
# thus the `""`. The pattern of unwanted string was written in
# regular expressions.
df$string <- gsub("[a-z]+(&URL)", "", df$string)
# you get
no string
1 1 _drf
2 2 _efg
I suggest you use the grep function .
The grep function takes your regex as the first argument, and the input vector as the second argument.If you pass value=TRUE, then grep returns a vector with copies of the actual elements in the input vector that could be (partially) matched.
so in your case
grep("[a-z]+(&URL)", df$col, perl=TRUE, value=TRUE)
Another approach:
df <- transform(df, string = sub(".*&URL", "", string))
# no string
# 1 1 _drf
# 2 2 _efg

Resources