Extract websites links from a text in R

Extract websites links from a text in R - r

I have multiple texts that each may consist references to one or more web links. for example:
text1= "s#1212a as www.abcd.com asasa11".
How do I extract:
"www.abcd.com"
from this text in R? In other words I am looking to extract patterns that start with www and end with .com

regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"blargwww.test.comasdf")
# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)
Which gives
> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"
[[2]]
[1] "www.boo.com"
[[3]]
character(0)
[[4]]
[1] "www.test.com"
Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead
> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com" "www.test.com"
Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:
> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com"
And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.
Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.
strapplyc. An alternative which gives the same result as ## above is:
library(gsubfn)
strapplyc(test1, pattern)
The regular expression Here is some explanation on how to decipher the regular expression:
pattern <- "www\\..*?\\.com"
Explanation:
www matches the www portion
\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.
.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.
\\. Once again we need to escape an actual dot character
com This part matches the ending 'com' that we want to match
Putting it all together it says: start with www. then match any characters until you reach the first ".com"

Check out the gsub function:
x = "s#1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")
The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.

The solutions here are great and in base. For those that want a quick solution you can use qdap's genXtract. This functions basically takes a left and a right element(s) and it will extract everything in between. By setting with = TRUE it will include those elements:
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"http://www.talkstats.com/ and http://stackoverflow.com/",
"blargwww.test.comasdf")
library(qdap)
genXtract(text1, "www.", ".com", with=TRUE)
## > genXtract(text1, "www.", ".com", with=TRUE)
## $`www. : .com1`
## [1] "www.abcd.com" "www.cats.com"
##
## $`www. : .com2`
## [1] "www.boo.com"
##
## $`www. : .com3`
## character(0)
##
## $`www. : .com4`
## [1] "www.talkstats.com"
##
## $`www. : .com5`
## [1] "www.test.com"
PS if you look at code for the function it is a wrapper for Dason's solution.

Related

Extract inside of first square brackets

I know there are a few similar questions, but they did not help me, perhaps due to my lack of understanding the basics of string manipulation.
I have a piece of string that I want to extract the inside of its first square brackets.
x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"
I have looked all over the internet to assemble the following code but it gives me inside of 2nd brackets
sub(".*\\[(.*)\\].*", "\\1", x, perl=TRUE)
The code returns 2. I expect to get 4.
Would appreciate if someone points out the missing piece.
---- update ----
Replacing .* to .*? in the first two instances worked, but do not know how. I leave the question open for someone who can provide why this works:
sub(".*?\\[(.*?)\\].*", "\\1", x, perl=TRUE)

You're almost there:
sub("^[^\\]]*\\[(\\d+)\\].*", "\\1", x, perl=TRUE)
## [1] "4"
The original problem is that .* matches as much as possible of anything before it matches [. Your solution was *? which is lazy version of * (non-greedy, reluctant) matches as little as it can.
Completely valid, another alternative I used is [^\\]]*: which translates into match anything that is not ].

stringr
You can solve this with base R, but I usually prefer the functions from the stringr-package when handeling such 'problems'.
x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"
If you want only the first string between brackets, use str_extract:
stringr::str_extract(x, "(?<=\\[).+?(?=\\])")
# [1] "4"
If you want all the strings between brackets, use str_extract_all:
stringr::str_extract_all(x, "(?<=\\[).+?(?=\\])")
# [[1]]
# [1] "4" "2"

R regex Grouping Not Working as Expected [duplicate]

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?
Example: consider a regex capturing digits preceded by "xy":
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
Desired result:
[1] "1234" "567"
First attempt: gregexpr:
regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567"
Not what I want because it returns the substrings matching the entire pattern.
Second try: regexec:
regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234"
Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.
If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.
So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?
Note: the pattern for r given above is just a silly example, it must remain arbitrary.

For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?
s <- "xy1234wz98xy567"
r <- "xy(\\d+)"
gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567"

Not sure about doing this in base, but here's a package for your needs:
library(stringr)
str_match_all(s, r)
#[[1]]
# [,1] [,2]
#[1,] "xy1234" "1234"
#[2,] "xy567" "567"
Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.
For instance, here's a simplified version of how the above works, using base R:
sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))

strapplyc in the gsubfn package does that:
> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567"
Try ?strapplyc for additional info and examples.
Related Functions
1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:
> strapply(s, r, as.numeric)
[[1]]
[1] 1234 567
2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.
> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"
UPDATE: Added strapply and gsubfn examples.

Since R 4.1.0, there is gregexec:
regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"

How do i insert a certain string in another string at a particular location in r?

I am new to R. It may be a very simple thing to do but I am not able figure it out.
Say, I have a string as follows:
This is an example string.
Now I want to make it as follows:
This is an (example/sample) string.
I know the location at which the change is to be made. (12th character in the given string).
I have a lot of strings where i need to perform similar operation.

I think I don't understand the problem but if I do you could use gsub here:
x <- "This is an example string."
gsub("example", "(example/sample)", x)
## [1] "This is an (example/sample) string."

Here's one solution with regular expressions:
# the string
s <- "This is an example string."
# the position of the target's first character
pos <- 12
# create a regular expression
reg <- paste0("^(.{", pos - 1, "})(.+?\\b)(.*)")
# [1] "^(.{11})(.+?\\b)(.*)"
# modify string
sub(reg, "\\1\\(\\2/sample\\)\\3", s)
# [1] "This is an (example/sample) string."

Here's another regex flavoured solution using a lookbehind:
s <- "This is an example string."
pos <- 12
replacement <- '(example/sample)'
sub(sprintf('(?<=^.{%s})\\S*\\b', pos-1), replacement, s, perl=TRUE)
## [1] "This is an (example/sample) string."
Lookbehind (?<=x) is useful because regex within it is part of the pattern but doesn't become part of the match (so we don't have to capture them and replace them later). The pattern above says: "The beginning of the string, followed by 11 characters, preceding zero or more non-whitespace characters, followed by a word boundary. Only the non-whitespace characters are replaced, by replacement.
Update
An alternative is to use strsplit to create a vector of words, and then identify the position in the vector of the character of interest (e.g. the 12th character), subsequently replacing that element with your new word. This is a bit slower than the regex approach, but makes it straightforward to request multiple replacements (at multiple character positions). For example:
f <- function(string, pos, new) {
s <- strsplit(string, '\\s')[[1]]
i <- findInterval(pos, c(gregexpr('(?<=\\b)\\w', string, perl=TRUE)[[1]],
nchar(string)))
s[i] <- mapply(sub, s[i], patt='\\b[[:alnum:]-]+\\b', repl=new, perl=TRUE)
paste0(s, collapse=' ')
}
f('This is an example string.', c(12, 20), c('excellent', 'function'))
## [1] "This is an excellent function."
Note that this hyphenated words are fully replaced (i.e. not just the part up to a hyphen) by the replacement, and all other punctuation (outside boundaries of hyphenated words) is retained.

Removing a character from within a vector element

I have a vector of strings:
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
str.vect
[1] "abcR.1" "abcL.1" "abcR.2" "abcL.2"
How can I remove the third character from the right in each vector element?
Here is the desired result:
"abc.1" "abc.1" "abc.2" "abc.2"
Thank you very much in advance

You can use nchar to find the length of each element of the vector
> nchar(str.vect)
[1] 6 6 6 6
Then you combine this with strtrim to get the beginning of each string
> strtrim(str.vect, nchar(str.vect)-3)
[1] "abc" "abc" "abc" "abc"
To get the end of the word you can then use substr (actually, you could use substr to get the beginning too...)
> substr(str.vect, nchar(str.vect)-1, nchar(str.vect))
[1] ".1" ".1" ".2" ".2"
And finally you use paste0 (which is paste with sep="") to stick them together
> paste0(strtrim(str.vect, nchar(str.vect)-3), # Beginning
substr(str.vect, nchar(str.vect)-1, nchar(str.vect))) # End
[1] "abc.1" "abc.1" "abc.2" "abc.2"
There are easier ways if you know your strings have some special characteristics
For instance, if the length is always 6 you can directly substitute the nchar calls with the appropriate value.
EDIT: alternatively, R also supports regular expressions, which make this task much easier.
> gsub(".(..)$", "\\1", str.vect)
[1] "abc.1" "abc.1" "abc.2" "abc.2"
The syntax is a bit more obscure, but not that difficult once you know what you are looking at.
The first parameter (".(..)$") is what you want to match
. matches any character, $ denotes the end of the string.
So ...$ indicates the last 3 characters in the string.
We put the last two in parenthesis, so that we can store them in memory.
The second parameter tells us what you want to substitute the matched substring with. In our case we put \\1 which means "whatever was in the first pair of parenthesis".
So essentially this command means: "find the last three characters in the string and change them with the last two".

The solution provided by #nico seems fine, but a simpler alternative might be to use sub:
sub('.(.{2})$', '\\1', str.vect)
This searches for the pattern of: "any character (represented by .) followed by 2 of any character (represented by .{2}), followed by the end of the string (represented by $)". By wrapping the .{2} in parentheses, R captures whatever those last two characters were. The second argument is the string to replace the matched substrings with. In this case, we refer to the first string captured in the matched pattern. This is represented by \\1. (If you captured multiple parts of the pattern, with multiple sets of parentheses, you would refer to subsequent captured regions with, e.g. \\2, \\3, etc.)

str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
a <- strsplit(str.vect,split="")
a <- strsplit(str.vect,split="")
b <- unlist(lapply(a,FUN=function(x) {x[4] <- ""
paste(x,collapse="")}
))
If you want to parameterize it further change 4 to a variable and put the index of the character you want to remove there.

Not sure how general or efficient this is, but it seems to work with your example string:
(This seems very similar to nico's answer although I am not using the strtrim function.)
my.string <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
n.char <- nchar(my.string)
the.beginning <- substr(my.string, n.char-(n.char-1), n.char-3)
the.end <- substr(my.string, n.char-1, n.char)
new.string <- paste0(the.beginning, the.end)
new.string
# [1] "abc.1" "abc.1" "abc.2" "abc.2"

The 3rd character from the right of each element is removed.
sapply(str.vec, function(x) gsub(substr(x, nchar(x)-2,nchar(x)-2), "", x))

This is a very quick and dirty answer, but thats what is needed sometimes:
#Define vector
str.vect <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
#Use gsub to remove both 'R' and 'L' independently.
str.vect2 <- gsub("R", '', str.vect )
str.vect_final <- gsub("L", '', str.vect2 )
>str.vect_final
[1] "abc.1" "abc.1" "abc.2" "abc.2"

Splitting a file name into name,extension

I have the name of a file like this: name1.csv and I would like to extract two substrings of this string. One that stores the name1 in one variable and other that stores the extension, csv, without the dot in another variable.
I have been searching if there is a function like indexOf of Java that allows to do that kind of manipulation, but I have not found anything at all.
Any help?

Use strsplit:
R> strsplit("name1.csv", "\\.")[[1]]
[1] "name1" "csv"
R>
Note that you a) need to escape the dot (as it is a metacharacter for regular expressions) and b) deal with the fact that strsplit() returns a list of which typically only the first element is of interest.
A more general solution involves regular expressions where you can extract the matches.
For the special case of filenames you also have:
R> library(tools) # unless already loaded, comes with base R
R> file_ext("name1.csv")
[1] "csv"
R>
and
R> file_path_sans_ext("name1.csv")
[1] "name1"
R>
as these are such a common tasks (cf basename in shell etc).

Use strsplit():
http://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html
Example:
> strsplit('name1.csv', '[.]')[[1]]
[1] "name1" "csv"
Note that second argument is a regular expression, that's why you can't just pass single dot (it will be interpreted as "any character").

Using regular expression, you can do this for example
regmatches(x='name1.csv',gregexpr('[.]','name1.csv'),invert=TRUE)
[[1]]
[1] "name1" "csv"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract websites links from a text in R - r

I have multiple texts that each may consist references to one or more web links. for example: text1= "s#1212a as www.abcd.com asasa11". How do I extract: "www.abcd.com" from this text in R? In other words I am looking to extract patterns that start with www and end with .com

Related

Extract inside of first square brackets

R regex Grouping Not Working as Expected [duplicate]

How do i insert a certain string in another string at a particular location in r?

Removing a character from within a vector element

Splitting a file name into name,extension

Categories

Resources