To clean the text scraped from web page, I ran gsub() to replace those redundant symbols. In this proceed, I use Extended Regular Expressions(such as [:blank:], [:digit:], [:print:], etc.). But they take the place of the letters which they have in the target text, and the real function they should be unfeasible in practice.
pg<-"http://www.irgrid.ac.cn/handle/1471x/1066693?mode=full&submit_simple=Show+full+item+record"
library(XML)
MetaNode <- getNodeSet(htmlParse(pg), '//table[#class="itemDisplayTable"]')
meta_label <- xpathSApply(MetaNode[[1]], './/td[#class="metadataFieldLabel"]', xmlValue)
meta_label <- gsub("[[:blank:]]+", "[:blank:]", meta_label)
meta_label <- gsub("[[:punct:]]+", "", meta_label)
meta_label
[1] "Titleblank" [2] "Authorblank"
[3] "IssuedblankDateblank" [4] "Sourceblank"
[5] "IndexedblankTypeblank" [6]
"ContentblankTypeblank" [7] "URI标识blank"
[8] "OpenblankAccessblank\r\nTypeblank" [9]
"fulltextblankversionblank\r\nblanktypeblank" [10] "专题blank"
Are those Extended Regular Expressions only use in the “pattern”
parameter of the functions, but could not use in “replacement”?
And the special symbol like “\r”, “\n” have their Extended Regular
Expressions?
You cannot use the [::blank::] as a replacement because that stands for a whole class of different types of symbols. If you want to reduce multiple repeated characters to the first occurance, you can use something like
x<-"Hello World"
gsub("([[:blank:]])+", "\\1", x)
# [1] "Hello World"
Here we use regular expression capture groups to grab the value that was found in the regular expression.
Related
In the code below I read some file names into R. The actual number of files is much larger, but this is a representative example.
folder <- here("test_data2")
files <- basename(list.files(path=folder,full.names=TRUE, pattern= "*tab.cut$"))
files <-
[1] "A_r1_D7__A-Prokka_1.tab.cut" "AB_r1_D7__A-Prokka_1.tab.cut" "AB_r2_D7__A-Prokka_1.tab.cut"
[4] "AB_r2_D7__B-Prokka_1.tab.cut" "AB_r2_D7__B-Prokka_10.tab.cut" "AB_r2_D7__B-Prokka_11.tab.cut"
[7] "AB_r2_D7__B-Prokka_12.tab.cut" "AB_r2_D7__B-Prokka_13.tab.cut" "AB_r2_D7__B-Prokka_14.tab.cut"
[10] "AB_r2_D7__B-Prokka_15.tab.cut" "AB_r2_D7__B-Prokka_16.tab.cut" "AB_r2_D7__B-Prokka_17.tab.cut"
[13] "AB_r2_D7__B-Prokka_18.tab.cut" "AB_r2_D7__B-Prokka_19.tab.cut" "AB_r2_D7__B-Prokka_2.tab.cut"
[16] "AB_r2_D7__B-Prokka_3.tab.cut" "AB_r2_D7__B-Prokka_4.tab.cut" "AB_r2_D7__B-Prokka_5.tab.cut"
[19] "AB_r2_D7__B-Prokka_6.tab.cut" "AB_r2_D7__B-Prokka_7.tab.cut" "AB_r2_D7__B-Prokka_8.tab.cut"
[22] "AB_r2_D7__B-Prokka_9.tab.cut" "ABCD_r1_D14__B-Prokka_1.tab.cut" "ABCD_r1_D14__B-Prokka_10.tab.cut"
[25] "ABCD_r1_D14__B-Prokka_11.tab.cut" "ABCD_r1_D14__B-Prokka_12.tab.cut" "ABCD_r1_D14__B-Prokka_13.tab.cut"
[28] "ABCD_r1_D14__B-Prokka_14.tab.cut" "ABCD_r1_D14__B-Prokka_15.tab.cut" "ABCD_r1_D14__B-Prokka_16.tab.cut"
[31] "ABCD_r1_D14__B-Prokka_17.tab.cut" "ABCD_r1_D14__B-Prokka_18.tab.cut" "ABCD_r1_D14__B-Prokka_19.tab.cut"
[34] "ABCD_r1_D14__B-Prokka_2.tab.cut" "ABCD_r1_D14__B-Prokka_3.tab.cut" "ABCD_r1_D14__B-Prokka_4.tab.cut"
[37] "ABCD_r1_D14__B-Prokka_5.tab.cut" "ABCD_r1_D14__B-Prokka_6.tab.cut" "ABCD_r1_D14__B-Prokka_7.tab.cut"
[40] "ABCD_r1_D14__B-Prokka_8.tab.cut" "ABCD_r1_D14__B-Prokka_9.tab.cut" "ABCD_r1_D14__C-Prokka_1.tab.cut"
[43] "ABCD_r1_D14__C-Prokka_2.tab.cut" "ABCD_r1_D14__D-Prokka_1.tab.cut" "ABCD_r1_D14__D-Prokka_2.tab.cut"
[46] "ABCD_r1_D14__D-Prokka_3.tab.cut" "ABCD_r1_D14__D-Prokka_4.tab.cut" "ABCD_r1_D14__D-Prokka_5.tab.cut"
[49] "ABCD_r1_D7__A-Prokka_1.tab.cut" "ACD_r2_D7__C-Prokka_1.tab.cut" "ACD_r2_D7__C-Prokka_2.tab.cut"
[52] "ACD_r2_D7__D-Prokka_1.tab.cut" "ACD_r2_D7__D-Prokka_2.tab.cut" "ACD_r2_D7__D-Prokka_3.tab.cut"
[55] "ACD_r2_D7__D-Prokka_4.tab.cut" "ACD_r2_D7__D-Prokka_5.tab.cut" "AD_r1_D7__A-Prokka_1.tab.cut"
[58] "CD_r2_D7__C-Prokka_1.tab.cut" "CD_r2_D7__C-Prokka_2.tab.cut"
But let's say I want to use the regular expression in the list.files() function to filter out those files that do NOT contain "B" among the first four characters.
I would think what's below is the proper pattern. What I'm trying to say in the beginning with \\D{1,4}[B] is to return any character string from 1 to 4 characters that contains "B".
B_files <- list.files(path=folder,full.names=TRUE, pattern= "\\D{1,4}[B]_([rR][123])_D\\d{1,2}__B-Prokka_\\d{1,2}.tab.cut$")
But this only returns those files that begin with "AB". Those that begin with "ABCD" are not in the output.
However, when I slightly alter the code by adding the ? quantifier, I suddenly get an output with files that begin with both "ABCD" and "AB" :
B_files <- list.files(path=folder,full.names=TRUE, pattern= "\\D{1,4}[B]?_([rR][123])_D\\d{1,2}__B-Prokka_\\d{1,2}.tab.cut$")
Can somebody tell me what's going on here? I though ? was lazy, meaning it will search for the shortest possible string. Thus, shouldn't the addition of the ? quantifier return only the files starting with "AB"?
And, overall, is my regular expression the right way to go about filtering those files that contain the character "B" within the first one to four characters?
Any help is appreciated.
Thank you!
We can use the pattern specifying the start (^) of the string followed by followed characters that doesn't include the 'B' with ^ inside the square bracket
"^[^B]{4}.*tab\\.cut$"
We can do the inverse with grep
grep("^[^B]{4}.*tab\\.cut$", files, invert = TRUE, value = TRUE)
#[1] "AB_r2_D7__B-Prokka_1.tab.cut" "ABCD_r1_D14__B-Prokka_11.tab.cut"
data
files <- c( "A_r1_D7__A-Prokka_1.tab.cut" , "AB_r2_D7__B-Prokka_1.tab.cut", "ABCD_r1_D14__B-Prokka_11.tab.cut", "CD_r2_D7__C-Prokka_1.tab.cut" , "ACD_r2_D7__D-Prokka_4.tab.cut" )
dat <- c("A_r1_D7.cut", "AB_r1_D7.cut", "ABCD_r1_D14.cut",
"ACD_r2_D7.cut", "CD_r2_D7.cut", "B.c")
You can write:
grep("^.{0,3}B", dat, value=T)
Demo
or
grep("^(?=.{0,3}B)", dat, value=T, perl=T)
Demo
Both return
[1] "AB_r1_D7.cut" "ABCD_r1_D14.cut" "B.c"
Note that the latter must use the PCRE regex engines as the default engine does not support positive lookaheads (or lookarounds generally).
I have hundreds of TXT files which contain many things and some download links.
The pattern of the download links are like this:
start with: http://
and
end with: .nc
I created a sample text file for your convenience that you could download from this link:
https://www.dropbox.com/s/5crmleli2ppa1rm/textfile_including_https.txt?dl=1
Based on this topic in Stackoverflow, I tried to extract all download links from the text file:
Extract websites links from a text in R
Here is my code:
download_links <- readLines(file.choose())
All_my_links <- gsub(download_links, pattern=".*(http://.*nc).*", replace="\\1")
But it returns all lines, too, while I only want to extract the http links ended with .nc
Here is the result:
head(All_my_links )
tail(All_my_links )
> head(All_my_links )
[1] "#!/bin/bash"
[2] "##############################################################################"
[3] "version=1.3.2"
[4] "CACHE_FILE=.$(basename $0).status"
[5] "openId="
[6] "search_url='https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.HighResMIP.MIROC.NICAM16-9S.highresSST-present.r1i1p1f1.day.pr.gr.v20190830|esgf-data2.diasjp.net'"
> tail(All_my_links )
[1] "MYPROXY_STATUS=$HOME/.MyProxyLogon"
[2] "COOKIE_JAR=$ESG_HOME/cookies"
[3] "MYPROXY_GETCERT=$ESG_HOME/getcert.jar"
[4] "CERT_EXPIRATION_WARNING=$((60 * 60 * 8)) #Eight hour (in seconds)"
[5] ""
[6] "WGET_TRUSTED_CERTIFICATES=$ESG_HOME/certificates"
What is my mistake in the code?
Any comment would be highly appreciated.
gsub() is not for extracting, that's what's wrong with your code. It's for replacing. (See help("gsub")). For the purposes of this demonstration, I will use the following data:
x <- c("abc", "123", "http://site.nc")
(I will not, as a rule, download data posted here as a link. Most others won't also. If you want to share example data, it's best to do so by including in your question the output from dput()).
Let's see what happens with your gsub() approach:
gsub(pattern = ".*(http://.*nc).*", replacement = "\\1", x = x)
# [1] "abc" "123" "http://site.nc"
Looks familiar. What's going on here is gsub() looks at each element of x, and replaces each occurrence of pattern with replacement, which in this case is itself. You will always get the exact same character vector back with that approach.
I would suggest stringr::str_extract():
stringr::str_extract(string = x, pattern = ".*http://.*nc.*")
# [1] NA NA "http://site.nc"
If you wrap this in na.omit(), it gives you the output I think you want:
na.omit(stringr::str_extract(string = x, pattern = ".*http://.*nc.*"))
# [1] "http://site.nc"
I encountered this question:
PHP explode the string, but treat words in quotes as a single word
and similar dealing with using Regex to explode words in a sentence, separated by a space, but keeping quoted text intact (as a single word).
I would like to do the same in R. I have attempted to copy-paste the regular expression into stri_split in the stringi package as well as strsplit in base R, but as I suspect the regular expression uses a format R does not recognize. The error is:
Error: '\S' is an unrecognized escape in character string...
The desired output would be:
mystr <- '"preceded by itself in quotation marks forms a complete sentence" preceded by itself in quotation marks forms a complete sentence'
myfoo(mystr)
[1] "preceded by itself in quotation marks forms a complete sentence" "preceded" "by" "itself" "in" "quotation" "marks" "forms" "a" "complete" "sentence"
Trying: strsplit(mystr, '/"(?:\\\\.|(?!").)*%22|\\S+/') gives:
Error in strsplit(mystr, "/\"(?:\\\\.|(?!\").)*%22|\\S+/") :
invalid regular expression '/"(?:\\.|(?!").)*%22|\S+/', reason 'Invalid regexp'
A simple option would be to use scan:
> x <- scan(what = "", text = mystr)
Read 11 items
> x
[1] "preceded by itself in quotation marks forms a complete sentence"
[2] "preceded"
[3] "by"
[4] "itself"
[5] "in"
[6] "quotation"
[7] "marks"
[8] "forms"
[9] "a"
[10] "complete"
[11] "sentence"
Looking for some guidance on how to replace a curly apostrophe with a straight apostrophe in an R list of character vectors.
The reason I'm replacing the curly apostrophes - later in the script, I check each list item, to see if it's found in a dictionary (using qdapDictionary) to ensure it's a real word and not garbage. The dictionary uses straight apostrophes, so words with the curly apostrophes are being "rejected."
A sample of the code I have currently follows. In my test list, item #6 contains a curly apostrophe, and item #2 has a straight apostrophe.
Example:
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
func_ReplaceTypographicApostrophes <- function(x) {
gsub("’", "'", x, ignore.case = TRUE)
}
list_TestWords_Fixed <- lapply(list_TestWords, func_ReplaceTypographicApostrophes)
The result: No change. Item 6 still using curly apostrophe. See output below.
list_TestWords_Fixed
[[1]]
[1] "this"
[[2]]
[1] "isn't"
[[3]]
[1] "ideal"
[[4]]
[1] "but"
[[5]]
[1] "we"
[[6]]
[1] "can’t"
[[7]]
[1] "fix"
[[8]]
[1] "it"
Any help you can offer will be most appreciated!
This might work: gsub("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", x)
I found it over here: http://axonflux.com/handy-regexes-for-smart-quotes
You might be running up against a bug in R on Windows. Try using utf8::as_utf8 on your input. Alternatively, this also works:
library(utf8)
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
lapply(list_TestWords, utf8_normalize, map_quote = TRUE)
This will replace the following characters with ASCII apostrophe:
U+055A ARMENIAN APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+FF07 FULLWIDTH APOSTROPHE
It will also convert your text to composed normal form (NFC).
I see a problem in your call to gsub:
gsub("/’", "/'", x, ignore.case = TRUE)
You are prefixing the curly single quote with a forward slash. I don't know why you are doing this. I could speculate that you are trying to escape the quote characters, but this is having the side effect that your pattern is now trying to match a forward slash followed by a quote. As this never occurs in your text, no replacements are being made. You should be doing this:
gsub("’", "'", x, ignore.case = TRUE)
Follow the link below for a demo which shows that using the above gsub calls works as you expect.
Demo
Was about to say the same thing.
Try using str_replace from stringr package, will not need to use slashes
I was facing similar problem. Somehow non of the solutions worked for me. So I devised an indirect way of doing it by identifying apostrophe and replacing it with the required format.
gsub("(\\w)(\\W)(\\w\\s)", "\\1'\\3","sid’s bicycle")
[1] "sid's bicycle"
Hope it helps someone.
I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).