I am trying to parse a string with regex to pull out information between a colon and the last newline prior to the next colon. How can I do this?
string <- "Name: Al's\nPlace\nCountry:\nState\n/ Province: RI\n"
stringr::str_extract_all(string, "(?<=:)(.*)(?:\\n)")
but I get:
[[1]]
[1] " Al's\n" " \n" " RI\n"
when I want:
[[1]]
[1] " Al's\nPlace\n" " \n" " RI\n"
I'm not sure if this is what you're after as your wanted output looks a bit different.
:((?:.*\\n?)+?)(?=.*:|$)
: match a colon
((?:.*\n?)+?) match and capture lazily any lines (to optional \n)
(?=.*:|$) until there is a line with colon ahead
See this demo at regex101
Curious if you might offer advice on the following.
I have data in a text file in this form:
"var1"
" var1a"
" var1a_descrp1"
" thing"
" var1b"
" var1b_descrp2"
" thing"
" var1b_descrp3"
" thing1"
" thing2"
" var1b_descrp4"
"poobarvar"
" var2a"
" var2a_descrp1"
" var2b"
" var2b_descrp1"
" thing"
" var2b_descrp1"
" thing1"
" thing2"
" thing3"
White spaces go a max depth of 12 spaces, or "three levels" deep.
And I'd love to cleanly parse this into a list structure of something like the following structure:
$var1
$var1$var1a
$var1$var1a$var1a_descrp1
$var1$var1a$var1a_descrp1[[1]]
[1] "thing"
$var1$var2a
$var1$var2a$var2a_descrp2
$var1$var2a$var2a_descrp2[[1]]
[1] "thing"
$var1$var2a$var2a_descrp3
$var1$var2a$var2a_descrp3[[1]]
[1] "thing1"
$var1$var2a$var2a_descrp3[[2]]
[1] "thing2"
$poobarvar
$poobarvar$var2a
list()
$poobarvar$var2b
$poobarvar$var2b$var2b_descrp1
$poobarvar$var2b$var2b_descrp1[[1]]
[1] "thing1"
$poobarvar$var2b$var2b_descrp1[[2]]
[1] "thing2"
$poobarvar$var2b$var2b_descrp1[[3]]
[1] "thing3"
I have a pretty convoluted set of while loops and if-else statements I'd love to clean up.
I have a dataset called Price which is supposed to be numeric but is generated as a string because all 5 is replaced by +.
It looks like this:
"99000" "98300" "98300" "98290" "98310" " 9831+ " "98310" " 9830+ " " 9830+ " " 9830+ " " 9829+ " " 9828+ " " 9827+ " "98270"
I used the gsub function in R to try and replace + by 5. The code I wrote is:
finalPrice<-gsub("+",5,Price)
However, the output is just a bunch of numbers which doesn't make sense for what I intended:
"59595050505,5 59585350505,5 59585350505,5 59585259505,5 59585351505,5 5 5 595853515+5 5,5 59585351505,5 5 5 595853505+5 5,5 5 5 595853505+5
How can I fix this?
The + sign should be escaped. Try this:
finalPrice<-gsub("\\+",5, Price)
Besides using double-escapes to force a literal-x to be matched by the pattern argument, you can also use either the fixed=TRUE parameter or use a character-class defined by the "[.]"-operation. See the ?regex page for more details:
> gsub("+", "5", txt, fixed=TRUE)
[1] "99000" "98300" "98300" "98290" "98310"
[6] " 98315 " "98310" " 98305 " " 98305 " " 98305 "
[11] " 98295 " " 98285 " " 98275 " "98270"
> gsub("[+]", "5", txt)
[1] "99000" "98300" "98300" "98290" "98310"
[6] " 98315 " "98310" " 98305 " " 98305 " " 98305 "
[11] " 98295 " " 98285 " " 98275 " "98270"
When writing regex, + means match the preceeding group one or more times. As the preceeding character is in your regex before the + is empty, gsub matches every empty string in the target.
The result is that 5 is inserted into each of these positions.
To avoid this, escape the +, which needs to be done with double backslash in R:
finalPrice<-gsub("\\+",5,Price)
For uppercase, lowercase letters and 10-digits I can generate a vector that contains all letters or 10-digit number as follow:
A <- LETTERS[0:26]
B <- letters[0:26]
C <- seq(0,9)
I wonder whether there is a similar function for non-alphanumeric characters.
~!##$%^&*_-+=`|\(){}[]:;"'<>,.?/
I tried
D <- c("~","!","#","#","$","%","^", "&","*","_","-","+","=","`","|","\","(",")","{","}","[","]",":",";",""","'","<",">",",",".","?","/")
Thanks
This is another option. Generate all ascii characters, then filter out the non punctuation with regular expressions.
ascii <- rawToChar(as.raw(0:127), multiple=TRUE)
ascii[grepl('[[:punct:]]', ascii)]
# [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "<" "=" ">" "?" "#"
# [23] "[" "\\" "]" "^" "_" "`" "{" "|" "}" "~"
This might be useful . . The ASCII character set is arranged in ranges of similar types of characters (letters, etc).
http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html
It's a bit drawn out, and there's probably a better website (and a better way to get the same result), but
library(XML); library(RCurl)
doc <- htmlParse(getURL("https://wci.llnl.gov/codes/basis/manual/node161.html"))
xp <- xpathSApply(doc, "//tr/td", xmlValue, trim = TRUE)
xp[nzchar(xp) & nchar(xp) == 1]
# [1] "!" "[" "%" "," "]" "&" "-" "|" "'" "." "=" "~" "("
# [14] "/" ")" "*" "=" "{" "?" "`" "}" "#" ":" ";" "^" " "
Also, using the website from the other answer yields a more complete result
> URL <- "http://datadebrief.blogspot.com/2011/03/ascii-code-table-in-r.html"
> r <- readLines(URL, warn = FALSE)[780:874]
> s <- sapply(strsplit(r, "\\s+"), "[", 1)
> s[!s %in% c(letters, LETTERS, 0:9)]
# [1] "" "!" "\"" "#" "$" "%" "&" "'" "("
# [10] ")" "*" "+" "," "-" "." "/" ":" ";"
# [19] "<" "=" ">" "?" "#" "[" "\\\\" "]" "^"
# [28] "_" "`" "{" "|" "}" "~"
...or yeah, just use rawToChar(as.raw(...)) like MrFlick said :-)
This answer is only for amusement, list the characters you want and use strsplit to generate your vector.
> D <- strsplit('!"#$%&\'()*+,-./\\:;<=>?#[]^_`{|}~', '(?=.)', perl=T)[[1]]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] "\\" ":" ";" "<" "=" ">" "?" "#" "[" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"
Or filter the characters you want.
> D <- gsub('[^\\pP\\pS]', '', rawToChar(as.raw(1:127), multiple=T), perl=T)
> D[D != ""]
## [1] "!" "\"" "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/"
## [16] ":" ";" "<" "=" ">" "?" "#" "[" "\\" "]" "^" "_" "`" "{" "|"
## [31] "}" "~"
I would like to extract the first value from this list:
[[1]]
[1] " \" 0.0337302" " -0.000248016" " -0.000496032" " -0.000744048"
[5] " -0.000992063" " -0.00124008" " -0.0014881" " -0.00173611"
[9] " -0.00198413" " -0.00223214" " -0.00248016" " -0.00272817"
[13] " -0.00297619" " -0.00322421" " -0.00347222" " -0.00372024"
[17] " -0.00396825" " -0.00421627" " -0.00446429" " -0.0047123"
[21] " -0.00496032" " -0.00520833" " -0.00545635" " -0.00570437"
the name of this test is M, I have tested this M[1] and M[[1]] but I don't get the correct answer.
How can I do that?
You need to subset the list, and then the vector in the list:
M[[1]][1]
In other words, M is a list of 1 element, a character vector of length 24.
You may want to use unlist M to convert it to just a vector.
M <- unlist(M)
Then you can just use M[1].
To remove the \" you can use sub:
sub("\"","",M[1])
[1] " 0.0337302"
The first element in the list you've shown is the entire vector shown by
[1] " \" 0.0337302" " -0.000248016" " -0.000496032" " -0.000744048"
[5] " -0.000992063" " -0.00124008" " -0.0014881" " -0.00173611"
[9] " -0.00198413" " -0.00223214" " -0.00248016" " -0.00272817"
[13] " -0.00297619" " -0.00322421" " -0.00347222" " -0.00372024"
[17] " -0.00396825" " -0.00421627" " -0.00446429" " -0.0047123"
[21] " -0.00496032" " -0.00520833" " -0.00545635" " -0.00570437"
you get that vector by doing M[[1]]
To further get the first element of this vector just recognize that M[[1]] is the vector you want the first element of so use normal subsetting to get that: M[[1]][1]
> M[[1]][1]
[1] " \" 0.0337302"