R sprintf in sqldf's like - r

I would like to do a looping query in R using sqldf to that select all non-NULL X.1 variable with date "11/12/2015" and at 9AM. Example :
StartDate X.1
11/12/2015 09:14 A
11/12/2015 09:36
11/12/2015 09:54 A
The date is in variable that generated from other query
nullob<-0
dayminnull<-as.numeric(sqldf("SELECT substr(Min(StartDate),1,03)as hari from testes")) # this produce "11/12/2015"
for (i in 1 : 12){
dday<-mdy(dayminnull)+days(i) #go to next day
sqlsql <- sprintf("SELECT count([X.1]) FROM testes where StartDate like '% \%s 09: %'", dday)
x[i]<-sqldf(sqlsql)
nullob<-nullob+x[i]
}
And it comes with error : Error in sprintf("SELECT count([X.1]) FROM testes WHERE StartDate like '%%s 09%'", :
unrecognised format specification '%'
Please hellp. thank you in advance

It's not super clear in the documentation, but a % followed by a %, that is %%, is the way to tell sprintf to use a literal %. We can test this fairly easily:
sprintf("%% %s %%", "hi")
[1] "% hi %"
For your query string, this should work:
sprintf("SELECT count([X.1]) FROM testes where StartDate like '%% %s 09: %%'", dday)
From ?sprintf:
The string fmt contains normal characters, which are passed through to
the output string, and also conversion specifications which operate on
the arguments provided through .... The allowed conversion
specifications start with a % and end with one of the letters in the
set aAdifeEgGosxX%. These letters denote the following types:
... [Documentation on aAdifeEgGosxX]
%: Literal % (none of the extra formatting characters given below are permitted in this case).

Related

R regex match things other than known characters

For a text field, I would like to expose those that contain invalid characters. The list of invalid characters is unknown; I only know the list of accepted ones.
For example for French language, the accepted list is
A-z, 1-9, [punc::], space, àéèçè, hyphen, etc.
The list of invalid charactersis unknown, yet I want anything unusual to resurface, for example, I would want
This is an 2-piece à-la-carte dessert to pass when
'Ã this Øs an apple' pumps up as an anomalie
The 'not contain' notion in R does not behave as I would like, for example
grep("[^(abc)]",c("abcdef", "defabc", "apple") )
(those that does not contain 'abc') match all three while
grep("(abc)",c("abcdef", "defabc", "apple") )
behaves correctly and match only the first two. Am I missing something
How can we do that in R ? Also, how can we put hypen together in the list of accepted characters ?
[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+
The above regex matches any of the following (one or more times). Note that the parameter ignore.case=T used in the code below allows the following to also match uppercase variants of the letters.
a-z Any lowercase ASCII letter
1-9 Any digit in the range from 1 to 9 (excludes 0)
[:punct:] Any punctuation character
The space character
àâæçéèêëîïôœùûüÿ Any valid French character with a diacritic mark
- The hyphen character
See code in use here
x <- c("This is an 2-piece à-la-carte dessert", "Ã this Øs an apple")
gsub("[a-z1-9[:punct:] àâæçéèêëîïôœùûüÿ-]+", "", x, ignore.case=T)
The code above replaces all valid characters with nothing. The result is all invalid characters that exist in the string. The following is the output:
[1] "" "ÃØ"
If by "expose the invalid characters" you mean delete the "accepted" ones, then a regex character class should be helpful. From the ?regex help page we can see that a hyphen is already part of the punctuation character vector;
[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~
So the code could be:
x <- 'Ã this Øs an apple'
gsub("[A-z1-9[:punct:] àéèçè]+", "", x)
#[1] "ÃØ"
Note that regex has a predefined, locale-specific "[:alpha:]" named character class that would probably be both safer and more compact than the expression "[A-zàéèçè]" especially since the post from ctwheels suggests that you missed a few. The ?regex page indicates that "[0-9A-Za-z]" might be both locale- and encoding-specific.
If by "expose" you instead meant "identify the postion within the string" then you could use the negation operator "^" within the character class formalism and apply gregexpr:
gregexpr("[^A-z1-9[:punct:] àéèçè]+", x)
[[1]]
[1] 1 8
attr(,"match.length")
[1] 1 1

regex to replace text *outside* of {}

I want to use regex to replace commands or tags around strings. My use case is converting LaTeX commands to bookdown commands, which means doing things like replacing \citep{*} with [#*], \ref{*} with \#ref(*), etc. However, lets stick to the generalized question:
Given a string <begin>somestring<end> where <begin> and <end> are known and somestring is an arbitrary sequence of characters, can we use regex to susbstitute <newbegin> and <newend> to get the string <newbegin>somestring<newend>?
For example, consider the LaTeX command \citep{bonobo2017}, which I want to convert to [#bonobo2017]. For this example:
<begin> = \citep{
somestring = bonobo2017
<end> = }
<newbegin> = [#
<newend> = ]
This question is basically the inverse of this question.
I'm hoping for an R or notepad++ solution.
Additional Examples
Convert \citet{bonobo2017} to #bonobo2017
Convert \ref{myfigure} to \#ref(myfigure)
Convert \section{Some title} to # Some title
Convert \emph{something important} to *something important*
I'm looking for a template regex that I can fill in my <begin>, <end>, <newbegin> and <newend> on a case-by-case basis.
You can try something like this with dplyr + stringr:
string = "\\citep{bonobo2017}"
begin = "\\citep{"
somestring = "bonobo2017"
end = "}"
newbegin = "[#"
newend = "]"
library(stringr)
library(dplyr)
string %>%
str_extract(paste0("(?<=\\Q", begin, "\\E)\\w+(?=\\Q", end, "\\E)")) %>%
paste0(newbegin, ., newend)
or:
string %>%
str_replace_all(paste0("\\Q", begin, "\\E|\\Q", end, "\\E"), "") %>%
paste0(newbegin, ., newend)
You can also make it a function for convenience:
convertLatex = function(string, BEGIN, END, NEWBEGIN, NEWEND){
string %>%
str_replace_all(paste0("\\Q", BEGIN, "\\E|\\Q", END, "\\E"), "") %>%
paste0(NEWBEGIN, ., NEWEND)
}
convertLatex(string, begin, end, newbegin, newend)
# [1] "[#bonobo2017]"
Notes:
Notice that I manually added an additional \ to "\\citep{bonobo2017}", this is because raw strings don't exist in R(I hope they do exist), so a single \ would be treated as an escape character. I need another \ to escape the first \.
The regex in str_extract uses positive lookbehind and positve lookahead to extract the somestring in between begin and end.
str_replace takes another approach of removing begin and end from string.
The "\\Q", "\\E" pair in the regex means "Backslash all nonalphanumeric characters" and "\\E" ends the expression. This is especially useful in your case since you likely have special characters in your Latex command. This expression automatically escapes them for you.

r check if string contains special characters

I am checking if a string contains any special characters. This is what I have, and its not working,
if(grepl('^\\[:punct:]', val))
So if anybody can tell me what I am missing, that will be helpful.
Special characters
~ ` ! ## $ % ^ & * | : ; , ." |
As #thelatemail pointed out in the comments you can use:
grepl('[^[:punct:]]', val)
which will result in TRUE or FALSE for each value in your vector. You can add sum() to the beginning of the statement to get the total number of these cases.
You can also use:
grepl('[^[:alnum:]]', val)
which will check for any value that is not a letter or a number.

String recognition in idl

I have the following strings:
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat
and from each I want to extract the three variables, 1. SWIR32 2. the date and 3. the text following the date. I want to automate this process for about 200 files, so individually selecting the locations won't exactly work for me.
so I want:
variable1=SWIR32
variable2=2005210
variable3=East_A
variable4=SWIR32
variable5=2005210
variable6=Froemke-Hoy
I am going to be using these to add titles to graphs later on, but since the position of the text in each string varies I am unsure how to do this using strmid
I think you want to use a combination of STRPOS and STRSPLIT. Something like the following:
s = ['F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat', $
'F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat']
name = STRARR(s.length)
date = name
txt = name
foreach sub, s, i do begin
sub = STRMID(sub, 1+STRPOS(sub, '\', /REVERSE_SEARCH))
parts = STRSPLIT(sub, '_', /EXTRACT)
name[i] = parts[0]
date[i] = parts[1]
txt[i] = STRJOIN(parts[2:*], '_')
endforeach
You could also do this with a regular expression (using just STRSPLIT) but regular expressions tend to be complicated and error prone.
Hope this helps!

Pyparsing - name not starting with a character

I am trying to use Pyparsing to identify a keyword which is not beginning with $ So for the following input:
$abc = 5 # is not a valid one
abc123 = 10 # is valid one
abc$ = 23 # is a valid one
I tried the following
var = Word(printables, excludeChars='$')
var.parseString('$abc')
But this doesn't allow any $ in var. How can I specify all printable characters other than $ in the first character position? Any help will be appreciated.
Thanks
Abhijit
You can use the method I used to define "all characters except X" before I added the excludeChars parameter to the Word class:
NOT_DOLLAR_SIGN = ''.join(c for c in printables if c != '$')
keyword_not_starting_with_dollar = Word(NOT_DOLLAR_SIGN, printables)
This should be a bit more efficient than building up with a Combine and a NotAny. But this will match almost anything, integers, words, valid identifiers, invalid identifiers, so I'm skeptical of the value of this kind of expression in your parser.

Resources