Trying to return field [doc] that have no letters. Results are all over the place.
SELECT Right([doc],4) AS ex1, IsNumeric([ex1]) AS ex2
FROM stat_converted;
The query returns two fields as it should but not evaluating correctly. Results with all numbers and others that are all letters are coming back as True(-1).
I also tried building a temp table and then applying IsNumeric to that with same results.
I also built a small test DB and the logic works so I am really confused.
IsNumeric will match things like "2E+1" (2 times ten to the power of 1, i.e. 20) as that is a number in scientific format. "0D88" is also a number according to IsNumeric because it is the double-precision (hence the "D") version of "0E88".
You could use LIKE '####' to match exactly four digits (0-9).
If you had more complex match requirements, say a variable quantity of digits, you would be interested in Expressing basic Access query criteria as regular expressions.
Related
A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.
I would like to be able to control the hierarchy of elements I extract from a search string.
Specifically, in the string "425 million won", I would like to extract "won" first, but then "n" if "won" doesn't appear.
I want the result to be "won" for the following:
stringr::str_extract("425 million won", "won|n")
Note that specifying a space before won in my regex is inadequate because of other limitations in my data (there may not necessarily be a space between "million" and "won"). Ideally, I would like to do this using regex, as opposed to if-else clauses because of performance considerations.
See code in use here
pattern <- "^(?:(?!won).)*\\K(?:won|n)"
s <- "425 million won"
m <- gregexpr(pattern,s,perl=TRUE)
regmatches(s,m)[[1]]
Explanation
^ Assert position at the start of the line
(?:(?!won).)* Tempered greedy token matching any character except instances where won proceeds
\K Resets the starting point of the match. Any previously consumed characters are no longer included in the final match
(?:won|n) Match either won or n
If you just want to extend on the code you already have:
na.omit(str_extract("420 million won", c("won", "n")))[1]
I was wondering if there is an easy way in SAS to count sentences in a string?
In pseudo code I would search for the index of every ., ?, and !, and check if the index before that (-1 or -2) is a character.
Any better ideas?
Assuming that your sentences are correctly punctuated, there should be exactly 1 sentence per ?!., so in that case you can use countc(my_string,'?!.'). The main exceptions are probably interrobangs (?!,!?) and ellipses (...).
If your string contains lots of sentences with missing stops or double stops, one option is simply to cross your fingers and hope they more or less cancel out.
If there are lots of double stops but not so many missing ones, you could apply a regex to replace any run of consecutive stops with a single . before counting those, e.g. countc(prxchange('s/[\.!\?]{2,}/./',-1,string),'?!.').
I'm a bit of an R novice and have been trying to experiment a bit using the agrep function in R. I have a large data base of customers (1.5 million rows) of which I'm sure there are many duplicates. Many of the duplicates though are not revealed using the table() to get the frequency of repeated exact names. Just eyeballing some of the rows, I have noticed many duplicates that are "unique" because there was a minor miss-key in the spelling of the name.
So far, to find all of the duplicates in my data set, I have been using agrep() to accomplish the fuzzy name matching. I have been playing around with the max.distance argument in agrep() to return different approximate matches. I think I have found a happy medium between returning false positives and missing out on true matches. As the agrep() is limited to matching a single pattern at a time, I was able to find an entry on stack overflow to help me write a sapply code that would allow me to match the data set against numerous patterns. Here is the code I am using to loop over numerous patterns as it combs through my data sets for "duplicates".
dups4<-data.frame(unlist(sapply(unique$name,agrep,value=T,max.distance=.154,vf$name)))
unique$name= this is the unique index I developed that has all of the "patterns" I wish to hunt for in my data set.
vf$name= is the column in my data frame that contains all of my customer names.
This coding works well on a small scale of a sample of 600 or so customers and the agrep works fine. My problem is when I attempt to use a unique index of 250K+ names and agrep it against my 1.5 million customers. As I type out this question, the code is still running in R and has not yet stopped (we are going on 20 minutes at this point).
Does anyone have any suggestions to speed this up or improve the code that I have used? I have not yet tried anything out of the plyr package. Perhaps this might be faster... I am a little unfamiliar though with using the ddply or llply functions.
Any suggestions would be greatly appreciated.
I'm so sorry, I missed this last request to post a solution. Here is how I solved my agrep, multiple pattern problem, and then sped things up using parallel processing.
What I am essentially doing is taking a a whole vector of character strings and then fuzzy matching them against themselves to find out if there are any fuzzy matched duplicate records in the vector.
Here I create clusters (twenty of them) that I wish to use in a parallel process created by parSapply
cl<-makeCluster(20)
So let's start with the innermost nesting of the code parSapply. This is what allows me to run the agrep() in a paralleled process. The first argument is "cl", which is the number of clusters I have specified to parallel process across ,as specified above.
The 2nd argument is the specific vector of patterns I wish to match against. The third argument is the actual function I wish to use to do the matching (in this case agrep). The next subsequent arguments are all arguments related to the agrep() that I am using. I have specified that I want the actual character strings returned (not the position of the strings) using value=T. I have also specified my max.distance I am willing to accept in a fuzzy match... in this case a cost of 2. The last argument is the full list of patterns I wish to be matched against the first list of patterns (argument 2). As it so happens, I am looking to identify duplicates, hence I match the vector against itself. The final output is a list, so I use unlist() and then data frame it to basically get a table of matches. From there, I can easily run a frequency table of the table I just created to find out, what fuzzy matched character strings have a frequency greater than 1, ultimately telling me that such a pattern match against itself and one other pattern in the vector.
truedupevf<-data.frame(unlist(parSapply(cl,
s4dupe$fuzzydob,agrep,value=T,
max.distance=2,s4dupe$fuzzydob)))
I hope this helps.
What I'm trying to achieve is to have all printed numbers display at maximum 7 digits. Here are examples of what I want printed:
0.000000 (versus the actual number which is 0.000000000029481.....)
0.299180 (versus the actual number which is 0.299180291884922.....)
I've had success with the latter types of numbers by using options(scipen=99999) and options(digits=6). However, the former example will always print a huge number of zeros followed by five non-zero digits. How do I stop this from occurring and achieve my desired result? I also do not want scientific notation.
I want this to apply to ALL printed numbers in EVERY context. For example if I have some matrix, call it A, and I print this matrix, I want every element to just be 6-7 digits. I want this to be automatic for every print in every context; just like using options(digits=6) and options(scipen=99999) makes it automatic for every context.
You can define a new print method for the type you wish to print. For example, if all your numbers are doubles, you can create
print.double=function(x){sprintf("%.6f", x)}
Now, when you print a double (or a vector of doubles), the function print.double() will be called instead of print.default().
You may have to create similar functions print.integer(), print.complex(), etc., depending on the types you need to print.
To return to the default print method, simply delete the function print.double().
Are all your numbers < 1? You could try a simple sprintf( "%.6f", x ). Otherwise you could try wrapping things to sprintf based on the number of digits; check ?sprintf for other details.