Can extract_numeric deal with negative numbers? - r

Is there a way to use tidyr's extract_numeric() to extract negative numbers?
For example,
> extract_numeric("2%")
[1] 2
> extract_numeric("-2%")
[1] 2
I'd really like the second call to return -2.
Bill
PS: While it doesn't concern me today, I suspect cases such as "-$2.00" complicate any general solution.

extract_numeric is pretty simple:
> extract_numeric
function (x)
{
as.numeric(gsub("[^0-9.]+", "", as.character(x)))
}
<environment: namespace:tidyr>
It just replaces any char that isn't 0 to 9 or "." with nothing. So "-1" will become 1, and there's nothing you can do about it... except maybe file an enhancement request to tidyr, or write your own...
extract_num = function(x){as.numeric(gsub("[^0-9\\-]+","",as.character(x)))}
will sort of do it:
> extract_num("-$1200")
[1] -1200
> extract_num("$-1200")
[1] -1200
> extract_num("1-1200")
[1] NA
Warning message:
In extract_num("1-1200") : NAs introduced by coercion
but a regexp could probably do better, only allowing minus signs at the start...

Just use sub if there's a single number in the string. Here's an approach:
The function:
myfun <- function(s) as.numeric(sub(".*?([-+]?\\d*\\.?\\d+).*", "\\1", s))
Examples:
> myfun("-2%")
[1] -2
> myfun("abc 2.3 xyz")
[1] 2.3
> myfun("S+3.")
[1] 3
> myfun(".5PPP")
[1] 0.5

Related

why does as.integer in R decrement the value?

I am doing a simple operation of multiplying a decimal number and converting it to integer but the result seems to be different than expected. Apologies if this is discussed else where, I am not able to find any straight forward answers to this
> as.integer(1190.60 * 100)
[1] 119059
EDIT:
So, I have to convert that to character and then do as.integer to get what is expected
> temp <- 1190.60
> temp2 <- 1190.60 * 100
> class(temp)
[1] "numeric"
> class(temp2)
[1] "numeric"
> as.character(temp2)
[1] "119060"
> as.integer(temp2)
[1] 119059
> as.integer(as.character(temp2))
[1] 119060
EDIT2: According to the comments, thanks #andrey-shabalin
> temp2
[1] 119060
> as.integer(temp2)
[1] 119059
> as.integer(round(temp2))
[1] 119060
EDIT3: As mentioned in the comments the question is related to behaviour of as.integer and not about floating calculations
The answer to this is "floating point error". You can see this easily by checking the following:
> temp <- 1190.60
> temp2 <- 1190.60 * 100
> temp2 - 119060
[1] -1.455192e-11
Due to floating point errors, temp2 isn't exactly 119060 but :
> sprintf("%.20f", temp2)
[1] "119059.99999999998544808477"
If you use as.integer on a float, it works the same way as trunc, i.e. it does round the float in the direction of 0. So in this case that becomes 119059.
If you convert to character using as.character(), R will make sure that it uses maximum 15 significant digits. In this example that would be "119059.999999999". The next digit is another 9, so R will round this to 119060 before conversion. I avoid this in the code above by using sprintf() instead of as.character().

A more precise sum() in R for big values?

I'm trying to do a simple sum over a large column in R. The answer comes back all right, but not to the specificity that I want. For example:
> tail(x)
[,1]
[1999995,] 1999995
[1999996,] 0
[1999997,] 1999997
[1999998,] 0
[1999999,] 1999999
[2e+06,] 0
If I do a sum(x), I get:
> sum(x)
[1] 1e+12
Which is fine, but I'd like it to print out something with more significant figures like 158683269821 or something. Is there an option in sum() to specify how many sigfigs I want?
The options I wound up using were thus:
> options("scipen"=100, "digits"=4)
> sum(x)
[1] 1000000000000
> sum(x)
[1] 1000000000000
> sum(x)+1
[1] 1000000000001
> sum(x)+2
[1] 1000000000002
> sum(x)-1
[1] 999999999999

R grep NAs from vector

How do I use grep() to get NAs from a vector?
i.e: when I try grep(NA, c(1,NA))
I get [1] NA NA
You want is.na():
> vec <- c(1,NA)
> is.na(vec)
[1] FALSE TRUE
If you want the NA, try
> which(is.na(vec))
[1] 2
> vec[which(is.na(vec))]
[1] NA
> vec[is.na(vec)] # simpler, logical subscripting
[1] NA
If you don't, negate the output from is.na():
> !is.na(vec)
[1] TRUE FALSE
> which(!is.na(vec))
[1] 1
> vec[which(!is.na(vec))]
[1] 1
> vec[!is.na(vec)] ## simpler, logical subscripting
[1] 1
One reason your code doesn't work is that you gave NA as the pattern. To R this means that the pattern is not defined, so whether either of the elements of the vector match this pattern is also undefined - hence both are NA in the output.
grep is the wrong option here. Use the built-in function is.na instead.
> is.na(c(1,NA))
[1] FALSE TRUE
EDIT: if you want the integer indices rather than true/falses (which is more like what grep returns), use which(is.na()).
Don't; use which and is.na instead:
> which(is.na(c(1,NA)))
[1] 2
> which(is.na(c(NA,1,NA)))
[1] 1 3
Use is.na(c(1, NA)).

Getting only matched part of the string in R

Is there a function in R that matches regexp and returns only the matched parts?
Something like grep -o, so:
> ogrep('.b.',c('abc','1b2b3b4'))
[[1]]
[1] abc
[[2]]
[1] 1b2 3b4
Try stringr:
library(stringr)
str_extract_all(c('abc','1b2b3b4'), '.b.')
# [[1]]
# [1] "abc"
#
# [[2]]
# [1] "1b2" "3b4"
I can't believe nobody ever mentioned regmatches!
x <- c('abc','1b2b3b4')
regmatches(x, gregexpr('.b.', x))
# [[1]]
# [1] "abc"
# [[2]]
# [1] "1b2" "3b4"
It makes me wonder, didn't regmatches exist two and half years ago?
You should probably give Gabor Grothendieck the check for writing the gsubfn package:
require(gsubfn)
#Loading required package: gsubfn
strapply(c('abc','1b2b3b4'), ".b.", I)
#Loading required package: tcltk
#Loading Tcl/Tk interface ... done
[[1]]
[1] "abc"
[[2]]
[1] "1b2" "3b4"
This just applies the identity function , I, to the matches of the pattern.
You need to combine gregexpr with substring, I reckon:
> s = c('abc','1b2b3b4')
> m = gregexpr('.b.',s)
> substring(s[1],m[[1]],m[[1]]+attr(m[[1]],'match.length')-1)
[1] "abc"
> substring(s[2],m[[2]],m[[2]]+attr(m[[2]],'match.length')-1)
[1] "1b2" "3b4"
The returned list 'm' has the start and lengths of matches. Loop over s to get all the substrings.

excluding FALSE elements from a character vector by using logical vector

I manage to do the following:
stuff <- c("banana_fruit","apple_fruit","coin","key","crap")
fruits <- stuff[stuff %in% grep("fruit",stuff,value=TRUE)]
but I can't get select the-not-so-healthy stuff with the usual thoughts and ideas like
no_fruit <- stuff[stuff %not in% grep("fruit",stuff,value=TRUE)]
#or
no_fruit <- stuff[-c(stuff %in% grep("fruit",stuff,value=TRUE))]
don't work. The latter just ignores the "-"
> stuff[grep("fruit",stuff)]
[1] "banana_fruit" "apple_fruit"
> stuff[-grep("fruit",stuff)]
[1] "coin" "key" "crap"
You can only use negative subscripts with numeric/integer vectors, not logical because:
> -TRUE
[1] -1
If you want to negate a logical vector, use !:
> !TRUE
[1] FALSE
As Joshua mentioned: you can't use - to negate your logical index; use ! instead.
stuff[!(stuff %in% grep("fruit",stuff,value=TRUE))]
See also the stringr package for this kind of thing.
stuff[!str_detect(stuff, "fruit")]
There is also a parameter called 'invert' in grep that does essentially what you're looking for:
> stuff <- c("banana_fruit","apple_fruit","coin","key","crap")
> fruits <- stuff[stuff %in% grep("fruit",stuff,value=TRUE)]
> fruits
[1] "banana_fruit" "apple_fruit"
> grep("fruit", stuff, value = T)
[1] "banana_fruit" "apple_fruit"
> grep("fruit", stuff, value = T, invert = T)
[1] "coin" "key" "crap"

Resources