How to round a whole number in R [duplicate] - r

This question already has an answer here:
round number to the first 3 digits (start with digit != 0)
(1 answer)
Closed 3 years ago.
This is a really simple question but how do I round a number in R such that I only show 2 significant figures?
E.g. 326 rounds to 330 and 4999 rounds to 5000
Thanks

Use digits to indicate decimal places.
round(326, digits=-1)
[1] 330
Here is the difference between signif() and round(). Taken directly from documentation:
x2 <- pi * 100^(-1:3)
round(x2, 3)
signif(x2, 3)
[1] 0.031 3.142 314.159 31415.927 3141592.654
[1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06
Use the one that works for you.

Maybe this could help:
signif(4999,2)
5000
signif(326,2)
330
signif(326232,2)
330000
And as Jim O. pointed out, there is a difference between signif() and round(). Also in performances are different, due as pointed out by Gregor, this could be not too much useful to know but maybe interesting:
library(microbenchmark)
k <- sample(1:100000,1000000,replace=T)
microbenchmark(
round_ ={round(k, digits=-1)},
signif_ ={signif(k,2)}
)
Unit: milliseconds
expr min lq mean median uq max neval
round_ 68.56366 70.22595 74.02643 71.99918 75.32761 109.5727 100
signif_ 109.57957 111.86501 121.17458 114.13232 118.88837 495.0321 100

try this
round(3333,-1)
round(3333,-2)
and see what you get

Related

Obtaining different results from sum() and '+'

Below is my experiment:
> xx = 293.62882204364098
> yy = 0.086783439604999998
> print(xx + yy, 20)
[1] 293.71560548324595175
> print(sum(c(xx,yy)), 20)
[1] 293.71560548324600859
It is strange to me that sum() and + giving different results when both are applied to the same numbers.
Is this result expected?
How can I get the same result?
Which one is most efficient?
There is an r-devel thread here that includes some detailed description of the implementation. In particular, from Tomas Kalibera:
R uses long double type for the accumulator (on platforms where it is
available). This is also mentioned in ?sum:
"Where possible extended-precision accumulators are used, typically well
supported with C99 and newer, but possibly platform-dependent."
This would imply that sum() is more accurate, although this comes with a giant flashing warning sign that if this level of accuracy is important to you, you should be very worried about the implementation of your calculations [in terms both of algorithms and underlying numerical implementations].
I answered a question here where I eventually figured out (after some false starts) that the difference between + and sum() is due to the use of extended precision for sum().
This code shows that the sums of individual elements (as in sum(xx,yy) are added together with + (in C), whereas this code is used to sum the individual components; line 154 (LDOUBLE s=0.0) shows that the accumulator is stored in extended precision (if available).
I believe that #JonSpring's timing results are probably explained (but would be happy to be corrected) by (1) sum(xx,yy) will have more processing, type-checking etc. than +; (2) sum(c(xx,yy)) will be slightly slower than sum(xx,yy) because it works in extended precision.
Looks like addition is 3x as fast as summing, but unless you're doing high-frequency trading I can't see a situation where this would be your timing bottleneck.
xx = 293.62882204364098
yy = 0.086783439604999998
microbenchmark::microbenchmark(xx + yy, sum(xx,yy), sum(c(xx, yy)))
Unit: nanoseconds
expr min lq mean median uq max neval
xx + yy 88 102.5 111.90 107.0 110.0 352 100
sum(xx, yy) 201 211.0 256.57 218.5 232.5 2886 100
sum(c(xx, yy)) 283 297.5 330.42 304.0 311.5 1944 100

Does R have a base function that takes a predicate and a vector as arguments, and returns the first member for which the predicate is true?

I thought that match or which would be up to the task, but they appear to not be. For example, assume that I have the function divisibleByFive<-function(x){x%%5==0} and I want to find the first member of the vector (either its index or its value) 123:456 such that divisibleByFive returns true. Is there a single base function for this job, taking both divisibleByFive (possibly vectorized) and 123:456 as arguments? If not, what is the idiomatic way to solve these sorts of problems?
Yes, that function is Filter. Try this
Filter(divisibleByFive, 123:456)[[1L]]
However, I don't really recommend doing so as this function is too generic and thus a bit slow in practice. Usually, what you want can be easily achieved by something like this
(x <- 123:456)[[which(divisibleByFive(x))[[1L]]]]
, as pointed out by #SteveM. See the benchmark
Unit: microseconds
expr min lq mean median uq max neval cld
Filter(divisibleByFive, 123:456)[[1L]] 238.3 245.35 270.543 250.45 275.9 475.4 100 b
(x <- 123:456)[[which(divisibleByFive(x))[[1L]]]] 8.3 9.00 10.099 9.50 9.9 28.5 100 a
We can use which.max
x[which.max(divisibleByFive(x))]
#[1] 125
Find and Position.
divisibleByFive<-function(x){x%%5==0}
Find(divisibleByFive,123:456)#For the desired value.
Position(divisibleByFive,123:456)#For the index of the desired value
Output:
> Find(divisibleByFive,123:456)#For the desired value.
[1] 125
> Position(divisibleByFive,123:456)#For the index of the desired value
[1] 3

element in R dataframe good practice

Accessing the ith element in the column bar in the dataframe foo in R can be done in two different ways:
foo[i,"bar"]
and
foo$bar[i].
Is there any difference between them? If so, which one should be used in terms of efficiency, readability, etc.?
Apologies if this has already been asked, but [] and $ characters are very elusive.
I tend to think this is an opinion based question, and therefore inappropriate for SO. But since you ask for speed considerations, I won't flag it as such. Note: There are more than the two methods you describe for indexing...
data(mtcars)
library(microbenchmark)
microbenchmark(opt_a= mtcars$disp[12],
opt_b= mtcars[12, "disp"],
opt_c= mtcars[["disp"]][12])
Unit: microseconds
expr min lq mean median uq max neval cld
opt_a 5.322 6.4620 8.34029 6.8425 7.603 56.640 100 a
opt_b 9.503 10.0735 15.41463 10.6435 11.024 354.285 100 b
opt_c 4.181 4.942 7.77386 5.322 6.082 84.009 100 a
using foo$bar[i] appears to be considerably faster than foo[i, "bar"] but not the fastest alternative

R programming - Computational time taken for comparing strings vs comparing integers? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I would like to know if string comparison is faster or slower than integer comparison for similar size (ex. 3 character string vs 3 digit number). Or is speed irrelevant to data type?
I'm asking this question because even a little difference would matter when I have to process millions of user's data.
It appears string comparison is slower.
x <- 1:11+100; y <- 11:1+100; cx <- as.character(x); cy <- as.character(y)
library(microbenchmark) # In line with Richard Scriven's comment
microbenchmark(x == y, cx == cy, times = 1000000)
# Unit: nanoseconds
# expr min lq median uq max neval
# x == y 318 408 477 664 108641192 1e+06
# cx == cy 521 633 701 943 111547387 1e+06
use
start.time<-Sys.time()
[your propgraming code]
end.time<-Sys.time()
s<-end.time-start.time
s
you will get your answer. Apply this on both string and numerical digit. you can identify the computational time of both string and numeric digit.

Efficiently extract frequency of signal from FFT

I am using R and attempting to recover frequencies (really, just a number close to the actual frequency) from a large number of sound waves (1000s of audio files) by applying Fast Fourier transforms to each of them and identifying the frequency with the highest magnitude for each file. I'd like to be able to recover these peak frequencies as quickly as possible. The FFT method is one method that I've learned about recently and I think it should work for this task, but I am open to answers that do not rely on FFTs. I have tried a few ways of applying the FFT and getting the frequency of highest magnitude, and I have seen significant performance gains since my first method, but I'd like to speed up the execution time much more if possible.
Here is sample data:
s.rate<-44100 # sampling frequency
t <- 2 # seconds, for my situation, I've got 1000s of 1 - 5 minute files to go through
ind <- seq(s.rate*t)/s.rate # time indices for each step
# let's add two sin waves together to make the sound wave
f1 <- 600 # Hz: freq of sound wave 1
y <- 100*sin(2*pi*f1*ind) # sine wave 1
f2 <- 1500 # Hz: freq of sound wave 2
z <- 500*sin(2*pi*f2*ind+1) # sine wave 2
s <- y+z # the sound wave: my data isn't this nice, but I think this is an OK example
The first method I tried was using the fpeaks and spec functions from the seewave package, and it seems to work. However, it is prohibitively slow.
library(seewave)
fpeaks(spec(s, f=s.rate), nmax=1, plot=F) * 1000 # *1000 in order to recover freq in Hz
[1] 1494
# pretty close, quite slow
After doing a bit more reading, I tried this next approach, wherein
spec(s, f=s.rate, plot=F)[which(spec(s, f=s.rate, plot=F)[,2]==max(spec(s, f=s.rate, plot=F)[,2])),1] * 1000 # again need to *1000 to get Hz
x
1494
# pretty close, definitely faster
After a bit more looking around, I found this approach to work reasonably well.
which(Mod(fft(s)) == max(abs(Mod(fft(s))))) * s.rate / length(s)
[1] 1500
# recovered the exact frequency, and quickly!
Here is some performance data:
library(microbenchmark)
microbenchmark(
WHICH.MOD = which(Mod(fft(s))==max(abs(Mod(fft(s))))) * s.rate / length(s),
SPEC.WHICH = spec(s,f=s.rate,plot=F)[which(spec(s,f=s.rate,plot=F)[,2] == max(spec(s,f=s.rate,plot=F)[,2])),1] * 1000, # this is spec from the seewave package
# to recover a number around 1500, you have to multiply by 1000
FPEAKS.SPEC = fpeaks(spec(s,f=s.rate),nmax=1,plot=F)[,1] * 1000, # fpeaks is from the seewave package... again, need to multiply by 1000
times=10)
Unit: milliseconds
expr min lq median uq max neval
WHICH.MOD 10.78 10.81 11.07 11.43 12.33 10
SPEC.WHICH 64.68 65.83 66.66 67.18 78.74 10
FPEAKS.SPEC 100297.52 100648.50 101056.05 101737.56 102927.06 10
Good solutions will be the ones that recover a frequency close (± 10 Hz) to the real frequency the fastest.
More Context
I've got many files (several GBs), each containing a tone that gets modulated several times a second, and sometimes the signal actually disappears altogether so that there is just silence. I want to identify the frequency of the unmodulated tone. I know they should all be somewhere less than 6000 Hz, but I don't know more precisely than that. If (big if) I understand correctly, I've got an OK approach here, it's just a matter of making it faster. Just fyi, I have no previous experience in digital signal processing, so I appreciate any tips and pointers related to the mathematics / methods in addition to advice on how better to approach this programmatically.
After coming to a better understanding of this task and some of the terminology involved, I came across some additional approaches, which I'll present here. These additional approaches allow for window functions and a lot more, really, and the fastest approach in my question does not. I also just sped things up a bit by assigning the result of some of the functions to an object and indexing the object instead of running the function again
#i.e.
(ms<-meanspec(s,f=s.rate,wl=1024,plot=F))[which.max(ms[,2]),1]*1000
# instead of
meanspec(s,f=s.rate,wl=1024,plot=F)[which.max(meanspec(s,f=s.rate,wl=1024,plot=F)[,2]),1]*1000
I have my favorite approach, but I welcome constructive warnings, feedback, and opinions.
microbenchmark(
WHICH.MOD = which((mfft<-Mod(fft(s)))[1:(length(s)/2)] == max(abs(mfft[1:(length(s)/2)]))) * s.rate / length(s),
MEANSPEC = (ms<-meanspec(s,f=s.rate,wl=1024,plot=F))[which.max(ms[,2]),1]*1000,
DFREQ.HIST = (h<-hist(dfreq(s,f=s.rate,wl=1024,plot=F)[,2],200,plot=F))$mids[which.max(h$density)]*1000,
DFREQ.DENS = (dens <- density(dfreq(s,f=s.rate,wl=1024,plot=F)[,2],na.rm=T))$x[which.max(dens$y)]*1000,
FPEAKS.MSPEC = fpeaks(meanspec(s,f=s.rate,wl=1024,plot=F),nmax=1,plot=F)[,1]*1000 ,
times=100)
Unit: milliseconds
expr min lq median uq max neval
WHICH.MOD 8.119499 8.394254 8.513992 8.631377 10.81916 100
MEANSPEC 7.748739 7.985650 8.069466 8.211654 10.03744 100
DFREQ.HIST 9.720990 10.186257 10.299152 10.492016 12.07640 100
DFREQ.DENS 10.086190 10.413116 10.555305 10.721014 12.48137 100
FPEAKS.MSPEC 33.848135 35.441716 36.302971 37.089605 76.45978 100
DFREQ.DENS returns a frequency value farthest from the real value. The other approaches return values close to the real value.
With one of my audio files (i.e. real data) the performance results are a bit different (see below). One potentially relevant difference between the data being used above and the real data used for the performance data below is that above the data is just a vector of numerics and my real data is stored in a Wave object, an S4 object from the tuneR package.
library(Rmpfr) # to avoid an integer overflow problem in `WHICH.MOD`
microbenchmark(
WHICH.MOD = which((mfft<-Mod(fft(d#left)))[1:(length(d#left)/2)] == max(abs(mfft[1:(length(d#left)/2)]))) * mpfr(s.rate,100) / length(d#left),
MEANSPEC = (ms<-meanspec(d,f=s.rate,wl=1024,plot=F))[which.max(ms[,2]),1]*1000,
DFREQ.HIST = (h<-hist(dfreq(d,f=s.rate,wl=1024,plot=F)[,2],200,plot=F))$mids[which.max(h$density)]*1000,
DFREQ.DENS = (dens <- density(dfreq(d,f=s.rate,wl=1024,plot=F)[,2],na.rm=T))$x[which.max(dens$y)]*1000,
FPEAKS.MSPEC = fpeaks(meanspec(d,f=s.rate,wl=1024,plot=F),nmax=1,plot=F)[,1]*1000 ,
times=25)
Unit: seconds
expr min lq median uq max neval
WHICH.MOD 3.249395 3.320995 3.361160 3.421977 3.768885 25
MEANSPEC 1.180119 1.234359 1.263213 1.286397 1.315912 25
DFREQ.HIST 1.468117 1.519957 1.534353 1.563132 1.726012 25
DFREQ.DENS 1.432193 1.489323 1.514968 1.553121 1.713296 25
FPEAKS.MSPEC 1.207205 1.260006 1.277846 1.308961 1.390722 25
WHICH.MOD actually has to run twice to account for the left and right audio channels (i.e. my data is stereo), so it takes longer than the output indicates. Note: I needed to use the Rmpfr library in order for the WHICH.MOD approach to work with my real data, as I was having problems with integer overflow.
Interestingly, FPEAKS.MSPEC performed really well with my data, and it seems to return a pretty accurate frequency (based on my visual inspection of a spectrogram). DFREQ.HIST and DFREQ.DENS are quick, but the output frequency isn't as close to what I judge is the real value, and both are relatively ugly solutions. My favorite solution so far MEANSPEC uses the meanspec and which.max. I'll mark this as the answer since I haven't had any other answers, but feel free to provide an other answer. I'll vote for it and maybe select it as the answer if it provides a better solution.

Resources