Finding the first significant figure of difference between two very similar values - r

I'm trying to reproduce the computations that led to a data set data.ref. I'd like to test how well my current implementation does by comparing the reference data to my computed results, data.my. Since each column of the data should have comparable magnitudes within the column, but not necessarily between columns, I've been looking at
(data.ref - data.my) / data.ref
to put errors on a comparable scale. However, since the data is ultimately going to be rounded off, what I'd really like to do is just run a quick and dirty check of how many significant figures worth of agreement the data has. That is, since I expect data.ref and data.my to be quite close to each other, I'd like the answer the question: what is the first significant figure at which each pair of corresponding entries differs?
Is there an R function that does this?

ceiling(log10(abs(data.ref, data.my))) seems to do the trick.
Example:
> data.my <- c(20, 30, 32, 32.01, 32.012)
> data.ref <- rep(32, length(data.my))
> ceiling(log10(abs(data.my - data.ref)))
[1] 2 1 -Inf -2 -1

Related

Is there some way to detect 'wrong' measures in a dataframe?

I'm struggling on how can I remove 'wrong' measures from my dataset. I'm dealing with kind a huge table, where I have a date and the size of an equipment. It can't get bigger with use, at most it can stay the same size, so of course this problem is a measurement error.
My database is extensive and with several particular cases, which makes it impossible for me to place it here, among other business reasons... Therefore, I use an image and a part of the data as an example, but the problem is what I described above...
simplest_example = test = data.frame(data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021","20-11-2021"), measure = c(5,4,3,5,2))
#as result:
# data1 measure
#1 20-09-2020 5
#2 15-10-2020 4
#3 13-05-2021 3
#4 20-11-2021 2
The point is: Select the largest non-ascending sequence possible, and exclude some values that inhibit this from happening.
So I would like to ask for a suggestion, if anyone here has come across something similar, and let me know how to recommend something.
If I understand, you want to detect any time the variable measure is greater than the value at the previous time point? I'd create a lag column, which is just the measure column lagged by one time. Then identify when a previous measure is greater than the current measure
library(dplyr)
simplest_example %>%
mutate(previous_measure = lag(measure)) %>%
filter(previous_measure < measure)

How to run a function to EACH of my observations in R?

My problem is as follows:
I have a dataset of 6000 observation containing information from costumers (each observation is one client's information).
I'm optimizing a given function (in my case is a profit function) in order to find an optimal for my variable of interest. Particularly I'm looking for the optimal interest rate I should offer in order to maximize my expected profits.
I don't have any doubt about my function. The problem is that I don't know how should I proceed in order to apply this function to EACH OBSERVATION in order to obtain an OPTIMAL INTEREST RATE for EACH OF MY 6000 CLIENTS (or observations, as you prefer).
Until now, it has been easy to find the UNIQUE optimal (same for all clients) for this variable that would maximize my profits (This is, the global maximum I guess). But what I need to know is how I should proceed in order to apply my optimization problem to EACH of my 6000 observations, INDIVIDUALLY, in order to have the optimal interest rate to offer to each costumer (this is, 6000 optimal interest rates, one for each of them).
I guess I should do something similar to a for loop, but my experience in this area is limited, and I'm quite frustrated already. What's more, I've tried to use mapply(myfunction, mydata) as usual, but I only get error messages.
This is how my (really) simple code now looks like:
profits<- function(Rate)
sum((Amount*(Rate-1.2)/100)*
(1/(1+exp(0.600002438-0.140799335888812*
((Previous.Rate - Rate)+(Competition.Rate - Rate))))))
And results for ONE optimal for the entire sample:
> optimise(profits, lower = 0, upper = 100, maximum = TRUE)
$maximum
[1] 6.644821
$objective
[1] 1347291
So the thing is, how do I rewrite my code in order to maximize this and obtain the optimal of my variable of interest for EACH of my rows?
Hope I've been clear! Thank you all in advance!
It appears each of your customers are independent. So you just put lapply() around the optimize() call:
lapply(customer_list, function(one_customer){
optimise(profits, lower = 0, upper = 100, maximum = TRUE)
})
This will return a very big list, where each list element has a $maximum and a $objective. You can then run lapply to total the $maximums, to find just how rich you have become!

How to compute for the mean and sd

I need help on 4b please
‘Warpbreaks’ is a built-in dataset in R. Load it using the function data(warpbreaks). It consists of the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. It has three variables namely, breaks, wool, and tension.
b. For the ‘AM.warpbreaks’ dataset, compute for the mean and the standard deviation of the breaks variable for those observations with breaks value not exceeding 30.
data(warpbreaks)
warpbreaks <- data.frame(warpbreaks)
AM.warpbreaks <- subset(warpbreaks, wool=="A" & tension=="M")
mean(AM.warpbreaks<=30)
sd(AM.warpbreaks<=30)
This is what I understood this problem and typed the code as in the last two lines. However, I wasn't able to run the last two lines while the first 3 lines ran successfully. Can anybody tell me what is the error here?
Thanks! :)
Another way to go about it:
This way you aren't generating a bunch of datasets and then working on remembering which is which. This is more a personal thing though.
data(warpbreaks)
mean(AM.warpbreaks[which(AM.warpbreaks$breaks<=30),"breaks"])
sd(AM.warpbreaks[which(AM.warpbreaks$breaks<=30),"breaks"])
There are two problems with your code. The first is that you are comparing to 30, but you're looking at the entire data frame, rather than just the "breaks" column.
AM.warpbreaks$breaks <= 30
is an expression that refers to the breaks being less than thirty.
But mean(AM.warpbreaks$breaks <= 30) will not give the answer you want either, because R will evaluate the inner expression as a vector of boolean TRUE/FALSE values indicating whether that break is less than 30.
Generally, you just want to take another subset for an analysis like this.
AM.lt.30 <- subset(AM.warpbreaks, breaks <= 30)
mean(AM.lt.30$breaks)
sd(AM.lt.30$breaks)

rank() function in R is ranking objects with floating points rather than integers

I'm quite new to R so this may seem quite trivial to many experienced programmers, sorry in advance!
I've got a numeric vector of length 8 that looks like this:
data <- c(45, 67, 23, 24, 5, 23, 45, 23)
When I type in: rank(data), R returns: [1] 6.5 8.0 3.0 5.0 1.0 3.0 6.5 3.0
However with my (very basic) understanding of rank, I expect R to return to me only whole numbers... such as:
[1] 6 8 2 5 1 3 7 4
How can rank() tell me the first element in data has a floating point ranking rather than a whole number ranking? Is it because there are values in data that are repeated and so rank() is trying to handle ties in a way that I am not expecting? If so, please tell me how I can fix this so I can get output that looks like what I previously expected. Also, any information on how rank() deals with NA values would be much appreciated. A basic description on rank() and what bells and whistles can be used would be fantastic! I've looked for videos on youtube and searched stackoverflow to no avail! Thanks so much.
From ?rank:
With some values equal (called ‘ties’), the argument ties.method determines the result at the corresponding indices. The "first" method results in a permutation with increasing values at each index set of ties. The "random" method puts these in random order whereas the default, "average", replaces them by their mean, and "max" and "min" replaces them by their maximum and minimum respectively, the latter being the typical sports ranking.
Sounds like you're using the default setting of "average" for tie breaking, which uses the mean, which is not necessarily an integer.
The built-in documentation should always be your first stop in looking for help. In this case (and most cases), it details all the "bells and whistles"---here there aren't many: just tie-handling and NA-handling. It also has examples at the bottom.

image comparison in R

I am looking for the best way to compare 2 or more images.
The images I have are now in matrix format, so basically I am comparing matrices.
They aren't square (but this isn't a problem).
This is an example of what I have with only two matrices:
#Original data
M1<-cbind(c(0,0,20,40,50,35),c(0,0,5,20,90,80),c(0,0,10,25,85,0),c(58,70,20,50,0,5))
#Data to be compared with M1
M2<-cbind(c(0,5,25,25,60,15),c(0,30,15,10,116,67),c(0,2,9,20,90,1),c(69,50,22,30,0,2))
I can check for the differences and the correlation, but I also want to be able to say for example, if:
high values in M2 occur in the same positions that M1
high values in M2 occur close to the positions in M1
high values in M2 occur far away
Same thing for low values.
By high values I mean maximum values, for example if the max value in M1 is in position (M1_maxvalue(x,y)), than I M2 max value should be a similar value observed in M1 as well as in the same or close position M1_maxvalue(x,y).
I can extract the positions, the variation of the positions of the maximum values, however I am looking for existent methods where I can base my comparisons.
What type of calculations can I use to do such type of analysis?
I can use both image processing packages as well as matrices algorithms.
Sounds like a job better handled with ImageJ or SAODS9 at http://hea-www.harvard.edu/RD/ds9/ .
IIRC those apps have built-in tools for spot and blob-finding, which may save you a lot of time and pain.

Resources