Grouping price ranges - formula

I am trying to group some price ranges from an .ods file, but have no idea how to do that.
e.g. I have a column with different prices like this:
11,61
6,15
13,68
7,69
6,00
What I want is to tell Calc to group everything from 0,00~10,99 and output text 0-10 and everything from 11,00~20,00 and output text 11-20, so the final output would be:
col1 col2
11,61 11-20
6,15 0-10
13,68 11-20
7,69 0-10
6,00 0-10

You can use the functions ROUNDDOWN() and ROUNDUP() with a negative count to get the next multiple of 10 (-1), 100 (-2) or 1000 (-3). It reduces the accuracy of a certain value by squares of 10. So, rounding to the previous or next multiple of 10 is done using:
=ROUNDDOWN(<yourvalue>; -1)
and
=ROUNDUP(<yourvalue>; -1)
respectively (take care to adapt the formula argument separators to commata (,) if this is required by the i18y your're using).
So, =ROUNDDOWN(11,61; -1) will result in 10, and =ROUNDUP(11,61; -1) will give you 20. This way, you can "calculate" the appropriate group for each value (example for value in A1):
=CONCATENATE(ROUNDDOWN($A1; -1)+1;"-";ROUNDUP($A1;-1))
To split it up on multiple lines:
=CONCATENATE( # Result will be a concatenated string
ROUNDDOWN($A1;-1)+1; # first value: previous multiple of 10, +1;
"-"; # second value: literal "-"
ROUNDUP($A1;-1) # third value: next multiple of 10
)
With your example data, this results in:
EDIT:
For a grouping 0-9, 9-19 and so on, the following formula should work:
=CONCATENATE(ABS(ROUNDDOWN($A2+1; -1)-1);"-";ROUNDUP($A2+1,01;-1)-1)
EDIT2:
For a solution using the IF() function, you could use:
=IF(A2 < 9;"0-9";IF(A2 < 19; "9-19";IF(A2 < 29; "19-29";"more than 29")))
For grouping of values greater than 29, you will have to add according IF clauses replacing the string "more than 29" by additional checks. Every grouping range will require its own IF clause.

Related

How can I identify inconsistencies and outliers in a dataset in R

I have a big dataset with alot of columns, being most of them not numeric values. I need to find inconsistencies in the data as well as outliers and the part of obtaining inconsistencies would be easy if the dataset wasn't so big (7032 rows to be exact).
An inconsistency would be something like: ID supposed to be 4 letters and 4 numbers and I obtain something else (like 3 numbers and 2 letters); or other example would be a number that should be a 0 or 1 and I obtain a -1 or a 2 .
Is there any function that I can use to obtain the inconsitencies in each column?
For the specific columns that doesn't have numeric values, I thought of doing a regex and validate if each row for a certain column is valid but I didn't found info that could give me that.
For the part of outliers I did a boxplot to see if I could obtain any outlier, like this:
boxplot(dataset$column)
But the graphic didn't gave me any outliers. Should I be ok with the results that I obtain in the graphic or should I try something else to see if there is really any outlier in the data?
For the specific examples you've given:
an ID must be be four numbers and 4 letters:
!grepl("^[0-9]{4}-[[:alpha:]]{4}$", ID)
will be TRUE for inconsistent values (^ and $ mean beginning- and end-of-string respectively; {4} means "previous pattern repeats exactly four times"; [0-9] means "any symbol between 0 and 9 (i.e. any numeral); [[:alpha:]] means "any alphabetic character"). If you only want uppercase letters you could use [A-Z] instead (assuming you are not working in some weird locale like Estonian).
If you need a numeric value to be 0 or 1, then !num_val %in% c(0,1) will work (this will work for any set of allowed values; you can use it for a specific set of allowed character values as well)
If you need a numeric value to be between a and b then !(a < num_val & num_val < b) ...

R programming- adding column in dataset error

cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.

R How to get only line numbers from output?

this is my code and I would like to get only line number 2174 as output.
Note that the first output row will be always disregarded, so I just care about the 2nd and just to see the number of that line, in this case: 2174
e[which(e$obs_pval==min(e$obs_pval)),]
snp obs_pval
1 1.852962e-07 1.852962e-07
2174 4.971520e+07 1.852962e-07
Your min call results in multiple rows sharing the minimum value, which is why more than one row is displayed.
Do you always just want the last row if there are multiple values that match your min call? If so, then you can wrap it in tail() :
tail(e[which(e$obs_pval == min(e$obs_pval)),], 1)
To just get the index:
tail(which(e$obs_pval == min(e$obs_pval)), 1)
or:
which(e$obs_pval == min(e$obs_pval))[length(which(e$obs_pval == min(e$obs_pval)))]

How to create excel formula that will add an number to specific digits in a multi digit number

Ex: I enter the number 9876543210 in a cell.
I want to create an if then formula to add a sequential number to this but working only off of the last digit. the zero in this example.
If the last digit is >= to 3 than add 5 if the last digit is <=2 than add 15.
Then have this formula repeat for 10 numbers - is that possible?
so i imput the 9876543210
it then show:
9876543225
9876543230
9876543245
and so on
=IF((RIGHT(A1,1)/1)>2,A1+5,A1+15)
Assumed that you update the number in the cell A1. Paste the above formula in A2 and copy paste downwards.
If this is Excel, you may want to use MOD (modulo or remainder) function to get the last digit and then perform an IF-THEN or nested IF-THEN to achieve this.
=IF(MOD(A1,10)=3, A1+15, IF(MOD(A1,10)=5, A1+20, A1+30))
This formula translates to the following decision tree:
IF the last digit of the value in cell A3 is 3 Then
Add 15 to it
ELSEIF the last digit of the value in cell A3 is 5 then
Add 20 to it
ELSE
Add 30 to it
END IF
Repeating the operation may require some VBA. If you already know the number of times you need to repeat the operation, you can pre-populate formulas in subsequent rows/columns, each time refer to the immediately preceding cell. For example, if you want to repeat it 5 times, you should compute the diff of first two cells and then add that diff to the value of immediately preceding row/column like this (assuming A1 had the original value, B1 had the formula I posted above and C1 through G1 are the next 5 cells):
In C1: =B1 + ($B1 - $A1)
In D1: =C1 + ($B1 - $A1)
and so on...
Note the use of absolute and relative addresses in these formulae. You can copy/paste the formula in C1 to the subsequent cells and it will automatically adjust itself to refer to immediately preceding cell.
EDIT
I just realized that you want to evaluate the MOD formula in each subsequent cell. In that case you simply need to copy/paste it to subsequent cells instead of using 2nd and 3rd formulas I posted above.

Probability of 3-character string appearing in a randomly generated password

If you have a randomly generated password, consisting of only alphanumeric characters, of length 12, and the comparison is case insensitive (i.e. 'A' == 'a'), what is the probability that one specific string of length 3 (e.g. 'ABC') will appear in that password?
I know the number of total possible combinations is (26+10)^12, but beyond that, I'm a little lost. An explanation of the math would also be most helpful.
The string "abc" can appear in the first position, making the string look like this:
abcXXXXXXXXX
...where the X's can be any letter or number. There are (26 + 10)^9 such strings.
It can appear in the second position, making the string look like:
XabcXXXXXXXX
And there are (26 + 10)^9 such strings also.
Since "abc" can appear at anywhere from the first through 10th positions, there are 10*36^9 such strings.
But this overcounts, because it counts (for instance) strings like this twice:
abcXXXabcXXX
So we need to count all of the strings like this and subtract them off of our total.
Since there are 6 X's in this pattern, there are 36^6 strings that match this pattern.
I get 7+6+5+4+3+2+1 = 28 patterns like this. (If the first "abc" is at the beginning, the second can be in any of 7 places. If the first "abc" is in the second place, the second can be in any of 6 places. And so on.)
So subtract off 28*36^6.
...but that subtracts off too much, because it subtracted off strings like this three times instead of just once:
abcXabcXabcX
So we have to add back in the strings like this, twice. I get 4+3+2+1 + 3+2+1 + 2+1 + 1 = 20 of these patterns, meaning we have to add back in 2*20*(36^3).
But that math counted this string four times:
abcabcabcabc
...so we have to subtract off 3.
Final answer:
10*36^9 - 28*36^6 + 2*20*(36^3) - 3
Divide that by 36^12 to get your probability.
See also the Inclusion-Exclusion Principle. And let me know if I made an error in my counting.
If A is not equal to C, the probability P(n) of ABC occuring in a string of length n (assuming every alphanumeric symbol is equally likely) is
P(n)=P(n-1)+P(3)[1-P(n-3)]
where
P(0)=P(1)=P(2)=0 and P(3)=1/(36)^3
To expand on Paul R's answer. Probability (for equally likely outcomes) is the number of possible outcomes of your event divided by the total number of possible outcomes.
There are 10 possible places where a string of length 3 can be found in a string of length 12. And there are 9 more spots that can be filled with any other alphanumeric characters, which leads to 36^9 possibilities. So the number of possible outcomes of your event is 10 * 36^9.
Divide that by your total number of outcomes 36^12. And your answer is 10 * 36^-3 = 0.000214
EDIT: This is not completely correct. In this solution, some cases are double counted. However they only form a very small contribution to the probability so this answer is still correct up to 11 decimal places. If you want the full answer, see Nemo's answer.

Resources