I'm trying to implement Benford's Law in R. So far, everything works accordingly, except that if there are some first-digits with 0 occurrences, an exception is thrown:
Error in data.frame(digit = 1:9, actual.count = first_digit_counts, actual.fraction = first_digit_counts/nrow(fraudDetection), :
arguments imply differing number of rows: 9, 5
This is because for my current data set, there are only first digits starting with 1, 2, 7, 8 and 9. How can I make it such that 3, 4, 5, 6 will have a count of 0 instead of not appearing at all in the table?
Current Data Set:
This is the part that is causing the exception to be thrown:
first_digit_counts <- as.vector(table(fraudDetection$first.digit))
The current code in which this code fits in is as follows:
# load the required packages
require(reshape)
require(stringr)
require(plyr)
require(ggplot2)
require(scales)
# load in data from CSV file
fraudDetection <- read.csv("Fraud Case in Arizona 1993.csv")
names(fraudDetection)
# take only the columns containing the counts and manipulate the data into a "long" format with only one value per row
# let's try to compare the amount of the fraudulent transactions against the Benford's Law
fraudDetection <- melt(fraudDetection["Amount"])
# add columns containing the first and last digits, extracted using regular expressions
fraudDetection <- ddply(fraudDetection, .(variable), transform, first.digit = str_extract(value, "[123456789]"), last.digit = str_extract(value, "[[:digit:]]$"))
# compare counts of each actual first digit against the counts predicted by Benford’s Law
first_digit_counts <- as.vector(table(fraudDetection$first.digit))
first_digit_actual_vs_expected <- data.frame(
digit = 1:9,
actual.count = first_digit_counts,
actual.fraction = first_digit_counts / nrow(fraudDetection),
benford.fraction = log10(1 + 1 / (1:9))
)
In order to ensure that all digits are represented in first_digit_counts, you can convert first.digit to a factor, explicitly setting the levels so they include all digits from 1 to 9:
first_digit = c(1, 1, 3, 5, 5, 5, 7, 7, 7, 7, 9)
first_digit_factor = factor(first_digit, levels=1:9) # Explicitly set the levels
That makes your table calls perform as expected:
> table(first_digit)
first_digit
1 3 5 7 9
2 1 3 4 1
> table(first_digit_factor)
first_digit_factor
1 2 3 4 5 6 7 8 9
2 0 1 0 3 0 4 0 1
> as.vector(table(first_digit_factor))
[1] 2 0 1 0 3 0 4 0 1
A function for this is available from the rattle package
library(rattle)
dummy <- rnorm(100)
calcInitialDigitDistr(dummy, split = "none")
Useful one line function
benford = function(x) barplot(table(as.numeric(substr(x,1,1))))
benford(ggplot2::diamonds$price)
Related
I have a vector of nodes taken from a binary regression tree. These are in level order, for example, 1,2,4,5,10,11. I would like to place them in infix order like so: 4,2,10,5,11,1. Thanks to Alistaire I have a solution that uses recursion. But as they point out, "There has to be a better way". I was hoping someone might be able to help me out with a non-recursive approach. The recursive version is very slow for vectors of any reasonable length. I have also tried creating a binary tree using igraph and data.tree but I cannot seem to get the ordering I want from these.
Yes, it's possible to do this without recursion since you are dealing with a binary tree, which has a fixed structure like the following tree with depth 5:
Suppose we have a vector of your nodes:
nodes <- c(1, 2, 4, 5, 10, 11)
First of all, we only want a binary tree that is of a suitable depth to accommodate your largest node. We can get the required depth by doing:
depth <- ceiling(log(max(nodes), 2))
And a data frame that gives the node number, depth and 'leftness' of a sufficiently large binary tree like this:
df <- data.frame(node = seq(2^(depth) - 1),
depth = rep(seq(depth), times = 2^(seq(depth) - 1)),
leftness = unlist(sapply(2^seq(depth) - 1,
function(x) (seq(x)[seq(x) %% 2 ==1])/(x + 1))))
However, we only need the subset of this tree that matches your nodes:
df <- df[match(nodes, df$node),]
df
#> node depth leftness
#> 1 1 1 0.5000
#> 2 2 2 0.2500
#> 4 4 3 0.1250
#> 5 5 3 0.3750
#> 10 10 4 0.3125
#> 11 11 4 0.4375
And we can sort the nodes in order according to leftness:
df$node[order(df$leftness)]
#> [1] 4 2 10 5 11 1
Which is your expected result.
To generalize this, just put the above steps in a function:
sort_left <- function(nodes) {
depth <- ceiling(log(max(nodes), 2))
df <- data.frame(node = seq(2^(depth) - 1),
depth = rep(seq(depth), times = 2^(seq(depth) - 1)),
leftness = unlist(sapply(2^seq(depth) - 1,
function(x) (seq(x)[seq(x) %% 2 ==1])/(x + 1))))
df <- df[match(nodes, df$node),]
df$node[order(df$leftness)]
}
So we can do:
sort_left( c(1, 2, 4, 5, 10, 11))
#> [1] 4 2 10 5 11 1
Or, given the example in your original question,
sort_left(c(1,2,4,5,10,11,20,21))
#> [1] 4 2 20 10 21 5 11 1
Which was the desired result. All without recursion.
So let's say I roll 5 dice.
The code to simulate the rolls would be
Rolls<-sample(1:6, 5, replace=TRUE)
and that's if I want to store my rolls under the object Rolls.
Now let's say for some reason I don't want there to be more than 2 sixes. That means if I roll, for example, 6 3 5 6 6 1 would I be able to re-roll one of the 6 values into a new value so that there are only 2 values of 6 and 4 values that are not 6?
Any support would be appreciated.
Thanks in advance
A solution without loops could be:
condition = which(Rolls==6)
if(length(condition)>=3){
Rolls[condition[3:length(condition)]] = sample(1:5, length(condition)-2, replace=TRUE)
}
condition states the places in Rolls with 6's, if there's more than 2, you select the third one onward Rolls[condition[3:length(condition)]] and re-sample them.
And the second question could be something like:
remove = 3
Rolls = Rolls[-which(Rolls==remove)[1]]
You can easily put those into functions if you like
Edit 1
To make the second answer a bit more interactive, you can build a function for it:
remove.roll = function(remove, rolls){
rolls = rolls[-which(rolls==remove)[1]]}
And then the user can call the function with whatever remove he likes. You can also make a program that takes information from the prompt:
remove = readline(prompt="Enter number to remove: ")
print(Rolls = Rolls[-which(Rolls==remove)[1]])
if i understood it correctly, that should work:
n <- 10
(Rolls<-sample(1:6, n, replace=TRUE))
#> [1] 6 2 4 1 1 6 5 2 1 6
(Nr_of_six <- sum(6 == Rolls))
#> [1] 3
while (Nr_of_six > 1) {
extra_roll <- sample(1:6, 1, replace=TRUE)
second_six <- which(Rolls==6)[2]
Rolls[second_six] <- extra_roll
print(Rolls)
Nr_of_six <- sum(6 == Rolls)
}
#> [1] 6 2 4 1 1 4 5 2 1 6
#> [1] 6 2 4 1 1 4 5 2 1 3
print(Rolls)
#> [1] 6 2 4 1 1 4 5 2 1 3
Created on 2021-03-21 by the reprex package (v1.0.0)
We can make this a fun demonstration of a use case for scan(). You can input the position of the values that you want to replace. Note that you need to hand scan() each position value piece by piece and hit enter after every one, in the end you can end the input by handing over an empty string "" and pressing enter.
Code
dice.roll <- function(){
# Initial toss
Rolls <- sample(seq(1, 6), 5, replace=TRUE)
# Communicate
cat("The outcome of the dice roll was:", "\n\n", Rolls, "\n\n",
"would you like to reroll any of those values ?", "\n",
"If yes enter the positions of the values you would \n like to replace, else just input an empty string.")
# Take input
tmp1 <- scan(what = "")
# Replace
Rolls[as.numeric(tmp1)] <- sample(seq(1, 6), length(tmp1), replace=TRUE)
# Return
cat("You succesfully replaced", length(tmp1), "elements. Your rolls now look as follows: \n\n",
Rolls)
}
dice.roll()
# The outcome of the dice Roll was:
#
# 6 4 6 3 4
#
# would you like to reroll any of those values ?
# If yes enter the positions of the values you would
# like to replace, else just input an empty string.
# 1: 1
# 2: 3
# 3: ""
# Read 2 items
# You succesfully replaced 2 elements. Your set now looks as follows
#
# 2 4 2 3 4
Please note that this function is just a quick write-up to properly implement this you should use a while statement or recursion to repeat the replacement as often as you'd like. Additionally, before actually using this one would have to insert if statements that handle inputs that are too long and other user behavior that could cause an error.
Here is my version of this function that uses recursion to roll extra values so that we only have no more than 2 6s. Pay attention that I put rolls vector outside of the function so in order to replace third, fourth or ... 6 from inside the function we use complex assignment operator <<-.
I personally chose to modify the first 6 value in a run of 3 6s or more.
rolls <- sample(1:6, 6, replace = TRUE)
n_six <- function() {
n <- length(rolls[rolls == 6])
if(n <= 2) {
return(rolls)
} else {
extra <- sample(1:6, 1, replace = TRUE)
rolls[which(rolls == 6)][1] <<- extra
}
n_six()
}
# Imagine our rolls would be a vector with 3 six values like this
rolls <- c(1, 2, 6, 5, 6, 6)
> n_six()
[1] 1 2 3 5 6 6 # First 6 was replaced
# Or our rolls contains 4 six values
rolls <- c(1, 6, 6, 5, 6, 6)
> n_six()
[1] 1 4 1 5 6 6 # First 2 6s have been replaced
And so on ...
I have two columns in my data frame, value and num_leads. I'd like to create a third column that stores the value's value from n rows below - where n is whatever number is stored in num_leads. Here's an example:
df1 <- data.frame(value = c(1:5),
num_leads = c(2, 3, 1, 1, 0))
Desired output:
value num_leads result
1 1 2 3
2 2 3 5
3 3 1 4
4 4 1 5
5 5 0 5
I have tried using the lead function in dplyr but unfortunately it seems all the leads must have the same number.
using indexing
with(df1, value[seq_along(value) + num_leads])
where seq_along(value) gives the row number, and by adding to num_leads you can pull out the right value
This is what I came up with:
df1$result <- df1$value[df1$value + df1$num_leads]
Maybe the title is a little bit vague but I didn't know how to better describe it. Suppose the following table/column is given:
tab0 <- data.frame(month = c(1, 3, 4, 7, 9, 12))
What I would love to achieve by using dplyr is the following table:
tab1 <- data.frame(month = c(1, 3, 4, 7, 9, 12), group = c(1, 1, 2, 3, 3, 4))
A month is assigned to a group in a way such that there is a maximum time lag (within a group) of 2 months. This is only an example, in the end I want to apply it to much more data and use days instead of months. I hope it's clear what I am after.
# example dataframe
tab0 <- data.frame(month = c(1, 3, 4, 7, 9, 12))
# input your lag
lag = 2
# create group
tab0$group = 1 + (tab0$month - tab0$month[1]) %/% (lag + 1)
# see updated daatset
tab0
# month group
# 1 1 1
# 2 3 1
# 3 4 2
# 4 7 3
# 5 9 3
# 6 12 4
The group number is calculated as follows: For each row we get the distance between the current month and the first month. Then we divide the result with 3 (your lag of 2 plus 1) and we keep the integer part of the division. Finally we add 1 to the result.
I am setting up a "decision-map" in xts. The result of below code generates following:
dec.1 dec.2 dec.3 dec.4 Master.dec
2017-01-01 2 2 2 2 2
2017-02-02 3 3 3 3 3
2017-03-03 0 0 0 0 0
There will always exist minimum one of these columns (dec.1 -> dec.4), but it will be unknown whether it is 1,2,3, or 4 columns (dec.1 -> dec.4).
In the original solution the decision columns will be spread out through the xts-sheet so I will not be able to use column number as identifiers.
Question:
In column "Master.dec" I calculate based on the left-side columns (dec.1 - dec.4), whereof sometimes it will be 1,2,3, or 4 decicion columns. Is there a way to keep the calculation done in "Master.dec" intact and working, despite if 1 to 3 of the decision columns would not be present ?
To reproduce the problem I encounter:
1) run the complete script
2) delete one column: xts1$dec.1 <- NULL
3) run only the section 2 of the script (2.add a rules system)
...you will get the error:
Error in NextMethod(.Generic) :
dims [product 3] do not match the length of object [0]
Note! The solution should be able to have removed 1-3 decision columns, there will always be one decision column but unknown which ones.
# dependent libraries
library(matrixStats)
library(xts)
#############################################
# 1. Create the xts from a data.frame base
#############################################
# creates a dataframe
df1 <- data.frame(date = c("2017-01-01", "2017-02-02", "2017-03-03"),
other.1 = c(1998, 1999, 2000),
dec.1 = c(2, 3, 0),
other.2 = c(58, 54, 32),
other.3 = c(12, 3, 27),
dec.2 = c(2, 3, 0),
dec.3 = c(2, 3, 0),
other.4 = c(2, 5, 27),
dec.4 = c(2, 3, 0)
)
# transforms the column date to date-format
df1 = transform(df1,date=as.Date(as.character(date),format='%Y-%m-%d'))
# creates the xts, based on the dataframe df1
xts1 <- xts(df1[,-1],order.by = df1$date)
#############################################
# 2.Add a rule system:
# if all "dec"-columns are 2, add value 2 in master.dec
# if all "dec"-columns are 3, add value 3 in master.dec
# if all "dec"-columns are 0, (or any other combination then above) add value 0 in master.dec
#############################################
xts1$m.dec <- ifelse(rowSds(xts1)==0,rowMins(xts1),0)
Since an xts object is essentially just an indexed matrix, you could try calculating the row wise standard deviation. If the result is 0 (ie. all values are the same), then you assign the rowMin (or max, whichever you prefer) to your new column, otherwise 0.
An efficient and concise solution can be found using the matrixStats package:
library(matrixStats)
xts1$m.dec <- ifelse(rowSds(xts1)==0,rowMins(xts1),0)
# dec.1 dec.2 dec.3 dec.4 m.dec
#2017-01-01 2 2 2 2 2
#2017-02-02 3 3 3 3 3
#2017-03-03 0 0 0 0 0